Tags
Despite my computer science colleagues complaining about how awkward, bloated, and annoying it is, most ecologists who dabble in command-line programming now use R, and it’s easy to see why. It’s freely available, has a multitude of statistical and graphical packages, and though the initial learning curve is steep, it can be an invaluable learning experience. When most statistical or graphing packages cost hundreds of dollars, this free alternative is an obvious choice.
For those not familiar with R, it’s a basic command-line program (like FORTRAN, BASIC, C++, etc.) that performs mathematical analyses and produced graphics. The main way R accomplishes this is by using packages, which are subroutines designed for a specific task. One of the most commonly used packages is called “MASS” (for “Modern Applied Statistics with S” by Venables and Ripley; S is another programming language). This is where one can find the routines to run analysis of variance and other linear models, for example.
While an ANOVA is fairly simple (heck, I calculated ANOVA F-ratios by hand in undergrad, as I suspect many of us did), many packages perform more complex analyses. This is what makes them so useful to “end users” who aren’t computer programmers – all we need is to format the data properly, and select a few options in the package, and it does the math for us. Here’s an example:
model <- glm(RS ~ Island, data=crau, family="binomial") summary(model)
This is an analysis of reproductive success of Crested Auklets, testing for differences among islands. Since reproductive success is a binomial response (the pair did or did not raise a chick successfully), it’s a generalized linear model with binomial error. Simple! Nowhere did I have to code in how to calculate the result, the package MASS does it for me as long as I give it the appropriate input.
But there’s a downside to this crowdsourcing of computer code – not all of it is up to the same standard. I work a fair bit with stable isotopes in ecology. These are non-decaying isotopes of carbon and nitrogen that can be used to get information on foraging. When we plot δ13C vs. δ15N (an “isotope biplot”), we can conceptualize it as analogous to the trophic niche concept, where δ15N relates to trophic position, and δ13C to foraging location. Pretty cool stuff.
In 2011, a group in the UK basically put a number on the size of this axis of Hutchinson’s “n-dimensional hypervolume” by calculating the standard ellipse (a 2D analogy of SD). Great! Now we can compare niche sizes among species/groups/sites. In their paper, they outlined 3 methods for calculating the standard ellipse area (SEA):
- Just figuring out the SEA
- Adjusting for small sample size, like we do in AIC analysis (SEAc)
- A Bayesian approach that produces a median and credibility intervals (the Bayesian equivalent of confidence intervals).
I like the 3rd option, since it also includes some estimate of error. In theory, all 3 numbers should be close. But when I ran the analysis for a community of birds I worked on in my MSc, I noticed a problem: the estimate of SEAc didn’t fall within the 95% credibility interval of the Bayesian estimate. So which is correct? Because the mathematics behind the calculations is beyond me (having only completed 2 semesters of calculus in undergrad), I have to rely on the package’s author(s).
But herein lies the problem: while software companies have a dedicated staff to troubleshooting issues like this, most contributors of R packages have other jobs (professors, government researchers, or full-time graduate students in most cases). If the author of the package can even be found*, chances are they’ve moved on to some other research project, or are teaching courses, and have little time to go back and mess around with code they wrote 1, 2, 3, or more years ago. The result is a population of frustrated users, or in a worst-case scenario, users who don’t even know there’s a problem because they don’t possess the specific knowledge to evaluate the package critically. As ecologists, many of us rely on our software to perform properly, and take for granted that it does. Very few of us have the technical know-how to fix it if/when we find a problem.
So what’s the solution? I don’t know of a practical one. Besides the assumption that most people contributing R packages are of the open data-sharing ilk, and would feel a moral obligation to ensure their contribution works, what can be done?
*I thought I recalled a paper on the difficulty in contacting corresponding authors successfully by e-mail, but can’t seem to find it. If you know it, leave a note in the comments.
Tom Evans (@ThomasEvans) said:
I accept that there is potentially a problem with unvetted ‘crowd-sourced’ coding, in terms of maintaining quality. However, one also realise that the code is at least available – it is open source. So should you think there is an error, it is possible to get code and check through it. If you can’t do that individually, then you can find someone to check it for you. If it were a proprietary program like SPSS, then you would have no chance.
Alex Bond said:
Great point, Tom. It’s a step forward from the pre-packaged SAS/SPSS/JMP/etc software packages. But unlike these programs, there’s little competition among packages. It’s not like there are multiple competing packages to calculate isotopic niche, for example, and the creators receive no benefit from my using their package over another. So while commercial packages are inflexible, I think the competition (and associated job security) keeps the quality high. R, on the other hand, is far more flexible (if there’s no package to do a particular analysis, one can be written by anyone with the appropriate programming skills). I should also note that one must generally expend a great deal of effort in first finding the source code for the packages, understand someone else’s code, and then be able to locate and fix the problem – certainly not for everyone!
Margaret Kosmala said:
Just wanted to point out that C++ is *not* a command-line program; it needs to be compiled.
Alex Bond said:
Thanks Margaret. Yes, it does need compiled. What I had meant was that it required typing in code/commands rather than a point-and-click user interface.
Margaret Kosmala said:
Ah. How about “programming language” then, as an alternate term?
Just discovered your blog and am enjoying it. 🙂