Tags

, , ,

Despite my computer science colleagues complaining about how awkward, bloated, and annoying it is, most ecologists who dabble in command-line programming now use R, and it’s easy to see why.  It’s freely available, has a multitude of statistical and graphical packages, and though the initial learning curve is steep, it can be an invaluable learning experience.  When most statistical or graphing packages cost hundreds of dollars, this free alternative is an obvious choice.

For those not familiar with R, it’s a basic command-line program (like FORTRAN, BASIC, C++, etc.) that performs mathematical analyses and produced graphics.  The main way R accomplishes this is by using packages, which are subroutines designed for a specific task.  One of the most commonly used packages is called “MASS” (for “Modern Applied Statistics with S” by Venables and Ripley; S is another programming language).  This is where one can find the routines to run analysis of variance and other linear models, for example.

While an ANOVA is fairly simple (heck, I calculated ANOVA F-ratios by hand in undergrad, as I suspect many of us did), many packages perform more complex analyses.  This is what makes them so useful to “end users” who aren’t computer programmers – all we need is to format the data properly, and select a few options in the package, and it does the math for us.  Here’s an example:

model <- glm(RS ~ Island, data=crau, family="binomial")
summary(model)

This is an analysis of reproductive success of Crested Auklets, testing for differences among islands.  Since reproductive success is a binomial response (the pair did or did not raise a chick successfully), it’s a generalized linear model with binomial error.  Simple!  Nowhere did I have to code in how to calculate the result, the package MASS does it for me as long as I give it the appropriate input.

But there’s a downside to this crowdsourcing of computer code – not all of it is up to the same standard.  I work a fair bit with stable isotopes in ecology.  These are non-decaying isotopes of carbon and nitrogen that can be used to get information on foraging.  When we plot δ13C vs. δ15N (an “isotope biplot”), we can conceptualize it as analogous to the trophic niche concept, where δ15N relates to trophic position, and δ13C to foraging location.  Pretty cool stuff.

In 2011, a group in the UK basically put a number on the size of this axis of Hutchinson’sn-dimensional hypervolume” by calculating the standard ellipse (a 2D analogy of SD).  Great!  Now we can compare niche sizes among species/groups/sites.  In their paper, they outlined 3 methods for calculating the standard ellipse area (SEA):

  • Just figuring out the SEA
  • Adjusting for small sample size, like we do in AIC analysis (SEAc)
  • A Bayesian approach that produces a median and credibility intervals (the Bayesian equivalent of confidence intervals).

I like the 3rd option, since it also includes some estimate of error.  In theory, all 3 numbers should be close.  But when I ran the analysis for a community of birds I worked on in my MSc, I noticed a problem: the estimate of SEAc didn’t fall within the 95% credibility interval of the Bayesian estimate.  So which is correct?  Because the mathematics behind the calculations is beyond me (having only completed 2 semesters of calculus in undergrad), I have to rely on the package’s author(s).

But herein lies the problem: while software companies have a dedicated staff to troubleshooting issues like this, most contributors of R packages have other jobs (professors, government researchers, or full-time graduate students in most cases).  If the author of the package can even be found*, chances are they’ve moved on to some other research project, or are teaching courses, and have little time to go back and mess around with code they wrote 1, 2, 3, or more years ago.  The result is a population of frustrated users, or in a worst-case scenario, users who don’t even know there’s a problem because they don’t possess the specific knowledge to evaluate the package critically.  As ecologists, many of us rely on our software to perform properly, and take for granted that it does.  Very few of us have the technical know-how to fix it if/when we find a problem.

So what’s the solution?  I don’t know of a practical one.  Besides the assumption that most people contributing R packages are of the open data-sharing ilk, and would feel a moral obligation to ensure their contribution works, what can be done?

 

*I thought I recalled a paper on the difficulty in contacting corresponding authors successfully by e-mail, but can’t seem to find it. If you know it, leave a note in the comments.