• Home
  • About
  • Contact
  • Languishing Projects
  • Beyond Science
  • Other Blogging
  • Queer in STEM

The Lab and Field

~ Science, people, adventure

The Lab and Field

Tag Archives: R

Academic hipsters redux (and why open science is like going to the dentist)

09 Friday Aug 2013

Posted by Alex Bond in opinion

≈ 9 Comments

Tags

LaTeX, Markdown, open science, R, workflow

Wow – ecologists sure like their writing programs.

After posting yesterday on why I’m reluctant to move out of my established Word/Endnote workflow, the post quickly became the most read in a single day on The Lab and Field (maybe I should have also put “Bayesian” or “frequentist” in the title!).

Folks brought up some great points in the comments, and I wanted to elaborate, and add some perspective.  I’m also not ashamed to say I was swayed by some arguments.

First, I don’t mean to imply that advocates for Markdown/LaTeX/R etc. think of themselves as superior, or look down on those who use Word/Endnote/SPSS.  When I was first learning R (a tool I now use daily, and have taught others to use), I had some pretty sour experiences dealing with more advanced useRs.  Yes, these are the exception to the rule, but their smugness left an off taste in my mouth.  And apparently, I’m not alone:

@ibartomeus @ftmaestre @thelabandfield I’m not saying don’t be open to new things, but these things are a tool rather than a religion (1/2)

— Franciska de Vries (@frantecol) August 9, 2013

@ibartomeus @ftmaestre @thelabandfield And people who don’t use them are not 1) stupid 2) conservative 3) against open science (2/2)

— Franciska de Vries (@frantecol) August 9, 2013

Second, there are non-monetary costs of using Word.  The format is proprietary, meaning you need a Word license to access the file (or convert it to another format in something like OpenOffice Writer).  For many of us, this isn’t an issue since Word is standard-issue on workplace machines.  Switching between operating systems (Mac to Windows, anything to Linux) can be fraught with formatting errors though.  Plain text formats (like Markdown and LaTeX) don’t suffer the same fate since the formatting is part of the actual document, which itself is plain text and readable in freely-available programs on any platform.  A first-order heading in Markdown, for example, looks like this:

This is the title of an article
===============================

Markdown and LaTeX (and their ilk) are fantastic if you need to embed equations, code, etc, which is painfully awful to do in Word.  This isn’t something I normally have to deal with (though when I do, I use LaTeX).  Most of the work in conservation biology / ecology / ornithology that I’ve done has used small datasets (>2000 data points), at most 1 or 2 simple equations in the manuscript, and relatively simple stats (general/generalized linear models).  The places where I’ve worked have had site licenses for Word and Endnote, so I’ve always had access to them.

The tools I’ve used have worked for the problems at hand.  But they may not for you.  You might be at a non-profit that can’t afford multiple licenses for MS Office or Endnote.  You might be doing lots of modeling and require multi-line equations in your manuscripts.

There is also a philosophical argument to be made, however.  When I move on to my next job, will I have access to the same tools? Will my previous work be lost to the Proprietary File Format gods?

I’m heading out of town for a week, but when I get back, I’m going to work with Andrew MacDonald and Gavin Simpson, two strong proponents of open science, to see if I can/want make the switch to 100% open software (heck, if I can do it, I’m certain just about anyone can!).

Here are the features of my current workflow I’d require, in somewhat decreasing order of importance to me:

  • Integration with my reference database (currently ~3300 entries), ideally including something similar to Endnote’s “Cite While You Write” feature, and the ability to easily format references to a journal’s particular style.
  • Ability for coauthors to comment on and edit manuscript drafts easily and concurrently, and to track, accept/reject these changes.
  • Work across platforms (Windows & Mac), and not require an internet connection (rain days in the field are some of my more productive writing days, after all!).

Data from spreadsheets (the most common format of data in my life) can be stored in *.csv files, and analysed in R (if you use R, and haven’t tried the Rstudio interface yet, give it a go).  If we can work this out, I’ll try it out for the fall semester to make a good comparison with my current system.  I’m already a half-convert (I use R, and have used LaTeX).  It might turn out to be something that works for me (which is the most important point), and I’ll keep everyone updated.

Why have I (and many others) been resistant to this change? Because it means admitting we’ve failed at something.  Confronting it is like going to the dentist – sure, you brush your teeth, perhaps floss occasionally, and for most of the year, that’s fine. But when you sit in that chair, and are told you need a filling, you feel like crap, and are embarrassed.  Dealing with “Open Science” questions of file format, and how repeatable analyses are is awkward for most of us because if we look critically at our own work, it’s often as closed as a nun’s habit on Sunday. Sure, we could (sometimes) repeat analyses, but can anyone else if we send them the data? But openness of data and repeatability in ecology is another post for another day.

Have I completely changed positions? No. Use what works for you. If you use Word to write your manuscript, keep data in Excel files, and run all your stats in SPSS I could care less. If you use Markdown/Pandoc to submit your paper that was analysed in R, I care equally as little.  Above all else, I care about good science. Can you do good science in Excel, Word, Endnote and SPSS? Sure. Can you do bad science using Markdown, LaTeX, Pandoc, and R? Absolutely.  But the reverse of both is equally as true.  As I wrote in my very first post, one shouldn’t let the study site or focal species drive the research questions.  Pick the best system to answer your question.  Similarly, pick the best software to do what you need to do.  And if that’s Word, Markdown, or an XF stub nib for your fountain pen, that’s OK.

Beware the academic hipster (or, use what works for you) UPDATED

08 Thursday Aug 2013

Posted by Alex Bond in opinion

≈ 34 Comments

Tags

advice, Github, Markdown, R, software, word processors, writing

UPDATE: Be sure to read the comments below, and my response

As a newly-minted PhD student, I was talking with a friend about writing papers.  “Use LaTeX”, he said.  I thought he meant the rubbery material commonly found in lab gloves.  But apparently not.  LaTeX (pronounced “lay-tech”) is typesetting software that he used for writing papers.

Eager to be on the cutting edge of scholarship, I spent a few days learning how LaTeX worked, how to insert symbols, figures, and tables.  I even produced my thesis proposal with it.  But my supervisor used Word exclusively, and I had no compelling reason to use LaTeX over Word, so I switched back.

Fast-forward a few years.  Now, everyone should be using markdown in a plain text editor, doing statistics in R, uploading versions to github or figshare, and managing citations with JabRef, BibTex or Mendeley.  Apparently, Word, Excel, Endnote, and SPSS are things of the past.  Special sessions at the 2013 Ecological Society of America meeting seem to be the nail in the proverbial coffin.  Some are even calling these new tools essential pieces of software for students.

There is a movement afoot to move the process of writing science out of Microsoft Word, and into other “better” formats like LaTeX, or Markdown with the argument that “researchers shouldn’t waste time on formatting, just the text of what they’re writing”.  They can then keep version control using something like GitHub, and invite collaborators to do the same.  This also keeps science open, since scientists aren’t beholden to a proprietary file format.

But in my mind, there are two arguments: the practical (A is tangibly better than B), and the philosophical (A is better than B because of ethical, moral, or philosophical reasons).  These are both important discussions to have, but in this post, I’m going to focus on the first.

Learning Curve

I’ve used Word for my typing needs since about 1997 (prior to which, I used Clarisworks, and WordPerfect, two functionally similar programs).  I know how to easily insert commonly used non-Roman letter symbols (like β), and most of my work (>95%) doesn’t extend beyond simple mathematical symbols or diacritical marks (like ±, Σ, or é).  I use minimal formatting in Word (bold, italics, line numbers, maybe changing the font size of the title), and after almost 20 years, I’ve gotten pretty good at Ctrl-B (or, in the last 10 years, command-B).

Coauthor inertia

The vast majority of my work is collaborative to some degree.  Whether it’s a supervisor or boss, or a larger group of other researchers, someone’s going to read, comment on, revise, and critique any paper I write before it goes to the journal.  Word is ubiquitous, while these other methods are not.  And like me, my coauthors are most familiar with Word, and use its Track Changes feature to make suggestions, comment on text, and insert their own edits.

Reference integration

This is really the deal-breaker for me.  Since 2005, I’ve used Endnote to manage my reference papers, and I use the “Cite While You Write” feature in every paper.  Basically, this means I can write something like “Birds have feathers, and can fly (Gill 2007)”, and Endnote will drop the full citation (in the specified format) in the Literature Cited section.  How cool is that?  It also makes reformatting for different journals relatively easy.  Yes, there are other types of programs that can do that for you (e.g., BibTex), but there’s a learning curve, and many hours updating citation keys so that there aren’t 4 “Jones2007”s.

Cost & Access

Word (and to a lesser extent, Endnote) are readily available at most research organizations, or are relatively cheaply obtained (let’s say a maximum of $200).  If you want to keep your projects private, GitHub will run you $7/month (or about $200 over 2 years), while the rest are free.  Word and Endnote are perpetual licenses.  True, universities and research organizations pay for these, but it’s unlikely that will change since the programs are used by non-academic staff, too.

Academic hipsters

The following was just tweeted from the 2013 ESA conference

Do it MT @ucfagls @recology_ : “throw away MS Word and pick up Markdown” – great advice in the reproducible research workshop #ESA2013

— Andrew MacDonald (@polesasunder) August 5, 2013

@thelabandfield But some of us want to have reproducible research so embed R or Python for the analysis in paper @polesasunder @recology_

— Gavin Simpson (@ucfagls) August 5, 2013

The implication, whether intended or not, is that those of us still using Word aren’t doing reproducible research.

Now before folks get their open sources all in a knot, I’m not just being a Luddite.  I use R regularly.  I’ve also used LaTeX for one manuscript.  I’m not advocating against using any of these tools if they’re the right tools for the job.  What I’m saying is don’t use them for the sake of using them–a form of what I could call academic hipsterism.

Feel like I should write an R package. I don’t have anything that needs doing, it just feels like it’s what all the cool kids are doing now.

— Steven Hamblin (@BehavEcology) August 5, 2013

Case in point.

My experiences with other early-career researchers, collaborators, supervisors, and grad students is that 99% of them will keep their data in Excel, write the manuscript in Word, and some will integrate references using Endnote (important point: the same applies to non-Microsoft products like Apple’s Pages and Numbers, OpenOffice etc.).

And for a good chunk of the statistical analyses I do, or that are in papers I read, review, and co-author, it doesn’t matter if they were done in R, or SPSS, or SAS, or Minitab, or JMP, or many other common statistical programs.

Are there issues with all these pieces of software? Yes. Are there issues with any piece of software? Yes.  Has a manuscript in ecology/zoology been rejected because the authors used a particular program to compose their text? I don’t think so.

Jeremy Fox at Dynamic Ecology wrote about how he keeps on top of the literature.  His point was that his system works for him, and yes, there are other systems out there.  The interface that I set up on my computer between Word and Endnote when I started my MSc aeons ago still works for me.  It also works with my coauthors, all of whom use Word as a primary text editor for manuscripts, and it works for journals, all of which accept submissions in Word format, or the easily-generated PDF.

Are tools like markdown, LaTeX, and github useful? To some, they are.  But they’re not yet useful to me. If they look useful to you, check them out – they just may be. But don’t feel beholden to adopt the latest software trend.

30 years ago, John Weins wrote in The Auk on the perils of word processors:

John Wiens on the perils of using word processors in an editorial in The Auk, 1983

John Wiens on the perils of using word processors in an editorial in The Auk, 1983

Has word processing improved how science is disseminated? Of course.  Perhaps we could say the same for the current crop of new tools in manuscript writing and statistics. But not for me, at least not yet.

I’m not saying these new pieces of software are terrible and useless.  I’m saying that I’m not inclined to use them because I don’t see how they are materially better than my current system.  Sometimes, it seems like the argument from the non-Word proponents is that “our way is better than yours in every case” (see the quote tweets above), which isn’t the case.

For what it’s worth, I’m going to have a lengthy skype chat with Andrew MacDonald later this month about the advantages of Markdown, and integrating it with BibTex.  I might even try it.  I’ll let you all know how it goes.

— — —

As a quick note, I’m off to the Society of Canadian Ornithologists meeting in Winnipeg, and won’t be as quick to approve new commenters, or respond to comments. Thanks for your patience. -AB

The perils of free software

16 Wednesday Jan 2013

Posted by Alex Bond in opinion

≈ 5 Comments

Tags

digital, R, stable isotopes, tools

Despite my computer science colleagues complaining about how awkward, bloated, and annoying it is, most ecologists who dabble in command-line programming now use R, and it’s easy to see why.  It’s freely available, has a multitude of statistical and graphical packages, and though the initial learning curve is steep, it can be an invaluable learning experience.  When most statistical or graphing packages cost hundreds of dollars, this free alternative is an obvious choice.

For those not familiar with R, it’s a basic command-line program (like FORTRAN, BASIC, C++, etc.) that performs mathematical analyses and produced graphics.  The main way R accomplishes this is by using packages, which are subroutines designed for a specific task.  One of the most commonly used packages is called “MASS” (for “Modern Applied Statistics with S” by Venables and Ripley; S is another programming language).  This is where one can find the routines to run analysis of variance and other linear models, for example.

While an ANOVA is fairly simple (heck, I calculated ANOVA F-ratios by hand in undergrad, as I suspect many of us did), many packages perform more complex analyses.  This is what makes them so useful to “end users” who aren’t computer programmers – all we need is to format the data properly, and select a few options in the package, and it does the math for us.  Here’s an example:

model <- glm(RS ~ Island, data=crau, family="binomial")
summary(model)

This is an analysis of reproductive success of Crested Auklets, testing for differences among islands.  Since reproductive success is a binomial response (the pair did or did not raise a chick successfully), it’s a generalized linear model with binomial error.  Simple!  Nowhere did I have to code in how to calculate the result, the package MASS does it for me as long as I give it the appropriate input.

But there’s a downside to this crowdsourcing of computer code – not all of it is up to the same standard.  I work a fair bit with stable isotopes in ecology.  These are non-decaying isotopes of carbon and nitrogen that can be used to get information on foraging.  When we plot δ13C vs. δ15N (an “isotope biplot”), we can conceptualize it as analogous to the trophic niche concept, where δ15N relates to trophic position, and δ13C to foraging location.  Pretty cool stuff.

In 2011, a group in the UK basically put a number on the size of this axis of Hutchinson’s “n-dimensional hypervolume” by calculating the standard ellipse (a 2D analogy of SD).  Great!  Now we can compare niche sizes among species/groups/sites.  In their paper, they outlined 3 methods for calculating the standard ellipse area (SEA):

  • Just figuring out the SEA
  • Adjusting for small sample size, like we do in AIC analysis (SEAc)
  • A Bayesian approach that produces a median and credibility intervals (the Bayesian equivalent of confidence intervals).

I like the 3rd option, since it also includes some estimate of error.  In theory, all 3 numbers should be close.  But when I ran the analysis for a community of birds I worked on in my MSc, I noticed a problem: the estimate of SEAc didn’t fall within the 95% credibility interval of the Bayesian estimate.  So which is correct?  Because the mathematics behind the calculations is beyond me (having only completed 2 semesters of calculus in undergrad), I have to rely on the package’s author(s).

But herein lies the problem: while software companies have a dedicated staff to troubleshooting issues like this, most contributors of R packages have other jobs (professors, government researchers, or full-time graduate students in most cases).  If the author of the package can even be found*, chances are they’ve moved on to some other research project, or are teaching courses, and have little time to go back and mess around with code they wrote 1, 2, 3, or more years ago.  The result is a population of frustrated users, or in a worst-case scenario, users who don’t even know there’s a problem because they don’t possess the specific knowledge to evaluate the package critically.  As ecologists, many of us rely on our software to perform properly, and take for granted that it does.  Very few of us have the technical know-how to fix it if/when we find a problem.

So what’s the solution?  I don’t know of a practical one.  Besides the assumption that most people contributing R packages are of the open data-sharing ilk, and would feel a moral obligation to ensure their contribution works, what can be done?

 

*I thought I recalled a paper on the difficulty in contacting corresponding authors successfully by e-mail, but can’t seem to find it. If you know it, leave a note in the comments.

Science Borealis

Science Borealis

Follow me on Twitter

My Tweets

Archives

Recent Posts

  • 2023 goals
  • 2022 by the numbers
  • Reflections on a decade of twittering
  • Farewell Twitter
  • 2020 by the numbers

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • The Lab and Field
    • Join 385 other followers
    • Already have a WordPress.com account? Log in now.
    • The Lab and Field
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...