Friday, November 23, 2007

Statistics Software

Most any paleontologist will, at some time, have to delve into statistics in order to answer some sort of question related to his or her research. Unfortunately, many of these statistical tests exceed the options available in Excel (I find it highly unlikely that Excel will ever have a principal components analysis, for example). So, what's a researcher to do? In this post, I'll address some of the statistical packages available out there as freeware or open source software.

  • PAST. This is probably the easiest to use statistical package out there, and it is geared especially for paleontologists (as you might guess by the name, which is short for "PALaeontological STatistics"). You can run diversity indices, PCA, and a whole bunch of other methods. The interface is quite user-friendly, although it has occasional quirks in how it wants the data aligned in the columns. Bugs, once reported, are quickly ironed out, and new features are added relatively frequently. The statistical plots it produces are generally quite good, but there aren't a lot of options to customize them. The website and documentation are generally pretty good, if a bit simplistic (in the case of the documentation). Unfortunately, after version 1.56b, you can no longer run the software under WINE in Linux (but you can download version 1.56b from the PAST site). Available for Windows
  • R. The gold standard in statistical analysis--this is for people who are really serious about their data. One big plus with R is that it handles large data files without batting an eye - this was a lifesaver when I had FEM outputs with over 150,000 values (to be fair, PAST loaded this too, although much more slowly and only with a lot of data massaging. SPSS choked.)! R has a very active development community, and you can find packages to do just about anything. The big downside (for some users) is that R is command-line only (although front ends such as R Commander now allow access to some, but not all, of R's features via a graphical user interface). But, it is incredibly powerful, and it is very easy to set up little scripts to run through whole masses of data in a matter of seconds. The graphical outputs are highly customizable and easily exportable into widely used formats. The user's manual is some of the best I've ever seen in open source software, too.
  • (S)MATR. This handy little program, available as a standalone executable for Windows (also running under WINE in Linux), an R package, or a MATLAB toolbox, will fill all of your reduced major axis regression needs. It's fast, powerful, and about the best way I've found to deal with data that don't meet the assumptions of Model I regression (ala Sokal and Rohlf). The downside is it doesn't produce graphical plots - but it does all the statistical tests that PAST doesn't.
Any search out on the web will also uncover other statistical packages. The above are just the ones with which I am most familiar.

No comments: