A major issue in both Computational Science and Cognitive Science is evaluation. Statistics was largely developed to provide a way of dealing quantitatively with the kind of qualitative data we traditionally have in the Behavioural Sciences. However, different disciplines have different traditions and conventions when it comes to designing, evaluating and writing up work. This issue is one that is of great concern to many journal editors, and has become a significant area of research for me personally, and so the rest of this editorial will shamelessly explore this area. I invite people interested in these problems to get in touch, as I am considering setting up a Thematic Series on Evaluation and Visualization issues that will focus on the issues of comparing results from computational models and human/animal data.
In the age of Big Data, with teams of people around the whole planet tackling very similar problems, or even exactly the same problem using exactly the same data set, the traditional ideas of significance are becoming a bit dated and misleading. Not that the area has ever been anything but controversial. We would encourage researchers to think in terms of the ‘New Statistics’ that encourages people to present data in a way that permits meta-analysis - that is to combine data from the published work of multiple groups to get a bigger more accurate and more diverse sample. In particular, we want to see effect sizes with standard deviations, and standard errors, and when significance is reported numerically a precise p-value should be given rather than an α-band. Furthermore, the sample size should always be specified.
We recommended showing standard deviation and standard error in tables and plots.
The reason for this is is that if many groups do similar experiments, and each use an α = 0.05 threshold of significance, as soon as we have a few groups or a dozen tests, we have a good chance of seeing a 'significant' result by chance. If people have test beds where they explore many algorithms against many datasets, then we expect one in twenty to be 'significantly better' just by chance, unless they correct for the multiple testing or graphically align all the relevant results (dropping the 'bad' ones is also a form of biasing the selection just as much as showing only the 'good' results).
Unfortunately errorbars and ± notations are not only used to show standard errors, but are also employed to display of larger confidence intervals that represent the confidence that a mean lies in the indicated range, and is set as some multiple of the standard error according to some specific, but often implicit, model of significance. Even worse is using the whisker notation to display standard deviations.
Some basic principles…
Start with the standards that are appropriate to the disciplines you have been trained in – we expect that every set of authors will include some computational or mathematical training and some training in at least one cognitive discipline. Show effect sizes not just p-values, and show actual p-values not broad α-bands - p-values do not tell us how strong the effect is, just how likely it is to be a chance result, and α should be determined a priori. For a properly performed experiment using a reasonable model with plausible assumptions, failure to achieve an effect that is of significance is still of interest and worth reporting, whether this is due to the effect being of low magnitude or formally failing a statistical test.
Where data is paired (within subject, repeated measures), show differences and treat that as your effect, and calculate and present means, standard deviations and standard errors on them.
In terms of the underlying measure, pay particular attention as to whether it is meaningful, and whether you should be evaluating using chance-correct techniques. Generally single class measures like Recall, Precision and F-measure are inappropriate for CCS (Powers 2008a, b), uncorrected Accuracy is inappropriate for imbalanced or variable prevalence data, and various forms of Kappa and DeltaP or Youden J statistic have been proposed for the dichotomous case (Powers, 2012). It is important to consider both Sensitivity and Specificity, and it turns out that balanced optimization of the pair optimizes Informedness, the probability of an informed decision. Informedness is useful in the multiclass case, being a generalization of Youden's J and a form of Kappa. These statistics are appropriate for a single direction of prediction, with Accuracy assuming all instances are weighted or costed equally, while Informedness assumes all classes are weighted or costed equally.
If there is no preferred direction of prediction, then Correlation is an appropriate statistic - this is the geometric mean of the two directions of Informedness, and is also closely related to common tests for Significance (Powers 2008b). If groups or clusterings are being compared, and there is not even a known or fixed number of classes, then there are a large number of clustering comparison techniques to choose from (Pfitzner et al. 2009).
While tables of results are useful, it is usually easier to understand results when the best results and the significant results are highlighted, and the results are supplemented with a visualization.