If you have a dataset that requires analysing, have a question about statistical techniques, or want advice on the statistical design of an experiment, feel free to send an email or call in to the lab.  We'll generally attempt to tackle any statistical problems, but if you get us involved early on, we can help ensure that the questions of interest are answerable, and that the data format will enable rapid generation of your results.

Gavin and Stuart are the members of the group who have the most experience in statistics - contact either of them for further details

General Considerations

We've seen a lot of datasets, from a variety of experiments.  Generally, we don't need to know much about the biological details behind the dataset, but there are a few things that will enable us to do the correct analysis quickly.  Take the following subset of data:

X condition dose replicate
6 Control 1.5 1
10 Control 2 2
100 Treated 1.5 1
142 Treated 3 2
Types of response
Here, X is the thing we're measuring (response variable in statistical terminology).  All the instances of X here are whole numbers: is this an experimental constraint (perhaps it is the number of cells, which is intrinsically a whole number), or could X take on any value?  For each variable (column), it is important to know what values it can take (whole numbers between 1 and N, all whole numbers, positive numbers, any number,...) if we are to do the correct statistical test.

For instance, the famous t-test is only really applicable to variables that can take any value (positive, negative, fractions) - and so wouldn't necessarily be applicable in this example situation.

Types of predictor
condition and dose are both predictors of our outcome Xcondition is not numeric - it's a label that can take one of at least two values.  Typically, we'll want to know if X varies between the two conditions.  It looks like it does in this situation.  dose is a numeric variable.  It looks as if it is also related to X, in that high values in dose correspond to high values in X.

Types of structure
We notice that replicate takes on only two values.  Does that mean there are two biological units (say, mouse 1 and mouse 2), and we've measured them before some treatment, and then again after treatment?  Or are there actually 4 biological units, and the replicate number is just a convenient label to distinguish units within an experimental group, and don't link individuals between groups.  The results will be different depending on the situation, and if you don't tell us, or get it wrong, the results won't be reliable.

Types of question
Most statistical questions turn out to be estimation (what's the best average of these replicates, together with an error bar) or hypothesis testing (is this set of measurements equal to a certain value?  is there a difference between group A and group B, above and beyond what would be expected by chance).  The trick is in phrasing the biological question in such terms.  Using the table above as our example, again, there are a number of statistical hypotheses that one can immediately phrase:
  • Is there a difference in (average value of) X between the control and treated groups, ignoring dose for the moment
  • Is there a difference in X between these two groups, taking account of dose
  • Do both treatment and dose have an effect on X
  • Is the dose-response the same in each of the treatment groups?
These won't all give the same answer - so it's important that we understand which question you're trying to answer when we analyse your data.


At some point, the biological question has to be phrased in a way that is ammenable to numerical analysis, which captures all the hidden constraints and assumptions.  If you can clarify what is going on regarding the four types above, then the likelihood that you get a correct answer to the correct question is substantially improved.  So if you can say "I've measured X in four experimental groups, with three independent replicates in each group.  X can take only take values 1,2...10, and I'm interested to know which of the three treatment groups are different from the control group", then we'll be ready to go.

The other recommendation I would make is that, if possible, you arrange your data in a rectangular table format as above. That is, each measurement gets its own row, and the columns record every relevant attribute of the data.  This might seem more tedious than just saying "column A corresponds to the control group, column B corresponds to the treatment group,...", but it is much more amenable to analysis, and generalises well (so that you don't have to resort to colours to represent other attributes of your data, for instance, you can just add another column).