#### General Considerations

We've seen a lot of datasets, from a variety of experiments. Generally, we don't need to know much about the biological details behind the dataset, but there are a few things that will enable us to do the correct analysis quickly. Take the following subset of data:X | condition | dose | replicate |

6 | Control | 1.5 | 1 |

10 | Control | 2 | 2 |

100 | Treated | 1.5 | 1 |

142 | Treated | 3 | 2 |

##### Types of response

Here,**X**is the thing we're measuring (

*response variable*in statistical terminology). All the instances of

**X**here are whole numbers: is this an experimental constraint (perhaps it is the number of cells, which is intrinsically a whole number), or could

**X**take on any value? For each

*variable*(column), it is important to know what values it can take (whole numbers between 1 and N, all whole numbers, positive numbers, any number,...) if we are to do the correct statistical test.

For instance, the famous t-test is only really applicable to variables that can take any value (positive, negative, fractions) - and so wouldn't necessarily be applicable in this example situation.

##### Types of predictor

**condition**and

**dose**are both predictors of our outcome

**X**.

**condition**is not numeric - it's a label that can take one of at least two values. Typically, we'll want to know if

**X**varies between the two conditions. It looks like it does in this situation.

**dose**is a numeric variable. It looks as if it is also related to

**X**, in that high values in dose correspond to high values in

**X**.

##### Types of structure

We notice that**replicate**takes on only two values. Does that mean there are two biological units (say, mouse 1 and mouse 2), and we've measured them before some treatment, and then again after treatment? Or are there actually 4 biological units, and the replicate number is just a convenient label to distinguish units within an experimental group, and don't link individuals between groups. The results will be different depending on the situation, and if you don't tell us, or get it wrong, the results won't be reliable.

##### Types of question

Most statistical questions turn out to be estimation (*what's the best average of these replicates, together with an error bar*) or hypothesis testing (

*is this set of measurements equal to a certain value? is there a difference between group A and group B, above and beyond what would be expected by chance*). The trick is in phrasing the biological question in such terms. Using the table above as our example, again, there are a number of statistical hypotheses that one can immediately phrase:

- Is there a difference in (average value of)
**X**between the control and treated groups, ignoring dose for the moment - Is there a difference in
**X**between these two groups, taking account of dose - Do both treatment and dose have an effect on
**X** - Is the dose-response the same in each of the treatment groups?

#### Recommendations

At some point, the biological question has to be phrased in a way that is ammenable to numerical analysis, which captures all the hidden constraints and assumptions. If you can clarify what is going on regarding the four*types*above, then the likelihood that you get a correct answer to the correct question is substantially improved. So if you can say "I've measured

**X**in four experimental groups, with three independent replicates in each group.

**X**can take only take values 1,2...10, and I'm interested to know which of the three treatment groups are different from the control group", then we'll be ready to go.

The other recommendation I would make is that, if possible, you arrange your data in a rectangular table format as above. That is, each measurement gets its own row, and the columns record every relevant attribute of the data. This might seem more tedious than just saying "column A corresponds to the control group, column B corresponds to the treatment group,...", but it is much more amenable to analysis, and generalises well (so that you don't have to resort to colours to represent other attributes of your data, for instance, you can just add another column).