Accurately assessing and estimating errors is a crucial but often undervalued step in any scientific experiment. This is especially critical for structural biologists concerned with analysing how well 3D structural models of proteins agree with the experimental data.
Using incorrect error estimates may skew analyses and lead to invalid conclusions. Scientists at EMBL Hamburg have now developed an approach to assess how well sets of data fit together, which bypasses the problem of error estimation altogether for small-angle X-ray scattering (SAXS) data experimentalists but also researchers across the physical sciences.
When employed to extract structural information embedded in SAXS data from proteins and other macromolecules, the new Correlation Map (CorMap) test – details of which are published today in in the journal Nature Methods – quickly and reliably discriminates between models that do and do not fit the experimental data, without the need for explicit error estimates.
As experimental scientists everywhere are aware, data collection is only the first step. This has to be followed by careful data analysis that typically involves comparing models with experimentally observed data. Assessing how well a model and the data agree inevitably requires the use of a statistical test to decide which models do, and which models do not, fit the given experimental data. For over a century, experimentalists have employed the reduced χ2 (or chi-square) test – the gold standard in many fields of research, including SAXS, since its introduction in 1900. However, this test requires estimates of the measurement errors in each experimental data point: for many experiments, these errors are not available or are inaccurate, making the chi-square test invalid.
Now, Daniel Franke, Cy Jeffries and Dmitri Svergun from the biological SAXS group at EMBL Hamburg, have devised a new test that sidesteps the problem of explicit error estimation altogether, yet maintains the statistical power to detect systematic deviations between data and model that is comparable to that of the reduced χ2 test using correctly specified errors.
The probability of having deviations is directly calculated from the longest stretch of repetitions of the same positive or negative sign.
The CorMap test takes into account the distribution of positive and negative differences between experimental data and model fit. “Daniel noticed that the longest run of positive or negative values while comparing SAXS data sets obeys the same statistics as that of the longest run of heads or tails in a simple coin toss experiment,” says Svergun, BioSAXS group leader. “If someone tossing a coin gets 30 heads in a row, you would probably think something must be wrong. Similarly, long stretches of positive or negative correlations between experimental data and a model also point to something wrong – that is, to systematic deviations. The probability of having deviations is directly calculated from the longest stretch of repetitions of the same positive or negative sign.”
CorMap is implemented in the ATSAS software package, a program suite for SAXS data analysis for biological macromolecules designed and maintained by the SAXS group in EMBL Hamburg. Both CorMap and ATSAS are free for academic use. Although the method was developed in the context of SAXS, the authors are confident that the method will also be applicable in other fields of research.