|Replicates of Complex Mixtures in Ultra-High Resolution Mass Spectrometry Could Help Pave The Way to Big Data
|Big Data, The Last Hyphenation
|Mr Remy Gavard
|Dr Mark Barrow
Dr Simon Spencer
Dr David Rossell
|University of Warwick
Abstract Information :
By using Fourier transform ion cyclotron resonance mass spectrometry (FTICR MS),
scientists are able to determine an unprecedented number of components in crude oil. The
statistical tools required to analyse the mass spectra struggle to keep pace with advancing
instrument capabilities and increasing quantities of data. Today, we are facing "fat data" as
we have lots of attributes but no "tall data" as there is a limited amount of exploitable training
samples. This is because most ultrahigh resolution analyses for complex mixture samples
are based on single, labour-intensive, experiments. As a result, it can be challenging to
monitor repeatability and differentiate between noise and true signals. Another factor
contributing to the low number of training samples available is that the data analysis is
usually performed once for the purpose of a specific investigation but may not be stored for
later use. In order to be able to develop methods to exploit greater numbers of samples, we
need to ensure the consistency, the reliability and the organisation of MS data.
We present a new algorithm developed in R, named Themis, to jointly pre-process replicate measurements of a complex sample. False positive peaks with low intensity can arise throughout a single mass spectrum due to the presence of noise. The locations of these peaks are not consistent between replicate samples, due to the randomness of the noise. Researchers are typically faced with a trade-off; it is important to set peak picking thresholds low enough to avoid omission of genuine peaks, but setting the threshold sufficiently low can result in large numbers of noise peaks being included too. By combining information across datasets, we determine and reduce false positive peaks with a smaller margin for error. This enables true peaks of low intensity to be extracted from the background noise and improves consistency as a preliminary step to assigning chemical compositions and data analysis. Through the use of peak alignment and an adaptive mixture-model-based strategy, it is possible to distinguish true peaks from noise and obtain more reliable datasets for further use.
We applied Themis to a variety of crude oils and naphthenic acid samples. These results demonstrated a more effective removal of noise-related peaks and the preservation and improvement of the chemical composition profile. Themis enabled the isolation of peaks that would have otherwise been discarded using traditional peak picking (based upon signal-tonoise ratio alone) for a single spectrum, and therefore Themis ensures the inclusion of information that would typically be lost, while also reducing data set sizes.
Themis affords greater success with the assignment of chemical compositions to lowintensity peaks using petroleomic software. In addition, improved monitoring of data quality and handling of replicate datasets will allow researchers to process larger numbers of samples with greater confidence. This, in turn, will enable larger scale data analysis methods, which inform decision making.