Using public datasets to eliminate false positive compound annotations in untargeted LC-MS hos Københavns Universitet

One of the major challenges in untargeted metabolomics is compound identification. The first initial step towards identifying an unknown in a sample is typically matching to a database of compounds’ masses. For true identification this initial hypothesis annotation is then confirmed with authentic standards if possible. It is, however, often not possible to investigated all unknowns at this level. Therefore, published data often contain many of these “hypothesis identifications” and conclusions are based upon these uncertain results. In this study we seek to use a large number of datasets to identify annotations with questionable evidence given the comparable data in the whole set of datasets. This could be comparing retention times across studies to identify compounds with an unlikely experimental retention time given the retention times reported by others. This could also include analysing the difference between experimental mass and the theoretical mass of all identified compounds and finding compounds where the difference is larger than would be expected based on the distribution of the error in all identifications.

Key learnings: metabolomics, LC-MS, compound identification, univariate statistics, bioinformatics.

Bemærk: Du skal ofte bruge forhåndsgodkendelse fra dit universitet eller studievejleder for at sikre, at projekter eller specialeopgaver på SDU Jobbank vil blive accepteret som en del af dit studie. Kontakt de relevante aktører i god tid for at sikre, at du vælger det rette projekt.