Spectra Prediction

The Spectra Prediction utility predicts the spectra for a given input molecule.

Submit Prediction Query

Figure 1
Submit Prediction Query Steps:

  1. Enter InChI or SMILES
  2. Select desired spectra type, ion mode and adduct type.
  3. Submit prediction job to server

Note: InChI strings need to start with "InChI=" and input molecule should not has any charges.

Defitions of Inputs:

InChI/SMILES: The molecule must be represented in either InChI format or SMILES format. InChI strings need to start with "InChI=" and are not expected to have any charge - an additional H+ will be added. InChI strings need to contain AT LEAST the main layer with its chemical formula and atom connections sublayers for proper computation.

Examples:CN1CCC[C@H]1c2cccnc2
InChI=1S/C10H14N2/c1-12-7-3-5-10(12)9-4-2-6-11-8-9/h2,4,6,8,10H,3,5,7H2,1H3

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Adduct: Indicates the specific adduct used.

Spectra Prediction Results

Figure 1

Results Part 1: Spectra are computed for low (10 eV), medium (20 eV) and high (40 eV) collision energy levels and are represented by a list of 'mass intensity' pairs, each corresponding to a peak in the spectra. Each peak in predicted spectra has a m/z value, an intensity value, and one or more possible fragment ion structure(s).

Figure 1

Results Part 2: A detailed list of all predicted fragment can be found down the page. Each fragment can be linked back to predicted spectra by its fragement id.

Peak Assignment

The Peak Assignment utility annotates the peaks in a provided set of spectra given a known molecule. The complete list of feasible fragments is computed, then the most likely fragments for each spectrum peak are determined using a pre-trained model.

Submit Peak Assigment Query:

Figure 1
Peak Assignment Query Steps:

  1. Enter InChI or SMILES
  2. Select desired spectra type, ion mode and adduct type.
  3. Enter spectra data, the spectra should be represented as a list of peaks with the format 'm/z intensity' on each line. Multiple energy levels are optional; only one is required.
  4. Select mass tolerance to use when matching peaks within the spectrum comparison.
  5. Submit job to server

Defitions of Inputs:

InChI/SMILES: The molecule must be represented in either InChI format or SMILES format. InChI strings need to start with "InChI=" and are not expected to have any charge - an additional H+ will be added. InChI strings need to contain AT LEAST the main layer with its chemical formula and atom connections sublayers for proper computation.

Examples:Oc1ccc(CC(NC(=O)C(N)CO)C(=O)NC(CC(O)=O)C(O)=O)cc1
InChI=1S/C16H21N3O8/c17-10(7-20)14(24)18-11(5-8-1-3-9(21)4-2-8)15(25)19-12(16(26)27)6-13(22)23/h1-4,10-12,20-21H,5-7,17H2,(H,18,24)(H,19,25)(H,22,23)(H,26,27)

Spectra: The spectra should be represented as a list of peaks with the format 'mass intensity' on each line. For ESI spectra, 'low','medium', and 'high' or 'energy0', 'energy1', and 'energy2' header lines should begin spectra of different energy levels (in that order) and multiple energy levels are optional (only one is required). EI spectra only need to have one energy level. Spectra may also be in .msp file format, in which case energy levels for ESI spectra should be specified in the "Comment: " field (EI spectra do not need a specified energy level). A corresponding spectra ID must be selected for .msp spectra. .msp files must have an "ID" and "Num peaks" attributes for each spectra.

Example peak list format:low
87.0546877.567280
105.0691741.791050
136.0761613.081500
160.0762892.225420
178.0846165.319120
223.106608100.000000
251.1017340.722900
297.1075673.945980
384.14038411.216900
medium
60.0445452.476820
87.0569659.632580
119.0460862.367850
135.0663351.865000
136.07719246.373600
160.0744176.652730
178.0870520.078100
223.109344100.000000
251.1086683.127750
297.1136871.892360
high
42.0339093.047230
60.04374626.520300
70.0272683.162400
87.05627218.342000
91.05449423.516200
119.0482815.711000
121.0634027.273900
133.065515.039960
135.0662383.626030
136.074907100.000000
160.07440926.458000
178.08545412.211700
Example .msp format:
Name: Diazirine
NISTNO: 305841
ID: ID_3
Num peaks: 12
Comment: energy0
12108.00
13228.99
14999.00
1521.98
2617.98
2758.05
28178.04
2922.98
4017.98
41108.00
42431.01
437.99
Name: Methane, diazo-
NISTNO: 57
ID: ID_4
Num peaks: 12
Comment: energy1
12110.10
13220.30
14999.00
1525.18
2612.59
2758.25
28179.34
2920.48
4021.98
41110.10
42424.82
4310.99

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Mass Tolerance: The mass tolerance to use when matching peaks within the dot product comparison. The default value is 10.0 ppm.

Peak Assigment Results:

Figure 1

Results: Input spectra are shown in the plot. Peaks for which corresponding fragments have been found are colored red; unassigned peaks are colored blue. Hover over the peaks to see the exact mass and intensity values, along with the highest scoring assigned fragments, if found. More detailed information can be found father down the page. Note that, in the proposed fragment annotations result, the charge is located on an atom. Whilst this may be a true representation of the charged mass fragment, this is not necessarily the case. Currently, CFM-ID determines the charge location by finding a possible solution of electron configuration that 1. Met valence requirements for each atom, 2. Uses exactly the amount of electrons in the fragment. Thus, there will be multiple possible charge locations that meet this requirement, in this case CFM-ID picks the first electron configuration it found.

Compound Identification

The Compound Identification function determines the compounds that most closely match to a given MS/MS spectrum. The input MS/MS spectra (at one or more collision energies) are compared to in silico predicted MS/MS spectra and/or experimental MS/MS spectra as chosen by the user. The top candidates are ranked according to how closely they match and returned in a list. Users may view the matching compounds and their scores in a table and the similarity of the observed spectra to the matched spectra using an MS mirror plot.

Compound Identification Query

Figure 1
Compound Identification Steps:

  1. Select find candidate or find neutral loss candidates option. The neutral loss option will allow you to enter a spectrum, then the corresponding neutral loss spectrum will be calculated based off of the parent ion mass.
  2. Select desired candidate databases. Both experimental and predicted databases are available.
  3. Select desired spectra type, ion mode and adduct type.
  4. Enter parent ion information.
  5. Select mass tolerance for candidates retirive.
  6. Enter spectra data, the spectra should be represented as a list of peaks with the format 'm/z intensity' on each line. Multiple energy levels are optional; only one is required. This function only accepts centroid spectrum.
  7. Select scoring function for ranking.
  8. Select mass tolerance to use when matching peaks within the spectrum comparison.
  9. Submit job to server

Defitions of Inputs:

Spectra: The spectra should be represented as a list of peaks with the format 'mass intensity' on each line. Only centroided spectrum can be entered. For ESI spectra, 'low','medium', and 'high' or 'energy0', 'energy1', and 'energy2' header lines should begin spectra of different energy levels (in that order) and multiple energy levels are optional (only one is required). EI spectra only need to have one energy level. Spectra may also be in .msp file format, in which case energy levels for ESI spectra should be specified in the "Comment: " field (EI spectra do not need a specified energy level). A corresponding spectra ID must be selected for .msp spectra. .msp files must have an "ID" and "Num peaks" attributes for each spectra.

Example:low
87.0546877.567280
105.0691741.791050
136.0761613.081500
160.0762892.225420
178.0846165.319120
223.106608100.000000
251.1017340.722900
297.1075673.945980
384.14038411.216900
medium
60.0445452.476820
87.0569659.632580
119.0460862.367850
135.0663351.865000
136.07719246.373600
160.0744176.652730
178.0870520.078100
223.109344100.000000
251.1086683.127750
297.1136871.892360
high
42.0339093.047230
60.04374626.520300
70.0272683.162400
87.05627218.342000
91.05449423.516200
119.0482815.711000
121.0634027.273900
133.065515.039960
135.0662383.626030
136.074907100.000000
160.07440926.458000
178.08545412.211700
Example .msp format:
Name: Diazirine
NISTNO: 305841
ID: ID_3
Num peaks: 12
Comment: energy0
12108.00
13228.99
14999.00
1521.98
2617.98
2758.05
28178.04
2922.98
4017.98
41108.00
42431.01
437.99
Name: Methane, diazo-
NISTNO: 57
ID: ID_4
Num peaks: 12
Comment: energy1
12110.10
13220.30
14999.00
1525.18
2612.59
2758.25
28179.34
2920.48
4021.98
41110.10
42424.82
4310.99

Database: Instead of providing a candidate list, one can be generated from a selected database. Additional input options for generating a compound list from a database are:

Parent Ion Mass: The parent ion mass of the compound used in the mass spectrometry.

Adduct Type: The adduct type used in the mass spectrometry.

Candidate Mass Tolerance: The mass tolerance to use when identifying candidate compounds in the database. The default value is 100.0 ppm.

Candidate Limit: The maximum number of candidates to return. The maximum and default value is 100.

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Number of Results: The number of results to return, with the default value being 10. If left blank, all results wil be returned.

Mass Tolerance: The mass tolerance to use when matching peaks within the dot product comparison. The default value is 10.0 ppm.

Scoring Function: The type of scoring function to use when comparing spectra. The options are Dice and DotProduct.

Compound Identification Results:

Figure 3

Results: Input spectra are shown in blue and candidate spectra are shown in red. If a database was queried, candidate spectra are overlayed on top for comparison. The top ranking candidate spectra is shown by default; to compare other database candidates use the "Compare" buttons on the list of ranked candidate compounds that follow the spectra.

Browser Compliance

OSVersionChromeFirefoxMicrosoft EdgeSafari
LinuxMint 20.195.0.463889.0.2N/AN/A
MacOSBigSur 11.696.0.4664.9395.0.2N/A14.1.2 (16611.3.10.1.6)
Windows1096.0.4664.11095.0.196.0.1054.57N/A

HCD vs CID Spectra

High-energy C-trap dissociation (HCD) is considered a more gentle fragmentation process than CID, that is an HCD spectrum typically has more unique fragments across the entire mass-to-charge ratio range than their CID counterparts. However, for the same molecule, CID and HCD spectra in similar collision energy are having a lot of fragments in common. Recall that CFM-ID is trained on CID data, thus its predicted spectra are less similar to Orbitrap spectra than QToF spectra (Yields lower Dice or/and Dot Product score). From our experience, CFM-ID predicted spectra are still very useful to determine compounds with Orbitrap data. In the CASMI 2016 experiments, we first determine the true collision energy of a given spectrum from its NCE value by the equation provided by Thermo Fisher, then compared this spectrum with the closest CID collision energy spectra.

Performance

Compound-to-Spectra Prediction Performance

Figure 1

Figure 1.Spectrum prediction results for the Metlin 2015 dataset in [M+H]+. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval. The plot on the left presents the overall performance of the model, and plots on the right provide the performance measures for each collision energy.

Figure 2

Figure 2.Spectrum prediction results for the Metlin 2015 dataset in [M-H]-. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval. The plot on the left presents the overall performance of the model, and plots on the right provide the performance measures for each collision energy.


SuperclassDiceDot ProductRecallPrecisionCount
Hydrocarbons0.474 0.259 49.3 57.1 2
Organic 1,3-dipolar compounds0.465 0.703 43.1 69.4 1
Organic nitrogen compounds0.43 0.454 50.8 50.8 124
Nucleosides, nucleotides, and analogues 0.418 0.53 51.5 48.9 73
Organosulfur compounds 0.405 0.364 43.7 54.9 18
Organic acids and derivatives 0.399 0.382 48.5 43.3 481
Lipids0.394 0.417 52.2 42.6 33
Organoheterocyclic compounds0.384 0.414 40.4 51 988
Alkaloids and derivatives0.377 0.471 42.5 47.3 90
Phenylpropanoids and polyketides 0.367 0.427 37.5 51 382
Benzenoids 0.358 0.367 42.5 44 797
Lipid-like molecules0.358 0.313 37.8 45.7 827
Organic oxygen compounds 0.346 0.298 40.5 41.5 223
Organophosphorus compounds 0.33 0.207 27.2 51.4 4
Lignans, neolignans and related compounds 0.242 0.217 31 28.4 11
Organometallic compounds0.227 0.22 16.8 42.8 2
Hydrocarbon derivatives0.208 0.248 22.2 33.9 5
Acetylides0.095 0.001 11.1 8.3 1

Table 1. Average (over all three energy levels) performance metrics based on chemical classes of the training set in ESI positive mode.

SuperclassDiceDot ProductRecallPrecisionCount
Lipids 0.3870.451 48.2 42.4 15
Organic acids and derivatives 0.3480.364 38.1 45.4 335
Organosulfur compounds0.322 0.329 29.9 49.2 4
Nucleosides, nucleotides, and analogues0.314 0.373 40.5 39.9 60
Lipid-like molecules 0.310 0.366 37.0 42.4 428
Organic oxygen compounds 0.3010.272 32.1 42.2 84
Organoheterocyclic compounds 0.3000.351 29.8 50.5 485
Benzenoids0.297 0.318 31.3 44.2 402
Phenylpropanoids and polyketides 0.287 0.339 26.9 48.8 231
Alkaloids and derivatives 0.2770.263 25.1 46.3 20
Organic nitrogen compounds 0.254 0.233 31.9 30.7 30
Hydrocarbon derivatives0.232 0.331 14.8 58.6 1
Lignans, neolignans and related compounds 0.205 0.285 19.7 27.0 6
Organometallic compounds0.145 0.100 11.0 30.6 1
Organophosphorus compounds0.131 0.070 8.5 42.9 1
Organometallic compounds0.227 0.22 16.8 42.8 2
Hydrocarbon derivatives0.208 0.248 22.2 33.9 5
Acetylides0.095 0.001 11.1 8.3 1

Table 2. Average (over all three energy levels) performance metrics based on chemical classes of training set in ESI negative mode.


Domian Specific Performance

Figure 3

Figure 3. Spectrum prediction results for the Exposome, Foodome, and HMDBOopme in [M+H]+ and [M-H]-. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval.


Spectra-to-Compound Identification Performance

Version# Top 1# Top 3# Top 10
CFM-ID 2.0 + Candidate Database120160182
CFM-ID 2.0 + Candidate Database + Experimental Spectra123171201
SIRIUS 4 + CSI:FingerID138N/A186
MS-Finder146162174
CFM-ID 3.0 + Candidate Database + Experimental Spectra + Meta Data149194204
CFM-ID 4.0 + Candidate Database147178203
CFM-ID 4.0 + Candidate Database + Meta Data162186204

Table 3. Comparison of CFM-ID 4.0, CFM-ID 3.0, CFM-ID 2.0, MS-FINDER, and SIRIUS 4 compound identification Performance on CASMI 2016 contest (category 3).Reported are the total number of challenges in which the corresponding implementation of the scoring function ranked the query compound in the top 1, top 3, and top 10. **CFM-ID 4.0 + Candidate Database + Meta Data** is the method provided in this web server.

Docker Image

What is Included?

  1. Latest CFM-ID 4 MSML machine learning model
  2. C++ Runtime for CFM-ID 4 MSML
  3. Latest CFM-ID 4 MSRB rule based extension
  4. Java Runtime for CFM-ID 4 MSRB

What is not Included?

  1. CFM-ID in-silico spectra libaray
  2. CFM-ID experimental spectra libaray

What can it do?

  1. Predict spectra for give molecule structure
  2. Annotate spectrum for given moleucle structure and its spectrum
  3. Idenity molecule from a given spectrum and user provided candidate list

How to use it?

Please refer to the user guide on DockerHub page.