CFM-ID: Help

Spectra Prediction

The Spectra Prediction utility predicts the spectra for a given input molecule.

Submit Prediction Query

Submit Prediction Query Steps:

Enter InChI or SMILES
Select desired spectra type, ion mode and adduct type.
Submit prediction job to server

Note: InChI strings need to start with "InChI=" and input molecule should not has any charges.

Defitions of Inputs:

InChI/SMILES: The molecule must be represented in either InChI format or SMILES format. InChI strings need to start with "InChI=" and are not expected to have any charge - an additional H+ will be added. InChI strings need to contain AT LEAST the main layer with its chemical formula and atom connections sublayers for proper computation.

Examples:	CN1CCC[C@H]1c2cccnc2
	InChI=1S/C10H14N2/c1-12-7-3-5-10(12)9-4-2-6-11-8-9/h2,4,6,8,10H,3,5,7H2,1H3

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Adduct: Indicates the specific adduct used.

Spectra Prediction Results

Results Part 1: Spectra are computed for low (10 eV), medium (20 eV) and high (40 eV) collision energy levels and are represented by a list of 'mass intensity' pairs, each corresponding to a peak in the spectra. Each peak in predicted spectra has a m/z value, an intensity value, and one or more possible fragment ion structure(s).

Results Part 2: A detailed list of all predicted fragment can be found down the page. Each fragment can be linked back to predicted spectra by its fragement id.

Peak Assignment

The Peak Assignment utility annotates the peaks in a provided set of spectra given a known molecule. The complete list of feasible fragments is computed, then the most likely fragments for each spectrum peak are determined using a pre-trained model.

Submit Peak Assigment Query:

Peak Assignment Query Steps:

Enter InChI or SMILES
Select desired spectra type, ion mode and adduct type.
Enter spectra data, the spectra should be represented as a list of peaks with the format 'm/z intensity' on each line. Multiple energy levels are optional; only one is required.
Select mass tolerance to use when matching peaks within the spectrum comparison.
Submit job to server

Defitions of Inputs:

Examples:	Oc1ccc(CC(NC(=O)C(N)CO)C(=O)NC(CC(O)=O)C(O)=O)cc1
	InChI=1S/C16H21N3O8/c17-10(7-20)14(24)18-11(5-8-1-3-9(21)4-2-8)15(25)19-12(16(26)27)6-13(22)23/h1-4,10-12,20-21H,5-7,17H2,(H,18,24)(H,19,25)(H,22,23)(H,26,27)

Spectra: The spectra should be represented as a list of peaks with the format 'mass intensity' on each line. For ESI spectra, 'low','medium', and 'high' or 'energy0', 'energy1', and 'energy2' header lines should begin spectra of different energy levels (in that order) and multiple energy levels are optional (only one is required). EI spectra only need to have one energy level. Spectra may also be in .msp file format, in which case energy levels for ESI spectra should be specified in the "Comment: " field (EI spectra do not need a specified energy level). A corresponding spectra ID must be selected for .msp spectra. .msp files must have an "ID" and "Num peaks" attributes for each spectra.

Example peak list format:	low
	87.054687	7.567280
	105.069174	1.791050
	136.07616	13.081500
	160.076289	2.225420
	178.084616	5.319120
	223.106608	100.000000
	251.10173	40.722900
	297.107567	3.945980
	384.140384	11.216900
	medium
	60.044545	2.476820
	87.056965	9.632580
	119.046086	2.367850
	135.066335	1.865000
	136.077192	46.373600
	160.074417	6.652730
	178.08705	20.078100
	223.109344	100.000000
	251.108668	3.127750
	297.113687	1.892360
	high
	42.033909	3.047230
	60.043746	26.520300
	70.027268	3.162400
	87.056272	18.342000
	91.054494	23.516200
	119.04828	15.711000
	121.063402	7.273900
	133.06551	5.039960
	135.066238	3.626030
	136.074907	100.000000
	160.074409	26.458000
	178.085454	12.211700

Example .msp format:
	Name: Diazirine
	NISTNO: 305841
	ID: ID_3
	Num peaks: 12
	Comment: energy0
	12	108.00
	13	228.99
	14	999.00
	15	21.98
	26	17.98
	27	58.05
	28	178.04
	29	22.98
	40	17.98
	41	108.00
	42	431.01
	43	7.99

	Name: Methane, diazo-
	NISTNO: 57
	ID: ID_4
	Num peaks: 12
	Comment: energy1
	12	110.10
	13	220.30
	14	999.00
	15	25.18
	26	12.59
	27	58.25
	28	179.34
	29	20.48
	40	21.98
	41	110.10
	42	424.82
	43	10.99

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Mass Tolerance: The mass tolerance to use when matching peaks within the dot product comparison. The default value is 10.0 ppm.

Peak Assigment Results:

Results: Input spectra are shown in the plot. Peaks for which corresponding fragments have been found are colored red; unassigned peaks are colored blue. Hover over the peaks to see the exact mass and intensity values, along with the highest scoring assigned fragments, if found. More detailed information can be found father down the page. Note that, in the proposed fragment annotations result, the charge is located on an atom. Whilst this may be a true representation of the charged mass fragment, this is not necessarily the case. Currently, CFM-ID determines the charge location by finding a possible solution of electron configuration that 1. Met valence requirements for each atom, 2. Uses exactly the amount of electrons in the fragment. Thus, there will be multiple possible charge locations that meet this requirement, in this case CFM-ID picks the first electron configuration it found.

Compound Identification

The Compound Identification function determines the compounds that most closely match to a given MS/MS spectrum. The input MS/MS spectra (at one or more collision energies) are compared to in silico predicted MS/MS spectra and/or experimental MS/MS spectra as chosen by the user. The top candidates are ranked according to how closely they match and returned in a list. Users may view the matching compounds and their scores in a table and the similarity of the observed spectra to the matched spectra using an MS mirror plot.

Compound Identification Query

Compound Identification Steps:

Select find candidate or find neutral loss candidates option. The neutral loss option will allow you to enter a spectrum, then the corresponding neutral loss spectrum will be calculated based off of the parent ion mass.
Select desired candidate databases. Both experimental and predicted databases are available.
Select desired spectra type, ion mode and adduct type.
Enter parent ion information.
Select mass tolerance for candidates retirive.
Enter spectra data, the spectra should be represented as a list of peaks with the format 'm/z intensity' on each line. Multiple energy levels are optional; only one is required. This function only accepts centroid spectrum.
Select scoring function for ranking.
Select mass tolerance to use when matching peaks within the spectrum comparison.
Submit job to server

Defitions of Inputs:

Spectra: The spectra should be represented as a list of peaks with the format 'mass intensity' on each line. Only centroided spectrum can be entered. For ESI spectra, 'low','medium', and 'high' or 'energy0', 'energy1', and 'energy2' header lines should begin spectra of different energy levels (in that order) and multiple energy levels are optional (only one is required). EI spectra only need to have one energy level. Spectra may also be in .msp file format, in which case energy levels for ESI spectra should be specified in the "Comment: " field (EI spectra do not need a specified energy level). A corresponding spectra ID must be selected for .msp spectra. .msp files must have an "ID" and "Num peaks" attributes for each spectra.

Example:	low
	87.054687	7.567280
	105.069174	1.791050
	136.07616	13.081500
	160.076289	2.225420
	178.084616	5.319120
	223.106608	100.000000
	251.10173	40.722900
	297.107567	3.945980
	384.140384	11.216900
	medium
	60.044545	2.476820
	87.056965	9.632580
	119.046086	2.367850
	135.066335	1.865000
	136.077192	46.373600
	160.074417	6.652730
	178.08705	20.078100
	223.109344	100.000000
	251.108668	3.127750
	297.113687	1.892360
	high
	42.033909	3.047230
	60.043746	26.520300
	70.027268	3.162400
	87.056272	18.342000
	91.054494	23.516200
	119.04828	15.711000
	121.063402	7.273900
	133.06551	5.039960
	135.066238	3.626030
	136.074907	100.000000
	160.074409	26.458000
	178.085454	12.211700

Example .msp format:
	Name: Diazirine
	NISTNO: 305841
	ID: ID_3
	Num peaks: 12
	Comment: energy0
	12	108.00
	13	228.99
	14	999.00
	15	21.98
	26	17.98
	27	58.05
	28	178.04
	29	22.98
	40	17.98
	41	108.00
	42	431.01
	43	7.99

	Name: Methane, diazo-
	NISTNO: 57
	ID: ID_4
	Num peaks: 12
	Comment: energy1
	12	110.10
	13	220.30
	14	999.00
	15	25.18
	26	12.59
	27	58.25
	28	179.34
	29	20.48
	40	21.98
	41	110.10
	42	424.82
	43	10.99

Database: Instead of providing a candidate list, one can be generated from a selected database. Additional input options for generating a compound list from a database are:

Parent Ion Mass: The parent ion mass of the compound used in the mass spectrometry.

Adduct Type: The adduct type used in the mass spectrometry.

Candidate Mass Tolerance: The mass tolerance to use when identifying candidate compounds in the database. The default value is 100.0 ppm.

Candidate Limit: The maximum number of candidates to return. The maximum and default value is 100.

Spectra Type: The type of spectra, either ESI (Electrospray Ionization) or EI (Electron Ionization/Impact).

Ion Mode: Indicates whether the precursor ion has a positive or negative adduct.

Number of Results: The number of results to return, with the default value being 10. If left blank, all results wil be returned.

Mass Tolerance: The mass tolerance to use when matching peaks within the dot product comparison. The default value is 10.0 ppm.

Scoring Function: The type of scoring function to use when comparing spectra. The options are Dice and DotProduct.

Compound Identification Results:

Results: Input spectra are shown in blue and candidate spectra are shown in red. If a database was queried, candidate spectra are overlayed on top for comparison. The top ranking candidate spectra is shown by default; to compare other database candidates use the "Compare" buttons on the list of ranked candidate compounds that follow the spectra.

Browser Compliance

OS	Version	Chrome	Firefox	Microsoft Edge	Safari
Linux	Mint 20.1	95.0.4638	89.0.2	N/A	N/A
MacOS	BigSur 11.6	96.0.4664.93	95.0.2	N/A	14.1.2 (16611.3.10.1.6)
Windows	10	96.0.4664.110	95.0.1	96.0.1054.57	N/A

HCD vs CID Spectra

High-energy C-trap dissociation (HCD) is considered a more gentle fragmentation process than CID, that is an HCD spectrum typically has more unique fragments across the entire mass-to-charge ratio range than their CID counterparts. However, for the same molecule, CID and HCD spectra in similar collision energy are having a lot of fragments in common. Recall that CFM-ID is trained on CID data, thus its predicted spectra are less similar to Orbitrap spectra than QToF spectra (Yields lower Dice or/and Dot Product score). From our experience, CFM-ID predicted spectra are still very useful to determine compounds with Orbitrap data. In the CASMI 2016 experiments, we first determine the true collision energy of a given spectrum from its NCE value by the equation provided by Thermo Fisher, then compared this spectrum with the closest CID collision energy spectra.

Performance

Compound-to-Spectra Prediction Performance

Figure 1.Spectrum prediction results for the Metlin 2015 dataset in [M+H]+. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval. The plot on the left presents the overall performance of the model, and plots on the right provide the performance measures for each collision energy.

Figure 2.Spectrum prediction results for the Metlin 2015 dataset in [M-H]-. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval. The plot on the left presents the overall performance of the model, and plots on the right provide the performance measures for each collision energy.

Superclass	Dice	Dot Product	Recall	Precision	Count
Hydrocarbons	0.474	0.259	49.3	57.1	2
Organic 1,3-dipolar compounds	0.465	0.703	43.1	69.4	1
Organic nitrogen compounds	0.43	0.454	50.8	50.8	124
Nucleosides, nucleotides, and analogues	0.418	0.53	51.5	48.9	73
Organosulfur compounds	0.405	0.364	43.7	54.9	18
Organic acids and derivatives	0.399	0.382	48.5	43.3	481
Lipids	0.394	0.417	52.2	42.6	33
Organoheterocyclic compounds	0.384	0.414	40.4	51	988
Alkaloids and derivatives	0.377	0.471	42.5	47.3	90
Phenylpropanoids and polyketides	0.367	0.427	37.5	51	382
Benzenoids	0.358	0.367	42.5	44	797
Lipid-like molecules	0.358	0.313	37.8	45.7	827
Organic oxygen compounds	0.346	0.298	40.5	41.5	223
Organophosphorus compounds	0.33	0.207	27.2	51.4	4
Lignans, neolignans and related compounds	0.242	0.217	31	28.4	11
Organometallic compounds	0.227	0.22	16.8	42.8	2
Hydrocarbon derivatives	0.208	0.248	22.2	33.9	5
Acetylides	0.095	0.001	11.1	8.3	1

Table 1. Average (over all three energy levels) performance metrics based on chemical classes of the training set in ESI positive mode.

Superclass	Dice	Dot Product	Recall	Precision	Count
Lipids	0.387	0.451	48.2	42.4	15
Organic acids and derivatives	0.348	0.364	38.1	45.4	335
Organosulfur compounds	0.322	0.329	29.9	49.2	4
Nucleosides, nucleotides, and analogues	0.314	0.373	40.5	39.9	60
Lipid-like molecules	0.310	0.366	37.0	42.4	428
Organic oxygen compounds	0.301	0.272	32.1	42.2	84
Organoheterocyclic compounds	0.300	0.351	29.8	50.5	485
Benzenoids	0.297	0.318	31.3	44.2	402
Phenylpropanoids and polyketides	0.287	0.339	26.9	48.8	231
Alkaloids and derivatives	0.277	0.263	25.1	46.3	20
Organic nitrogen compounds	0.254	0.233	31.9	30.7	30
Hydrocarbon derivatives	0.232	0.331	14.8	58.6	1
Lignans, neolignans and related compounds	0.205	0.285	19.7	27.0	6
Organometallic compounds	0.145	0.100	11.0	30.6	1
Organophosphorus compounds	0.131	0.070	8.5	42.9	1
Organometallic compounds	0.227	0.22	16.8	42.8	2
Hydrocarbon derivatives	0.208	0.248	22.2	33.9	5
Acetylides	0.095	0.001	11.1	8.3	1

Table 2. Average (over all three energy levels) performance metrics based on chemical classes of training set in ESI negative mode.

Domian Specific Performance

Figure 3. Spectrum prediction results for the Exposome, Foodome, and HMDBOopme in [M+H]+ and [M-H]-. Each bar displays mean scores for its metrics with an error bar indicates the 95% confidence interval.

Spectra-to-Compound Identification Performance

Version	# Top 1	# Top 3	# Top 10
CFM-ID 2.0 + Candidate Database	120	160	182
CFM-ID 2.0 + Candidate Database + Experimental Spectra	123	171	201
SIRIUS 4 + CSI:FingerID	138	N/A	186
MS-Finder	146	162	174
CFM-ID 3.0 + Candidate Database + Experimental Spectra + Meta Data	149	194	204
CFM-ID 4.0 + Candidate Database	147	178	203
CFM-ID 4.0 + Candidate Database + Meta Data	162	186	204

Table 3. Comparison of CFM-ID 4.0, CFM-ID 3.0, CFM-ID 2.0, MS-FINDER, and SIRIUS 4 compound identification Performance on CASMI 2016 contest (category 3).Reported are the total number of challenges in which the corresponding implementation of the scoring function ranked the query compound in the top 1, top 3, and top 10. **CFM-ID 4.0 + Candidate Database + Meta Data** is the method provided in this web server.

Docker Image

What is Included?

Latest CFM-ID 4 MSML machine learning model
C++ Runtime for CFM-ID 4 MSML
Latest CFM-ID 4 MSRB rule based extension
Java Runtime for CFM-ID 4 MSRB

What is not Included?

CFM-ID in-silico spectra libaray
CFM-ID experimental spectra libaray

What can it do?

Predict spectra for give molecule structure
Annotate spectrum for given moleucle structure and its spectrum
Idenity molecule from a given spectrum and user provided candidate list

How to use it?

Please refer to the user guide on DockerHub page.