Skip to the content.

Performing Database Search using SpeCollate

SpeCollate is available as a standalone executable that can be downloaded and run on a Linux server with a Cuda-enabled GPU.

Two different executables are included in the downloadable specollate.tar.gz file; 1) specollate_train for retraining a model and 2) specollate_search for performing database search using a trained model. A pre-trained model is provided within the download file.

The below sections separately explain the setup for training and search operation. You can skip the training section if you only intend to use SpeCollate for database search.


  1. Same as step 1 in the Training section.
  2. Download one of the mgf files. Or you can use your own spectra files in mgf format.
  3. Download the human peptide database. You can provide your own peptide database file created using the Digestor tool provided by OpenMS.
  4. Set the following parameters in the [search] section of the config.ini file:
    • model_name: Name of the model to be used. The model should be in the /models directory.
    • mgf_dir: Absolute path to the directory containing mgf files to be searched.
    • prep_dir: Absolute path to the directory where preprocessed mgf files will be saved.
    • pep_dir: Absolute path to the directory containing peptide database.
    • out_pin_dir: Absolute path to a directory where percolator pin files will be saved. The directory must exist; otherwise, the process will exit with an error.
    • Set database search parameters e.g. precursor_mass_tolerance etc.
  5. Execute the specollate_search file:
    If you want to use the preprocessed spectra from a previous run, use the -p False flag:
    ./specollate_search -p False

  6. Once the search is complete; you can analyze the percolator files using the crux percolator tool:
    cd <out_pin_dir>
    crux percolator --list-of-files T --overwrite T

Visualizing Embeddings

Here we will show the UMAP projections of the embedded spectra and peptides generated by the trained network.

In Fig 1, mappings of spectra and peptide embeddings from NIST Library Human-HCD-Phospho are shown at mass 2000 Da. Peptides are shown in red points, and the clustering of the blue points around the peptide (red point) represents the spectra belonging to that peptide.

Fig 1: Spectra and Peptide projections of the embedded spectra and peptides. Embeddings are generated from 40 peptides and their corresponding spectra selected in increasing precursor mass value starting at 2000 Da. Adjacent mass values are chosen to show that the network is able to distinguish peptides and spectra that are close in mass. Subfigures a) and b) show how spectra with different charges map close to the peptides. Although different charge spectra are mapped close to the peptide, they still tend to be relatively farther from other charges. In subfigure a) spectra with charge 4 are closest to the peptide while in subfigure b) spectra with lower charge are closest to the peptide.

Figs 2-6 show the mappings for Proteome Tools library data. Peptides with more than 15 corresponding spectra of different precursor charge are selected.

Fig 2: Proteome Tools spectra and peptides selected at 1500 Da.

Fig 3: Zoom in of Fig 2.

Fig 4: Proteome Tools spectra and peptides selected at 1700 Da.

Fig 5: Proteome Tools spectra and peptides selected at 2300 Da.

Fig 6: Proteome Tools spectra and peptides selected at 2500 Da.