SDRF-Proteomics

Contents

Introduction

Many resources have emerged that provide raw or integrated proteomics data in the public domain. Among them, ProteomeXchange consortium (including PRIDE Archive, MassIVE, JPOST or IProx) define a group of guidelines to ensure the quality of the data and the metadata associated with the datasets.

Unfortunately, proteomics experimental design and sample related information are often missing in public repositories or stored in very diverse ways and formats. For example:

  • CPTAC consortium provides for every dataset a set of excel files with the information on each sample (e.g. S048) including tumor size, origin, but also how every sample is related to a specific raw file (e.g. instrument configuration parameters).

  • ProteomicsDB, captures for each sample in the database a minimum number of properties to describe the sample and the related experimental protocol such as tissue, digestion method and instrument (e.g. Project 4267).

  • ProteomeXchange submissions only required a minimum unstructured metadata such as species, instruments, post-translational modifications or disease. This metadata is captured at the project level making difficult to associate each specific metadata term with the samples in the study (Figure 1).

Note

The lack of detailed and well-structure metadata at a sample level prevents data interpretation, reproducibility, and integration of data from different resources.

_images/sample-metadata.png

Figure 1: SDRF-Proteomics file format stores the information of the sample and its relation to the data files in the dataset. The file format includes not only information about the sample but also about how the data was acquired and processed.

Important

The following use cases can be defined for the format:

  • Capturing the experimental design of a proteomics experiment, particularly the relationship between the samples analyzed and the instrument files generated during data acquisition in the laboratory.

  • Capturing sample metadata, including information on the source and any treatments applied that could affect data analysis.

  • Providing comprehensive metadata for instrument files, so that users can have a general understanding of how the data was acquired.

Specifications

The SDRF-Proteomics format describes the sample characteristics and the relationships between samples and data files included in a dataset. The information in SDRF files is organised so that it follows the natural flow of a proteomics experiment. The main requirements to be fulfilled for the SDRF-Proteomics format are:

  • The SDRF file is a tab-delimited format where each ROW corresponds to a relationship between a Sample and a Data file.

  • Each column MUST correspond to an attribute/property of the Sample or the Data file.

  • Each value in each cell MUST be the property for a given Sample or Data file.

  • The SDRF file must start with columns describing the properties of the sample (e.g. organism, disease, phenotype etc), followed by the properties of data files which was generated from the analysis of the experimental results (e.g. label, faction identifier, data file etc).

  • Support for handling unknown values/characteristics.

Caution

The SDRF-Proteomics aims to capture the sample metadata and its relationship with the data files (e.g., raw files from mass spectrometers). The SDRF-Proteomics does not aim to capture the downstream analysis part of the experimental design including details of which samples were compared to which other samples, how samples are combined into study variables or parameters for the downstream analysis such as FDR or p-values thresholds.

SDRF-Proteomics Format

The SDRF-Proteomics file format describes the sample characteristics and the relationships between samples and data files. The file format is a tab-delimited one where each ROW corresponds to a relationship between a Sample and a Data file (and MS signal corresponding to labelling in the context of multiplexed experiments), each column corresponds to an attribute/property of the Sample and the value in each cell is the specific value of the property for a given Sample (Figure 1).

_images/sdrf-nutshell.png

Figure 2: SDRF-Proteomics in a nutshell. The file format is a tab-delimited one where columns are properties of the sample, the data file or the variables under study. The rows are the samples of origin and the cells are the values for one property in a specific sample.

Rules

There are general scenarios/use cases that are addressed by the following rules:

  • Unknown values: In some cases, the column is mandatory in the format but for some samples the corresponding value is unknown. In those cases, users SHOULD use not available.

  • Not Applicable values: In some cases, the column is mandatory but for some samples the corresponding value is not applicable. In those cases, users SHOULD use not applicable.

  • Case sensitivity: By specification the SDRF is case insensitive, but we RECOMMEND using lowercase characters throughout all the text (Column names and values).

  • Spaces: By specification the SDRF is case sensitive to spaces (sourcename != source name).

  • Column order: The SDRF MUST start with the source name column (accession/name of the sample of origin), then all the sample characteristics; followed by the assay name corresponding to the MS run. Finally, after the assay name all the comments (properties of the data file generated).

  • Extension: The extension of the SDRF should be .tsv or .txt.

Values

The value for each property (e.g. characteristics, comment) corresponding to each sample can be represented in multiple ways.

  • Free Text (Human readable): In the free text representation, the value is provided as text without Ontology support (e.g. colon or providing accession numbers). This is only RECOMMENDED when the text inserted in the table is the exact name of an ontology/CV term in EFO. If the term is not in EFO, other ontologies can be used.

SDRF values annotated in free text

source name

characteristics[organism]

sample 1

homo sapiens

sample 2

homo sapiens

  • Ontology url (Computer readable): Users can provide the corresponding URI (Uniform Resource Identifier) of the ontology/CV term as a value. This is recommended for enriched files where the user does not want to use intermediate tools to map from free text to ontology/CV terms.

SDRF with ontology terms

source name

characteristics[organism]

sample 1

http://purl.obolibrary.org/obo/NCBITaxon_9606

sample 2

http://purl.obolibrary.org/obo/NCBITaxon_9606

  • Key=value representation (Human and Computer readable): The current representation aims to provide a mechanism to represent the complete information of the ontology/CV term including Accession, Name and other additional properties. In the key=value pair representation the Value of the property is represented as an Object with multiple properties, where the key is one of the properties of the object and the value is the corresponding value for the particular key. An example of key value pairs is post-translational modification (see Protein Modifications)

NT=Glu->pyro-Glu; MT=fixed; PP=Anywhere; AC=Unimod:27; TA=E

Samples metadata

The Sample metadata has different Categories/Headings to organize all the attributes/ column headers of a given sample. Each Sample contains a source name (accession) and a set of characteristics. Any proteomics sample MUST contain the following characteristics:

  • source name: Unique sample name (it can be present multiple times if the same sample is used several times in the same dataset).

  • characteristics[organism]: The organism of the Sample of origin.

  • characteristics[disease]: The disease under study in the Sample.

  • characteristics[organism part]: The part of organism’s anatomy or substance arising from an organism from which the biomaterial was derived (e.g. liver).

  • characteristics[cell type]: A cell type is a distinct morphological or functional form of cell. Examples are epithelial, glial etc.

Example:

Minimum sample metadata for any proteomics dataset

source name

characteristics[organism]

characteristics[organism part]

characteristics[disease]

characteristics[cell type]

sample_treat

homo sapiens

liver

liver cancer

liver cancer cell

sample_control

homo sapiens

liver

liver cancer

liver

Note

Additional characteristics can be added depending on the type of the experiment and sample. The SDRF-Proteomics templates defines a set of templates and checklists of properties that should be provided depending on the proteomics experiment.

Some important notes:

  • Each characteristics name in the column header SHOULD be a CV term from the EFO ontology. For example, the header characteristics[organism] corresponds to the ontology term Organism.

  • Multiple values (columns) for the same characteristics term are allowed in SDRF-Proteomics. However, it is RECOMMENDED not to use the same column in the same file. If you have multiple phenotypes, you can specify what it refers to or use another more specific term, e.g. “immunophenotype”.

Data files metadata

The connection between the Samples to the Data files is done by using a series of properties and attributes. All the properties referring to the MS run (file) itself are annotated with the category/prefix comment. The use of comment is mainly aimed at differentiating sample properties from the data properties. It matches a given sample to the corresponding file(s). The word comment is used for backwards-compatibility with gene expression experiments (RNA-Seq and Microarrays experiments).

The order of the columns is important, assay name MUST always be located before the comments. It is RECOMMENDED to put the last column as comment[data file]. The following properties MUST be provided for each data file (ms run) file:

  • assay name: assay name is an accession for each msrun. Because of back-compatibility with SDRF in transcriptomics we don’t use the term ms run but the more generic term assay name. Examples of assay names are: “run 1”, “run_fraction_1_2”, it must be a unique accession for every msrun.

  • comment[fraction identifier]: The fraction identifier allows to record the number of a given fraction. The fraction identifier corresponds to this ontology term. It MUST start from 1 and if the experiment is not fractionated, 1 MUST be used for each MSRun (assay).

  • comment[label]: label describes the label applied to each Sample (if any). In case of multiplex experiments such as TMT, SILAC, and/or ITRAQ the corresponding label SHOULD be added. For Label-free experiments the label free sample term MUST be used Label annotations.

  • comment[data file]: The data file provides the name of the raw file generated by the instrument. The data files can be instrument raw files but also converted peak lists such as mzML, MGF or result files like mzIdentML.

  • comment[instrument]: Instrument model used to capture the sample Instrument information.

Example:

Minimum data metadata for any proteomics dataset

source name

assay name

comment[label]

comment[fraction identifier]

comment[instrument]

comment[data file]

sample 1

run 1

label free sample

1

NT=LTQ Orbitrap XL

000261_C05_P0001563_A00_B00K_R1.RAW

sample 1

run 2

label free sample

2

NT=LTQ Orbitrap XL

000261_C05_P0001563_A00_B00K_R2.RAW

Note

All the possible _label_ values can be seen in the in the PRIDE CV under labels node.

Label annotations

In order to annotate quantitative datasets, the SDRF file format uses tags for each channel associated with the sample in comment[label]. The label values are organized under the following ontology term Label. Some of the most popular labels are:

  • For label-free experiments the value SHOULD be: label free sample or the corresponding key value pair term: AC=MS:1002038;NT=label free sample

  • For TMT experiments the SDRF uses the PRIDE ontology terms under sample label. Here some examples of TMT channels:

    TMT126, TMT127, TMT127C, TMT127N, TMT128 , TMT128C, TMT128N, TMT129, TMT129C, TMT129N, TMT130, TMT130C, TMT130N, TMT131

In order to achieve a clear relationship between the label and the sample characteristics, each channel of each sample (in multiplex experiments) SHOULD be defined in a separate row: one row per channel used (annotated with the corresponding comment[label] per file.

Examples:

Instrument information

The model of the mass spectrometer SHOULD be specified as comment[instrument]. Possible values are listed in PSI-MS

Additionally, it is strongly RECOMMENDED to include comment[MS2 analyzer type]. This is important e.g. for Orbitrap models where MS2 scans can be acquired either in the Orbitrap or in the ion trap. Setting this value allows to differentiate high-resolution MS/MS data. Possible values of comment[MS2 analyzer type] are mass analyzer types.

Additional Data files technical properties

It is RECOMMENDED to encode some of the technical parameters of the MS experiment as comments including the following parameters:

  • Protein Modifications

  • Precursor and Fragment ion mass tolerances

  • Digestion Enzymes

Protein Modifications

Sample modifications (including both chemical modifications and post translational modifications, PTMs) are originated from multiple sources: artifact modifications, isotope labeling, adducts that are encoded as PTMs (e.g. sodium) or the most biologically relevant PTMs. It is RECOMMENDED to provide the modifications expected in the sample including the amino acid affected, whether it is Variable or Fixed (also Custom and Annotated modifications are supported) and included other properties such as mass shift/delta mass and the position (e.g. anywhere in the sequence). The RECOMMENDED name of the column for sample modification parameters is: comment[modification parameters]. The modification parameters are the name of the ontology term MS:1001055. For each modification, different properties are captured using a key=value pair structure including name, position, etc. All the possible (optional) features available for modification parameters are:

Minimum data metadata for any proteomics dataset

Property

Key

Example

Required

comment

Name of the Modification

NT

NT=Acetylation

Yes

Name of the Term in this particular case Modification, for custom modifications can be a name defined by the user.

Modification Accession

AC

AC=UNIMOD:1

Yes

Accession in an external database UNIMOD or PSI-MOD supported.

Chemical Formula

CF

CF=H(2)C(2)O

No

This is the chemical formula of the added or removed atoms. For the formula composition please follow the guidelines

Modification Type

MT

MT=Fixed

No

This specifies which modification group the modification should be included with. Choose from the following options: [Fixed, Variable, Annotated]. Annotated is used to search for all the occurrences of the modification into an annotated protein database file like UNIPROT XML or PEFF.

Position of the modification in the Polypeptide

PP

PP=Any N-term

No

Choose from the following options: [Anywhere, Protein N-term, Protein C-term, Any N-term, Any C-term]. Default is Anywhere.

Target Amino acid

TA

TA=S,T,Y

No

The target amino acid letter. If the modification targets multiple sites, it can be separated by ,.

Monoisotopic Mass

MM

MM=42.010565

No

The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This should only be used if the chemical formula of the modification is not known. If the chemical formula is specified, the monoisotopic mass will be overwritten by the calculated monoisotopic mass.

Target Site

TS

TS=N[^P][ST]

No

For some software, it is important to capture complex rules for modification sites as regular expressions. These use cases should be specified as regular expressions.

Note

We RECOMMEND for indicating the modification name, to use the UNIMOD interim name or the PSI-MOD name. For custom modifications, we RECOMMEND using an intuitive name. If the PTM is unknown (custom), the Chemical Formula or Monoisotopic Mass MUST be annotated.

An example of an SDRF-Proteomics file with sample modifications annotated, where each modification needs an extra column:

Example about how to annotated two modifications in SDRF-Proteomics

source name

comment[modification parameters]

comment[modification parameters]

Sample 1

NT=Glu->pyro-Glu;MT=fixed;PP=Anywhere;AC=Unimod:27;TA=E

NT=Oxidation;MT=Variable;TA=M

Cleavage agents

The REQUIRED comment[cleavage agent details] property is used to capture the enzyme information. Similar to protein modification a key=value pair representation is used to encode the following properties for each enzyme:

Example about how to annotated two modifications in SDRF-Proteomics

Property

Key

Example

Required

comment

Name of the Enzyme

NT

NT=Trypsin

required

Name of the Term in this particular case Name of the Enzyme.

Enzyme Accession

AC

AC=MS:1001251

required

Accession in an external PSI-MS Ontology definition under the following category cleavage agent name

Cleavage site regular expression

CS

CS=(?<=[KR])(?!P)

optional

The cleavage site defined as a regular expression.

An example of an SDRF-Proteomics with annotated endopeptidase:

Example about how to annotated enzymes in SDRF-Proteomics

source name

comment[cleavage agent details]

Sample 1

NT=Trypsin;AC=MS:1001251;CS=(?<=[KR])(?!P)

Warning

If no endopeptidase is used, for example in the case of Top-down/intact protein experiments, the value SHOULD be ‘not applicable’.

Precursor and Fragment mass tolerances

For proteomics experiments, it is important to encode different mass tolerances (for precursor and fragment ions).

Example:

Example about how to annotated tolerances in SDRF-Proteomics

source name

comment[fragment mass tolerance]

comment[precursor mass tolerance]

Sample 1

0.6 Da

20 ppm

Note

Units for the mass tolerances (either Da or ppm) MUST be provided.

Factor values

The variable/property under study MUST be highlighted using the factor value category. For example, the factor value[disease] is used when the main purpose of a given experiment is to compare protein expression across different diseases or different states of a given disease. Multiple variables under study can be included by adding multiple factor values columns.

Important

“factor value” columns SHOULD indicate which experimental factor/variable is used as the hypothesis to perform the data analysis. The “factor value” columns SHOULD occur after all characteristics and attributes of the samples.

SDRF-Proteomics Templates

The sample metadata Templates are a set of guidelines to annotate the different types of proteomics experiments (use cases) to ensure that Minimum Metadata and characteristics are provided to understand the dataset. These templates respond to the distribution and frequency of experiment types in public databases like PRIDE and ProteomeXchange. The Python/Java validators will check the columns checklists depending on the template.

NOTE: It is planned that, unlike in other PSI formats, regular updates will need to be done to be able to explain how new use cases for the format can be accommodated.

Note

Each of the template is a tsv file with the minimum columns to describe the experiment. The community can create they are own templates for example for meta-proteomics experiments, imaging proteomics or top-down. If the community would like to add a new template, the following table should be modified and the corresponding tsv should be created in this folder.

Sample attributes: Minimum sample attributes for primary cells from different species and cell lines

SDRF-Proteomics templates sample attributes

Default

Human

Vertebrates

Non-vertebrates

Plants

Cell lines

source name

required

required

required

required

required

required

characteristics[organism]

required

required

required

required

required

required

characteristics[strain/breed]

required

characteristics[ecotype/cultivar]

required

characteristics[ancestry category]

required

characteristics[age]

required

required

required

characteristics[developmental stage]

required

required

required

characteristics[sex]

required

required

required

characteristics[organism part]

required

required

required

required

required

required

characteristics[cell type]

required

required

required

required

required

required

technology type

required

required

required

required

required

required

characteristics[disease]

required

required

required

required

required

required

characteristics[individual]

required

characteristics[biological replicate]

required

required

required

required

required

required

characteristics[cell line]

required

assay name

required

required

required

required

required

required

comment[data file]

required

required

required

required

required

required

comment[technical replicate]

required

required

required

required

required

required

comment[fraction identifier]

required

required

required

required

required

required

comment[label]

required

required

required

required

required

required

comment[instrument]

required

required

required

required

required

required

Additional conventions

Specific use cases and conventions

Conventions define how to encode some particular information in the file format by supporting specific use cases. Conventions define a set of new columns that are needed to represent a particular use case or experiment type (e.g., phosphorylation-enriched dataset). In addition, conventions define how some specific free-text columns (values that are not defined as ontology terms) should be written.

Conventions are documented and compiled from at https://github.com/bigbio/proteomics-sample-metadata/issues or by performing a pull-request. New conventions will be added to updated versions of this specification document in the future. It is planned that, unlike in other PSI formats, more regular updates will need to be done to be able to explain how new use cases for the format can be accommodated.

How to encode age and other elapsed times

One of the characteristics of a sample can be the age of an individual. It is RECOMMENDED to provide the age in the following format: {X}Y{X}M{X}D. Some valid examples are:

  • 40Y (forty years)

  • 40Y5M (forty years and 5 months)

  • 40Y5M2D (forty years, 5 months, and 2 days)

When needed, weeks can also be used: 8W (eight weeks)

Age interval:

Sometimes the sample does not have an exact age but contains a range of ages. To annotate an age range the following convention is RECOMMENDED:

40Y-85Y

This means that the subject (sample) is between 40 and 85 years old. Other temporal information can be encoded similarly.

Phosphoproteomics and other post-translational modifications enriched studies

In PTM-enriched experiments, the characteristics[enrichment process] SHOULD be provided. The different values already included in EFO are:

  • enrichment of phosphorylated proteins

  • enrichment of glycosylated proteins

This characteristic can be used as a factor value[enrichment process] to differentiate the expression between proteins in the phospho-enriched sample when compared with the control.

Synthetic peptide libraries

It is common to use synthetic peptide libraries for multiple use cases including:

  • Benchmark of analytical and bioinformatics methods and algorithms.

  • Improvement of peptide identification/quantification using spectral libraries.

When describing synthetic peptide libraries most of the sample metadata can be declared as “not applicable”. However, some authors can also annotate the organism, for example, because they know that the library has been designed from specific peptide species, see example the following experiment containing synthetic peptides (Example PXD000759).

In these cases, it is important to annotate that the sample is composed of a synthetic peptide library. This can be done by adding the characteristics[synthetic peptide]. The possible values are “synthetic”, “not synthetic” or “mixed”.

Normal and healthy samples

Samples from healthy patients or individuals normally appear in manuscripts and are often annotated as healthy or normal. We RECOMMEND using the word “normal” mapped to the CV term PATO_0000461, which is also included in EFO: normal PATO term.

Example:

Minimum data metadata for any proteomics dataset

source name

characteristics[organism]

characteristics[organism part]

characteristics[phenotype]

characteristics[compound]

factor value[phenotype]

sample_treat

homo sapiens

liver

necrotic tissue

drug A

necrotic tissue

sample_control

homo sapiens

liver

normal

none

normal

Multiple projects into one annotation file

It may be needed to annotate multiple ProteomeXchange datasets into one large SDRF-Proteomics file e.g., reanalysis purposes. If that is the case, it is RECOMMENDED to use the column name comment[proteomexchange accession number] to differentiate between different datasets.

Additional information

Ontologies/Controlled Vocabularies Supported

The list of ontologies/controlled vocabularies (CV) supported are:

  • PSI Mass Spectrometry CV (PSI-MS)

  • Experimental Factor Ontology (EFO).

  • Unimod protein modification database for mass spectrometry (UNIMOD)

  • PSI-MOD CV (PSI-MOD)

  • Cell line ontology (CLO)

  • Drosophila anatomy ontology (FBBT)

  • Cell ontology (CL)

  • Plant ontology (PO)

  • Uber-anatomy ontology (UBERON)

  • Zebrafish anatomy and development ontology (ZFA)

  • Zebrafish developmental stages ontology (ZFS)

  • Plant Environment Ontology (PEO)

  • FlyBase Developmental Ontology (FBdv)

  • Rat Strain Ontology (RSO)

  • Chemical Entities of Biological Interest Ontology (CHEBI)

  • NCBI organismal classification (NCBITaxon)

  • PATO - the Phenotype and Trait Ontology (PATO)

  • PRIDE Controlled Vocabulary (PRIDE)

Relations with other formats

SDRF-Proteomics is fully compatible with the SDRF file format part of MAGE-TAB. The MAGE-TAB is the file format to store the metadata and sample information on transcriptomics experiments. MAGE-TAB (MicroArray Gene Expression Tabular) is a standard format for storing and exchanging microarray and other high-throughput genomics data. It consists of two spreadsheets for each experiment: the Investigation Description Format (IDF) file and the Sample and Data Relationship Format (SDRF) file.

The IDF file contains general information about the experiment, such as the project title, description, and funding sources, as well as details about the experimental design, such as the type of technology used, the organism studied, and the experimental conditions. The SDRF file contains detailed information about the samples and the data generated from them, including sample annotations, data file locations, and data processing parameters. It also defines the relationships between samples, such as replicates or time-course experiments. Together, the IDF and SDRF files provide a complete description of the experiment and the data generated from it, allowing researchers to share and compare their data with others in a standardized and interoperable format.

SDRF-Proteomics sample information can be embedded into mzTab metadata files. The mzTab (Mass Spectrometry Tabular) format is a standard format for reporting the results of proteomics and metabolomics experiments. It can be used to store information such as protein identification, peptide sequences, and quantitation results. The mzTab format allows for the embedding of sample metadata into the file, which includes information about the samples and the experimental conditions. This metadata can be derived from the Sample and Data Relationship Format (SDRF) file in a proteomics experiment. In the mzTab format, sample metadata is stored in a separate section called the “metadata section,” which contains a list of key-value pairs that describe the samples. The keys in the metadata section correspond to the column names in the SDRF file, and the values correspond to the values in the Sample cells. By embedding sample metadata into the mzTab file, researchers can ensure that all relevant information about the experiment is stored in a single file, making it easier to share and compare data with others.

Documentation

The official website for SDRF-Proteomics project is https://github.com/bigbio/proteomics-sample-metadata. New use cases, changes to the specification and examples can be added by using Pull requests or issues in GitHub (see introduction to GitHub).

A set of examples and annotated projects from ProteomeXchange can be found here

Multiple tools have been implemented to validate SDRF-Proteomics files:

  • sdrf-pipelines (Python): This tool allows a user to validate an SDRF-Proteomics file. In addition, it allows a user to convert SDRF to other popular pipelines and software configuration files such as: MaxQuant or OpenMS.

  • jsdrf (Java): This Java library and tool allows a user to validate SDRF-Proteomics files. It also includes a generic data model that can be used by Java applications.

Tools

srdf-pipelines

The SDRF pipelines provide a set of tools to validate and convert SDRF-Proteomics files to different workflow configuration files such as MSstats,OpenMS and MaxQuant.

Installation:

$> pip install sdrf-pipelines

Validate the SDRF:

Then, you can use the tool by executing the following command:

$> parse_sdrf validate-sdrf --sdrf_file {here_the_path_to_sdrf_file}

jsdrf

The jsdrf is a Java library to validate SDRF file formats. The SDRF file format represent the sample to data information in proteomics experiments.

Validation of sdrf files with proteomics rules. How to use it:

$> java -jar jdsrf-{X.X.X}.jar --sdrf query_file.tsv --template HUMAN

Using the Java library with maven:

<dependency>
    <groupId>uk.ac.ebi.pride.sdrf</groupId>
    <artifactId>jsdrf</artifactId>
    <version>{version}</version>
</dependency>

The following links should be followed to get support and help with the sdrf maintainers:

Report Issue Get help on GitHub Forum