Additional conventions

Specific use cases and conventions

Conventions define how to encode some particular information in the file format by supporting specific use cases. Conventions define a set of new columns that are needed to represent a particular use case or experiment type (e.g., phosphorylation-enriched dataset). In addition, conventions define how some specific free-text columns (values that are not defined as ontology terms) should be written.

Conventions are documented and compiled from at https://github.com/bigbio/proteomics-sample-metadata/issues or by performing a pull-request. New conventions will be added to updated versions of this specification document in the future. It is planned that, unlike in other PSI formats, more regular updates will need to be done to be able to explain how new use cases for the format can be accommodated.

How to encode age and other elapsed times

One of the characteristics of a sample can be the age of an individual. It is RECOMMENDED to provide the age in the following format: {X}Y{X}M{X}D. Some valid examples are:

40Y (forty years)
40Y5M (forty years and 5 months)
40Y5M2D (forty years, 5 months, and 2 days)

When needed, weeks can also be used: 8W (eight weeks)

Age interval:

Sometimes the sample does not have an exact age but contains a range of ages. To annotate an age range the following convention is RECOMMENDED:

40Y-85Y

This means that the subject (sample) is between 40 and 85 years old. Other temporal information can be encoded similarly.

Phosphoproteomics and other post-translational modifications enriched studies

In PTM-enriched experiments, the characteristics[enrichment process] SHOULD be provided. The different values already included in EFO are:

enrichment of phosphorylated proteins
enrichment of glycosylated proteins

This characteristic can be used as a factor value[enrichment process] to differentiate the expression between proteins in the phospho-enriched sample when compared with the control.

Synthetic peptide libraries

It is common to use synthetic peptide libraries for multiple use cases including:

Benchmark of analytical and bioinformatics methods and algorithms.
Improvement of peptide identification/quantification using spectral libraries.

When describing synthetic peptide libraries most of the sample metadata can be declared as “not applicable”. However, some authors can also annotate the organism, for example, because they know that the library has been designed from specific peptide species, see example the following experiment containing synthetic peptides (Example PXD000759).

In these cases, it is important to annotate that the sample is composed of a synthetic peptide library. This can be done by adding the characteristics[synthetic peptide]. The possible values are “synthetic”, “not synthetic” or “mixed”.

Normal and healthy samples

Samples from healthy patients or individuals normally appear in manuscripts and are often annotated as healthy or normal. We RECOMMEND using the word “normal” mapped to the CV term PATO_0000461, which is also included in EFO: normal PATO term.

Example:

Minimum data metadata for any proteomics dataset
source name	characteristics[organism]	characteristics[organism part]	characteristics[phenotype]	characteristics[compound]	factor value[phenotype]
sample_treat	homo sapiens	liver	necrotic tissue	drug A	necrotic tissue
sample_control	homo sapiens	liver	normal	none	normal

Multiple projects into one annotation file

It may be needed to annotate multiple ProteomeXchange datasets into one large SDRF-Proteomics file e.g., reanalysis purposes. If that is the case, it is RECOMMENDED to use the column name comment[proteomexchange accession number] to differentiate between different datasets.