New SPP2002 Publications

smORFer: a modular algorithm to detect small ORFs in prokaryotes

Alexander Bartholomäus, Baban Kolte, Ayten Mustafayeva, Ingrid Goebel, Stephan Fuchs, Dirk Benndorf, Susanne Engelmann, Zoya Ignatova


Emerging evidence places small proteins (≤50 amino acids) more centrally in physiological processes. Yet, their functional identification and the systematic genome annotation of their cognate small open-reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use the 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. They have difficulties evaluating prokaryotic genomes due to the unique architecture (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present a new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting putative smORFs. The unique feature of smORFer is that it uses an integrated approach and considers structural features of the genetic sequence along with in-frame translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way, and dependent on the data available for a particular organism, different modules can be selected for smORF search.


Towards the characterization of the hidden world of small proteins in Staphylococcus aureus, a proteogenomics approach

Stephan Fuchs, Martin Kucklick, Erik Lehmann, Alexander Beckmann, Maya Wilkens, Baban Kolte, Ayten Mustafayeva, Tobias Ludwig, Maurice Diwo, Josef Wissing, Lothar Jänsch, Christian H. Ahrens, Zoya Ignatova, Susanne Engelmann

Small proteins play essential roles in bacterial physiology and virulence, however, automated algorithms for genome annotation are often not yet able to accurately predict the corresponding genes. The accuracy and reliability of genome annotations, particularly for small open reading frames (sORFs), can be significantly improved by integrating protein evidence from experimental approaches. Here we present a highly optimized and flexible bioinformatics workflow for bacterial proteogenomics covering all steps from (i) generation of protein databases, (ii) database searches and (iii) peptide-to-genome mapping to (iv) visualization of results. We used the workflow to identify high quality peptide spectrum matches (PSMs) for small proteins (≤ 100 aa, SP100) in Staphylococcus aureus Newman. Protein extracts from S. aureus were subjected to different experimental workflows for protein digestion and prefractionation and measured with highly sensitive mass spectrometers. In total, 175 with up to 100 aa (SP100) were identified. Out of these 24 (ranging from 9 to 99 aa) were novel and not contained in the used genome annotation.144 SP100 are highly conserved and were found in at least 50% of the publicly available S. aureus genomes, while 127 are additionally conserved in other staphylococci. Almost half of the identified SP100 were basic, suggesting a role in binding to more acidic molecules such as nucleic acids or phospholipids.

The small DUF1127 protein CcaF1 from Rhodobacter sphaeroides is an RNA-binding protein involved in sRNA maturation and RNA turnover

Julian GrütznerFabian BillenkampDaniel-Timon SpankaTim RickVivian MonzonKonrad U FörstnerGabriele Klug

Many different protein domains are conserved among numerous species, but their function remains obscure. Proteins with DUF1127 domains number >17 000 in current databases, but a biological function has not yet been assigned to any of them. They are mostly found in alpha- and gammaproteobacteria, some of them plant and animal pathogens, symbionts or species used in industrial applications. Bioinformatic analyses revealed similarity of the DUF1127 domain of bacterial proteins to the RNA binding domain of eukaryotic Smaug proteins that are involved in RNA turnover and have a role in development from Drosophila to mammals. This study demonstrates that the 71 amino acid DUF1127 protein CcaF1 from the alphaproteobacterium Rhodobacter sphaeroides participates in maturation of the CcsR sRNAs that are processed from the 3′ UTR of the ccaF mRNA and have a role in the oxidative stress defense. CcaF1 binds to many cellular RNAs of different type, several mRNAs with a function in cysteine / methionine / sulfur metabolism. It affects the stability of the CcsR RNAs and other non-coding RNAs and mRNAs. Thus, the widely distributed DUF1127 domain can mediate RNA-binding, affect stability of its binding partners and consequently modulate the bacterial transcriptome, thereby influencing different physiological processes.

Multi-protease Approach for the Improved Identification and Molecular Characterization of Small Proteins and Short Open Reading Frame-Encoded Peptides

Philipp T KaulichLiam CassidyJürgen BartelRuth A SchmitzAndreas Tholey


The identification of proteins below approximately 70-100 amino acids in bottom-up proteomics is still a challenging task due to the limited number of peptides generated by proteolytic digestion. This includes the short open reading frame-encoded peptides (SEPs), which are a subset of the small proteins that were not previously annotated or that are alternatively encoded. Here, we systematically investigated the use of multiple proteases (trypsin, chymotrypsin, LysC, LysargiNase, and GluC) in GeLC-MS/MS analysis to improve the sequence coverage and the number of identified peptides for small proteins, with a focus on SEPs, in the archaeon Methanosarcina mazei. Combining the data of all proteases, we identified 63 small proteins and additional 28 SEPs with at least two unique peptides, while only 55 small proteins and 22 SEP could be identified using trypsin only. For 27 small proteins and 12 SEPs, a complete sequence coverage was achieved. Moreover, for five SEPs, incorrectly predicted translation start points or potential in vivo proteolytic processing were identified, confirming the data of a previous top-down proteomics study of this organism. The results show clearly that a multi-protease approach allows to improve the identification and molecular characterization of small proteins and SEPs. LC-MS data: ProteomeXchange PXD023921.