Description

LNCipedia.org is an integrated database of human annotated long non-coding RNA transcripts and genes obtained from different sources [1]. The latest iteration of the database (version 5.2) contains 127,802 lncRNA transcripts (56,946 genes). In addition to basic transcript information and structure, several statistics are calculated for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites.

LncRNAs constitutes a large and diverse class of non-coding RNA genes. While several lncRNAs have been functionally annotated, the majority remains to be characterized. Different high-throughput methods to identify new lncRNAs (including RNA sequencing and annotation of chromatin-state maps) have been applied in various studies resulting in multiple unrelated lncRNA datasets.

The database is publicly available and allows users to query and download lncRNA sequences and structures based on different search criteria. The database may serve as a source of information on individual lncRNAs or as a starting point for large-scale studies.

	Full set	High-confidence set
Transcripts	127,802	107,039
Genes	56,946	49,372

Methods

Content

The sources used in the data collection step are listed in the table below. The most recent version of each source at the time of development has been included. The sequences and annotations are extracted and stored in a mongoDB database using custom Perl scripts. To this purpose, import scripts for different file formats, such as FASTA, BED and GFF, have been developed. Redundant transcripts are grouped in a single record, while maintaining all annotation from the original sources. The web interface for LNCipedia is build using the Mojolicious Perl web framework and offers different ways of querying the data. LNCipedia is regularly updated when newer versions of the lncRNA sources are released or if new sources become available. In addition, researchers are encouraged to submit new transcript sequences or annotations trough lncipedia.org.

Source	Version	Number of transcripts
Ensembl	90	26,115
Refseq	NCBI Annotation Release 106	5,487
NONCODE	4	68,331
FANTOM CAT	-	27,719
Human Body Map lincRNAs	-	13,964
Hangauer et al., 2013	-	5,298
Nielsen et al.,2014	-	7,134
Sun and Gadad et al., 2015	-	2,198

High-confidence set

Since LNCipedia contains a non-negligible number of putative coding transcripts, we have introduced a filtering strategy to create a stringent or high-confidence data set [2]. The high-confidence set contains transcripts that do not show coding potential by any metric. The following methods are used to assess the coding poteintial of the transcripts:

PhyloCSF: Coding Potential of a multi-species nucleotide sequence alignment

We use the PhyloCSF algoritm to benchmark the (non)coding Ensembl data. We achieved a specificity and sensitivity of 93%, the cutoff is 60.7876. A score lower than this cutoff means that the transcript is non-coding, above this cutoff it is likely to be coding.

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

We use the CPAT algoritm to calculate the coding probability based on the sequence of the lncRNA. The suggested coding probability cutoff of 0.364 is used, this cutoff corresponds to a sensitivity and specificity of 0.966.

PRIDE: database search

We have re-analysed +100 Homo sapiens proteomics projects from the PRIDE database by searching MSMS spectra in standard UniProtKB/Swiss-Prot human database together with the translated version of lncipedia [2].

Ribosome-profiling: Lee et al., 2012 and Bazzini et al., 2014

253 lncRNAs containing small open reading frames (smORFS) are provided by Bazzini et al., 2014. Bazzini and colleagues developed an approach to detect smORFs using ribosome profiling whereby the periodicity of ribosome movement on actively translated ORFs is used to distinguish coding from non-coding sequences.

A second approach to apply ribosome profiling in the quest for novel coding RNAs has been described by Lee et al., 2012. Using lactimidomycin, a ribosome inhibitor specific to initiating ribosomes, translation initiation sites (TIS) were mapped in HEK-293 cells.

Credits and contact

LNCipedia is maintained at Ghent University and VIB. If you wish to contact us, please use the contact form on lncipedia.org or email to pieterjan.volders@ugent.be.

References

[1] LNCipedia: a database for annotated human lncRNA transcript sequences and structures
Pieter-Jan Volders, Kenny Helsens, Xiaowei Wang, Björn Menten, Lennart Martens Kris Gevaert, Jo Vandesompele and Pieter Mestdagh
Nucleic Acids Res. 2013 January 1. doi: 10.1093/nar/gks915

[2] An update on LNCipedia: a database for annotated human lncRNA sequences
Pieter-Jan Volders, Kenneth Verheggen, Gerben Menschaert, Klaas Vandepoele Lennart Martens, Jo Vandesompele and Pieter Mestdagh
Nucleic Acids Res. 2015 January 1. doi: 10.1093/nar/gku1060