How to Improve Your RNA-Seq Data with UMIs and ERCC RNA
Quantifying transcripts accurately and reproducibly is critical when studying gene expression. Scientists have developed two powerful tools for RNA sequencing (RNA-Seq) experiments to minimize bias and variability: unique molecular identifiers (UMIs) and ERCC RNA spike-ins. We’ll cover the basics of each and how they can help you produce better RNA-Seq data.
Unique Molecular Identifiers (UMIs)
What are UMIs?
UMIs are short, random nucleotide sequences added to individual transcripts during library preparation. They are typically 8 to 12 nucleotides in length and act as molecular barcodes to uniquely identify a cDNA molecule. A UMI is distinct from an index sequence, the latter of which is the same for all molecules in a sample. Indexes are barcodes that enable multiplexing (pooling) of samples for sequencing.
Why are UMIs used?
Libraries usually undergo PCR amplification to boost the number of low-abundance transcripts in the sample, increasing the chances they’ll be read during sequencing. However, PCR does not copy DNA with perfect fidelity; it’s usually biased, as some molecules in a library are amplified more efficiently than others. PCR can also introduce errors, or mutations, when making copies. As a result, the sequencing data may not accurately represent the original population of RNA molecules. In other words, the relative abundance of reads differs from that of the input RNA.
UMIs can correct the bias and errors caused by PCR amplification. By tagging the original cDNA molecule with a UMI, all its PCR copies will carry the same barcode. After sequencing, bioinformatic tools identify PCR duplicates and remove them from the data set. PCR and sequencing errors can also be identified and corrected by comparing the sequences of all reads with the same UMI. The cleaned-up data is then ready for gene expression analysis.
When should UMIs be used?
UMIs are used for deep sequencing (>50 million reads/sample) and analyzing low-input RNA samples, as these scenarios are most likely to have PCR duplicates.
ERCC RNA
What is ERCC RNA?
ERCC stands for the External RNA Controls Consortium. The ERCC is a group of researchers and organizations that have developed a set of synthetic RNA molecules to standardize RNA quantification in gene expression profiling. The ERCC spike-in mix contains 92 transcripts of known concentration. The set is organized into four subgroups, each containing 23 controls that span six logs of dynamic range in concentration. The ERCC mix is added to an RNA sample at the start of library preparation.
Why is ERCC RNA used?
The use of ERCC spike-ins can help standardize RNA quantification across different experiments. Using the read counts of the ERCC controls, researchers can determine the sensitivity (i.e., the limit of detection), dynamic range, linearity, and accuracy of an RNA-Seq experiment. They can also control technical variations between runs.
When should ERCC RNA be used?
ERCC RNA is used to compare the technical performance of runs across multiple experiments.
Conclusion
RNA-Seq is widely used for gene expression analysis, but the journey from purified RNA to read counts is susceptible to bias, sequencing errors, and variability. Fortunately, using UMIs and ERCC RNA can mitigate these risks and generate more accurate results.
Are you ready to generate high-quality RNA-Seq data the first time around? Consider adding UMIs and ERCC RNA to your GENEWIZ RNA-Seq experiment to reduce technical variability and enhance the reliability of your data.