A complete metatranscriptome analysis pipeline
This project is maintained by transcript
Version 2 of the SAMSA pipeline - faster! Lighter! More options! Less waiting!
The following programs can be downloaded OR can be installed from the binaries provided in the programs/ folder.
git clone https://github.com/transcript/samsa2.git
Metatranscriptomics - RNA-seq data from multiple members of a microbial community - offers incredibly powerful insights into the workings of a complex ecosystem. RNA sequences are able to not only identify the individual members of a community down to the strain level, but can also provide information on the activity of these microbes at the time of sample collection - something that cannot be determined through other meta- (metagenome, 16S rRNA sequencing) method.
However, working with metatranscriptome data often proves challenging, given its high complexity and large size. SAMSA is one of the first bioinformatics pipelines designed with metatranscriptome data specifically in mind. It accepts raw sequence data in FASTQ form as its input, and performs four phases:
Preprocessing: If the sequencing was paired-end, PEAR is used to merge mate pairs. Trimmomatic is used for the removal of adaptor contamination and low-quality reads. SortMeRNA removes ribosomal sequences, as these don’t contribute to the mRNA functional profile of the metatranscriptome.
Annotation: Annotation is completed using DIAMOND, an accelerated BLAST-like sequence aligner. (Why DIAMOND? At a standard rate of 10 annotations per second, a standard BLAST approach would take several months to finish - just for a single file!)
Aggregation: DIAMOND returns results on a per-read basis, a bit like a ticker tape or a line item receipt. In the aggregation step, Python scripts condense these line-by-line results to create summary tables.
Analysis: R scripts use DESeq to compute most significantly different features between control vs. experimental samples. These R scripts generate a tabular output with assigned p-values and log2FoldChange scores for each feature. These ‘features’ can be either organisms or specific functions. R can also create graphs showing visual representation of the metatranscriptome(s).
For more information, please consult the manual, which goes into more detail on each step in the SAMSA pipeline.
Preprocessing: The following program steps can either be run through master_script.bash or individually:
Annotation: This step can be performed through master_script.bash, or individually.
Aggregation: This step can be performed through master_script.bash, or individually.
Analysis:
Note: these programs are located in the “R_scripts” folder. They all require two sets of DIAMOND summarized results from the aggregation step; an experimental set and a control set.
Step 1: check the documentation! The documentation includes in-depth explanations of each step, including sample commands. Be sure to check there if you’re having an issue on one particular step.
If you’re unsure if your files are being processed properly, take a look at the sample files. These correspond to each step in the pipeline. If a quick look (from the command line, “less $file”) reveals a dissimilar setup to these example files, there may be an issue with the most recent program used in the pipeline.
Check out the Google Group for SAMSA! https://groups.google.com/forum/#!forum/samsa-bioinformatics-group.
If you find an error with one of these programs, or simply want to ask me questions, you can contact me at swestreich@gmail.com.
Westreich, S.T., Korf, I., Mills, D.A., Lemay, D.G. (2016) SAMSA: A comprehensive metatranscriptome analysis pipeline. BMC Bioinformatics.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Kopylova E., Noé L. and Touzet H., “SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.
Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014). PEAR: a fast and accurate Illumina paired-end Paired-End reAd mergeR. Bioinformatics.
=======