SAMSA2

A complete metatranscriptome analysis pipeline

This project is maintained by transcript

Version 2 of the SAMSA pipeline - faster! Lighter! More options! Less waiting!

New in version 2:

Dependencies

The following programs can be downloaded OR can be installed from the binaries provided in the programs/ folder.

  1. DIAMOND, version 0.8.3: https://github.com/bbuchfink/diamond
  2. Trimmomatic, a flexible read cleaner: http://www.usadellab.org/cms/?page=trimmomatic
  3. PEAR, if using paired-end data (recommended): https://sco.h-its.org/exelixis/web/software/pear/
  4. SortMeRNA: http://bioinfo.lifl.fr/RNA/sortmerna/

Quick start

  1. Download SAMSA2:
    git clone https://github.com/transcript/samsa2.git
  2. Either install the dependencies from the links above, or use the setup_and_test/package_installation.bash script provided with SAMSA2 for installing from the included binaries.
  3. Make changes to the master_script.bash, which performs the first 3 of 4 steps in the SAMSA2 pipeline (preprocessing, annotation, aggregation)
  4. If not using master_script, use DIAMOND to annotate your reads against a database of your choosing (note that database must be local and DIAMOND-indexed). See “example_DIAMOND_annotation_script.bash” for more details.
  5. If not using master_script, use “DIAMOND_analysis_counter.py” to create a ranked abundance summary of the DIAMOND results from each metatransciptome file.
  6. Import these abundance summaries into R and use “run_DESeq_stats.R” to determine the most significantly differing features between either individual metatranscriptomes, or control vs. experimental groups.

Background

Metatranscriptomics - RNA-seq data from multiple members of a microbial community - offers incredibly powerful insights into the workings of a complex ecosystem. RNA sequences are able to not only identify the individual members of a community down to the strain level, but can also provide information on the activity of these microbes at the time of sample collection - something that cannot be determined through other meta- (metagenome, 16S rRNA sequencing) method.

However, working with metatranscriptome data often proves challenging, given its high complexity and large size. SAMSA is one of the first bioinformatics pipelines designed with metatranscriptome data specifically in mind. It accepts raw sequence data in FASTQ form as its input, and performs four phases:

Preprocessing: If the sequencing was paired-end, PEAR is used to merge mate pairs. Trimmomatic is used for the removal of adaptor contamination and low-quality reads. SortMeRNA removes ribosomal sequences, as these don’t contribute to the mRNA functional profile of the metatranscriptome.

Annotation: Annotation is completed using DIAMOND, an accelerated BLAST-like sequence aligner. (Why DIAMOND? At a standard rate of 10 annotations per second, a standard BLAST approach would take several months to finish - just for a single file!)

Aggregation: DIAMOND returns results on a per-read basis, a bit like a ticker tape or a line item receipt. In the aggregation step, Python scripts condense these line-by-line results to create summary tables.

Analysis: R scripts use DESeq to compute most significantly different features between control vs. experimental samples. These R scripts generate a tabular output with assigned p-values and log2FoldChange scores for each feature. These ‘features’ can be either organisms or specific functions. R can also create graphs showing visual representation of the metatranscriptome(s).

Individual programs in SAMSA2 and their functions

For more information, please consult the manual, which goes into more detail on each step in the SAMSA pipeline.

Preprocessing: The following program steps can either be run through master_script.bash or individually:

Annotation: This step can be performed through master_script.bash, or individually.

Aggregation: This step can be performed through master_script.bash, or individually.

Analysis:
Note: these programs are located in the “R_scripts” folder. They all require two sets of DIAMOND summarized results from the aggregation step; an experimental set and a control set.

Need assistance?

Step 1: check the documentation! The documentation includes in-depth explanations of each step, including sample commands. Be sure to check there if you’re having an issue on one particular step.

If you’re unsure if your files are being processed properly, take a look at the sample files. These correspond to each step in the pipeline. If a quick look (from the command line, “less $file”) reveals a dissimilar setup to these example files, there may be an issue with the most recent program used in the pipeline.

Check out the Google Group for SAMSA! https://groups.google.com/forum/#!forum/samsa-bioinformatics-group.

If you find an error with one of these programs, or simply want to ask me questions, you can contact me at swestreich@gmail.com.

Citations of other tools used:

Westreich, S.T., Korf, I., Mills, D.A., Lemay, D.G. (2016) SAMSA: A comprehensive metatranscriptome analysis pipeline. BMC Bioinformatics.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Kopylova E., Noé L. and Touzet H., “SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.

Zhang, J., Kobert, K., Flouri, T., Stamatakis, A. (2014). PEAR: a fast and accurate Illumina paired-end Paired-End reAd mergeR. Bioinformatics.

=======