RandA is a tool for deep-sequencing data analysis. Specifically, RandA performs profiling and differential expression analysis for all sequenced non-coding RNA (ncRNA) transcripts. The tool performs various RNA-seq analysis steps including: adapter clipping; alignment against a database derived from Rfam ; multiple mapping handling ( SEQ-EM ); read count normalization and differential expression testing for each transcript detected in the samples.

Instructions

 

In order to run RandA please follow these instructions:

  • Download RandA tool and extract
  • Download the RandA compatible Rfam database and extract
  • Copy the Rfam.fa file to the DB folder inside the RandA folder
  • After all the prerequisites are properly installed, run the jar file from the RandA folder
  • To perform the analysis follow the workflow

Workflow

 

The workflow of the tool is divided into three main sections:



(1) INPUT & OUTPUT ( view )

RandA takes in deep sequencing output files, in FASTQ format or FASTA format.

  • The user should choose one or more deep sequencing output files for analysis. For each sequenced sample.

  • For each file, the user should set the number of allowed alignment mismatches. When dealing with short RNA, it is recommended to reduce this number in order to produce more accurate results.

  • The user should assign a condition name (e.g “Normal”,“Tumor” etc.) for each file, allowing several samples to be assigned under the same condition (technical / biological replicates).

  • In the analysis section, these conditions are coupled so as to perform differential expression analysis between them. In order to add a condition to the conditions list, the user must click the “Update” button below the input files list.

  • The user may choose to clip and/or trim the sequence reads prior to alignment. RandA utilizes ea-utils' fastq-mcf tool.

    • Currently fastq-mcf supports clipping only in FASTQ files. if you wish to clip FASTA files, please perform the clipping prior to the RandA run, and do not check the clip and trim checkbox.

    • In order to perform clipping, the user should supply a FASTA file containing possible adapters sequences. (see example file ).

    • In order to trim bases with low quality from the 3prime end of the sequence reads, the user may set a minimal quality threshold

  • The user can set the organisms to be included in the reference database created from the full Rfam database. The organism list can be filtered using the search-as-you-type text box.

  • The user should specify the output directory in which to create the RandA output folder.





(2) DATABASE PREPARATION: ( view )

  • Due to the high level of sequence diversity and size in Rfam , RandA allows the user to perform a variety of manipulations. The first being the previously mentioned organisms selection. Second, the user may pick the various RNA types experimentally relevant.

  • The user may choose to save the new database created by clicking the Save DB checkbox.

  • The user may also collapse the new database (prior to the analysis) in order to reduce redundancy. The collapse is performed either by sequence identity or by description identity.





(3) ANALYSIS: ( view )

  • In this section the user may select conditions on which differential expression analysis should be performed, or leave the differential expression table empty in order to perform a regular transcript expression profiling analysis for each of the input files.

  • maps the reads against the newly formed database using a Burrows-Wheeler transform based alignment tool ( BWA ), summing the number of reads that mapped uniquely to each of the annotated ncRNA sequence. The user may choose to call ambiguous mappings using SEQ-EM , an algorithm that distributes multiply mapped reads based on the distribution of uniquely mapped reads. If the user does not tick the SEQ-EM checkbox, only uniquely mapped reads are incorporated to the downstream analysis.

  • For the purpose of differential expression analysis, RandA utilizes DESeq, an “R”-based tool that performs differential expression analysis on deep sequencing data, and utilizes a negative binomial distribution model for variance estimation. Prior to the differenitial expression analysis, RandA reviews the number of samples under each condition (replicates) and sets the appropriate DESeq analysis parameters accordingly.

Database

 

For the purpose of ncRNA alignment and subsequent annotation, RandA utilizes Rfam (v11.0), an extensive database of known ncRNA. The database contains almost 200,000 different organisms, and more than 1 million unique sequences for a variety of RNA species such as rRNA, tRNA, miRNA, cis-regulatory elements, snRNA, snoRNA, ribozymes and other documented non coding transcripts.

In order to run RandA the user must first download the Rfam database (formatted for RandA) and copy it into the DB folder in the main RandA folder. The database will be updated every time a new Rfam version is available.
It is also possible to download the Rfam.full file from the Rfam ftp , and convert it into RandA compatible format using the prepDB package (see instructions ).

Once downloaded, RandA allows the user to manipulate the full Rfam database, in order to create a more specific database to fit the relevant experimental needs. The user may choose to:

  • Select specific organisms to include in the database

  • Select specific ncRNA types (e.g tRNA, rRNA, etc.) to include in the database

  • Collapse the database by sequence identity or RNA description, to lower redundancy. Since version 1.1.3 the user may choose to collapse same-description sequences by their sequence similarity. In order to do so, the user must choose to collapse the database by description and then set the identity level above which same-description sequences will be discarded. Lowering the identity level will result in a more specific, less sensitive database.

If the user ticks the “Save DB” checkbox, the newly created database is not removed, and it can be found in the DB folder in the main RandA folder.

Output

 

After the tool has finished the analysis, an output folder is created in the user-specified location. The output folder includes:


The main analysis table - An excel (.xls) file for each of the conditions couples set for differential expression. the columns for the table are:

  • RNA Accession – the transcript name taken from Rfam .

  • RNA type – the type of RNA transcript (e.g miRNA, snoRNA etc.).

  • RNA description – a more specific description of the transcript (e.g miR-321).

  • Organism - the organism matching the RNA transcript.

  • Base mean count – the mean of the read counts of all the samples under the same condition normalized by each sample's size factor.

  • Fold change – the ratio between base means.

  • Log 2 fold change – the log 2 of the ratio between base means.

  • P-values - raw and adjusted with the Benjamini-Hochberg procedure.

    For more details regarding the DESeq analysis, please refer to the DESeq paper and manual .

  • Rfam link – a link to the Rfam page for the given transcript accession.

  • EMBL link – a link to the EMBL page for the given transcript accession.

Additional output files include:

- summary_report.txt – a text file summarizing all the RandA analysis output.

- tmpfile.txt – the input file for the DESeq analysis, holds all the counts for each of the transcripts in all conditions.

- DESeqRes.csv – the output table for the DESeq analysis.

- conditions.ini – a file describing each file and the condition it belongs to.

- DESeq.ini – the input file for the runDESeq.r script.

- alignment_files – a folder containing the alignment files in SAM format.

- graphs – a folder containing several plots describing the multiple mapping and size distribution for each sample file.

Command Line

 

RandA can also be run using the command line. the main perl script running the tool is RandA_pipe.pl which takes in 2 parameters:

1) a config file that will contain all the parameters for the analysis (see example file – randa.config)

2) “cluster”/”no-cluster” whether the pipe should be ran on a computer cluster (via qsub)

if you wish to run RandA  using a computer cluster, you must first change the path in the file RandA_pipe.pl to the correct pipeline path (in the CHANGE PATH TO PIPE section). After the path is changed, use the command line to run RandA

Usage:

perl RandA_pipe.pl randa.config cluster/no-cluster

Prerequisites

 

RandA requires these elements to be installed in order to work properly:

Perl and modules: Math::CDF  ;  Spreadsheet::WriteExcel  ;  Getopt::Long  ;  GD::Graph::bars

BWA   ;  fastq-mcf   ;  R   ;  DESeq   ;  Java v1.6+

Once RandA is downloaded and installed, these prerequisites can be automatically installed (except for Java) by the shell script install_prereqs.sh inside the RandA directory:

$>cd path-to-RandA
$path-to-RandA> sudo ./install_prereqs.sh

The perl modules can be downloaded and installed using this command line:

$>cpan Math::CDF OLE::Storage_Lite Spreadsheet::WriteExcel Getopt::Long GD::Graph::bars

System Requirements

  • RandA can run on a Linux operating system, with 32-bit (without SEQ-EM) or 64-bit architecture. (Tested on Ubuntu 11.0 and Fedora 16)

  • 4GB or more of RAM is recommended when using the SEQ-EM option on Illumina sequence files (10,000,000 or more short reads).

Database

In order to run RandA the user must first download the Rfam database (formatted for RandA) and copy it into the DB folder in the main RandA folders

Downloads

 

Source

Citing

 

1. Isakov,O., Ronen,R., Kovarsky,J., Gabay,A., Gan,I., Modai,S. and Shomron,N. (2012) Novel Insight into the Non-Coding Repertoire Through Deep Sequencing Analysis. Nucl. Acids Res., 10.1093/nar/gks228.

Help

 

For additional help and support, please contact RandAhelp@gmail.com