Microbial RNA-seq#
Summary of methods#
- Index generated with RSEM
- Filter and trim reads with fastp
- Quality checking of reads with FastQC
- Reads mapped to reference and quantified with RSEM
- Alignment post-processing and QC with Picard
- Aggregation of QC tables using MultiQC
Parameters#
Required parameters:#
-
--pubdir
- Default:
/<PATH>
- Description: The directory that the saved outputs will be stored.
- Default:
-
-w
- Default:
/<PATH>
- The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on scratch space or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--strandedness
- Default:
NA
- Supported options are
reverse_stranded
,forward_stranded
,non_stranded
- Default:
-
--concat_lanes
- Default:
false
- Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--fasta
- Default:
/<PATH>
- Path to the reference genome in FASTA format
- Default:
-
--gff
- Default:
/<PATH>
- Path to the annotation for the reference genome in GFF3 format
- Default:
-
--read_type
- Default:
PE
- Comment: Type of reads: paired end (PE) or single end (SE).
- Default:
fastp filtering paramenters:#
-
--quality_phred
- Default: 15
- Quality score threshold.
-
--unqualified_perc
- Default: 40
- Percent threshold of unqualified bases to pass reads.
Pipeline Default Outputs#
Naming Convention | Description |
---|---|
rsem.merged.gene_counts.tsv |
RSEM-generated gene-level raw counts merged across all samples |
rsem.merged.gene_tpm.tsv |
RSEM-generated gene-level TPM counts merged across all samples |
rsem.merged.isoform_counts.tsv |
RSEM-generated isoform-level raw counts merged across all samples |
rsem.merged.isoform_tpm.tsv |
RSEM-generated isoform-level TPM counts merged across all samples |
microbial_rnaseq_report.html |
Nextflow autogenerated report |
multiqc/ |
MultiQC report summarizing quality metrics across all samples in the run |
${sampleID}/bam/${sampleID}.genome.bam |
RSEM-generated alignment of reads to the reference genome |
${sampleID}/bam/${sampleID}.transcript.bam |
RSEM-generated alignment of reads to the reference transcriptome |
${sampleID}/${sampleID}.genes.results |
RSEM-generated quantification of gene-level count abundances |
${sampleID}/${sampleID}.isoforms.results |
RSEM-generated quantification of transcript-level count abundances |
trace.txt |
Nextflow trace of processes |
Workflow Validation#
The genome assembly and annotation for Enterococcus faecalis V583 were
downloaded from NCBI
and used to generate simulated RNA-seq reads with the function
simReads
in the R package Rsubread. To test the performance of the workflow, we generated three sets of
"control" paired-end reads, in which the TPM values were assigned based on
random sampling, and three sets of "case" control reads, in which 50 genes
were randomly selected to have substantially higher TPM values.
We then ran the workflow using the E. faecalis reference to generate indices, and compared the raw count values quantified by RSEM with the "true" count values expected from the simulated reads generated by Rsubread.
The dashed line represents a 1:1 reference, and the deviation of the counts quantified by RSEM from the "True" counts is due to the quality filtering of the read data, as the reads were simulated to include sequencing error.
We then performed differential expression analyses on both the true counts and the counts estimated by RSEM in the workflow.
Again, the dashed line is a 1:1 reference. The red points are those which would be scored as "significant" with standard thresholds of | log2 fold-change | > 2 and FDR-adjusted P-values < 0.01.