toil-rnaseq
with toilkit
nbayley
│ example_sample_key.txt
└───raw_fastqs
│ 24C-001_S4_L001_R1_001.fastq.gz
│ 24C-001_S4_L001_R2_001.fastq.gz
│ 24C-001_S4_L002_R1_001.fastq.gz
│ 24C-001_S4_L002_R2_001.fastq.gz
│ 24C-002_S5_L001_R1_001.fastq.gz
│ 24C-002_S5_L001_R2_001.fastq.gz
│ 24C-002_S5_L002_R1_001.fastq.gz
│ 24C-002_S5_L002_R2_001.fastq.gz
│ 24C-003_S6_L001_R1_001.fastq.gz
│ 24C-003_S6_L001_R2_001.fastq.gz
│ 24C-003_S6_L002_R1_001.fastq.gz
│ 24C-003_S6_L002_R2_001.fastq.gz
└
fastq-merge
can merge fastqs from multiple lanes corresponding to the same sample
While it is important to check for lane-specific biases in sequencing data, many tools require a single fastq file per read orientation (e.g. forward R1 and reverse R2 in paired-end sequencing).
# default split_char is "_", specified here for completeness
toilkit fastq-merge --indir nbayley/raw_fastqs/ --outdir nbayley/merged_fastqs/ --split_char _
24C-001
24C-002
24C-003
24C-004
If you are unsure how to parameterize a command, check the docstrings
# TODO: add defaults to docstring
toilkit fastq-merge --help
usage: toilkit fastq-merge [-h] [--indir [INDIR]] [--split_char [SPLIT_CHAR]]
[--outdir [OUTDIR]] [--write_method [WRITE_METHOD]]
optional arguments:
-h, --help show this help message and exit
--indir [INDIR] Directory containing .gz files
--split_char [SPLIT_CHAR] Character to split file names
--outdir [OUTDIR] Output directory
--write_method [WRITE_METHOD] The method of merging (p, i, or s)
nbayley
│ example_sample_key.txt
└───raw_fastqs
│ │ 24C-001_S4_L001_R1_001.fastq.gz
│ │ 24C-001_S4_L001_R2_001.fastq.gz
│ │ 24C-001_S4_L002_R1_001.fastq.gz
│ │ 24C-001_S4_L002_R2_001.fastq.gz
│ │ 24C-002_S5_L001_R1_001.fastq.gz
│ │ 24C-002_S5_L001_R2_001.fastq.gz
│ │ 24C-002_S5_L002_R1_001.fastq.gz
│ │ 24C-002_S5_L002_R2_001.fastq.gz
│ │ 24C-003_S6_L001_R1_001.fastq.gz
│ │ 24C-003_S6_L001_R2_001.fastq.gz
│ │ 24C-003_S6_L002_R1_001.fastq.gz
│ │ 24C-003_S6_L002_R2_001.fastq.gz
│ └
└───merged_fastqs
│ 24C-001_R1.fastq.gz
│ 24C-001_R2.fastq.gz
│ 24C-002_R1.fastq.gz
│ 24C-002_R2.fastq.gz
│ 24C-003_R1.fastq.gz
│ 24C-003_R2.fastq.gz
└
fastq-rename
can rename fastqs based on a tab-delimited table of original and new sample names
24C-001 patientA
24C-002 patientB
24C-003 patientC
fastq-rename
example call
# make sure the pattern is surrounded by single or double quotes, default is '*.fastq.gz'
toilkit fastq-rename --dir nbayley/merged_fastqs/ --pattern '*.fastq.gz' --keyname nbayley/example_sample_key.txt
Directory: nbayley/merged_fastqs/
Filename pattern: *.fastq.gz
Filename key: example_sample_key.txt
{'24C-001': 'patientA', '24C-002': 'patientB', '24C-003': 'patientC'}
nbayley
│ example_sample_key.txt
└───raw_fastqs
│ │ 24C-001_S4_L001_R1_001.fastq.gz
│ │ 24C-001_S4_L001_R2_001.fastq.gz
│ │ 24C-001_S4_L002_R1_001.fastq.gz
│ │ 24C-001_S4_L002_R2_001.fastq.gz
│ │ 24C-002_S5_L001_R1_001.fastq.gz
│ │ 24C-002_S5_L001_R2_001.fastq.gz
│ │ 24C-002_S5_L002_R1_001.fastq.gz
│ │ 24C-002_S5_L002_R2_001.fastq.gz
│ │ 24C-003_S6_L001_R1_001.fastq.gz
│ │ 24C-003_S6_L001_R2_001.fastq.gz
│ │ 24C-003_S6_L002_R1_001.fastq.gz
│ │ 24C-003_S6_L002_R2_001.fastq.gz
│ └
└───merged_fastqs
│ patientA_R1.fastq.gz
│ patientA_R2.fastq.gz
│ patientB_R1.fastq.gz
│ patientB_R2.fastq.gz
│ patientC_R1.fastq.gz
│ patientC_R2.fastq.gz
└
fastq-rename
assumes that sample names in the filename are separated by “_”As part of our pipeline we remove mouse reads from samples taken from mice (either direct or sphere-derived xenografts). To do this we use functions from the BBTools package. Sequencing reads are aligned to both mouse and human genomes and reads that uniquely map to mouse are removed, retaining human and ambiguously mapping reads.
nbayley
│ example_sample_key.txt
└───raw_fastqs
└───merged_fastqs
│ xenograftA_R1.fastq.gz
│ xenograftA_R2.fastq.gz
│ xenograftB_R1.fastq.gz
│ xenograftB_R2.fastq.gz
└
bbseal
runs the BBduk (adapter trimming) and Seal (dual alignment) commands and produces separate fastqs for human, mouse, unaligned reads. With the hard-coded Seal paramterization in our pipeline, ambiguously aligning reads are added to both the human and mouse fastq files.
# note: currently reference file names are hard-assigned in the source code,
# which are v31 human and vM22 mouse gtf files from GENCODE
toilkit bbseal --bbduk_dir /home/graeberlab/bbmap/ --ref_genome_dir /media/graeberlab/wdgold/nbayley/refs/
nbayley
│ example_sample_key.txt
└───raw_fastqs
└───merged_fastqs
│ xenograftA_R1.fastq.gz
│ xenograftA_R2.fastq.gz
│ xenograftB_R1.fastq.gz
│ xenograftB_R2.fastq.gz
└───xenograftA
│ │ xenograftA_unmapped_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_2.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_2.fq.gz
│ └ xenograftA_refstats.txt
└───xenograftB
bbcat
merges the human and ambiguous reads with the unmapped reads into one fastq (per read direction)
toilkit bbcat --dir nbayley/merged_fastqs/
nbayley
│ example_sample_key.txt
└───raw_fastqs
└───merged_fastqs
│ xenograftA_R1.fastq.gz
│ xenograftA_R2.fastq.gz
│ xenograftB_R1.fastq.gz
│ xenograftB_R2.fastq.gz
└───xenograftA
│ │ xenograftA_unmapped_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_2.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_2.fq.gz
│ │ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ │ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ └ xenograftA_refstats.txt
└───xenograftB
bbmetrics
collects the sequencing alignment metrics across multiple xenograft sequencing samples
# note: if you want spaces in the output filename prefix surround it in quotes
toilkit bbmetrics --dir nbayley/merged_fastqs/ --outfile "my example"
nbayley
│ example_sample_key.txt
└───raw_fastqs
└───merged_fastqs
│ xenograftA_R1.fastq.gz
│ xenograftA_R2.fastq.gz
│ xenograftB_R1.fastq.gz
│ xenograftB_R2.fastq.gz
│ my example_BBSeal_mouse_read_filtering_results.tsv
└───xenograftA
│ │ xenograftA_unmapped_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.vM22.transcripts_2.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_1.fq.gz
│ │ xenograftA_out.gencode.v31.transcripts_2.fq.gz
│ │ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ │ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ └ xenograftA_refstats.txt
└───xenograftB
toil-rnaseq
manifestsOnce you have prepared the fastqs, move the analysis-ready fastqs to the same directory
nbayley
│ example_sample_key.txt
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
│ patientA_R1.fastq.gz
│ patientA_R2.fastq.gz
│ patientB_R1.fastq.gz
│ patientB_R2.fastq.gz
│ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ xenograftA_human_ambig_umap_reads_R2.fq.gz
│ xenograftB_human_ambig_umap_reads_R1.fq.gz
└ xenograftB_human_ambig_umap_reads_R2.fq.gz
make-manifest
creates a manifest file for input to toil-rnaseq
# note: this command assumes paired-end sequencing fastqs (R1 and R2)
toilkit make-manifest --dir /nbayley/ --tdir /nbayley/analysis_fastqs/ --suffix my-example.tsv --starting_num 0
nbayley
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
│ patientA_R1.fastq.gz
│ patientA_R2.fastq.gz
│ patientB_R1.fastq.gz
│ patientB_R2.fastq.gz
│ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ xenograftA_human_ambig_umap_reads_R2.fq.gz
│ xenograftB_human_ambig_umap_reads_R1.fq.gz
└ xenograftB_human_ambig_umap_reads_R2.fq.gz
cut-manifest
splits a manifest file based on the desired number of samples per toil-rnaseq
run
# note: files we be ordered alphanumerically and split into groups based on that order
toilkit cut-manifest --manifest_file manifest-toil-rnaseq-my-example.tsv --split_num 2
nbayley
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
│ manifest-toil-rnaseq-my-example-1.tsv
│ manifest-toil-rnaseq-my-example-2.tsv
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
│ patientA_R1.fastq.gz
│ patientA_R2.fastq.gz
│ patientB_R1.fastq.gz
│ patientB_R2.fastq.gz
│ xenograftA_human_ambig_umap_reads_R1.fq.gz
│ xenograftA_human_ambig_umap_reads_R2.fq.gz
│ xenograftB_human_ambig_umap_reads_R1.fq.gz
└ xenograftB_human_ambig_umap_reads_R2.fq.gz
Assuming you have also prepared a config.yaml you are now ready to run toil-rnaseq run
. See the documentation for instructions on preparing config files and executing commands
One more thing is to create a sample name key for converting UUIDs used by toil-rnaseq
back to the sample names in the fastq filenames in the outputted results
# note: use the complete manifest file
toilkit manifest-key --infile nbayley/manifest-toil-rnaseq-my-example.tsv --outfile nbayley/example_UUID_key.txt
toil-rnaseq
outputsDepending on whether you enabled bamQC in the config.yaml for toil-rnaseq
, you may have .tar.gz outputs for one or more samples containing the QC and gene expression results with the prefix “FAIL_”
nbayley
│ example_UUID_key.txt
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
│ manifest-toil-rnaseq-my-example-1.tsv
│ manifest-toil-rnaseq-my-example-2.tsv
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
└───toil_output
│ UUID_0.tar.gz
│ UUID_1.tar.gz
│ FAIL.UUID_2.tar.gz
│ UUID_3.tar.gz
│ UUID_0.sortedByCoord.md.bam
│ UUID_1.sortedByCoord.md.bam
│ UUID_2.sortedByCoord.md.bam
└ UUID_3.sortedByCoord.md.bam
toil-fix
removes the FAIL prefix from outputs (don’t worry we will pull out the relevant QC data related to the FAIL prefix later!)
toilkit toil-fix --indir nbayley/toil_output/
nbayley
│ example_UUID_key.txt
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
│ manifest-toil-rnaseq-my-example-1.tsv
│ manifest-toil-rnaseq-my-example-2.tsv
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
└───toil_output
│ UUID_0.tar.gz
│ UUID_1.tar.gz
│ UUID_2.tar.gz
│ UUID_3.tar.gz
│ UUID_0.sortedByCoord.md.bam
│ UUID_1.sortedByCoord.md.bam
│ UUID_2.sortedByCoord.md.bam
│ UUID_3.sortedByCoord.md.bam
└───renamed
│ FAIL.UUID_3.tar.gz
└
toil-combine
is the primary command for extracting all relevant information from the .tar.gz and compiling the data across samples. By default this command will extract and collate FastQC and STAR alignment results, bamQC results, RSEM gene/isoform expression results, and STAR junction alignment results. Check the docstrings to see how to omit specific results.
# note: currently this command will output results in the current working directory. Let's assume we are in the directory *../nbayley/*
toilkit toil-combine --prefix my_example --anno_filename example_UUID_key.txt --indir toil_output/ --star_output junctions/
nbayley
│ example_UUID_key.txt
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
│ manifest-toil-rnaseq-my-example-1.tsv
│ manifest-toil-rnaseq-my-example-2.tsv
│ my_example_rsem_genes_raw_counts.txt
│ my_example_rsem_genes_tpm_counts.txt
│ my_example_rsem_transcripts_hugo_raw_counts.txt
│ my_example_rsem_transcripts_hugo_tpm_counts.txt
│ my_example_rsem_transcripts_raw_counts.txt
│ my_example_rsem_transcripts_raw_counts.txt
│ my_example_toil-rnaseq_qc_data.txt
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
└───toil_output
The last thing to do is rename the toil-rnaseq
output files with UUIDs back to our sample names using toil-rename
. Currently this command must come after toil-combine
because it expects UUID outputs
# note: currently this command operates in the current working directory. Let's assume we `cd` into *../nbayley/toil_output/*
toilkit toil-rename --infile ../example_UUID_key.txt
If you accidentally rename the output files before toil-combine
you can easily reverse the renaming
toilkit toil-rename --infile ../example_UUID_key.txt --direction 2
Assuming all went well you now have all the QC, gene expression, and splice junction data extracted from the .tar.gz outputs and output filenames that correspond to biological samples :D
nbayley
│ example_UUID_key.txt
│ example_sample_key.txt
│ manifest-toil-rnaseq-my-example.tsv
│ manifest-toil-rnaseq-my-example-1.tsv
│ manifest-toil-rnaseq-my-example-2.tsv
│ my_example_rsem_genes_raw_counts.txt
│ my_example_rsem_genes_tpm_counts.txt
│ my_example_rsem_transcripts_hugo_raw_counts.txt
│ my_example_rsem_transcripts_hugo_tpm_counts.txt
│ my_example_rsem_transcripts_raw_counts.txt
│ my_example_rsem_transcripts_raw_counts.txt
│ my_example_toil-rnaseq_qc_data.txt
└───raw_fastqs
└───merged_fastqs
└───analysis_fastqs
└───toil_output
│ patientA.tar.gz
│ patientB.tar.gz
│ xenograftA.tar.gz
│ xenograftB.tar.gz
│ patientA.sortedByCoord.md.bam
│ patientB.sortedByCoord.md.bam
│ xenograftA.sortedByCoord.md.bam
│ xenograftB.sortedByCoord.md.bam
└───renamed
│ FAIL.UUID_3.tar.gz
└
toil-rnaseq
command