Optional: Our current trimmomatic setup - the over-represented sequences from FastQC are assumed to be adapter sequences and thus fed to Trimmomatic.
java -jar /tigress/BEE/gtex/tools/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
-threads 1 -phred33 \
${SCRATCH}/${SAMPLE}_1.fastq \
${SCRATCH}/${SAMPLE}_2.fastq \
/tigress/BEE/gtex/data/phenotype/expression/trimmed_rna_seq_reads/${SAMPLE}_1.trimmed.P.fastq \
${SCRATCH}/${SAMPLE}_1.trimmed.U.fastq \
/tigress/BEE/gtex/data/phenotype/expression/trimmed_rna_seq_reads/${SAMPLE}_2.trimmed.P.fastq \
${SCRATCH}/${SAMPLE}_2.trimmed.U.fastq \
ILLUMINACLIP:/tigress/BEE/gtex/data/auxiliary/adapters/${SAMPLE}_custom_adapters.fasta:2:30:15 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:30 \
2>/tigress/BEE/gtex/data/job_logs/trim/trim_err/${SAMPLE}.trim.err.log.txt
Each of the parameters above can be tweaked to what makes the most sense for the dataset.
Now that we have prepared the reads (still in *.fastq format), and that we have a pretty well-annotated genome/transcriptome in the case of human (we don't cover the case of de novo alignment here), there are two different routes that we can take:
1. Align reads to the reference using a read aligner, and then quantify the genes/transcripts.
2. Use Pseudo-alignments using hash tables built from reference to directly quantify transcripts.
In the case of 1, we can align reads either using splice-aware transcriptome aligners such as Bowtie2 and STAR (2-pass), or simply genome aligners such as TopHat2, and then we'll have the output in *.sam/*.bam format. Then, we can quantify the genes/transcripts using tools such as RSEM (for STAR 2-pass), eXpress (for Bowtie2), or Cufflinks (STAR and TopHat2), which will yield counts of genes/transcripts depending on the tool used.
STAR paper:
RSEM paper:
However, we run into multiple problems, which are not unique to the alignment scenario:
Problem:
The transcript annotation may not include all the splice junctions.
A Solution:
We use STAR 2-pass implementation, which discovers novel splice junctions from the read alignment in the first pass, and uses this updated annotation to align the reads.
Optional: Initial STAR genomeGenerate, and the same function applied after first pass to include novel splice junctions
/tigress/BEE/tools/STAR-STAR_2.4.2a/bin/Linux_x86_64/STAR \
--runMode genomeGenerate \
--sjdbGTFfile /tigress/BEE/gtex/external_sources/hg19/gencode.v19.annotation.gtf \
--sjdbOverhang 75 \
--genomeDir $STAR_genomeDir \
--genomeFastaFiles $genome_fa \
--runThreadN 32
/tigress/BEE/tools/STAR-STAR_2.4.2a/bin/Linux_x86_64/STAR \
--runMode genomeGenerate \
--genomeDir $STAR_genomeDir \
--genomeFastaFiles $genome_fa \
--sjdbGTFfile /tigress/BEE/gtex/external_sources/hg19/gencode.v19.annotation.gtf \
--sjdbFileChrStartEnd /tigress/BEE/gtex/data/auxiliary/STAR_1pass_hg19/SJ.out.tab.Pass1.conservative.sjdb \
--sjdbOverhang 75
In the example of GTEx, we have the following stats: