RNASeq analysis walk-through

Tutorial

This wiki will guide you through the RNAseq analysis, starting from the quiality checking till getting the differntial gene expression results. The next part of the wiki series will guide you through some of the down stream analysis that you can do to the results obatined here. Here is the overview of the RNAseq analysis covered in this tutorial.

Overview

rnaseq

Figure 1.: Overview of the RNAseq workflow

Experimental design

This experiment consists of 3 conditions. The first condition is "control" which is mock-infected soybean plants. The second and third conditions includes infection with 2 different pathogens. All 3 conditions have three replicates each (total we have 9 pairs of fastq files, 3 pairs belonging to control, 3 pairs bleonging to condition1- infected with pathogen 1 and 3 pairs belonging to condition 2 -infected with pathogen 2). Pairs because the data is paired-end.

Replicate 1Replicate 2Replicate 3
ControlControl.A_R1.fastq.gzControl.B_R1.fastq.gzControl.C_R1.fastq.gz
 Control.A_R2.fastq.gzControl.B_R2.fastq.gzControl.C_R2.fastq.gz
Infected 1Infected1.A_R1.fastq.gzInfected1.B_R1.fastq.gzInfected1.C_R1.fastq.gz
 Infected1.A_R2.fastq.gzInfected1.B_R2.fastq.gzInfected1.C_R2.fastq.gz
Infected 2Infected2.A_R1.fastq.gzInfected2.B_R1.fastq.gzInfected2.C_R1.fastq.gz
 Infected2.A_R2.fastq.gzInfected2.B_R2.fastq.gzInfected2.C_R2.fastq.gz

1. Download the data

For downloading the data, you can use wget or curl commands, if the data is hosted somewhere. If not, you might have to upload the data to the HPC either using scp command or using rsync (if data is located locally on your computer), or use globusURL to get the data from other computer. Here we will assume that you have the data in our DNA facility (at Iowa State University) and you have access to those files. We will use wget command to download them.

  1. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_1_HCC22_956.tar
  2. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_2_HCC22_956.tar
  3. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_3_HCC22_956.tar
  4. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_4_HCC22_956.tar
  5. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_5_HCC22_956.tar
  6. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_6_HCC22_956.tar
  7. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_7_HCC22_956.tar
  8. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_8_HCC22_956.tar
  9. wget http://upart.biotech.iastate.edu/pub/5_NNNN_01_9_HCC22_956.tar

Note that these weblinks are inactive, so if you actually run these commands it will fail as they don't point to any files. Once downloaded, you can untar the archive and you will find the fastq files. To untar:

  1. module load parallel
  2. parallel "tar xf {}" ::: *.tar

Here we load the parallel and then run it effeciently and in parallel on all the tar files. The other option is to run it in a for loop, which will take considerable amount of time as it untars one file at a time. After this step, you will have gzipped fastq files.

We will also need the genome file and associated GTF/GFF file for this wiki. Since the data is for Soybean, we will donwload them directly from the Phytozome website (needs logging in and selecting the files). Specifically, you will need;

Gmax_275_Wm82.a2.v1.gene.gff3.gz 
Gmax_275_v2.0.fa.gz

Glycinemax_phytozome Figure 2: Files needed for the RNAseq tutoial, genome assembly (unmasked) from the "assembly" directory and gff3 from the "annotation" directory

  1. # decompress files
  2. gunzip Gmax_275_Wm82.a2.v1.gene.gff3.gz
  3. gunzip Gmax_275_v2.0.fa.gz

2. Quality Check

For this we will use fastqc, which is a tool that provides a simple way to do quality control checks on raw sequence data coming from high throughput sequencing pipelines (link). It provides various metrics to give a indication of how your data is. A high qulaity illumina RNAseq file should look something like this. Since there are 9 set of files (18 files total), and we need to run fastqc on each one of them, you can either do a for loop or use parallel command. We need to submit it as a job to the cluster, but the command should have:

  1. module load fastqc
  2. module load parallel
  3. parallel "fastqc {}" ::: *.fastq.gz

You will find .html files once the job is complete. You can open them using the firefox browser on the HPC (see guide here or download it locally to view them in your local browser. The main metrics to check are: * Per base sequence quality * Adapter content * Per base N content

Once you are happy with the results, proceed with the mapping part.

3. Mapping reads to the genome

There are several mapping programs available for aligning RNAseq reads back to the genome. Generic aligners such as BWA, BowTie2, BBMap etc., are not suitable for mapping RNAseq reads because they are not splice aware. RNAseq reads are mRNA reads that only contain exoninc regions, hence mapping them back to the genome requires splitting the individual read, only done by splice aware mappers. Here for this tutorial, we will use HiSat2 (derivative of BowTie2), STAR aligner and GSNAP.

Note: you don't have to run all three mapping programs, use any one of the below methods

Option A: Use HiSat2 for mapping

For HiSat2 mapping, you need to first index the genome and then use the read pairs to map the indexed genome (one set at a time). For indexing the genome, HiSat2 as is packaged with the hisat2-build script. Building index is as follows:

  1. hisat2-build -p 16 Gmax_275_v2.0.fa Gmax_275_v2.0_hisat2
  2. # -p for number of processors
  3. # first argument is the fasta file
  4. # second argument is the base name for indexed files

Once complete, you should see number of files with .ht2l extension. These are our index files.

For mapping, each set of reads (forward and reverse or R1 and R2), we will set up a run script. This script can also be found on our GitHub page.

  1. #!/bin/bash
  2. module load hisat2
  3. module load_gsnap.samtools
  4. DBDIR="/path/containing/HiSat2_index_files"
  5. GENOME="Gmax_275_v2.0_hisat2"
  6. # number of processors to use
  7. p=16
  8. R1_FQ="$1"
  9. R2_FQ="$2"
  10. # output file prefix (uses a portion of the fastq filename, but can be changed if these are not unique)
  11. OUTPUT=$(basename ${R1_FQ} |cut -f 1 -d "_");
  12. # run the HiSat2 aligner
  13. hisat2 \
  14. -p ${p} \
  15. -x ${DBDIR}/${GENOME} \
  16. -1 ${R1_FQ} \
  17. -2 ${R2_FQ} | \
  18. -S ${OUTPUT}_gsnap.sam &> ${OUTPUT}.log
  19. # convert_gsnap.same file to bam file format
  20. samtools view \
  21. --threads 16 \
  22. -b \
  23. -o ${OUTPUT}.bam \
  24. ${OUTPUT}_gsnap.sam
  25. # sort the bam file based on co-ordinates
  26. samtools sort \
  27. -m 7G \
  28. -o ${OUTPUT}_sorted.bam \
  29. -T ${OUTPUT}_temp \
  30. --threads 16 \
  31. ${OUTPUT}.bam

For setting it up to run with each set of file, we will use this loop:

  1. for fastq in *R1.fastq.gz; do
  2. fastq2=$(echo $fastq | sed 's/R1.fastq.gz/R2.fastq.gz/g');
  3. echo "./runHISAT2.sh ${fastq} ${fastq2}";
  4. done > hisat2.cmds

For creating PBS/Slurm submission scripts, use either makePBSs.py or makeSLURMs.py from the common_scripts directory on GitHub. Submit jobs using the for loop.

  1. makeSLURMs.py 1 hisat2.cmds
  2. for sub in hisat2_*.sub; do
  3. qsub $sub;
  4. done

This should create, following files as output:

Control.A_sorted.bam
Control.B_sorted.bam
Control.C_sorted.bam
Infected1.A_sorted.bam
Infected1.B_sorted.bam
Infected1.C_sorted.bam
Infected2.A_sorted.bam
Infected2.B_sorted.bam
Infected2.C_sorted.bam

Option B: Use STAR for mapping

Again, we need to index the genome first:

  1. #!/bin/bash
  2. module load star
  3. FASTA="Gmax_275_v2.0.fa"
  4. GFF="Gmax_275_Wm82.a2.v1.gene.gff3"
  5. DB=$(pwd)
  6. mkdir -p ${DB}/${FASTA%.*}_star
  7. STAR \
  8. --runMode genomeGenerate \
  9. --runThreadN 16 \
  10. --genomeDir ${DB}/${FASTA%.*}_star
  11. --genomeFastaFiles ${FASTA}
  12. --sjdbGTFfile ${GFF}
  13. --sjdbOverhang 99

Once this completes, you should see a folder named Gmax_275_v2.0_star with lots of files in it. This is our indexed genome.

For running mapping, we will set up a run script, like we did for HiSat2. You can also find the run script on our GitHub

  1. #!/bin/bash
  2. R1="$1"
  3. R2="$2"
  4. OUT=$(basename ${R1} |cut -f 1 -d "_");
  5. GFF="/home/arnstrm/arnstrm/GMAPDB/Gmax_275_Wm82.a2.v1.gene.gff3"
  6. DB="/home/arnstrm/arnstrm/GMAPDB/Gmax_275_v2.0_star"
  7. STAR \
  8. --runMode alignReads \
  9. --runThreadN 16 \
  10. --genomeDir ${DB} \
  11. --readFilesCommand zcat \
  12. --outFileNamePrefix ${OUT}_star \
  13. --readFilesIn ${R1} ${R2}

For setting it up to run with each set of file, we will use this loop:

  1. for fastq in *R1.fastq.gz; do
  2. fastq2=$(echo $fastq | sed 's/R1.fastq.gz/R2.fastq.gz/g');
  3. echo "./runSTAR.sh ${fastq} ${fastq2}";
  4. done > star.cmds

For creating PBS/Slurm submission scripts, use either makePBSs.py or makeSLURMs.py from the common_scripts directory on GitHub. Submit jobs using the for loop.

  1. makeSLURMs.py 1 star.cmds
  2. for sub in star_*.sub; do
  3. qsub $sub;
  4. done

This should create, following files as output:

Control.A_star.sam
Control.B_star.sam
Control.C_star.sam
Infected1_star.A.sam
Infected1_star.B.sam
Infected1_star.C.sam
Infected2_star.A.sam
Infected2_star.B.sam
Infected2_star.C.sam

Option C: Use GSNAP for mapping

Final alternative is to use GSNAP aligner. For indexing, use the script below:

  1. #!/bin/bash
  2. FASTA="Gmax_275_v2.0.fa"
  3. DB="/work/GIF/arnstrm/GENOMEDB"
  4. module load gmap-gsnap
  5. gmap_build -d ${FASTA%.*}_gsnap -D ${DB} ${FASTA}

A directory, named Gmax_275_v2.0_gsnap will be created with lots of files in it once the indexing is complete. Next, we will setup a run script for GSNAP (as found here)

  1. #!/bin/bash
  2. GMAPDB="/work/GIF/arnstrm/GENOMEDB"
  3. DB_NAME="Gmax_275_v2.0_gsnap"
  4. R1="$1"
  5. R2="$2"
  6. OUTFILE=$(basename ${R1} |cut -f 1 -d "_")
  7. gsnap \
  8. -d ${DB_NAME} \
  9. -D ${GMAPDB}
  10. -t 16 \
  11. -B 5 \
  12. -N 1
  13. -m 5 \
  14. --gunzip \
  15. --fails-as-input \
  16. --input-buffer-size=10000000 \
  17. --output-buffer-size=10000000 \
  18. -A_gsnap.sam ${R1} ${R2} > ${OUTFILE}_gsnap_gsnap.sam

For setting it up to run with each set of file, we will use this loop:

  1. for fastq in *R1.fastq.gz; do
  2. fastq2=$(echo $fastq | sed 's/R1.fastq.gz/R2.fastq.gz/g');
  3. echo "./runGSNAP.sh ${fastq} ${fastq2}";
  4. done > gsnap.cmds

For creating PBS/Slurm submission scripts, use either makePBSs.py or makeSLURMs.py from the common_scripts directory on GitHub. Submit jobs using the for loop.

  1. makeSLURMs.py 1 gsnap.cmds
  2. for sub in gsnap_*.sub; do
  3. qsub $sub;
  4. done

This should create, following files as output:

Control.A_gsnap.sam
Control.B_gsnap.sam
Control.C_gsnap.sam
Infected1.A_gsnap.sam
Infected1.B_gsnap.sam
Infected1.C_gsnap.sam
Infected2.A_gsnap.sam
Infected2.B_gsnap.sam
Infected2.C_gsnap.sam

Once we have the SAM or BAM files generated (from any of the 3 methods above), we can proceed with the abundence estimation.

4. Abundence estimation

For quantifying transcript abundance from RNA-seq data, there are many programs we can use. Two most popular tools inlcude, HTSeq and featureCounts. We will need a file with aligned sequencing reads (SAM/BAM files generated in previous step) and a list of genomic features (donwloaded GFF file).

Note: you don't have to run both these abundence estimation methods, use any one

Option A: HTSeq

As we need to process one SAM/BAM file at a time, we will set up a run script as follows:

  1. #!/bin/bash
  2. GFF="Gmax_275_Wm82.a2.v1.gene.gff3"
  3. INFILE="$1"
  4. OUTFILE=$(echo $INFILE | cut -f 1 -d "_")_counts.txt
  5. htseq-count \
  6. -m intersection-nonempty \
  7. -t gene \
  8. -i ID \
  9. $INFILE $GFF > $OUTFILE
  1. for sam in *.sam; do
  2. echo "./runHTSEQ.sh ${sam}";
  3. done > htseq.cmds

For creating PBS/Slurm submission scripts, use either makePBSs.py or makeSLURMs.py from the common_scripts directory on GitHub. Submit jobs using the for loop.

  1. makeSLURMs.py 9 htseq.cmds
  2. # as these commands run quickly, we will put them all in one `sub` file
  3. qsub htseq_0.sub;
  4. done

You will have a text file with counts for every SAM/BAM file you provide. Next, we need to merge individual files to make a single file. We can do this using awk as follows:

  1. awk '{arr[$1]=arr[$1]"\t"$2}END{for(i in arr)print i,arr[i]}' *_counts.txt >> merged_htseq_counts.tsv

Option B: featureCounts

featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. featureCounts takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features. It outputs numbers of reads assigned to features (or meta-features). It also outputs stat info for the overall summrization results, including number of successfully assigned reads and number of reads that failed to be assigned due to various reasons (these reasons are included in the stat info). We can run this on all SAM/BAM files at the same time.

  1. #!/bin/bash
  2. GFF="Gmax_275_Wm82.a2.v1.gene.gff3"
  3. OUTFILE="featureCounts"
  4. module load subread
  5. # package containing featureCounts script
  6. featureCounts \
  7. -T 16 \
  8. -p \
  9. -t gene \
  10. -a \
  11. -o ${OUTFILE}_counts.txt *.bam

The ouput will look something like this (headers might be different, only first 10 lines displayed here)

#Geneid Chr Start   End Strand  Length  Control.A   Control.B   Control.C   Infected1.A Infected1.B Infected1.C Infected2.A Infected2.B Infected2.C
Glyma.01G000100.Wm82.a2.v1  Chr01   27355   28320   -   966 24  0   51  93  126 91  121 32  52
Glyma.01G000200.Wm82.a2.v1  Chr01   58975   67527   -   8553    91  1   122 193 214 239 102 111 148
Glyma.01G000300.Wm82.a2.v1  Chr01   67770   69968   +   2199    9   0   12  22  18  21  3   11  18
Glyma.01G000400.Wm82.a2.v1  Chr01   90152   95947   -   5796    169 0   407 480 402 518 502 443 379
Glyma.01G000500.Wm82.a2.v1  Chr01   90289   91197   +   909 0   0   0   0   0   0   0   0   0
Glyma.01G000600.Wm82.a2.v1  Chr01   116094  127845  +   11752   149 4   310 374 402 529 304 352 300
Glyma.01G000700.Wm82.a2.v1  Chr01   143467  155573  +   12107   39  2   78  119 113 129 34  100 107
Glyma.01G000800.Wm82.a2.v1  Chr01   157030  157772  +   743 0   0   0   1   1   0   0   0   0
Glyma.01G000900.Wm82.a2.v1  Chr01   170534  193342  +   22809   240 1   517 759 760 859 462 658 494

Since we only need the counts (without the feature information), we will trim this table using cut command.

  1. cut -f 1,7-15 featureCounts_counts.txt > featureCounts_counts_clean.txt

Now we are ready for performing DGE analysis!

5. Differential Gene Expression analysis

A gain, there are few options here. You can use edgeR, DESeq2, or QuasiSeq (and many more!). Here we will discribe how to do this with reference to the data we have. You can easily modify it to suit your needs (eg., different number of samples/repliates/conditions)

Note: you don't have to run all three methods, use any one

Option A: edgeR

Run the following code on RStudio or R terminal

  1. # import data
  2. datain <- read.delim("counts.txt",row.names="Geneid")
  3.  
  4. # experimental design
  5. DataGroups <- c("CTL", "CTL","CTL","CTL", "TRT", "TRT", "TRT", "TRT")
  6.  
  7. # load edgeR
  8. library(edgeR)
  9.  
  10. # create DGE object of edgeR
  11. dgList <- DGEList(counts=datain,group=factor(DataGroups))
  12.  
  13. # filter data to retain genes that are represented at least 1 counts per million (cpm) in at least 2 samples
  14. countsPerMillion <- cpm(dgList)
  15. countCheck <- countsPerMillion > 1
  16. keep <- which(rowSums(countCheck) >= 2)
  17. dgList <- dgList[keep,]
  18. dgList$samples$lib.size <- colSums(d$counts)
  19.  
  20. # normalization using TMM method
  21. dgList <- calcNormFactors(dgList, method="TMM")
  22.  
  23. ## data exploration
  24. # MDS plot
  25. png("plotmds.png")
  26. plotMDS(dgList, method="bcv", col=as.numeric(d$samples$group))
  27. dev.off()
  28.  
  29. # Dispersion estimates
  30. design.mat <- model.matrix(~ 0 + dgList$samples$group)
  31. colnames(design.mat) <- levels(dgList$samples$group)
  32. dgList <- estimateGLMCommonDisp(dgList,design.mat)
  33. dgList <- estimateGLMTrendedDisp(dgList,design.mat, method="power")
  34. dgList <- estimateGLMTagwiseDisp(dgList,design.mat)
  35. png("plotbcv.png")
  36. plotBCV(dgList)
  37. dev.off()
  38.  
  39. # Differentail expression analysis
  40. fit <- glmFit(dgList, design.mat)
  41. lrt <- glmLRT(fit, contrast=c(1,-1))
  42. edgeR_results <- topTags(lrt, n=Inf)
  43.  
  44. # plot log2FC of genes and highlight the DE genes
  45. deGenes <- decideTestsDGE(lrt, p=0.05)
  46. deGenes <- rownames(lrt)[as.logical(deGenes)]
  47. png("plotsmear.png")
  48. plotSmear(lrt, de.tags=deGenes)
  49. abline(h=c(-1, 1), col=2)
  50. dev.off()
  51.  
  52. # save the results as a table
  53. write.table(edgeR_results, file="Results_edgeR.txt")

Option B: DESeq2

Again, this should be done either on RStudio or in R terminal. Following are the steps

  1. ## RNA-seq analysis with DESeq2
  2. ## Largely based on Stephen Turner, @genetics_blog
  3. ## https://gist.github.com/stephenturner/f60c1934405c127f09a6
  4.  
  5. # Import the data
  6. countdata <- read.table("counts.txt", header=TRUE, row.names=1)
  7.  
  8. # Remove .bam or .sam from filenames
  9. colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))
  10.  
  11. # Convert to matrix
  12. countdata <- as.matrix(countdata)
  13. head(countdata)
  14.  
  15. # Assign condition (first four are controls, second four and third four contain two different experiments)
  16. (condition <- factor(c(rep("ctl", 4), rep("inf1", 4), rep("inf2", 4))))
  17.  
  18. # Analysis with DESeq2
  19.  
  20. library(DESeq2)
  21.  
  22. # Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
  23. (coldata <- data.frame(row.names=colnames(countdata), condition))
  24. dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
  25. dds
  26.  
  27. # Run the DESeq pipeline
  28. dds <- DESeq(dds)
  29.  
  30. # Plot dispersions
  31. png("qc-dispersions.png", 1000, 1000, pointsize=20)
  32. plotDispEsts(dds, main="Dispersion plot")
  33. dev.off()
  34.  
  35. # Regularized log transformation for clustering/heatmaps, etc
  36. rld <- rlogTransformation(dds)
  37. head(assay(rld))
  38. hist(assay(rld))
  39.  
  40. # Colors for plots below
  41. ## Ugly:
  42. ## (mycols <- 1:length(unique(condition)))
  43. ## Use RColorBrewer, better
  44. library(RColorBrewer)
  45. (mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))])
  46.  
  47. # Sample distance heatmap
  48. sampleDists <- as.matrix(dist(t(assay(rld))))
  49. library(gplots)
  50. png("qc-heatmap-samples.png", w=1000, h=1000, pointsize=20)
  51. heatmap.2(as.matrix(sampleDists), key=F, trace="none",
  52. col=colorpanel(100, "black", "white"),
  53. ColSideColors=mycols[condition], RowSideColors=mycols[condition],
  54. margin=c(10, 10), main="Sample Distance Matrix")
  55. dev.off()
  56.  
  57. # Principal components analysis
  58. ## Could do with built-in DESeq2 function:
  59. ## DESeq2::plotPCA(rld, intgroup="condition")
  60. ## I (Stephen Turner) like mine better:
  61. rld_pca <- function (rld, intgroup = "condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) {
  62. require(genefilter)
  63. require(calibrate)
  64. require(RColorBrewer)
  65. rv = rowVars(assay(rld))
  66. select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  67. pca = prcomp(t(assay(rld)[select, ]))
  68. fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "))
  69. if (is.null(colors)) {
  70. if (nlevels(fac) >= 3) {
  71. colors = brewer.pal(nlevels(fac), "Paired")
  72. } else {
  73. colors = c("black", "red")
  74. }
  75. }
  76. pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
  77. pc2var <- round(summary(pca)$importance[2,2]*100, digits=1)
  78. pc1lab <- paste0("PC1 (",as.character(pc1var),"%)")
  79. pc2lab <- paste0("PC1 (",as.character(pc2var),"%)")
  80. plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
  81. with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
  82. legend(legendpos, legend=levels(fac), col=colors, pch=20)
  83. # rldyplot(PC2 ~ PC1, groups = fac, data = as.data.frame(pca$rld),
  84. # pch = 16, cerld = 2, aspect = "iso", col = colours, main = draw.key(key = list(rect = list(col = colours),
  85. # terldt = list(levels(fac)), rep = FALSE)))
  86. }
  87. png("qc-pca.png", 1000, 1000, pointsize=20)
  88. rld_pca(rld, colors=mycols, intgroup="condition", xlim=c(-75, 35))
  89. dev.off()
  90.  
  91.  
  92. # Get differential expression results
  93. res <- results(dds)
  94. table(res$padj<0.05)
  95. ## Order by adjusted p-value
  96. res <- res[order(res$padj), ]
  97. ## Merge with normalized count data
  98. resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
  99. names(resdata)[1] <- "Gene"
  100. head(resdata)
  101. ## Write results
  102. write.csv(resdata, file="diffexpr-results.csv")
  103.  
  104. ## Examine plot of p-values
  105. hist(res$pvalue, breaks=50, col="grey")
  106.  
  107. ## Examine independent filtering
  108. attr(res, "filterThreshold")
  109. plot(attr(res,"filterNumRej"), type="b", xlab="quantiles of baseMean", ylab="number of rejections")
  110.  
  111. ## MA plot
  112. ## Could do with built-in DESeq2 function:
  113. ## DESeq2::plotMA(dds, ylim=c(-1,1), cex=1)
  114. ## I like mine better:
  115. maplot <- function (res, thresh=0.05, labelsig=TRUE, textcx=1, ...) {
  116. with(res, plot(baseMean, log2FoldChange, pch=20, cex=.5, log="x", ...))
  117. with(subset(res, padj<thresh), points(baseMean, log2FoldChange, col="red", pch=20, cex=1.5))
  118. if (labelsig) {
  119. require(calibrate)
  120. with(subset(res, padj<thresh), textxy(baseMean, log2FoldChange, labs=Gene, cex=textcx, col=2))
  121. }
  122. }
  123. png("diffexpr-maplot.png", 1500, 1000, pointsize=20)
  124. maplot(resdata, main="MA Plot")
  125. dev.off()
  126.  
  127. ## Volcano plot with "significant" genes labeled
  128. volcanoplot <- function (res, lfcthresh=2, sigthresh=0.05, main="Volcano Plot", legendpos="bottomright", labelsig=TRUE, textcx=1, ...) {
  129. with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main=main, ...))
  130. with(subset(res, padj<sigthresh ), points(log2FoldChange, -log10(pvalue), pch=20, col="red", ...))
  131. with(subset(res, abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="orange", ...))
  132. with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="green", ...))
  133. if (labelsig) {
  134. require(calibrate)
  135. with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=textcx, ...))
  136. }
  137. legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green"))
  138. }
  139. png("diffexpr-volcanoplot.png", 1200, 1000, pointsize=20)
  140. volcanoplot(resdata, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-2.3, 2))
  141. dev.off()

Option C: QuasiSeq

Also in RStudio or R terminal. As a first step, save the following code as a file named QLresultsPvaluePlot.R using any text editor and place it in the same directory as the count data is in.

  1. QLresultsPvaluePlot<-function(QLfit,Strname){
  2. filename=Strname
  3. results<-QL.results(QLfit,Plot=F)
  4. designNum<-dim(results$P.values$QLSpline)[2]
  5. designNames<-colnames(results$P.values$QLSpline)
  6. for (i in 1:designNum){
  7. print(i)
  8. if (min(results$P.values$QLSpline[,i])<1 && min(results$P.values$QLSpline[,i])!="NaN"){
  9. Rnames<-rownames(dataIn)
  10. if (min(results$Q.values$QLSpline[,i])<10 && min(results$Q.values$QLSpline[,i])!="NaN"){
  11. if (length(which(results$Q.values$QLSpline[,i]<10))>1){
  12. meanTrt<-apply(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==2)],1,mean)
  13. meanWt<-apply(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==1)],1,mean)
  14. FoldTrtoverWt <- meanTrt/meanWt
  15. logTwoFC <-log2(FoldTrtoverWt)
  16. outData<-cbind(as.matrix(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),]),meanTrt,meanWt,as.matrix(results$P.values$QLSpline[which(results$Q.values$QLSpline[,i]<10),i]),as.matrix(results$Q.values$QLSpline[which(results$Q.values$QLSpline[,i]<10),i]),FoldTrtoverWt,logTwoFC)
  17. colnames(outData)<-c(colnames(dataIn),"mean_treat","mean_control","Pvalues","Qvalues","fold_change","Log2FC")
  18. write.table(outData,file=paste(filename,".FulldesignVS.",i,".txt",sep=""))
  19. }
  20. if (length(which(results$Q.values$QLSpline[,i]<10))==1){
  21. outData<-cbind(matrix(dataIn[which(results$Q.values$QLSpline[,i]<10),],1,dim(dataIn)[2]),as.matrix(results$P.values$QLSpline[which(results$Q.values$QLSpline[,i]<10),i]),as.matrix(results$Q.values$QLSpline[which(results$Q.values$QLSpline[,i]<10),i]),(sign(mean(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==2)])-mean(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==1)])))*mean(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==2)])/mean(dataIn.norm[which(results$Q.values$QLSpline[,i]<10),which(trt==1)]))
  22. colnames(outData)<-c(colnames(dataIn),"Pvalues","Qvalues","fold_change")
  23. write.table(outData,file=paste(filename,".FullvsDesignVS.",i,".txt",sep=""))
  24. }
  25. }
  26. pdf(file=paste(filename,".",i,".pdf",sep=""),width=5,height=5)
  27. a<-hist(results$P.values$QLSpline[,i],breaks=seq(0,1,.01),main=paste(Strname,designNames[i]),cex.main=.5)
  28. b<-a$counts[1]*.75
  29. bb<-a$counts[1]*.65
  30. bbb<-a$counts[1]*.55
  31. text(.5,b,paste("Number of genes qvalue below 0.5 = ",as.character( dim(as.matrix(dataIn[which(results$Q.values$QLSpline[,i]<0.5),i]))[1])),cex=.8)
  32. text(.5,bb,paste("Number of genes qvalue below 0.3 = ",as.character( dim(as.matrix(dataIn[which(results$Q.values$QLSpline[,i]<0.3),i]))[1])),cex=.8)
  33. text(.5,bbb,paste("Number of genes qvalue below 0.1 = ",as.character( dim(as.matrix(dataIn[which(results$Q.values$QLSpline[,i]<.1),i]))[1])),cex=.8)
  34. dev.off()
  35. }
  36. }
  37. }

Next, run these steps on RStudio by setting the work directory to the counts data directory.

  1. # set the work directory
  2. setwd("C:/Users/arunk/Google Drive/PostDoc/projects/20170707_RSmith_MosquitoRNAseq/QuassiSeq")
  3. # source the code you just created
  4. source("QLresultsPvaluePlot.R")
  5. # Import the data
  6. uniq<-as.matrix(read.table("counts.txt", header=TRUE, row.names = 1))
  7.  
  8. # Check dimensions
  9. cols<-dim(uniq)[2]
  10. # remove the rows with all zero counts
  11. colsm1<-cols
  12. dataIn2<-(uniq[which(rowSums(uniq[,1:cols])>colsm1),])
  13. dataIn3<-dataIn2[which(rowSums(sign(dataIn2[,1:cols]))>1),]
  14. dataIn<-as.matrix(dataIn3)
  15. # normalize data using upperquartile of 0.75
  16. log.offset<-log(apply(dataIn, 2, quantile,.75))
  17. upper.quartiles<-apply(dataIn,2,function(x) quantile(x,0.75))
  18. # calculate scaling factors
  19. scalingFactors<-mean(upper.quartiles)/upper.quartiles
  20. dataIn.norm<-round((sweep(dataIn,2,scalingFactors,FUN="*")))
  21. # standard deviation
  22. sd(dataIn[1,])
  23. sd(dataIn.norm[1,])
  24. # experimental design
  25. trt<-as.factor(c(1,1,1,1,2,2,2,2))
  26. mn<-as.numeric(rep(1,cols))
  27. # QuasiSeq analysis
  28. library(QuasiSeq)
  29. design.list<-vector('list',2)
  30. design.list[[1]]<-model.matrix(~trt)
  31. design.list[[2]]<-mn
  32. log.offset<-log(apply(dataIn, 2, quantile,.75))
  33. fit2<-QL.fit(round(dataIn), design.list,log.offset=log.offset, Model='NegBin')
  34. QLresultsPvaluePlot(fit2,paste("results_",1,2,sep=""))
  35. QLresultsPvaluePlot(fit2,paste("results_",1,2,sep=""))