Reconstructing Species Phylogenetic tree with BUSCO genes using Maximum Liklihood method

In this guide, we will explain how to get the species phylogenetic tree using just the draft genomes (genomes that still don't have the annotations yet). For data, we will use the sequenced legume genomes. You will need the following programs installed and access to a large cluster (we use Condo in this exercise).


Like mentioned before, we will use the sequenced legume genomes for this purpose. Not all genomes had gene predictions hence we will predict the BUSCO genes before we construct phylogentic tree.

The data can be downloaded from using the commands:

Submitting dependency jobs using SLURM

SLURM scheduler uses sbatch command to submit the jobs. You can submit large number of jobs using a loop or if you want to run a series of jobs that runs after completion of set of jobs using the same command. You can also schedule the job to start running on a predefined time as well. In this tutorial, we will explain how to submit jobs that runs depending on status of the previously submitted jobs or schedule a bunch of jobs to run one after the other.

Slurm: SLURM job management cheat sheet

Quick reference sheet for SLURM resource manager

Job scheduling commands

CommandsFunctionBasic UsageExample
sbatchsubmit a slurm jobsbatch [script]$ sbatch job.sub
scanceldelete slurm batch jobscancel [job_id]$ scancel 123456

Introduction to Regular Expressions

Regular Expressions Definition -- regex, regexp

Regular expressions are the patterns used to find or find and replace text on the command line. Regular expressions are used in most modern programming languages and the syntax is usually very similar.

Regular expressions in Perl

In this tutorial we will be using perl on the command line to showcase how to use regular expressions. Perl one-liners is a great tool to add to your bioinformatics toolkit to find, replace, extract and generally manipulate strings.

Submitting dependency jobs using PBS-Torque

To submit jobs one after the other (i.e., run second job after the completion of first), we can use depend function of qsub.

First submit the firstjob, like normal

qsub first_job.sub

You will get the output (jobid#)


Second submit the second job following way,

Downloading SRA files from NCBI

SRA toolkit has been configured to connect to NCBI SRA and download via FTP. The simple command to fetch a SRA file and to split it to forward/reverse reads, use this command:

module load sratoolkit
fastq-dump --split-files --origfmt --gzip SRR1234567

You will see 2 files SRR1234567_1.fastq and SRR1234567_2.fastq downloaded directly from NCBI. If the file size is more than 1Gb, submit this within a PBS script.

Torque: PBS job management cheat sheet

Job scheduling commands

CommandsFunctionBasic UsageExample
qsubsubmit a pbs jobqsub [script]$ qsub job.pbs
qdeldelete pbs batch jobqdel [job_id]$ qdel 123456

Retrieve FASTA sequences using sequence ids

1. cdbfasta/cdbyank

This is a tutorial for using file-based hashing tools (cdbfasta and cdbyank) that can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to pull records based on that index file. To create a index file for the large multi-FASTA file

Downloading files via wget

Normally, for downloading files we use wget/curl and paste the link (to the file) to download

wget http://link.edu/filename

But you can also download entire file in the directory that matches your regular expression using the examples below

Using Wget

There are 2 options. You can either specify a regular expression for a file or put a regular expression in the URL itself. First option is useful, when there are large number of files in a directory, but you want to get only specific format of files (eg., fasta)