Bioawk basics

Bioawk is an extension of the UNIX core utility command awk. It provides several features for biological data manipulation in a similar way as that of awk. This tutorial will give a brief introduction and examples for some common tasks that can be done with this command.

Changing file permission defaults

By default, files created by a user will have following permissions (except files that are generated by some script/software, explicitly changing attributes):

-rw-r--r-- for any_file
drwxr-xr-x for any_directory

Alternatively, the permissions can also be set either in the .bashrc or .login file which will run during the login step and configures the account automatically. Some examples are:

Converting FASTQ to FASTA

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.

Using SED

sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.

Password-less SSH login

To login automatically from your machine to the remote host, you can save the private/public key pair in both machines. This way, you don't have to enter password each time you login.

Using PSC supercell

Our research allocation gives us about 40Tb of long term storage that we can use for backingup. This guide will help you getting started and using the Supercell. If you haven't already gotten a username/password for PSC systems (eg., Blacklight), you need to do it now. Use the [password reset link](http://psc.edu/index.php/resources-for-users/allocations "password reset link").

Accessing Supercell

SFTP is the best method to browse the files and create the directory structure you want.

Calculate sequence lengths in a fasta file

Sometimes it is essential to know the length distribution of your sequences. It may be your newly assembled scaffolds or it might be a genome, that you wish to know the size of chromosomes, or it could just be any multi fasta sequence file.

1. Using biopython

Save this as a script, make it an executable and run on a fasta file:

Running BLAST jobs in parallel

If there is a large file of sequences, then the traditional way for doing a BLAST search is to start with the first sequence and run them sequentially for all the sequences. This is very time consuming and waste of the processing power HPC's can offer. There are several ways to speed up this process. Most of them split the input sequence file into multiple pieces (usually equal to the number of processors) and run the BLAST search simultaneously on each of the split file. So larger the number of processors, more faster the whole process.

Getting data from iPlant via iRODS

Quick tutorial to download data from iPlant datastore to Lightning3/Condo using iRODS. iRODS is currently installed on Lightning3/Condo, to start using it, just load it using modules.

module use /data004/software/GIF/modules
module load iRODS

iRODS provides Unix command-line utilities to interact with the iPlant Data Store. Many commands are similar to Unix (by adding i to the common UNIX commands, you get iRODS commands or icommands).