Exercises Linux Tutorial - Part 2

Stefan Wyder Nov 2013
URPP Evolution
University of Zurich
Many thanks to Gregor Roth (von Mering group / IMLS, UZH) who agreed to share some exercises.
Exercise
1: connecting to a remote host, transferring files
|
Command |
Task |
|
ssh -X user@hostname |
Connect to server |
|
scp <what> <towhere> |
Transfer file from/to server |
|
sftp user@hostname |
Transfer file from/to server (interactive) |
ssh ("secure shell") is a secure protocol for remote login and also for executing commands in a remote machine. To connect to a remote machine using ssh, simply type:
$ ssh username@130.60.201.40
inside your local computer shell. You will be asked for the password and after you successfully login, you can work with the remote machine in the same way as you work on your local machine. (You are dropped at your home directory). But ssh can’t transfer files, for that you can use another program called sftp. Disconnect before you go on to the next step (Ctrl+d).
$ sftp username@130.60.201.40
After you are connected, you can use “cd”, “mkdir”, “rm” to navigate around and manipulate files on the remote computer. To upload a file from your local computer to the remote computer, simply use “put filename”. Filename here refers to a file on your local computer that will be uploaded to the remote computer.
|
|
Create a new file with nano on your local computer. Save a few lines of text to the file. Connect to the remote server using sftp and upload the file using the put command inside the sftp session. |
For
Windows Users only: To
work on a server you don't need to first start Linux in a Virtual
Machine and to connect from there. There is an alternative by using
PuTTY, a SSH client.
Download
and install
PuTTY Download
and install Xming (graphics emulation)
http://www.straightrunning.com/XmingNotes/#head-13 For
file transfer use Use WinSCP or FileZilla
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Exercise
2: Writing and executing a perl script
Many scripts in bioinformatics are written in perl. You have used the nano editor from the first exercise in part 1 of the Tutorial. Try to copy/paste the below simple perl program into a file and execute it. The hello world perl script:
#!/usr/bin/perl
print
"Hello World.\n";
Copy the above 2 lines and save them to the file “hello.pl”. The first line tells the Unix shell to interpret the program with perl. In order to run the program you can either start it with:
$ perl hello.pl
The output on the screen should be “Hello World.”
Or you can make your file executable by typing:
$ chmod +x hello.pl
And now you can simply type:
$ ./hello.pl
Note the “./” at the start of the command. This is because the directory where we stored hello.pl is not in the system variable $PATH.
Check the file's permissions using ls -l. Remove the permission to execute it and check the permissions again.
Exercise
3: Compressing/Decompressing files
File compression and decompression
|
Command |
Meaning |
|
gzip filename |
compress
file with gzip |
|
gunzip filename.gz |
decompress
file with gzip |
|
zmore
filename.gz |
display file content of a gzip compressed file |
|
tar xfz filename.tar.gz |
extract/decompress files from tar.gz archive |
|
tar zcvf archive.tar.gz folder_to_compress |
creates archive.tar.gz |
|
unzip filename.zip |
unzip archive |
|
zgrep pattern filename.gz |
search text/pattern in a compressed file |
Use the program wget to download the FASTA file of proteins from:
ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.71.pep.all.fa.gz
When
downloaded, gunzip the
file. Then count how many proteins are in the file (
use
grep and wc).
Now gzip the file and try to count the number of proteins without using gunzip.
For
Mac OS X users only: On
Mac OS X wget is not available by default, you can install it with
your favorite package manager e.g if you have installed homebrew,
simply type: brew install wget
Alternatively, if you don’t
want to install wget, use a web browser to download the file.
Exercise
4: Write a simple bash script
Often, you would like to run the same command with different parameters. As an exercise, write a simple bash script that will output numbers from 1 to 100. Use a for loop.
#!/bin/bash
for i in {1..100}
do
echo $i
done
Save the above code to a file (e.g. script.sh), make the file executable (+x flag) and run it.
What
is the output?
Exercise
5: Iterating over files
First lets prepare some files: Make a new directory called FASTAS and change into the new directory.
Copy 10 Fasta files from the server by typing:
$ scp studi15@130.60.201.40:~/FASTAS/* .
By using the same concept (for loop) from exercise 4, can you try to iterate over all fasta files in a directory, print their name and the sequence length in each file?
for
filename in *.fasta
1. Use wc to count the sequence length in each file. Make sure not to include the sequence header
2. Count the number of Methionines (M) in each sequence (use grep -o)
*
Exercise 6: Reading in arguments
We often need to pass information to our script (e.g. for a cut-off, a parameter or an input file).
#!/bin/bash
firstarg=$1
secondarg=$2
echo "You have entered \"$firstarg\" and \"$secondarg\""
The backslash in front of " is needed for escaping as " is used a string delimiter.
Try to write a shell script which takes as argument a filename. The script shall then display all Arabidopsis genes contained in the file. Use as input file TAIR10_GFF3_genes.gff from part 1 of the tutorial.
Exercise
7: Download and install bowtie2 software
Bowtie2 is a short-sequence read aligner (e.g. 150nt long). The reads are aligned to a reference sequence (e.g. human genome).
Make a new ~/software directory and download bowtie2 source code using wget from this link:
http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.1.0/bowtie2-2.1.0-source.zip
Unzip the file, change to the unzipped folder and compile the source by running make:
$ make
You have compiled your first program. If you got an error message first install the gcc compiler on your system (see Appendix) and then repeat make. You can now try to run bowtie2:
$ ./bowtie2
Since bowtie2 directory is not in the $PATH environment variable (a list of directory locations which Unix searches for commands when you try to run them), you can only run it from the bowtie2-2.1.0 folder or by providing the full path (e.g under Linux: /home/swyder/software/bowtie... or under Mac OS X: /Users/swyder/software/bowtie). You can add the bowtie2 folder to the $PATH:
$ export PATH=$PATH:/home/username/software/bowtie2-2.1.0
Now you can simply type “bowtie2” anywhere (in any directory) and the shell will find the bowtie2 software. The modification to $PATH affects only the current window until it is closed – you have to add it to ~/.bash_profile to make it permanent.
Exercise
8: Try out your package manager
You can often save time and work using the package manager as compiling software from source can be painful and time-consuming. The package manager helps you to install, upgrade and remove software. But first we have to check whether a software is available in the repositories. Try out the package manager of your system:
For Ubuntu
users:
The Linux
package management systems are comprehensive (for Ubuntu > 35'000
software packages), powerful and still easy to use – one of the
advantages of the Linux world. It also checks automatically for
software updates and your system will propose you from time to time
to upgrade your system.
You can interact with the package
manager using a graphical user interface or the command line. There
is a sort of App Store called "Ubuntu Software Center"
which includes software reviews. To run it, click on topmost icon in
the dock on the left and type "Ubuntu Software Center" in
the search field. Try it.
For Mac OS X
users: Unfortunately
the package managers for Mac OS X are lagging behind the Linux
world. But if you work regularly on the command line you will need
a package manager. The best one in my opinion is homebrew. To
install it, you first need to install the command line tools from
XCode (see Appendix). Once they are installed go to the homebrew
website and follow the instructions there. $
brew search bowtie2 $
brew install PackageName To get help on
brew, simply type brew
Check for some
software you know whether it is available under homebrew (or in your
package manager of choice). To check bowtie2 e.g. type in the
terminal:
*
Exercise 9: Exploring a FASTA format file
Moraxella catarrhalis (http://www.ncbi.nlm.nih.gov/nuccore/NC_014147) is an interesting Gram-negative Gammaproteobacterium and a human pathogen of the respiratory tract. The whole genome sequence is already available on the server in the Morax/NC_014147.fasta file. The FASTA format is widely used in sequence distribution, see the description at http://en.wikipedia.org/wiki/FASTA_format.
Use
grep to find out how
many chromosomes are present in the file. Use grep
(
-v)
to only print out the genomic sequence. How large is the genome?
Now look at the RNA-seq data sample stored in the Morax/rnaseq.fasta.gz file. This file includes qualities for each nucleotide (see FASTQ description at http://en.wikipedia.org/wiki/FASTQ_format).
How
many reads are in the file?
*
Exercise 10: Searching for short sequences in the Moraxella
catarrhalis genome
Is
the sequence “CTGTATCACCGATTT” present in the Moraxella
genome (Morax/
NC_014147.fasta)?
You
can simply use grep to
find out.
You can also use bowtie2. First build the index of the Moraxella reference genome. The format is: “bowtie2-build <fasta_file> <custom_index_name>”. In the Morax folder, you could use:
$ bowtie2-build NC_014147.fasta Morax
Once the index is created (you do this only once for each reference genome, i.e. each FASTA file), you align one read (sequence) to the genome by typing:
$ bowtie2 Morax -ac CTGTATCACCGATTT > Morax.sam
Explore the SAM results file with “less -S”. The parameter “-S” prevents line wraps, so you can see one alignment per line.
Does
bowtie2 find more
alignments compared to grep? Why could that be?
*
Exercise 11: Alignment of RNA-seq sample reads to the Moraxella
catarrhalis genome
To align the RNA-seq reads in the rnaseq.fastq file, you first need to index the Moraxella catarrhalis genome. Use bowtie2-build to create an index with name Morax.
$ bowtie2-build fasta_file custom_index_name
After you created the index, you can align the reads by typing:
$ bowtie2 Morax -U rnaseq.fastq.gz > Morax2.sam
The results are returned in SAM format and stored to the Morax2.sam file. How many reads align?
Check what are the options of bowtie2. Use Bowtie2 manual (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) to explore and change parameters. How do different parameters influence your results?
Appendix
Installation of the gcc compiler for C/C++
Typically
open-source software is written in C/C++, to compile it from source
(and make a binary) one needs a compiler. If its not available on
your system, you have to install it
Linux:
install the C++ compiler using the package manager is very easy,
under Ubuntu simply type in the terminal: sudo apt-get install g++ .
Of course we could install the package also by using the "Ubuntu
Software Center".
Mac OS X: Depending on the version of Mac OS X you are using installation differs. First you have to install the command line tools of XCode, you have 3 different options:
Option 1 (easy, but large download of 2Gb, only if you have enough free disk space): install the whole Xcode via the the App Store. In XCode select <Preferences | Downloads< then <Command Line Tools>
Option
2 (a bit more steps, but you save a lot of
disk space):
Go to https://developer.apple.com/xcode/
choose <View downloads>, log in (or register first, its free),
type <command line tools> in the search field, then select the
correct version for your version of Mac OS X, download and install
it.
Option
3 (not tested):
OSX GCC Installer
Go to https://github.com/kennethreitz/osx-gcc-installer/ and follow the instructions there
Getting help
|
man command |
display manual page of command |
|
command -h |
display shorter manual page of command (only GNUtools, not in Mac OS X) |
|
Program --help |
display help / usage information for software/scripts |
|
Program -h |
display help / usage information for software/scripts |
File and folder manipulation
|
pwd |
display current folder |
|
ls -l path |
list files and folders |
|
cd path |
change folder to path |
|
cd ~ |
change folder to home folder |
|
mkdir dir_name |
make folder |
|
rmdir dir_name |
remove folder |
|
cp source dest |
copy file/folder and all its contents |
File compression and decompression
|
gzip filename |
compress file with gzip (adds .gz extension) |
|
gunzip filename.gz |
decompress fildecompress file with gzip (removes .gz extension) |
|
zmore
filename.gz |
fildecompress display file content of a gzip compressed file |
|
tar xfz filename.tar.gz |
extract/decompress files from tar.gz archive |
|
tar zcvf archive.tar.gz folder_to_compress |
creates archive.tar.gz |
|
unzip filename.zip |
unzip archive |
|
zgrep pattern filename.gz |
search text/pattern in a compressed file |
Text processing
|
grep pattern filename |
search text/pattern |
|
cut |
extract column |
|
tr |
substitute/delete text/pattern |
|
less |
display file content |
|
wc |
count number of lines in file |
|
sort |
sort lines |
|
uniq |
remove lines occurring more than once |
|
comm filename |
compress file with gzip (adds .gz extension) |
Network and file transfer
|
wget URL |
download file (also html page) and save to current folder |
|
ssh –X username@host |
remote
login to host with username |
|
sftp username@host |
remote login to host with username and transfer files |
|
scp source target
|
copy files from/to host scp username@host:~/path/file . scp file username@host:~/path/file |
Permissions and Ownership
These commands also work on directories
|
chmod ug+rx filename |
Set write and execute permissions to user and group |
|
chown user filename |
changes user ownership |
|
chgrp group filename |
changes group ownership |
|
chown user:group filename |
changes user & group ownership |
System information & processes
|
uname -a |
display system information |
|
df -h |
list mounted disks with available space |
|
du -h path |
show space usage |
|
top |
display running processes |
|
kill -9 pid |
kill process |
“vi” editor
|
$ vi filename |
start editing file with vi |
|
i |
switch to “insert” mode |
|
ESC |
switch to “command” mode |
|
:w |
save |
|
:q |
quit |
|
:x |
save and quit |
|
/<pattern> |
search for pattern, <n> gives you the next match |
|
:q! |
quit without saving changes |