Category Archives: cluster

Programming on the Della/SLURM cluster at Princeton.

[tutorial] bash commands for research

This post contains a quick overview of useful command line tools for research found in standard bash shells. For the sake of brevity, I will only include essential information here, and link to other informative pages where available.

awk

awk (also implemented in newer versions nawk and gawk) is a utility to manipulate text files. It’s syntax is much simpler than perl or python but still enables the fast writing of scripts to process files by lines and columns.

examples

For each line in the file input.txt, print the first and third column separated by a tab. Store the result in output.txt

awk '{ print $1 “\t" $3 }' input.txt > output.txt

Replace each column in file input.txt with its absolute value and print out each line to output.txt.

awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }' input.txt > output.txt

List the number of columns in each row in a file input.txt.

cat input.txt | awk '{ print NF}'

links

sed

sed is a stream editor, similar to awk, but is focused on streams of text instead of delimited files. I find it most useful when working with patterns in files.

examples

Convert a comma separated value file to a tab delimited file


sed -e 's/,/\t/g' input.csv > output.csv

Print the line before the line matching the regular express regexp to standard output.


sed -n '/regexp/{g;1!p;};h' input.txt

links

screen

screen is an invaluable tool for creating virtual consoles that (1) keeps sessions active through network failure, e.g. when using secure shells, (2) connect to your session from different locations, and (3) run a long process without maintaining an active shell. Also see the alias command for a useful screen alias.

examples

Open a new screen console, list all available screen sessions, and reattach to screen with a particular ID


screen; screen –ls; screen –r ID

links

alias

Creates a command which represents a more complex command or group of commands. Commonly used to simplify a complex command without writing a script.

examples

Set up an alias to quickly view all queued cluster jobs for user daguiar


alias myq='showq -u daguiar'

reattach to the first detached screen session


alias cscreen='screen -r `screen -ls | grep Detached | head -n1 | awk "{print $1}"`'

links

grep

Find patterns in text file. Useful options include -c to count occurrences of a pattern, -i case insensitive, -v invert the matches, -P use Perl style regular expressions (sometimes easier to work with)

examples

Print all lines that begin with a tab followed by a 1 in file input.txt.


grep -P "^\t1" input.txt

Print the count of all the lines that end with a 2 followed by a tab in file input.txt.


grep -Pc "2\t$" input.txt

links

find

A tool for finding files. Can be chained with other tools for powerful pipelines. Useful options are –name to find files by filename, -wholename to find files by filename and path, -maxdepth descends down to at most this level. The -exec option is VERY useful.

examples

Find all files ending in .txt and concatenate them all together.


find . --name "*.txt" -exec cat {} \;

links

sort

sorts lines in a file, can sort by column using -k option. Be sure to specific the type of sort, e.g. -n for numeric sort, -g general numeric sort.

examples

Sort text file input.txt by its numeric first column and store it in output.txt.


sort -k1,1n input.txt > output.txt

links

paste/join/cat

Tools for combining files.
paste merges files line by line.
join merges two files on a common field.
cat concatenates a number of files one after the other. Can also be used to print a file to standard input.

examples

paste the lines of input1.txt and input2.txt together separating them with a space


paste –d " " input1.txt input2.txt

join two files input1.txt and input2.txt by the first field of both files


join -1 1 -2 1 input1.txt input2.txt

concatenate two files and store them in output.txt


cat input1.txt input2.txt > output.txt

links

Useful shorthand.

shorthand description examples
. prefix to filename means hidden file;in filename means current directory;synonym for source when used alone .ssh;ls ./;. executable.sh
/ root of the file system; all absolute paths start from here cd /home/daguiar
../ up one directory cd ../
| redirect output of command left of pipe to the command right of pipe ls | sort -r | head -n2
~ represents your home directory; has other uses cd ~; ls ~
[Tab][Tab] lists all available completions cd a[Tab][Tab]
[ArrowUp/ArrowDown] cycle through previously submitted commands
[Ctrl]c kill the current process
[Ctrl]d log out of terminal
[Ctrl]z bg = send process to background (& can also be used to run a process in the background e.g. &) or fg = send process to foreground find . -name "*" [Ctrl]z bg
>out.txt redirect output to file out.txt ls -al > files.txt

More information.

New to linux or need a refresher? Learn these more common commands.

You can view the manual for any linux commands by typing man command

  • mv/cp/rm: move, copy, remove files
  • cd: change directory
  • ls: list files. Common options -l, -a, -S, -t
  • mkdir: make a directory
  • less: view a file in order; does not load the entire file into memory, great for previewing large files.
  • chmod: change permissions of a file. There are 3 permissions (read, write, execute) for the file owner, group, and everyone. Examples: : chmod 777 (give owner/group/everyone read/write/execute permissions); : chmod g+rw: (add to the group permissions read and write access).
  • chgrp/chown: change the group/owner for a file.
  • vim/emacs/nano: popular text editors
  • history: print previously used commands
  • head/tail: view start/end of file
  • wget: network downloader
  • tar/gzip: popular compression and archival software; examples, extract an archive: tar xvf file.tar extract a gzipped tar archive: tar xzvf file.tar.gz gzip a file:gzip file
  • groups: see which groups you belong to. Groups can be used to manage permissions for a set of individuals for a file.
  • pwd: print working directory
  • ssh: login to remote host
  • diff: compare two files, output differences. -w option will ignore whitespace
  • touch: update the timestamp of a file (or create a new one if it doesn’t exist)
  • xargs: constructs argument lists when | cannot be used, example: echo break up a sentence into groups of 2| xargs -n 2
  • ps: view running processes. examples, ps -U daguiar
  • top: display top processes
  • finger: look up information on a user
  • echo: print arguments to standard output
  • rsync: remote and local file synchronization tool
  • du: summarize disk space usage, example: du -h –max-depth=1
  • uname -a: get information about currently logged in machine. Related: echo $0 to print your interpreter.
  • md5sum: checksum files. Verify that some important files you downloaded are genuine.

A worked-through exercise

Now that we’ve learned the basics of these commands, let’s put them all together.
You will have to use what you have learned to work out solutions to the tasks in bold. Note that there are many ways to solve each problem.

The scenario

Due to the recent discovery that the Brontosaurus may indeed be a genus of dinosaur, the NSF has reallocated all of its funding to dinosaur research. A collaboration between leading archaeologist Dr. Li-Fang (aka the iron fist of Taiwan) and crazed molecular biologist Dr. Bianca resulted in the extraction of DNA samples from several Brontosauri fossils. After characterizing the set of variants in the sample, a variant call format (VCF) file was generated and uploaded to the cloud. During upload, a deranged hacker and former world-class sprinter named Greg, who has a personal vendetta against the Brontosaurus, corrupted the text file. We must clean this file using the tools described in this tutorial.

Download the (VCF) file.

(1) In case we get interrupted during this workflow, start a new screen session.
Toggle answer

(2) In a bash prompt, change the directory to the file and print the file to standard output.

Toggle hint

Toggle answer

(3)It appears the third individual (12th column) was maliciously inserted into the VCF file.Remove the 12th column in the file; equivalently, retain all columns except the 12th.

Toggle hint

Toggle answer

(4)There appears to be the string “ihatedinosaur” scattered around the file. Find and replace all occurrences of “ihatedinosaur” with the empty string.

Toggle hint

Toggle answer

(5)An erroneous chromosome was also added to the VCF file. Remove the chromosome FABRICATED from dino.vcf.

Toggle hint

Toggle answer

(6)We’ve recovered a sample that was removed from dino.vcf. Add the file dino4.vcf as a new sample in the dino.vcf file.

Toggle hint

Toggle answer

(7) Finally, sort the file so it is in chromosome – position order.

Toggle hint

Toggle answer

When you are done, you can terminate your screen [Ctrl+D]. Congratulations, you’ve made the world safe