Donnerstag, 31. März 2016

Evaluating the distribution of amino acid triples for proteins of humans using Amazon's Elasic Map Reduce

This is a tutorial on how to use Amazon's EMR to analyze data parallel in a cluster. This is an educational article with the purpose to show how elastic Map/Reduce works. For the sake of simplicity, we are about to calculate sequences of 3 amino acids in this tutorial. This (rather small) problem could be solved by a classical (single-threaded) imperative program as well much faster and with much less overhead.
However, when it comes to bigger problems (like analyzing sequences of >=8 amino acids), the scalability of map/reduce really starts to kick in and makes it possible to analyze huge amount of data (it's part of the "big data - hype" after all :) )

The data

Since the human genome was fully sequenced, we know all types of proteins (and their amino acid sequence) that exist in the human body. In fact, you can download this information as zipped TXT files from NCBI: ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/protein/ - download protein.fa.gz

This zip file is only 11,3MB of size ... our whole life in such a small file ..

Proteins are chains of Amino Acids (AA). Only 22 different AA's are used in human proteins, so each of them is assigned a letter. You can look up that assignments in wikipedia.

In the File, each pair of 2 lines represents the name and the actual sequence of AA's of every protein found in the human body.

In this example, the occurance of each possible triple of AA's are counted throughall all proteins in the input file. The following example shows what is done:

Let's consider the following AA-seqence: AABABAB

From this sequence, we can extract the following triples and there count of occurrences:

  • AAB: 1x (AABABAB)
  • ABA: 2x (AABABAB)
  • BAB: 2x (AABABAB)

The Results

It is remarkable that there are indeed a few triples that occure very often in the analyzed AA-chains. Below is a graph showing the number of occurrences of the top 400 triples (each triple is assigned a number, starting with triple 1 - the tiple with the most occurences - almost 80K).


The logarithmic pattern of the distribution continues throughout the whole dataset.

The following is a table of the 10 most occurring triples:

TRIPLEOCCURANCES
SSS78314
EEE66653
LLL63178
PPP59174
AAA48985
SSL46638
SLL43912
LLS43323
LSS43262

You can download here a sorted CSV file with all triples and their occurrences

How it's done

Map/Reduce is a proven programming model to analyze data. What makes it so special is its ability to be executed (massively) parallel. Paired with modern implementations like Hadoop and its flexible and fast filesystem HDFS, Map/Reduce can be applied as a massively parallel procedure without running into problems like locks, dirty reads or all the other stuff we dislike about parallel algorithms :)

Map/Reduce works as follows in Hadoop:
  1. The input file (for example the amino acid sequence file) is split into several parts (chosen by the system or configured by the user)
  2. A number of MAPPER tasks are spawned. Each mapper is filled with a part of the input file splittet in step 1. (The mapper gets the data via stdin)
  3. The MAPPER writes out a number of KEY-VALUE-PAIRS (k, v) to stdout. In our example, this would be the sequence of 3 amino acids as key, and the (constant) count of 1 as value.
  4. After all mappers have finished, a number of REDUCER tasks are spawned. Each reducer is filled with a tuple that consists of a key and a LIST of values (k, ListOf(v)). In our example, a reducer could get an input that looks like this, if there were 4 key-value pairs emitted for the amino acid combination "FLI":
    1. (FLI, (1, 1, 1, 1))
  5. The reducer now can aggregate the values of the lists (like calculating the sum or the average of the values). The reducer outputs another key-value pair (k, v'). In our example, it should output:
    1. (FLI, 4)

After the reduce step we see, that the subsequnce "FLI" was found 4 time amongst all the data (of ALL mappers and reducers).

This was just a very brief overview. There can be done much more using Map/Reduce. Wikipedia is (as always) a good entry point: https://en.wikipedia.org/wiki/MapReduce

This is an illustration of how the map/reduce was used for this particular problem:


The Implementation

Most steps of the process are executed automatically by hadoop. The only custom code to write are: the mapper code and the reducer code.

In this example, python was used as implementation language for the mapper and the reducer process. Here is the code:

mapper.py

#!/usr/bin/env python
import sys

def main(separator='\t'):
    for line in sys.stdin:
        for n in range(0, len(line) - 1 - 3):    # iterate line from 0 to len-4
            print '%s%s%d' % (line[n:n+3], separator, 1)

if __name__ == "__main__":
    main()


reducer.py

The reducer code was taken from a great blog post from Michael Noll: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

I recommend this blog post for additional info on how to write map/reduce code.

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    #   current_word - string containing a word (the key)
    #   group - iterator yielding all ["<current_word>", "<count>"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

Combiner

Combiners are useful to reduce the amount of data that is to be transfered and processed during the shuffeling step, especially if mappers run on different machines.
They can be introduced as additional processes that are executed directly after the mapping process but before the shuffeling process. In this example, the reducer.py is used also as a combiner. This means, that all results from each mapper are reduced directly after the mapping process.

Combine works only for the results of each mapper separately! So if there is more than 1 mapper process, combine can drastically speed up the whole process (by pre-aggregating the data), but reduce is still needed to process the data of all mappers.

Cluster creation and process management

To run a map/reduce process on amazon, you first have to create a cluster and add an execution step. This can be done via the GUI or via the AWS command line interface.

In this example, the AWS CLI was used.

Furthermore, a s3 bucket and the following subfolders need to be created:

  • Folder for log files
  • Folder for source files
    • Here the python code files are placed
  • Folder for input file(s)
  • Folder for output files
You can name these folders as you wish.

Locally you need:
  • steps.json

Cluster creation command

aws emr create-cluster --release-label emr-4.0.0 --use-default-roles \
--log-uri s3://yourS3bucket/logfolder --enable-debugging \
--applications Name=Hadoop --ec2-attributes AvailabilityZone=YOUR-ZONE \
--steps file://./steps.json \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge \
--auto-terminate


steps.json

This file needs to be placed locally (=on your machine) and serves as definition of step you want to execute. This file contains information on where to find the neccessary files for map/reduce:

[
  {
     "Name": "Protenom analyzer",
     "Type": "STREAMING",
     "ActionOnFailure": "CONTINUE",
     "Args": [
         "-files",
         "s3://yourS3bucket/src/mapper.py,s3://yourS3bucket/src/reducer.py",
         "-mapper",
         "mapper.py",
         "-reducer",
         "reducer.py",
         "-combiner",
         "reducer.py",
         "-input",
         "s3://yourS3bucket/inputFiles",
         "-output",
         "s3://yourS3bucket/outputFiles"]
  }
]

After starting the cluster, you can monitor the cluster in the AWS web console:


In this example, 1 master and 4 worker (virtual) machines were started as cluster to run the map/reduce processes.

The output files are written into the corresponding folder specified in steps.json. In the log folder you will find detailed logs on the whole map/reduce tasks, including the execution time, the number of spawned mappers/combiners/reducers, etc.

How many of those processes and tasks are spawned is determined automatically by hadoop. However, you can configure a lot of parameters to influence the system. Please consult the corresponding documentation. This is a good starting point: http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-what-is-emr.html

Freitag, 11. März 2016

Resources on how to learn C++

With great power comes great responsibility! This applies to C++. Whereas most "modern" programming languages offer you a more abstract view on the underlying operating system and its resources, with C++ you are able to work with the system directly.
If you are a programmer and you want or have to learn C++ for your next project, this is the way I did it and - may I say - I'm very satisified on how fast I was able to learn a good amount of C++ with these resources.

Three important things

With C++ you can of course cover a wide area of programming techniques in different domains. However, when I learn something new, I try to find out the core concepts of the thing i have to learn. With C++, there are the following three concepts every programmer of that language must be aware of:

  1. Object Oriented Programming in general
    C++ is all about object orientation. Therefore, OOP is a real must when learning C++. However, chances are that you have already experience in another OO-language such as Java or C#.
    C++ is different in some details (eg. there are no Interfaces in C++, but you can use multiple inheritance (of abstract classes) which is not available in Java or C#). But in most cases, C++ fits well in what you might already know about OOP
  2. The C++ core language
    Obviously, you have got to learn about the core language. This also includes C++ templates - at least their basics. (Templates may seem similar to what you might know as Generics in other languages. But templates are actually much more powerful in C++.)
    This also includes memory management - which might be new to you if you only used programming languages with garbage collectors before. Memeory management can be real pain if you don't know what you are doing. But if you use the appropriate techniques, the complexity of memory management abstracts away, making your code more stable and flexible. Regarding memory management (and in general) you have to know about:
    1. Objects (both in heap and on the stack)
    2. References (as aliases to objects)
    3. Pointers (raw pointers)
    4. Smart pointers ("managed" pointers that take care of memory management themself)
  3. The Standard Template Library (STL)
    The STL for C++ is what the respective frameworks are for Java and C#. You could write everything from scratch using pure C++, or you can use the implementations of th STL. The STL offers you very powerful, performand and secure implementations of the most important data structures and algorithms. You have to learn about the following concepts:
    1. Containers such as vecotr, map, set or list - how they work and what you use for which types of problem
    2. Algorithms (that can be fed by containers)
    3. Iterators (that act as abstractions for containers, so that the same algorithms work with different kinds of containers)

Resources I used in the beginning of learning C++

Here are some recommendations on resources I used in the beginning. If not mentioned separately, all resources are designed for at least C++ 11, which added a lot of new features to C++ and the STL. If you start learning C++, always check if the resources you look on are written for at least C++ 11. Though you might find a log of very good books for older versions as well.

Online video lectures

Introduction to C++:

The online course "C++ for C Programmers" on Coursera by Ira Pohl (from the University of California, Santa Cruz) provides an amazing introduction in all three of the core concepts mentioned above. Even if you don't do the exercises (which is highly recommended), watching the video alone will give you a very good jump start into the programming.
It covers a good amount of work with data structures, as well as object oriented and algorithmic thinking. As a "bonus", you will learn (and implement) the most important algorithms on graphs :-)

Digging into the STL:

There is a terriffic course available on Channel 9: C9 Lectures: Stephan T. Lavavej - Standard Template Library (STL)
Stephan T. Lavavej is maintainer of the Visual Studio C++ Standard Template Library implementation. So in this course you will learn about the STL from a person who actually implements the (more exactly: one) implementation of the STL. The course is very comprehensive and provides you also with exercises and showcases. After this course you should definetely know all the differences between the importand container data structures I mentioned above and you will be able to exactly know when to use which container.

Books

There are really tons of books available for C++. I was researching in the internet about what books I should get when I started learning and here are the ones I can personally recommend to you:

A tour of C++

This book is written by Bjarne Stroustrup, the creator of C++. Here, Bjarne gives a quick overview over the language, the STL, what is possible and how to use the feature, C++ provides. Don't expect any details here, though!

Kindle / ebook
Paperback


C++ Primer

In addition to the quick tour of C++ there are lots of "big books" available. While Bjarne Stroustrup himself also wrote one of that big books, I decided to get myself a different one - just to have C++ presented from yet another angle.

The C++ primer is written by people with manifold experience who coded in C++ for companies like Microsoft, IBM, Bell Laboratories, AT&T and Pixar. This book provides an in-depth view on all the important concepts and lots of exercises.

Kindle / ebook
Paperback


Effective C++

This book is not intended for beginners. However, in my opinion, it is a must read for every C++ programmer. It contains tons of material on how to make your code more robust, secure and performant. It covers all that important and subtle details that can make the difference between a good program and a mess.

Paperback

Effective Modern C++

This is the pendant of Effective C++ just for the new features braught in by C++ 11 and C++ 14. This book is also highly recommended for anybody wanting to switch from an older version.

Kindle / ebook
Paperback


More of the Effective C++ - Series

As I am very satisfied with that series, I'm going to get myself the other ones as well. These are the following:

More Effective C++: 35 New Ways to Improve Your Programs and Designs

This is part 2 of Effective C++"






Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library

I'm definetely looking forward to this one, as it specifically covers the usage of the STL - the bread and butter of every C++ program.