Sequence Analysis Algorithms for Bioinformatics Application


Master's Thesis, 2014

88 Pages, Grade: N


Excerpt


Table of Contents

List of Figures

List of Tables

List of Abbreviation

Chapter 1: Introduction
1.1 Bioinformatics
1.2 Bioinformatics cycle
1.3 A recap of molecular biology
1.3.1 DNA
1.3.2 RNA and transcription
1.3.3 Proteins
1.4 Sequence analysis
1.5 Sequence alignment
1.5.1 Pairwise sequence alignment
1.6 Motivations
1.7 Objectives
1.8 Main contributions

Chapter 2: Literature Review
2.1 Pairwise sequence alignment algorithms
2.1.1 Dynamic programming pairwise sequence alignment algorithms
2.2 Accelerating DP pairwise sequence alignment
2.3 Graphics processing unit ( GPU ):
2.3.1 GPU programming model
2.3.2 Threads
2.3.3 GPU Memory :
2.4 Accelerating sequence alignment on GPU :
2.5 Conclusion

Chapter 3: Gene Tracer algorithm
3.1 Overview
3.2 Gene tracer algorithm
3.3 Gene Tracer Flow chart
3.4 Implementation of Gene Tracer program
3.4.1 For DNA :
3.4.2 For Proteins :
3.5 Conclusion

Chapter 4: Gene Tracer by scanning database on GPU (GTDBGPU) algorithm
4.1 GTDBGPU algorithm
4.2 Modifications of striemer’s local alignment algorithm On GPU for GTDBGPU
4.3 GTDBGPU algorithm
4.3.1 SW sequence alignment on GPU function
4.3.2 Gene tracer with proposed modifications
4.4 Gene Tracer with scanning database using CPU vs GPU
4.4.1 GTDBCPU(3 nested loop)
4.4.2 GTDBGPU ( 2 nested loop)
4.5 Gene Tracer with scanning database on GPU Flow chart
4.6 GTDBGPU algorithm outputs
4.6.1 For proteins :
4.6.2 For DNA :
4.7 GTDBGPU’s execution time on GPU vs CPU
4.8 Effect of number of threads per block on execution time
4.8.1 GPU’s Occupancy computing equations
4.8.2 Steps to determine the suitable number of threads per block

Chapter 5: Conclusion and Future work

References .

List of Figures

Figure 1.1 Bioinformatics cycle

Figure 1.2 Double stands of DNA forms as double helix

Figure 1.3 Translation from Gene To DNA

Figure 2.1 pairwsie sequence alignment

Figure 2.2 Backtracking scoring matrix

Figure 2.3 Global alignment vs Local alignment

Figure 2.4 CPU organization versus GPU one

Figure 2.5 Memory bandwidth for CPU and GPU

Figure 2.6 FLOPS for CPU and GPU

Figure 2.7 programming model for programming on GPU

Figure 2.8 Heterogeneous programming model of GPU

Figure 2.9 Access time between processors and different memory of GPU device

Figure 2.10 Flow chart of G.Striemer implementation on GPU

Figure 2.11 An example of G.Striemer implementation on GPU

Figure 3.1 : Gene tracer function

Figure 3.2 : Output of Gene Tracer for DNA sequences

Figure 3.3 : Output of Gene Tracer for proteins (Ancestor1 and offspring sequences)

Figure 3.4 : Output of Gene Tracer for proteins (Ancestor 2 and offspring sequences)

Figure 4.1 The first step of (GTDBGPU) algorithm

Figure 4.2 The second step of (GTDBGPU) algorithm

Figure 4.3 Common parts between Sequence 1 and Query

Figure 4.4 Common parts between Sequence 1 and Query

Figure 4.5 GT – DB GPU’s output for protein sequence 1 and offspring

Figure 4.6 GT – DB GPU’s output for protein sequence 2 and offspring

Figure 4.7 GT – DB GPU’s output for DNA sequence 1 and offspring

Figure 4.8 GT – DB GPU’s output for DNA sequence 2 and offspring

Figure 4.9 GPU vs CPU execution time for GTDBGPU

Figure 4.10 Speed up ratio vs Query sequence length

Figure 4.11 Occupancy calculator result of 64 thread

Figure 4.12 GTDB GPU time vs # of threads per block

List of Tables

Table 2-1 GPU memories

Table 4-1 part of BLOSUM 50 substitution matrix

Table 4-2 substitution matrix for DNA

Table 4-3 part of coincidence matrix for proteins

Table 4-4 Coincidence matrix for DNA

Table 4-5 Differnce between using BLOSUM50 and Coincidence matrices

Table 4-6 Execution time of gene tracer on GPU and CPU

Table 4-7 GPU's occupancy vs # of threads

List of Abbreviation

illustration not visible in this excerpt

Acknowledgments

I would like to thank all of Prof. Ibrahim Ziedan, Prof. Ahmed Al zohairy and Dr.Hitham Abo bakr, for their great efforts during their supervision on the thesis.

Great thanks to Nvidia company for supporting us with two powerful GPU units to perform this work.

Not forget to mention my colleague Eng. Ahmed Al Mansi, for his support in proposing web interface tools.

Last but not least, I would like to acknowledge and extend my heartfelt gratitude to my honest wife and daughter Lamar for their support and encouragement which has made the completion of this dissertation possible.

Mohamed Issa

Preface

The common theme of this dissertation is proposing an algorithm that extracting the most nearest two sequences ( ancestors ) to a query sequence (offspring ) with scanning biological database using GPU. Also it determines parts of similarity and difference between the 3 sequences and percentage of similarities. This algorithm called Gene Tracer with scanning database using GPU (GTDBGPU ). The material is organized in five chapters. Most of the material has been published in international journal and Jhon Wiley publisher.

Some of the material in Chapter two are published as chapter entitled as “Accelerating Pairwise Alignment Algorithms by Using Graphics Processor Units “ in a book entitled as “Biological Knowledge Discovery Handbook: Preprocessing, Mining and Post processing of Biological Data ” [M.Issa , Wiley , 2012].

Some of the material in Chapter Three “Gene Tracer algorithm” is proposed and published in IJCA as “GeneTracer: Algorithm Tracing Genes Modification from Ancestors through Offsprings ” [M.Issa , Vol 52 , No 19 ,2012].

Thesis’s works were included in a scientific project from H3ABIONET [http://www.h3abionet.org].

Abstract

In bioinformatics, it is very important to relate common matched parts between DNA or Protein sequences. Gene Tracer algorithm was proposed to perform this function. Gene Tracer based mainly on local sequence alignment. Local alignment determines the common parts between sequences. Gene Tracer extends this function, so it has three sequences as inputs. It determines common parts between one of the sequences and each one of the other two sequences. Then it locates common substrings in each sequence ( The two ancestors and offspring ). Another important application is determining ancestors sequences of an offspring before it is used in any process. Gene Tracer was extended to scan a large biological database to get the nearest two sequences to offspring sequence. Due to database was large, this application was implemented on Graphical Processing Unit “GPU” to decrease execution time of algorithm. GPU computes alignment for all sequences in database with query in parallel. But CPU executes alignment for all sequences serially. So GPU gives significant reduction of application’s execution time than CPU. Execution time also was decreased by increasing occupancy of GPU. Occupancy of GPU means keeping GPU’s resources as busy as possible.

There are 3 contributions for this research work. i) Gene Tracer algorithm was proposed to relate DNA or Protein sequences. ii) Gene Tracer with scanning database using GPU algorithm was proposed to search for offspring’s ancestors. iii) Effect of GPU’s occupancy on execution time was discussed, where execution time of an application was decreased by increasing occupancy.

Keywords : Bioinformatics, Gene Tracer algorithm, GPU.

Chapter 1: Introduction

In this chapter, we give an overview of Bioinformatics with special high light to pairwise sequence alignment. The motivation of our research is presented next. Main objectives and research challenges are then manifested. Main contributions are introduced and the rest of the thesis is finally outlined.

1.1 Bioinformatics

Bioinformatics is a mixture field of biological and information sciences, which study methods of storing, retrieving and analyzing biological data. Biological data include nucleic acid (Deoxy ribonucleic acid “DNA” or Ribonucleic acid “RNA”) sequence and protein sequence. This analysis may require the use of different techniques such as algorithms, databases, data mining, web technologies, image processing, modeling and simulation, software engineering, artificial intelligence and statistics. Bioinformatics is useful in many fields such as drug design and genetic engineering.

1.2 Bioinformatics cycle

Bioinformatics cycle consists of three stages as shown in figure 1.1. In the first stage, biological data, which represent informations, are extracted from biological experiments in life science. In the second stage biological data are stored in biological database. In the third stage several operations may be performed on biological sequences that retrieved from database such as statistical analysis, visualization, prediction and modeling using information science techniques. The ultimate goal of statistical bioinformatics is to statistically identify significant changes in biological processes for the purpose of answering biological questions. For example if the biologist want to determine originality of biological sequence that was extracted from biological experiments. Analysis tools are used to compare this sequence with sequences in biological databases to determine the most nearest matching sequence for biologist.

Abbildung in dieser Leseprobe nicht enthaltenFigure ‎1.1 Bioinformatics cycle

1.3 A recap of molecular biology

In the following a short recapitulation of the basics of molecular biology is mentioned.

1.3.1 DNA

DNA is the basis of inheritance. It is a polymer consisting of small molecules called nucleotides which include four bases adenine (A), cytosine (C), guanine (G) and thymine (T). A DNA sequence is a series of the four alphabets {A, C, G, T}. DNA occurs in double complementary strands where there is a hydrogen bond between A & T also a hydrogen bond between G & C. Two strands of DNA forms a double helix as shown in figure 1.2 [1].

illustration not visible in this excerpt

Figure ‎1.2 Double stands of DNA forms as double helix [1].

1.3.2 RNA and transcription

Genetic informations are carried by DNA. Informations in DNA are copied to new copies called mRNA (messenger RNA) and this process is called transcription. The DNA nucleotides A,C,G,T are respectively transcribed into mRNA nucleotides U (uracil (U) replaces thymine (T) in RNA molecules),G,C,A. After transcription to mRNA a translation process to protein occurs as shown in figure1.3.

1.3.3 Proteins

Proteins are polypeptide chains, which consist of 20 different alphabet of amino acids. Three mRNA nucleotides encode one of 20 amino acids [1].

illustration not visible in this excerpt

Figure ‎1.3 Translation from Gene To DNA [1]

1.4 Sequence analysis

DNA or protein sequence analysis is a basic research area of bioinformatics. Sequence analysis means analyzing DNA, RNA and proteins sequences using many analytical methods for understanding their features, functions and structures. Sequence alignment or sequence comparison is a public methodology for performing sequence analysis process [23].

1.5 Sequence alignment

Sequence alignment is the mainly important problem in computational biology. It is used to determine which parts in sequences are a like and which ones are differ [4]. Also it is used to measure the grade of similarity between two or more sequences which gives conclusion about the homology between them [5]. Sequence alignment between two sequences called pairwise alignment and that between more than two sequences is called multiple alignment [6].

In this research work pairwise sequence alignment is focused. Protein or DNA sequences with unknown originality or functionality are often compared with other known sequences stored in large scale biological database (size ~ 300,000 sequence ) to determine their originality or functionality through their alignment. Determining the origin of a sequence is depending on extracting the most nearest matching sequence that has the highest alignment score from biological database [7].

1.5.1 Pairwise sequence alignment

Pairwise sequence alignment can be done by hand for very short sequences, so alignment of lengthy, highly variable or extremely numerous sequences cannot be done by human efforts.

In fact there are many alignment algorithms such as dynamic programming ( DP ) that produce high quality sequence alignment. Generally there are three kinds of DP sequence alignment, 1) the local sequence alignment developed by smith – waterman (SW) [8], 2) the global alignment developed by Needleman [9] and 3) semi global alignment [10].

The suitable kind of DP sequence alignment algorithms for searching database is the local sequence alignment. It extracts the most nearest matching sequence that has the longest substring matching between query sequence and each sequence in a database. This algorithm is slow, It consumes many hours to search for sequence about 1024 base and this is impractical for use by biologists but it has higher accuracy of alignment. Other algorithms based on heuristic methods which is fast were designed for searching large scale biological database, but it consumes many minutes (30 minute for sequence length about 1024 ) [11] and don’t guarantee accurate alignment result.

So DP sequence alignment is suitable for such problem due to it’s high accuracy. Scanning biological database time increases due to the rapid growth of it. A speed up technique for this algorithm is proposed in parallel manner using Graphical Processing Unit (GPU) to decrease execution time of alignment algorithms as possible.

1.6 Motivations

Pairwise sequence alignment is one of the main bioinformatics applications. As mentioned in section 1.5, its main usage to get the alignment between two sequences. This alignment may be used to get the similarity between the two sequences as global alignment or originality of a sequence as local alignment or used in DNA fragment assembly as semi global alignment.

We concentrate on pairwise local alignment that used to get originality for a sequence by determining the alignment of common longest nearest substrings between two sequences. Also is used to extract the nearest sequence from biological database for a query sequence depending on the maximum score of alignment’s scores between this query sequence and each sequence in database. This alignment scores not express the length of common substring. Also this algorithm extract alignment of common parts only but not locates it in each sequence.

Another important application is that many biologists need to find the relation between offspring sequences and two ancestors and also estimation percentage of contribution of each ancestor in offspring sequence or called similarity percentage. Such application cannot be computed by local alignment.

1.7 Objectives

This thesis concentrates on developing pairwise local alignment algorithm so that besides finding the common substrings between two sequences, it positions these substrings in each sequence. Also developing local alignment to perform function of finding the relations between an offspring sequence and two ancestors.

1.8 Main contributions

In the following main contributions in this thesis:

1 Developing pairwise local alignment so that it performs function of finding the relations between offspring sequences and two ancestors sequences and positioning the common parts in each sequence and also estimation of similarity percentage between offspring and each ancestors. This developed algorithm was called Gene Tracer.
2 Another development for Gene Tracer algorithm is instead input two ancestors sequences and offspring one, it extracts the nearest two sequences from biological databases that matching the maximum length of common parts for an offspring sequence. Then positions common parts in each sequence and computes similarity percentage.
3 For more development for Gene Tracer algorithm is implementing and executing it on Graphical Processing Unit “ GPU ” to decrease execution time as much as possible.
4 Study effect of the number of threads per block on GPU’s occupancy which affect on GPU’s execution time.

The rest of thesis are organized as follows :

Chapter 2 : Literature review

This chapter survey previous methods of pairwise sequence alignment and developments for accelerating it.

Chapter 3 : Gene Tracer

This chapter discuss gene tracer algorithm that was proposed to get relations and positions of similarity between three sequences (two ancestor and offspring sequence).

Chapter 4 : Gene Tracer by scanning database on GPU (GTDBGPU) algorithm .

This chapter discuss proposed algorithm to get the most nearest two sequences to a biological sequence on CPU and GPU, and compare execution time on CPU vs GPU. Also in this chapter more development to accelerate algorithm execution on GPU by discussing effect of GPU’s occupancy.

Chapter 5 : Conclusion and Future work

This chapter concludes thesis works and suggestion of future work.

Chapter 2: Literature Review

In this chapter; an introduction to pairwise sequence alignment algorithms and it’s main applications are presented. Also developments of local alignment algorithms are focused. Then an introduction to GPU and developments of pairwise local alignment using it are presented.

Sequence alignment used to determine portions of similarity and difference in two sequences. As shown in figure 2.1 two sequences are aligned where similar portions are aligned and different portions are aligned with gap). There are two main kinds of sequence alignment. The alignment between two sequences called pairwise alignment and alignment between more than two sequences called multiple alignment. In this research, pairwise alignment is focused.

illustration not visible in this excerpt

Figure ‎2.1 pairwsie sequence alignment.

2.1 Pairwise sequence alignment algorithms

There are two approaches for implementing pairwise alignment, a) DP methods which consumes long time of execution but gives highly accurate alignment, b) heuristic methods which may have small execution time but don’t gurantee accurate alignment.

2.1.1 Dynamic programming pairwise sequence alignment algorithms

Dynamic programming is a programming approach like the divideandconquer approach ,where the problem is solved by divides it to sub problems and the optimal solution may be obtained [12].

There are three types of DP pairwise sequence alignment a) global sequence alignment, b) local sequence alignment and c) semiglobal alignment.

1.Global Alignment

A pairwise global Alignment involves the alignment of the entire of two sequences. An example of global alignment, the following two sequences GTTCGTTG , GTTCTTG seem to be similar but there is extra nucleotide in the first sequence. So global alignment is used to determine position of similar and different portions in two sequences as follow:

illustration not visible in this excerpt

The indel or gap () in the lower sequence means that there is an extra nucleotide in upper sequence. A DP algorithm called Needleman and Wunsch [13] was proposed for pairwise global alignment. A way for aligning two sequences S 1 and S 2, with respective lengths m and n using pairwise global alignment may be performed as follow:

(i) A Scoring matrix M of size (m+1)*(n+1) is constructed and initialized using a substitution matrix, such as PAM (Percent Accepted Mutations) [13], BLOSUM (BLOcks SUbstitution Matrix) [14]. Line by line scores are added starting from the left upper cell to the right lower cell where the following equation is used [15]:

illustration not visible in this excerpt

where P is a gap penalty, se is the score between the character at position i in S 1 and the one at position j in S 2.

(ii) Backtrace the scoring matrix by building a path was called maximum score path such as in figure 2.2, which gives an optimal pairwise alignment, such path starts from the lowest right cell to the upper left cell and three types of possible movements are allowed:

(a) Diagonal movement: This movement corresponds to the passage from a cell (i, j) to a cell (i 1, j 1).

(b) Vertical movement: This movement corresponds to the passage from a cell (i, j) to a cell (i 1, j).

(c) Horizontal movement: This movement corresponds to the passage from a cell (i, j) to a cell (i, j 1).

illustration not visible in this excerpt

Figure ‎2.2 Backtracking scoring matrix

A) Global sequence alignment algorithm

illustration not visible in this excerpt

End

Time complexity of the algorithm of Needleman and Wunsch is O (m * n) and space complexity is O (m * n) where m, n are lengths of the two sequences.

B ) Example of global alignment

illustration not visible in this excerpt

Scoring matrix ( H ) :

illustration not visible in this excerpt

The bold cells are the optimal alignment path starting from the lower right cell and ending at the upper left cell. The first row and column are not included in matrix but to clarify the algorithm.

illustration not visible in this excerpt

Application of global alignment is comparing two genes with almost the same function, for example comparing human’s gene versus mouse’s gene [17].

2.Local Alignment :

A pairwise local Alignment involves the alignment of portions of two sequence as shown in figure 2.3 but global alignment aligning the entire of two sequences. A DP algorithm SW alignment algorithm [8] was proposed for pairwise local alignment .

illustration not visible in this excerpt

Figure ‎2.3 Global alignment vs Local alignment

The equation used for calculation of the scoring matrix is as follows[2]:

illustration not visible in this excerpt

Where M is scoring matrix, S 1 is sequence 1 and S 2 is sequence 2, se is the score between the character at position i in S 1 and the one at position j in S 2 and P is a constant gap penalty. With m and n respectively being the lengths of the sequences S 1 and S 2 [2].

Global alignment differs than local alignment in two points. 1) In local alignment any cell of scoring alignment matrix has minimum score zero, but in global alignment score may be lower than zero. 2) In local alignment backtrace start from cell has maximum score and stop at first cell has score zero, but in global alignment back trace start from lower right cell and end at upper left cell in scoring matrix.

illustration not visible in this excerpt

End

Time complexity of the algorithm of Smith and Waterman and also space complexity is O (m * n) [8].

B ) Example of local alignment

Input : Q_seq=GACGG , D_seq = ACGA

For Match score = 1 , Mismatch = 1 , Constant gap penality = 1

illustration not visible in this excerpt

Scoring matrix H :

Starting from the maximum score and ending at first 0 in the optimal alignment path. The maximum score express the length of common substring.

Output :

illustration not visible in this excerpt

Application of local alignment is searching for local similarities in large sequences (e.g., newly sequenced genomes) [17].

3.Semi – global Alignment

A pairwise semiglobal alignment [14] is like pairwise global alignment but neglecting start gaps and end gaps. Where start gaps are gaps that occur before the first character in a sequence ,and end gaps are that occur after the last character in a sequence [4]. An overlap of two sequences is considered an alignment where start and end gaps are ignored.

A ) Semi Global alignment algorithm with constant gap algorithm

// input , output the same as local and global alignment .

illustration not visible in this excerpt

End

Time complexity of Semi – Global alignment algorithm and also space complexity is O (m * n).

B ) Example of Semi Global alignment algorithm

illustration not visible in this excerpt

For Match score = 1 , Mismatch = 1 , Constant gap penality = 1

illustration not visible in this excerpt

Scoring matrix : (H)

Output :

illustration not visible in this excerpt

Application of Semi – Global alignment algorithm are DNA assembly fragment [17].

2.2 Accelerating DP pairwise sequence alignment

Time and space complexities of the three kinds of DP pairwise alignment is large and increases as biological sequence length increases. So many efforts were performed to reduce time and space complexties.

Space is often the limiting factor when computing sequence alignment for long biological sequences, many biological literature have proposed space saving strategies like Hirschberg DP algorithm [18] that requires only space proportional to the sum of the sequence lengths. So the space complexity of sequence alignment algorithm be linear. But the time complexity still quadratic.

For time complexity , many development are proposed to overcome slowely execution of DP sequence alignment such as heuristic methods . Many heuristic sequence alignment were proposed that attempting to keep as much sensitivity as possible but don’t guarantee accurate alignment. Such as FASTA [19], BLAST[20], SSEARCH [21].

FASTA is a heuristic tools proposed by lipman and pearson that find local similar regions between two sequences but faster than dynamic programming approach. Although the FASTA algorithm is faster than any of the previous algorithms, it is not guaranteed to find the optimal alignment between the two sequences.

Another development to FASTA algorithm is BLAST (Basic Local Alignment Search Tool) that approximate measure of alignment as DP alignment and similar to FASTA but faster than FASTA. SSEARCH is another heuristic tool that was proposed to accelerate execution time of sequence alignment algorithm faster than FASTA and BLAST.

With the new sequencing technologies, the number of biological sequences is increasing exponentionally. In addition, the length of of these sequences is large, hundreds of bases. So alignment between a query sequence and the sequences in biological databases becomes expensive in computing time and memory space.

So other developments like parallel solutions developed for accelerating this process. These parallel techniques such as using multi core CPU that reduce complexity of time such as an algorithm was developed and implemented on 8 core multiprocessor [22].

Also there are many hardware accelerators [23] may be used to accelerate sequence alignment algorithm such as General purpose multicore, Field Programmable Gate Arrays “FPGAs” and Graphics Processing Unit “GPU”.

Many parallel techniques take the advantage of single instruction multiple data “SIMD”. where the same instruction executed on many cores and do the same function but in different data. Such as Farrar [24] compute the SW algorithm using SSE2 SIMD on intel processors. Farrar’s algorithm implementation achieve 28 speed up than other implementations using SIMD techniques . Farrar’s implementation was then optimized by Rognes [25] to further enhancing the performance. Rognes implemented the stripped algorithm on SSE3 and Linux 64 bit.

FPGAs have been used to implement SW in many solutions . Zhang et al. [26] propose a field programmable gate array (FPGA) solution for the SW algorithm. Their solution provides 250 times faster than a CPU version running on a 2.2 GHz Opteron processor.

They presented impressive speedups over software implementations. However, they are still not considered to be commodity hardware and their programming interface is rather complex. Due to the limited storage, FPGAs cannot produce the alignment for huge sequences.

Due to FPGAs are expensive and has limited space storage, GPUs are used to do sequence alignment for huge sequences and with highly speed up than implementation on FPGAs. In the following section GPU’s architecture and programming are discussed briefly.

2.3 Graphics processing unit ( GPU ):

Graphics processing unit is a device that was designed to accelerate the computation of Graphics operations. GPU is basically an electronic device consisting of many processors with memory. It was used to accelerate image building for output to a display unit. GPU has high memory bandwidth and high computational capabilities and higher parallel structure that makes it useful as a general purpose unit for other applications rather than any imaging applications like computational biology, financial work and many other applications.

A GPU consists of many multiprocessors and a large Dynamic RAM (DRAM). Each multiprocessor is coupled with a small cache memory and consists of a large number of cores, i.e., Arithmetic Logical Units (ALUs), controlled by a control unit.

Comparison between GPU and CPU is as shown in figure 2.4. A CPU consists of a small number of ALUs and a large DRAM and a single large cache memory. GPU has large number of ALUs (Arithmatic Logic Unit) that give it high computational power in performing data parallel computations faster than CPU. GPU has small cache memory and control unit for each set of ALUs. Also GPU has a higher bandwidth between memory and processing elements (ALUs) that make it performing operations faster.

illustration not visible in this excerpt

Figure ‎2.4 CPU organization versus GPU one [27]

As shown in figure 2.5 show memory bandwidth for the CPU vs GPU. For different versions of GPU and CPU memory bandwidth which measured as number of gigabyte per second of GPU is higher than CPU’s memory bandwidth.

Figure 2.6 show the evolution of FLoatingpoint Operations Per Second (FLOPS) .Number of floating point per operation in GPUs is higher than that in CPUs. Figures [2.5 , 2.6] show speed and powerful computation speed of GPU over CPU.

GPUs are used in game consoles, embedded systems, mobile phones and computers. In a computer, a GPU can be found on a video card, or on the motherboard. Most of the new desktop and notebook computers have integrated GPUs.

GPUs were originally designed to accelerate computer Graphics algorithms. However, their high computational capabilities and their highly parallel structure opened up to them a wide a range of other fields, like scientific computing [28], computational geometry [ 29 ] and bioinformatics [30]. In bioinformatics, mainly GPUs were adopted to accelerate pairwise alignment algorithms.

illustration not visible in this excerpt

Figure ‎2.5 Memory bandwidth for CPU and GPU [27]

illustration not visible in this excerpt

Figure ‎2.6 FLOPS for CPU and GPU [27]

2.3.1 GPU programming model

GPU can be programmed using OpenGL [31] or new released library from Nvidia called the Compute Unified Device Architecture (CUDA) [32]. CUDA is a scalable parallel programming model and a software environment for parallel computing, and also is an extension for C programming language.

A program that can be executed on GPU is written in C language. It consists of two parts serial and parallel. The serial part is executed on CPU and the parallel part is executed on GPU . The parallel part is written as a function called kernel which is executed on GPU.

There are two compilers one for C code and the other for GPU code called NVCC compiler proposed by Nvidia [27]. NVCC output compiled parallel part to be executed on GPU. CPU executes serial code and launches kernel to GPU grid to execute the parallel code where grid consists of blocks of thread that is assigned to kernel as in figure 2.7.

Then CPU execute another serial code till reaches another parallel code and launch another kernel to another GPU grid as in figure 2.8. This manner of execution is called heterogeneous execution.

illustration not visible in this excerpt

Figure ‎2.7 programming model for programming on GPU [27]

illustration not visible in this excerpt

Figure ‎2.8 Heterogeneous programming model of GPU [27]

2.3.2 Threads

Executing parallel program on GPU is performed by launching a kernel on a grid of blocks and each block consist of set of threads. All threads execute the same kernel but on different data.

Threads are grouped in blocks where there is a limit on the number of threads per block. Blocks can be grouped in a grid as shown in figure 2.7 where each grid launches a certain kernel and many grids can do many different functions in parallel. There are many cores on GPU each core executes a thread and many threads are scheduled on this core.

An advantage of these threads of GPU than threads of CPU is that they are light weight meaning that they don’t waste time for scheduling or launching a core. Threads can be synchronized at certain break point on a kernel function which means each core reaches a breakpoint of synchronization stalls execution for this thread until all threads reach the same break point.

2.3.3 GPU Memory :

GPU memory consists of an on – chip memory with processors like shared , local memory and registers and off chip memory which is separated from processors like global, constant and texture memory. Table 21 show difference between GPU memories.

illustration not visible in this excerpt

Large

Table ‎2‑1 GPU memories

There are differnet access time to different memory that is as shown in figure 2.9. This figure show access time to each memory of GPU by processors where global memory has the highest access time around 400 – 600 cycle. Access time for constant cache, texture cache and shared memory approximately is the same as access time to registers 1 cycle.

illustration not visible in this excerpt

Figure ‎2.9 Access time between processors and different memory of GPU device [11, 27]

2.4 Accelerating sequence alignment on GPU :

GPUs have a massively parallel architecture. With GPUs impressive speedups can be achieved using a programming model that is simpler than the one required for FPGAs. GPU programming is based on OpenGL [31] and now on CUDA parallel programming languages which is the computing engine in NVIDIA Graphics processing units [27].

The first implementations of SW algorithm on GPUs are described in [ 33 ]. This implementation uses OpenGL. The implementation has two versions, the first one is with traceback for computing alignment and the second one is without traceback. Using an Nvidia GeForce 7800 GTX card, their implementation achieves a 10fold speedup over SSEARCH .

SWCUDA [34] is used CUDA for sequence alignment implementation on GPU . Each GPU thread computes the whole alignment of the query sequence with one database sequence. The threads are grouped in a grid of blocks during execution. In order to make the most efficient use of the GPU resources , the computing time of all the threads in the same grid must be as near as possible. Experimental studies have been done to compare SW CUDA running on two Geforce 8800 GTX with BLAST [20] and SSEARCH [21], running on a 3 GHz Intel Pentium IV processor. The execution times of CUDA implementation were up to 30 times faster than SSEARCH and up to 2.4 times faster than BLAST. SW CUDA was also 3 times faster than Single Instruction Multiple Data (SIMD) implementation [24].

SW – CUDA needed to be improved to utilize the full resources on the GPU because it use local memory of GPU which is the slowest memory resource on GPU.

An effective development [35] is proposed that using onchip shared memory that reducing amount of data transfer from global memory to processing elements in a GPU. That gives result of reducing data fetch amount to 1 / 140.

An implementation proposed by Striemer [11] is 23 times speed than SSEARCH. SW computation matrix are computed purely using on GPU in Striemer’s Implementation. It works in 3 stages, 1) load biological sequences database to GPU’s global memory. 2) Each thread used to compute alignment score between a query sequence and each sequence in a database. 3) Alignment scores back to CPU to determine highest alignment score. So this implementation used to give alignment score only not the alignment.

Unlike SW CUDA, this implementation don’t involve using of CPU for partial SW computations, it were done purely using GPU. Another difference that this implementation use constant memory of GPU to save query sequence and substitution matrix because constant cache has access time as register access time as shown in figure 2.9.

Striemer’s implementation of SW alignment using GPU has two nested loops as shown in figure 2.10.

illustration not visible in this excerpt

Figure ‎2.10 Flow chart of G.Striemer implementation on GPU [11].

For each loop of the outer loops character of database sequence is got. Then the inner loops compute alignment scores between this database sequence character and all query sequence characters. Constant memory contain query sequence and BLOSUM50 [14], so accessing constant cache will reduce execution time due to access to constant cache need 0 – 1 cycle as shown in figure 2.9. Also shared memory used to save temporary calculations of SW computation, this also reduce execution time because access to shared memory need 1 cycle. After computing SW computation matrix for each sequence in database with query sequence using GPU, alignment score for each database sequence saved in global memory of GPU and transfer all scores to CPU to determine the sequence that has the highest alignment score. So this implementation used to compute alignment scores only not the alignment of sequences.

An example of striemer’s alignment on GPU is as shown in figure 2.11. It computes alignment score between database sequence has 4 residue (character) and query sequence has 8 residue, where cell calculations matrix used to save temporary alignment scores for each column. Cell calculations matrix used to save temporary alignment score for each column to be used in computing alignment scores of next column. Initially cell calculations matrix is filled with zeros.

illustration not visible in this excerpt

Figure ‎2.11 An example of aligning query sequence 8 residues and database sequence has 4 residues [11].

Each step in figure as follow: 1) compute alignment scores in column of database sequence character ‘G’ 4 cell each time and then save this scores in cell calculations matrix. 2) Compute alignment scores between ‘A’ and query sequence in column 2 using previous scores in cell calculation matrix and then save it in cell calculation matrix. 3) Repeat step 1 & 2 for column 3 and 4 respectively. The largest score in each column is saved. Finally the largest score in all columns represents the alignment score between the two sequence.

2.5 Conclusion

In this chapter we listed the most famous available pairwise dynamic programming sequence alignment algorithms that are available and documented. We found that there are three alignment algorithms, 1) global alignment, 2) local alignment, semiglobal alignment. This thesis focuses on local alignment algorithm that is used to get similar portions between two sequences. We listed the development for local alignment algorithm in time and space complexity. Developments using hardware accelerator such as multicore CPU,FPGA and GPU were also mentioned. These hardware accelerators speed execution of local alignment algorithm.

We found that all development algorithms for local alignment algorithms used to get alignment between similar parts in two sequences.

In chapter 3, we proposed an algorithm that was development for local alignment algorithm called Gene Tracer. This algorithm determines longest common substrings between two sequences and is used to determine common portions between offspring sequence and it’s ancestors sequences and percentage of similarity.

In chapter 4, we developed Gene Tracer algorithm to extract the nearest two sequences ( ancestors ) to offspring sequence from biological database. GPU was used for speeding up executing of this algorithm and compared to execution of algorithm on CPU.

Chapter 3: Gene Tracer algorithm

In this chapter, we present a proposed algorithm , called Gene Tracer that is used to get relations between sequences to determine homology of them. This algorithm is used in application such as determining originality or functionality of sequences. In other words gene tracer is used to trace genes modification from ancestors sequences through offspring sequence. It tracks down genes modification in the ancestor sequences and finds related parts of each ancestor sequence in the offspring one.

3.1 Overview

Gene Tracer can find precisely the location of the ancestor sequences contribution inside the offspring one and gives statistical results that express the relationship between the two ancestor sequences and their offspring one as shown in figure 3.1.

illustration not visible in this excerpt

Figure ‎3.1 : Gene tracer function

Gene Tracer’s inputs are two ancestor sequences and one offspring sequence. Gene Tracer’s output is as shown in figure 3.1 the output is three sequences have colored parts which represent the matching parts. Longest matching parts common between ancestor 1 and offspring sequence are given a red color with length L and the total length of ancestor 1 is Z1 then matching percentage is L / Z1. The same was done with ancestor 2 and offspring but in blue color and percentage is K / Z2.

3.2 Gene tracer algorithm

Gene Tracer has three sequences A_Seq1 (ancestor 1), A_Seq2 (ancestor 2) and Off_Seq (offspring). Gene Tracer operates in two steps, The first step is that it constructs a local alignment [8] between A_Seq1 and Off_Seq. The second step is constructing another local alignment between A_Seq2 and Off_Seq. Based on the constructed alignments, The location of the ancestor sequences contribution inside the offspring one can be found precisely. Statistical results that express the relationships that exist between the two ancestor sequences and the offspring one may be found. A Gene Tracer based on Smith waterman algorithm [8] is presented in the following section and a complete description of Smith waterman algorithm was presented in chapter 2.

As stated in chapter 1, Gene Tracer algorithm was modified of SW local alignment algorithm. The modifications are as following:

1 Determining locations of common substrings in both ancestors and offspring sequences.

2 Distinguishing these common substrings by giving them clear and different color.

These modifications are as shown in the following algorithm.

Gene Tracer Algorithm

Inputs

illustration not visible in this excerpt

End

Where : Smith_Waterman (A_Seq1, Off_Seq, Match, NonMatch, ConstGap, L, i, j) is the Smith and Waterman algorithm [8] with modifications that it returns length of common substring and its location in each sequence and also distinguish this common substring in clear color in each sequence in stead of return score of alignment or alignment between common substrings only. It receives as input the sequences A_Seq1, Off_Seq, the match score Match, the non match score NonMatch and the constant gap value ConstGap. It returns L with maximum score in SW computation matrix which represent length of longest common substring. Also returns positions of end of common substring ( i ) in ancestors sequences (A_seq1 or A_seq2) and ( j ) in offspring sequence. Color_Seq (A_Seq1, iL, L) is a function to color a sequence from position (iL) to position L.

Gene Tracer algorithm is of complexity O (max (M, N)* P) in computing time and memory space, where M, N and P are respectively the lengths of A_Seq1, A_Seq2 and Off_Seq sequences.

3.3 Gene Tracer Flow chart

illustration not visible in this excerpt

3.4 Implementation of Gene Tracer program

The Gene Tracer algorithm was implemented and coded in php ( web programming language). It was executed on a 2.27 GHZ core i3 machine running windows 7 with 4 GB of RAM. In what follow, results of Gene Tracer program for DNA and Proteins sequences.

3.4.1 For DNA :

Inputs :

illustration not visible in this excerpt

Figure ‎3.2 : Output of Gene Tracer for DNA sequences

Figure 3.2 shows the obtained results: In the upper part of this figure, Locations of common substrings are represented graphically but for real matched portions between ancestro1 sequence and offspring sequence are in red and match percentage between these sequences is also given also execution time is presented. The same for ancestor2 sequence and offspring sequence in the lower part.

3.4.2 For Proteins :

Inputs :

Ancestor 1 :

Ancestor 2:: AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAAD

Offspring :

AGGWCRKWKQYENVNLIHYI

The output is as shown in figure 3.3 , 3.4

illustration not visible in this excerpt

Figure ‎3.3 : Output of Gene Tracer for proteins (Ancestor1 and offspring sequences)

illustration not visible in this excerpt

Figure ‎3.4 : Output of Gene Tracer for proteins (Ancestor 2 and offspring sequences)

3.5 Conclusion

In this chapter, Gene Tracer algorithm was presented. It has Three sequences ancestor 1 , ancestor 2 and offspring as input. It output the same input sequences but common substrings between ancestor 1 and offspring is presented in the red color and the same for ancestor 2 and offspring but in blue color. Also it estimates percentage of contribution of ancestors sequences in offspring sequence In the next chapter we will present the extended Gene Tracer algorithm that extracts the nearest two ancestors sequences to offspring sequence from biological database instead of be input for it. Also for more time development for algorithm it will be executed on GPU.

Chapter 4: Gene Tracer by scanning database on GPU (GTDBGPU) algorithm

Gene Tracer algorithm is used to determine the relationships and positions of common substrings between a certain query sequence (ex. offspring) and two other sequences ( ex. ancestors ). One of the main contributions is that the algorithm is extended to scan a biological database to extract the nearest two ancestors sequences to an offspring sequence based on the longest common substring between offspring sequence and each one of ancestors sequences.

Gene Tracer algorithm based on local alignment. Local alignment takes significant execution time on a CPU. Consequently, scanning database by Gene Tracer algorithm is extended to be executed on GPU to minimize execution time as possible because GPU has many cores and executed applications parallel as SIMD (Single Instruction Multiple Data) kind. Where each core execute the same application but on different data (sequences in database).

Another development for execution application on GPU was discussing influence of number of threads per block on GPU’s occupancy and so on execution time on GPU. Therefore one of main contributions is determining the suitable number of threads per block that give an occupancy 100% and decrease execution time as possible.

4.1 GTDBGPU algorithm

GTDBGPU algorithm operates in 3 steps. The first step is as shown in figure 4.1, i) GTDBGPU load sequences from database to GPU memory. The algorithm uses local sequence alignment [8] to extract the nearest 100 sequences to query sequence. ii) Sequence alignment between the query and each sequence in the database is used to determine their relations through a thread of a GPU for each sequence. iii) Alignment scoring matrix is computed for each sequence in the database using BLOSUM50 [36] substitution matrix for proteins is as shown in table 4.1 and table 4.2 for DNA and the constant gap is 10. The highest score in scoring matrix is preserved for each sequence. iv) Alignment scores of all sequences of database with query sequences are returned to the CPU. CPU extracts 100 sequences from database those have the highest 100 score and stores these 100 sequences in a new database called Gene Tracer database.

illustration not visible in this excerpt

Figure ‎4.1 The first step of (GTDBGPU) algorithm.

Table ‎4‑1 part of BLOSUM 50 substitution matrix .

illustration not visible in this excerpt

As shown in table 4.2 scoring substitutions matrix has a high score of 10 is given for matched nucleotides (characters). For A with T or G with C the score is 5 because they are in the same biological group. For A with C or G with T the score of 10 is given because they are in different biological group.

The second step , The nearest two sequences are extracted from the new database in the same way as in the first step as follow as shown in figure 4.2. i) 100 sequences in new databases are loaded to the GPU. ii) Alignment scores between query sequence and each sequence in gene tracer database is computed but using substitution matrix may be called coincidence matrix. Coincidence matrix for proteins as shown in table 4.3 and table 4.4 for DNA. Where for matched residues in proteins or nucleotides in DNA (residue or nucleotide is each character of sequence), score is 1 and for mismatched residues score is 1. For a constant gap is 1.

These scores are the length of longest common substring between a query sequence and each sequence of 100 sequences in Gene Tracer database .

illustration not visible in this excerpt

Figure ‎4.2 The second step of (GTDBGPU) algorithm.

Table ‎4‑3 part of coincidence matrix for proteins.

illustration not visible in this excerpt

Table ‎4‑4 Coincidence matrix for DNA .

illustration not visible in this excerpt

Coincidence matrix is used for alignment because score of alignment using this matrix is the length of longest common substring. But scores that were computed using BLOSUM50 matrix don’t represent the length of longest common substring that is as shown in the following example.

Query = LMNCCH, Sequence 1 = CCPKLM, Sequence 2 = LMNPA

Result of alignment’s scores using BLOSUM50 and Coincidence matrices are as shown in table 4.5.

Table ‎4‑5 Differnce between using BLOSUM50 and Coincidence matrices

illustration not visible in this excerpt

Common substring between query and sequence 1 is : CC

Common substring between query and sequence 2 is : LMN

As shown in the example, although sequence 1 has higher score than sequence 2 using BLOSUM50 substitution matrix, but sequence 2 has longer common substring with query than sequence 1’s common substring with query. So it’s clear that local alignment’score using coincidence substitution matrix represent the real length of matched characters between two sequences.

One of main contributions of this thesis is development for local alignment algorithm is to use first BLOSUM50 matrix to compute the alignment between sequences of database and query and extract the best 100 sequences that have the highest alignment scores with query sequence. Then using coincidence matrix to compute the alignment score between the 100 sequences and query sequence. Score using coincidence matrix represents length of common substring between any sequences of 100 and the query sequence. Positions of each score are preserved. Where positions are row and column numbers in each sequence’s scoring matrix in the database. Where row number means end of longest common substring in query and column number means end of longest common substring in sequence. as shown in the following.

Scoring matrix for alignment of Query and Sequence 2:

illustration not visible in this excerpt

The maximum score in matrix is the length of common parts. The row number is the shaded row from maximum score and represent end of common substring in query. The column number is the shaded column from maximum score and represent end of common substring in sequence 2.

So the result of alignment between two score resulting of using coincidence substitution matrix is as follow, the common parts are bold: Query: LMN CCH

Sequence 2 : LMN PA

iii ) Scores and positions are returned to the CPU. So two sequences that have the longest common substring with query are determined depending on this scores.

The two sequences are preserved in text file with a developed format called tracer format as shown in the following.

illustration not visible in this excerpt

Tracer format :

Q1 : position in query of the end of common substring between query and sequence 1.

S1: position in sequence 1 of end of common substring between query and sequence 1.

L1 : the length of common substring between query and sequence 1.

Q2 : position in query of the end of common substring between query and sequence 2.

S2: position in sequence 2 of end of common substring between query and sequence 2.

L2 : the length of common substring between query and sequence 2.

As the previous example :

Query = LMNCCH, Sequence 1 = CCPKLM, Sequence 2 = LMNPA

illustration not visible in this excerpt

The third step, Text file that have the two sequences and query with alignment’s score in tracer format is uploaded a web interface tool that is developed to show the common parts between query and each sequence in the same color, also shows the matching percentages of common substring length to the length of a query.

As the previous example :

illustration not visible in this excerpt

Figure ‎4.3 Common parts between Sequence 1 and Query.

illustration not visible in this excerpt

Figure ‎4.4 Common parts between Sequence 1 and Query.

4.2 Modifications of striemer’s local alignment algorithm On GPU for GTDBGPU

An algorithm developed by M. Striemer [11] is used to get local alignment using GPU. As mentioned in chapter 2 it has some features such as i) It is 23 times faster than SSEARCH [21]. ii) It used BLOSUM 50 as substitution matrix. iii) query sequence and substitution matrix are stored in constant memory as hard coded because access to constant cache of GPU for reading is fast as access to register if all threads read the same address [27].

The following are proposed modifications to striemer’s algorithm on GPU.

1 ) Striemer’s algorithm is modified so that it finds protein or DNA sequence that has the highest alignment score instead of protein sequence only. 2) Striemer’s algorithm is developed also to do function of gene tracer. After getting alignment score of each sequence in database, the highest 100 sequences score are extracted and saved in new database. Alignment scores of these 100 sequences with query are recomputed but using coincidence matrix and positions of scores in each sequence. Then two sequences with highest scores are extracted and are preserved in text file and a developed web tool used to show the common substrings between sequences in text file in clear and difference color. 3) Another modifications is determining the suitable number of threads per block by calculating occupancy [27] instead of using just 64 thread as proposed by striemer’s algorithm.

The following algorithm shows Gene Tracer algorithm with scanning database on GPU.

4.3 GTDBGPU algorithm

Input :

illustration not visible in this excerpt

End

4.3.1 SW sequence alignment on GPU function

illustration not visible in this excerpt

End

Time complexity of SW_GT_GPU function is O (m * n) and space complexity is (O (m*100)+ 3*O(100 )) where n length of query and m length of maximum sequence length in database .

Then Time complexity of GTDBGPU algorithm is O (m * n) and space complexity is (O (m*k)+O(k)) where k is sequence count in database, n length of query and m length of max sequence in database .

4.4 Gene Tracer with scanning database using CPU vs GPU

The difference between GTDBCPU and GTDBGPU algorithms is that in GTDBGPU each thread of a GPU computes alignment between query and one sequence from database. So alignment of all sequences in database with query is computed in parallel. But in GTDBCPU, alignment of all sequences in database with query is computed serially. The difference is shown in the following.

4.4.1 GTDBCPU(3 nested loop)

illustration not visible in this excerpt

4.4.2 GTDBGPU ( 2 nested loop)

illustration not visible in this excerpt

So GTDBCPU algorithm has total time complexity O (m * n*z) and space complexity is (O (m*z)+ 3*O(z)) where n is the length of query, m is the length of maximum sequence length in database and z is the sequences count in database.

4.5 Gene Tracer with scanning database on GPU Flow chart

illustration not visible in this excerpt

Input Query sequence to constant memory

4.6 GTDBGPU algorithm outputs

The GTDBGPU algorithm was applied for both DNA sequences and proteins sequences with a query sequence , results obtained are as follow :

4.6.1 For proteins :

Query input :

MSFQNPSYINAKHRSFLQPKDTQDSQDLRNWVSHSSVDEETAYSSSTLSSSSSKSFPCYDFEAAEMPETDVTITPQYVMDRVGKVADDRD

The text file contain results of alignment in tracer format is as shown in the following.

illustration not visible in this excerpt

This mean common substring between query and sequence 1 has length 60 and end in position 60 in sequence 1 and end in position 60 in query. Also common substring between query and sequence 2 has length 30 and end in position 450 in sequence 2 and end in position 90 in query.

Uploading text file of result :

illustration not visible in this excerpt

Figure ‎4.5 GT – DB GPU’s output for protein sequence 1 and offspring.

illustration not visible in this excerpt

Figure ‎4.6 GT – DB GPU’s output for protein sequence 2 and offspring .

As shown in figure 4.5 and figure 4.6 common parts between query & sequence 1 are in red color also matching percentage between query and sequence 1 are computed, where matching percentage equal to the length of common part divide by the length of query sequence .The same for sequence 2 with query.

4.6.2 For DNA :

Results of DNA Query sequence :

Text result file :

illustration not visible in this excerpt

This mean common substring between query and sequence 1 has length 25 and end in position 60 in sequence 1 and end in position 25 in query. Also common substring between query and sequence 2 has length 50 and end in position 247 in sequence 2 and end in position 85 in query .

Uploading text file of result :

illustration not visible in this excerpt

Figure ‎4.7 GT – DB GPU’s output for DNA sequence 1 and offspring.

Abbildung in dieser Leseprobe nicht enthaltenFigure ‎4.8 GT – DB GPU’s output for DNA sequence 2 and offspring.

4.7 GTDBGPU’s execution time on GPU vs CPU

GTDBGPU algorithm was implemented on a GPU Tesla C2075 . It has total memory 6 GByte and 448 CUDA cores. Each core is 1.15 GHZ. The algorithm was implemented using CUDA C extension in Microsoft visual studio environment. GPU execution time is measured and compared with execution time on CPU using device with core i3 multiprocessor each core is 2.27 GHZ. Main memory is 4 GByte and run windows 7 64 bit. The tested biological database contains 300,000 sequence.

Results of execution time on GPU vs CPU are as shown in table 4.5 .

Table ‎4‑6 Execution time of gene tracer on GPU and CPU using 64 thread per block.

illustration not visible in this excerpt

Figure ‎4.9 GPU vs CPU execution time for GTDBGPU.

illustration not visible in this excerpt

Figure ‎4.10 Speed up ratio vs Query sequence length.

These results show that implementation of algorithm on GPU is 140 faster than execution on CPU due to parallel execution of algorithm on a GPU where each thread computes alignment between query and a sequence in database. CPU computes alignment between query and each sequence in database serially.

4.8 Effect of number of threads per block on execution time

In this section effect of the number of threads per block on occupancy and so execution time will be discussed. The number of threads is determined depending on calculating occupancy and then it’s effect on execution time. This is one of main contributions in thesis.

To optimize execution of programs on GPU, multiprocessors on the device should be kept busy. So it is very important to design applications by using threads and blocks in a way that maximizes hardware usage. Threads of blocks are divided into warps and each multiprocessor executes some of this warps. As the number of warps increases per multiprocessor, multiprocessors will be more busy because if a warp is paused or stalled then another warp is executed and this decreases latencies and keeps hardware as busy as possible. A metric that is used to measure utilization of hardware called occupancy and is a key measure for GPU efficiency[37]. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps per multiprocessor.

Limitation factors for occupancy are 1) the number of active threads per block. 2) The number of registers used in the kernel. 3) The amount of shared memory allocated for block in the kernel. Equations that are used to determine occupancy is shown in the following [27].

4.8.1 GPU’s Occupancy computing equations

Active Warps per Multiprocessor = A_Th_MP * No_W_B

illustration not visible in this excerpt

Warp size = 32 , As defined by Nvidia [27]

A_Th_MP: Number of active thread blocks per multiprocessor

A_Th_MP= Minumum ( A1, A2,A3)

A1: The maximum number of thread blocks per multiprocessor limited by maximum blocks per multiprocessor (physical limit ) .

, As defined by Nvidia [27]

illustration not visible in this excerpt

For devices with compute capability 2.0 :

A3: The maximum number of thread blocks per multiprocessor limited by shared memory per multiprocessor.

For devices with compute capability 2.0 :

illustration not visible in this excerpt

Occupancy calculator spreadsheet on excel was designed depending on these equations by Nvidia to measure occupancy. Occupancy 100 % is the best utilization of hardware. The best value of the number of threads per block is determined so that give occupancy 100 %.

Striemer’s algorithm [11] assumes number of threads per block 64. The number of registers used is 13. The shared memory per block is 1152 Byte.

In occupancy calculator , occupancy was measured using these values and found to be 33 %, result is as shown in figure 4.11. This mean the number of active warps to multiprocessor is 33 % of maximum number of warps per multiprocessor, so this number of threads was not acceptable. As shown in figure 4.11 there are 3 sections. Section 1, determine the compute capability of GPU device used and shared memory size per multiprocessor.

illustration not visible in this excerpt

Figure ‎4.11 Occupancy calculator result of 64 thread.

Section 2, limitation factors on occupancy are fed. Section 3, GPU occupancy data is computed and displayed such as active warps per multiprocessor, active threads per multiprocessor and occupancy. Occupancy is 33 %. The last section includes physical limits for GPU that are constant for device such as warps per multiprocessor is the maximum number of warps allocated to multiprocessor.

Effect of changing number of threads per block on occupancy.

Number of threads per block is increased by 32 value and computing occupancy and execution time[27]. So at number of threads is 96 thread and occupancy will be 50 % and execution time is 185 at a query length 1024 . Therefore when occupancy decreases, execution time decreased also. Number of threads is increased by 32 due to warp size is 32[27].

Number of threads is increased by 32 and measure occupancy until it reach 100 % at 192 thread. By calculating execution time at 192 thread, it’s 172 second . That is as shown in table 4.7.

Table ‎4‑7 GPU's occupancy vs # of threads.

illustration not visible in this excerpt

Figure ‎4.12 GTDB GPU time vs # of threads per block

As shown in table 4.7, there are small max difference in execution time (~ 17 sec ). But this show effect of occupancy on execution time. So determining the suitable number of threads per block is very important. In some applications increasing occupancy may not improve performance [27], so measuring occupancy and execution time correspond to it, and then the number of threads that give minimum execution time is determined. In the following are steps to determine suitable number of threads for application.

4.8.2 Steps to determine the suitable number of threads per block

1 Assume suitable initial number threads.
2 Calculate shared memory per block (as mention in occupancy equations ) and using occupancy calculator to determine occupancy.
3 If occupancy less than 100 %, increase number of threads per block by 32.
4 Measure occupancy and execution time and repeat step 3 if occupancy less than 100 % until it reach 100%.
5 Choose the number of thread give minimum execution time that in many cases at occupancy 100 % but in some cases increasing occupancy don’t improve performance.

Chapter 5: Conclusion and Future work

Three main contributions are proposed in this thesis. i) Gene Tracer algorithm was developed based on pairwise local alignment algorithm and is used to determine and locate common parts between DNA or Protein sequence (offspring) and other two DNA or Protein sequences (ancestors) and also computes percentage of contributions of ancestors sequences in offspring sequence. ii) Gene Tracer algorithm was extended, so it determines origins of a sequence by scanning database on GPU. This extended algorithm was abbreviated GT DB GPU. The main application of GT DB GPU algorithm is extracting ancestors sequences of an offspring sequence from biological database using GPU and then locates common parts between offspring and each one of the two ancestors. Algorithm’s implementation on GPU has maximum speed up 140 times than implementation on CPU. iii) More reduction in execution time on GPU was reached by increasing occupancy to 100 %. It was found that the number of threads per block increases so occupancy increases. For query sequence having a length 1024, database contains 300000 sequences, the number of thread was 64 per block, so occupancy was 33 % and execution time was 189 second on GPU. For the same query sequence and database increasing the number of thread to 192 per block, occupancy was 100 % and execution time was 172 second on GPU. There was 9 % reduction in time. Steps of determining suitable number of threads per block were listed.

To the above contributions, an important point is the limitations of GPU usage. GPU’s cost and Developing software for GPU are the two important limitations for GPU usage. GPU’s cost is added to total cost of a PC.

The GPU device that was used in testing and gave significant optimization is more expensive than CPU. Developing software for GPUs is more complex than doing the same for CPU code. It is not easy to recommend the use of GPU for all applications due to its cost and development of its software. But if one must have better performance for applications, then GPUs might be an attractive option.

For future work, I suggest development of the GT DB – GPU algorithm so that it can deal with many offsprings at the same time and extracted the ancestors for each offspring sequence that has certain properties. Also I suggest using cloud computing system as an environment of execution and its usage is compared to GPU.

References

[1] A. W.Liew , H.Yan and M.Yang, “Pattern recognition techniques for the emerging field of bioinformatics : A review ”, the journal of Pattern Recognition society , October 2006 .
[2] Mourad Elloumi, Albert Y. Zomaya,” Algorithms in Computational Molecular Biology: Techniques, Approaches” and Applications edited , , 2011, John Wiley & Sons .
[3] W.pearson, D.lipman , “Improved tools for biological sequence comparison”, Proc. Nat.Acad. Sci. USA , Vol. 85, pp. 24442448, April 1988 .
[4] J. SETUBAL , J. MEIDANIS ,” introduction of computational molecular biology ” , PWS PUBLISHING 1997.
[5] K.R. Sharma , “Bioinformatics Sequence Alignment and Markov Models”, McGrawHill 2009.
[6] J. C. Venter, M. D. Adams, E. W. Myers, et al., “The sequence of the human genome,” Science 291 (2001), 1304–1351 .
[7] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. MullerWittig, “ BioSequence Database Scanning on a GPU “, in Proc. 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS’06), 5th IEEE International Workshop on High Performance Computational Biology) Workshop (HICOMB’06), Rhode Island, Greece : (2006).
[8] T. F. Smith, M. S. Waterman, “Identification of common molecular subsequences”, J. Molecular Biology, no. 147, pp. 195197, 1981.
[9] C. S. B Needleman, C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. Journal of molecular biology, vol. 48, no. 1, pp. 443453. 1970.
[10] S.Coull, B.Szymanski , “Sequence alignment for masquerade detection”, Computational Statistics and Data Analysis 52 (2008) 4116–4131
[11] Gregory M. Striemer and Ali Akoglu ,” Sequence Alignment with GPU: Performance and Design Challenges”, IEEE Xplore,2009
[12] T.Cormen , E.Rivest and C.Stelin ,”introduction to algorithms ”, The MIT Press Cambridge, Massachusetts London, England .
[13] M. O. Dayhoff, R. M. Schwartz, B. C. Orcutt,” A model of evolutionary change in proteins, in Atlas of Protein Sequence and Structure”, chapter 22, National Biomedical Research Foundation, Washington, DC: (1978), p345–358.
[14] S. Henikoff, J. G. Henikoff, “Amino acid substitution matrices from protein blocks”, Proc. Natl. Acad. Sci. USA, Vol. 89, N°22: (1992), p1091510919.
[15] M . Elloumi , Y. Zomaya , “ Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications “ , 2011, John Wiley & Sons .
[16] X. Huang, K. M. Chao, A generalized global Alignment algorithm, Bioinformatics”, Vol. 19, N°2: (2003), p228–233.
http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module12/align.html .
[18] D.Hirschberg. “A linear space algorithm for computing maximal common sub expressions”. Communications ofthe ACM, 18(6)
[19] W.pearson, D.lipman , “Improved tools for biological sequence comparison”, Proc. Nat.Acad. Sci. USA , Vol. 85, pp. 24442448, April 1988 .
[20] S.Altschul , W.Gish, W.Miller, E.Myers , and D. Lipman. “Basic local alignment search tool”. Journal of Molecular Biology, 1990
[21] W. R. Pearson, “Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the SmithWaterman and FASTA algorithms,”Genomics, vol. 11, no. 3, pp. 635–650, Nov. 1991 .
[22] A. M.Hosny , H. A Shedeed , A. S. Hussein and M. F. Tolba “An Efficient Solution for Aligning Huge DNA Sequences” , International Journal of Computer Applications ,Vol 32 , October 2011 .
[23] S. Sarkar, T. Majumder, A. Kalyanaraman, “Hardware Accelerators for Biocomputing: A Survey” , IEEE Explorer , 2010 .
[24] M. Farrar, “ Striped SmithWaterman speeds database searches six times over other SIMD implementations”, Bioinformatics, Vol. 23, Issue 2 : (2007), p156161.
[25] Rognes Torbjørn, "Faster SmithWaterman database searches with intersequence SIMD parallelisation," BMC Bioinformatics, vol. 12, no. 1, p. 221, 2011.
[26] P . Zhang, N. Sun, X. Jiang, X. Liu. L. Xu, "A reconfigurable accelerator for smithwaterman algorithm," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 54, no. 12, pp. 1077 1081 , December 2007.
[27] Nvidia Programming Guide http://developer.Nvidia.com/Nvidiagpuprogrammingguide .
[28] J. Krüger and R. Westermann, Linear algebra operators for GPU implementation of numerical algorithms, ACM Transactions on Graphics (TOG), in Proc ACM SIGGRAPH’03, Vol. 22, Issue 3 : (July 2003), p908–916.
[29] P. Agarwal, S. Krishnan, N. Mustafa, and S. Venkatasubramanian. Streaming geometric optimization using graphics hardware, in Proc. 11th Annual European Symposium on Algorithms (ESA’03), Budapest, Hungary : (September, 2003).
[30] M. Charalambous, P. Trancoso, and A. Stamatakis. Initial experiences porting a bioinformatics application to a graphics processor, in Proc. 10th Panhellenic Conference on Informatics (PCI’05), Volos, Greece : (November 2005).
[31] D. Shreiner, M. Woo, J. Neider, and T. Davis, “ OpenGL Programming Guide”, 5th ed. Reading, MA: AddisonWesley, Aug. 2005.
[32] Nvidia CUDA parallel programming (http://www.Nvidia.com/object/cuda_home_new.html)
[33] Y. Liu, W. Huang, J. Johnson, S. Vaidya, “ GPU accelerated SmithWaterman “, in Proc. Computational Science (ICCS’06), Lecture Notes in Computer Science, Vol. 3994, SpringerVerlag, Berlin, Germany : (2006), p188195.
[34] S. A. Manavski and G. Valle, “CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment”, BMC Bioinformatics, Vol. 9 (Suppl 2), S10 : (2008).
[35] Yuma Munekawa, Fumihiko Ino , Kenichi Hagihara , “Design and implementation of the SmithWaterman Algorithm on the CUDACompatible GPU“ , 2008 .
[36] http://alg.csie.ncnu.edu.tw/DataSet/BLOSUM%2050.txt.
[37] http://docs.nvidia.com/cuda/cudacbestpracticesguide/index.html

Excerpt out of 88 pages

Details

Title
Sequence Analysis Algorithms for Bioinformatics Application
Grade
N
Author
Year
2014
Pages
88
Catalog Number
V280808
ISBN (eBook)
9783656747918
ISBN (Book)
9783656747871
File size
1665 KB
Language
English
Keywords
sequence, analysis, algorithms, bioinformatics, application
Quote paper
Mohamed Issa (Author), 2014, Sequence Analysis Algorithms for Bioinformatics Application, Munich, GRIN Verlag, https://www.grin.com/document/280808

Comments

  • No comments yet.
Look inside the ebook
Title: Sequence Analysis Algorithms for Bioinformatics Application



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free