#### **CO421 - Presentation**

## Accelerating Adaptive Banded Event Alignment Algorithm on FPGAs using OpenCL

**Group 07** E/15/123 Wishma Herath E/15/280 Pubudu Premathilaka E/15/316 Suneth Samarasinghe Supervisors Prof. Roshan Ragel Mr. Hasindu Gamaarachchi (UNSW)

### Terminology

**Genome** : a long sequence composed of four types of nucleotide bases

4 Bases : adenine (A), cytosine(C), guanine (G) and thymine (T)
Sequencing : the process of reading strings of contiguous bases
Reads : the resulting strings of bases from sequencing

## Introduction

## DNA sequencing is the most powerful method to reveal genetic variations at the molecular level

### The process of determining precise order of Nucleotides within a DNA molecule

Sugar phosphate backbone

U.S. National Library of Medicine

### **Oxford Nanopore Sequencing Technology (ONT)**

- An enzyme unwinds DNA feeding one strand through a nanometer size protein pore.
- Unique shapes of DNA bases cause disruption in electrical current



### **Oxford Nanopore Sequencing Technology (ONT)**

- Base calling
- Sequence alignment
- Downstream analysis (Polishing)
  - Methylation calling:
    - Events detection
    - Signal space alignment
    - Hidden Markov model



# X lma Problem Definition) = Imx + x + x $e^{(e^{+}+1)}$ $(a^{x}) = d^{x} d^{x} d^{x} d^{x}$

(106(3)) = 30

 $\left( \prod_{n \in X} (1 + X) \right) = \frac{1}{1 + X}$ 

 $\left(\frac{e_{1}}{e_{+1}}\right) \neq \left(\frac{e_{+1}}{e_{+1}}\right)$ 

 $-\left( ln \times \right) =$ 

(lvgax)=

 $\frac{4}{(e^{x})} = e^{x}$ 

## ABEA is one of the most time consuming steps when analyzing raw nanopore data



### **Original ABEA**

Nanopore long reads can be >1M bases long

>10<sup>12</sup> computations Hundreds of GB of RAM Takes ~ 48-64 hours

A fine-tuned version of ABEA is used in a recent work (called f5c by Hasindu Gamaarchchi et al.) - an optimized version of nanopolish to run on general purpose GPUs and CPUs.

### **According to the literature**

FPGAs are more power efficient relative to GPUs and provides reasonable performance for HPC

Literature suggests to do a Hardware-software codesign to get the best performance



# Related work (literature review)

## Takeaways from FPGA based accelerations using OpenCL (1/5)

From SWIFOLD by Rucci et al., 2018

• Use of smaller data types for kernel(Eg: ALMs Usage: int(32 bit) 89%, short(16-bit)

52%)

- Larger pipelines
- In SWIFOLD, regardless of the sequence length and sequence similarity they have achieved higher average performance
- The exploitation of OpenCL memory hierarchy has offered a considerable benefit in performance.

#### Takeaways from FPGA based accelerations using OpenCL (2/5)

From SW Protein Search by Rucci et al., 2015

• Data level parallelism

From KNN implementation by Fahad et al., 2016

• FPGA offers better power/ energy efficiency compared to GPU implementation

## Takeaways from FPGA based accelerations using OpenCL (3/5)

Design of FPGA-based computing systems with openCL by Waidyasooriya et al., 2017

- Performance improvement techniques for NDRange kernels
  - Specifying the Work-Group Size (Compiler level optimization)
  - Kernel Vectorization (SIMD)
  - Increasing the Number of Compute Units

## Takeaways from FPGA based accelerations using OpenCL (4/5)

Design of FPGA-based computing systems with openCL by Waidyasooriya et al., 2017

- Performance improvement techniques for Single Work Item kernels
  - Avoiding Nested Loops
  - Reducing Initiation Interval Due to Read-Modify-Write Operations to Global Memory
  - Ignore Loop-Carried Dependencies

## Takeaways from FPGA based accelerations using OpenCL (5/5)

Design of FPGA-based computing systems with openCL by Waidyasooriya et al., 2017

- Common performance improvement techniques
  - $\circ$  Use of Loop unrolling
  - Inserting the 'restrict' keyword in pointer arguments whenever possible

## Implementation

#### **Implementation Choices**

- OpenCL
  - OpenCL allows a programmer to use various compute devices(FPGA,GPU,CPU)
  - HDL is a time consuming and more complex. Therefore, HLS tools are favourable
  - Most of the HLS tools does not support interface between FPGA and CPU. Designer still needs to use HDL to design interface circuit
  - OpenCL allows designers to describe whole computation: computation on the host, data transfer between the host and accelerators, and computation on accelerators
  - Board Support Package(BSP) facilitates to use the same OpenCL code to different FPGA boards
- FPGA
  - Efficient power consumption when working as an accelerator
  - Reconfigurable hardware provides more flexibility

## Tools, devices and technologies

- Altera Stratix V GX (DE5-NET)
- Intel FPGA SDK for OpenCL
- Intel SDK for OpenCL
- Quartus Prime
- Visual Studio







## Implementation (1/2)

#### ABEA

Input:

ref [] : base-called read (1D char array) model : pore-model events [] : the output of the event detection step

#### Output:



|   |   | G | А | А | Т | Т | С | А | G | Т | Т | A |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|   | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| G | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |
| A | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 |
| Т | 0 | 1 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| С | 0 | 1 | 2 | 2 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 |
| G | 0 | 1 | 2 | 2 | 3 | 3 | 3 | 4 | 4 | 5 | 5 | 5 |
| A | 0 | 1 | 2 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 5 | 6 |

| k-mer  | mean              | sd              |
|--------|-------------------|-----------------|
| AAAAAA | μ <sub>0</sub>    | σ00             |
| AAAAAC | μ1                | σ1              |
| AAAAAG | μ2                | σ2              |
| AAAAAT | μ3                | σ3              |
| AAAACA | μ4                | σ4              |
|        |                   |                 |
|        |                   |                 |
| -      |                   |                 |
| 2      |                   |                 |
| πππ    | μ <sub>4095</sub> | $\sigma_{4095}$ |

## Implementation (2/2)

Phase 1 (Related to F5c)

• Flow of Execution



- Alignment function is divided into 3 kernels:
  - 1. **Pre-Kernel** : Initialising the first two bands of the dynamic programming table and pre-computing frequently accessed values by the next kernel.
  - 2. **Core-Kernel** : The filling of the dynamic programming table which is the compute intensive portion of the ABEA algorithm.
  - 3. **Post-Kernel** : Performs backtracking.
- Memory Pre-allocation

R annen fürfennen: A CONTRACTOR OF THE OWNER OWNER OF THE OWNER and the first of the 400 S 17 11 ADDED TO LEADED The same is the second to be used to state and the second s Carl Cours 2 Planter & Later Bill VAT 2 - D. CONTENSION . LELEWERCOSE 1 return the south and the second state the second state the second s he function conversion and length bl.gef. length while ge -is.setDocument-function(a) (var b.e.g-ala.ownerDocument) a:v:return glass&g----g.nodel.ownerDocument) ettributes is function(a) (return a.className "i", la.getAttribute("className") }),c.getElementsSylagh wetion(a) return o.appendChild(a).id=u, in.getElementsByName((in.getElementsByName(u).length) Edt Fur getAttribute("id")===b)}):(delete d.find.ID,d.filter.ID=function(a) (var b=a.replace(ba,ca);return funct unction a bireturn"undefined" != typeof b.getElementsByTagName?b.getElementsByTagName(a):c.qsa?b.querySelectorAll "-\r\\' msallowcapture=''><option s
lev a b){return"undefined"!=typeof b.getElementsByClassName& function(a, b){return"undefined"" lev ByClassName& function(a, b){return"undefined"!=typeof b.getElementsByClassName& function(a, b){return"undefined"!=typeof b.getElementsByClassName& function(a, b){return"undefined"!=typeof b.getElementsByClassName& function(a, b){re querySelectorAll("[name=d]").length&&q.push("name"+L+"\*[\*^\$]!~]?="),a.querySelectorAll(":enabled").length // push " thesSelector))&&ia(function(a){c.disconnectedMatch=s.call(a,"div"),s.call(a,"[s!='']:x"),r.push("!=" documentElement:a,d=b&&b.parentNode;return a===d||!(!d||1!==d.nodeType||!(c.contains?c.contains) eDocumentPosition-th.compareDocumentPosition: return d?d:(d=(a.ownerDocument//a)===(b.ownerDocument//b)?a. compare x(v,b)?1:k?3(k,a)-3(k,b):0:46d?-1:1)):function(a,b)(if(a===b)return l=!0,0;var c,d=0,e=a.parentNode.f= nshift call(a, b): if(d)) c.disconnectedMatch) [a.document&511 --- a.document.nodeType] return d)catch(e)() return fa call toLowerCase e void & return void & attributes getAttribute rCase to the december of extended on the substant of the slice @ sort as the worte to all a liber alt the end push it is white textContent for firstchild while the set of Leas. A well would be delevel. and the second designed to the second design of the 100.000 a a character attacks a Contraction (Contraction) The rate of the second second second second

## **Experimental setup**



#### **Host specifications**

#### **Device specifications**

| PLATFORM NAME    | Intel(R) FPGA SDK for           | DEVICE NAME               | Emulated Device              |  |  |
|------------------|---------------------------------|---------------------------|------------------------------|--|--|
|                  | OpenCL(TM)                      |                           | OpenCL 1.0 Intel(R) FPGA SDK |  |  |
| PLATFORM VERSION | Version 18.0                    | DEVICE VERSION            | for OpenCL(TM), Version 18.0 |  |  |
| CPU              | Intel Core i5-4200H 2.80GHz x 2 | DEVICE MAX CLOCK          | 1000 MHz                     |  |  |
| RAM (GB)         | 12                              | DEVICE GLOBAL MEM<br>SIZE | 12 GB                        |  |  |
|                  |                                 | DEVICE MAX CU             | 1                            |  |  |

### Dataset

Publicly available reads that aligned to a 2kb region in the E. coli draft assembly

| Sample     | E. coli str. K-12 substr. MG1655              |
|------------|-----------------------------------------------|
| Instrument | MinION sequencing R9.4 chemistry              |
| Basecaller | Albacore v2.0.1                               |
| Region     | "tig0000001:200000-202000"                    |
| Note       | Ligation-mediated PCR amplification performed |

|         |                 |                 | Mean read length | Max read length |
|---------|-----------------|-----------------|------------------|-----------------|
| Dataset | Number of reads | Number of bases | (Bases)          | (Bases)         |
| testset | 143             | 819102          | 5727             | 12618           |

## **Execution time (1/2)**

#### **OpenCL** implementation on Intel FPGA emulator

| Kernel      | Execution time (s) | Percentage |
|-------------|--------------------|------------|
| Pre-kernel  | 0.125              | 0.008%     |
| Core-kernel | 1501.740           | 99.922%    |
| Post-kernel | 1.045              | 0.070%     |

#### Explanation

The Core-kernel takes the longest time of around 99.9% of the total execution time since it is the most compute intensive step

## Execution time (2/2)

| Implementation                                | Total execution time (s) |
|-----------------------------------------------|--------------------------|
| OpenCL implementation on DE5net FPGA Emulator | 1503.126                 |
| CPU implementation on Host PC                 | 6.187                    |

around 240x slowdown

#### **Explanation**

Intel FPGA emulator has limited resources such as 1 GHz maximum clock frequency and one compute unit(CU) compared to a physical FPGA hardware. Since the PC acts as both the Host and the Emulator Device, the resources are allocated for both host platform and the emulator device.

## **Comparison with CPU implementation**

## Output

#### Per read:

N: number of event align pairs

| reference position | read position |   |     |
|--------------------|---------------|---|-----|
| x1                 | y1            | _ |     |
| x2                 | y2            |   | — N |
| x3                 | у3            | _ |     |
|                    |               |   |     |

#### Number of event align pairs comparison for FPGA and CPU implementations



CPU 📕 FPGA

read index

#### Difference vs. read index



read index

#### **Explanation**

Almost all reads gives exact same no. of event align pairs for both OpenCL implementation and CPU implementation. Others with deviation of 1-6. Only for 139th read has a deviation of 95. For now we suspect it is due to the floating point precision difference. In the next phase we hope to identify the reasons for this deviations.

#### 139th read



## **Further justification**

Source: Intel ® FPGA SDK for OpenCL <sup>™</sup> Standard Edition Programming Guide

- Emulator run uses floating point computation hardware of the CPU whereas the hardware run uses floating point cores implemented as FPGA cores.
- The emulation of channel behavior has limitations, In such cases, the Emulator might execute channel operations in an order different from that on the hardware.

(especially for conditional channel operations where the kernel does not call the channel operation in every loop iteration)

• Difference in channel depths might lead to execution on the hardware hangs while kernel emulation works without any issue.

#### B' GREDIUS

## **Conclusion and Future Directions**

## Conclusion

- ABEA algorithm is a key component in nanopore data analysis
- f5c has been fine-tuned to run on CPUs/GPUs
- Accelerations using FPGAs has a lower power requirement relative to GPUs
- The portable version of f5c can be superior to deploy in many hardware platforms

## **Future Directions**

## Include the OpenCL implementation into nanopolish package

## Adapt the hardware design into an embedded device to perform event alignment in real-time

## Work Progress

## Timeline

• Semester 7 progress and timeline

| Week Number                  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|------------------------------|---|---|---|---|---|---|---|---|---|----|
| Background Study             |   |   |   |   |   |   |   |   |   |    |
| Literature Review            |   |   |   |   |   |   |   |   |   |    |
| Extract the 'Align' function |   |   |   |   |   |   |   |   |   |    |
| Convert into OpenCL          |   |   |   |   |   |   |   |   |   |    |
| <b>Results Verification</b>  |   |   |   |   |   |   |   |   |   |    |

## **Plan for Semester 8**

- Evaluate performance on Altera Stratix V FPGA and compare with f5c
- Improve performance on FPGA through optimization techniques identified from literature review
- Deploy on different hardware platforms such as CPU, GPU and evaluate performance

