Extending the application of GPUs beyond processing accelerators

Leonel Sousa and Aleksandar Ilic and Frederico Pratas







- Graphics Processing Units (GPUs) in integrated circuits since 80's
  - Alpha-Numeric Television Interface Circuit (ANTIC) LSI Atari  $\mu$ C
- GPUs development mainly thrust by the gamming industry & market (e.g. Prince of Persia- 1<sup>st</sup> version1989; 2<sup>nd</sup> version 2008)

- Not only high computational power but growing at a high rate







Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed

linescid

 Originally GPUs only dedicated for graphics but the high computational capacities woke up interest for using them to other applications





, inescid

- General purpose Computation on Graphics Processing Units (GPGPU: <u>www.gpgpu.org</u>)
  - Acceleration of specific functions, mainly operating as a coprocessor in host computers
    - Scientific computation (e.g., Fluid Dynamics)
    - Signal Processing (FFT, Filtering, Image and video processing,...)
    - High Performance libraries (BLAS,...)

•





- To use computational power of GPUs in comm computers:
  - CPU + GPU = heterogeneous system with slow communication
  - or as a computer component to design hardware processing structures

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

SUPERIOR



- GPU architectures and programming models
- Extending the application of GPUs
  - CHPS: Collaborative-execution-environment for Heterogeneous Parallel Systems
  - SDHA: applying the Stream-based-computing-model to Design Hardware Accelerators
- Conclusions and future work



### Graphics Operations: Rasterizationbased Rendering

technology from seed









26/8/2009

CE Colloquium, TU Delft

### Flow-model and Caravela Programming Tools for GPGPU



technology from seed

• Flow-model and Caravela programming tools (OpenGL, DirectX) S. Yamagiwa and L. Sousa, "Caravela: A Novel Stream-Based Distributed Computing Environment", IEEE Computer, May 2007





- Streaming Multiprocessor (SM): 8 cores (SP) + shared memory
- Texture/Processor Cluster 2 SMPs



technology from seed

- Warp groups of 32 parallel threads
  - Each SM manages a pool of 24 warps
  - SIMT controls execution and branching behavior of one thread
  - Large number of threads scheduled by hardware masks memory latency

- NVIDIA GT200 GPUs
  - A collection of SIMD SMs- up to 30
  - High memory bandwidth-up to 150GB/s
  - Connected via PCIe bus, with 2GB/s bandwidth



Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa



, inescid



technology from seed

- Compute Unified Device Architecture (CUDA)
  - Divide a kernel into thousands of parallel threads
  - Organize these threads into blocks, which run independently and in parallel in different SMs and share 16KB of memory each
  - Group blocks into grids
  - All threads have access to global memory
- Programming interface
  - Standard C programming languages with minimal extensions for parallel programming





 CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems

• **SDHA** - Applying the Stream-Based Computing Model to Design Hardware Accelerators



# CHPS Echnology from seed

- Commodity computers = **Heterogeneous systems** 
  - Multi-core general-purpose processors (CPUs)
  - Many-core graphic processing units (GPUs)
  - Special accelerators, co-processors, FPGAs, DSPs
- ⇒ Huge **collaborative** computing power
  - Not yet explored
  - In most research done target the usage of one these devices at time, mainly domain-specific computations
- Heterogeneity makes problems much more complex
  - many programming challenges



### **CHPS: Heterogeneous Systems**

### Master-slave execution paradigm

- Distributed-memory programming techniques
- CPU (Master)
  - Global execution controller
  - Access the whole global memory

### Interconnection Busses

- Reduced communication bandwidth comparing to distributed-memory systems
- Underlying Devices (Slaves)
  - Different architectures and programming models
  - Computation performed using local memories



technology

from seed



### **CHPS: Programming Challenges**

technology from seed

### Computation Partitioning

- To fulfill device capabilities/limitations and achieve optimal load-balancing
- Data Migration
  - Significant and usually asymmetric
  - Require detailed modeling
  - Potential execution bottleneck
- Synchronization

### • Different programming models

- Per device type and vendor-specific
- Vendors provide high performance libraries and software



, inesc id

- Application Optimization
  - Very large set of parameters and solutions affects performance

INSTITUTO SUPERIOR TÉCNICO

### **CHPS: Unified Execution Model**

technology from seed





### **Primitive jobs**

- Minimal problem portions for parallel execution (1<sup>st</sup> agglomeration)
- Balanced granularity: fine enough partition to sustain execution on every device in the system, but coarse enough to reduce data movement problems/overheads

### Job Queue

- Classifies the primitive jobs to support specific parallel execution scheme
- Accommodates primitive jobs' batching into a single job according to the individual device demands (2<sup>nd</sup> agglomeration)

2nd

agglomeration

26/8/2009

CE Colloquium, TU Delft

### **CHPS: Unified Execution Model**

technology from seed



### **Device Query**

- Identifies all underlying devices
- Holds per-device information (device type, current status, requested workload and performance history)

inesc id

### Scheduler

- Adopts a scheduling scheme according to devices' capabilities, availability and restrictions (currently, FCFS)
- Forms execution pair:
  <device, computational\_portion>
- Upon the execution completion, the data is transferred back to the host, and the device is free to support another request from the Scheduler



### **CHPS:** Case study **Dense Matrix Multiplication**

Matrix multiplication  $C_{M \times N} = A_{M \times K} \times B_{K \times N}$  is based on **sub-matrix partitioning** 

$$c_{i,j} = \sum_{l=0}^{K/R} a_{i,l} \times b_{l,j}, \quad 0 \le i \le \frac{M}{P}; 0 \le j \le \frac{N}{Q}$$

by creating a set of **primitive jobs**  $a_{i,l} \times b_{l,j}$ 

- Devices can request several primitive jobs at once •
- Single  $C_{i,i}$  updates are performed atomically •
- High performance libraries: •
  - ATLAS CBLAS [CPU], CUBLAS [GPU]





R

Р

 $\boldsymbol{O}$ 

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

K

Μ

technology

Ν

, inesc id

R

from seed



 $H = FFT_{1D}(FFT_{2D}(h))$ 

### Traditional parallel implementation requires transpositions

- between FFTs applied on different dimensions and after executing the final FFT

### **Our implementation**

 Only portions of data which are assigned to the device are transposed, and the results are retrieved on adequate positions after the final FFT



High performance libraries: *FFTW* [CPU], *CUFFT* [GPU]

### CHPS: Case study 3D Fast Fourier Transform



technology from seed

| Problem size                         |                | N <sub>1</sub>                              | N <sub>1</sub> N <sub>2</sub>                                           |                |
|--------------------------------------|----------------|---------------------------------------------|-------------------------------------------------------------------------|----------------|
| Primitive Job parameters             |                | P <sub>1</sub> P <sub>2</sub>               |                                                                         | P <sub>3</sub> |
| NI<br>PI<br>PI<br>NI<br>N2<br>ED FFT | Total Size     | $N_1$ 2D <i>FFT</i> s of a size $N_2 x N_3$ |                                                                         |                |
|                                      | Primitive Job  | ✓                                           | ×                                                                       | ×              |
|                                      | Job Queue Size | N <sub>1</sub> / P <sub>1</sub>             |                                                                         |                |
|                                      | Total Size     | $N_2 x N_3$ 1D <i>FFT</i> s of a size $N_1$ |                                                                         |                |
| N1<br>P3<br>N3<br>N2<br>1D FFT       | Primitive Job  | ×                                           | ✓                                                                       | ~              |
|                                      | Job Queue Size |                                             | (N <sub>2</sub> / P <sub>2</sub> ) x (N <sub>3</sub> / P <sub>3</sub> ) |                |



### CHPS: Experimental Results Optimized Dense Matrix Multiplication

inesc id

technology from seed

#### RESULTS

| Exhaustive search                | e search 32x32 tests |     |  |  |  |
|----------------------------------|----------------------|-----|--|--|--|
| Optimal load belonging           | GPU                  | CPU |  |  |  |
| Optimal load-balancing           | 13                   | 1   |  |  |  |
| Obtained speedup (comparing to): |                      |     |  |  |  |
| 1 CPU core <b>4.5</b>            |                      |     |  |  |  |
| 2 CPU cores                      | 2.3                  |     |  |  |  |
| 1 GPU <b>1.2</b>                 |                      | .2  |  |  |  |



GPU [CPU] Workload

| Single precision floating-point arithmetic (Largest problem executable by the GPU) |      |      |      |  |  |
|------------------------------------------------------------------------------------|------|------|------|--|--|
| Problem size                                                                       | N    | К    |      |  |  |
| Problem size                                                                       | 4096 | 4096 | 4096 |  |  |
| Primitive Job                                                                      | Р    | Q    | R    |  |  |
| parameters                                                                         | 4096 | 16   | 4096 |  |  |
| Job Queue size                                                                     |      | 32   |      |  |  |

|                        | CPU          | GPU                     |  |
|------------------------|--------------|-------------------------|--|
| Experimental Setup     | Intel Core 2 | NVIDIA<br>GeForce8600GT |  |
| Speed/Core (GHz)       | 2.33         | 0.54                    |  |
| Global Memory (MB)     | 2048         | 256                     |  |
| High Performance Softw | /are         |                         |  |
| Matrix Multiplication  | ATLAS 3.8.3  | CUBLAS 2.1              |  |
| 3D FFT                 | FFTW 3.2.1   | CUFFT 2.1               |  |



Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

5

### **CHPS: Experimental Results Complex 3D FFT [2D FFT Batch]**

technology from seed



#### **Complex 3D fast Fourier transform**

linescid

(Problem not possible to execute on the GPU)

| Problem size   | N <sub>1</sub>           | N <sub>2</sub> | N <sub>3</sub> |
|----------------|--------------------------|----------------|----------------|
|                | 512                      | 512            | 512            |
| Primitive Job  | P <sub>1</sub>           | P <sub>2</sub> | P <sub>3</sub> |
| parameters     | 16                       | х              | х              |
| Job Queue size | 32 x 2D FFT<br>(512x512) |                | • •            |

### ıſi INSTITUTO SUPERIOR TÉCNICO

RESULTS

1 CPU core

2 CPU cores

1 GPU

Exhaustive search

Optimal load-balancing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

Χ

GPU maximal workload

is 5x2D FFTs (512x512)

### CHPS: Experimental Results Complex 3D FFT [1D FFT Batch]

technology from seed

| inesc id |
|----------|
|----------|

#### RESULTS

| Exhaustive search      | austive search 416 tests |     |  |
|------------------------|--------------------------|-----|--|
| Ontimal load halanging | GPU                      | CPU |  |
| Optimal load-balancing | 12                       | 4   |  |

Obtained speedup (comparing to):

| 1 CPU core  | 2.2 |
|-------------|-----|
| 2 CPU cores | ~1! |
| 1 GPU       | X   |



**Complex 3D fast Fourier transform** (Problem not possible to execute on the GPU)

|              | Problem size   | N <sub>1</sub>        | N <sub>2</sub> | N <sub>3</sub> |
|--------------|----------------|-----------------------|----------------|----------------|
| Problem size | 512            | 512                   | 512            |                |
|              | Primitive Job  | P <sub>1</sub>        | P <sub>2</sub> | P <sub>3</sub> |
| _            | parameters     |                       | 16             | 16             |
|              | Job Queue size | 32x32 1D FFT<br>(512) |                | FTs            |

FFTs (512)Results include transpose time!

GPU maximal workload is 13 x 1D



•

### CHPS: Experimental Results Complex 3D FFT [1D FFT Batch]

technology from seed

linescid

| 3D FFT RESULTS (TOTAL)           |     |  |  |
|----------------------------------|-----|--|--|
| Obtained speedup (comparing to): |     |  |  |
| 1 CPU core 2.8                   |     |  |  |
| 2 CPU cores                      | 1.3 |  |  |
| 1 GPU                            | X   |  |  |

- **Transposition time** dominates the execution for higher workloads!
- **Memory transfers** are significant!





 CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems

 SDHA - Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study



### **SDHA: Motivation**



 Co-processing hardware can be a very efficient solution to accelerate certain types of applications



technology

from seed

 However, mapping algorithms into hardware is not straightforward



 Relatively easier to program according to the streambased computing model (many stream-based algorithms have been proposed for different applications, e.g. for running on GPUs)





- The stream-based description seems to be suitable as an approximation to hardware description:
  - decouples computation from memory accesses
  - exposes maximum parallelism available in the application
- Stream-based description can be used as an intermediate step:
  - besides being easier and faster to implement it also provides the HW designer with early-stage programmable prototyping platforms





### SDHA: Overview Stream-based Computation



technology from seed

- A data stream is a **sequence** of data items.
- In stream computing a kernel, comprising several operations in sequence, is applied to a single or multiple input data streams to produce an output data stream.





# SDHA: Overview Hardware Design



#### technology from seed

- Efficient solutions but complex design
  - Customizable
  - Differently optimized for each application

### Adaptable/Reconfigurable

 Different applications can be dynamically loaded

### Hard to:

We try to overcome these difficulties: stream-based computing (GPUs) hardware accelerators







technology

from seed

 MrBayes is a bioinformatics application that performs Bayesian inference of evolutionary (phylogenetic) trees.

Phylogenetic trees are used for instance in

"Origins and evolutionary genomics of the 2009 swineorigin H1N1 influenza A epidemic"

http://www.nature.com/nature/journal/v459/n7250/full/nature08182.html

 A current real-world phylogenomic data set study contains 1,500 genes and requires 2,000,000 CPU hours on a BlueGene/L system<sup>1</sup>.



<sup>1</sup> M. Ott, et. al, "Large-scale Maximum Likelihood-based Phylogenetic Analysis on the IBM BlueGene/L", In SC'07, NY, USA.



• Phylogenetic Tree: Origin and evolution of the H1N1 (simplified)

Reconstruction of the sequence of reassortment events leading up to the emergence of S-OIV.



GJD Smith *et al. Nature* **459**, 1122-1125 (2009) doi:10.1038/nature08182





- Profiling showed that >85% of the program execution is spent in the Phylogenetic Likelihood Functions (PLF).
  - 2 PLF. CondLikeDown and CondLikeRoot
- Data Structures

- Conditional Likelihood vector
- Nucleotide substitution matrix







### SDHA: MrBayes Stream-based Parallelization

technology from seed

linescid





Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

### SDHA: MrBayes Stream-based Parallelization

technology from seed

- The stream-based version was implemented with CUDA for GPGPU
  - The PLF are executed on the GPU, while the remaining part of the program is computed on the CPU
  - Each thread computes one inner product, which can be seen as a reduction
  - Number of threads used: 10240
  - The data streams are split between blocks
  - Number of blocks used: 40

More details about the CUDA implementation and the algorithm parallelization can be found at F. Pratas, et. al, "*Fine-grain Parallelism using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function*", In ICPP'09, Vienna, Austria.

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

SUPERIOR



inesc id





### SDHA: MrBayes From Stream-based to Reconfigurable Hardware

technology from seed

inescid





technology from seed

inesc id

- Prototyping platform
  - The stream-based algorithm is programmed in CUDA
  - The GPU is used for optimization and debugging
  - The algorithm is ported to VHDL with a minimum effort
    - A simple control unit can be used for several units following a SIMD approach





| Characteristics                                  | Baseline  | GPU             | Virtex 4   | Virtex 5         | Virtex 5     |
|--------------------------------------------------|-----------|-----------------|------------|------------------|--------------|
|                                                  | Intel x86 | GeForce 8800 GT | LX100      | LX110            | FX200T       |
| Frequency [GHz]                                  | 3.000     | 0.575           | 0.450      | 0.667            | 0.450        |
| Power [Watts]                                    | 65        | 105             | <10        | <10              | <10          |
| # Cores                                          | 1         | 112             | —          | —                | —            |
| # Slices <sup>(1)</sup> / $#$ DSP <sup>(2)</sup> | —         | —               | 49152 / 96 | $17280 \ / \ 64$ | 122880 / 384 |
| Technology                                       | 45-nm     | 65-nm           | 90-nm      | 65-nm            | 65-nm        |

- The Stream-based implementation was programmed with CUDA API v2.1
- The hardware was described in VHDL and the results were obtained with Xilinx ISE 10.1 after Place-and-Route.
- Baseline is a general-purpose architecture and is used as reference system.
- MrBayes version 3.1.2 with input datasets obtained from Seq-Gen.
- Datasets are named according to X\_Y, where X is related to the number of PLF calls and Y is related to the size of the data set.



SUPERIOR



 Scalability of the accelerated kernel for different number of FCAs on the Virtex5 FX200T



- Input Data Set [organisms-columns]
- We implemented Super-pipelined FCAs, with 8 stages floating-point units.
- Multiples of 4 FCAs were used to simplify the control due to the number of discrete rates.



inesc id

## SDHA: Experimental Results

| PLF function   | Characteristics            | Virtex 4<br>LX100 | Virtex 5<br>LX110 | Virtex 5<br>FX200T |
|----------------|----------------------------|-------------------|-------------------|--------------------|
|                | Occupation (Slices / DSPs) | 15% / 100%        | 17% / 75%         | 63% / 100%         |
| CondLikeDown   | Frequency [MHz]            | 128               | 104               | 54                 |
| ColldLikeDowli | # of cycles                | 40                | 40                | 40                 |
|                | $Maximum \ \# \ FCAs$      | 8                 | 8                 | 64                 |
|                | Occupation (Slices / DSPs) | $15\% \ / \ 54\%$ | 31% / $88%$       | 84% / 95%          |
| CondLikeRoot   | Frequency [MHz]            | 98                | 105               | 56                 |
|                | # of cycles                | 72                | 72                | 72                 |
|                | Maximum $\#$ FCAs          | 4                 | 8                 | 52                 |

- The CondLikeRoot function has one extra inner product per conditional likelihood element extra floating-point units.
- The obtained frequency is similar for the two FPGAS when using approximately the same number of FCAs.
- The frequency decreases with the increase of the number of FCAs due to the FPGA occupancy and associated routing overhead.
- The latency is constant because the parallelism is the only factor being modified across different topologies.



SUPERIOR



• Results are relative to the kernel functions. FPGA can be reconfigured two implement one of the PLFs at a time

LX100

[8 FCA]

LX110

[8 FCA]

FX200T

[64 FCA]

• It is assumed that the input data fits in the main memory.

8800 GT

[112 PE]

- The fastest architecture is the Virtex5 FX200T with 64 FCAs, showing a speedup of 10.2x. The GPU achieves a speedup of 4.3x.
- Although the difference between the # FCAs in the two Virtex5 is 8x, the speedup obtained is only 4x (different operating frequencies).





technology from seed

 CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems

Aleksandar Ilic and Leonel Sousa, "*Collaborative Execution Environment for Heterogeneous Parallel Systems*", submitted to 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP) 2010.

• **SDHA** - Applying the Stream-Based Computing Model to Design Hardware Accelerators

Frederico Pratas and Leonel Sousa, "*Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study*", In International Workshop on Systems, Architectures, MOdeling, and Simulation (SAMOS), Springer, July 2009.





- The proposed unified execution environment achieves significant speedups relatively to the single CPU core execution:
  - 4.5 for dense matrix multiplication
  - **2.8** for complex 3D fast Fourier transform

## • Future work:

- Implementation in more heterogeneous systems (more GPUs, more CPU cores, or special-purpose accelerators)
- Asynchronous memory transfers
- Adoption of advanced scheduling policies
- Performance prediction and application self-tuning
- To identify limits in performance to choose the processor to use (e.g



GPU versus CPU)

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology

from seed



- We were able to obtain a hardware implementation of the target algorithm using the stream-based
- Due to the stream-based parallel characteristics it naturally leads to scalable hardware architectures
- Typical folding procedures were used systematically to optimize the operations in hardware
- The results show that the solution is efficient, even not being a completely custom solution
- Future work:
- GPUs can be used as prototyping platforms to design hardware
- Development of software tools to assist hardware design.



