# Vertical Acceleration: From Algorithms to Logic Gates

# considering Economics of Computation

Oskar Mencer

Beograd, August 2010



## Maxeler Technologies





MaxBox 4 MaxCards in a 1U box



MaxRack Storage, Network and Compute



**Real-time trace processing** 

6 7 8 9 10 11 12

#### Finite Difference (with Chevron)



**Local Vol Approximation** 





BSR

Geologic Formation

Survey Ship

mo (A ir C

2/27



#### \* published by the Society of Exploration Geophysicists in 2008



# 48x Speedup of Angle Gathers

with Stanford Center for Earth and Environmental Sciences \*)



Angle gathers from CPU computed subsurface offsets

| • | Can be | dominant | cost in | shot | profile | migration |
|---|--------|----------|---------|------|---------|-----------|
|   | •      |          |         |      | -       |           |

Cross-correlating two fields by various shifts:

$$I(h, x, z) = \sum_{s} \sum_{w} S(x - h, z, w, s) \cdot G^{*}(x + h, z, w, s)$$



subsurface offsets

#### SPEEDUP RESULTS FROM CUSTOM TRACE MEMORY SYSTEM:

- Trace = Unit of Transfer
- Buffers Prefetch Right Traces in Advance



\*) Case Study: Subsurface Offset Gathers, presented at SEG 2008

Maxeler April 2010

## Seismic Data Acquisition





Courtesy of Schlumberger

### ENI-AGIP Seismic Trace App: Conjugate Gradient Optimization 100 MAX2 cards delivering performance of 21,800 CPU cores[EAGE2010]





Data Use with 64 t<sub>o</sub>



#### Acceleration Projects end up at 20-30x







Customer ACustomer BApp1: 19x, App2: 25x1.2GB/s per card

Customer C App1: 22x, App2: 22x





Customer D App1: 32x, App2: 29x

Customer E App1: 30x

Customer F App1: 26x, App2: 30x



#### **Computing Technology 2010**

# Microprocessor

# **FPGA**

#### Intel Quad Core Nehalem





19.6 mm 731 million transistors 8 MB L3 plus 4 x 0.5 MB L2 128 bit DDR3 bus and 2x Quick patch I/O Branch pred. and prefetchers doubled for SMT? Reworked SSE / FP Single core size: ~29.6 mm2 L2 and L3 cache tiles: ~5.8 mm2 / MB (excl.tags) www.chip-architect.com rev.4: Oct 15, 2007



#### Xilinx Virtex-6 FPGA



• 5MB at >1TB/s

- 2000 multipliers
- ~1M logic elements

#### Computing with CPUs versus FPGAs



Streaming Data through a data flow machine



## Acceleration is Hard



## Vertical Optimizations in a Horizontal world





## Managing a Complex C++ Acceleration Project

"With C you can shoot yourself in the foot.

with C++ you can blow your whole leg away."

Abstraction
 The enemy of acceleration?
 We balance abstraction with modelling of underlying dataflow
 In the process we make the code easier to read and maintain
 Requires the support of the people who wrote the code
 Focus on large "chunk size", the unit of data and computation
 Requires adaption of the job distribution system
 Acceleration has to track a moving target; ongoing development
 Accelerating the acceleration process: Agile Programming
 Coding standards to achieve fast deliveries



# Maxeler Loop Flow Graphs for JP Morgan Credit Derivatives Transformation Options





#### 30x Accelerated 2-Fluid Lattice Boltzmann: Changing the sorting step inside the iteration





#### **Slow and Fast Memories**

John von Neumann, 1946:

"We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding, but which is less quickly accessible."





$$\begin{split} \text{SABR model:} & dF_t = \sigma_t F_t^{\ \beta} dW_t \\ & d\sigma_t = \alpha \sigma_t dZ_t \\ & < dW, dZ >= \rho dt \end{split}$$
we integrate in time (Euler in log-forward, Milstein in vol.)
$$& \ln F_{t+1} = \ln F_t - \frac{1}{2} (\sigma_t \exp((\beta - 1) \ln F_t))^2 . dt + \sigma_t \exp((\beta - 1) \ln F_t) \Delta W_t \\ & \sigma_{t+1} = \sigma_t + \alpha \sigma_t \Delta Z_t + \frac{1}{2} (\alpha \sigma_t) (\alpha) (\Delta Z_t^2 - dt) \end{split}$$
For each path, initialize: 
$$& \ln F_0, \quad \sigma_0 \end{split}$$

For each t:  $\ln F_{t+1} = G(\sigma_t, \ln F_t, \Delta W_t)$  $\sigma_{t+1} = H(\sigma_t, \Delta Z_t)$ 

... How do we compute G and H optimally?



## **Design Space for Function Evaluation**

and +,-,×,÷

#### Computing f(x) in the range [a,b] with $|E| \le 2^{-n}$ Table Table Table

Arithmetic

+,-,×,÷

- uniform vs non-uniform
- number of table entries
- how many coefficients

- polynomial or rational approx
- continued fractions
- multi-partite tables

#### Underlying hardware/technology changes the optimum



# Variable Segments and Approximations



Approximate in each segment separately:

$$r(x) = c_0 + c_1 x' + c_2 x'^2 + c_3 x'^3 \dots \qquad r(x) = \frac{a_0 + a_1 x + a_2 x^2 + a_3 x^3 \dots}{b_0 + b_1 x + b_2 x^2 + b_3 x^3 \dots}$$

Compute Coefficients: Taylor Series, MiniMax (Remez), Splines,... Precision: How many coefficients AND how many bits per variable?



p(x')

## Finite Difference Stencils and Coefficients

- Monte Carlo vs Finite Difference vs Finite Elements
- Explicit versus implicit
- Discretization, delta\_t versus accuracy
- Example: 3D stencil shape for Finite Difference





joint Stanford/Maxeler presentation at the SEG meeting 2009.

#### MaxCompiler



#### **Generic Acceleration Architecture**







## **Providing Complete Solutions**

Maxeler offers complete hardware, software and application acceleration solutions for high performance computing



- Card: PCI Express x16, compute, memory and local interconnect
- Box: 1U solutions with 1 or 4 Cards
- Rack: 10U, 20U or 40U, balancing compute, storage & network
- MaxelerOS: Resource management of Stream Computing
- Runtime support: memory management and data choreography
- MaxCompiler: providing programmability
- HPC System Performance Architecture
- Algorithms and Numerical Optimization
- Integration into business and technical processes





PERFORMANCE COMPUTING



