Financial Software on GPUs: Between Haskell and Fortran (pdf, 2012) Nov 19 2019 16:51 languageMoneyScience
keyboard_arrow_downkeyboard_arrow_up Visit resource
Cosmin E. Oancea1, Christian Andreetta, Jost Berthold, Alain Frisch, Fritz Henglein
This paper presents a real world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware neutral algorithm implementation and efficient execution on graphics processing units (GPU).
The language issues refer to preserving algorithmic invariants, e.g., inherent parallelism made explicit by map-reduce-scan functional combinators. Efficient execution is achieved by manually applying a series of generally applicable compiler transformations that allows the generated OpenCL code to yield speedups as high as 70x and 540x on a commodity mobile and desktop GPU, respectively. Apart from the concrete speedups attained, our contributions are twofold: First, from a language perspective, we illustrate that even state-of-the-art auto-parallelization techniques are incapable of discovering all the requisite data parallelism when rendering the functional code in Fortran-style imperative array processing form.
Second, from a performance perspective, we study which compiler transformations are necessary to map the high-level functional code to hand-optimized OpenCL code for GPU execution. We discover a rich optimization space with nontrivial trade-offs and cost models. Memory reuse in map reduce patterns, strength reduction, branch divergence optimization, and memory access coalescing, exhibit significant impact individually. When combined, they enable essentially full utilization of all GPU cores. Functional programming has played a crucial double role in our case-study: Capturing the naturally data-parallel structure of the pricing algorithm in a transparent, reusable and entirely hardware independent fashion; and supporting the correctness of the subsequent compiler transformations to a hardware-oriented target language by a rich class of universally valid equational properties.
Given the observed difficulty of automatically parallelizing imperative sequential code and the inherent labor of porting hardware-oriented and optimized programs, our case study suggests that functional programming technology can facilitate high level expression of leading-edge performant portable highp erformance systems for massively parallel hardware architectures.