In this talk we will describe PyFR a high-order accurate compressible fluid flow solver for mixed unstructured grids being developed at Imperial College London and authored in Python. We will outline the techniques employed to allow PyFR to run heterogeneously on CPUs, GPUs and the Intel Xeon Phi, with an emphasis on run-time code generation.
Python is one of the most commonly used programming language throughout the computing industry. It has proven itself to be an easy-to-use, elegant scripting language that allows for rapid prototyping and development of highly flexible software. Within many scientific circles Python is now the Swiss army knife of both pre- and post-processing. Due to relatively high interpreter overheads, Python is often eschewed for high-performance codes (HPC) in lieu of more traditional languages such as C and Fortran. However, over the past few years a variety of modules have come to prominence which promise to help close this gap.
Moreover, these is also a buzz within the community in accelerators, including graphics processing units (GPUs) and co-processors. Successfully exploiting these new and novel architectures, many of which sport their own programming interfaces, is extremely labour intensive.
In this talk we will make the case for why Python is perfectly geared towards helping scientists and engineers write the next generation of accelerated HPC codes. Our case study will focus on PyFR, a high-order accurate compressible fluid flow solver being developed at Imperial College London. By intelligently leveraging Python and packages such as Numpy, mpi4py, PyCUDA, pyMIC, and Mako, PyFR is able to run at scale on a variety of hardware platforms. All of this is accomplished without sacrificing the compact elegance of Python. A prime focus of our talk will be on run-time code generation whereby kernels as specified as templates which are translated, compiled, and linked at run-time. We will demonstrate how this concept can both improve the portability of a code base between architectures and also permit optimisations to be made that are simply not viable ahead of time compilation. Results will be shown demonstrating the ability of PyFR to obtain in excess of 50% of peak FLOP/s on a variety of platforms and its ability to scale to 100s of GPUs.