While they used to depend on real tests and evaluations, today, industries lean more and more on numerical simulation. Planning ahead and visualizing what your end-product could be and how it could work is both time- and money-saving. As such, they have slowly thought of new questions, of new problems that require greater and greater processing speed. Now, be it in the fields of meteorology, geology, chemistry, energy or media and entertainment, computers are asked to run complex algorithms that, if not executed effectively, may take so long to complete none of us will be here to see it!
One domain that is famous for using demanding mathematical models is flow simulation. This semester, with my colleague A. Khizar, we took a quick peek at it by implementing the shallow water equations that help describe, for instance, the movement of water on a surface. We personally computed and visualized the collapse of a water-filled bump with animations like this one (I am also showing here the side-view evolution of the height of the water at one specific y):
[projectinfo languages=”c/c++,python” team=”2″ duration=”3 months”]
What is High Performance Computing (HPC)?
To answer these recent challenges, computer scientists have refined their algorithms, improved their hardware and designed new ways of efficiently combining multiple machines to solve a problem. Innovations in programming, electronics or algorithmics have made a lot of incredible things possible: tsunami simulations, large data analysis like at the CERN’s LHC, safer cryptography for the military, precise crash tests for automobiles…
Optimizing your code is, of course, necessary; but it is as important to know what kind of computer you work on to benefit from all its features and ‘talk’ to it in the quickest and most valuable way. BLAS routines are but one example of super-adapted computation where, in the 1970s, specialists conceived low-level programs that perform vector and matrix operations very efficiently and can be used as standard building blocks through many interfaces. Since then, computer companies have written their own versions specifically tuned for their architectures and the routines have been improved: it boosts linear algebra computation so much most of the well-known data processing softwares use it now (Mathematica, MATLAB, NumPy, R, Julia…).
One of the main concepts of HPC our course focused on is parallel computing. The goal is to connect a cluster of processing units together so each can do its part and you get your result by regrouping their individual outputs. So, for example, instead of having one computer execute one program on all of your data at a time, you’ll use several of them and tell each to execute the program on a subset of the data at the same time. There are numerous types of parallelism depending on the ressources you consider, the level at which you implement it (do you do several tasks at once? or the same task on different data?), etc.
Thanks to HPC, you can solve a problem faster, or solve more problems at once, or solve bigger problems.
Application to the shallow water equations
In this project, we wanted to transform a given sequential code (meaning a program that can only be executed by one unit, line by line) into a parallel code using the 3 techniques we had learnt about in class: MPI parallelization, OpenMP multi-threading and vectorization with intrinsics. Our program was to produce a large binary file that could later be visualized with a Python script to give animations such as the one at the beginning of the article.
MPI parallelization relies on the MPI, or Message Passing Interface. This standard of communication libraries allows us to run a program as a group of autonomous MPI processes that each execute their own instance of the code on their assigned data and communicate by exchanging messages. A. Khizar and I spent the majority of the project on this tool because, as explained later, it raises several design issues and can be gradually optimized.
In contrast, OpenMP uses threads: this means the different processes share the memory but still have their own execution state.
Finally, vectorization is a technique to greatly improve the performance of your code by working on vectors rather than scalars. In other words, you don’t do your computations on one tiny bit of data but on a whole chunk.
A focus on MPI
When distributing computation between several units, as with MPI, you need to think of how you divide the work. This question actually comprises different things: how do you cut down your data to scatter it across your processes? will all the processes always use the same part of the data? is your network more adapted to a lot of little communications, or fewer longer ones? All of these issues must be dealt with early on so you establish a functioning cluster of processors and don’t actually slow things down by overflooding the network with messages, for example.
In our case, we computed the shallow water equations on a grid, so we wondered how to best split this grid between the different MPI processes. For this sort of problem, there are two natural decompositions in subdomains, either by blocks of rows or by square blocks:
There is no ‘perfect answer’ because both decompositions have good and bad points: the first one is simpler to implement and uses less communications but with longer messages, while the other one can distribute the work more fairly but requires a lot of message exchanges. Depending on whether your processing units are on a network with a low latency or a low bandwith, you may wish to choose one or the other.
To improve the performance of your MPI cluster, you can also use the library’s non-blocking communication routines: instead of waiting for each exchange to be completely finished (both on sending and receiving ends), you can throw the message and let it live its own life while resuming your activity. Also, when you want your MPI processes to access a shared resource (here, the output binary file), you have to be aware of concurrent access and avoid your units all colliding together at the same spot.
Comparing the 3 techniques was a nice way of identifying their advantages and their drawbacks. In particular, it made us think of the compromise between execution time gain and programming time loss: while OpenMP does not give as much acceleration as intrinsics do, you only have to add one line of code in your program to make it working; vectorization, on the other hand, requires a complete redesign – but, yeah, it made our program 3 or 4 times faster!
HPC and code optimization isn’t always straight-forward – sufficient to look at how complicated automatic parallelization is! – but it is really interesting and essential in the current context: chances are we are not going to stop asking for more effective computers and more efficient algorithms!
- S. Graillat and P. Fortin, “Calcul haute performance : notions de base,” 2018. [Course notes for MAIN4 – Polytech Sorbonne].
- I. Free Software Foundation, “Shallow water equations.” (https://en.wikipedia.org/wiki/Shallow_water_equations), May 2018.
- Dr. Ralf-Peter Mundani, “Parallel Programming and High-Performance Computing,” 2008. [Course notes for IGSSE]. (https://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/slides/part01.pdf)
- BLAS routines: http://www.netlib.org/blas/
- I. Free Software Foundation, “Parallel computing.” (https://en.wikipedia.org/wiki/Parallel_computing), May 2018.
- I. Dupays, M. Flé, J. Gaidamour, D. Girou, P.-F. Lavallée, D. Lecas and P. Wautelet, “MPI.” (http://www.idris.fr/media/formations/mpi/idrismpien.pdf), January 2018.