Why Globus for Extreme-Scale Cosmology: A Conversation with Argonne’s Katrin Heitmann
July 08, 2019 | Mary Bass
Globus recently saw the biggest single file transfer in our history: a team led by Argonne National Laboratory scientists moved 2.9 petabytes of data on the Summit system at Oak Ridge National Laboratory, as part of a research project involving three of the largest cosmological simulations known to date.
We sat down with Dr. Katrin Heitmann, Argonne physicist and computational scientist and the lead researcher on this project, to get details on the project and why she uses Globus.
Tell us about your work – in particular, what aspects of your research require powerful file transfer?
[Dr. Heitmann] "Our group is working in the field of computational cosmology. We are generating extreme-scale simulations to model the evolution and content of the Universe. The simulations are used to build sophisticated synthetic sky maps that are used to understand observational data from cosmological surveys. The results from the simulations are shared with the community for further investigations and we need to be able to move the data around quickly; powerful, reliable file transfer is a key to this."
I understand your recent project using Summit resulted in the largest single transfer (~3 PB) Globus has ever seen. What can you tell us about the importance of this project?
[Dr. Heitmann] "We carried out three different simulations on Summit (each simulation resulted in a file transfer of 2-3PB) to model three different scenarios of the makeup of the Universe. We are trying to understand the subtle differences in the distribution of matter in the Universe when we change the underlying model slightly. These simulations are unique worldwide, and we were very lucky that the CAAR (Center for Accelerated Application Readiness) program at the Oak Ridge Leadership Computing Facility provided early access to Summit and therefore enabled this work.
Due to its uniqueness, the data is very precious and the analysis will take some time. The first step after the simulations were finished was to make a backup copy of the data to HPSS. One major problem we have with our simulations is that we can't get the data analyzed as fast as the computing centers want us to get them off their machines. So the copy to tape has two functions: make sure we have a copy of the data set in case something bad happens to the disk (which does occur rather regularly), and also ensure we can pull it back if we need to do new analysis tasks. In this way, we can move the data back and forth between disk and tape and we can carry out the analysis in steps.
Storage is in general a very large problem in our community -- the universe is just very big, so our work can often generate a lot of data. Using Globus to easily move the data around between different storage solutions and different institutions for analysis is essential."
Why do you choose Globus for this kind of job?
[Dr. Heitmann] "Several reasons: speed, reliability, and ease of use. The implementations in Globus are extremely convenient -- for example, it reminds me when my credentials expire, so a job basically never times out. Also, the Globus interface is easy to use, and it provides excellent monitoring interfaces (first thing in the morning when I get to my office is making a coffee, second thing is very often checking in on the transfers!), and I can really fully rely on it due to the checksums.
In addition, this work would not have been possible without Oak Ridge's excellent setup of data transfer nodes, enabling the use of Globus for HPSS transfers."
What are the most important elements of file transfer performance? How important is speed relative to reliability and other aspects of transfer?
[Dr. Heitmann] "The most important part is reliability. It is basically impossible for me as a user to check the very large amounts of data upon arrival after a transfer has finished. The analysis of the data often uses a subset of the data, so it would take quite a while until bad data would be discovered and at that point we might not have the data anymore at the source. So the reliability aspects of Globus are key.
Of course, speed is also important -- if the transfers were very slow, given the amount of data we transfer, we would have a problem as well. So it’s good to be able to rely on Globus for fast data movement as well. However, a super-fast transfer service only takes you so far; you need reliability, security, ease of use and other factors as well. Globus offers all of these."
What other aspects of Globus do you rely on, beyond file transfer?
[Dr. Heitmann] "Many people in our community use Globus. We have recently started to stand up a data portal with cosmological simulations at the ALCF (Argonne Leadership Computing Facility), and Globus is a central piece to that work. Our portal is called the HACC Simulation Data Portal, and it provides access to results from large cosmological simulations carried out with HACC, the Hardware/Hybrid Accelerated Cosmology Code, developed primarily at Argonne.
Basically, the users pick a data set and then Globus is used to get the data to their institution. This is very convenient because it provides fast, reliable transfers."
For more information about the record-breaking transfer Dr. Heitmann achieved, read the press release.
For details about the project on Summit, read the news article.
To read about the HACC data portal, visit the HACC site or see the slides presented at GlobusWorld 2019.