Globus Participates in Three-Year Program to Modernize the Earth Systems Grid Federation
February 12, 2024 | Lee Liming
Data nodes upgrade eliminates IT obstacles for climate scientists
The Earth System Grid Federation (ESGF) is an international federation of data repositories for climate model simulation outputs, part of the World Climate Research Programme (WCRP). More than 35 global repositories—using a common software base—currently hold 26M+ datasets (~50 PB of data, including replicas). ESGF data is produced by climate observations and simulations and is used by scientists around the world to investigate the consequences of possible climate change scenarios. Globus is involved in a three-year program to upgrade the ESGF data nodes operated by the U.S. Department of Energy (DOE). Goals of the program include: increasing system capacity in preparation for the next wave of climate simulations in 2025, improving repository sustainability, and broadening access to data analysis capabilities.
The modernization process involves leveraging institutional resources available at the labs. Currently the US DOE ESGF nodes manage more than 7 petabytes of climate data. While the national laboratories excel at provisioning large-scale storage, Globus excels at large-scale data access. So the nodes are being upgraded to use lab-wide storage with Globus access rather than custom storage and data access systems purchased and operated by the climate research teams. The nodes’ search index services (formerly run by climate researchers) is also transitioning to Globus’s Search service (operated by Globus and Amazon Web Services for the global research community). These changes offload significant IT responsibility from the climate teams, allowing them to focus more on climate research.
ESGF data portals - Data access, search and discovery
Each ESGF data repository provides a web portal. The latest version of the ESGF web portal is Metagrid, which is a web application that enables researchers to login, search the repository for relevant data, and download or transfer selected data to other systems. An important future direction for Metagrid is offering analysis methods within the web application - both to make analysis capabilities more widely available - and to avoid transferring large amounts of data to other systems.
Metagrid is designed around the Modern Research Data Portal (MRDP) design pattern. The web application supplies the user interface, and leverages external services for login, search, data transfer, and data analysis. Initially the external services were provided by ESGF operators, but the upgrades conducted by US DOE are replacing them with institutional and multi-institutional services, resulting in increased capacity and a reduction in cost to the ESGF projects.
For logins, Metagrid initially used OpenID authentication services. ESGF personnel installed and operated OpenID servers, each of which included a local account database and an authentication service. The US DOE upgrade replaces these OpenID servers with Globus authentication. Globus provides a simple but secure authentication API that applications like Metagrid can use. It builds on the thousands of institutional login servers supplied by universities, federal labs, and research federations, so researchers can login to Metagrid using accounts they already have instead of creating new accounts and managing new passwords. ESGF personnel no longer need to run OpenID servers, keep them secure, and provide account management support.
For data discovery and search, Metagrid currently uses a local Apache Solr search engine. With more than 50 million entries in the index, this search engine requires a minimum of 128GB of RAM for query processing. Keeping these index services running, available, and secure is costly for each ESGF site. The US DOE upgrade is replacing these local indices with Globus Search. Globus Search is a cloud-hosted search engine operated by the Globus team, built on Amazon Web Services’ ElasticSearch. Globus Search adds research federation IAM, or institutional identities with item-level visibility permissions to ElasticSearch, and is used by multiple research teams and supported by hundreds of campus/lab subscriptions. Metagrid uses the Globus Search API for query services, and ESGF personnel no longer have to install, configure, and operate search engine software or maintain expensive host systems.
For data access, Metagrid initially supported single file links, OpenDAV links, and THREDDS links. In each case, ESGF personnel operated the servers that supported these links and in most cases operated the data storage servers as well. The US DOE upgrade is replacing ESGF-operated storage with mass storage provided by the leadership computing facilities at Oak Ridge National Laboratory and Argonne National Laboratory. These facilities offer massive capacity at significantly lower per unit cost. Data access is provided by Globus, already widely used at research universities, federal labs, and federal research agencies. The ESGF data collections are configured by facility staff, freeing ESGF personnel from administration chores. With Globus transfer users can easily and reliably move data either using bulk data transfer protocols for highly reliable and high performance transfers, or for data access via HTTP/S for browser based download, or leverage the Globus Transfer API for bulk transfers to and from over 35,000 other Globus collections. Globus also synchronizes data between the three US DOE ESGF data nodes located in California, Illinois, and Tennessee.
Simplified data publication using Globus Flows
Data publication is a key function of the ESGF platform. For climate model comparisons, the data is highly specialized, and details of the data itself and the metadata that provides the surrounding context, are incredibly important. The publication software used by ESGF is quite complicated. Installing this software, configuring it for a specific intercomparison project, and keeping it updated is a significant effort, which creates technical hurdles for climate researchers to overcome before they can publish their data. The next part of the US DOE upgrade—to be completed during 2024—is to make the publishing software available online as a service so individual researchers don’t have to do it themselves. Publication software will be deployed and configured for active intercomparison projects on compute systems at Argonne and Oak Ridge national laboratories. The team will use Globus Flows to create Web interfaces that automate the process of transferring a new dataset to a DOE facility, running the publication software using Globus Compute, and adding the resulting metadata to the Metagrid portals so new data becomes discoverable.
Data access and analysis
Climate researchers using ESGF to discover data relevant to their work currently have only one option for working with data they discover: to download the data to their own systems and analyze it using their own computers and analysis software. We expect that datasets produced in the next round of intercomparison projects (CMIP7) will be large enough that downloading and analyzing on local systems will be beyond the capabilities of some researchers. Consequently, the final part of the US DOE ESGF upgrade will be to add server-side analysis capabilities that enable researchers to perform common analysis methods on ESGF datasets without downloading the data and using their own computers.
Using web browsers, researchers will be able to request analyses via web portals, and the analyses will be performed on the DOE facility systems, orchestrated and secured by Globus Flows and Globus Compute. Server-side analysis will be limited to the most common forms using well-known software and configurations. (Some may even be pre-computed in anticipation of popular requests.) Researchers conducting specialized or unusual analysis will need to perform those on their own systems. But common analyses can be within reach of all researchers.