Navigating Cloud Storage with Globus
February 18, 2022 | Greg Nawrocki
I was speaking with a colleague from another institution the other day. We were lamenting the lack of in-person events and actually missing traveling to visit colleagues at their home institutions. While this is certainly one of the major themes of the past two years, another theme that I’ve heard in the course of this new normal of virtual meetings (and the few in-person events, thank you Supercomputing) is mass migration of data between cloud storage platforms, and even between cloud stores and on-prem storage.
There are varying reasons for this mass migration, from security concerns to cost. Regardless of the reason, there are common barriers to success that can make these data transfers difficult.
The simple fact is that while they all have “cloud” in the name, either literally or philosophically, they are different storage systems. While the human interface to these systems is often browser based, the way they operate, and their data transfer workflows can be quite different. They are also not intrinsically designed to communicate with each other. This often necessitates a two-hop transfer, from cloud storage A, down to local storage, and then up to cloud storage B. With the file sizes we deal with in the big data science community, this can be a time-consuming proposition.
Globus has a connector for nearly every popular cloud storage out there. This allows for seamless abstraction of those storage types as a Globus Collection. The real beauty of this solution is that regardless of the storage type, they all look like one thing to the Globus Web App (and the CLI and API as well) and therefore present themselves to the user the same way. No need to fish around for a specific app to connect to that storage or remember how a particular workflow functions, all storage systems are Globus collections and Globus Transfer functions in the way Globus users are familiar with.
When Globus Connect Server is abstracting storage (and if the storage appears as a Globus Collection it would be) the Globus Service (the component a user is interacting with as they use the Globus Web App) will be the orchestrator of the transfer and will initiate a connection directly between the Globus Collections allowing them to communicate and transfer files in a seamless interoperable way. No two-hop transfer, no attempting to match disparate transfer protocols, it just works.
Cloud storage providers often throttle transfers. This can take the form of bandwidth limiting or only allowing a certain amount of data within a specific time period and completely halting the transfer until that time period expires. Depending on the method of transfer, this can mean complete failure or the need to constantly babysit the transfer to ensure success.
With Globus the reliability aspects that ensure data integrity such as checksum and restart on failure, while not unique to cloud storage, apply here as well. With the fire-and-forget model, you'll get an email when the transfer successfully completes. Head out to dinner or a movie and let Globus be your big data babysitter.
When it comes to data migration between cloud stores en-masse, sit back and relax, let Globus do the flying. We may not be earning those frequent flier miles these days, but at least our data can.