GridFTP: A Brief History of Fast File Transfer
April 16, 2024 | John Bresnahan
How GridFTP became the de facto standard for data transfer in the scientific community
Globus is widely used in the research ecosystem for reliable, secure, and high-performance data transfer. The service has two key aspects: an orchestration layer that manages these transfers, and GridFTP, a highly performant protocol that moves the bits on the wire between source and destination storage systems. The GridFTP protocol, developed over 25 years ago now, was designed and built to manage ever increasing scientific data volumes across highly distributed, yet connected sets of systems. In this blog post, we reflect on the evolution of GridFTP, and the drivers of a system designed to serve as the workhorse for scientific data movement.
Science Drivers
In the late 1990s significant hardware resources were coming online that made grid computing a necessary reality. Important scientific instruments were created in geographically distributed areas. The best example of this is the Large Hadron Collider at CERN, but there were many others like the Advanced Photon Source at Argonne National Lab. These instruments were crucial (and still are today) to a wide swath of scientific research across the globe. They produced immense amounts of data, on the order of multiple petabytes per month. There were also powerful data centers at various laboratories and other locations capable of processing teraflops. Additionally, graphics centers where the processed data could be visualized existed in limited locations, e.g.,the CAVE at the University of Illinois Chicago.
These resources had potentially world changing consequences, like (discovering the god particle), by providing the processing power to search for the needle in massive data haystacks and the visualization systems to expedite human understanding of the results. In most cases these resources were thousands of miles apart, yet very accessible, due to another rapidly emerging technology: fast wide-area networks, also known as Long Fat Networks or LFNs.
LFNs, such as ESnet, connected laboratories across the globe at gigabit per second line speeds. This made it possible for institutions to feed their compute farms with data at rates that substantially accelerated the pace of discovery.
Design Requirements
While all these fast hardware resources were interconnected, the connections between them were barely fast enough—there were no cycles to waste. What was missing was the software to efficiently make use of these resources, and GridFTP was the solution to moving data among them at near line speeds. Let’s take a look at what was needed for GridFTP to make all this possible.
Third Party Transfer
In a scientific research scenario, a typical application operated as a pipeline (illustrated below). It started by scheduling a job to gather raw data at the source. Next, it coordinated transferring the data to a facility for processing. Once processed, the results were sent to the visualization center.
A key thing to note is that the orchestrating program was often running on a system distinct from any of the systems storing and processing the data. It only needed to provide instructions and schedule jobs—it did not need to see the data as they moved between systems. In fact, it typically did not have access to a fast network, so routing the data through this software was a significant performance disadvantage.
GridFTP provided the ability to a “third-party transfer”. Third-party transfer (also known as managed file transfer, or MFT) is now an established concept, and the cornerstone of the current Globus transfer service, however at the time this was much less the case. We needed to design a solution such that the orchestrating program could contact a sender and receiver of data, and give them enough information to securely transfer the data over a fast network without the orchestrator ever seeing it.
Security
It goes without saying that data transfer needs to be secure and reliable. The orchestrator of the transfer must safely access the data, be assured that no unauthorized source can see them, and most importantly, be sure that the data received by the destination is the same as the data sent by the source. Even in cases where the data are not sensitive in any way, it is still crucial that the data are not altered by a bad actor or faulty systems along the way.
The third-party transfer approach introduced a caveat: an orchestrator connecting securely to the source and destination was a well known practice, but telling the sender how to connect to the receiver and authenticate on behalf of the orchestrator was not. This created the need for a “delegated credential” that the orchestrator could give to both sender and receiver, containing sufficient information so that they could each prove to the other that they were in fact represented by the orchestrator.
Wide Area High Performance
The LFNs connecting compute sites were impressively fast at the time, given the state of the art (see a timeline of ESNet’s capabilities). While fast, it was not acceptable to use just 50%, 60%, or even 80% of the bandwidth. Applications needed all of it. Performance was the most important design requirement for GridFTP.
Coupled with this was the importance of reliability in the face of errors. Networks were inherently lossy, even the non-congested scientific networks that were the main target of GridFTP. This was greatly exacerbated on WANs, and even more so on LFNs.
The GridFTP Protocol
The authors of GridFTP used the well known FTP protocol as a starting point. While it was not a widely used feature, FTP had the ability to do third-party transfers by using two-channel protocol that separates the control channel and the data channel. The control channel was used to contact servers and make requests, and the data channel was where the requested data was streamed. In most historic use cases an FTP client would establish a control channel with a server and request to either have a file sent to the client, or have a file received by the server, with both the control and data channels using the same network path. However they would still be separate channels and thus it was entirely possible for a client to form two control channels, one with the source server and one with the destination server, and instruct them to form a data channel between them as shown below.
With traditional FTP we could meet the requirement of third party transfers, but modifications were still needed in order to secure both the control channel and the data channel. While they are not discussed in detail here, suffice it to say that the control channel was wrapped in a security layer which, once established, could delegate a credential from the client to each server. The delegated credential on each end could then be used to authenticate the data channel.
A New Data Channel Protocol
The most substantial change that GridFTP introduced was the creation of a new data channel protocol. Traditional FTP largely relies on the stream mode (Mode S) data channel protocol. Stream mode is simply a TCP channel over which the data flows. There is no framing or blocking beyond what TCP offers. While TCP was, and is still today, the backbone of most network communication, this leaves a lot to be desired if we are designing for high performance.
TCP
To appreciate the improvements GridFTP introduces, it’s important to understand a few TCP concepts at a high level before we explain the design choices we made.
The biggest problem with stream mode is that TCP was not designed with high latency networks in mind. Nor was it designed to be greedy over uncongested networks. TCP was designed under the assumption that the network was an on-demand resource for many active users and thus needed to be cooperatively shared. The networks that GridFTP targeted were high latency, uncongested networks. While they were shared resources, they were made specifically for these scientific applications and thus, for better or for worse, sharing the resource was much less of a priority than it was for TCP.
TCP allows for a variety of congestion control algorithms to be used. These algorithms are what allows TCP to fairly share network bandwidth among many user’s streams. At the time, TCP Reno was the most widely deployed TCP congestion control algorithm. While it attempted to fairly share the network, it actually was very unfair to streams that had high latency, like those crossing a WAN or LFN.
The two parties sending data via TCP have no idea about the network topology that connects them—and how many users are competing for network bandwidth is even more of a mystery. Thus there is no a priori knowledge of how fast data can flow. If the sender sends too fast, the network will not be able to keep up and thus packets will have to be dropped and then retransmitted, making the transfer slower. On the other hand, if the sender goes too slow, network resources will remain idle.
To address this issue TCP sends packets in windows. When a stream first starts TCP will send one packet of data and wait for an acknowledgement (ACK). Once received it will send two packets and wait for their corresponding ACK. Once received, it will send 4, then 8, then 16, and so on in an exponential fashion until a packet is lost (also known as a congestion event). When a congestion event occurs TCP makes the assumption that it found the fastest transfer rate the network can handle, and will remain at that speed. This is known as TCP slow start.
If a congestion event occurs after TCP slow start, the allowed window size is cut in half, thus reducing the transfer rate by half. The assumption here is that a packet was lost because someone else tried to use the network but the current stream was hogging it. Thus by halving the line speed a fair share of the network is released for the other flow. At this point TCP Reno slowly increases the window size by one packet at a time until it hit another congestion event at which point it will again halve the window size. This is known as additive-increase-multiplicative-decrease (AIMD).
The WAN Penalty
The process described above is a very aggressive way to backoff and share the network. While it was exceptionally effective in the growth of the Internet, it was far from ideal for the GridFTP use case, in particular because it was unfair to WANs.
The figure above shows two flows transferring a 2GB file across a 1Gbps network. One is a LAN flow with 10ms latency, and the other a WAN flow with 100ms latency. The flows were simulated at different times and thus did not interfere with each other.
From 0 to about 1 on the x-axis we see the TCP slow start algorithm. Because the round trip time in the LAN case is so much lower than it is in the WAN case, we see that the LAN reaches full speed faster than the WAN. However, because the growth is exponential, it is not too much of a setback.
Looking further down the timeline on the x-axis we see a congestion event: the flows were both at full speed, lost a packet, and were reduced to half speed. This is where the WAN penalty is more pronounced. In this case the packet loss was not due to a long held stream. It was either due to an anomaly in the network (which happens more often on WANs than LANs) or a very short flow. After the event, half of the network is again open for use and the two flows start slowly, ramping back up by increasing their window sizes by one packet with each ACK that they receive. The LAN gets back up to full speed in a reasonable amount of time, but we can barely see the increase in the WAN case. This is because each ACK on the WAN takes 10 times longer than it does on the LAN. Thus, for the WAN, additive increase moved far too slowly for it to ever reclaim the available bandwidth.
Parallel Streams
GridFTP targeted WANs specifically, and as a result needed to overcome the inherent bias of TCP against WANs. At the same time, TCP was available everywhere and GridFTP aspired to be a general purpose transfer protocol. Requiring a custom network stack in the operating system kernel in order to use GridFTP would create a major barrier to deployment. Hence we wanted to use TCP as the base reliability layer.
To mitigate the WAN penalty we used a technique known as parallel streams. Instead of using a single TCP stream, like stream Mode S did, we created Mode E, extended block mode. Mode E allowed the sender to establish many TCP connections with the receiver. It would then send chunks of data across each stream as they had availability. While the data was almost always sent over the same physical path, by dividing up the logical path in this way it mitigated the effects of the congestion events discussed above. The effect is illustrated below.
In this graph we show the throughput of four parallel streams. The green line at the top is the sum of the four streams; the other lines represent the throughput of each stream. At around the 10th second we see that one stream had a congestion event. While this stream experiences a setback, the others do not. Because we are dividing the data up over streams the overall throughput of the transfer suffers a much smaller setback, about 1/8th of what it would be with a single stream.
This is an important advantage of parallel streams, but not the only one. They also allow GridFTP to leverage multicore machines by encrypting each stream in parallel (ref), dynamically adjust to discrepancies in stream speed, and make more efficient use of parallel file systems. This was a well researched topic and the curious reader is encouraged to read more about it in papers like this one.
In the decades that followed the creation of GridFTP, TCP has evolved quite a bit. GridFTP has incorporated congestion control algorithms like CUBIC, and will continue to take advantage of technological innovations.
Conclusion
In this post we discussed the origins of GridFTP and the design choices that made it a fast, secure, and successful transfer protocol. While the protocol and the performance benefits are a key pillar in delivering highly performant data transfers, the Globus transfer service has built on this protocol to provide new capabilities such as secure data sharing, and a unified interface across diverse storage systems.
More importantly, it introduced the fire-and-forget paradigm that allows researchers to simply request that data be moved and not be concerned with the underlying infrastructure. Outsourcing reliability, performance optimization, and security to the Globus service greatly reduces the barrier to using such high performance protocols. And Globus does not stop there, as we continue to break down the data management obstacles that hinder scientific research and discovery. With recent additions such as remote computation and automation to support federated, distributed workflows, the possibilities are endless.