The State of the Craft in Research Data Management

November 23, 2020   |  Susan Tussy

Part II in a Five-Part Series

Data volumes are exploding, and the need to efficiently store and share data quickly, reliably and securely—along with making the data discoverable—is increasingly important. Deliberate planning and execution are necessary to properly collect, curate, manage, protect and disseminate the data that are the lifeblood of the modern research enterprise.

In order to better understand the current state of research data management we reached out to several research computing leaders at institutions around the world to gain insights into their approach and their recommendations for others. We wanted to explore how data management tools, services, and processes deployed at their institutions help the research community reduce the amount of time spent addressing technological hurdles and accelerate their time to science. 

At these, and many other leading institutions, new methodologies and technologies are being developed and adopted to create a frictionless environment for scientific discovery. Self-service data portals and point-and-click data management tools such as Globus are being made available to allow researchers to spend more time on science and less on technology. Education, documentation and a data management plan can assist researchers in their desire to spend more time on science. Hack-a-thons, webinars and researcher resource books are just some of the ways that institutions are educating researchers on best practices. Science Node published a summary of our findings, and we present here a more complete discussion with Doug Jennewein, Senior Director of Research Technology from Arizona State University, as the second installment in our five-part series.

 

What do you consider to be the key challenges in research data management?

I see that there are really four key things and they are movement, storage, metadata and publication. People are still moving data by mailing hard drives—yes it is boring but it is a real problem. Where do you store and how do you charge people? Again, a boring but real challenge. Moving and storage is always a problem but good metadata and metadata management has bubbled to the surface.  A challenge with metadata is ensuring that sufficient metadata is expressed and associated with data throughout the data lifecycle. Metadata is important for reusing data and having the ability to reproduce results. Nobody is putting data in cold storage anymore. In the age of AI you need to be able to revisit data at any moment as soon as a new machine learning algorithm comes out. And lastly, publication—supporting data publication as a first class element of  cyberinfrastructure. It should be just as well supported as optimizing a GPU workflow. 

What are some of the issues you face in research data management?

Certainly automation is a  problem. In the advanced networking world we talk about the need for automation and friction free access to data. Take for example when a new faculty member comes on board. We provide software packages, install them, and from their perspective all of this is streamlined and automated. But numerous issues arise when it comes to data, such as how do I publish data, move data, and make data findable. We don’t have the same level of automation when it comes to data management, the same level of service that we have in advanced computing on many campuses.

What about secure data movement and managing protected data?

With the pandemic a lot of attention and institutional support has been given to secure data storage and movement with an audit trail. These other things are now front and center. Stepping up overnight has been a challenge. 

How can you encourage adoption of data management tools and technologies?

It needs to be part of the mission of the campus cyberinfrastructure facilitator—or that kind of role. We have made great strides in high performance computing and things like Open Science Grid. Many people compute, but everyone has data. You need to redefine the facilitator job category and embrace data. You need to educate users. You need to sit with faculty and show them how to move data, show Globus as part of the process. You need to demonstrate the data tools as a first class citizen, alongside advanced computing. Networking and data, these are the big challenges. Data management is as important as advanced computing. And not storage but data, let’s make that distinction. Data must be a first-class citizen. At ASU we have a Research Data Management office and a digitally focused data repository. This is a strength and an investment that the institution takes seriously. While many researchers employ advanced computing, virtually all university researchers have data management needs. Look at what we did in computing: The need for guidance and expertise in the practice of advanced computing has given rise to an emerging job family of advanced computing facilitators. The same thing needs to happen with data management. There needs to be a redefined job category and you need to educate users to show them: Here is how you move data, and here is where Globus transfer and SFTP make sense. Yes, computing is where the analysis happens, but there are these other two pieces—networking and data—that are equally challenging.”

How can you make researchers aware of the capabilities that are available?

We present data services and a research data management office in a researcher resource book containing Core facilities, data management, and all things researchers would need. When a researcher comes on board, we show  them these tools. We incorporate data aspects into our outreach in advanced computing because you can’t have computing without data. Engage with the library. Researchers know to go there. The engagement of data, computing and the library works well for us.

How do you measure success?

Frequency of using the data could be a measure, or mean time to re-use. Or in the case of cloud solutions it could be egress—time and money. It could also be measured by the number of scientific results, because at the end of the day you don’t want to impede scientific progress by the data infrastructure.