The State of the Craft in Research Data Management - Part I

October 05, 2020   |  Susan Tussy

Part I in a Five-Part Series

Data volumes are exploding, and the need to efficiently store and share data quickly, reliably and securely—along with making the data discoverable—is increasingly important. Deliberate planning and execution are necessary to properly collect, curate, manage, protect and disseminate the data that are the lifeblood of the modern research enterprise.

In order to better understand the current state of research data management we reached out to several research computing leaders at institutions around the world to gain insights into their approach and their recommendations for others. We wanted to explore how data management tools, services, and processes deployed at their institutions help the research community reduce the amount of time spent addressing technological hurdles and accelerate their time to science. 

At these, and many other leading institutions, new methodologies and technologies are being developed and adopted to create a frictionless environment for scientific discovery. Self-service data portals and point-and-click data management tools such as Globus are being made available to allow researchers to spend more time on science and less on technology. Education, documentation and a data management plan can assist researchers in their desire to spend more time on science. Hack-a-thons, webinars and researcher resource books are just some of the ways that institutions are educating researchers on best practices. Science Node published a summary of our findings, and we present here a more complete discussion with Rajendra Bose from Columbia University, as the first in our five-part series.

 

Rajendra Bose, Director of Research Computing, The Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University

Globus: Why is ease of data management important for science?

Rajendra: Our experience in interacting with neuroscience labs over the past years is not surprising in that, with their limited time to do science in between developing proposals and dealing with the administrative tasks of a lab, we learned that researchers would benefit from having a worry-free solution for keeping and organizing data on the lab scale. Preferably, a proposed solution matches a simple conceptual model of how it should work so that researchers can focus on tackling science questions, rather than the details of how to move and copy data sets around and how backup and restore works. Some faculty principal investigators (PIs) face data management challenges and find it difficult to retrieve data from previous experiments. Or they are not able to replicate experimental results or combine data from multiple experiments to get better insights. NIH and some journals require data sharing which can be achieved only if you organize your data in an acceptable way. If you incorporate routine procedures or the use of a standard database with data management tasks you can have some sanity checks for mistakes in saving your data. The cost issues are not trivial as the labs do not always understand their own costs and the level of security which is needed for maintaining their research data.  

Globus: Describe the “universe” of tools you use for data management at your institution.

Rajendra: We provide each of our 50 labs with the opportunity to use ~10 TB free data per year, with the ability to purchase more, on the Institute’s enterprise storage system which includes offsite backup. We purposefully give labs the ability to structure their shares in the way that suits them. We are considering developing an “in-house” archiving level of storage and we are creating a self-service portal for key lab administrators to adjust lab storage without having to send in a ticket to our team.

We use a few different tools here at the university. We use our Globus subscription and have an Institute endpoint so we can help with the occasional need for large data transfers. Through NIH U19 awards we are moving forward with the use of the free, open source Datajoint software developed by Vathes LLC with several labs to track lab activity including complete experiment and data analysis pipelines. Some parts of the university make use of the LabArchives online lab notebook service provided by our central IT and libraries, but our labs do not seem to be using this.

Globus: How do you encourage adoption of new data management technologies?

Rajendra: One way is that we have been able to use sponsored projects to pilot technology with labs and PIs that have bought into the plan. We have learned that initial discussions or consultations with lab members are key to understanding what they are trying to accomplish, and only then do we provide guidance and options. In labs that collect a lot of data and transfer data between lab members there is a tendency that the burden of data and even hardware maintenance falls on one "technical" student or postdoctoral researcher in the lab. These labs often are the best customers to start with.

Globus: Are there areas where you feel current approaches can be improved?

Rajendra: I sense that those of us providing the technology think that our teams' online web and wiki pages of instructions and explanations of how to connect via SMB and NFS and the like are simple. I can imagine that researchers who are concentrating on creating experiments and delving into the details of biology and chemistry--while they will adapt to the demands of current technical solutions--really want to work in an environment that is even more conceptually simple and offers the easiest possible way to do things. For example, a quick drag and drop here and there, clicking on some items in a list and pressing a button. Obviously these could be masking layers of technology, but in the best case would appear simple.

Globus: What is the biggest barrier to adoption?

Rajendra: What I sense is that researchers with limited time do not want to (or cannot) invest too much time in delving through online pages of instructions, or, even if a training session could be set up, they don’t all necessarily want to have to sit through one to learn how to use new tools that are not directly connected to their science. The researchers' approach is that these are technical issues with low priority and no relation to science. We have a case where one of our researchers took note of another organization's ability to explore comprehensive records of a large number of experiments within seconds using a query. This escalated our Institute's involvement in DataJoint and Github projects. Based on our experience many student and postdoctoral researchers will want to have these tools in their skillset.  

Globus: What steps have you taken to facilitate data management tools?

Rajendra: We are trying to be active in the NIH U19 Data Science Consortium through two NIH U19 awards with lead PIs at the Institute. This is somewhat of a novel idea to have teams from normally competitive sponsored projects, awarded to different organizations, work together on the “data science” side of things. This means people from different projects share and learn from the tools everyone is using, and distill experiences, with the ultimate goal being that the neuroscience community share a framework that others or new projects can use or adapt.

Globus: Which approaches worked well for you? What didn’t work that you expected to work?

Rajendra: Having staff with the right backgrounds and interest in supporting science who can have good consulting-type discussions with researchers and the time to do so. Having staff build relationships and trust with faculty and their varying lab populations is proving to be a successful approach at delivering solutions and trying to resolve researcher problems related to computing and data.

In terms of introducing a new approach, we have experimented (through part of a sponsored project award) with having a dedicated “data engineer” with an appropriate background, who can serve as a bridge between the technology and the research world. Someone to work almost daily and closely with lab members whose PIs have “bought into” piloting new data and code management tools is the kind of effort and situation needed to really learn whether a brand new approach or tool (in this case Datajoint) will eventually be adopted or not.

It takes much more than simply developing something and announcing on a web page or through email that something is available for use. In some cases, because researchers are often already fluent in sophisticated data analysis software and scientific instruments and procedures, we may be overestimating their willingness to jump in and explore new tools. It is not easy, but you need to get to a point where the PIs trust your advice and you have an understanding of their main goals versus their need for technical solutions. Then they will feel more comfortable reaching out to your team and more willing to try a solution the team is providing.