Collect Data

What Is Research Data?

To understand RDM requirements, you have to understand the definition of research data. The term research data combines two key concepts: research and data. Research might be described as a systematic process of investigation, a way of finding out about things. Research transforms information into knowledge and is a part of how we discover the world. Data can be an important part of that knowledge discovery. Data are one type of information or evidence that serve as input to research. But not all information in a research project is data.

Canada’s Tri-Agency FAQ (2021) states that “What is considered relevant research data is often highly contextual, and determining what counts as such should be guided by disciplinary norms.” In short, context is important; you can’t really define research data without looking at how it’s being generated and used. The FAQ section “How are research materials related to research data?” delves into this: “Research materials serve as the object of an investigation, whether scientific, scholarly, literary or artistic, and are used to create research data. Research materials are transformed into data through method or practice.”

General	Social Sciences	STEM Fields
images video mapping/GIS data numerical measurements software & code	survey responses focus group and individual interviews economic indicators demographics opinion polling	measurements generated by sensors/laboratory instruments computer modeling simulations observations and/or field studies specimen

Table 1. Common data types by discipline

That transformation is a key part of separating general information from research data. Data are the results of taking raw information from any source (e.g., informants/survey respondents, archival or bibliographic data, social media, scientific instruments, document text) and collecting or assembling that information into a structured form to serve as an input for further research. Because of the work that goes into structuring, annotating, and organizing research data, they can also be considered a research output, along with books, articles, and other items created by researchers. Research data are a vital source of information that may not be captured in any other source. If they are published or shared, they can be referred to by other researchers and cited just like any other research output.

For example, a researcher may use a set of research articles as input for their research. If the researcher is simply reading those articles and referring to their contents through citations to support other ideas, the articles are serving as research material, but not research data. However, if the researcher takes the same set of articles, imports them into a piece of software, and reviews and annotates them in a structured way to come to some sort of formal conclusion on the group of articles as a whole, then those articles form a dataset and are considered research data.

Research data can be secondary data, meaning that the researcher did not collect or assemble the material themself. In this case, the structuring or refining to serve as input may have been done by another researcher. Or the data may come pre-structured if it’s administrative data (say, extracted from an admissions database). But something that is a structured collection of information that is being refined into research through analysis is still considered research data.

Data Dictonaries

Consider creating a data dictionary. A data dictionary is a file that describes each element of your dataset. If your dataset includes tabular (spreadsheet) data, the data dictionary would include a list of the fields in the table and what they mean, including units and precision.

If your data included R or Python code or scripts, the dictionary would provide a brief overview of the purpose of the code (if not already contained in comments); and information about the code relates to the dataset. [From Smithsonian Data Management Best Practices. Describing Your Data: Data Dictionaries (PDF)].

Data dictionaries have several benefits:

Keeping things consistent across a project. The dictionary can define data names, labels, units, constraints such as acceptable range of values, and other characteristics.
Enabling software to process a data file, by providing details to the software about the file. This information might include the type of data in each column (integer, character, date, etc); the name of the column; the physical units, if relevant; whether nulls are included; etc.
Increasing interoperability and reuse of the data that you want to share and publish.
Providing “human-readable” details to support discovery, interpretation and analysis.

For more details on what might be in a data dictionary, how to make one, and examples, see:

The information on this webpage is adapted from Documenting Data: Metadata by the University of Iowa Libraries under a CC-BY 4.0 license, Research Data Management 101: The Lifecycle of a Dataset by MIT Libraries under a CC-BY 4.0 license, and Data Management for postdocs and researchers by MIT Libraries under a CC-BY 4.0 license.