Data quality

Quality control of data is an integral part of all research and takes place at various stages e.g. during data collection, data entry or digitisation, and data checking.

It is important to develop suitable procedures before data gathering starts. Including information on data quality in your Data Management Plan will ensure that standards and procedures are adhered to throughout the project by all members of the research team.

The following video produced by the New York University Health Sciences Library,  'How to avoid a data management nightmare' demonstrates, in a fun way, the potentially disastrous consequences of quality assurance failures in data management.

NYU Health Sciences Library. Karen Hanson, Kevin Read, Alisa Surkis

Everybody wants to avoid data loss, article retractions, data errors. So it’s important to think about naming conventions, backups, workflows, variables.

Variables (Meet Dr. B Hartwell, Cardiology Researcher and Dr M. Audaheer, Research Coordinator)

Dr Hartwell: I’ve been looking at our data. I think we might get some interesting results if we use the new Oxford BMI calculation rather than the standard BMI. Can we run that analysis?

Dr Audaheer: Uh… I need to show you something. We only recorded the standard BMI so we can’t calculate an alternate one!

Dr Hartwell: That’s too bad… that could have been a game changer but at least we can still do the standard analysis.

How could this have been avoided? Never record compounded variables when the component variables are available.

Workflows Dr Hartwell: We need to collect data from 300 patients with hypertension. In this folder are blank copies of the form I designed for data collection. I’ve also emails each of you an empty spreadsheet for the results. Bye! Send me your spreadsheets when you’re done.

6 months later… Okay, let’s see what kind of results have been collected…

Wait, what is this? This data is a mess! I can probably sort out height and weight but I’ll have to leave smoking out of the analysis. I guess that will have to do.

When planning data collection never assume that variable names mean the same thing to everybody. Be explicit about variable type: numeric, text, categorical. Be explicit about units of measurement: kgs, stones, lbs. Be explicit about definitions: current smoker? How many cigarettes a day? Ever smoked? Make sure everyone is one the same page BEFORE starting to collect data!

Naming conventions Dr Audaheer: I finished entering the last of the data, but was confused about something… Some of the values for pulse-ox are over 100 per cent!

Dr Hartwell: That’s because that’s the pulse RATE…Wait a minute! The pulse-ox and pulse rate data have been mixed up. How can we possibly correct this?! The study is ruined!! Ruined!!

Ambiguous file or field names can be dangerous! Come up with a consistent naming convention that can be used by the entire team and document it!

Backups One final thought… life is unpredictable so create a backup plan! Keep your data on a secure server. Ensure copies are geographically dispersed. How many days of data can you afford to lose?

Transcribed by Library Services, The Open University
Video produced by the NYU Health Sciences Library

Data collection

This can include:

  • calibration of instruments to check the precision, bias and/or scale of measurement
  • taking multiple measurements, observations or samples
  • checking the truth of the record with an expert
  • using standardised methods and protocols for capturing observations, alongside recording forms with clear instructions
  • computer-assisted interview software to: standardise interviews, verify response consistency, route and customise questions so that only appropriate questions are asked, confirm responses against previous answers where appropriate and detect inadmissible responses

Digitisation and data entry

When data are digitised, transcribed, entered in a database or spreadsheet, or coded, quality can be ensured by standardised and consistent procedures for data entry with clear instructions.

This may include:

  • setting up validation rules or input masks in data entry software
  • using data entry screens
  • using controlled vocabularies, code lists and choice lists to minimise manual data entry
  • detailed labeling of variable and record names to avoid confusion
  • designing a purpose-built database structure to organise data and data files
  • accompanying notes and documentation about the data

Data checking

Data checking is when data are edited, cleaned, verified, cross-checked and validated. Checking typically involves both automated and manual procedures, for example:

  • double-checking coding of observations or responses and out-of-range values
  • checking data completeness
  • adding variable and value labels where appropriate
  • verifying random samples of the digital data against the original data
  • double entry of data
  • statistical analyses such as frequencies, means, ranges or clustering to detect errors and anomalous values
  • correcting errors made during transcription
  • peer review

Video guides to managing data quality

Video: Data Quality Checking Produced by the Statistical Services Centre in collaboration with the University of Reading, this video discusses the importance of data quality checks throughout a research project.

Contact us

Library Research Support team