Data management

September 20th, 2013

There’s a software-developer maxim that I heard recently, which is “if you’re writing code, you should be using source control”. Source control, also known as version control, is a system that looks after all the code that you write and stores it, and data about all the changes you make to it, in a form that allows you to revert to previous versions easily. It also allows you to work collaboratively with other people. By analogy with this principle, let me introduce a new maxim: “if you’re collecting data, you should be doing data management”. What’s “data management”? Let me explain…

Data management is looking after your data, and storing it a form that makes it easy to retrieve and understand later. A common situation is that you start out doing some “play” experiments, fiddling about to try and get a handle on some new piece of equipment. You collect some data, perhaps some numbers in a logbook, perhaps some sort of data file. Then you do more experiments, and unless you were meticulous, you end up with a whole load of different experimental results with filenames like DATA_EXPT.xls, DATA_EXPT2.XLS, DATA_TUESDAY.xls, etc. You put them aside for a week and then come back to them. They were meaningful then, but now you’ve forgotten what the parameters were, or which of the runs produced interesting results. Now you have a data management problem.

I’ve amassed some useful techniques for doing data management, mostly from my PhD and some subsequent work. Here are some tips:

  • Give your experiments a “run number”, which uniquely identifies that particular experiment. I tend to use four digit numbers, so my first run is called 0001 or perhaps “SOMETHING_0001”. This helps filenames line up when you browse them on the computer.
  • Write the run number on everything that’s relevant – sample bottles, printouts, filenames, in your logbook – anywhere that you need to refer to that experiment.
  • Create a master log, which has the run number for every experiment, and all the metadata associated with it, in columns. You can do this in a paper logbook, with pre-printed forms that you fill in in the lab and then scan or file, or a spreadsheet. My master logs typically contain run number, date and time of experiment, details of the parameters used and then comments on the results.
  • Don’t be tempted to put metadata in filenames or filepaths – just stick to the run number.
  • If you have an experiment which turns out to have gone wrong – perhaps because of a fault with the equipment, or somesuch, make an entry in the master log so that you know that was a “duff” run.
  • For datasets over a few tens of runs, I tend to put a simple “QA” column in the master log. Is this data “Good”, “Bad” or “Questionable”? When you see the results of the run, classify the run and put the appropriate label in the QA column. You can then quickly filter out bad runs when browsing for the right dataset.
  • Back up your data. Ideally, you should be keeping your data in an electronic format on a machine that’s regularly backed up, like a shared network drive on a workplace server, or using Dropbox, or similar cloud sync system. There are obviously data protection and privacy concerns for some types of data, so choose carefully what’s best for your situation. Make sure that you keep the master log with your data, and also keep it backed up!
  • If you have data files with some sort of arbitrary format – like CSV files, for instance, it’s worth making sure that you keep a description of the file format with your data. You may not remember what the format was next month, or even next year!
  • If you are collecting a large amount of data, it is well worth considering using a full-on database to store and retrieve it. This applies particularly where you are collecting time-series data over multiple days, as the database can easily handle “give me all the data from 8pm yesterday to 8am today” without having to deal with the fact that new log files were created every hour, for instance.
  • Automation is your friend. You spend less time doing clerical work and more time thinking about the meaning of your datasets.
  • For timeseries data, set up a system to automatically load your data into the database regularly – I have a system running at the moment that imports every six hours.
  • If you have data in a database, you can write nice reporting tools to produce graphs and analyses easily. I recommend Python as a language that has good libraries to do this sort of thing – I’m using Matplotlib for graphs and python-pptx to automatically make nice powerpoint slides (with the corporate template and all!) for my analyses.

The University of Glasgow has published their own guide to data management which is well worth a look.

I hope this helps! May all your datasets be fruitful and easy to find later.

 

 

Leave a Reply