Handling Messy CSV Files

Photo by Jan Kolar on Unsplash. Photo by [Jan Kolar](https://unsplash.com/@jankolar) on [Unsplash](https://unsplash.com).

If you’re a working data scientist, CSV files might very well be your bread and butter. They are easy to read for both humans and computers alike, can be tracked in version control, and can be emailed and compressed easily! However, if you’ve been around for a bit longer you may also be familiar with the dark side of CSV files: uncommon cell delimiters, uneven row lengths, double quoting, escape characters, comment lines, and more! 😱

There’s a very simple reason for this problem: a “comma separated value” file is not a standard file format, but is actually more a convention that people haven’t fully agreed on. The RFC4180 suggests a definition, but this is not a standard in the formal sense and libraries and programming languages do not always adhere to it. Because of this, the number of variations of CSV file formats that you may encounter on the internet is enormous! As an illustration, here are some examples of real-world CSV files:

Examples of real-world CSV files: (a) includes comment lines at the top, prefixed with the # character, which is not part of the CSV “standard”, (b) uses the ^ symbol as delimiter and the ~ symbol for quotes, and (c) uses the semicolon as delimiter, but yields the exact same number of columns when using the comma. Image adapted from Van den Burg et al. (2019).

Now, why is this a problem? Why do we care that CSV files come in different formats? Isn’t this a wonderful way to express your individuality when saving tabular data? Well … no. A CSV file is used to store data, so it should be easy to load data from it. By varying the format that is used, CSV files require human inspection before they can be loaded.

Here’s an example of the latter point. This dataset on Kaggle contains information on 14,762 movies retrieved from IMDB. Say we want to load this data into Python, and want to use Pandas to load it into a nice data frame:

>>> import pandas as pd
>>> df = pd.read_csv('./imdb.csv')
Traceback (most recent call last):
# ... skipping the full traceback ...
pandas.errors.ParserError: Error tokenizing data. C error: Expected 44 fields 
in line 66, saw 46

Huh, that didn’t work. What if we use the standard way of detecting the format, also known as the dialect, and load the file as suggested by the documentation for the Python standard csv library?

>>> import csv
>>> with open('./imdb.csv', newline='') as csvfile:
...   dialect = csv.Sniffer().sniff(csvfile.read())
...   csvfile.seek(0)
...   reader = csv.reader(csvfile, dialect)
...   rows = list(reader)
>>> len(rows)
13928

Okay, that did something, but it ended up reading 13,928 rows, instead of the 14,762 that we expected! For comparison, R’s read.csv() method doesn’t fare much better and ends up reading 15,190 rows! What’s going on here??

Well, it turns out that this particular CSV file uses an escape character (\) when a movie title contains a comma! Neither Pandas nor the standard csv library detected this automatically, and therefore failed to load the data properly. Imagine if you would start analyzing this data without realizing that this happened! 🙈

Of course, you can manually inspect every CSV file you encounter on the web and make sure it doesn’t have any problems. But it’s 2019, why do we still have to deal with messy CSV files? Why can’t these packages detect the dialect correctly? One reason this is difficult is that there are just too many variations of CSV files out there. Another reason is that it’s actually non-trivial to come up with an algorithm that can do it correctly all the time, because any dialect will give you some table, but there’s only supposed to be one table that correctly reflects the data that was stored.

CSV is a textbook example of how not to design a textual file format.

Thankfully, there’s now a solution: CleverCSV, a Python package for detecting the dialect of CSV files with high accuracy. It is modeled on the way in which a human would determine the dialect: by looking for patterns that result in a regular tabular structure with “clean data” in the cells (such as numbers, dates, etc.). CleverCSV is actually based on research, where we investigated almost 10,000 CSV files to develop the best way to detect CSV dialects. To make it easy to switch existing code to CleverCSV, the package has been designed to be a drop-in replacement for the CSV module. So instead of using import csv, you can use import clevercsv (or, if you’re really smart: import clevercsv as csv).

But wait, there’s more! Of course you don’t want to detect the dialect of the same file over and over again, because it’s not likely to change all that often. So CleverCSV also provides a command line interface that simply gives you the code you need: And if you prefer to get a Pandas data frame, simply use: clevercsv code -p <filename>.

$ clevercsv code ./imdb.csv

# Code generated with CleverCSV version 0.4.7

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

CleverCSV also comes with handy wrappers for commonly used functionality, such as read_csv to detect the dialect and load the file as a list of lists, and csv2df to load a file into a Pandas data frame. CleverCSV is available on GitHub and on PyPI. Furthermore, the research that led to CleverCSV is fully reproducible and publicly available (if you care about such a thing! :))

Data wrangling and data cleaning are some of the most time consuming tasks for data scientists, and they’re not the most fun either. In fact, survey’s show that data scientists spend the majority of their time on these menial tasks, while also being the part of their job they dislike the most! CleverCSV is a tool that aims to solve part of this problem, by giving data scientists a way to save time on the boring task of correctly loading data from messy CSV files. I hope that you give it a try!