If you’re a working data scientist, CSV files might very well be your bread and butter. They are easy to read for both humans and computers alike, can be tracked in version control, and can be emailed and compressed easily! However, if you’ve been around for a bit longer you may also be familiar with the dark side of CSV files: uncommon cell delimiters, uneven row lengths, double quoting, escape characters, comment lines, and more! 😱
There’s a very simple reason for this problem: a “comma separated value” file is not a standard file format, but is actually more a convention that people haven’t fully agreed on. The RFC4180 suggests a definition, but this is not a standard in the formal sense and libraries and programming languages do not always adhere to it. Because of this, the number of variations of CSV file formats that you may encounter on the internet is enormous! As an illustration, here are some examples of real-world CSV files:
Now, why is this a problem? Why do we care that CSV files come in different formats? Isn’t this a wonderful way to express your individuality when saving tabular data? Well … no. A CSV file is used to store data, so it should be easy to load data from it. By varying the format that is used, CSV files require human inspection before they can be loaded.
Here’s an example of the latter point. This dataset on Kaggle contains information on 14,762 movies retrieved from IMDB. Say we want to load this data into Python, and want to use Pandas to load it into a nice data frame:
>>> import pandas as pd
>>> df = pd.read_csv('./imdb.csv')
Traceback (most recent call last):
# ... skipping the full traceback ...
pandas.errors.ParserError: Error tokenizing data. C error: Expected 44 fields
in line 66, saw 46
Huh, that didn’t work. What if we use the standard way of detecting the format, also known as the dialect, and load the file as suggested by the documentation for the Python standard csv library?
>>> import csv
>>> with open('./imdb.csv', newline='') as csvfile:
... dialect = csv.Sniffer().sniff(csvfile.read())
... csvfile.seek(0)
... reader = csv.reader(csvfile, dialect)
... rows = list(reader)
>>> len(rows)
13928
Okay, that did something, but it ended up reading 13,928 rows, instead of
the 14,762 that we expected!
For comparison, R’s
read.csv()
method doesn’t fare much better and ends up reading 15,190
rows!
What’s going on here??
Well, it turns out that this particular CSV file uses an escape character
(\
) when a movie title contains a comma! Neither Pandas nor the standard
csv
library detected this automatically, and therefore failed to load the
data properly. Imagine if you would start analyzing this data without
realizing that this happened! 🙈
Of course, you can manually inspect every CSV file you encounter on the web and make sure it doesn’t have any problems. But it’s 2019, why do we still have to deal with messy CSV files? Why can’t these packages detect the dialect correctly? One reason this is difficult is that there are just too many variations of CSV files out there. Another reason is that it’s actually non-trivial to come up with an algorithm that can do it correctly all the time, because any dialect will give you some table, but there’s only supposed to be one table that correctly reflects the data that was stored.
CSV is a textbook example of how not to design a textual file format.
Thankfully, there’s now a solution:
CleverCSV, a Python
package for detecting the dialect of CSV files with high accuracy. It is
modeled on the way in which a human would determine the dialect: by looking
for patterns that result in a regular tabular structure with “clean data” in
the cells (such as numbers, dates, etc.). CleverCSV is actually based on
research, where we investigated almost 10,000 CSV
files to develop the best way to detect CSV dialects. To make it easy to
switch existing code to CleverCSV, the package has been designed to be a
drop-in replacement for the CSV module. So instead of using import csv
,
you can use import clevercsv
(or, if you’re really smart: import
clevercsv as csv
).
But wait, there’s more! Of course you don’t want to detect the dialect of the
same file over and over again, because it’s not likely to change all that
often. So CleverCSV also provides a command line interface that simply gives
you the code you need:
And if you prefer to get a Pandas
data frame, simply use: clevercsv code -p <filename>
.
$ clevercsv code ./imdb.csv
# Code generated with CleverCSV version 0.4.7
import clevercsv
with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
rows = list(reader)
CleverCSV also comes with handy wrappers for commonly used functionality, such
as read_csv
to detect the dialect and load the file as a list of lists,
and csv2df
to load a file into a Pandas data frame. CleverCSV is available
on GitHub and on
PyPI. Furthermore, the research that
led to CleverCSV is fully reproducible and publicly
available (if you
care about such a thing! :))
Data wrangling and data cleaning are some of the most time consuming tasks for data scientists, and they’re not the most fun either. In fact, survey’s show that data scientists spend the majority of their time on these menial tasks, while also being the part of their job they dislike the most! CleverCSV is a tool that aims to solve part of this problem, by giving data scientists a way to save time on the boring task of correctly loading data from messy CSV files. I hope that you give it a try!