Contents:
(✋)
Make is a build automation tool. It uses a configuration file called a Makefile that contains the rules for what to build. Make builds targets using recipes. Targets can optionally have prerequisites. Prerequisites can be files on your computer or other targets. Make determines what to build based on the directed acyclic graph of the targets and prerequisites. It uses the modification time of prerequisites to update targets only when needed.
Here’s a simple example that compiles some C code:
In this example:
hello
is the targethello.c
is the dependencycc -o hello hello.c
is the recipeCollectively, the two lines are called a rule.
Make is great when:
One of the things that might scare people off from using Make is that when they see an existing Makefile it can seem daunting to understand it and tailor it to their own needs.
In this hands-on part of the tutorial we will iteratively construct a Makefile for a real project. In the Example section below there are a number of Makefiles that I’ve curated over time. Hopefully the experience that you gain from this tutorial should make it easy to adapt them to your specific projects.
Because there are probably enough tutorials on how to use Make to build a software project, this tutorial will show how to use Make for a data analysis pipeline. The idea is to explain different features of Make by iterating through several versions of a Makefile for this project.
We will create a Makefile
for a data analysis pipeline. The task is as follows:
Task: Given two datasets, create a summary report (in pdf) that contains the histograms of these datasets.
(Of course this data science task is kept as simple as possible to explain how to use Make.)
To start, clone the base repository:
This basic repository contains all the code that we’ll need in this tutorial:
You’ll notice that there are two datasets in the data directory (input_file_1.csv
and input_file_2.csv
) and that there is already a basic Python script in scripts and a basic report LaTeX file in report. Ensure that you have the matplotlib
and numpy
packages installed:
You will also need a working version of pdflatex
and, of course, make
.
Let’s create our first Makefile. Using your favorite editor, create a file called Makefile
with the following contents:
# Makefile for analysis report
output/figure_1.png: data/input_file_1.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_1.csv -o output/figure_1.png
output/figure_2.png: data/input_file_2.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_2.csv -o output/figure_2.png
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf
To test that everything works correctly, you should now be able to type:
If everything works correctly, the two figures will be created and pdf report will be built. (💡).
Let’s go through the Makefile in a bit more detail. We have three rules, two for the figures that are very similar, and one for the report. Let’s look at the rule for output/figure_1.png
first. This rule has the target output/figure_1.png
that has two prerequisites: data/input_file_1.csv
and scripts/generate_histogram.py
. By giving the output file these prerequisites it will be updated if either of these files changes. You’ll notice that the recipe line calls Python with the script name and uses command line flags (-i
and -o
) to mark the input and output of the script.
The rule for the PDF report is very similar, but it has three prerequisites (the LaTeX source and the figures). Notice that the recipe changes the working directory before calling LaTeX and also moves the generated PDF to the output directory. This has to do with some limitations of pdflatex
. However, it’s important to distinguish the above rule from the following:
# don't do this
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/
pdflatex report.tex
mv report.pdf ../output/report.pdf
This rule places the three commands on separate lines. However, Make executes each line independently in separate subshells, so changing the working directory in the first line has no effect on the second, and a failure in the second line won’t stop the third line from being executed. Therefore, we combine the three commands in a single recipe above.
This is what the DAG looks like for this Makefile:
In our first Makefile we have the basic rules in place. We could stick with this if we wanted to, but there are a few improvements we can make:
We now have to explicitly call make output/report.pdf
if we want to make the report.
We have no way to clean up and start fresh.
Let’s remedy this by adding two new targets: all
and clean
. Again, open your editor and change the contents to add the all
and clean
rules:
# Makefile for analysis report
all: output/report.pdf
output/figure_1.png: data/input_file_1.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_1.csv -o output/figure_1.png
output/figure_2.png: data/input_file_2.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_2.csv -o output/figure_2.png
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf
clean:
rm -f output/report.pdf
rm -f output/figure_*.png
Note that we’ve added the all
target to the top of the file. This is because Make executes the first target when no explicit target is given. So you can now type make
on the command line and it would do the same as make all
. Also, note that we’ve only added the report as the prerequisite of all
because that’s our desired output and the other rules help to build that output. If you’d have multiple outputs, you could add these as further prerequisites to the all
target.
The clean
rule is typically at the bottom, but that’s more style than requirement. Note that we use the -f
flag to rm
to stop it complaining when there is no file to remove.
You can try out the new Makefile by running:
Typically, all
and clean
are defined as so-called Phony Targets. These are targets that don’t actually create an output file. Such targets will always be run if they come up in a dependency, but will no longer be run if a directory/file is created that is called all
or clean
(💡). We therefore add a line at the top of the Makefile so that it looks like this:
# Makefile for analysis report
.PHONY: all clean
all: output/report.pdf
output/figure_1.png: data/input_file_1.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_1.csv -o output/figure_1.png
output/figure_2.png: data/input_file_2.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_2.csv -o output/figure_2.png
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf
clean:
rm -f output/report.pdf
rm -f output/figure_*.pdf
Phony targets are also useful when you want to use Make recursively. In that case you would specify the subdirectories as phony targets. You can read more about phony targets in the documentation, but for now it’s enough to know that all
and clean
are typically declared as phony.
Sidenote: another target that’s typically phony is test, in case you have a directory of tests called test and want to have a target to run them that’s also called test.
There’s nothing wrong with the Makefile we have until now, but it’s somewhat verbose because we’ve declared all the targets explicitly using separate rules. We can simplify this by using Automatic Variables and Pattern Rules.
Automatic Variables. With automatic variables we can use the names of the prerequisites and targets in the recipe. Here’s how we would do that for the figure rules:
# Makefile for analysis report
.PHONY: all clean
all: output/report.pdf
output/figure_1.png: data/input_file_1.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
output/figure_2.png: data/input_file_2.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf
clean:
rm -f output/report.pdf
rm -f output/figure_*.pdf
We’ve replaced the input and output filenames in the recipes respectively by $<
, which denotes the first prerequisite and $@
which denotes the target. There are more automatic variables that you could use, see the documentation.
Pattern Rules. Notice that the figure recipes have become identical! Because we don’t like to repeat ourselves, we can combine the two rules into a single one by using pattern rules. Pattern rules allow you to use the %
symbol as a wildcard and combine the two rules into one:
# Makefile for analysis report
.PHONY: all clean
all: output/report.pdf
output/figure_%.png: data/input_file_%.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf
clean:
rm -f output/report.pdf
rm -f output/figure_*.pdf
The %
symbol is now a wildcard that (in our case) takes the value 1
or 2
based on the input files in the data
directory. You can check that everything still works by running make clean
followed by make
.
When Makefiles get more complex, you may want to use more “bash-like” features such as building outputs for all the files in an input directory. While pattern rules get you a long way, Make also has features for wildcards and string/path manipulation for when pattern rules are insufficient.
While before our input files were numbered, we’ll now switch to a scenario where they have more meaningful names. Let’s switch over to the big_data
branch:
$ git stash # stash the state of your working directory
$ git checkout big_data # checkout the big_data branch
If you inspect the data directory, you’ll notice that there are now additional input files that are named more meaningfully (the data are IMBD movie ratings by genre). Also, the report.tex file has been updated to work with the expected figures.
We’ll adapt our Makefile to create a figure in the output directory called histogram_{genre}.png
for each {genre}.csv
file, while excluding the input_file_{N}.csv
files.
Sidenote: if we were to remove the
input_file_{N}.csv
files, pattern rules would be sufficient here. This highlights that sometimes your directory structure and Makefile should be developed hand in hand.
Before changing the Makefile, run
to remove the output files.
First, we’ll create a variable that lists all the CSV files in the data directory and one that lists only the old input_file_{N}.csv
files:
A code convention for Makefiles is to use full caps for variable names and define them at the top of the file.
Next, we’ll list just the data that we’re interested in by filtering out the INPUT_CSV
from ALL_CSV
:
This line uses the filter-out
command to remove items in the INPUT_CSV
variable from the ALL_CSV
variable. Note that we use both the $( ... )
syntax for functions and variables (the ${ ... }
syntax also works). Finally, we’ll use the DATA
variable to create a FIGURES
variable with the desired output:
Here we’ve used the patsubst
function to transform the input in the DATA
variable (that follows the data/{genre}.csv
pattern) to the desired output filenames (using the output/figure_{genre}.png
pattern).
Now we can use these variables for the figure generation rule as follows:
$(FIGURES): output/figure_%.png: data/%.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
Note that this rule again applies a pattern: it takes a list of targets (FIGURES
) that all follow a given pattern (output/figure_%.png
) and based on that creates a prerequisite (data/%.csv
). Such a pattern rule is slightly different from the one we saw before because it uses two :
symbols. It is called a static pattern rule.
Of course we also have to update the report.pdf
rule:
output/report.pdf: report/report.tex $(FIGURES)
cd report/ && pdflatex report.tex && mv report.pdf ../$@
and the clean
rule:
The resulting Makefile should now look like this:
# Makefile for analysis report
#
ALL_CSV = $(wildcard data/*.csv)
INPUT_CSV = $(wildcard data/input_file_*.csv)
DATA = $(filter-out $(INPUT_CSV),$(ALL_CSV))
FIGURES = $(patsubst data/%.csv,output/figure_%.png,$(DATA))
.PHONY: all clean
all: output/report.pdf
$(FIGURES): output/figure_%.png: data/%.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
output/report.pdf: report/report.tex $(FIGURES)
cd report/ && pdflatex report.tex && mv report.pdf ../$@
clean:
rm -f output/report.pdf
rm -f $(FIGURES)
If you run this Makefile, it will need to build 28 figures. You may want to use the -j
flag to make
to build these targets in parallel!
In a data science pipeline, it may be quite common to apply multiple scripts to the same data (for instance when you’re comparing classifiers or testing different parameters). In that case, it can become tedious to write a separate rule for each script when only the script name changes. To simplify this process, we can let Make expand a so-called canned recipe.
To follow along, switch to the canned
branch:
$ make clean
$ git stash --all # note the '--all' flag so we also stash the Makefile
$ git checkout canned
On this branch you’ll notice that there is a new script in the scripts directory called generate_qqplot.py
. This script works similarly to the generate_histogram.py
script (it has the same command line syntax), but it generates a QQ-plot. The report.tex file has also been updated to incorporate these plots.
Now, we could simply add another rule that generates the QQ-plot figures, so that the Makefile becomes:
# Makefile for analysis report
#
ALL_CSV = $(wildcard data/*.csv)
DATA = $(filter-out $(wildcard data/input_file_*.csv),$(ALL_CSV))
HISTOGRAMS = $(patsubst data/%.csv,output/figure_%.png,$(DATA))
QQPLOTS = $(patsubst data/%.csv,output/qqplot_%.png,$(DATA))
.PHONY: all clean
all: output/report.pdf
$(HISTOGRAMS): output/histogram_%.png: data/%.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i $< -o $@
$(QQPLOTS): output/qqplot_%.png: data/%.csv scripts/generate_qqplot.py
python scripts/generate_qqplot.py -i $< -o $@
output/report.pdf: report/report.tex $(FIGURES)
cd report/ && pdflatex report.tex && mv report.pdf ../$@
clean:
rm -f output/report.pdf
rm -f $(HISTOGRAMS) $(QQPLOTS)
(which is what is currently in your repo.)
But as the number of scripts that you want to run on your data grows, this may lead to a large number of rules in the Makefile that are almost exactly the same. We can simplify this by creating a canned recipe (which is like a function) that takes both the name of the script and the name of the genre as input:
define run-script-on-data
output/$(1)_$(2).png: data/$(2).csv scripts/generate_$(1).py
python scripts/generate_$(1).py -i $$< -o $$@
endef
Note that in this recipe we use $(1)
for either histogram
or qqplot
and $(2)
for the genre. These correspond to the expected function arguments to the run-script-on-data
canned recipe. Also, notice that we use $$<
and $$@
in the actual recipe, with two $
symbols for escaping. To actually create all the targets, we need a line that calls this canned recipe. In our case, we use a double for loop over the genres and the scripts:
$(foreach genre,$(GENRES),\
$(foreach script,$(SCRIPTS),\
$(eval $(call run-script-on-data,$(script),$(genre))) \
) \
)
The full Makefile
then becomes:
# Makefile for analysis report
#
ALL_CSV = $(wildcard data/*.csv)
DATA = $(filter-out $(wildcard data/input_file_*.csv),$(ALL_CSV))
HISTOGRAMS = $(patsubst %,output/histogram_%.png,$(GENRES))
QQPLOTS = $(patsubst %,output/qqplot_%.png,$(GENRES))
GENRES = $(patsubst data/%.csv,%,$(DATA))
SCRIPTS = histogram qqplot
.PHONY: all clean
all: output/report.pdf
define run-script-on-data
output/$(1)_$(2).png: data/$(2).csv scripts/generate_$(1).py
python scripts/generate_$(1).py -i $$< -o $$@
endef
$(foreach genre,$(GENRES),\
$(foreach script,$(SCRIPTS),\
$(eval $(call run-script-on-data,$(script),$(genre)))\
)\
)
output/report.pdf: report/report.tex $(HISTOGRAMS) $(QQPLOTS)
cd report/ && pdflatex report.tex && mv report.pdf ../$@
clean:
rm -f output/report.pdf
rm -f $(HISTOGRAMS) $(QQPLOTS)
Note that we’ve added a SCRIPTS
variable that holds the histogram
and qqplot
names. If we were to add another script that follows the same pattern as these two, we would only need to add it to the SCRIPTS
variable.
To build all of this, run
Before wrapping up I wanted to briefly mention two commands that are very useful when writing Makefiles with complex variables: info
and error
.
With the info
command you can print the current value of a variable to stdout:
will print DATA = ...
with ...
the content of the DATA
variable.
The error
command stops execution at that point in the Makefile. This is useful in combination with the info
command when you want to just print the variables and not run any further:
And of course, you can run make -d
to see the debug output. This can help you figure out where a rule fails or a syntax error occurs.
I hope this tutorial has given you some insight in how to use Make for your projects. Of course it isn’t feasible to expand on all the features of Make in such a short space, but I think I’ve covered the most important ones. A few things that may be relevant that I haven’t covered:
Special targets other than PHONY. For instance .POSIX:
stops execution of a recipe upon the first failure, and PRECIOUS:
allows you to specify targets that should not be deleted when Make is interrupted.
If you want to explore other uses of Make, I have collected some links to Makefiles that I use for various other projects:
LaTeX project with a separate figure directory. This is an example of using Make recursively, which has some caveats (see Further Reading).
Python package. This Makefile is self-documenting: the default goal is help
, which prints a brief description of each target to the command line.
R package. Not mine but I’ve used something like this for several R packages.
Real-world reproducible research paper. Shameless self-promotion. This Makefile generates all the empirical results for a paper (including constants, figures, and tables) from the raw input data, resulting in reproducibility through a single make
invocation!
Thanks for your attention! (💡)
Some links I’ve collected while writing this.
Recursive Make Considered Harmful. This is a well-known paper on why you shouldn’t use nested makefiles. To summarise: if you do this Make can’t see the entire DAG and that leads to problems.
Non-Recursive Make Considered Harmful: This is a research paper describing the failings of Make for large and complex builds.
Plot the DAG of the Makefile with makefile2graph.
I’m not explicitly recommending any of these, but if you want features that Make doesn’t support or have a really big project with complex dependencies, then here are some alternatives. I would recommend not to optimise your build system until you run into a problem with your current one. Optimising your build system might lead you down to a rabbit hole that ends up taking more time than slow builds!
Bazel: An open-source version of Google’s Blaze build system.
Buck: Facebook’s build system.
Tup: a fast build system that processes prerequisites bottom-up instead of Make’s top-down. The speed looks impressive and the paper describing it is interesting. Would have been great if it could be run on Makefiles too instead of requiring a Tupfile with different syntax.
Walk: Make alternative that claims to be more flexible than Make because it doesn’t just rely on modification time. Makefile is now essentially a bash script.
A Directed Acyclic Graph (DAG) is a graph of nodes and edges that is:
The latter property is of course quite handy for a build system. More on DAGs on Wikipedia.
First check if you have GNU Make installed already. In a terminal type:
If you get make: command not found
(or similar), you don’t have make. If you get make: *** No targets specified and no makefile found. Stop.
you do have make.
We’ll be using GNU Make in this tutorial. Verify that this is what you have by typing:
If you don’t have GNU make but have the BSD version, some things might not work as expected.
To install GNU Make, please follow these instructions:
Linux: Use your package manager to install Make. For instance on Arch Linux:
Ubuntu:
MacOS: If you have Homebrew, it’s simply:
If you have a builtin Make implementation, please ensure that it’s GNU make by checking make --version
.
Windows Cygwin, probably?