Resources for “Make”

Table of contents

Discussions

Blogs

Of course we are not the first to suggest the use of Make for reproducibility! The blog posts cited below were found after the above tutorial was written, but can add further information and examples.

Tools

Alternatives to Make

There are many alternatives to Make. Below are some that caught our eye and that might be worth a look.

  • SnakeMake. A Python3-based alternative to Make. Snakemake supports multiple wildcards in filenames, supports Python code in rules, and can run workflows on workstations, clusters, the grid, and in the cloud without modification.

  • Tup. A fast build system that processes prerequisites bottom-up instead of Make’s top-down. The speed looks impressive and the paper describing it is interesting, but for small projects Make’s speed will not be a bottleneck. The Tupfile syntax is not compatible with that of Makefiles.

  • Bazel. An open-source version of Google’s Blaze build system.

  • Buck. Facebook’s build system.

Glossary

Makefile: a text file that contains the configuration for the build

Rule: an element of the Makefile that defines something that must be built, usually consists of targets, recipes, and optionally, prerequisites.

Target: the outcome of a rule in a Makefile. It is usually a file. If it is not a file, it’s a phony target.

Recipe: one or more shell commands that are executed by Make. Usually these commands update the target of the rule.

Prerequisite: the prerequisite(s) of a rule correspond to files or other targets in the Makefile that must be up to date before the rule is run.

Phony: a phony target is one that doesn’t correspond to a file on the filesystem. A target is marked as phony by making it a prerequisite of the .PHONY target.

Pattern: A pattern rule is a rule that contains exactly one % character in the target, which can be used to match a part of a filename.

Appendix

Directed Acyclic Graph

A Directed Acyclic Graph (DAG) is a graph of nodes and edges that is:

  1. directed: edges have a direction and you can only walk the graph in that direction

  2. acyclic: does not contain cycles: A can’t depend on B when B depends on A.

The latter property is of course quite handy for a build system. More information on DAGs can be found on Wikipedia.

Installing Make

First, check if you have GNU Make installed already. In a terminal type:

$ make

If you get make: command not found (or similar), you don’t have Make. If you get make: *** No targets specified and no makefile found.  Stop. you do have Make.

We’ll be using GNU Make in this tutorial. Verify that this is what you have by typing:

$ make --version

If you don’t have GNU Make but have the BSD version, some things might not work as expected and we recommend installing GNU Make.

To install GNU Make, please follow these instructions:

  • Linux: Use your package manager to install Make. For instance on Arch Linux:

    $ sudo pacman -S make
    

    Ubuntu:

    $ sudo apt-get install build-essential
    
  • MacOS: If you have Homebrew installed, it’s simply:

    $ brew install make
    

    If you have a builtin Make implementation, please ensure that it’s GNU Make by checking make --version.

Advanced: Generating Rules using Call

This section continues the tutorial above and demonstrates a feature of Make for automatic generation of rules.

In a data science pipeline, it may be quite common to apply multiple scripts to the same data (for instance when you’re comparing methods or testing different parameters). In that case, it can become tedious to write a separate rule for each script when only the script name changes. To simplify this process, we can let Make expand a so-called canned recipe.

To follow along, switch to the canned branch:

$ make clean
$ git stash --all        # note the '--all' flag so we also stash the Makefile
$ git checkout canned

On this branch you’ll notice that there is a new script in the scripts directory called generate_qqplot.py. This script works similarly to the generate_histogram.py script (it has the same command line syntax), but it generates a QQ-plot. The report.tex file has also been updated to incorporate these plots.

After switching to the canned branch there will be a Makefile in the repository that contains a separate rule for generating the QQ-plots. This Makefile looks like this:

# Makefile for analysis report
#

ALL_CSV = $(wildcard data/*.csv)
DATA = $(filter-out $(wildcard data/input_file_*.csv),$(ALL_CSV))
HISTOGRAMS = $(patsubst data/%.csv,output/figure_%.png,$(DATA))
QQPLOTS = $(patsubst data/%.csv,output/qqplot_%.png,$(DATA))

.PHONY: all clean

all: output/report.pdf

$(HISTOGRAMS): output/histogram_%.png: data/%.csv scripts/generate_histogram.py
	python scripts/generate_histogram.py -i $< -o $@

$(QQPLOTS): output/qqplot_%.png: data/%.csv scripts/generate_qqplot.py
	python scripts/generate_qqplot.py -i $< -o $@

output/report.pdf: report/report.tex $(FIGURES)
	cd report/ && pdflatex report.tex && mv report.pdf ../$@

clean:
	rm -f output/report.pdf
	rm -f $(HISTOGRAMS) $(QQPLOTS)

You’ll notice that the rules for histograms and QQ-plots are very similar.

As the number of scripts that you want to run on your data grows, this may lead to a large number of rules in the Makefile that are almost exactly the same. We can simplify this by creating a canned recipe that takes both the name of the script and the name of the genre as input:

define run-script-on-data
output/$(1)_$(2).png: data/$(2).csv scripts/generate_$(1).py
	python scripts/generate_$(1).py -i $$< -o $$@
endef

Note that in this recipe we use $(1) for either histogram or qqplot and $(2) for the genre. These correspond to the expected function arguments to the run-script-on-data canned recipe. Also, notice that we use $$< and $$@ in the actual recipe, with two $ symbols for escaping. To actually create all the targets, we need a line that calls this canned recipe. In our case, we use a double for loop over the genres and the scripts:

$(foreach genre,$(GENRES),\
	$(foreach script,$(SCRIPTS),\
		$(eval $(call run-script-on-data,$(script),$(genre))) \
	) \
)

In these lines the \ character is used for continuing long lines.

The full Makefile then becomes:

# Makefile for analysis report
#

ALL_CSV = $(wildcard data/*.csv)
DATA = $(filter-out $(wildcard data/input_file_*.csv),$(ALL_CSV))
HISTOGRAMS = $(patsubst %,output/histogram_%.png,$(GENRES))
QQPLOTS = $(patsubst %,output/qqplot_%.png,$(GENRES))

GENRES = $(patsubst data/%.csv,%,$(DATA))
SCRIPTS = histogram qqplot

.PHONY: all clean

all: output/report.pdf

define run-script-on-data
output/$(1)_$(2).png: data/$(2).csv scripts/generate_$(1).py
	python scripts/generate_$(1).py -i $$< -o $$@
endef

$(foreach genre,$(GENRES),\
	$(foreach script,$(SCRIPTS),\
		$(eval $(call run-script-on-data,$(script),$(genre)))\
	)\
)

output/report.pdf: report/report.tex $(HISTOGRAMS) $(QQPLOTS)
	cd report/ && pdflatex report.tex && mv report.pdf ../$@

clean:
	rm -f output/report.pdf
	rm -f $(HISTOGRAMS) $(QQPLOTS)

Note that we’ve added a SCRIPTS variable with the histogram and qqplot names. If we were to add another script that follows the same pattern as these two, we would only need to add it to the SCRIPTS variable.

To build all of this, run

$ make -j 4