The Problem

We have accumulated quite a few notebooks. We keep notebooks short and to the point to make collabroation easier, but it means that we create a lot of them. In one of our projects we have more than 25 notebooks. Even with good folder structure and naming, it feels scattered. The situation is bad for practial and motivational reasons.

On the practical side it can be hard to find what you are looking for. There are no good way of searching across all the notebooks in the project. We need to get excellent search capabilities on top of the notebooks.

I am also often in a situation where one of my coworkers want a figure that was presented during one of the weekly demos. They may not be on GitHub, so I can’t simply send them a link to the notebook. In certain cases the notebook will have been cleared before being pushed to GitHub anyway. We to be able to easily share the results presented in the notebooks.

The sense of cohesion is lost.

On the motivational side there is the feeling of progression and completion. The whole situation makes me long for the days of thesis writing where three years of work was contained between two covers. The sense of progression I got from writing on a chapter for the thesis kept me going through the rough times. Even if an experiment or analysis failed I would just go “Eh, I guess it’ll be chapter in the thesis”, and finishing that chapter still gave me a feeling of completion. I want to create the same feeling of progression for our data scientist, and I don’t think that writing up a notebook gets the job done. Once you have finshed an analysis it should go in “the report”. One solution is to compile all the notebook into a report.

In summary, I want a workflow that makes it easy to:

combine a collection of notebooks into a single report.
share the report.
search across all notebook within a project.

I want a workflow that lets me convert a collection of notebook into a selfcontained report with as little overhead as possible.

The Solution

The solution we’re currenly using revolves around three tools: make, nb_convert, and mkdocs.

Use Jupyter Notebooks to present the work.
Convert notebooks to Markdown using nb_convert.
Build a static website from the Markdown files using mkdocs
Use Make to orchestre everything

If you want to see temple of how this would work, then check out this repo. Below I’ll explain what they different tools do and how they work together.

MkDocs

MkDocs is a fast and simple static site generator that’s geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file.

We use MkDocs because it gives us a simple way of structuring the notebooks. The default search is also great.

I wont give a fulle introduction here. I will simply point out some of the details and why I like them. The YAML configuration file is called mkdocs.yml and should be placed in the root of the project directory.

The Table of Contents or Navigation is written in the nav section of the YAML file. For a data science project it could look something like this.

    - introduction: introduction.md
    - pre-processing:
        - introduction: pre-processing/introduction.md
        - missing data: pre-processing/missing-data.md
        - feature engineering: pre-processing/feature-engineering.md
    - model building: model-building.md
    - testing: testing.md
    - experimental: 
        - introduction: experimental/introduction.md

I like that I have to define the TOC myself, rather than have it infer it from some pre-defined folder structure or naming convention. Here I can simply include the things that I want.

We are using the Material theme.

nb_convert

Primarily, the nbconvert tool allows you to convert a Jupyter .ipynb notebook document file into another static format including HTML, LaTeX, PDF, Markdown, reStructuredText, and more. nbconvert can also add productivity to your workflow when used to execute notebooks programmatically.

For this workflow we use it to generate the Markdown file. A neat thing about nb_convert is that is allows you to use cumtom templates when converting the notebook. We use this to hide all input fields.

jupyter nbconvert\
    --to markdown $<\
    --output-dir $(dir $@)\
    --TagRemovePreprocessor.remove_cell_tags='{"remove_cell"}'\
    --TagRemovePreprocessor.remove_input_tags='{"remove_input"}'\
    --template=src/to_markdown.tpl

Make

Make is a tool for building a dependency graph. I think this is a nice introduction to Make.

Make isn’t really neccesary – we could have a collection of bash scrips – but I like Make. It allows us to only convert notebooks that have changed. This way we are not constantly converting 25 notebooks.

The makefile has a couple of reponsebilities. I needs to locate all the notebook files. Create a mirror structure for the markdown files in the docs folder, and populate the folders based on the notebooks.

The most important part of the Makefile is this:

notebooks := $(wildcard ./**/*.ipynb)
md_pages := $(patsubst notebooks/%.ipynb,docs/%.md,$(notebooks))

build.env: ; conda env create -f environment.yml
build.site: $(md_pages)

docs/%.md: notebooks/%.ipynb
	jupyter nbconvert\
		--to markdown $<\
		--output-dir $(dir $@)\
		--template=src/to_markdown.tpl

Here md_pages finds all Jupyter Notebooks in the notebooks folder and replaces .ipynb with .md. This gives us a list of Markdown files.

the build.site taget requires all the md_pages. The last section is a recipe that tells make how to produce any .md file in the docs folder, given a .ipynb file with the same name exists in the notebooks folder. It create the file by running the jupyter nbconvert command.

The workflow

Here I wanted to be explicit about how I work with this set-up.

If I’ve just finished a new notebook I will:

Add it to the nav section of mkdocs.yml.
run make build.site to convert the notebooks with changes into Markdown files in the docs folder.
run mkdocs serve to launch the site on localhost. This way I can click around and see how the new notebook fits into the greater picture.

If I’m editing a notebook I will have mkdocs serve running in one terminal window and then run make build.site in another terminal window every thime I want to see my changes. mkdocs serve will automatically pick up on the changes to the markdown version of the notebook and rebuild the site.

In Conclusion

I quite like this solution, but I’ve only just started using it. Problems or chalanges are sure to crop up over time. I still haven’t set it up to work with Git. This should just require a file in the .githooks folder, but it may cause some trouble – you never know.