Jupyter notebooks and collaboration

28 October 2017

The adoption of Git as the primary means of collaborating on code, and Jupyter notebooks as the standard environment for data exploration and interactive modelling, is widespread. However, a problem arises in that Git was designed to version plain text files, such as those containing source code, and not structured data like JSON documents or binary data such as images embedded in Jupyter notebooks.

This makes it challenging to follow best practices, including making small, atomic commits on topic branches and submitting them for code review, without additional tooling and processes. At PyCon UK, I presented a talk on tools and practices that enhance the collaborative and productive use of Jupyter notebooks for machine learning.

Firstly, I demonstrate how built-in Git features, such as incremental staging of changed files, can eliminate noise from altered cell counts. Secondly, I show how simple tooling can automatically clear output cells from notebooks before committing changes to Git, which avoids the addition of binary data to the repository. Finally, I introduce the nbdime tools from the Jupyter project, which are a set of tools for diffing and merging Jupyter notebooks.

I show how to install and configure these tools to integrate Jupyter and version control systems, which enhances collaboration for machine learning practitioners. By implementing these tools and practices, we can make working with Jupyter notebooks for machine learning more effective and enjoyable.