Jupyter notebooks and collaboration
Git has been almost universally adopted for collaborating on code, as Jupyter notebooks have been for data exploration and interactive modelling. However, when bringing the two together, a problem arises: Git was designed to version plain text files, such as those containing source code, and not structured data like JSON documents or binary data such as images embedded in Jupyter notebooks.
This makes it challenging to follow best practices, including making small, atomic commits on topic branches and submitting them for code review, without additional tooling and processes.
At PyCon UK, I gave a talk on tools and practices that enhance the collaborative and productive use of Jupyter notebooks for machine learning, which you can watch below.
I demonstrate how built-in Git features, such as incremental staging of changed files, can eliminate noise from altered cell counts, and then show how simple tooling can automatically clear output cells from notebooks before committing changes to Git, which avoids the addition of binary data to the repository.
I then introduce the nbdime tools from the Jupyter project, which allow diffing and merging Jupyter notebooks, and show how to install and configure these tools to integrate Jupyter with Git. By implementing these tools and practices, you can make working with Jupyter notebooks for machine learning both more effective and more enjoyable.