9 minutes
Publish jupyter notebooks as blog posts
I wanted to publish my data science work to a blog as easy as possible. My motivation was to do it really simple, because I want to concentrate on the data science part and not on the blog post creation. With this approach I can just have one single source of truth, which is the jupyter notebook. I can just create a notebook with all the code and explanations, and the rest is done by the workflow. This blog post which you are reading right now is also created with this approach. 🙃
Motivation
The root of the problem, that I really like working in different repositories for each of my hobby projects. This basically means, that I am creating a new github repository for each of my data science projects. For the blog it is not ideal because the blog posts are living and should live in one single repository (I am using Hugo). I could fit my workflow to the mono repository approach, where I just create an extra directory for the data science work in the blog’s repository, but I don’t like it 😀. I know this is pretty subjective, but I like to keep my projects separate; I might not want to share every project, I might want specific automation for individual ones, and a monorepo could quickly grow in size, making it harder to manage. Because I stick with the separate repository approach, I needed to find a way to publish the data science projects to the blog repository.
The workflow
As mentioned before, my goal is to publish my hobby projects with minimal effort. The first time I published my work, I had to create a notebook with all the code and explanations just to generate the final markdown file. Afterward, I had to copy the markdown from the project’s repository to the blog repository, and I had to deploy it manually. This process is tedious and time-consuming.
The go to solution for me was to use github actions. Github actions are a great way to automate workflows, and I can implement custom actions to fit my needs.
I created a custom github action, which basically can be used to publish the jupyter notebooks to the blog. The advantage of this approach, that I can just create a template repository, which can be used for any future data science projects, where the required github actions are already set up. Furthermore it allows me to make changes only in the custom github action, without the need to change the action in each project repository. After setting up the data science projects, from the template repository I can just focus on creating the individual project notebooks. When I am ready to publish, I just need to push the notebook to it’s repository, and the workflow will take care of the rest.
The following diagram shows the workflow of the process:

The custom github action
As described the first step was to create a custom github action. The action needs 5 parameters:
blog-gh-token: The github token for the blog repository. This is needed to be able to push the changes to the blog repository. The token in my case is a fine-grained personal access token, which only has the permissions to push to the blog repository.notebook-path: The path to the notebook, which we want to publish.target-repo: The target repository. This is the repository where the hugo blog is living.target-branch: The target branch, of the blog repository.target-page: This is the folder in the blog repository, where the notebook should be published.
Usage of the action:
After triggering the action, the following steps are executed under the hood:
Setting up python with nbconvert
nbconvert is used to convert the notebook to a markdown file. In hugo the content is stored in markdown format, so this is a requirement.
Convert the notebook to a markdown file
This step is using a python script, which is located in the template repository. The python script is basically using the nbconvert library to convert the notebook to a markdown file. I added some custom preprocessing steps and configurations, to bring the notebook to the desired output format as markdown file.
In the first part of the code I am just defining some variables such as the:
USERNAME: The username of the user who triggered the actionRESPONSE: The response of the github api, which contains the user informationREAL_NAME: The real name of the userREPO_NAME: The name of the repository with the ownerREPO_NAME_ONLY: The name of the repository without the usernameTARGET_DIR: The target directory where the markdown file should be stored
After that I am calling the python script, which is doing the conversion.
In the script I am reading the notebook with nbconvert after that front-matter need to be added for the notebook; this is required for hugo to be able to detect the title, date, and other metadata from the notebook.
In some cases there are empty cells and unnecessary whitespace in the notebook cells, to get rid of them a custom Preprocessor is used.
If the notebook has markdown cells with referenced images, it is necessary to handle them additionally, these resources are referenced with . These images are not converted by nbconvert, so I need to take care of them. In the handle_custom_resources function the custom resources are added to the resourcesdirectory created by the nbconvert.
Plots, and images generated by python code are automatically converted by nbconvert, however they are added to the root level where the markdown file is stored. I wanted to organize the resources in separate folders, so I moving the resources to a separate folder.
With these steps the notebook is converted with all desired files. And the files can easily be moved to the target repository. (aka. blog repository)
Copy the notebook to the target repository
This is done by cloning the target repository in the github action and copying the converted notebook and its resources (like images) to the target branch.
If everything goes well, the notebook is coverted with all its assets and pushed to the target repository. The blog repository is set up in a way that it publishes the markdown files as blog posts in case of a push to the main branch.
Blog repositoy
The blog repository is a normal hugo repository. I am using netlify to host the blog. Netlify provides some nice features to deploy the blog with CI/CD. However I decided to deploy my blog “manually” (I mean with manually that I am not using the netlify’s build service) because I want to have control over the full deployment process. I am using the netlify cli to deploy the blog.
The deployment is pretty straightforward:
For the deployment I needed to set up a netlify authentication token and a site id. With these two environment variables the deployment is done with only one line of code.
Final thoughts
This approach is not perfect, but it works for me. I can live with the current solution, and it is a good starting point. In any future project, I can just create a notebook with all the code and explanations, and the workflow will take care of the rest.
Links
1721 Words
2024-10-04 00:00