michael decrescenzo - Highly modular blogging with Blogdown

When I finished graduate school, I tore down my website.

For a handful of reasons. I no longer needed a website that cried out, “Help me, I’m finishing my PhD and I need to escape.” I didn’t need to showcase unpublished papers, teaching resources, or old blog posts that I had grown detached from. It was time for a clean reset.

But if you work with Blogdown, you know that starting over is laborious. Not that Blogdown isn’t great, because it is. It’s that, when you’re a finicky person like me, setting up a website with the right balance of capable features, pleasant aesthetics, and a principled codebase is legitimately challenging. I was encountering the same familiar challenges over and over.

For example, the site’s Hugo theme. I would take a lot of Hugo themes for test-drives. Hugo advertizes themes as if they were completely modular with respect to your /content/ folder. For most themes, this is a lie. Themes usually want too many bespoke variables or file structures in your website content. Some amount of this is okay, but it comes at a cost. If you really want to take a theme for a spin, I would find it easier to create an entirely new blogdown::new_site() than to change my theme in an existing site.

Except now you’re dragging the same files around your computer all over again. Ugh, this new website directory needs your blog source files, your website-level .Rprofile that controls Blogdown’s build behaviors, the jpg/png image files that you use to brand yourself online, etc… And maybe these files need to go in different folders or be given different file names from the previous theme. After a while, these files no longer have a single authoritative “home” on your computer, and you may have multiple conflicting(!) versions of these files across your various experimental website folders.

And then there’s reproducibility. Even after lugging around all the same files to the new site, good luck getting your blog posts to render if your package library has changed since they were written, which it probably has. Danielle Navarro wrote about reproducibility in Rmarkdown blogging, and argues convincingly that the best way to protect your .Rmarkdown posts from this rebuilding risk is to create a dedicated package library for each separate blog post using renv. This sounds intense at first, but the underlying principle is simple, which makes it a good solution to a difficult problem.

This post will continue that pattern: intense at first, but well-founded, solutions for difficult problems.

What this post is about: modularity

What we want is a principled and robust approach for managing the many interlocking components of your website. Specifically, we explore the modularity of the elements in your website. I take the view that your website is a collection of modular components that are better managed independently, with different Git repositories for the different site components. Yes, managing your website with multiple repositories. Stay with me.

The modules that compose your website include your theme, your blog posts, your Blogdown build preferences (implemented in your .Rprofile), and maybe more. These modular components come together at the nexus of the website, but as I will argue, these components should not belong to the website. Why not? Because these components can be re-used across different websites or substituted with other similar components. Hugo already flirts with this idea in its core design by separating the /theme/ directory from the /content/ directory, as if to say, “these components can be combined but do not depend on one another.” This post takes an opinionated stance that such modularity is a good idea and should be assertively extended to other components of your Blogdown site. That said, I make no assertion that this stance is objectively correct—only that it has been useful enough for me that I wanted to share some thoughts about the principles and processes at work. (You should do what works for you!)

Modularity as a software philosophy is one thing, but implementing it in code requires technical solutions. This post will discuss how to achieve this using Git submodules, an intermediate-level Git construct that, if you’re like me, is somewhat familiar but somewhat intimidating. In short, a Git submodule is a repository-within-a-repository. It has its own version history that is distinct from the “parent” repository. In this post, I provide a simple tour of submodules and how they can be used to structure your website workflow. We will recast our website as the “primary” repository, and we import other modular site components (like our blog posts) as Git submodules. In case you host your blog on Netlify, I will also discuss how to ensure that Netlify can build your site successfully.

Aside: some terminology

This discussion will involve plenty of concepts that sound similar to one another but should be understood as distinct things. I want to flag these concepts so that we understand each other better.

Directory vs. repository. A directory is a folder on your computer that holds files. A (Git) repository tracks changes to files. For many projects, the project’s root directory is entirely managed by one repository, so the distinction between the two may be blurred. When Git submodules are involved, this is no longer true. Your website directory will be managed by one repository, and sub-directories below your website will be managed by other repositories.

Website vs. module. The website is the entire project that puts your website online. Your website will contain various modules that combine to build the entire project. Your blog posts will be considered a module (or several modules, depending on your implementation). Your theme is another module. Think of modules as building blocks for your website that can be stacked, swapped out, and so on.

Parent repository (for the website) vs. child repository (for the module), a.k.a. “submodule”. The website and the module will be versioned by separate repositories. We can refer to the over-arching project repo (the website) as the “parent” repo and the module repo as the “child” repo. A “Git submodule” is a Git construct that is overlaid onto this relationship between repositories. A repository, in isolation, is simply a repository. But if you import a repository into another project as a dependency, Git designates the dependency as a “submodule” to the parent repository, and this affects our Git workflow as a result. I explain all of that below.

Websites are a collection of modules

Modules are like little building blocks, and your website has plenty of them. Setting aside any formal definition of what would mathematically be considered a “module”, let’s crudely define them as structures in your website that are agnostic to the content of other structures. We may be able to replace modules with other modules, or remove modules entirely, without affecting the core function of other modules.

Here are some examples from my own workflow. I consider my blog posts, Hugo theme, and blogdown build settings (in my site-level .Rprofile) as modular components within the website as a whole, and I version each component with its own separate repository. Here is how I justify this view for each component:

Blog posts: The content of a blog post is completely separable from the website repo. We can take a blog post and locate it in a different website, and the blog post should still be meaningful (and reproducible) unto itself. Many blogdown users remake their websites and carry their old blog posts to the new sites, which shows that the blog content doesn’t functionally depend on the website.

It turns out that, for blog posts, modularity and reproducibility are pretty closely related. In her discussion of blog reproducibility, Danielle Navarro touched on the principle that a blog should be “encapsulated” or “isolated” away from the broader website to robustify the blog against other dependencies. By insisting that blog posts also be modular, not only is the blog protected from the website’s computational environment, we can control each post independently of one another, move posts around across contexts, and remove posts entirely without side-effects.

This also affects how we treat the blog post’s dependencies. Suppose that your post includes an analysis on a data file that you read from disk. This file should belong to your blog post—and be versioned by that blog post’s Git repository—not your website. This means you should keep all of these files in the blog post directory, and forget about the website’s /static/ folder except for files that rightfully belong to the website.
Hugo theme: Hugo is designed such that the /content/ of a website (specified in markdown files) is more-or-less independent of its /theme/. The same theme can be used for multiple websites, and a single website can (in theory¹) swap out one theme for another. Because themes are managed with Git repositories already, you can pull theme updates from their remote repositories without overwriting any bespoke theme customizations specified in your /layouts/ folder.

Blogdown complicates this somewhat. When you install a theme with blogdown::install_theme(), Blogdown actually deletes the theme’s .git directory. (At least, this was my experience.) This is probably for ease-of-use among users who would not appreciate having to manage the theme as a submodule. But we are enthusiastic seekers of modularity, so we want to keep that upstream remote connection alive. As such, I installed my site’s Hugo theme using Git submodule operations instead of installing it with blogdown::install_theme().

The website .Rprofile file: You may have a global .Rprofile file, but it is an increasingly common Blogdown workflow staple to set up a website-specific .Rprofile to control Blogdown’s build behavior. How is this a module? Your blogdown build preferences are probably not specific to this website repository. Instead, it is likely that your preferences reflect your workflow for blogging in general and could be equally applicable to any other website repo you create or manage. If you change your blogdown workflow in a way that bears on this .Rprofile file, that change may affect all of your blogdown websites equally! Managing these .Rprofiles separately for each website would be inefficient and error-prone, so instead we manage the .Rprofile in one repository that we import to our website as a submodule.

How to accomplish this: Git submodules

Git submodules are repositories-within-repositories. Suppose you are working on a project repository (like your website), and there are external tools or resources that you want to import from another project. You have a strong project-based workflow, so you want all of the code that creates your website to be contained within the website directory on your computer. At the same time, the external dependency is clearly its own entity, and there is no reason why its code should be owned by the website repository. Git submodules allow you to clone this dependency repo into your website directory so you can use this code without versioning it redundantly.

Submodule basics

If you have never worked with submodules before, here is how they work in broad strokes. (This is not an exhaustive intro.)

When you add a submodule to a parent repository, the parent repository tracks the presence of the submodule, but it does not track the content. Your website repo tracks the presence of submodules to ensure that your project can be reproduced (read: cloned) with all necessary dependencies in place.² However, your website repo is ignorant of the actual content of the submodule because the submodule code is versioned by its own separate repo. There is no need to duplicate that effort.

Upstream changes to the submodule repo can be pulled into your website repo. This is standard workflow for Git. If you want to pin your dependency to a particular commit of the submodule, you can git checkout that commit. If you want your dependency to stay dynamically up to date with the submodule’s remote repo, checkout the desired branch and pull changes as they arise on the upstream remote.

Local changes to the submodule content can be pushed to remote. If you have write access to the submodule’s remote repository—either you own the repo, or it’s your fork of some other repo—you can make changes to the submodule contents from within the submodule and push those changes back upstream.³ This is just like a Git workflow where multiple users are pushing to the same remote repository, except instead of multiple users, it’s only you, editing the repo and committing/pushing changes from different endpoints. This allows you to keep the submodule content updated on all of its local and remote copies without duplicating any effort.

How to add your website components as submodules

In the spirit of modularity, there is actually nothing Blogdown-specific about including submodules within a project repository. All the same, I will discuss a Blogdown-specific example: the .Rprofile module, which I keep in its own repository here. I discuss how I manage blog posts with submodules later on, because that conversation is a little more involved.

You can add a submodule to your (already initialized) website repo with git submodule add [my-url] [my-destination-folder]. You will want to be strategic about where you add the repo, since it will effectively behave like a cloned repository. I often create a /submodules/ folder under my project root and clone submodules to that location.

# from /path/to/site
mkdir submodules
cd submodules
git submodule add git@github.com:mikedecr/dots_blogdown.git

Adding the submodule does not clone its contents. It simply registers the submodule with the repository, creating an entry in the website repo’s .gitmodules file. You have to run a separate command to actually clone the submodule repo’s contents:

git submodule update --init --recursive

The output will look like you did a git clone. At this point, there should exist a folder called /dots_blogdown/ that contains the repo contents.

From there, your next step depends on how you want to use the contents of the submodule. For this particular example, we want this .Rprofile to live at the top of our website root. This ensures that the file’s code is executed when we open R to manage our website. I achieve this by linking the file to the website root (and, bonus, removing write permissions⁴).

# exit /submodules/
cd ..
# -s = symlink, -f = force
ln -f ./submodules/dots_blogdown/.Rprofile ./.Rprofile
# bonus: remove write-permissions (make read-only)
chmod -w ./.Rprofile

It is smart to automate any post-Git processes, such as linking files to other destinations, by putting these commands and other pre-build operations in your website’s /R/build.R file. This ensures that these operations are done each time your website is built, ensuring that your website can be safely reproduced if your submodule content should ever change. With that automation in place, if I ever changed my .Rprofile repo, I never have to worry about manually re-linking my updates to the right destination. The build script does it for me.

Developing within the submodule repo

The above instructions describe how to simply employ submodule files in your website. But suppose you wanted to change the content of the submodule files and push those changes back upstream. What would you do?

Before making any changes to the submodule files, make sure the submodule isn’t in detached HEAD state. A detached HEAD state is basically what happens when you have checked out a commit in isolation of the branch on which that commit lives. When you are in detached HEAD state, you are basically looking at a copy of the project, but you cannot alter the project tree itself. Any files you change cannot be committed to a persistent branch. To make permanent changes, you have to checkout the branch that you want to track and commit changes to, which is probably main.

Make your changes. Even though you are editing a file within a submodule repository, Blogdown doesn’t know or care, so it shouldn’t behave any differently. It will knit/render blog posts and serve your website locally like nothing is wrong. That’s because nothing is wrong.

Commit changes to submdodule files to the submodule repository. From the command line, this means you probably should cd into the submodule repo before adding any files to the index. If you do Git stuff inside of a GUI, you should be able to make the submodule appear as its own repo that you can do add/commit/push actions to. (I don’t use Rstudio, so unfortunately I don’t know if Rstudio makes this easy.) After committing to your local copy of the submodule repo, you should notice that your parent repository detects an updated commit in the submodule! You should commit that change to the parent repository as well. This simply tells the parent repo that it should consult this new submodule to reproduce the project correctly. This is important because anyone else who clones your website repository (ahem, Netlify!) will need to import the submodule at the correct commit.

Both submodule and parent repos can be pushed. If this is your first time pushing any submodule-related commits to Netlify, you will want to read the section about Netlify below.

As you get more familiar with Git, you won’t need to follow a checklist. You will simply be familiar enough with how Git works to know exactly what to do!

What to do about your blog?

Should your blog be one submodule repository, or several? My current setup is to treat every blog post as its own, separate module with its own, separate repository. This keeps each post and all of its dependencies isolated from other posts, which is cleanest for me from a reproducibility and modularity standpoint.

However, you may find many blog post repositories to be overkill, and would instead want a single repository containing all of your blog posts. Would that be fine?

In short, the single-blog-module setup may be possible, but it will likely require even more advanced Git magic than just submodules. If you really want to know the nasty technical details, you can read about the problem and one potential solution, with the caveat that I haven’t tested that workflow out. If you trust me that the single-repo workflow is pretty complicated except for people looking to increase their Git dexterity stats, you can skip ahead to read about separate repositories for each post.

One submodule for all posts: the problem

To explain, consider the submodule workflow mentioned earlier. If we wanted to use a “single submodule” approach to blogging, we would

Move our blog posts to another repository and push it to the web.
Add this repository as a submodule located in your content/blog or analogous subdirectory.
The changes in the content/blog folder are now owned by the submodule repository. The parent repo will no longer see what’s happening in those files—only if you have made new commits.

Unfortunately, this may be a critical problem for your website! This is because many themes ask you to put other important files under your content/blog directory, in addition to the posts. Many popular themes ask for a content/blog/_index.md file to manage the blog’s “listings” page. Many themes also will accept image files in that directory to use for headers and sidebars on the listings page. These files are problems for the single-repo workflow. If we let our blog submodule own the content/blog directory, those files can no longer be tracked by the parent (website) repository. Adding the files to the submodule’s .gitignore does not fix it either. So, what can be done?

One submodule for all posts: there might be a way

I haven’t tested this, but there might be a way to save the unified-blog-repository workflow: you could make your blog repository a bare repository.

A bare repository is a repository with no root directory. Now, if you have only used Git on a per-project basis, the idea of a repo with no root directory sounds unthinkable, but it is actually a common way to version your “dotfiles”. Here’s why: your dotfiles usually live at your /home/username/ or ~/ directory. Many folks want to track these files to keep certain preferences synchronized on different machines, but as you can foresee, making a Git repository track your entire ~/ folder would be a horrible and terrifying idea. Instead, people create bare repositories that only track the contents that are explicitly added to the repository, regardless of their relative location to the repo’s .git folder.

How might this ameliorate our workflow problem? If we want one submodule repo to track all of the posts in /content/blog, but we don’t want that repo to own the other files in that directory, we might be able to achieve that effect with a bare submodule repo. Such a repo shouldn’t be aware of the other files under content/blog, because the repo doesn’t know that it is the same folder as those files.

Again, try it if you want, but you have no assurances from me that it will work.

My choice: every post gets its own repository

In lieu of the “advanced solution”, we opt for peak modularity: every blog post gets its own repository.

This workflow sounds tedious but is actually easier than you would think, and most of the steps are identical to what I have already covered above. Here’s a quick rundown of what I do:

Start on Github or whichever remote service you prefer. Make a remote-first repository for a new post (give it a meaningful title) and copy its cloning link.
Add the new repo as a submodule to a new folder for that post inside of /content/blog. I assume you use a “page bundle” model for organizing your blog code: separate folders for each post that contain respective index.[R]markdown files. It’s advisable to blog with page bundles even if you don’t want to implement hyper-modular blog versioning. Learn more about page bundles from Alison Hill here.
Your .gitmodules file will automatically update to reflect the new submodule. You will eventually want to commit that change, but it doesn’t have to be now. If necessary, initialize/update the submodule to clone its contents into the new post directory.
Checkout your desired submodule branch (e.g. main) so you can commit changes to your blog repo.
Edit your post as you normally would by creating an index.Rmarkdown and typing away. This is where you would use renv to take a snapshot of your R package library in order to reproduce the post. Hugo will trip over the files created by renv, however, so if you want to use it (again, you should), add "renv" to the ignoreFiles field in your website’s config.toml (which you only have to do once per site).
Commit changes to the blog module repository and push to remote.
You should notice that your parent repository detects an updated commit in the submodule. Commit that change to the parent repository as well. Pushing this website commit to remote will kick off a new Netlify build if you use continuous integration. Speaking of that…

Getting it working with Netlify

Once you are done getting your site looking the way you want, and all of your files are committed to the parent and child repositories, you can push your website repo to the remote that Netlify is tracking.

Except, whoops, your site may fail to build on Netlify. Why? Netlify works by cloning your website repository to their servers and building it with Hugo on their end. This process fails if Netlify can’t successfully reproduce your website repo with all of the submodules declared in your .gitmodules file. Such failure can happen for two benign and fixable reasons: (1) the submodule is a private repository, or (2) the submodule was added using the repo’s ssh URL instead of the https URL.

In either case, all you have to do is add ssh-keys to grant Netlify access to these repositories. It sounds complicated and jargony, but Netlify describes the whole process right here.

Once Netlify has access to the repositories, it can build its own copy of your website. This is because your parent Git repository spells out all of the instructions for cloning the required submodules at their requested commits.

Closing note

This post presents an opinionated view of a Blogdown website as a collection of modules and a corresponding workflow for managing them. If you find it helpful, awesome! But as always, you should do what works for you. It happened to be the case that I had a particular set of problems and a desire to strengthen some skills could help me solve them.

Footnotes

The system isn’t perfect. Some themes define special fields whose values are specified in your content files, but the main idea is there.↩︎
This is how Netlify builds your site, in fact. Netlify clones your website’s Git repository and builds it on their servers, so this is actually super important.↩︎
Just be sure you have checked out a branch (not in detached HEAD state) before you commit changes to the submodule files. More here.↩︎
Because I forcefully link the .Rprofile file from the submodule to the website root, any changes I make to the copy at the root would be overwritten if I ever re-linked the file. This is why I make the file read-only: to prevent myself from editing the wrong copy of the file. Just a little trick to guard against bugs :)↩︎