17 December 2020

Organise your commits

Crafting a beautiful PR is not a high priority for a lot of people, but I think it should be! In this post, I'll expound on what a good, well-organised PR looks like, why you should do that, and how you can do that. This is an opinionated post, not necessarily about an objective truth (there are multiple different Git workflows and tools, and this post is about my personal approach, which I believe to be the best :-) ).

Why?

Clean, organised commits make code easier to develop, debug, review, and maintain. If you find a bug during development or later, you can bisect through your commits to narrow down the bug to a single, small chunk of code. By making each commit a logical unit with a good description, you make your PR easier to review. Thinking about your work as a narrative flow of small changes can lead to better design and neater code. Good documentation of commits has all the benefits of good documentation in general, plus they are intrinsically timestamped, so it never gets out of date. I also find good commit docs generally more useful than code docs because they are more likely to explain why code was written a certain way.

Some basics

I'm going to assume you know what Git is and what commits and PRs are. This post is only really relevant for non-trivial PRs. For really small PRs, you just need a single commit and most of the rest is irrelevant.

It's worth noting that PRs are not a Git feature, they are a GitHub feature (or whatever Git hosting system). So the concept of a PR does not show up in your Git history or have any representation as far as Git is concerned. When a PR is 'merged' (I'm using quotes because this may not technically be a Git merge, but is colloquially called that), GitHub must represent it only as commits; exactly how it is represented is up to the project maintainers (i.e., not up to the PR author). In other words, there are technical reasons to follow the project's workflow, as well as social ones.

Git histories

Git allows significant flexibility in the shape of its history, it can be any DAG. However, it is convenient for humans for the history to be as linear as possible. It makes it easiest to reason about, and easiest to bisect. Tools often depict the history as linear, even when it is not, which can make things confusing. Finally a linear history makes reverting code much easier.

When branches are merged, Git creates a merge commit, which is a commit with two parents and whose purpose is solely to unite the two branches into a single branch. If we never merge, then the Git history will be a directed tree, i.e., from any branch looking back in time, the history is linear.

For project maintainers

Project maintainers can choose to preserve every commit in a PR or squash all commits in the PR into a single commit. They can encourage authors to use either rebase or merge to keep their code up to date (e.g., by making these tools available via GitHub or a bot). It is common to use either the combination of preserving commits and using rebase, or to squash commits and use merge. It's also fine to squash commits and use rebase (which produces a more linear history, but has coarse-grained commits). Preserving commits and using merge is ugly because it creates lots of noise in the history.

You should preserve commits and use rebase. Rebasing keeps your history linear and clean. Preserving commits lets each commit stand alone, for documentation, bisection, etc.

For authors

You should follow the project conventions. Being consistent is more important than being right. However, there is some wiggle room. Even if the project squashes commits, you can use fine-grained commits rather than one big commit to make your code easier to review (even though the commits won't be preserved in the history). Rebasing your code to keep it up to date helps keep the history more linear, even if the code is eventually merged. If the project has made the correct choices, then you can still submit your PR with a single commit and merge to keep it up to date, but you will not be popular.

The rest of this post is for PR authors and gets into the details of what a good PR structure is, assuming your commits will be preserved and that you and the project maintainers will rebase your code, rather than merge it with master.

Structuring commits

Each commit should logically stand alone (like the cheese). It should do one thing and do it well. That includes comments and docs, unit tests, etc., those things should not be in a separate commit.

You should break up a PR into as many commits as possible, whilst still keeping each commit coherent. The smaller the change, the easier it is to reason about and review.

Every commit should compile and pass all linting and tests; try not to reformat or polish in later commits. Doing this makes the PR easy to bisect and review.

Some good things to separate into their own commits are:

preparatory refactoring (best is to make each refactoring its own commit, e.g., it is easier to review a commit which only moves a large function to another file (just check that the text is the same), then another commit which splits the function up, rather than one commit which does both).
Bulk changes (anything done with search and replace, generated code, reformatting existing code, changes to error messages or user-facing strings, etc.).
Integration tests and benchmarks.

Review changes

Your reviewer is likely to have change requests for multiple commits. However, it is convenient to have all changes in a single commit so the reviewer can check they have been made to their satisfaction. So what to do?

I think the ideal workflow is that you make all the changes in a single commit (or a single commit for each round of change requests). Then when the PR is ready, the author splits up the commit and amends the changes to the appropriate commit so there is no 'review changes' commit. However, this is impractical because splitting a commit and fixing up others is a bit painful, the reviewer should check that no new changes have been introduced (effectively requiring a re-review), and it doesn't fit with the common workflow of the reviewer merging the PR as soon as they are satisfied.

Therefore, in practice I think that keeping a 'review changes' commit is the least bad solution (unless all changes are to a single commit). If there are rounds of re-review and it is not too painful, then you might be able to tidy things up a bit.

Commit messages

A good commit message makes a commit easier to review and serves as useful documentation.

By convention (though a convention which is understood by Git and GitHub, so almost a requirement) a commit message has two parts: a short title, and a longer description. The title is the first line and should be limited to 50-ish characters. Then a blank line, then the description (some people advocate for line breaking the description at 72 characters).

The title should give a summary of what is changed. Some projects have conventions for the title, a common one is module name: summary, e.g., storage: refactor prewrite handling.

The description should be detailed. There is no reason to write a short description (though obviously you should avoid being pointlessly verbose). A good description should include:

a more detailed summary of the changes,
how it works,
why the change is necessary,
why the implementation is the way it is,
any alternatives considered and why they weren't chosen,
the problem being solved, including links to issues, previous work, etc.
any additional context,
links to design specs or previous PRs, if appropriate,
any known issues with the solution.

As with any documentation, keep your audience in mind. In this case the audience is likely to be other engineers or QA folk. Probably they have a mysterious bug they can't fix and are looking for more information: they are likely to be in a bad mood. They may just be looking for more context when making changes to the code. In either case they probably want to know what is in the commit and why it was implemented that way.

They may not know the code well (bugs have a habit of showing up non-locally). They will have a lot of commit messages to read, and need to decide quickly if this commit is relevant. It is therefore very important for the summary to be clear and accurate, and not to bury important things deep in the description. On the other hand, the description should be thorough - you can't assume the reader has a lot of context.

Examples

Here are a couple of examples of good commit messages from the Rust repo:

b2c51c, by @nikomatsakis:

avoid generating drops for moved operands of calls

Currently, after a CALL terminator is created in MIR, we insert DROP
statements for all of its operands -- even though they were just moved
shortly before! These spurious drops are later removed, but not before
causing borrow check errors.

This PR series modifies the drop code to track operands that were
moved and avoid creating drops for them.

Right now, I'm only using this mechanism for calls, but it seems
likely it could be used in more places.

c7bd5a, by @alexcrichton:

Fix disagreeement about CGU reuse and LTO

This commit fixes an issue where the codegen backend's selection of LTO
disagreed with what the codegen later thought was being done. Discovered
in #72006 we have a longstanding issue where if `-Clinker-plugin-lto` in
optimized mode is compiled incrementally it will always panic on the
second compilation. The underlying issue turned out to be that the
production of the original artifact determined that LTO should not be
done (because it's being postponed to the linker) but the CGU reuse
selection thought that LTO was done so it was trying to load pre-LTO
artifacts which were never generated.

The fix here is to ensure that the logic when generating code which
determines what kind of LTO is being done is shared amongst the CGU
reuse decision and the backend actually doing LTO. This means that
they'll both be in agreement about whether the previous compilation did
indeed produce incremental pre-LTO artifacts.

Closes #72006

Git workflow

OK, so how do you practically organise your commits? There are many different tools and techniques for using Git. I prefer to use the command line, and here is my workflow for producing an organised PR.

My top tip is to plan to tidy up later. Coding is intrinsically iterative and it is unlikely that you'll get everything right first time around, or that you'll know the perfect structure for your PR in advance. Having said that, the more you can plan things up front, the easier organising your commits will be (and probably the better your code will be in general).

The most important way to support this is to commit early and often. Make lots of small commits with basic summary descriptions (but avoid just using WIP for every commit, you want to know what is in each one). It is much easier to merge commits than it is to split them up.

Before I get into the actual Git commands I use, it is really useful to have some aliases for commands you use a lot. Git has a terrible UI and you can improve things a lot by having your own aliases. I have the following in my .zshrc:

alias amend="git commit -a --amend -C@"
alias mr="git checkout master"
alias rb="git rebase master"
alias rbc="git rebase --continue"
alias up="git pull upstream master"

rbi() {
    git rebase -i HEAD~$1
}

cmt() {
    git commit -a -s -m $1
}

ck() {
    git checkout $1
}

The most useful of those is probably amend (amends all changes to the current head commit), followed by rbi (interactive rebase over the previous $1 commits).

When I start work I create a new branch (git checkout -b $name) then I commit often (git commit -a -m $message) and amend often (git commit -a --amend -C@). As mentioned above, lots of small commits makes things easy to organise later. By amending little and often, I can easily fix things if I make a mistake using git reflog. I find it easier to write commit messages at the end, once I know the structure of the PR, but if it's a large PR then I'll do multiple rounds of organising commits and writing messages so that I don't forget things.

I rebase my work daily to keep it up to date (commit, then git checkout master, git pull upstream master, checkout my branch, git rebase master. There are quicker ways to do this, but I often work on multiple branches so I like this way). I never merge.

When I'm done, I use git rebase -i ... to 'merge' and move commits, and to rewrite commit messages. If I need to break up a commit, then I un-commit it (git reset HEAD~1, changing 1 if I need to do multiple commits), then add changes with git add ... and commit without -a. I use git add with -i or -p if I need more fine-grained control. Then when I'm done, I push to a remote branch (git push).

I end up using the following commands fairly often:

log to list commits,
show to see a specific commit,
status and diff to see current changes,
reflog and reset when things go wrong,
commit --amend to change the commit message of the most recent commit.