Automatic Code Formatting for Partially-Staged Files

I wrote git-format-staged to apply an automatic code formatter or linter to staged files. It ignores unstaged changes, and leaves those changes unstaged. When run in a Git pre-commit hook git-format-staged guarantees that committed files are formatted properly, and does not clobber unstaged changes if formatting cannot be applied to working tree files cleanly.

How I learned to love automatic formatting

I used to pay a lot of attention to code formatting. I would split or join lines, indent to just the right column, and so on. Then I learned about Prettier, and tried it out in my Javascript projects. After setting up an editor plugin I could tap a key combination, and Prettier would do the same things that I was doing by hand in an instant without any thought on my part. The code did not wind up in exactly the style that I was used to - but it was formatted according to a set of consistent, sensible rules that I can live with. I realized that if I can trust a program to format my code nicely I can free up a not-insubstantial chunk of my attention. That leaves me with more mental capacity to devote to matters that actually require human input. I have not looked back!

Automatic formatting is not specific to Javascript. The idea has been around for a while - but I credit the Golang community with boosting its popularity. Early on the community made gofmt a core part of Go programming conventions. The spotlight on gofmt encouraged wider adoption of code formatting in other languages. These days there is a tool available to prettily-format just about every language out there.

Automatic formatting in a team

Automatic formatting can help a team to be more focused too. The formatter makes style expectations explicit; and people joining the team do not have to expend effort to learn the style rules for a project. But if some contributors run a formatter and some do not then you get unnecessary code churn when two contributors work on the same file, and one contributor's formatting rewrites the other's recent changes. That can happen in a solo project too if you forget to run the formatter consistently. (When I forget I feel guilty about not following my own rules!) The most reliable way to ensure that all code committed to your project is formatted consistently is to run the formatter in a Git pre-commit hook.

The problem with content changes on pre-commit

The naïve way to format files in a pre-commit hook is to:

  • get a list of files with changes that will be committed (staged files)
  • run the formatter on those files
  • run git add to stage any changes from formatting

The last step is important: if you do not add changes to the staging area after formatting then changes from formatting will not make it into the commit. But running git add non-interactively will stage the entire file, which will irritate contributors who like to use git add --patch or a similar editor feature to selectively stage changes to each file. A contributor might deliberately leave some changes to a file unstaged if they want to save those changes for a later commit, or if those changes contain debugging code. In the worst cases an unstaged change might include a password that the contributor has temporarily pasted into a source file for use in development. If your pre-commit hook runs git add all unstaged changes will be committed, and the contributer might push that commit before they realize what happened. Even if that does not happen you risk receiving a sternly-worded issue report for interfering with another developer's workflow!

Enter git-format-staged

I wrote git-format-staged to reconcile automatic formatting with partially staged files. Git-format-staged runs your formatter on the staged version of each file. It ignores unstaged changes, and leaves those changes unstaged. Or it can run a linter, which can be helpful if you do not want your linter to report problems in unstaged hunks.

You can use git-format-staged in any project that uses Git for version control. It is a standalone script, and its only dependencies are Git and Python. There are detailed instruction in the project's readme; but in case you want to get started right away here is the two-step process to set up automatic formatting in your Javascript project:

Install git-format-staged from NPM. I recommend installing it as a development dependency of your project. While you are at it install a code formatter, and Husky which will hook up your pre-commit script.

$ yarn add --dev git-format-staged prettier husky

Git does not provide a way to pre-install event hooks in repo clones. Husky fills that gap. Did you know that by default npm packages can run arbitrary code when the package is installed? Husky uses that power to copy a bunch of hooks into .git/hooks/. Each hook checks for an npm script with a corresponding name. Once Husky is installed all you need to do is to add a script to your package.json file called "precommit" that formats your code:

"scripts": {
  "precommit": "git-format-staged --formatter 'prettier --stdin' '*.js'"

You can provide any command that you want for formatting, but it must be "pipeable": it must read file content from stdin and write formatted code to stdout.

Note that quotes are required around the both the formatter command that you provide, and around patterns for files that you want to format. The file patterns are similar to those in .gitignore, except that * matches files in nested subdirectories. You can supply multiple patterns, and exclude files from formatting with !. See the readme for details.

And that's it! Happy coding!

Git objects and the index

Explaining exactly how git-format-staged works requires getting into some details of Git's internal operation. If you want a high-level view of how Git works behind the scenes I recommend reading The Git Parable. Git is complicated - but it is more approachable than you might think.

Put briefly, every version of every file that you commit is stored as a distinct file in the Git object database in .git/objects/. Every object has a unique name, which is the hash of its content. You can see the hashes of files in your project with this command:

$ git ls-files --stage

100644 1fed445333e85fb9996542978fa56866de90a2fb 0       .flowconfig
100644 d95266a6abbfb88067c449565b3ed01ab08fc639 0       .gitignore
100644 0e81c64902c1e6d5455addac38a9c6a3f01c2190 0       .travis.yml
100644 792ca2246057929ed88cd5ecc02eda6f1472cea9 0       LICENSE
100644 ee9a2bc0c0226cff24154937e70ef8bb4599e25d 0
100644 721bafb9848dcc0e5bd5166e5a227adfd8ccfe92 0       commitlint.config.js
100755 3f1ffbb770142bd0b37eb1e855a742fb38c2cb8b 0       git-format-staged
100644 be28f28111a221450bb8b8f11e9a7f6fe397947d 0       no-main.js
100644 75e8c8c29ae642f52462a5286bcd342b089d4783 0       package-lock.json
100644 294e47fcffe47264e1a3d53678890cadac0b26ba 0       package.json
100644 236c1ff26fcd39f5efdf8b31c0c68f7b3839762f 0       test/git-format-staged_test.js
100644 408f26e1aa5450a8a42987f2f46df5041e3fdd75 0       test/helpers/git.js

From left to right those columns show each object's mode bits, object name / hash, stage number, and file path. The output of git ls-files --stage shows the state of the Git index. The index (sometimes called the cache) can be approximately described as the state of your repository content right now. It initially matches the state of the most recent commit in your working branch. Whenever you stage changes, the index changes. When you create a commit the state in the index becomes the state of content in the new commit.

Try creating a new file, and stage it:

$ echo "const foo = ()=>'foo'" > new_file.js && git add new_file.js

What happens when you run git add is that Git creates a new object in the object database using the content of new_file.js, and adds an entry to the index that points to the new object:

$ git ls-files --stage | grep new_file.js

100644 9a622ce1db369d03a7eaca94c4306b9b0f00429c 0       new_file.js

When you stage changes to a previously-committed file the process is similar: Git creates a new object with the latest content of the file, and changes the index entry to point to the new object.

Now make another change to the same file, but do not stage the change:

$ echo "const bar = ()=>'bar'" >> new_file.js

Unstaged changes are not represented in the index, or in Git's object database. If you check the index again you can see that it is still holding onto the first version of new_file.js:

$ git ls-files --stage | grep new_file.js

100644 9a622ce1db369d03a7eaca94c4306b9b0f00429c 0       new_file.js

The object name in the index has not changed; and Git objects are immutable, which means that means that the content of new_file in the index is the same as before. You can verify this by dumping the content of the object:

$ git cat-file -p 9a622ce1db369d03a7eaca94c4306b9b0f00429c

const foo = ()=>'foo'

What this shows us is that the staged version of a file is a file on disk, distinct from the working tree version of the file. It just happens that the staged version exists in the Git object database. When you view "staged changes" what you see is actually a diff between the latest commit and the index.

Git-format-staged works by bypassing the working tree

With the right commands you can read and write directly to the index without touching the working tree. For example you can emulate the process of staging changes to a file by running the low-level steps yourself. First create an object with the content that you want. Let's format new_file.js with Prettier:

$ git cat-file -p 9a622ce1db369d03a7eaca94c4306b9b0f00429c \
    | prettier --stdin \
    | git hash-object -w --stdin


We pulled the staged version of new_file.js from the object database, fed the content to Prettier via a pipe, and piped formatted code to git hash-object, which creates a new Git object. Because we pulled file content from the object database we got the staged version of new_file.js, which does not include the definition for bar.

Next update the index entry for new_file.js to point to the formatted version:

$ git update-index --cacheinfo 100644,d562cee83a7d2a4108c9e37a4372e509d49e59ee,new_file.js

The --cacheinfo argument is of the form MODE_BITS,OBJECT_NAME,FILE_PATH. We kept the same mode bits and file path from before, and supplied a new object name / hash.

If you look at staged changes you will see that the staged version of new_file.js has now been prettily formatted.

$ git diff --cached

diff --git a/new_file.js b/new_file.js
new file mode 100644
index 0000000..d562cee
--- /dev/null
+++ b/new_file.js
@@ -0,0 +1 @@
+const foo = () => "foo";

When you run git-format-staged it runs the same steps, using the same commands.

Keeping the working tree in sync

The staged version of the file is now nicely formatted. But the working tree file has not been changed. And the unstaged definition for bar is still there:

$ cat new_file.js

const foo = ()=>'foo'
const bar = ()=>'bar'

Directly manipulating the Git object database and index means that we did not read or write the working tree file. This leaves us with a problem: when the staged version of the file is changed by automatic formatting we want the same changes to be made to the working tree file. Otherwise any discrepancies between the working tree file and the staged file will be presented as unstaged changes that did not exist before the pre-commit hook ran.

To get the working tree back in sync with the index git-format-staged gets a diff between the original staged file and the formatted staged file to compute a patch of changes introduced by formatting. Then it applies that patch to the working tree:

$ STAGED_OBJECT=9a622ce1db369d03a7eaca94c4306b9b0f00429c
$ FORMATTED_OBJECT=d562cee83a7d2a4108c9e37a4372e509d49e59ee
$ git diff $STAGED_OBJECT $FORMATTED_OBJECT | git apply -

The patch actually needs to be massaged a bit to fix up working tree paths before it can be given to git apply. So the command above does not work literally. But it gives an idea of what git-format-staged does.

In most cases merging formatting changes with unstaged changes works transparently from the user's perspective. Unstaged portions of the file did not get run through the formatter, and they end up as unformatted islands in an otherwise-formatted file. Sometimes there is a conflict applying the patch. In that case git-format-staged aborts the merge, leaving the working tree file entirely unformatted. This is the least-lossy outcome possible: changes that are committed are properly formatted; unstaged changes are preserved; and formatting changes to the working tree file can be recomputed by running the formatter again.

Pyramid of Doom - or - My Problem Solving Algorithm
The Olio Apps Internship Experience