Time for my annual “oh shit, I forgot to bump the copyright year again” round-up!
In the F/OSS community, there are two different philosophies when it comes to applying copyright statements to a project. If the code base consists exclusively (or almost exclusively) of code developed for that specific project by the project’s author or co-authors, many projects will have a single file (usually named LICENSE
) containing the license, a list of copyright holders, and the copyright dates or ranges. However, if the code base incorporates a significant body of code taken from other projects or contributed by parties outside the project, it is customary to include the copyright statements and either the complete license or a reference to it in each individual file. In my experience, projects that use the BSD, ISC, MIT, adjacent licenses tend to use the latter model regardless.
The advantage of the second model is that it’s hard to get wrong. You might forget to add a name to a central list, but you’re far less likely to forget to state the name of the author when you add a new file. The disadvantage is that it’s really, really easy to forget to update the copyright year when you commit a change to an existing file that hasn’t been touched in a while.
So, how can we automate this?
One possibility is to have a pre-commit hook that updates it for you (generally a bad idea), or one that rejects the commit if it thinks you forgot (better, but not perfect; what if you’re adding a file from an outside source?), or one that prints a big fat warning if it thinks you forgot (much better, especially with Git since you can commit --amend
once you’ve fixed it, before pushing).
But how do you fix the mistake retroactively, without poring over commit logs to figure out what was modified when?
Let’s start by assuming that you have a list of files that were modified in 2017, and that each file only has one copyright statement that needs to be updated to reflect that fact. The following Perl one-liner should do the trick:
perl -p -i -e 'if (/Copyright/) { s/ ([0-9]{4})-20(?:0[0-9]|1[0-6]) / $1-2017 /; s/ (20(?:0[0-9]|1[0-6])) / $1-2017 /; }'
It should be fairly self-explanatory if you know regular expressions. The first substitution handles the case where the existing statement contains a range, in which case we extend it to include 2017, and the second substitution handles the case where the existing statement contains a single year, which we replace with a range starting with the original year and ending with 2017. The complexity stems mostly from having to take care not to replace 2018 (or later) with 2017; our regexes only match years in the range 2000-2016.
OK, so now we know how to fix the files, but how do we figure out which ones need fixing?
With Git, we could try something like this:
git diff --name-only 'HEAD@{2017-01-01}..HEAD@{2018-01-01}'
This is… imperfect, though. The first problem is that it will list every file that was touched, including files that were added, moved, renamed, or deleted. Files that were added should be assumed to have had a correct copyright statement at the time they were added; files that were only moved or renamed should not be updated, since their contents did not change; and files that were deleted are no longer there to be updated.¹ We should restrict our search to files that were actually modified:
git diff --name-only --diff-filter M 'HEAD@{2017-01-01}..HEAD@{2018-01-01}'
Some of those changes might be too trivial to copyright, though. This is a fairly complex legal matter, but to simplify, if the change was inevitable and there was no room for creative expression — for instance, a function in a third-party library you are using was renamed, so both the reason for the change and the nature of the change are external to the work itself — then it is not protected. So perhaps you should remove --name-only
and review the diff, which is when you realize that half those files were only modified to update their copyright statements because you forgot to do so in 2016. Let’s try to exclude them mechanically, rather than manually. Unfortunately, git diff
does not have anything that resembles diff -I
, so we have to write our own diff command which does that, and ask git to use it:
$ echo 'exec diff -u -ICopyright "$@"' >diff-no-copyright
$ chmod a+rx diff-no-copyright
$ git difftool --diff-filter M --extcmd $PWD/diff-no-copyright 'HEAD@{2017-01-01}..HEAD@{2018-01-01}'
This gives us a diff, though, not a list of files. We can try to extract the names as follows:
$ git difftool --diff-filter M --no-prompt --extcmd $PWD/diff-no-copyright 'HEAD@{2017-01-01}..HEAD@{2018-01-01}' | awk '/^---/ { print $2 }'
Wait… no, that’s just garbage. The thing is, git difftool
works by checking out both versions of a file and diffing them, so what we get is a list of the names of the temporary files it created. We have to be a little more creative:
$ echo '/usr/bin/diff -q -ICopyright "$@" >/dev/null || echo "$BASE"' >list-no-copyright
$ chmod a+rx list-no-copyright
$ git difftool --diff-filter M --no-prompt --extcmd $PWD/list-no-copyright 'HEAD@{2017-01-01}..HEAD@{2018-01-01}'
Much better. We can glue this together with our Perl one-liner using xargs
, then repeat the process for 2018.
Finally, how about Subversion? On the one hand, Subversion is far simpler than Git, so we can get 90% of the way much more easily. On the other hand, Subversion is far less flexible than Git, so we can’t go the last 10% of the way. Here’s the best I could do:
$ echo 'exec diff -u -ICopyright "$@"' >diff-no-copyright
$ chmod a+rx diff-no-copyright
$ svn diff --ignore-properties --diff-cmd $PWD/diff-no-copyright -r'{2017-01-01}:{2018-01-01}' | awk '/^---/ { print $2 }'
This will not work properly if you have files with names that contain whitespace; you’ll have to use sed
with a much more complicated regex, which I leave as an exercise.
¹ I will leave the issue of move-and-modify being incorrectly recorded as delete-and-add to the reader. One possibility is to include added files in the list by using --diff-filter AM
, and review them manually before committing.