Taking Months to Bits

A calendrical journey into low-level programming

(TL;DR: for the impatient, see here and here. For everyone else, read on…)

A calendar system is an attempt to make sense of the passing of time with respect to astronomical phenomena — the positioning of the Sun, the Moon and the stars with respect to our unfixed and moveable Earth. It is perhaps ironic, then, that the widely used Gregorian calendar system is a cobbled-together mishmash that better reflects the arbitrariness of history than it does either celestial order or logical consistency.

The division of the year into months based in some way on the cycle of the Moon (month and moon have a shared etymology) is common across many cultures. The Moon’s sidereal period (its position with respect to stars) is just over 27 days. Its synodic period is a half-day over 29 days (its position with respect to the Sun as seen from Earth). Four weeks — 28 days — would not be an unreasonable compromise. If we were to design a calendar from scratch that tried to reconcile the otherwise misaligned concepts of day, week, month and year in as consistent a form as possible, the International Fixed Calendar would be a good candidate — 13 months of 28 days each, plus an intercalary day (or two, for a leap year).

But this isn’t what we have. Months do not have a fixed length.

A system of alternating long months (31 days) and short months (30 days) would work quite well, covering 366 days nicely. In such a system, leap years would be regular, but common years would need to clip a day off one of the months — shortening a long month from 31 to 30 rather than a short month from 30 to 29 would keep things as consistent as possible in the face of inconsistency. Such a calendar would have twelve months every year and all months would have either 30 or 31 days, no exceptions. In a leap year there would be six 30-day months and six 31-day months; in a common year there would be seven 30-day months and five 31-day months.

But this isn’t what we have. As with other attempts at organising time, our calendar is “an unhappy marriage between astronomy and politics.

The calendar has seven long months and five short months, and one of the short months — February — accumulates all of the exceptions — it is indeed 28 days by default, but is shorter than the other short months and is the dumping ground for the intercalary day injected every leap year. The other months kind of alternate between long and short, but only kind of. To make sense of this, people have resorted to everything from knuckles to rhymes to remember the pattern.

Thirty days hath September,
April, June, and November;
All the rest have thirty-one,
Excepting February alone.
To which we twenty-eight assign,
Till leap year gives it twenty-nine.

If you write out the months in sequence, emphasising the long ones, you can see the pattern and where it breaks:

We can see that August is the troublemaker — or, to be precise, Emperor Augustus, as one story has it:

But this is one story. Other sources suggest that the month of Sextilis already had 31 days and the causality was the other way around, i.e., rather than renaming Sextilis to August and extending it to 31 days, a month of 31 days was chosen for renaming, and Sextilis was available. That it immediately follows the month formerly known as Quintilis (renamed July to honour of Julius Caesar) is convenient and coincidental.

It is also worth remembering that having two adjacent months with 31 days was not unusual. Our perception that months mostly alternate between long and short is largely an artefact of when we start and end the year. The Roman year did not begin in January:

Hence, why what we consider to be months 7 (July) and 8 (August) were previously named for being months 5 (Quintilis) and 6 (Sextilis) — and yes, months 7, 8, 9 and 10 gave us September, October, November and December.

Either way, it was Johannes’ tweet that led to me to revisit a coding problem from many years ago and, ultimately, down the rabbit hole to this article.

Sets and LUTs and code’n’flow

Because month length doesn’t alternate in a way that is consistent, to determine whether a month is long or not, we can recast it as a problem of set membership: for a given month m, it is long iff m ∈ {January, March, May, July, August, October, December}. If we count months from 1, this translates directly into Python as

Letting data do the talking — sets, lookup tables, etc. — is an underdeveloped skill and a commonly missed opportunity. You are as likely — if not more likely — to encounter something like the following:

Some programmers might approach this as a control-flow problem, giving rise to the all-too-common code smell of a Boolean-to-Boolean converter:

Burying the logic in control flow in an effort to make the code more explicit unfortunately has the opposite effect, making it unwieldy with tautology and verbosity. Smothering the logic even further is also a popular choice:

As is introducing needless asymmetry into the control flow:

Every if comes free with an else, whether you use it or not. I have even heard programmers read such code aloud as “… then return true, else return false” or “… then return true, otherwise return false.” If you mean what you say, that is an opportunity to write what you mean.

Breaking the existing symmetry without good reason is more of a habit than a consequence of reason and, unfortunately, leads to code that becomes harder to reason about, compose or refactor. There are lessons to be learned and habits to be acquired from both functional and structured programming that are broadly applicable.

We can combine the styles of the previous two examples into code that is about as imperative as we can get without inflicting unnecessary violence on the flow of control:

It is not necessarily that many programmers are unfamiliar with data-driven approaches or are uncomfortable with Boolean logic or are unaware of block-structured programming practices… but sometimes it seems that many programmers are unfamiliar with data-driven approaches or are uncomfortable with Boolean logic or are unaware of block-structured programming practices.

Habit — and therefore what is considered ‘normal’ or ‘simple’ by an individual — is largely a by-product of familiarity and exposure. A lot of what we learn comes from copying — consciously or otherwise — what we see others doing. If you change the ideas and styles people are exposed to, what is seen as ‘normal’ or ‘simple’ will also change. A lot of legacy code and example code is relentlessly procedural, regardless of its advertised paradigm, and so the lens through which many programmers view problems and frame solutions is correspondingly control-flow-oriented and imperative. This is habit forming.

Staring at the C

Let’s switch languages and see what we would do in C to write is_long_month conveniently and without resorting to redirecting the flow of control. C lacks set-like constructs, such as Python’s in, but has decent support for array-based lookup tables:

Or, if you want to go old school:

Reflecting on my own experience, it is probably learning C that properly introduced me to the value and power of table-based approaches, an approach that was further reinforced in my habits by languages with out-of-the-box support for other key-based data structures (whether termed associate arrays, dictionaries or maps).

Now, while we’re here, a quick word on naming style: although I am keeping function names in long form, other identifiers are shorter than I would normally write (e.g., m instead of month and month for long_months). A blog post doesn’t have much horizontal real-estate, so code wraps at the drop of a meaningful identifier. And things are about to get interesting…

Standing on the bits

C has such an intimate relationship between integers and Booleans that they were considered equivalent for most of C’s history. Even with the introduction of _Bool in C99 (available as bool if you include <stdbool.h>), the traffic between the two remains free and easy.

If you ever see a small array of 1s and 0s, there is a good chance you can replace it with an integer treated as a bit set. (But just because you can, doesn’t mean you should.) In this case, the twelve truth values fit comfortably within any 16-bit-or-better integer:

There are two things to note about the binary literal:

  1. The digits are in reverse order from the previous array, i.e., January is the rightmost 1 rather than the leftmost.
  2. Binary literals are standard in C++, Java, Python, Ruby and many other languages, but not in C, where they are supported only as a compiler extension.

Given that C is a systems programming language, the second point might come as a surprise… but perhaps not as surprising as the rationale offered by the ANSI C committee for the C89 standard (unchanged for ISO C99):

A proposal to add binary constants was rejected due to lack of precedent and insufficient utility.

The idea that binary literals would be of ‘insufficient utility’ in a language that hugs the metal for a living is bemusing, astonishing and a timely reminder of Barnett Cocks’ observation that

A committee is a cul-de-sac down which ideas are lured and then quietly strangled.

The statement of ‘insufficient utility’ has certainly never been true and ‘lack of precedent’ ceased to be true a long time ago. The boat was again missed with C11 and C17. As it currently stands, binary literals are likely to appear in the C23 standard.

If we look carefully again at the pattern of months, we can see that if we add a bias of 1 to any month following July, we can determine whether the month is long or short using modulo arithmetic alone:

No need for lookups, control flow or bit twiddling. This code exploits an artefact of the programming language and a coincidence in the problem domain to achieve its brevity — it is left as an exercise for the reader to determine whether or not this is a good thing.

Knowing whether a month is short or long is perhaps of less interest than knowing how many days it has. If month lengths were strictly either 30 or 31 days, this would be an easy problem to solve: the number of days in month m would always be 30 + is_long_month(m).

But this isn’t what we have. Months can have one of four different lengths, with February accounting for half of that variation; for the other 11 months, 30 + is_long_month(m) works just fine.

We could throw it all in and go back to control flow:

Where we assume the leap year is for a proleptic Gregorian calendar with a year 0:

Alternatively, we could focus more on function than on control flow by making good on the observation that for every month except February the problem is already solved:

Another approach is to switch back to using lookup tables, which was the path taken in both editions of The C Programming Language.

We can adopt this approach in our code:

The result is determined declaratively by a purely tabular approach, with no other arithmetic or special-case handling involved. A halfway house between the conditional and the tabular approaches could look something like the following:

It’s an interesting coincidence that the maximum month length is 31, which is 2⁵-1, because it means we only need 5 bits to encode each month’s length. For twelve months, we therefore need only 60 bits. That means our table can be shoehorned into a 64-bit integer with bits to spare.

There are a few things to note here, the first of which is don’t do this at work.

The second is that I have used long long, which is guaranteed to be at least 64-bits wide, rather than long, which is not. I could have made this more explicit by using the int64_t type name from <stdint.h>, but as I then use long long literals explicitly in the encoding there’d be little abstraction benefit.

Third, those long long literals: I use LL and not ll. If you have a habit of using lower-case suffixes for numeric literal types in C, that may be a habit you want to reconsider. Here’s that encoding again using lower case:

In most fixed-width typefaces, the visual similarity between 1 and l is a liability.

And, speaking of literals, the fourth thing to note is the use of the binary literal 0b11111 to make it clear that I am masking the lower five bits of my shifted lookup to recover the month’s day count. (If you believe being able to show bit values directly in binary form is of ‘insufficient utility’ then feel free to replace 0b11111 with 31.)

Let’s turn our attention back to the special handling for February:

We could take advantage of the close relationship between Booleans and integers to generate the 0 case logically:

Or we could do it arithmetically:

What’s interesting about this fragment is that, except for the code inside is_leap_year, this code is branchless, i.e., there are no control-flow branches either from an explicit statement of control flow, such as if and switch, or from a short-circuiting operator, such as && and ?:.

We can also eliminate the control flow when determining the leap year:

We can, therefore, determine the number of days in a month without either performing a lookup or taking any decisions:

For maximum effect, we can inline all the code:

My work here is done.

Equal rights

Or so I thought.

Years ago, I tried to figure out if it was possible to calculate the number of days in the month without branching or using lookups. (I was on a train. I was bored.) The code I wrote is lost to history and to my memory, but I am guessing it ended up something like the code above.

But doing it in a language that has strong prohibitions on a relationship between Booleans and integers? C’s permissive and laissez-faire view of such a relationship is exactly the reason I chose C all those years ago and again this time around: it’s already tricky enough without also trying to convert Boolean operations to integer operations.

Things are different now. (I’m at home in a pandemic. I’m very bored.) And, when Ben asked, I thought it was an interesting follow-on challenge. We can find our way to a solution by dividing the problem up into its constituent parts.

The low hanging fruit is determining whether a month is long or short. In the C code the post-July bias is added to the month using the Boolean result of m > 7 as an integer. Another way this can be written is m ≥ 8. Conveniently, 8 is 2³, so all values greater than or equal to 8 will have the 2³-value bit set.

Now that the style has shifted from arithmetic to bitwise, it would not be unreasonable to replace the remainder operator with a bitwise and for consistency, i.e., (…) % 2 becomes (…) & 1, but I’ll leave it as is.

The rest of the fruit is much higher up, but can be reached by the ladder of realising all we need is a way of comparing two integers for equality that results in a 1 or a 0 rather than a true or a false. To quote from Structure and Interpretation of Computer Programs:

We are using here a powerful strategy of synthesis: wishful thinking.

If we imagine an eq operation that does this for us, we can detect a leap year as follows:

And then define daysInMonth as

The introduction of the yet-to-be-defined eq structures our solution for us.

Now, to make our wish come true, we need to reify eq — we’ve cornered the problem, but hopefully we’ve not written ourselves into a corner. How to compare two ints to get an int rather than a boolean?

Here’s what we know:

  • If we subtract one integer from another we get zero if they are equal and non-zero if they are unequal.
  • In the equal case, when the result is 0, adding 1 gets us the Boolean-to-integer conversion we want. This does not, however, work in the general case when they are unequal.
  • Integers use a two’s complement representation. For a 32-bit int, the positive integers from 0 to 2³¹-1 are represented in binary as we’d expect. Negative integers, however, are represented by complementing the bits and adding 1. So for an integer n, -n is ~n+1. This means the leading bit is a sign bit that is 0 for zero and positive integers and 1 for negative integers.
  • For two integers a and b, a-b is equal to -(b-a). When a and b are unequal, this means that either a-b is negative and b-a is positive or a-b is positive and b-a is negative. Either way, there is a subtraction with a negative result if a and b are unequal.
  • If you combine two numbers of opposite sign using a bitwise or then the sign bit will always be 1. This means that (a-b) | (b-a) has a sign bit of 0 when a and b are equal and 1 when they are unequal.
  • If you right shift the integer 0 by any number of bits it remains 0.
  • If you perform an arithmetic right shift of a negative number it remains negative, i.e., it fills the leftmost bits with 1. Therefore, if you right shift a 32-bit negative integer by 31 bits you will get all bits set 1, which is the integer value -1, to which adding 1 gets us 0 and, therefore, the Boolean-to-integer conversion we want.

We can therefore define our eq function as

Again, for maximum NSFW effect, we can inline all the code:

And now, my work here is done.

End of days

There are a number of things you should and should not take away from this article. The first is that however much you know about dates and times, it will never be enough.

The second is that exploring the lower-level side of programming for its own sake is challenging, fun and educational.

Low-level programming is good for the programmer’s soul.

— John Carmack

The third is that for most of the work that most developers undertake, most of the code and many of the techniques I have shown here are NSFW and, where they are SFW, the context I have shown them in is not really suitable.

I would normally say, don’t do any of this in production code without tests to back you up. My advice here is simpler: don’t bother with the tests… because the odds are that you shouldn’t be doing this in production code at all.

Use existing libraries or table lookups for this kind of problem, or just leave the clever stuff to optimisers. The state of the art for optimisation is such that you are better off not second-guessing compilers and JIT runtimes. Write production code that is first and foremost code that you and others want to come back to with love rather than with a pickaxe.

The cleverness of the code here is not for your codebase, it’s for you — it’s about learning and play, not production. It exercises the brain, broadens your problem-solving skills and makes you aware of the kinds of thing that are being done on your behalf by compilers, libraries, VMs, etc. And did I mention it was fun?

consultant · father · husband · itinerant · programmer · speaker · trainer · writer