Test Precisely and Concretely

Assertions should be necessary, sufficient, and comprehensible

A previous version of this article appeared in 97 Things Every Programmer Should Know

It is important to test for the desired, essential behaviour of a unit of code, rather than for the incidental behaviour of its particular implementation. But this should not be taken or mistaken as an excuse for vague tests. Tests need to be both accurate and precise.

Something of a tried, tested, and testing classic, sorting routines offer an illustrative example — they are to computer science as fruit flies are to genetics. Implementing sorting algorithms is far from an everyday task for a programmer, commodified as they are in most language libraries, but sorting is such a familiar idea that most people believe they know what to expect from it. This casual familiarity, however, can make it harder to see past certain assumptions.

When programmers are asked, “What would you test for?”, by far and away the most common response is something like, “The result of sorting a sequence elements is a sorted sequence of elements.” As definitions go, it’s perhaps a little circular, but it’s not false.

So, given the following C function:

void sort(int values[], size_t length);

And some values to be sorted:

int values[length];

The expected result of sorting:

sort(values, length);

Would pass the following:

assert(is_sorted(values, length));

For some appropriate definition of is_sorted.

While this is true, it is not the whole truth. First of all, what do we mean when we say the result is “a sorted sequence of elements”? Sorted in what way? Most commonly, a sorted result goes from the lowest to the highest value, but that is an assumption worth stating explicitly. Assumptions are the hidden rocks many programs run aground on — if anything, one goal of testing and other development practices is to uncover assumptions rather than gloss over them.

So, are we saying they are sorted in ascending order? Not quite. What about duplicate values? We expect duplicate values to sort together rather than be discarded or placed elsewhere in the resulting sequence. Stated more precisely, “The result of sorting a sequence of elements is a sequence of elements sorted in non-descending order.” Non-descending and ascending are not equivalent.

When prompted for an even more precise condition, many programmers add that the resulting sequence should be the same length as the original. Although correct, whether or not this deserves to be tested depends largely on the programming language.

In C, for example, the length of an array cannot be changed. By definition, the array length after the call to sort will be the same as it was before the call. In contrast to the previous point about stating and asserting assumptions explicitly, this is not something you should or could write an assertion for. If you’re not sure about this, consider what you might write:

const size_t expected = length;
sort(values, length);
assert(length == expected);

The only thing being tested here is that the C compiler is a working C compiler. Neither length nor expected will — or can — change in this fragment of code, so a good compiler could simply optimise this to:

sort(values, length);
assert(true);

If the goal is to test sort, this truism not particularly helpful. It is one thing to test precisely by making assumptions explicit; it is another to pursue false precision by restating defined properties of the platform.

The equivalent sort in Java would be

... void sort(int[] values) ...

And the corresponding tautologous test would be

final int expected = values.length;
sort(values);
assert values.length == expected;

Sneaking into such tests I sometimes also see assertions along the lines of

assert values != null;

If the criteria you are testing can’t be falsified by you, those tests have little value — hat tip to Karl Popper:

In so far as a scientific statement speaks about reality, it must be falsifiable: and in so far as it is not falsifiable, it does not speak about reality.

This is not to say you will never encounter compiler, library, VM, or other platform bugs, but unless you are the implementer of the compiler, library, VM, or other platform, these are outside your remit and the reality of what you are testing.

For other languages or data structure choices, that the resulting length is unchanged is a property to be asserted rather than a property that is given. For example, if we chose to use a List rather than an array in Java, its length is one of the properties that could change and would, therefore, be something to assert had remained unchanged:

final int expected = values.size();
sort(values);
assert values.size() == expected;

Similarly, where sorting is implemented as a pure function, so that it returns a sorted sequence as its result, leaving the original passed sequence untouched, stating that the result has the same length as the input makes the test more complete. This is the case with Python’s own built-in sorted function and in functional languages. If we follow the same convention for our own sort, in Python, it would look like

result = sort(values)
assert len(result) == len(values)

And, unless we are already offered guarantees on the immutability of the argument, it makes sense to assert the original values are unchanged by taking a copy for posterity and later comparison:

original = values[:]
result = sort(values)
assert values == original
assert len(result) == len(values)

We’ve navigated the clarity of what we mean by sorted and questions of convention and immutability… but it’s not enough.

Given the following test code:

original = values[:]
result = sort(values)
assert values == original
assert len(result) == len(values)
assert is_sorted(result)

The following implementation satisfies the postcondition of not changing its parameter and of returning a result sorted in non-descending order with same length as the original sequence:

def sort(values):
return list(range(len(values)))

As does the following:

def sort(values):
return [0] * len(values)

And the following:

def sort(values):
return [] if len(values) == 0 else [values[0]] * len(values)

Given the following sequence:

values = [3, 1, 4, 1, 5, 9]

The first example simply returns an appropriately sized list of numbers counting up from zero:

[0, 1, 2, 3, 4, 5]

The second example makes even less of an effort:

[0, 0, 0, 0, 0, 0]

The third example at least shows willing to use something more than just the length of the given argument:

[3, 3, 3, 3, 3, 3]

This last example was inspired by an error taken from production C code (fortunately caught before it was released). Rather than the contrived implementation shown here, a simple slip of a keystroke or a momentary lapse of reason led to an elaborate mechanism for populating the whole result with the first element of the given array — an i that should have been a j converted an optimal sorting algorithm into a clunky fill routine.

All these implementations satisfy the spec that the result is sorted and the same length as the original, but what they let pass is also most certainly not what was intended! Although these conditions are necessary, they are not sufficient. The result is an underfitting test that only weakly models the requirement and is too permissive in letting flawed implementations through.

The full postcondition is that “The result of sorting a sequence of elements is a sequence of the original elements sorted in non-descending order.” Once the constraint that the result must be a permutation of the original values is added, that the result length is the same as the input length comes out in the wash and doesn’t need restating regardless of language or call convention.

Are we done? Not yet.

Even stating the postcondition in the way described is not enough to give you a good test. A good test should be comprehensible and simple enough that you can readily see that it is correct (or not).

If you already have code lying around that has the same functionality as the functionality you want to test, you can use it as a test oracle. Under the same conditions, the new code should produce the same results as the old code. There are many reasons you may find yourself in this situation: the old code represents a dependency you are trying to decouple from; the new code has better performance than the old code (faster, smaller, etc.); the new code has a more appropriate API than the old code (less error-prone, more type safe, more idiomatic, etc.); you are trying something new (programming language, tools, technique, etc.) and it makes sense to use a familiar example as your testing ground.

For example, the following Python checks our sort against the built-in sorted function:

values = ...
original = values[:]
result = sort(values)
assert values == original
assert result == sorted(values)

Sometimes, however, the scaffolding we need to introduce makes the resulting test code more opaque and less accessible. For example, to test our C version of sort against C’s standard qsort function:

int actual[length];
...
int expected[length];
for (size_t at = 0; at != length; ++at)
expected[at] = actual[at];
qsort(expected, length, sizeof(int), compare_ints);
sort(actual, length);
for (size_t at = 0; at != length; ++at)
assert(actual[at] == expected[at]);

Even making the narrative structure of the test case more explicit, the code still looks to be more about bookkeeping local variables than about testing sort:

// Given
int actual[length] = {...};
...
int expected[length];
for (size_t at = 0; at != length; ++at)
expected[at] = actual[at];
qsort(expected, length, sizeof(int), compare_ints);
// When
sort(actual, length);
// Then
for (size_t at = 0; at != length; ++at)
assert(actual[at] == expected[at]);

And compare_ints is ours to define, so will lie outside the test case, making the test code even harder to assimilate on reading:

int compare_ints(const void * lhs_entry, const void * rhs_entry)
{
int lhs = *(const int *) lhs_entry;
int rhs = *(const int *) rhs_entry;
return lhs < rhs ? -1 : lhs > rhs ? 1 : 0;
}

Note that this dislocation of functionality is not the same as making test code more readable by extracting clunky bookkeeping code into intentionally named functions:

// Given
int actual[length];
...
int expected[length];
presort_expected_values(expected, actual, length);
// When
sort(actual, length);
// Then
assert_equal_arrays(actual, expected, length);

Refactoring to reduce bulk and raise intention is certainly a practice that should be considered (much, much more often than it is) when writing test code.

One limitation of testing against an existing implementation is that it might not always be obvious what the expected result is. It is one thing to say, “The new implementation should produce the same results as the old implementation”, but quite another to make clear exactly what those results are. In the case of sorting, this is not much of an issue. In the case of something from a more negotiated domain, such as insurance quotes or delivery scheduling, it might not be clear what “the same results as the old implementation” entails. The business rules, although replicated in the new implementation, may be no clearer with tests than they were without. You may have regression, but you do not necessarily have understanding.

The temptation is to make these rules explicit in the body of the test by formulating the postcondition of the called code and asserting its truth. As we’ve already seen in the case of something as seemingly trivial as sorting, arriving at a sound postcondition can be far from trivial. But now that we’ve figured it out, we could in principle use it in our test:

values = ...
original = values[:]
result = sort(values)
assert values == original, "Original list unchanged"
assert is_sorted(result), "Non-descending values"
assert is_permutation(result, original), "Original values preserved"

Where extracting is_sorted and is_permutation make the postcondition clear and the test more readable, but are left as an exercise for the reader to implement. And herein lies the problem: the auxiliary test code — to check that a sequence is sorted and that one sequence contains a permutation of values in another — may quite possibly be more complex than the code under test. Complexity is a breeding ground for bugs.

One response to the evergreen question, “How do we know that our tests are correct?”, it to make the test code significantly simpler. Tony Hoare pointed out that

There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other is to make it so complicated that there are no obvious deficiencies.

If the tests are significantly simpler than the code being tested, they are also more likely to be correct. And when they are incorrect, the errors are easier to spot and fix.

The solution to the problem has been staring us in the face. Alfred North Whitehead observed that

We think in generalities, but we live in detail.

Using concrete examples eliminates this accidental complexity and opportunity for accident. For example, given the following input:

[3, 1, 4, 1, 5, 9]

The result of sorting is the following:

[1, 1, 3, 4, 5, 9]

No other answer will do. And there is no need to write any auxiliary code. Extracting the constancy check as a separate test case, the test reduces to the pleasingly simple and direct

assert sort([3, 1, 4, 1, 5, 9]) == [1, 1, 3, 4, 5, 9]

We are, of course, not restricted to only a single example. For each given input there is a single output, and we are free to source many inputs. This helps highlight how the balance of effort has shifted. From being distracted into a time sink by the mechanics and completeness of the auxiliary code, we can now spend time writing a variety of tests to demonstrate different properties of the code under test, such as sorting empty lists, lists of single items, lists of identical values, large lists, etc.

By being more precise and more concrete, the resulting tests will both cover more and communicate more. An understanding of postconditions can guide us in how we select our tests, or our tests can illustrate and teach us about the postconditions, but the approach no longer demands logical completeness and infallibility on our part.

Precision matters. A test is not simply an act of confirmation; it is an act of communication. There are many assertions made in tests that, although not wrong, reflect only a vague description of what we can say about the code under test.

For example, the result of adding an item to an empty repository object is not simply that it is not empty: it is that the repository now has a single entry, and that the single item held is the item added. Two or more items would also qualify as not empty, but would be wrong. A single item of a different value would also be wrong. Another example would be that the result of adding a row to a table is not simply that the table is one row bigger: it’s also that the row’s key can be used to recover the row added. And so on.

Of course, not all domains have such neat mappings from input to output, but where the results are determined, the tests should be just as determined.

In being precise, however, it is easy to get lost in a level of formality that makes tests hard to work with, no matter how precise and correct they are. Concrete examples help to illustrate general cases in an accessible and unambiguous way. We can draw representative examples from the domain, bringing the test code closer to the reader, rather than forcing the reader into the test code.

consultant · father · husband · itinerant · programmer · speaker · trainer · writer