Negative empiricism; testing

If empiricism is the belief in knowledge gained through experience and observation, then what is negative empiricism? Simply, it is a skewed form of empiricism that rejects knowledge acquisition by confirmation. It is an attempt to resolve the issue of inductive knowledge, without rejecting the possibility of any knowledge.

This is quite an unintuitive concept, because our default mode of operation is to look for confirmation of our beliefs. Most people by default read newspapers that align with their political views; most of us have an idea for a holiday and then go looking for what would make that holiday great. It is simply unnatural for us to assume our opinions are wrong as a first basis. This runs counter to the scientific method so widely espoused in our society today, which states to prove by falsification.

Realising that you likely do this a lot yourself can be quite discomforting, it can shake your notions of how well you understand the world around you, but that is a broader implication of induction than I want to discuss here. I want only to discuss how we can use negative empricisim to cope with uncertainty.

Negative empiricism answers the induction problem by giving us a narrow range where we can have extremely high confidence in our knowledge and leaving the rest as low-confidence inductive knowledge. The narrow range consists of all knowledge gained by deduction, where the knowledge was gained by attempting to falsification. If we think all swans are white, then we see a single black swan, we now know that not all swans are white. We can treat anything in this category with high confidence, we cannot be certain, but we can have faith. Everything else, everything in the induction category, that proved by confirmation, must be viewed with very low confidence, with high scepticism. This presents us with an asymmetric distribution of knowledge, where a small set dominates the high quality of knowledge. All this has quite serious implications for how testing should be approached, because it implies that a lot of tests are less useful than they may seem to us through inductive eyes.

Misleading tests

Suppose I want to test the Reverse LINQ method, to ensure it always returns the reverse of a given list. I might write a test like below:

public void Test()
{
    var list = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };

    list.Reverse().Reverse().SequenceEqual(list);
}

The test, unsurprisingly, passes. What does this tell me? It does not tell me the Reverse method functions correctly all the time. It tells me that the Reverse method works as expected for the given list only. It does not tell me the Reverse method is devoid of bugs and works for any given sequence, we merely assume this is the case; inducing it from the limited test above.¹

Now, the above example is perhaps obviously going to work every time, because the functionality to reverse a collection is easily made universal, and one may place high trust in the LINQ methods. However, it is easy to imagine business code for which a test is written that covers only a subset of the possible inputs, as above, and yet is treated as though it proves the functionality works for all. I know I have written plenty of such tests. I think we have all known insufficient happy-path testing come back to bite us.

This isn’t a newly discovered limitation of positive unit testing; orthogonal testing has been used to test on the boundaries of wide input variations for decades. For example, tests to list an item for sale at -$1.00, $0.00, $1.00. These values are important because they sit at the boundaries of validity; $1 is as valid as $200, and -$1 is as invalid as -$5000, but the assumption still remains that what does not work for -$1.00 will also not work for -$300, and -$10,000. The inference is just that, inference. It’s certainly useful information, but it can’t be treated as exhaustive proof.

More recently, libraries like QuickCheck (and FsCheck) have tried to resolve the issue by leveraging the processor to perform exhaustive testing. The limitation is though, that for more complex tests than the above, the developer is required to build models for the test inputs, and what isn’t modelled, or is insufficiently modelled, won’t be caught. Sufficiently complex business cases require equally complex models as test inputs, increasing the likelihood of the model mismatching reality and hiding faults. There is an inescapable limit to confirmation testing in this way; you can never be sure you’ve covered everything.

Better tests? Better code?

So what does all this tell us? Well, firstly it tells us unit tests aren’t everything. Unfortunately, they are sometimes sold as panacea for instability issues, and when taken in this way, they can make situations worse by introducing a false sense of confidence. Unit testing is a work tool and a necessity when pursuing software development from an engineering perspective, but unit tests are limited in their ability to “prove” working software. We should not blindly take a passing unit test to prove the code works, it is far more nuanced than that.²

Another more subtle thing negative empiricism tells us about our code is that in order to move our tests closer to the rare realm of high-confidence knowledge, we must limit the complexity and functionality they are testing over. TDD guides us to write in this way, by getting us to ask what minimum unit of code would cause our test to pass. The results tend to be smaller, more composable, testable units; easier to reason with, safer to trust the inference of confirmation tests, because the range of behaviour in a single method is so limited. If my functionality reverses a collection, and only reverses a collection, then I am far safer building some confidence from confirmation tests than if the method contains a multitude of side effects that are not consistently triggered. This isn’t deductive knowledge, but it is “better” inductive knowledge, because the scope of the inference is narrower.

Finally, we should hold in higher esteem deductive over inductive tests. Deductive tests are rarely seen in the main source control branches, because they would break a CI pipeline. These are the tests you write that fail, because they’ve exposed behaviour in the code you were not expecting, be that a defect or no. In TDD, deductive tests are the tests we first write to prove our functionality fails and to prove our test is effective. The test falsifies the assumption that the functionality works for the given inputs, we then modify the system under test (SUT) to meet the criteria of the failing test.

The tragedy of these deductive tests is that the moment we change the SUT to pass the test, the test becomes inductive again. For the time it is proving our code works, it is inductive, but it becomes deductive when we receive a notification that functionality is failing our assumptions. As we must have all tests passing to integrate our changes, so all deductive tests must end up inductive, until the time comes for them to fail again, protecting another developer from an unknown defect.

Footnotes:

This information the test provides is actually even worse than this. The test tells us only that the Reverse method worked correctly for the given sequence, for the given test run. There is nothing in the test to confirm that the next test run will be produce the same result.

This can be extended to include many other forms of testing; integration, acceptance, etc. The confirmation we receive is confirmation for the tested data set only, at the given testing time.