Software Testing: Theory and Practice (Part 8) - One Step Beyond

Key Takeaways Every form of software testing we have examined so far samples the input space. Any inputs that fall outside the sample are never exercised, creating risk. In some projects this risk is unacceptable. In such cases, formal methods—which allow exhaustive verification—become an attractive option. Challenges of Traditional Software Testing To conclude this series, we will spend two installments looking at technologies that go one step beyond traditional software testing. In Part 1 “Fundamental Concepts of Software Testing” and Part 2 “Software Testing and Logical Expressions,” we explained that software testing of functional systems means confirming that, for every input satisfying the pre-conditions, the output satisfies the post-conditions. Exhaustively exercising every possible input is, however, prohibitively expensive. Consequently, as discussed in Part 5 “How to Select Test Cases,” we finish testing after examining only a subset of inputs, then extrapolate the overall risk from that subset’s results. Inputs that were not selected remain untested, and defects may well lurk there. Most of us have triggered a software failure at some point in daily life; almost invariably that failure was caused by an input never chosen as a test sample. Failures caused by out-of-sample inputs are not rare. In other words, traditional sampling-based testing deliberately trades lower cost for a non-trivial amount of risk—and we should stay aware of that fact. Sometimes a project simply cannot take that risk—for example, aerospace, transportation, or medical systems demanding high reliability. Even where the reliability bar is lower, Part 3 “The Significance and Strategy of Shift-Left Testing” showed that the earlier we detect defects, the cheaper they are to remove. From a cost perspective alone, a project may therefore prefer heavier early-stage testing. When you want exhaustive verification rather than sampling, formal methods are invaluable. Across this two-part finale we will start with a single sample in a traditional test, gradually increase the number of test cases, and eventually arrive at formal methods that cover all inputs. Along the way we will observe what we gain—and what we lose. What Happens as We Add Test Cases ? When There Is Only One Test Case Let’s begin with just one test case, i.e. one input/output pair. In traditional Example-Based Testing (EBT) we supply a concrete input and the expected output, then compare the actual output: Suppose we have our own natural-number type—ZERO, ONE, TWO, …—and an add function. To verify that 1 + 1 = 2, an EBT looks like Listing 1. Listing 1. First test case for the natural-number addition // Verify that our custom addition returns 2 for 1 + 1 describe('add', () => { context('when given 1 + 1', () => { it('returns 2', () => { // Provide concrete inputs const actual = add(ONE, ONE); // Provide the expected output const expected = TWO; // Compare actual and expected; fail if they differ, succeed if they match assert.deepStrictEqual(actual, expected); }); }); }); Of course, checking only 1+11 + 11+1 is hardly reassuring, so we add more cases. When There Are Two Test Cases To verify 1 + 0 = 1 as well, we might simply add another almost identical block, as in Listing 2. Listing 2. Two separate test cases describe('add', () => { // Original test case context('when given 1 and 1', () => { it('returns 2', () => { const actual = add(ONE, ONE); const expected = TWO; assert.deepStrictEqual(actual, expected); }); }); // Additional input goes in its own test case context('when given 1 and 0', () => { it('returns 1', () => { const actual = add(ONE, ZERO); const expected = ONE; assert.deepStrictEqual(actual, expected); }); }); }); But this approach needs roughly eight lines for every added case—verbose and error-prone. Instead, we can share the common scaffolding by supplying the input/expected pairs in table form, as in Listing 3. Such tests are called Parameterized Tests or Table-Driven Tests. Listing 3. Converted to a parameterized test describe('add', () => { // Table of input and expected output // lhs = left-hand side, rhs = right-hand side [ { lhs: ONE, rhs: ONE, expected: TWO }, { lhs: ONE, rhs: ZERO, expected: ONE }, ] .forEach(({ lhs, rhs, expected }) => { context(`${lhs} + ${rhs}`, () => { it(`returns ${expected}`, () => { const actual = add(lhs, rhs); assert.deepStrictEqual(actual, expected); }); }); }); }); Now each extra test case costs just one new line. Column: Eager Tests You often see code that tries multiple inputs within a single test case—the anti-pattern known as an Eager Test. It has two problems: If the first assertion fails, later assertions never run, so diagnosti

Apr 28, 2025 - 03:37

Software Testing: Theory and Practice (Part 8) - One Step Beyond

Key Takeaways

Every form of software testing we have examined so far samples the input space. Any inputs that fall outside the sample are never exercised, creating risk.
In some projects this risk is unacceptable. In such cases, formal methods—which allow exhaustive verification—become an attractive option.

Challenges of Traditional Software Testing

To conclude this series, we will spend two installments looking at technologies that go one step beyond traditional software testing.

In Part 1 “Fundamental Concepts of Software Testing” and Part 2 “Software Testing and Logical Expressions,” we explained that software testing of functional systems means confirming that, for every input satisfying the pre-conditions, the output
satisfies the post-conditions. Exhaustively exercising every possible input is, however, prohibitively expensive.

Consequently, as discussed in Part 5 “How to Select Test Cases,” we finish testing after examining only a subset of inputs, then extrapolate the overall risk from that subset’s results. Inputs that were not selected remain untested, and defects may
well lurk there. Most of us have triggered a software failure at some point in daily life; almost invariably that failure was caused by an input never chosen as a test sample.

Failures caused by out-of-sample inputs are not rare. In other words, traditional sampling-based testing deliberately trades lower cost for a non-trivial amount of risk—and we should stay aware of that fact.

Sometimes a project simply cannot take that risk—for example, aerospace, transportation, or medical systems demanding high reliability. Even where the reliability bar is lower, Part 3 “The Significance and Strategy of Shift-Left Testing” showed that the earlier we detect defects, the cheaper they are to remove. From a cost
perspective alone, a project may therefore prefer heavier early-stage testing.

When you want exhaustive verification rather than sampling, formal methods are invaluable. Across this two-part finale we will start with a single sample in a traditional test, gradually increase the number of test cases, and eventually arrive at formal methods that cover all inputs. Along the way we will observe
what we gain—and what we lose.

What Happens as We Add Test Cases ?

When There Is Only One Test Case

Let’s begin with just one test case, i.e. one input/output pair.
In traditional Example-Based Testing (EBT) we supply a concrete input and the expected output, then compare the actual output:

Suppose we have our own natural-number type—ZERO, ONE, TWO, …—and an
add function. To verify that 1 + 1 = 2, an EBT looks like Listing 1.

Listing 1. First test case for the natural-number addition

// Verify that our custom addition returns 2 for 1 + 1
describe('add', () => {
  context('when given 1 + 1', () => {
    it('returns 2', () => {
      // Provide concrete inputs
      const actual = add(ONE, ONE);
      // Provide the expected output
      const expected = TWO;
      // Compare actual and expected; fail if they differ, succeed if they match
      assert.deepStrictEqual(actual, expected);
    });
  });
});

Of course, checking only $1 + 1$ is hardly reassuring, so we add more cases.

When There Are Two Test Cases

To verify 1 + 0 = 1 as well, we might simply add another almost identical block, as in Listing 2.

Listing 2. Two separate test cases

describe('add', () => {
  // Original test case
  context('when given 1 and 1', () => {
    it('returns 2', () => {
      const actual = add(ONE, ONE);

      const expected = TWO;
      assert.deepStrictEqual(actual, expected);
    });
  });

  // Additional input goes in its own test case
  context('when given 1 and 0', () => {
    it('returns 1', () => {
      const actual = add(ONE, ZERO);

      const expected = ONE;
      assert.deepStrictEqual(actual, expected);
    });
  });
});

But this approach needs roughly eight lines for every added case—verbose and error-prone. Instead, we can share the common scaffolding by supplying the input/expected pairs in table form, as in Listing 3. Such tests are called Parameterized Tests or Table-Driven Tests.

Listing 3. Converted to a parameterized test

describe('add', () => {
  // Table of input and expected output
  // lhs = left-hand side, rhs = right-hand side
  [
    { lhs: ONE, rhs: ONE,  expected: TWO },
    { lhs: ONE, rhs: ZERO, expected: ONE },
  ]
  .forEach(({ lhs, rhs, expected }) => {
    context(`${lhs} + ${rhs}`, () => {
      it(`returns ${expected}`, () => {
        const actual = add(lhs, rhs);

        assert.deepStrictEqual(actual, expected);
      });
    });
  });
});

Now each extra test case costs just one new line.

Column: Eager Tests

You often see code that tries multiple inputs within a single test case—the anti-pattern known as an Eager Test. It has two problems:

If the first assertion fails, later assertions never run, so diagnostic information is limited.
Bundling multiple inputs under one case blurs the test’s intent, because the case name can no longer describe exactly what is being verified.

The recommended style is one input → one test case, as shown here.

If We Keep Increasing the Number of Test Cases

How many cases are enough to trust our addition? Certainly more than two. To identify good cases we would like to use the On–Off Point method, introduced in Part 5 “How to Select Test Cases.”
However, that method assumes (1) the implementation performs a top-level branch with no loops/recursion afterwards—an assumption our add does not meet. In such situations Property-Based Testing (PBT) is more suitable.

Recap: PBT automatically generates large numbers of inputs. Because the inputs are generated, we can’t specify concrete expected outputs. Instead we supply a relationship that must hold between each input and its output.

Listing 4. Skeleton of a property-based test

describe('add', () => {
  context('given x and 0', () => {
    it('returns ???', () => {
      const x = valuesGen(); // generator picks some natural number
      const actual = add(x, 0);

      const expected = '???'; // concrete expectation unavailable
      assert.deepStrictEqual(actual, expected);
    });
  });
});

Thus, we replace explicit outputs with properties.

Part 4 suggested using the post-condition as that property. Table 1 restates the specification of add.

Table 1. Specification of add

Left operand	Right operand	Output
0	0	0
0	1	1
0	2	2
…	…	…
1	0	1
1	1	2
1	2	3
…	…	…

Unfortunately, expressing this relation without using “add two natural numbers” is difficult—yet that is exactly what we are implementing. So we fall back to a simpler property: the commutative law.

Listing 5. Verifying that add(x, 0) equals x

describe('add', () => {
  context('given x and 0', () => {
    it('returns x', () => {
      const [x, y] = valuesGen(); // generator picks x and y
      const actual   = add(x, y);
      const expected = add(y, x); // commutativity
      assert.deepStrictEqual(actual, expected);
    });
    // In practice valuesGen runs many times with different inputs.
  });
});

This alone is not sufficient—an implementation that mistakenly performs multiplication would still pass—so we add more relations (Listing 6).

Listing 6. Additional properties for addition

describe('add', () => {
  context('given x and 0', () => {
    it('returns x', () => { /* … */ });
  });

  context('given x and y + 1', () => {
    it('returns (x + y) + 1', () => { /* … */ });
  });

  context('given x and y', () => {
    it('returns the same as y + x (commutativity)', () => { /* … */ });
  });

  context('given x and y, then adding z to the result', () => {
    it('matches adding x to (y + z) (associativity)', () => { /* … */ });
  });

  // valuesGen runs many variations.
});

Automatic generation lets us exercise far more inputs than hand-written cases ever could. What we lose is concreteness: we must discover and encode abstract input/output relationships—a skill that takes practice.

Even here the space is not fully covered; the generated values, though numerous, are still a tiny subset of all natural numbers. One extension is to shrink the input domain until exhaustive testing becomes feasible (e.g. by imposing a maximum value). But sometimes we cannot lower that maximum. That is where formal methods shine.

What Are Formal Methods ?

Formal methods are mathematically based techniques for specifying, developing, and verifying software and hardware. The field is vast and rapidly evolving; we cannot cover everything here. Instead, we will demonstrate a proof assistant, one category of formal-methods tools.

Exhaustive Testing with a Proof Assistant

We will use the proof assistant Isabelle¹ to exhaustively verify the wide input space.

A proof assistant lets you define programs (like an ordinary language) and prove mathematical properties about them. Crucially, logical quantifiers such as “for all x” allow statements about every possible input—transcending the limits of traditional and property-based tests.

Consider the Pythagorean theorem: for the legs $a$ , $b$ and hypotenuse $c$ of a right triangle, $a^2 + b^2 = c^2$ always holds. Checking a handful of triangles does not prove the universal claim, just as testing can never prove the absence of all defects. Mathematical proof does.

Isabelle

We will translate the earlier JavaScript add into Isabelle².

Listing 1. JavaScript version of add

// Inputs are Zero or Succ.
// Zero represents 0; Succ(n) is n + 1.
// new Succ(new Zero()) is 1,
// new Succ(new Succ(new Zero())) is 2, and so on.
class Zero {}
class Succ {
  constructor(nat) { this.v = nat; }
}

function add(a, b) {
  // Base case: if b is 0 then a + 0 = a
  if (b instanceof Zero) {
    return a;
  }
  // Recursive case: a + b = Succ(a + (b − 1))
  return new Succ(add(a, b.v));
}

Listing 2. Isabelle version of add

theory Scratch
  imports Main
begin

datatype myNat = Zero | Succ myNat

primrec add :: "myNat ⇒ myNat ⇒ myNat"
  where "add a Zero     = a"
      | "add a (Succ b) = Succ (add a b)"

(* ... more to come ... *)

end

Proving Program Behaviour

First, we replicate the property from PBT: $x + 0 = x$ .

Listing 3. Proof that $x + 0 = x$

lemma add_Zero_eq:
  shows "add x Zero = x"
  by (induct x; auto)

(* cursor here *)

Isabelle reports (excerpt):

theorem add_Zero_eq:
add ?x Zero = ?x

To see a failed proof, change 0 to 1 (Listing 5).

Listing 5. Attempting (and failing) to prove $x + 1 = x$

lemma add_One_eq:
  shows "add x (Succ Zero) = x"
  by (induct x; auto)

(* cursor here *)

Output:

theorem add_One_eq:
add ?x (Succ Zero) = ?x
Failed to finish proof⌂:
goal (1 subgoal):
    1.  False

Was the claim false, or merely too hard for automation? Isabelle’s try
command helps distinguish the two, by invoking Nitpick, Quickcheck, and
Sledgehammer.

Listing 7. Using try

lemma add_One_eq:
  shows "add x (Succ Zero) = x"
  try

Output:

Trying “solve_direct”, “quickcheck”, “try0”, “sledgehammer”, and “nitpick”…
Nitpick found a counterexample:
Free variable:
x = Succ (Succ Zero)

Nitpick found x = 2, confirming the statement is false.

Similarly we can prove the remaining properties (Listing 9).

Listing 9. Proving the remaining properties

lemma add_Succ_eq:
  shows "add (Succ x) y = Succ (add x y)"
  by (induct y; auto)

lemma
  shows "add x y = add y x"
  by (induct y; auto simp add: add_Zero_eq add_Succ_eq)

lemma [simp]:
  shows "add (add x y) z = add x (add y z)"
  by (induct z; auto)

And because Isabelle provides mathematical naturals, we can easily verify the post-condition as well (Listing 10).

Listing 10. Proving the post-condition

fun denoteMyNat :: "myNat ⇒ nat"
  where "denoteMyNat Zero     = 0"
      | "denoteMyNat (Succ x) = 1 + denoteMyNat x"

lemma
  shows "denoteMyNat (add x y) = denoteMyNat x + denoteMyNat y"
proof (induct x)
  case Zero
  thus ?case by (induct y; auto)
next
  case (Succ x)
  thus ?case by (simp add: add_Succ_eq)
qed

A full Isabelle tutorial—e.g. Programming and Proving in Isabelle/HOL³—
is beyond our scope, but highly recommended.

Summary

Using natural-number addition as an example, we observed this progression:

Number of test cases	Technique
A handful	Traditional example-based test
Dozens	Parameterized test
Hundreds	Property-based test
All inputs	Formal methods

Moving from stage ① to ② is easy and loses nothing.
From ② to ③ we lose concrete inputs/outputs; from ③ to ④ we additionally need
mathematical-proof skills. The lower we go, the more abstract our description
must become to keep maintenance feasible.

Mastering formal methods takes training. Humans rarely write perfect logical
statements or proofs on the first try—just as we rarely write bug-free code.
Hence we must verify that our own formulas and proofs match our intent.
Understanding the messages produced by tools like Isabelle therefore requires
a solid grounding in logic, and process skills akin to Test-Driven Development
help enormously.

I hope this article sparks your interest in formal methods.

https://isabelle.in.tum.de/index.html ↩
A rigorous correspondence would require a formal semantics for JavaScript, which is beyond our scope here. ↩
https://isabelle.in.tum.de/dist/Isabelle2024/doc/prog-prove.pdf ↩