2022-08-24 04:07: Ran Over by Test-Driven Development

Hello,

At the time of writing, it's been over four months since I wrote one of these diaries and six months since the last release of Quoll, and while a lot of that has to do with me being busy with moving, a large part has been me working on getting the first block of content that actually uses the reality modification game mechanics out. Imagine! Using programs to modify reality! I'm sure that sounds super interesting, but to be honest, I covered the basics of that back in this dev diary!

So let's talk about a much more boring but still very useful part of coding, automated test writing.

(WARNING: This Dev Diary entry contains spoilers for the Seattle Apartment Level as well as the Reckoning Room which is first revealed in the Glitch's Non-Breaking Space Level. Nothing too major, but it may ruin some of the surprise.)

Why We Write Automated Tests in the First Place

To say that Quoll, despite being a single-dev game is a complicated program is an understatement. The traditional measurement of code size is lines of code, which is pretty self explanatory. It's the number of lines in all the source code files that make up the program. As of writing, just my code for the game part of Quoll is 7224 lines spread across 15 files. I didn't bother figuring out how many of those lines were just blank lines that I added to make things look pretty, but that's a good sized program, and I'm still just getting started.

That's pretty much just the apartment level, the basic implementation of the reality modification mechanics, and the beginning of the next level, Glitch's Non-breaking Space. And that's just my code! It doesn't include all of the code in all of the extensions done by other people that I'm bringing in. That is also just the Inform 7 code. It does not include the C# code that builds the project and makes the website. Nor does it include the HTML, CSS, and JavaScript of the website itself. Nonetheless, a lot gets done in those 7224 lines of code and thus I find the need to be organized in my testing.

When code is small, you can often get away with ad hoc testing, where your testing is just running the program, trying various inputs, and making sure it works as you expect. When I was first starting out on Quoll, this was pretty much my testing process. But once code gets "chunky", you need to start being a bit more organized in your testing, and one of the ways to make that easier is automated tests.

It's not really about how tedious it is to run the program again and again. There's no real way to ever escape that while working on a program, honestly. It's more about our ability to know what tests we need to rerun when code changes. There's always some basic functionality that we want to work no matter what changes we make elsewhere. And thus, we should make tests targeting that basic functionality, since changes in one part of the code base often end up causing bugs in another part. Since I tend to like to work from an example for these dev diaries, let's talk about that example.

Why Only Have One Bul when You Can Have Three

In the game, Ada uses what is called "Propositional Reckoning" as a method for changing reality. It works by setting the values of Buls, which can be either Trul or Ful. The tie of this concept to Boolean values (bools), which can be either true or false should be obvious. A Boolean value can easily be represented with a single binary digit (1 or 0), which makes it a very easy thing to represent in a computer. You could say real-life programming is also about setting the values of Buls. Uh, bools.

These Buls both include the very important Propositional Buls that participate in the manifestation of reality, but also other more transitory Buls that are only used when Ada's program runs, the Free Buls. That is, along with all the sigils of the SGL programming language, SGL programs can have Free Buls spread out in Funge Space¹. Feel... free to read the details in the help files, this is mostly just a quick overview for the sake of those readers who haven't read that yet or played the game.

I was writing the phone app that will give Ada and the player an overview of the current state of Funge Space, which looks like this in the game:

> x phone
The phone is on CartoFunge, which displays a map of the current state of Funge Space:
 0123456789
0Α1Χ0Ο1ΔΕ0Ι
1ΝυΞ··Κ·ΜΣΗ
2ΛΒΦΒΡΖ····
3Β@1Ν·Γ0Ψ0Θ
4Κ·ΠΥΗ·Τ10Η
5···ΝΚΗΛ01Ω
6··········
7··········
8··········
9··········
My current location is at (1, 3). In this location there is an empty cell.

There is 1 Free Bul on the displayed floor at coordinates (1, 3).

While that last line only mentions one Free Bul, I of course do not intend to limit the player to just one. I only needed one such bul early on in my development. And rather than trusting my code that is displaying that last line, I figured it was time to add more Free Buls to my test setup.

It was easy enough to move the code that added the single Free Bul into a loop to add three² buls instead, to add some basic code to describe all three buls in one sentence rather than repeating "There is a Free Bul set to Trul here" three times, and to handle some state clean up when dropping Free Buls back into the pile in Reckoning Room³. That's not actually the focus of this dev diary, but rather the instigating event of the rest of it, so I'm going to gloss over it.

More importantly, by doing my due diligence to the tenets of test-driven development, I then wrote a test to test the new code⁴. But what is test-driven development?

A Very Brief Introduction to Test-Driven Development

The "traditional" way to write automated tests for code was to write the actual feature code, and then to go back over and write test code for that feature code. This last job would often be handed off to another developer⁵. There are problems with this way of doing things, though.

The main problem is that it furthers the game of telephone away from the specification. The designer writes a spec that the first dev uses to write their feature code, and then the second dev uses that spec and the feature code to write test code to test the feature code. While having more people engaged in the process is good, it also introduces more chances for deviations. Obviously the major deviations come from either dev failing to fully understand the specification, or even from the designer failing to fully understand the problem. But there're a few others that're worth pointing out, and I don't mean simple errors like typos.

One deviation comes about in how programmers often think about problems. That is, programmers think about code, and the structure of code is often required to be very different from how the problem would be specified in a specification. For example, a specification may describe a large number of dialog messages to be displayed under certain conditions. But it doesn't make sense to write separate code for each dialog message. You write one dialog message function, and then pass in the different messages as needed. Note that it does make sense to have a different test for each dialog message, as the differing circumstances are still part of what needs to be tested⁶.

That's an example of where the code may be simpler than the problem. But programmers also like to think ahead and prepare for future expansions, especially if that would be easy to tack on their current work. This is another influx of deviation from the spec, and it may not be benign. Code that is meant to be there for expansion is still code that runs, and still can cause bugs.

Even if all the people involved do their jobs perfectly, there's still this lag between when code is done and when it is tested that can be dangerous. Yes, most devs will do ad hoc testing as they write code, but as mentioned before, that cannot hope to suffice as the problems and the code get more complicated. On top of that, when deadlines rear their ugly heads, something has to be cut, and often, sadly, that's the testing part. So someone had the bright idea to actually write the tests first. This is Test-driven Development (TDD).

The idea is that first you write a failing test that covers the functionality you're about to add. Note that the test is meant to be one that fails! If you write a test and the test actually passes, then hey, you don't need to write any more new feature code, because it must already be handled by the existing code. We can keep things small!

After you have your failing test, you can write your feature code until your test passes. Obviously that's the next step, but by having a clear finish line of a passing test, this also keeps your actual feature code small. You're not supposed to deviate away to make other small changes or do refactoring; you just need to get that test passing. Repeat these two steps again and again and you have working code and you can point to the tests to prove it⁷! And all with one dev so you can fire that other loser⁸!

I am personally a big fan of TDD, because I think it helps me keep my feature code changes small, as well as focus on the end functional result of my code. However, of course in reality, it is often easier to write the code first than all of the tests. For example, if I have a particularly complicated set of branches and loops in my code, I can usually "see" the feature code more easily than I can "see" the total set of outcomes that would have to be covered by testing. So I often relax these rules a bit and just make sure that no code is checked in without a matching test covering it, even if I end up writing the code first. It's a good compromise.

So let's talk about how doing testing in Inform 7 looks, because as you might expect it is quite different from "normal" programming languages.

Testing in the Inform 7 Editor

Quoll is a piece of interactive fiction, that is, a text adventure, and it might be useful to think about what that means. The state of the game is a series of rooms, with items and people in them, one of which is the player character⁹. Who, I remind you, is named "Ada" in Quoll.

The game is turn based, so the input each turn is some text typed in by the player. This text is parsed, and represents actions the player character may take in the game world. As a tester, I am the zeroth player of the game. To test interactive fiction, you play the game, just like you test any other game. So in that same way, in Inform 7 we can test by entering in commands. And the Inform 7 Editor has some nice features when we do that.

Let me play a bit. It'll be too much for a transcript, but in short, I play the early part of the Apartment level, reading Glitch's note, looking up the spell components on the phone, then going in and examining the picture of Malia on the wall, which I take. After doing that in the Editor, the window looked like this:

Fig. 0 - The Inform Editor after a Playthrough

The Editor is divided into two panes. On the right is the Story Pane where I have just finished a playthrough of the game, thus showing the results of entering the commands. For example, I typed "X PICTURE", "TAKE PICTURE", "S"¹⁰. The game read this as issuing commands to examine the picture, take the picture, and then go south. And the game dutifully printed out the descriptions of it doing that, including the implicit "LOOK" command that was part of "S", and thus, the description of the room Ada entered.

However, what's that on the left? That pane is a special feature of the Inform 7 Editor, the Skein Pane¹¹. This pane kept track of all of my inputs. You can see a vertical chain of commands, starting with "ENTER NANWIGE" and ending with the same "S" you see in the Story pane. Every time I entered a command in the game, the Editor kept track of it. And it's not just pretty, I can double-click on bubbles to play the game back to that command and then branch off on a different path. So after double-clicking on the second "W" bubble before I did "X PICTURE", I can then enter in a completely different set of commands into the Story Pane.

And after doing that, the window looked like this:

Fig. 1 - Going Down an Alternate Path

You can still see the results of the last few commands in the Story Pane to the right like before, but now the Skein Pane is the more interesting thing to look at. In addition to my earlier set of commands in the left branch, you can now see my second, alternate set of commands in a chain to the right of the original chain. The branch-off happens on the second "W". The Editor will keep track of this whole connected tree of the commands you enter, allowing you to surf around that tree to test your game in an ad-hoc fashion. But that's not all! Let's change the right pane to the Transcript Pane¹²:

Fig. 2 - The Transcript Pane

Each command shows up here, too. In the box for each command is the transcript of what the game printed for that command. In addition to that, there is a "Bless" button for each command. "Blessing" a command saves that transcript as the correct output for that command at that point. If I were to change the code and re-run all the commands in the Skein (something I often do as part of testing), when the output does not match the Blessed transcript, it will be flagged with a bright red dot on its bubble in the Skein, as you can see below (note that I had to do some work to get this state to happen for the sake of demonstration, so we're back at that second "W" command again):

Fig. 3 - This Transcript is Cursed

The command is similarly marked as not matching the expected Blessed transcript in the transcript pane, because I took a second to fix some uncapitalized directions in the room description for the Apartment Front Hallway room, as well as slightly change the description of the hallway¹³.

And congrats, that's how you test. Basically, if you were to make a code change that did something, like moving three Buls to a room rather than one¹⁴, the Editor will point out all the instances in the Skein/Transcript where its Blessed transcript was different.

Sometimes you will expect these differences, like seeing there are three Buls in the room instead of one. You can just Bless the updated transcript in that case. But perhaps some other command has a difference from a consequence of your changes, and you need to go through and start investigating or seeing what mistake you made. Heck, your investigation will become a new branch of the Skein that can be used to test that your new found bug doesn't happen in the future¹⁵.

However, it likely seems optimistic to think that we can comprehensively cover every state of a game through ad hoc play. We might want to get more organized and write down a set of commands as an explicit test.

Defining Tests in Source Code

The Skein is pretty great for ad hoc testing and you have quite a few tools to prune its tree to better organize tests, but for the most part I do not use it, but instead use another way Inform 7 has to define tests as a set of commands more explicitly in the source code:

Test free-buls with "xyzzy /
? filling and entering / fill-funge-space 3 / in /
? taking free / take free /
? description updated after taking free / l /
? dropping free / drop free /
? taking free again / take free /
? switching free / switch free /
? dropping free again / drop free /
? taking multiple frees / take free / take free /
? description updated after taking multiple frees / l".

This code creates a test called "free-buls", which I can then run in the debug version of the game with "TEST FREE-BULS". Each / here marks the division between a command, and the commands that start with ? are taking advantage of the Playerr Feedback functionality of Quoll that was originally added to allow my test players to add comments. Here, I'm using it to give myself a bit more information what I'm testing with each command.

If you're wondering about that first command, "XYZZY", that is a debug-only command that puts the game in a special test mode. This command, "XYZZY" kicks off an Inform 7 Action¹⁶ called appropriately enough, "invoking test mode". You can enter that command into a release version of the game, but you'll only get a snarky message from Ada. As far as to why I chose those letters, check out the Wikipedia article on it.

As of writing, I have 63 such tests written, and if you were to look at the Skein that is checked into the repository (y'know, when I make it public eventually), you'd see that the only bubbles in it are me calling these tests. This way I get the benefit of having a set of tests that I expect to have a certain output, and a way to associate a Blessed transcript with them.

It's not perfect however:

Fig. 4 - My Sad Horizontal Skein

In this screenshot you can now see the top of the Skein in the pane to the left. My ad hoc playthrough is still there stretching down vertically, how the Skein likes it. On the left of the Skein Pane, you can see a few bubbles for "TEST" commands, including "TEST FREE-BULS", which is the currently highlighted command bubble. Part of the issue with using "TEST" commands like this is that each "TEST" command is treated as only one command by the Skein, so it limits the Skein to being purely horizontal.

This also applies to the Transcript Pane that you can see in the right pane. The Blessed transcript, rather than being tied to each individual command in "TEST FREE-BULS", is tied to the whole thing. This means that any change to the transcript, no matter how small, will "curse"¹⁷ the whole thing. The Transcript Pane attempts to mitigate this by underlining the exact changed text in the transcript, but as you may have noticed in the example of a cursed transcript above, it's not perfect¹⁸.

But, it does the job. I'm able to keep a list of my tests, and re-run them all on occasion. However, since there has simply not been enough code blocks in this Dev Diary, let's talk some more about the code that is being tested by the above test.

In which We Get Back to Tripling the Buls

So, as I've admitted, I did not write this while perfectly sticking to the tenets of TDD. I first wrote some code to copy three Buls to the room when setting things up when invoking test mode:

Carry out invoking test mode (this is the propositional reckoning test mode invocation rule):
    [...]
    repeat with I running from 1 to 3:
        let B be a random bul in Bul Limbo;
        now the bul state of B is Ful;
        now B is not propositional;
        now B is editable;
        move B to Reckoning Room;
    [...]

This was as simple as moving the lines to copy one Bul into a repeat loop.

I then updated my Rule for writing a paragraph about a bul so that it combined all Free Buls in Reckoning Room¹⁹ into one paragraph rather than writing the line "There is 1 Free Bul set to Ful (0) here." three times:

Rule for writing a paragraph about a bul (called B):
    if the location is Reckoning Room:
        let FBL be the list of buls in Reckoning Room;
        let FBLC be the number of entries in FBL;
        say "There [regarding FBLC][are] also [FBLC] Free Bul[if FBLC is greater than 1]s here, all[otherwise]
        here,[end if] set to Ful.";
        repeat with FB running through FBL:
            now FB is mentioned;
    otherwise:
        say "There is [a B] set to [bul state of B] ([numeric version of bul state of B]) here";
        if B is fixed and the location is fixed:
            say ", in the ground";
        say ".".

I realized that my claim that all the Free Buls are set to Ful could be screwed up if the player makes Ada take a Bul, switch it to Trul and then drop it, so my last change of code was to update my Check dropping a bul Rule to account for this and set any dropped Buls back to Ful:

Check dropping a bul:
    [...]
    otherwise if the location is Reckoning Room and the noun is editable and the bul state of the noun is Trul:
        now the bul state of the noun is Ful;
        say "Let me just set this back to Ful before dropping it... [run paragraph on]";
        continue the action;
    [...]

And with all that in mind, I finally wrote my test (which you saw earlier) to minimally cover what I had changed:

Test free-buls with "xyzzy /
? filling and entering / fill-funge-space 3 / in /
? taking free / take free /
? description updated after taking free / l /
? dropping free / drop free /
? taking free again / take free /
? switching free / switch free /
? dropping free again / drop free /
? taking multiple frees / take free / take free /
? description updated after taking multiple frees / l".

After making sure this test's transcript was to my liking, I was ready to run all my other tests²⁰ to make sure that my changes didn't break anything. After all, this seems like a pretty neat and self-contained change, right? It totally won't break things.

It broke things.

Out of our 63 tests, a whole 54 of them now had cursed transcripts²¹. In many cases, it was just simply that the test had Ada walk through Reckoning Room, which of course had an updated description that there were three Buls there instead of one. Those are easy enough to Bless.

In other cases, the tests were indeed affected by my changes in a predictable way. For example, I had been setting the single Free Bul to Trul but I decided to set them to Ful during my update. This caused small changes to tests like my test of the Funge Space Mapper app, where the map was slightly different. These were also easy enough to Bless. Another set of tests was actually affected by the small change to the Apartment Front Hallway description that I used to demonstrate the cursing of transcripts. Also easy to Bless.

But then we get to some tests that had some differences that were a bit odder. For example, in my test for creating two-dimensional Funge Space, rather than Venerable the Grue being there for Ada in the Alpha-Sigil, Page the Grue was there. Why did that happen? For that, we need to talk about randomness.

When Randomness is not Random

That randomness is a part of Quoll should be unsurprising. When my code creates the Funge Space area, one of the things it does is to randomly choose a grue to put in Alpha-Sigil. This is purely aesthetic, as the grues all work the same once Ada rides them. This bit of randomness is why Venerable was subbed out for Page.

More impactfully is that when Ada reads Glitch's second note in the Apartment Level, seven plushies are semi-randomly strewn around the apartment. I say "semi-randomly" because Autumn the Mouse, Malachi the Panther, and Null the Skunktaur are always chosen from the 13 available plushies. Then four more from the last ten to make seven. That means there's 210 different sets of plushies that can be chosen here²².

For all the non-Null plushies, the code then assigns one each of six colors to the plushies. That makes 6! or 720 possible color assignments. In addition, each plushie has a 1 in 4 chance of being marked as "dirty" meaning that they would have to be cleaned before being used as a sacrifice. This doesn't apply to Malachi who is always marked dirty, and Autumn and Null who are always marked clean. So with four plushies that can either be dirty or clean, there's 16 different ways that can end up.

There are then ten different rooms in the apartment that one plushie each will be distributed to, for a total of 120 possibilities for what rooms they end up in. This makes for a whopping 209,304,000 different ways that the plushies can be chosen, colored, dirtied, and distributed in the apartment after reading the second note. I assure you I do not test all these ways.

Instead, I take advantage of a curious fact about random numbers generated on computers, in that they are actually pseudo-random. That is, they are random enough from the eyes of people, but they are actually generated by a predictable function that starts with a particular "seed" value. Getting truly random numbers is actually a hard problem, and most solutions involve falling back on physical randomness, which tends to be better²³.

So when my tests run, they are set to use a particular seed value for the pseudo-random number generator ((P)RNG) every time. This makes the results predictable. As long as I call for the exact same number of random results at the exact same times it will be the same results. But then why did poor Venerable get replaced by Page? Well, look back at my update of the Carry out invoking test mode rule:

Carry out invoking test mode (this is the propositional reckoning test mode invocation rule):
    [...]
    repeat with I running from 1 to 3:
        let B be a random bul in Bul Limbo;
        now the bul state of B is Ful;
        now B is not propositional;
        now B is editable;
        move B to Reckoning Room;
    [...]

That random in let B be a random bul in Bul Limbo means what it says. It takes all the Buls in Bul Limbo (where they all begin at the start of the game), and picks one at random. While before I was only doing this one time, I'm now doing it three times. So while the same seed is being worked with, the call to the RNG that would have put Venerable into the room gets eaten up getting a second and third Bul from Bul Limbo.

And I don't just use this let X be a random [...] when there's actually real randomness! There is no phrase to grab something that I know that there's only up to a single instance of such as let B be the bul in the location or let P be the plush animal toy on the sacrificial platform. I have to use this random phrasing. And nonetheless, it counts as a call to the RNG.

I've spent a good amount of time trying to work around this, but unfortunately, there's no good solution that I've found. Actually, I don't think there can be one as long as we're working with the lower level Relations like Containment, Incorporation, and Support that are actually implemented as part of the Inform 6 code under Inform 7.

I'll just have to occasionally go and update my tests to account for the RNG sometimes being thrown off by extra RNG calls. Oh well, programming sucks sometimes. The tests are still worth it in the end.

And that's all I really have to say about testing in Quoll (for now, at least).

Your Not-so-humble Dev,

Technically, there are also Embedded Buls that are there before Ada gets there and cannot be changed, but we can relegate them to a footnote for now. It was hard enough explaining everything else in a couple of paragraphs.↩
Three is a magic number, indeed.↩
No, really! I got all that code done in like an hour of insomnia right before I decided to write this Dev Diary.↩
Yeah, yeah, I should have done the test first to really follow the rules of TDD. Leave me alone.↩
Me. I was that developer, specifically focused on writing automated tests. Your code is bad and you should feel bad. (Don't worry, so is mine.)↩
This is one of the reasons I'm so tired.↩
This is not the whole picture of testing, sadly. Most devs write simpler tests called "unit tests" that only really cover how their small "unit" of code works, not how the whole thing works. Those are tested by "integration" or "functional" tests.↩
As I said in the last footnote, not really, but that didn't stop them from firing me.↩
If you noticed I didn't capitalize these, it's because I'm not really talking about the Inform 7 concepts right now. Just the general text adventure concepts of those things. It's minor, don't sweat it.↩
You may notice I typed in these commands in lowercase in the screenshot. That is how I normally enter in my commands for these types of games, but I am putting them in all-caps in the text to distinguish the text as a command.↩
If you downloaded Inform 7 recently, you likely have the newer open-source version so this pane is now called the Testing Pane and it looks and works way better. I had issues using the Beta version, so I'm still behind as of writing.↩
If you have the newer version of Inform 7, the Transcript Pane has been removed, the same functionality directly available in the new Testing Pane.↩
The capitalization was a real error I just noticed in the middle of writing this Dev Diary!↩
Remember how that was our original goal in this Dev Diary?↩
I should say "can become a branch", because yes, it will get added, but you can prune off branches you don't need.↩
We're back to using capitalization to mark special Inform 7 meanings of words here...↩
Unlike "Bless", "curse" is just something I made up myself, so it is not actually part of Inform 7's terminology. Hence, no capitalization.↩
The new release of Inform 7 with the Testing Pane has greatly improved the look of these transcripts, actually showing differences by placing the changed text in red and green boxes. This is more often what you'll see in "diffing" programs that programmers use like Beyond Compare, and is a welcome change!↩
You might wonder why these rules are so specific to this one Room, Reckoning Room. What happens if the player makes Ada take Buls out elsewhere? Well, she will only carry them within Funge Space and the Reckoning Room, and refuse to leave those areas while holding them. There's a bunch of already existing code to ensure that.↩
As of writing, running all 63 tests takes just under four minutes. It's a good time for taking a break.↩
I had to count this by hand! I did something I never have done before but worked well: used Roman numerals as tally marks.↩
Feel free to check my combinatorics work here on your own time.↩
One of my favorite sites, random.org uses atmospheric noise for this purpose, but you could also literally have a physical die that is rolled and checked, which I know other sites do.↩

QuollAlpha
Quoll

Quoll