During a journey to upgrade to Rails 4, the test suite was crucial for success. I ran the test suite and recorded all of the test failures. Then I went through test by test fixing them until I was all done. Barely containing my excitement I pushed to remote and waited. Sadly, our CI server reported a build failure. What the heck?! The tests all passed in isolation. Why do they not pass when run together? Well…
The mark of a good test is that they can be run in any order and still pass. In order to accomplish this, we need each test (or the suite at large) to be responsible for setting up its own data, and tearing down that data when its finished. You want to leave the state of everything exactly like how you found it before the test was run. This can include clearing caches, truncating database tables, or rolling back transactional changes. Many test suites including Rspec have the capability to do these things for you without much headache.
But what happens when your tests are run in a different order and fail? Or what happens when your tests pass in isolation, but fail when run in part of the larger test suite? Likely something in an unknown test is causing your test failure. And finding which other test is being naughty can be a real pain. Fortunately, this is the case for a new feature in Rspec 3.3 called
In case you are not familiar with bisect its a simple, but powerful algorithm. Suppose you want to drop an egg off a building at different floors to discover the lowest floor on which the egg will break. There are multiple ways to accomplish this task. The most wasteful way would be to start at floor one, drop the egg, if it doesn’t break, go to floor two and repeat. When you drop the egg and it breaks, then you have your answer. This is slow. In Big O notation, this would be
n is the number of floors in the building.
We can do better. What if we could cut the floors that we test on in half each time we perform our egg drop test? For argument sake, if the building has 100 floors (and we have an unusually strong egg!) we might bisect the floors and start at floor 50. If we drop the egg and it breaks, we know that the answer is somewhere between 1 and 50. If it doesn’t break, we know the answer is between 51 and 100. Lets assume it breaks. We can then do our next test (with a new egg) on floor 25. If it breaks, we know our answer is between 1 and 25. If it doesn’t break, we know our answer is between floors 26 and 50. Just by dropping two eggs on different floors, we’ve cut our sample size from 100 to 25. In Big O notation, this would be
O(n/2). The number of times we have to do this test is the number of powers of 2 we can divide into our number of floors. On a 100 floor building, we would have our answer in ~6 tests (2^6). We are going to save a lot of eggs.
So why are we talking about eggs when we have test failures to fix? Rspec offers a sweet utility called “bisect” that will do something analogous to our egg drop algorithm in our test suite. We have a known test failure in the larger suite, but it passes in isolation. And we want to know which currently unknown test is interfering and causing our known test failure. How would you tackle this? Well, you could remove half your tests (keeping your failing test) and see if you still get a test error when running the test suite. If you do, you remove half again, and repeat your testing. Do this until you get down to two tests – the known failing test, and the previously unknown (now known) passing test. But that is tedious work to do manually.
Lets take a look at Rspec bisect:
$ bundle exec rspec --seed 1234 --bisect Bisect started using options: "--seed 1234" Running suite to find failures... (3 minutes 14.1 seconds) Starting bisect with 3 failing examples and 2475 non-failing examples. Checking that failure(s) are order-dependent... failure appears to be order-dependent Round 1: bisecting over non-failing examples 1-2475 . ignoring examples 1-1238 (1 minute 38.37 seconds) Round 2: bisecting over non-failing examples 1239-2475 . ignoring examples 1239-1857 (57.25 seconds) Round 3: bisecting over non-failing examples 1858-2475 . ignoring examples 1858-2166 (29.98 seconds) Round 4: bisecting over non-failing examples 2167-2475 .. ignoring examples 2322-2475 (46.11 seconds) Round 5: bisecting over non-failing examples 2167-2321 .. ignoring examples 2245-2321 (43.61 seconds) Round 6: bisecting over non-failing examples 2167-2244 .. ignoring examples 2206-2244 (42.92 seconds) Round 7: bisecting over non-failing examples 2167-2205 .. ignoring examples 2187-2205 (43.65 seconds) Round 8: bisecting over non-failing examples 2167-2186 .. ignoring examples 2177-2186 (41.96 seconds) Round 9: bisecting over non-failing examples 2167-2176 . ignoring examples 2167-2171 (24.15 seconds) Round 10: bisecting over non-failing examples 2172-2176 .. ignoring examples 2175-2176 (36.95 seconds) Round 11: bisecting over non-failing examples 2172-2174 .. ignoring example 2174 (37.3 seconds) Round 12: bisecting over non-failing examples 2172-2173 . ignoring example 2172 (24.68 seconds) Bisect complete! Reduced necessary non-failing examples from 2475 to 1 in 9 minutes 1 second. The minimal reproduction command is: rspec ./spec/models/file_1.rb[1:4:3] ./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] --seed 1234
That is a lot of output. Lets break it down:
- We call rspec with a given
--seed. What this seed value does is determine the order of the tests. When
--seedis omitted a random number is chosen which makes the tests order random. This is normally a good thing. But when bisecting, its a bad thing because we want to reproduce a certain arrangement of tests to reliably generate our failure.
- The test suite initially takes 3 minutes 14 seconds to run. This is the longest test run because it is discovering which test failures we have so it must run the full suite.
- The first round we ignore the first half of our tests. This takes half the time to run because we are running half the tests.
- The second round we detected the failure from round one, so we know the half we tested includes the failure. We’ve just eliminated 50% of our tests from being the culprit. The next bisect will ignore the first half of the remaining pool and test again. We half the test run time again (tests don’t have exactly the same duration)
- The third round we repeat what we did in round two, but with the remaining pool, and so on. If the failure is no longer detected, we test on the half we most recently removed.
- Finally we get a completed message with a command to reproduce the failure. I can copy and paste this into a terminal and see the two files that are causing a failure.
rspec ./spec/models/file_1.rb[1:4:3] ./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] --seed 1234 What is that identity notation you might ask?
./spec/models/file_1.rb is straightforward. This is just the filename. The next part is a bit more interesting:
[1:4:3]. This is telling us the nesting of the assertion that is failing when combined with the other file. Not only does Rspec bisect tell us the two files that are interfering each other, but which tests within those two files are interfering! The notation
1:4:3 is telling us its the first
describe context, then the fourth context within that context, and finally the third test within those previous two contexts. In the case of the second notation:
./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] its just telling us that we have multiple tests that are failing (3 to be precise), separated by commas. Notice that the command also includes the
seed value we used when bisecting. This is still important because your failure might be order dependent.
And what if your tests fail in a different order? Same principle. When you get an intermittent test failure when running the test suite that you suspect is related to order, note the seed value that was used. Perform the same rspec bisect algorithm (and the same command) and use this seed value instead of
1234. This will reproduce the test failure you got by running the same tests in the same order as when the failure was recorded.
The Rspec team did an awesome job with the bisect utility. This used to be a dreaded problem to face as someone running tests. To verify your fix you had to run the entire test suite just to know if your change worked. This would have been a 4 minute wait in the example above, but could easily have been much longer. When reproducing failures, you want the shortest interval possible. Running 4 tests is certainly faster than running 2475 tests!
Another shout out goes to the authors of the
transpec gem that is used when upgrading versions of Rspec. This automatically changes your deprecated code with the newer style syntax. These two features together show that the Rspec team has a powerful and mature product.