Sure enough, when we came in the next morning every test had stopped! All the equipment was locked up and nothing was working. We eventually tracked down the bug. The problem was that the day of the year was being stored in a byte, rather than a word. That caused everything to stop working on Friday, September 13th, the 256th day of the year.
The hardware had been designed to get the timing for the memory refresh from the actual instruction execution. The DIVIDE instruction on that particular machine was very long, and actually took long enough that a memory refresh could be missed, and the memory contents were scrambled.
But is that a hardware or a software problem? A decision had been made earlier in the project not to use the DIVIDE instruction because of the effect it had on interrupt response. We never decided whether it was a software or hardware problem, but instead simply fixed both.
Got any horror stories you want to share? Want to vote for your favourites among the recent bugs? Got anything to say on the subject of bugs in modern software? Write to Rob to let me know.