I used a program called Monkey extensively in 1986 while performing final testing of Microsoft Works for the Macintosh to shake out the last bugs in the program. I found a quite a few bugs that would probably never occur in real life, but it made the program much more robust. Nobody ever complained about Works crashing.
APL has a primitive for generating random sequences and I'm pretty sure I've seen toy examples of it being used to characterise functions; the story here may have been the origin of "fuzz testing" under that name but it would not surprise me at all to find 1960s computer use (or 1950s cybernetic use, or even 1930s radio use?) under a different name.
I know I was interested in random noise inputs and test oracles that could classify them into known equivalence classes for inputs/outputs as a way to verify equivalence class boundaries. That was in the 90s. But it's an example of similar stuff that's not "fuzz testing," per se.
Well, there’s ups and downs. In theory, we should be able to automatically generate tests for all of our software. There are methods that include path-based, combinatorial, symbolic, model-driven, etc. While some options are generic (eg range test on primitive types), most tests need a good understanding of what is correct vs incorrect behavior.
That brings us to the other problem: software specification. We can’t even know if software is correct if we don’t precisely define what correct means. Same with “secure.” So, we need specifications of each property we want to check. Then, we can use a variety of methods, including testing, to check those. Developers usually don’t specify the correctness conditions. The tools for doing so aren’t great for developers either.
Enter fuzzing. It can do something similar to path-based and combinatorial testing with no user guidance. Many failures cause crashes or other obviously bad behavior. That gets value on software without specifications.
Even if we specify things, we might not get all the conditions right or every module covered. Maybe a developer updates code without updating the specification. While model-based testing is good due to human effort, fuzzing will catch what they missed by not depending on human efforts.
So, there’s better methods to test software from both an efficiency and accuracy standpoint. Yet, the labor involved plus room for human error makes fuzzing a valuable tool in the toolbox. There’s no shame in using it.
> That brings us to the other problem: software specification. We can’t even know if software is correct if we don’t precisely define what correct means.
However if we have software that is mostly correct, and if the bugs are unpredictable and specific to particular implementations, we can still find bugs without a specification by using differential testing: using two or more separate implementations, identify inputs on which their behavior differs. (This requires the software be deterministic.) Places where they differ will not tell us which of the two is wrong, but that can be determined manually once the discrepancy is found.
Fuzzing is guesswork. I see this attitude: because an unknown bug could possibly exist, therefore it could be a high severity bug and is then worthy of expending a lot of time and effort to discover.
To beat all odds and discover something you can't predict, you don't even know what it could be. Some effort should be done to reduce the problem space.
In the given example, I don't see why you need to test input at the cli level when access control and input sanitation should be verified already using known parameters that reject all unpredictable input. Obviously, certain combinations of input are more dangerous than others, and at the very least, those individual systems should have focused parameterized tests, and that set reduced from the random fuzz possibilities.
Exploratory testing on highly secure, safety prioritized systems is one thing. Sure, chaotic testing like this has a place, in a very specific, hopefully highly structured system. Even then I would use it only after every other type of testing.
When someone wants to test every input possibility with random noise I roll my eyes. Test what you know is a threat first, achieve solid coverage, run tests at every stage of development and then maybe we can talk about fuzzing. Is the system actually functioning as it's intended? Are all the happy path use cases tested? Are you sure about that? Boring, I know.
Fuzz testing is incredibly effective at finding gaps in the programmer’s understanding. You should read Barton Miller’s papers on fuzz testing https://pages.cs.wisc.edu/~bart/fuzz/ to see how effective dumb fuzzing still is over 30 years later.
I regularly write custom, small, “fuzzers” in my test suites - at least when I can. For example, recently I’ve implemented a btree for a project with some custom optimisations. My test suite includes a fuzzer which randomly mutates the btree and performs the same mutations on a slower reference data structure. After each change, I test that both structures contain the same data.
The test has shaken out about 10 obscure bugs in my code that my other unit tests failed to find. And that is about what I expected - it’s what I find more or less every time I do randomised testing on code that hasn’t experienced this before.
I really think this sort of testing should be taught and done almost everywhere. It’s wild how many bugs you find with randomly generated input. It is by far the most efficient unit testing you can do, measured by bugs found per line of testing code.
According to these reports, the effectiveness of fuzzing decreased over that time period.
The majority of bugs reported are explained as poorly designed code, which could then be tested without fuzzing.
For example: A primary class of bugs is unbound inputs, which could easily be found with static analysis. There's no reason to toss random strings at it until it breaks, you can know it will break simply because that input is unbound.
The lack of adequate traditional testing for each utility is specifically mentioned as a limitation of the studies. All fuzzing proves here, is the value of traditional testing. Of course fuzzing is going to find bugs where there are inadequate tests, but there really should be tests.
I believe it was invented by Margaret Hamilton and a research assistant during the Apollo project:
"Often in the evening or at weekends I would bring my young daughter, Lauren, into work with me. One day, she was with me when I was doing a simulation of a mission to the moon. She liked to imitate me – playing astronaut. She started hitting keys and all of a sudden, the simulation started. Then she pressed other keys and the simulation crashed. She had selected a program which was supposed to be run prior to launch – when she was already 'on the way' to the moon. The computer had so little space, it had wiped the navigation data taking her to the moon. I thought: my God – this could inadvertently happen in a real mission. I suggested a program change to prevent a prelaunch program being selected during flight. But the higher-ups at MIT and Nasa said the astronauts were too well trained to make such a mistake. Midcourse on the very next mission – Apollo 8 – one of the astronauts on board accidentally did exactly what Lauren had done. The Lauren bug! It created much havoc and required the mission to be reconfigured. After that, they let me put the program change in, all right."
It was a joke. I didn’t know that about Hamilton, though.
In case her work interests you, I did get their book on Higher Order Software where she applied everything they learned. The method was like executable specifications with code generators. The specs were reminiscient of Prolog and CSP mixed together. Later, it became USL in the 001 Toolkit. It and her other papers are on htius.com.
I didn’t think HOS/USL was practical. Her team’s early work was really impressive, though. I also still respect everyone that tried to achieve the hard goal of formally-specified, bug-proof software. Each attempt teaches us lessons.
Now, if you want a practical method, the best were Design by Contract, Cleanroom Software Engineering, and Praxis' Correct by Construction. Cleanroom and CbyC hit very, low, defect rates. DbyC was highly pragmatic in balancing developer productivity, ease of adoption, and correctness. I think they'd all be even better combined with functional programming, static, test generators, etc. that we now have. In any case, here's links to descriptions of and books on some of those methods.
Wow, thanks for all the links. I hadn't realized you really meant book - I thought by their book on Higher Order Software you meant Hamilton's and Zeldin's paper Higher Order Software—A Methodology for Defining Software [1], which I managed to get through the local university library.
https://www.folklore.org/Monkey_Lives.html