I'm sure there is some code review for the configuration changes, but clearly th...

aeijdenberg · on June 4, 2019

We use Terraform a lot too - and most of the time it's great, but not infallible.

Our team managed to screw-up some pretty major DNS due to a valid terraform plan that looked OK, but in reality then deleted a bunch of records, before failing (for some reason I can't remember) before it could create new ones.

And of course, we forgot that although we had shortened TTL on our records, the TTL on the parent records that I think get hit when no records are found were much longer, so we had a real bad afternoon. :)

navaati · on June 4, 2019

> but in reality then deleted a bunch of records, before failing […] before it could create new ones.

    lifecycle {
      create_before_destroy = true
    }

may be your friend :) (not sure if applicable though)

gundmc · on June 4, 2019

Code is reviewed, but I'm not aware of any companies where terminal commands are reviewed before each execution (though maybe they should be - it seems like every major cloud outage is config related). It sounds like the change was reviewed and approved but incorrectly pushed.

eyjafjallajokul · on June 4, 2019

We do at AWS. Not all commands though, but most commands that we audit and find are dangerous.

See a similar outage in S3 from 2 years ago - https://aws.amazon.com/message/41926/

thundergolfer · on June 4, 2019

How do this code-review of cmds work? Does the command get saved to a file, and then that file is reviewed like regular source-code, and then when it is approved the cmd is copy-pasted back to the terminal and run?

That above seems pretty clunky, so it's very likely not what happens.

zedgerman · on June 4, 2019

I’ve seen scripts get checked in and deployed just like you would a new service (code). Same Code Review process and same release pipeline.

In this particular case, commands that were run on a Production machine were by-design limited to what they can do and affect (mostly just the physical host they’re run on or a few hosts in the logical group of hosts they belong to).

sdan · on June 4, 2019

Even with Kubernetes, you can clearly see what is deploying to what nodes. Not sure what Google's pipeline is, but I would suspect they have some "undo" function to stop the deployment .

lclarkmichalek · on June 4, 2019

When you're dealing with this scale of system, the number of config changes, automated or human, can make determining which config the harder issue. Once you've found out what the issue is, you probably also want the revert to go via the normal flows, for fear that your revert could exacerbate the situation. Both of those add time to remediation.

reilly3000 · on June 4, 2019

I'm guessing it was something lower level than Kubernetes/Borg, since it was able to affect all of their networking bandwidth across multiple regions. ¯\_(ツ)_/¯

shereadsthenews · on June 4, 2019

The interesting tidbit in here (really the only piece of information at all) is that the outage itself prevented remediation of the outage. That indicates that they might have somehow hosed their DSCP markings such that all traffic was undifferentiated. Large network operators typically implement a "network control" traffic class that trumps all others, for this reason.