Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The description is vague about what devices ("servers") were misconfigured.

"servers" when said by Googlers usually means processes that serve requests, not machines. Hopefully a future postmortem will provide more details.

> How would restricted bandwidth utilization on servers cause network congestion...

This is a common problem with load balancing if you ever use non-trivial configuration. Imagine you split 100 qps of traffic between equally sized pods A and B. If each pod has an actual capacity of 60 qps and received 50 qps, then everything is fine. However, if you configure your load balancer not to send more than 10 qps to A, then it has to send the remaining 90 qps to B. Now B is actually overloaded by 50%. Using automatic utilization based load balancing can prevent this in some cases, but it can also cause it if utilization isn't reported accurately.

> Someone forgot to classify management traffic as high-priority? Oops.

I have some sympathy. During normal operations, you usually want administrative traffic (e.g. config or executable updates) to be low-priority so it doesn't disrupt production traffic. If you have extreme foresight, maybe you ignored that temptation or built in an escape hatch for emergencies. However, with a complicated layered infrastructure, it's very difficult to be sure that all network communication has the appropriate network priority, and you usually don't find out until a situation like this.



> During normal operations, you usually want administrative traffic (e.g. config or executable updates) to be low-priority so it doesn't disrupt production traffic

Honest question: is it not best practice to have an isolated, dedicated management network? I can’t for the life of me understand why a misconfig on the production network should hamper access through the admin network. Unless on Google’s scale it’s not the proper way to design and operate a network ?


On Google's scale, the networks are themselves production systems. So the question they face isn't whether to keep a single isolated network, but how long it's worth keeping the recursion going.


Presumably it's a trade off of complexity against redundancy, and at the scale that google's datacenters run the complexity is too high to make it worthwhile.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: