MASSIVE Storms in VA area where us-east-1 is. 326,000 customers without power already, worst lightning I have seen in my 20 years of life. Sky is intense blue/green/purple. This is most likely what the issue is
I'm completely ignorant here. But aren't these outages usually solved by having backup servers in different locations? As many datacenters do, and as I imagine something as huge as heroku would?
Underground cables have more expensive set up costs, lower lifetime, and higher maintenance costs. The price you pay for electricity doesn't even come close to justifying burying power lines. There's also the ecological stuff if you find that a reasonable argument. Bottom line, burying power cables just so you don't have to light a candle for a night isn't worth it.
Depending on where you are in the world, earthquakes are much rarer then insane storms. I'm speaking as a Floridian. I'm fairly ignorant on this issue, but would it be that difficult to use one or the other depending on which natural occurrence is more likely? Or is this also a cost issue?
"The North Carolina Utilities Commission
studied the cost of placing Duke Power’s distribution facilities underground and found it would
cost more than $41 billion, resulting in a 125 percent increase in customer rates."
Do they? Here in Germany the entire cabling within cities is underground, only the high voltage long distance lines are above ground. I've never heard a story about people stealing underground cables (they do steal e.g. train track above ground cabling). That also wouldn't make sense, digging up those cables is much more effort than taking them down from a post.
I've also never heard stories about issues with rats.
Power outages still happen, but they are quite rare - in 30 years I can only remember twoish.
I don't know but I've heard stories. Stealing cables underground is not common but it happened. And rats and underground water is quite a problem for underground (copper) cables.
Well, until we figure out plausible ways to control weather reliably on a large enough scale, at least. Without killing the atmosphere or our species or anything like that.
I have a feeling the electromagnetic conditions are much more stable on earth than in space. The magnetosphere and the atmosphere deflect a great deal of energy.
Sun is far away, you'd still get data back before it hits, but if buster made satelite hit another satelite or descend to earth it would be a worse situation.
Which is why I brought it up, it's hilarious. I thought people were just trolling at first but man, the first time I saw it, it made my day. Relating something like "God" with natural disasters. I love how people come up with that kind of stuff.
Was watching a movie in a big 20-screen theater in Richmond, and they told everybody to just leave (incidentally, not through the emergency exits, instead they funneled 100s of people into the lobby all at once :/)
Saw this post here on HN, pulled up www.chart.state.md.us to watch the live traffic cams in the area. Clicked through a couple, some of which showed heavy rain, wind & lightning. Then the stream froze and now the site is completely unresponsive.
8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.
8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.
9:20 PM PDT We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.
It's a hard thing to engineer, especially after the fact, and especially when you are trying to hide it behind an abstraction layer. (Which is to say: You can't expect your customers to engineer their apps with multiregion in mind, or to take it kindly when you raise rates to support additional redundant hardware and bandwidth.)
e.g. because region-to-region data transfer is not free, and trans-region latency is ugly, you can't just relaunch half your instance farm in another region and expect happiness. There are also routing issues: Internal IPs don't work across regions, elastic IPs don't transfer across regions...
Even if they can't do it right away, they should communicate a plan for how they are going to tackle this recurring issue. That's the whole point behind status.heroku.com / trust.salesforce.com. They are part of a publicly traded corporation with a lot of resources.
Extremely nerve wracking for new startups like ours.
But anyone deploying a critical application to AWS makes a point of cross-region data replication. Heroku have long known that they lose potential customers to, say, Engine Yard as a result of only hosting at US-East.
One can only conclude that this is a clear business decision on their part. I can hardly believe that Heroku's engineers are incapable of it. Indeed I would be very surprised to learn that they haven't brought up an instance of their platform at, say, US-West, for testing or proof-of-concept purposes.
Of course, productising that is a different matter. Extending the control plane, front end, and pricing/billing systems might have considerable associated project cost. Perhaps they have concluded that the costs outweigh the additional revenue. Or, just haven't got around to it yet.
Appropriate title might be that "Heroku is down due to AWS outage which is down due to power failure which happened due to storms caused by moist winds colliding with hot air that was heated over the continent by sun that....". It really doesn't matter. Heroku is down. Customers don't care.
Even now that it is updated it has yellow triangles for "performance issues" instead of red circle for service disruption. Seems like they are in denial.
This was the disappointing thing for me as well. Our connectivity died around 8PM EST-ish, and I immediately went to status.aws and it said everything was normal. I then proceeded to waste half my night looking at our internal infrastructure trusting that page was accurate.
At 11:25 Eastern, https://status.heroku.com/incidents/386 was posted: "We're currently experiencing a widespread application outage. We've disabled API access while engineers work on resolving the issues."
Slightly different scenario however: the power was shut off by the fire marshall if I recollect correctly.
Rackspace (and many, many other co's) tend to have functional UPS units & generators. Amazon tends to choose the cheapest datacenter facility imaginable and then these sort of failures occur.
Given their size they'll inevitably fix the power issues though -- they've got the finances & they're capable to add a few levels of redundancy.
I found the reports about the outage - it was 2007 (so obviously much more than a year ago) but very similar to one of Amazon's recent outages - the truck took out a transformer, Rackspace fired up backup power, but cooling failed to start so Rackspace had to shut it all down to avoid melting everything.
Looks like Amazon wasn't the only one with inadequate testing of their continuity plan. And I don't think Rackspace offered alternate Availability Zones at that point.
I think Netflix are expecting another cloud to offer the same model and API as Amazon though, which isn't likely to happen - everyone else is learning from AWS's mistakes!
Even if it did, many of the features they're waiting for (like auto-scaling groups) probably wouldn't be as useful in a multi-cloud environment, and would therefore have to be built into Asgard.
- One AZ is down
- API commands are spotty and may return incorrect results
- ELB looks screwed
- IP reassignments don't seem to be working
- Who knows what the fuck else is broken
Just started up try using filepicker.io and it seems to be down too.
Beginning to feel pretty lucky though -- this is at least the 4th AWS-East outage that has made enough of a splash to notice but missed my instances. Upgrading to multiple availability zones was scheduled for Monday anyway.
Simple solution to this is to have a backup or failover to a non-AWS Datacenter too, basically don't be just dependent on one Datacenter. E.g. MS Azure/Google/Rackspace
This not only spreads your risks but keeps your customers happy.
Google Appengine's just fine. Dont't know how many AZ i'm on and don't want to know :))
The more i see about amazon failures the more i think VM are just not high enough for me in the abstraction layer...
I just lost a potential hire because of this, was demoing my app to someone and it wasn't working, she thought it was because of the product. Damn you heroku!
The little red ribbon that you pull to get the AA batteries out is stcuk underneath - they are looking for a pen to flick the battery but since everyone switched over to Fire tablets there aren't any pens.
idea for heroku : allow customers to host a "my app is down page for blah blah reason" where they host their status page (rackspace I guess?). Who think this would be useful? My users see a blank page right now when they go to ZeTrip, I'd rather show them a static page saying : "our site is down due to amazon lack of redundancy."
Cloudflare lets you do this afaik. I'm not sure I'd trust a service to show a proper 'this site is temporarily down' page when something very bad has happened.
I can't believe that guys at heroku are not ready for such situations!
They rely ONLY on virginia's instances because its the cheapest, without caring about customers.. or thinking of replicating their services in multiple locations for such issues!