We just wanted to take a minute to address the brief downtime we experienced on Friday evening that delayed some messages throughout the weekend.
Here’s developer Brandon Keene with an elegant postmortem on the technical side of things:
Our hosting provider, Heroku, fixed a security vulnerability last week. In response a Heroku-affiliated vendor, RedisToGo, changed our password without warning or notification. This caused our Redis connection to them to fail and crashed the app. We didn’t lose any texts, but the site was unavailable for about 30 minutes. In response, we updated our credentials and brought the app back up.
Since this was an emergency deploy, we missed an error caused by a deleted type of delayed job. This missing job crashed our job workers and resulted in texts being queued, but not delivered. Once we realized this, we fixed the error and worked down previously queued jobs. This resulted in texts being sent much later than intended.
We’ve fixed these particular failures and have increased monitoring to ensure we catch and resolve these issues quicker.
In short, we were caught off-guard by one of our providers, the app went down, and once we got it back up, it took longer than it should have for the backlogged texts to be sent. We’re incredibly sorry for any inconvenience this may have caused.
It’s fixed now, and we’re doing everything we can to ensure it won’t happen again. If you’re having trouble with GroupMe at any point, or have any suggestions or questions, we’d love to talk to you and figure it all out. Just send an email to firstname.lastname@example.org.
By the way, we always send out alerts on our Twitter when we’re experiencing technical issues, so follow us there, and you’ll always be in the loop.
Thanks for your patience and support, and thanks for using GroupMe!