This is a public post-mortem that was intended to be released on the engineering blog of MercadoLibre, but my departure (and the slowness of the editing) prevented that.
The event
On December 28, 2018, the year’s last friday when our shipping calculator triggered an alert. This component is responsible of delivering the estimated delivery time along with the delivery options for a given article.
From our team, in our NewRelic monitors we saw the cause was due to a strong degradation of the API response times. I guess it was time to leave aside the panettone leftovers and start diagnosing the issue.
Our metrics pointed to a specific flow, SLA Coverage Brazil (carrier coverage by service level agreement). Under the hood one of its dependencies was the culprit: service coverage.
The diagnosis
So, right now, our monitors showed that:
- In service coverage, the apdex and the execution/runtime metrics were degraded by a 100%, at the same time on the whole instance pool, during the window of just a couple of minutes
- This service failure triggers retries, adding an extra 300K requests per minute (RPM) and (not so) gracefully becoming a cascading failure
In the meanwhile, we were manually rebooting instances while adding new ones to the pool in an attempt to boost the service health. Additionally, we start observing replaced instances (that is, an instances that dies due to a process error and gets replaced by a new one).
So, from this point, the following events unfold:
- We identify test request starting before the event
- We locate the client and stop said traffic
- The instances slowly recover and the apdex goes green
Among the confusion this created (the traffic was just a couple dozen requests) we retraced our steps and found the real cause of the downtime, something we discarded on the beginning: the panics
After checking the service coverage logs, we had seen panics but quickly omitted them due to their frequency and the fact that they didn’t seem related to our particular problem
What happened?
“…panics do not propagate between goroutines, and must be handled independently.”
When observing the panic stacktraces, we found the same pattern:
- The stack frames did not include any of our recovery mechanisms at all, so this wasn’t being handled
- The offending code came from a goroutine
As you can see in this issue, one design detail of golang that gets easily overlooked at first is that panics do not propagate between goroutines, and must be handled independently.
When a panic happens inside a go application process, and does not recover it, it finalizes in an anomalous way (exit code ≠ 0).
Our web application stack included a recovery middleware for all things happening in the entrant goroutine that gets covered by it, but any new goroutine must have its recovery mechanism defined explicitly.
Implications (a.k.a panics for dummies)
If a panic occurs on a gorountine without a recover(), the callee isn’t going to handle it, regardless of its recovery stack frame count, and the application will die.
It is a known issue and a non-negotiable design change for go 1.x (fingers crossed for go 2.x)
Compare the following examples, this one has a propagated (unhandled panic):
|
|
And this one handles the panic. You may notice that this code is in need of some composability:
|
|
Thinking some solutions
A couple days after the incident, I gathered the team with some alternatives and let the votes decide which was the best one for our situation (checking and guarding a set of N apicalls across several repos).
There were 3 candidates:
The obvious and hard
Not generating panics inside of the spawned goroutines. I’d wanted to patent that but I had the feeling Rob Pike already did that.
This is the ideal way but in the real world is not always possible. Most of time the are third-party dependencies or I/O driven packages which include panics as their error signaling mechanism. Getting rid of that is beyond us.
The idiomatic and ugly
How about adding this lil’ gut to the beginning every new goroutine we spawn in code?
|
|
It is going to work, with the disadvantage of breaking the DRY rule, becoming an extra point of failure for specialized handlers.
Remember what I said about composability? we can mitigate it with a handler like the following, as long as f is argument of another function that launches the goroutine or we ourselves launch it:
|
|
We’d use it the following way:
|
|
This way, we centralize the panic handling. And last but not least, in case we want to return something from said handler, we communicate though channels:
|
|
This simple pattern is idiomatic but can accumulate code smells quickly if the code has added complexity or use cases, so use it accordingly
Bonus track: Infrastructure
In order to be resilient, a service must be able to respond quickly to failures. Throwing the entire instance and requesting a new one is far from an ideal solution (in this particular case, a simple reboot of the dying webserver with a watchdog script could’ve been orders of magnitude faster than the default workflow).
Bonus track II: Spotting unrecovered panics
This was a dump of our middleware stack (notice the RecoveryWithWriter
being the recovery middleware):
|
|
And, the unhandled panic stacktrace on all of its glory:
|
|
The behavior is given away by 2 things:
- The code never passes through the recovery middleware
- The panic occurs inside of a goroutine launched by an errgroup.Group instance