It is 10 in the morning. Approaching at 2 o’clock, a feature demo. We are supposed to show some of last developments on one of the newest micro-service of the platform.
The code worked nice before. We did not touch the implementation of the service since the last tests. There is no reason why it should fail. Still, I decide to do a last manual test on our demo environment.
It fails miserably. The service is not even responsive. After two hours debugging the problem, it turned out the cause was ridiculously and surprisingly small.
Every horror story needs a context
My company has been doing a shift toward the use of Service Oriented Architecture in the past few years. In this context, we recently built a micro-service A that maintains a view on some financial information, to have a representation that offers better performance for some common query patterns.
The service A is a stateless service, fed by a message broker to receive a feed of real-time updates (from services X and Y below). It aggregates asynchronously these updates and serves queries on the result to other services (service Z below) via a REST API:
The goal of the demo is to trigger an update on service X (via GUI manipulations) and show that it correctly updates the view maintained in service A. Technically, the goal is to demonstrate that service X correctly sends update messages to service A and that service A correctly processes them.
Investigating by following the log trail
Unfortunately, when I tried running the test case just before the demo, service A was not reacting to any changes made in service X…
The service A was not responding to updates as expected. It would still service queries though. It is just that the queries would return data that remained invariably stale. The view maintained by the service A was not impacted by GUI updates performed on service X.
I therefore followed the rule of the 4 Rs:
But it did not work. Not this time. I had just a few tens of minutes left to investigate, and to find the guilty service before the demo. Time to look into log files…
Production side? No. Consumer side? No.
I first looked into the log of the service X, hoping to find why it did not send any update messages to service A. Instead, I found that the service X correctly sent the messages.
So I looked the logs on the consumer side, hoping to find why the service A did not receive or process any of the messages it should have been receiving from service X. I could not find the expected message in particular, but found that the service A was being overwhelmed with tons of messages from the broker.
So I looked at these messages in details. To my surprise, they all looked the same! The service A was consuming tons of messages (several hundreds per second), but these messages were always the same ten or so.
Acknowledgement issue? No.
My next hypothesis was that service A was not correctly acknowledging the messages it received. In such case, the message broker would always serve the same messages over an over again.
I verified the logs of the service A and it was not the case: A was correctly acknowledging the messages sent to it. The broker was correctly dropping these messages after consumption.
I also went back again to the of logs of service X, to check whether it did sent the messages more than once. But it was only sending one message for each update.
Finding the guilty emitter
Something, somewhere, must be sending these same messages over and over again. They are not appearing out of thin air. I consulted the monitoring API of the messaging system to make sure I was not having an hallucination. I was not. There ware indeed tons of messages being transmitted.
I stopped, one by one, each and every service that sends data to the service A through the messaging system, until I noticed a reduction of the traffic. And so I found the guilty service. It was service Y.
Once Y was shut down, I saw the queue being slowly consumed by service A. A few minutes later, the whole garbage was gone. And finally the update of service X was processed.
No message was lost in the making
Now at this point, it was pretty clear why service A was not very responsive to updates. Actually, it was responsive but completely overwhelmed by the traffic it was facing. The resulting latency was so high that the service was not able to react to these updates in real-time.
It is also clear why the error first looked like the lost update I first suspected. Because the service has an idempotent business logic, the 10 repeating messages had no observable effect (other than overwhelming the service). And because the service has separate thread pools for processing updates and serving queries, the query side was still responding with stale data.
The sweeper is flooding the system
It was time to understand why service Y suddenly started to act as a rogue, flooding the messaging infrastructure with a continuous flow of identical messages. But first, we have to explain what service Y is for.
Sweeping unsent messages
The service Y, guilty of flooding the messaging system, is a technical service that acts as a Sweeper. Its goal is to make sure that messages that have not be sent to the broker (because the broker was not alive or not reachable) will eventually be sent to the broker.
Basically, if service X wants to send a message to the service A, it needs the messaging infrastructure (the message broker) to be alive. When the broker is down, service X falls back to writing the message in the Sweeping table, a table stored in database (*).
The Sweeper (the service Y) has then for only task to poll periodically the sweeping table, try to transmit these unsent messages, and delete (or mark as sent) these messages from the sweeping table if successful.
(*) The sweeper table has another important role as well. By writing in the sweeper table inside a DB transaction local to the sending service, we make sure that we only send messages if the commit was successful.
Diving in the Sweeper code
In principle, the code of the Sweeper is really simple. It regularly polls the sweeping table, fetch the unsent messages, sends them, wait for the acknowledgment of the message broker, and then mark these messages as being sent (we mark them as sent after acknowledgement for guaranteed delivery).
In our case, the code was written in Java, and roughly looked like this:
- The load and save of unsent entries is done using JPA (via the message repository of JPA)
- The messages are sent using JMS (the Session object passed as parameter)
These might seem as details, but they are not. The failure mode of our APIs are often what makes the difference in how a failure propagate in our system.
JPA saving errors …
It turns out that JPA has an “interesting” error management. When JPA tries to save an object, it throws an exception if the ID of the object is not unique in the table. Basically, JPA uses the ID to know which record to replace, and does not know which one to replace, so it panics.
The overall issue started like this: one service managed at some point to corrupt the Sweeping table. The direct consequence was that two messages now shared the same ID in the sweeper table (yes, there should have been a unique index, we will come back to that).
So when the sweeper encountered these messages, it managed to load them, send them, but failed to save them, marked as “sent” for there were two messages with the same ID. So the next sweeper loop would load the same two messages, send them again, and fail to save them again.
Basically, we got an infinite loop sending the same messages over and over again.
… Interacting with transactions …
This infinite loop does not yet explain everything. There were 2 corrupted messages. Why are there 10 messages being sent repeatedly instead of 2?
A quick look at the code gives us the answer: the sweeper marks all the messages as sent in the same DB transaction. Therefore, a single corrupted message makes the whole transaction fails.
As a consequence, instead of just sending the two corrupted messages over and over again, the Sweeper sends all the unsent messages of the Sweeping table over and over again. And guess what? There were about 10 messages in the sweeping table during the incident.
Note: This behavior was motivated by performance. Marking all the messages as sent inside a single commit limits the impact on the database. It was an interesting bet knowing that sending the same messages twice – occasionally – is not a problem: idempotent receivers were already needed by the “at least once” guarantee of our messaging infrastructure.
… and being swallowed by a pokemon catch
To finish up, there was a last issue that made everything much worse. The Sweeper iteration loop was surrounded by a wonderful Pokemon try-catch, logging and swallowing all exceptions:
Now, for sure, not swallowing the exception would not have solved the issue. The Sweeper would have crashed upon marking the messages as sent, and then would have been revived by the supervisor, and then would have sent the messages again before crashing again.
In short, the sweeper would still have flooded the messaging system. But at a much lower rate. And the fact that it crashed repeatedly would have shown on the monitoring.
Instead, absorbing the exception made the identification of the root cause of the system failure much less obvious. The first visible sign was another service that stopped responding.
Asymmetric errors, Pokemon catching and system failures
In retrospect, there are many things that went wrong that we could have prevented. We are still in a learning process. This was only the first demo of an ongoing development.
Dealing with errors. Not anticipating them.
Distributed systems are a weird beast. This experience clearly reaffirmed in us the fact that we cannot anticipate most errors in a distributed systems.
As illustrated here, some errors are not even necessarily symmetric. We may be able to connect to the DB to load a record, but not to save the record, the same way that we may be able to send a message through a network link but not receive an answer. Some errors are out of our control (another service corruption our DB in our case).
Therefore the important thing is to put in place mechanisms that limit errors from propagating and lead to a system failure, and try to detect and report these errors as soon as possible.
Some measures we took
To solve our issue, we started by adding a unique index on the sweeper table to prevent corruption of the message IDs and identify the source of the corruption (which we still haven’t found yet).
We customized our JPA repository to use a JQL update where request. Now, whenever there are messages with conflicting IDs in the sweeper table, both duplicates will be marked as “sent” (we will lose a corrupted message instead of losing the entire system):
We are adding back-pressure on the broker message queue, so that we get clear overflow errors when the situation gets out of control. And we removed the Pokemon catching try-catch block 🙂