How small programming faults led to overflowing an entire system

2018-07-27T19:38:02+00:00

Shouldn’t you also implement a better way to figure out which service is flooding service A than stopping and resuming each service one by one ?

LikeLike

Reply

2018-07-27T20:48:11+00:00

Hi and thank you for the comment 🙂 Indeed, this is probably something we should look for.

LikeLike

Reply

	private void pollSweeperTable(Session session) {
	List<Message> unsentMessages = messageRepository.fetchUnsentMessages();
	for (Message message : unsentMessages)
	sendMessage(session, message);
	messageRepository.markSent(unsentMessages);
	}

	private void pollSweeperTable(Session session) {
	List<Message> unsentMessages = messageRepository.fetchUnsentMessages();
	for (Message message : unsentMessages)
	sendMessage(session, message);
	messageRepository.markSent(unsentMessages); // One DB transaction
	}

	try {
	pollSweeperTable(Session session);
	// …
	} catch (Exception e) {
	// Log the exception
	}

	@Modifying
	@Query(
	value = "update SweeperTable set sent = 1 where messageId in (:messageIds)",
	nativeQuery = true)
	void markSent(@Param("messageIds") List<Long> messageIds);

How small programming faults led to overflowing an entire system

Every horror story needs a context

Investigating by following the log trail

The symptoms

Production side? No. Consumer side? No.

Acknowledgement issue? No.

Finding the guilty emitter

No message was lost in the making

The sweeper is flooding the system

Sweeping unsent messages

Diving in the Sweeper code

JPA saving errors …

… Interacting with transactions …

… and being swallowed by a pokemon catch

Asymmetric errors, Pokemon catching and system failures

Dealing with errors. Not anticipating them.

Some measures we took

2 thoughts on “How small programming faults led to overflowing an entire system”

Add yours

Leave a comment Cancel reply