Algorithms as design tools

Today’s post is about a true war story. The story of some hard to read and slow code, patiently reworked to get faster and more readable version of it.

This post will first walk through the steps it took to get there. Then we will step back and conclude by discussing a trick I found useful to improve code in general: thinking of algorithm as design tools, not just as implementation tools.

The initial code

The whole journey will be based on the following snippet of C++ code, inspired from the original code I had to refactor. I simplified it for the exercise, leaving only the parts to focus on.

bool bulk_save(vector<Data const*> const& data_set)
{
   bool error = false;
   DbTransaction tx("bad_save");

   for (auto it = begin(data_set);
        !error && it != end(data_set); ++it)
   {
      auto* d = *it;
      error = !is_consistent(d);
      if (error)
      {
         tx.rollback();
      }
      else
      {
         if (has_id(d))
            update_in_db(d, tx);
         else
            insert_in_db(d, tx);
      }
   }
   return !error && tx.succeeded();
}

The purpose of this code is to store records in the database. Some of the records are new and must be inserted. Some of the records already exist and must be updated.

The function opens a transaction at the beginning, loops through all the records, checks their consistency, before inserting or updating them in DB and closing the transaction. In case any inconsistent record is found, the transaction is rollbacked.

This code has several code smells:

A non-standard stop condition in the for loop
A variable that impacts the control flow of the program
A bit too much indentation level
An implementation that does not show the intent

All of this makes it harder for the reader to understand the code. In particular, non-standard stop conditions and control variable are typical bug nests.

As we will see later, there is also a performance issue lurking there. This performance defect was the reason why the code was reworked in the first place.

Step 1: Simplify the initial code

The first step is to simplify the code. I always try to proceed with minor changes on a piece of code to increase my understanding of it. If it breaks, I get some vital information, otherwise I just get better code.

In our case, a simple way to improve the code is to early return upon finding an inconsistent data record. It removes the need for the control variable error, and allows us to use a standard range-based for loop.

This leads to the following code:

bool bulk_save(vector<Data const*> const& data_set)
{
   DbTransaction tx("bad_save");
   for (auto* d : data_set)
   {
      if (!is_consistent(d))
      {
         tx.rollback();
         return false;
      }
      if (has_id(d))
         update_in_db(d, tx);
      else
         insert_in_db(d, tx);
   }
   return tx.succeeded();
}

Now that the situation is slightly improved, it is time to move to step 2.

Note: early returns are quite potent at improving code quality. If you want to know more, have a look at the this article of Simplify C++.

Step 2: Identify and Separate concern

The next step is to separate concerns. Separating concerns is a necessary step before proceeding with making our code more declarative.

As we will see, if we try to make the code more declarative before separating concern, there is a good chance we will not be able to find the right abstractions.

We already have identified two concerns that needs to be decoupled from one another:

The consistency check on all the Data record
The actual work of updating the DB (inserts and updates)

We can separate them by splitting the loop in two:

bool bulk_save(vector<Data const*> const& data_set)
{
   for (auto* d : data_set)
   {
      if (!is_consistent(d))
         return false;
   }
  
   DbTransaction tx("good_save");
   for (auto* d : data_set)
   {
      if (has_id(d))
         update_in_db(d, tx);
      else
         insert_in_db(d, tx);
   }
   return tx.succeeded();
}

Note: If you are worried by the performance impacts of the introduction of a second loop over the data, the next section will address this concern.

Now, and only because we separated concerns, we can easily identify one abstraction, the algorithm all_of (or any_of, depending on your taste).

bool bulk_save(vector<Data const*> const& data_set)
{
   if (!all_of(data_set, is_consistent))
      return false;

   DbTransaction tx("good_save");
   for (auto* d : data_set)
   {
      if (has_id(d))
         update_in_db(d, tx);
      else
         insert_in_db(d, tx);
   }
   return tx.succeeded();
}

Had we tried to use algorithms before separating the concerns, we would have ended up with nothing more than a for_each.

Side note on performance

When I first showed this code, I immediately got some remarks on performance: instead of one loop, we now have two loops. How can we justify such a waste?

While this is true we go through the vector one more time, the two loops approach also brings something on the table:

In case of inconsistent record, it does not open any transaction.
In case of inconsistent record, it does not trigger a rollback.
In any case, the transaction is shorter than it previously was.

Is it a gain or is it a loss then? It all depends. How many inconsistent records are there in average? How expensive is the call to is_consistent?

The only answer to this performance concern is to measure, using representative sets of data. For this code, the impact of the additional loop was not noticeable, while putting less pressure on the database is always a win for us.

Trying to aim for short critical sections almost always pays off. This is true for locks, DB transactions, STM transaction blocks, etc.

Step 3: Revisit and Improve

The first steps have already helped us clean the function quite a bit. We did it by separating concerns and identifying the right abstractions.

And we can go further in the decoupling:

We can partition the Data to insert in DB from the ones to update.
We can perform this partitioning outside of the DB transaction.
We can then leverage on bulk insert or bulk update operations (if available).

It leads to the following piece of code:

bool bulk_save(vector<Data const*> const& data_set)
{
   if (!all_of(data_set, is_consistent))
      return false;

   vector<Data const*> working_set(data_set);
   auto mid = partition(working_set, has_id);

   DbTransaction tx("fast_save");
   bulk_insert_in_db(mid, end(working_set), tx);
   bulk_update_in_db(begin(working_set), mid, tx);
   return tx.succeeded();
}

Partition will permute the elements of the data set such that:

Elements before the mid iterator are the ones to update
Elements after the mid iterator are the ones to insert

We then call two specific algorithms, bulk_insert_in_db and bulk_update_in_db, to deal with each responsibility. These two specialized algorithms will be able to perform bulk operations and increase the efficiency of our function.

Following these changes, we get much less indentation, better performance, better separation of concerns and a transaction further reduced in size. As we went a pretty long way from the initial code. It is time to wrap up, and give an explanation to the title of this post.

Use algorithms as design tools

I must confess, I lied to you. The original code did not look as I told you it was. Instead of being a bulk_save function taking a vector of Data as parameter, it was a for-loop calling the save method of the Data object:

class Data
{
public:
   bool save_in_db(DbTransaction& tx) const
   {
      if (!is_consistent())
      {
         tx.rollback();
         return false;
      }
      if (has_id())
         insert_in_db(tx);
      else
         update_in_db(tx);
      return true;
   }

private:
   bool has_id() const;
   bool is_consistent() const;
   void insert_in_db(DbTransaction&);
   void update_in_db(DbTransaction&);
};

This difference is critical: looking only at the prototypes of the public methods of the class, there would be no way to identify any of the all_of, partition or for_each algorithm.

To even see the opportunity for algorithmic improvement, we have to look at the implementation details. We have to read the content of the method and look at the private methods of the class. Note that it gets really hard to do if the save method is virtual.

What does it tell us?

The encapsulation of this class is much too strict: It does not provide us with the tools to manipulate the class efficiently. It also prevents declarative code by hiding meaningful algorithms from our sight. Last but not least, it might lead to put more features inside the class itself.

So, if there is such a thing as too much encapsulation, how to choose the right level of encapsulation to our classes? I like to use the rule of thumb from Mike Acton: when there is one, there are many.

It states that objects are usually instantiated in groups, either at the same time, or spread across some period of time. So it makes sense to design our code around the most common case and to:

Think in terms of collection of objects, not just single instances.
Identify the kind of algorithms we need to perform on these collections.
Use that knowledge to choose the right encapsulation level.

To summarize, algorithms are not only useful for the implementation of routines. We can think of them as design tools of classes and routines too.