This post is the third of the series of post focused on providing useful higher order functions that goes hand in hand with STL algorithms and ranges.
The two previous posts went over the following set of tools:
- Lenses allow to “zoom” on the elements being scanned
- Traversal combinators allow to merge several scan into one single scan
Today’s post is about considering a more complex example, inspired from real work, and see how these tools helps in solving it.
I hope to convince you that these tools are useful, if not to put directly in practice, at least to help you consider different design choices.
Motivation
I work as a Back Office software engineer for a trading platform. Our platform receives a lot of financial transactions per day. These transactions contribute in the position of the bank, generate payments or confirmations, participates in the risk, etc.
We rely on stream processing during the day and we switch to batch processing outside of working hours to consolidate the data. In batch mode, we need to handle a lot of transactions fast, much more than can fit in memory.
To do so, we parallelise the work across several engines. Each engine streams a set of transactions from the DB through the functions we want to execute on them (recompute positions, consolidate the risk, etc.).
The work performed by the engines can be assimilated as a traversal of the DB, to run several computations of data, scanning each transaction only once.
The problem description below is greatly inspired from this kind of problem.
Problem description
Instead of receiving a stream of financial transactions, we will receive a stream of stock values as strings. Each string will contain the value of the name and the value of the stock.
Instead of consolidating positions or risk, our API client is interested in extracting trends from this data. He has a bunch of different metrics of interest, things like:
- Finding the minimum value
- Computing the average value
- Counting the points far away form the initial stock value
Our client is not interested in running all of these metrics all the time. He wants to be able to cherry pick them. Our goal is to provide an API to build this capabilities easily, from the raw stream of stock value. We need to:
- Parse these inputs
- Be able to select a stock name
- Extract the stock value
- To combine the desired analytics on them
The terrible solution
Sit tight and be prepared for a true moment of horror. What follows is obviously a terrible solution, but I had to include it as:
- It is unfortunately pretty common to see such patterns in our code.
- It has a pedagogic interest: it shows us where not to go.
This is a design in which everything is packed together. The reading of the inputs, the filtering and the processing, the selection of the different metrics… everything is tightly coupled inside one function.
struct batch_options | |
{ | |
bool m_with_minimum; | |
bool m_with_average; | |
bool m_with_variations; | |
double m_limit; | |
std::function<bool (stock_update const&)> m_filter; | |
}; | |
struct batch_result | |
{ | |
double m_minimum; | |
double m_average; | |
int m_variation_count; | |
}; | |
batch_result terrible_design(std::vector<std::string> const& source, batch_options const& options) | |
{ | |
batch_result result; | |
result.m_minimum = std::numeric_limits<double>::max(); | |
result.m_average = 0.0; | |
result.m_variation_count = 0; | |
int count = 0; | |
double init_value = 0.0; | |
for (auto& msg: source) | |
{ | |
stock_update s = parse_message(msg); | |
if (!options.m_filter(s)) | |
continue; | |
if (count == 0) | |
init_value = s.m_value; | |
++count; | |
if (options.m_with_minimum) | |
result.m_minimum = std::min(result.m_minimum, s.m_value); | |
if (options.m_with_average) | |
result.m_average += s.m_value; | |
if (options.m_with_variations) | |
if (abs(s.m_value - init_value) > options.m_limit) | |
++result.m_variation_count; | |
} | |
if (count) | |
result.m_average /= count; | |
return result; | |
} |
The usage of this code is quite obscure as well as the reader will have great troubles figuring out what each boolean value represents:
batch_result res = terrible_design(source, batch_options { | |
true, true, true, 2.0, | |
[](stock_update const& m) { return m.m_id == "A"; } | |
}); |
So this is obviously awful code. And I claim that better naming, method extraction, stronger types (instead of booleans), none of this will really make is much less awful: we are dealing with an architectural problem.
But this bad design has its pedagogic merits. Inside a single function, it exposes all the things we want to avoid:
- The source of data is coupled with the traversal
- The algorithm is coupled with the parsing and filtering
- The algorithm contains data specific to the metrics
- The data of disabled metrics data is still initialised
- The output of the function contains non-computed metrics results
- The client code is coupled with all metrics
As a result, adding a new metric will trigger might break the other metrics: everything will have to be re-tested again.
Our goals
Let us negate all the bad things that we identified from the previous code, to get an idea of what we are looking for:
- A source of data decoupled from the traversal
- An algorithm that does not know about the filtering nor the parsing
- An algorithm agnostic of the data specific to the metrics
- Disabled metrics should have no cost
- Disabled metrics do not appear in the results
- A client code only coupled with the metrics it uses
The rest of this post will try to come up with a solution that comply with all these desirable properties.
Taking care of the data source
Let us first start by separating the traversal (the algorithm we will later select) from both the source and the filtering / transformations we will perform on it. The following code provides a pure pipe-line, implementing using boost ranges:
template<typename Filter> | |
auto read_stock_filtered_by(Filter filter) { | |
return [filter] (auto const& source) { | |
return source | |
| transformed(parse_message) | |
| filtered(filter) | |
| transformed(to(&stock::m_value)); | |
}; | |
} |
Returning a lambda allows to build a pure pipeline, agnostic to data source. The client code will be able to plug its source to this pipeline, to assemble it with the rest of the algorithm:
auto read_stock_by_id(std::string const& id) { | |
return read_stock_filtered_by(equaling(id, on(&stock_update::m_id))); | |
} | |
//Assembling the algorithm, the source and the pipeline | |
std::vector<std::string> fake_source = ... ; | |
auto result = algorithm(read_stock_by_id("EUR")(fake_source)); |
Using lens functions here (to and equaling) leads to shorter and more declarative code, that can almost read like a sentence:
- “read stock filtered by equaling ID on the stock id”
- “transformed to the stock value”
Thinking in terms of collections
Now, we need to find a way to separate the algorithm from the metrics we would like to plug in it, all in one pass. This means that the following (although nicer) API would not solve our problem:
template<typename Range> | |
double extract_min_value(Range const& values); | |
template<typename Range> | |
double compute_average(Range const& values); | |
template<typename Range> | |
int count_big_variations(Range const& values); |
It would indeed implicitly force several scan of our data:
auto read = read_stock_by_id("A"); | |
double min_val = extract_min_value(read(source)); | |
double avg_val = compute_average(read(source)); | |
int variation_count = count_big_variations(read(source)); |
Thinking in terms of collection as encouraged by Mike Acton does not necessarily mean taking collections as input.
In the previous code snippet, you might have noticed that we assembled our pipeline with the source each time we call an algorithm. We had to. Const ranges are not const themselves: they represent ranges that only provide const access to the element the iterator points to. This is very different from a real container, which would have been const as well: we could have passed it to the tree calls directly.
Thinking in terms of steps
You might have already guessed that the solution relies on the previous post, in which we introduced functions to combine several traversals into one.
The second important insight provided by the traversal combination functions is realizing that we can deal with collections by decomposing them into steps. These steps can be either be plugged directly into an algorithm or be combined together into one step before doing so.
Our API will thus provide step functions that can be easily combined and then be used inside the std::accumulate algorithm. We only need two functions by metric:
- A binary step function, to build a result from a list of elements
- An initialisation value, or equivalently a no argument function
Our metrics as step functions
Let us start by writing our metrics as step functions.
Finding the minimum
Each step is a simple call to the std::min function. The initialisation can be implemented in several ways (with optional for example). I kept it as stupid as possible deliberately:
template<typename T> | |
struct FindMinimum | |
{ | |
T operator()() const { return std::numeric_limits<T>::max(); } | |
T operator()(T const& current_min, T const& val) const | |
{ | |
return std::min(current_min, val); | |
} | |
}; |
Finding the average
It requires a little more work: we need to keep track of the number of elements as we go along. This is the motivation behind the RunningAverage structure:
struct RunningAverage | |
{ | |
double m_value; | |
int m_count; | |
}; | |
struct GetAverage | |
{ | |
RunningAverage operator()() const { return RunningAverage{ 0.0, 0 }; } | |
RunningAverage operator()(RunningAverage const& avg, double value) const | |
{ | |
return RunningAverage { | |
(avg.m_value * avg.m_count + value) / (avg.m_count + 1), | |
avg.m_count + 1 }; | |
} | |
}; |
Computing big variations
This last step function requires to capture a limit parameter, and to remember the first value. It could be done differently, but is kept voluntarily simple here:
struct VariationCount | |
{ | |
boost::optional<double> m_initial; | |
int m_count; | |
}; | |
struct CountVariationBiggerThan | |
{ | |
CountVariationBiggerThan(double limit) : m_limit(limit) {} | |
double m_limit; | |
VariationCount operator()() const { return VariationCount{}; } | |
VariationCount operator()(VariationCount gap, double val) const | |
{ | |
if (!gap.m_initial) | |
return VariationCount { val, 0 }; | |
if (abs(val - *gap.m_initial) > m_limit) | |
gap.m_count++; | |
return gap; | |
} | |
}; |
Assembling the steps
To finish up, we only need to provide a small generic helper function, based on the par_steps, to combine all these steps together:
template<typename Range, typename... Steps> | |
auto fold(Range const& rng, Steps... steps) | |
{ | |
return std::accumulate( | |
rng.begin(), rng.end(), | |
std::make_tuple(steps()...), | |
par_steps(steps...)); | |
} |
Now, our client only needs to assemble the steps together to extract all the metrics it needs. It can be done quite elegantly:
auto read = read_stock_by_id("A"); | |
auto res = fold(read(source), | |
FindMinimum<double>(), | |
GetAverage(), | |
CountVariationBiggerThan(2)); |
Last but not least, if the client feels uncomfortable in dealing with tuples in the rest of this code, wrapping the fold with std::apply will let him summarise his results easily.
auto to_nice_result = [](auto min, auto avg, auto big) { | |
return batch_result{ min, avg.m_value, big.m_count }; | |
}; | |
auto typed = apply(to_nice_result, fold_result); | |
std::cout | |
<< "Result summary:\n" | |
<< " - Minimum: " << typed.m_minimum << "\n" | |
<< " - Average: " << typed.m_average << "\n" | |
<< " - Variations: " << typed.m_variation_count << "\n"; |
Conclusion
This was a pretty long post, much longer than I first anticipated. I hope you enjoyed it nevertheless, and that it showed you the benefits of thinking in terms of lens functions and traversal combinators:
- We are not limited to create collections from other collections each time we want to transform them. Instead, we can provide “views” on collections by combining ranges and lenses.
- We are not limited to traverse the collections each time we want to compute something from their elements. Instead, we can combine our computations before running them on our entire collection.
In the end, the tools themselves are not really complex and not that important. The main benefits of knowing them is to be able to think differently about collection processing, and sometimes to come with shorter, faster and more elegant solutions.
Leave a Reply