In the previous post, we developed the tiny_rand library as an extension of the STL random header. The goal was to build a minimal set of functions such that it would allow to randomly generate any kind of data: containers, classes, tuples, variants, and any combination of those.

The **tiny_rand** library is built around concepts coming from Functional Programming. In particular, it leverages Functors and Applicative to get a lot of composability for free. But it is far from being perfect. We ended the previous post with a bunch of idea for improvements.

In today’s post, we will tackle one of these improvements. Our goal will be to implement one of the missing feature: the ability to randomly select among a bunch of values, adding weights to each of the values.

### Motivation

Before developing the feature, we will first describe what it is useful for. We will therefore start by some interesting use cases.

###### Generating strings

Let us imagine that we wish to generate random strings. But instead of having equal probabilities to generate each character, we want to weight each character, such that the generated strings matches the distribution of characters in the English language.

We could use an API that looks like follows:

std::vector<Weighted<char>> weighted_letters = { | |

{'a', 2}, {'b', 1}, {'c', 1}, {'d', 1}, {'e', 2} | |

}; | |

auto weighted_letters_gen = weighted_choice_gen(weighted_letters); |

- We provide
*weigthed*characters to**weighted_choice_gen** - A weighted character associates a character
**C**to a double**W(C)**

The resulting generator would generate a character **C** with the probability **P(C)**, equal to the weight of the character **W(C)** divided by the sum of the weights of all characters.

###### Generating strings, revisited

Our **weighted_choice_gen** is very useful. It is however pretty low level, and might not be the simplest way to generate strings where each character has an associated weight.

For instance, let us imagine that we wish to generate strings, such that there are in average X times more consonants than vowels (*). Let us assume that we already have two generators available: one for vowels and one for consonants.

In such a use case, we could certainly do it using *weighted_choice_gen*, but it would be very awkward, very quickly:

- We would have to compute the weights of individual letters
- We would loose reuse of the vowel and consonant generators

It would be much better to separate concerns and restore composability, by associating weights to generators instead:

std::string vowels = "aeiouy"; | |

std::string consonants = "bcdfghjklmnpqrstvwxyz"; | |

auto vowels_gen = choice_gen(vowels); | |

auto consonant_gen = choice_gen(consonants); | |

auto weighted_letters_gen = weighted_one_of_gen( | |

[](char c) { return c; }, | |

weighted(vowels_gen, 2), | |

weighted(consonant_gen, 1)); |

*(*) Note: this is very different from generating each vowel with a weight of X, and each consonant with a weight of 1. Can you see why?*

###### The emerging need

Both these test case show the the need to randomly select stuff (value or generators) with associated weights, to introduce bias toward some choices.

**weighted_choice_gen**will implement the weighted choice of values**weighted_one_of_gen**will implement the weighted combination of generators

In some sense, *weighted_choice_gen* is the generalization of *choice_gen*, and *weighted_one_of_gen* is the generalization of *one_of_gen*.

###### Other use cases

Both these generators are pretty useful in practice. For instance, we could use them to stress test a trading platform with test inputs that models the the kind of traffic the platform usually receives.

If the traffic is made of 0.25% IRS and 0.75% Spots, we can combine (using *weighted_one_of_gen*) a generators for IRS and a generator for Spots.

### Overall design

We will follow the same recipe we followed for the design and implementation of the previous features. We will create combinators that take other generators and produce new generators (*).

We recognized two very similar problems to solve:

- The weighted random selection of values
- The weighted random selection of generators

Because these are very similar in nature, we will build an algorithm that works for both values and generators (generators are special case of values) and build nice looking combinators around it.

Because we want to be able to test our combinators, we will also work on separating the non-deterministic part from the deterministic part in our algorithms.

*(*) It is generally a good idea to stick with a very few amount of constructs, patterns or idioms inside a given library. Switching too much (even for reasons that might seem good locally) might lead to an explosion of different ways to do things, making the library as a whole less cohesive. In one word, being consistent in our design makes API users happier.*

### Core implementation

In this section, we will discuss the design and implementation of the deterministic core algorithms that will power our combinators.

###### Types

We could use *std::pair* to associate a value to its weight. To be a bit more declarative thought, we will make a dedicated type to model this association:

template<typename Value> | |

struct Weighted | |

{ | |

Value m_value; | |

double m_weight; | |

}; | |

template<typename Value> | |

Weighted<Value> weighted(Value gen, double weight) | |

{ | |

return Weighted<Value>{gen, weight}; | |

} |

*Note: It is not always clear whether using strong typing leads to real improvements. There is always a trade-off involved between safety versus specificity, which will be discussed in some future posts.*

###### Algorithm discussion

The **choice_gen** combinator for random non-weighted selection had a constant time algorithm complexity. The implementation just had to randomly pick an index and look inside a vector to get back the chosen value.

When we consider weights, the algorithm is less straightforward, and we cannot ensure a constant time complexity. For this reason, and although **choice_gen** could theoretically be implemented in terms of **weighted_choice_gen**, we keep them separated.

Our new algorithm will proceed as follows:

- We will build weight intervals: each interval will be associated a value
- The width of an interval is the weight of the associated value
- We randomly pick a double between zero and the sum of all the weights
- We binary search this number in the intervals to select the value

The construction of the intervals is depicted below:

INPUT: [{ val_1, 1.2 }, { val_2, 0.7 }, { val_3, 1.5 }] INTERVALS: [0, 1.2) => val_1 [1.2, 1.9) => val_2 [1.9, 3.4) => val_3

This algorithm requires linear complexity to build the intervals. But this only has to happen once, at the creation of the generator. The search itself has a logarithmic complexity and occurs at each roll of the generator.

###### Algorithm implementation

The STL offers a pretty useful algorithm that we can use here to help us doing the binary search job: std::lower_bound. Its documentation on cppreference.com says that it *“Returns an iterator pointing to the first element [..] greater or equal to the [..] value.”*

This means that we can implement the interval logic we described earlier by just storing the end of each interval. Finding the appropriate interval is finding the interval whose end is at least greater than the searched value.

Based on this insight, we can implement the *setup* phase that builds our intervals as follows. We just associate end of intervals to the value associated to the interval:

template<typename Value> | |

std::vector<std::pair<double, Value>> make_intervals( | |

std::vector<Weighted<Value>> const& weighted_values) | |

{ | |

std::vector<std::pair<double, Value>> intervals; | |

intervals.reserve(weighted_values.size()); | |

double summed_weights = 0.0; | |

for (auto const& weighted_value: weighted_values) | |

{ | |

summed_weights += weighted_value.m_weight; | |

intervals.emplace_back(summed_weights, weighted_value.m_value); | |

} | |

return intervals; | |

} |

Searching the intervals can be implemented as a simple *std::lower_bound* on the resulting collection, with a custom comparison:

template<typename Value> | |

Value search_containing_interval( | |

std::vector<std::pair<double, Value>> const& intervals, double weight) | |

{ | |

auto it = std::lower_bound( | |

intervals.begin(), intervals.end(), weight, | |

[](auto const& element, double weight) | |

{ | |

return element.first < weight; | |

}); | |

return it->second; //By construction, always exists | |

} |

Both these algorithms can be easily unit tested. We took care avoiding any kind of random generator side-effect in these core building blocks.

*Note: splitting algorithms into parts is a good way to make the parts testable. A lot of times, there is no need for dependency injection, fakes or mocks to make a piece code testable. It is not to say these techniques have no uses, just to state simpler alternatives.*

### Wrapping up as an API

In the previous section, we built the algorithmic building blocks needed to randomly choose a value, with weights associated to each of the values. In this section, we will assemble these blocks into the combinators of our API.

###### Weighted Value Choice

To randomly pick a value, with weights associated to each value, we just assemble our core building blocks.

At the creation of the generator, we build our interval search vector and pass it to the generator returned from the **weighted_choice_gen** combinator. This generator will first generate a random double value, then find its associated interval. It then returns the value associated to the interval.

template<typename Value> | |

auto weighted_choice_gen(std::vector<Weighted<Value>> const& weighted_values) | |

{ | |

auto const& intervals = details::make_intervals(weighted_values); | |

double sum_weights = intervals.back().first; | |

return [=](std::mt19937& bit_gen) -> Value | |

{ | |

std::uniform_real_distribution<double> distribution(0., sum_weights); | |

return details::search_containing_interval(intervals, distribution(bit_gen)); | |

}; | |

} |

###### Weighted Generator Choice

To randomly pick a generator, we will use the combinator we just implemented to pick a random value, *weighted_choice_gen*, and add some glue around.

We first transform our parameter pack of heterogeneous generators into a vector of homogeneous weighted generators that can be fed to *weighted_choice_gen*. We use type erasure and std::function to do this.

The output of *weighted_choice_gen* will be a generator that produces a generator when we roll it. We capture it inside the lambda returned by our new **weighted_one_of_gen** combinator. The implementation mirrors this quite naturally:

template<typename Finalizer, typename Generator, typename... Generators> | |

auto weighted_one_of_gen( | |

Finalizer finalizer, Weighted<Generator> head, Weighted<Generators>... tail) | |

{ | |

using Out = decltype(finalizer(head.m_value(std::declval<std::mt19937&>()))); | |

using OutGen = std::function<Out(std::mt19937&)>; | |

auto map_first = [&](auto&& wg) -> Weighted<OutGen> | |

{ | |

return Weighted<OutGen>{transform_gen(finalizer, wg.m_value), wg.m_weight}; | |

}; | |

std::vector<Weighted<OutGen>> weighted_gens{ map_first(head), map_first(tail)... }; | |

auto generator_picker = weighted_choice_gen(weighted_gens); | |

return [=](std::mt19937& bit_gen) | |

{ | |

return generator_picker(bit_gen)(bit_gen); | |

}; | |

} |

### Conclusion, and what’s next

In this post, we added two combinators in our **tiny_rand** library. These combinators allow to randomly select items (value or generators), with weights associated to each of these items to introduce bias.

We looked at our problem and recognized the similarities between these two features. We saw it could be implemented based on one single algorithm. The resulting generator is quite efficient: each roll has a logarithmic time complexity (logarithmic in the number of items to pick from).

In future posts, we will look at improving the **tiny_rand** library with combinators to build dependent generators (generators whose behavior depends on the result of a previous generator). This will lead us to the concept of *Monads*, which we will discuss in the context of C++.

You can contact or follow me on Twitter.

http://stackoverflow.com/questions/5027757/data-structure-for-loaded-dice

LikeLike

Indeed, I did not know about the alias method. There was similar comments in the associated reddit post: https://www.reddit.com/r/cpp/comments/688xkk/implementing_a_weighted_choice_random_generator/. I will have a look at it when I get the change and improve my implementation.

LikeLike