Weighted choice implementation

In the previous post, we developed the tiny_rand library as an extension of the STL random header. The goal was to build a minimal set of functions such that it would allow to randomly generate any kind of data: containers, classes, tuples, variants, and any combination of those.

The tiny_rand library is built around concepts coming from Functional Programming. In particular, it leverages Functors and Applicative to get a lot of composability for free. But it is far from being perfect. We ended the previous post with a bunch of idea for improvements.

In today’s post, we will tackle one of these improvements. Our goal will be to implement one of the missing feature: the ability to randomly select among a bunch of values, adding weights to each of the values.

 

Motivation


Before developing the feature, we will first describe what it is useful for. We will therefore start by some interesting use cases.

 

Generating strings

Let us imagine that we wish to generate random strings. But instead of having equal probabilities to generate each character, we want to weight each character, such that the generated strings matches the distribution of characters in the English language.

We could use an API that looks like follows:

  • We provide weigthed characters to weighted_choice_gen
  • A weighted character associates a character C to a double W(C)

The resulting generator would generate a character C with the probability P(C), equal to the weight of the character W(C) divided by the sum of the weights of all characters.

 

Generating strings, revisited

Our weighted_choice_gen is very useful. It is however pretty low level, and might not be the simplest way to generate strings where each character has an associated weight.

For instance, let us imagine that we wish to generate strings, such that there are in average X times more consonants than vowels (*). Let us assume that we already have two generators available: one for vowels and one for consonants.

In such a use case, we could certainly do it using weighted_choice_gen, but it would be very awkward, very quickly:

  • We would have to compute the weights of individual letters
  • We would loose reuse of the vowel and consonant generators

It would be much better to separate concerns and restore composability, by associating weights to generators instead:

(*) Note: this is very different from generating each vowel with a weight of X, and each consonant with a weight of 1. Can you see why?

 

The emerging need

Both these test case show the the need to randomly select stuff (value or generators) with associated weights, to introduce bias toward some choices.

  • weighted_choice_gen will implement the weighted choice of values
  • weighted_one_of_gen will implement the weighted combination of generators

In some sense, weighted_choice_gen is the generalization of choice_gen, and weighted_one_of_gen is the generalization of one_of_gen.

 

Other use cases

Both these generators are pretty useful in practice. For instance, we could use them to stress test a trading platform with test inputs that models the the kind of traffic the platform usually receives.

If the traffic is made of 0.25% IRS and 0.75% Spots, we can combine (using weighted_one_of_gen) a generators for IRS and a generator for Spots.

 

Overall design


We will follow the same recipe we followed for the design and implementation of the previous features. We will create combinators that take other generators and produce new generators (*).

We recognized two very similar problems to solve:

  • The weighted random selection of values
  • The weighted random selection of generators

Because these are very similar in nature, we will build an algorithm that works for both values and generators (generators are special case of values) and build nice looking combinators around it.

Because we want to be able to test our combinators, we will also work on separating the non-deterministic part from the deterministic part in our algorithms.

(*) It is generally a good idea to stick with a very few amount of constructs, patterns or idioms inside a given library. Switching too much (even for reasons that might seem good locally) might lead to an explosion of different ways to do things, making the library as a whole less cohesive. In one word, being consistent in our design makes API users happier.

 

Core implementation


In this section, we will discuss the design and implementation of the deterministic core algorithms that will power our combinators.

 

Types

We could use std::pair to associate a value to its weight. To be a bit more declarative thought, we will make a dedicated type to model this association:

Note: It is not always clear whether using strong typing leads to real improvements. There is always a trade-off involved between safety versus specificity, which will be discussed in some future posts.

 

Algorithm discussion

The choice_gen combinator for random non-weighted selection had a constant time algorithm complexity. The implementation just had to randomly pick an index and look inside a vector to get back the chosen value.

When we consider weights, the algorithm is less straightforward, and we cannot ensure a constant time complexity. For this reason, and although choice_gen could theoretically be implemented in terms of weighted_choice_gen, we keep them separated.

Our new algorithm will proceed as follows:

  • We will build weight intervals: each interval will be associated a value
  • The width of an interval is the weight of the associated value
  • We randomly pick a double between zero and the sum of all the weights
  • We binary search this number in the intervals to select the value

The construction of the intervals is depicted below:

INPUT:
[{ val_1, 1.2 }, { val_2, 0.7 }, { val_3, 1.5 }]

INTERVALS:
[0, 1.2)   => val_1
[1.2, 1.9) => val_2
[1.9, 3.4) => val_3

This algorithm requires linear complexity to build the intervals. But this only has to happen once, at the creation of the generator. The search itself has a logarithmic complexity and occurs at each roll of the generator.

 

Algorithm implementation

The STL offers a pretty useful algorithm that we can use here to help us doing the binary search job: std::lower_bound. Its documentation on cppreference.com says that it “Returns an iterator pointing to the first element [..] greater or equal to the [..] value.”

This means that we can implement the interval logic we described earlier by just storing the end of each interval. Finding the appropriate interval is finding the interval whose end is at least greater than the searched value.

Based on this insight, we can implement the setup phase that builds our intervals as follows. We just associate end of intervals to the value associated to the interval:

Searching the intervals can be implemented as a simple std::lower_bound on the resulting collection, with a custom comparison:

Both these algorithms can be easily unit tested. We took care avoiding any kind of random generator side-effect in these core building blocks.

Note: splitting algorithms into parts is a good way to make the parts testable. A lot of times, there is no need for dependency injection, fakes or mocks to make a piece code testable. It is not to say these techniques have no uses, just to state simpler alternatives.

 

Wrapping up as an API


In the previous section, we built the algorithmic building blocks needed to randomly choose a value, with weights associated to each of the values. In this section, we will assemble these blocks into the combinators of our API.

 

Weighted Value Choice

To randomly pick a value, with weights associated to each value, we just assemble our core building blocks.

At the creation of the generator, we build our interval search vector and pass it to the generator returned from the weighted_choice_gen combinator. This generator will first generate a random double value, then find its associated interval. It then returns the value associated to the interval.

 

Weighted Generator Choice

To randomly pick a generator, we will use the combinator we just implemented to pick a random value, weighted_choice_gen, and add some glue around.

We first transform our parameter pack of heterogeneous generators into a vector of homogeneous weighted generators that can be fed to weighted_choice_gen. We use type erasure and std::function to do this.

The output of weighted_choice_gen will be a generator that produces a generator when we roll it. We capture it inside the lambda returned by our new weighted_one_of_gen combinator. The implementation mirrors this quite naturally:

 

Conclusion, and what’s next


In this post, we added two combinators in our tiny_rand library. These combinators allow to randomly select items (value or generators), with weights associated to each of these items to introduce bias.

We looked at our problem and recognized the similarities between these two features. We saw it could be implemented based on one single algorithm. The resulting generator is quite efficient: each roll has a logarithmic time complexity (logarithmic in the number of items to pick from).

In future posts, we will look at improving the tiny_rand library with combinators to build dependent generators (generators whose behavior depends on the result of a previous generator). This will lead us to the concept of Monads, which we will discuss in the context of C++.


You can contact or follow me on Twitter.

2 thoughts on “Weighted choice implementation

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: