Generating random numbers using C++ standard library: the solutions

The content of this post is based on the three C++ standardization papers I presented in Prague, P2058 , P2059 , P2060 , and various conversations I had afterwards on the same topic.

Now, onto the solutions themselves.

Fixing std::random_device

In my last post, I complained that std::random_device is allowed to be not random at all, and there is no way to find out, because std::random_device::entropy is interpreted very differently across different standard library implementations.

My ideal way to fix this would be to mandate that a standard library implementation only provides std::random_device if it provides proper randomness. And by proper, I mean cryptographically strong. While this sounds onerous, the three major implementations already provide this in practice, they just do not advertise it... However, I also think that such a proposal would never pass the standard committee, and so we need to fix it differently.

Provide users with better queries for the properties of the implementation

Users generally care about one of two things.

  1. Whether the random_device is random , that is, it does not produce the same sequence every time the code is run.
  2. Whether the random_device produces cryptographically secure outputs.

Obviously, the second property is much stronger, because a random_device that is cryptographically secure is also random, but random_device can be random while not being cryptographically secure. As currently standardized, a random_device is also allowed to be neither random nor cryptographically secure.

A nice feature of these properties is that they are binary, so the answer to them is either yes , or no , with no possibilities in-between. They are also reasonably well defined, which should avoid an entropy -like fiasco with implementations interpreting them differently and causing them to be useless in practice.

My proposal to fix std::random_device in standard simply follows from the above. std::random_device interface should be extended with 2 new member functions:

class random_device {
   ...
   // Returns true if different instances generate different bytes
   constexpr bool is_random() const;
   
   // Returns true if generated bytes are cryptographically secure
   bool is_cryptographically_secure() const;
};

You might notice that only is_random is constexpr . The reason for that is that it is the weaker property and, outside of maliciously constructed cases, the implementation should know whether the random_device is randomized. is_random could even be made static , if we restricted users from using the explicit random_device(const string& token) constructor.

is_cryptographically_secure is not constexpr to increase implementations' latitude to handle things like hardware errata, which can only be checked at runtime. Just like is_random , it could be made static if we imposed further restrictions on users of random_device .

Deprecate std::random_device::entropy

Now that random_device provides a way to query basic properties of its implementation, we should also remove deprecate random_device::entropy , because it is wholly useless, and (very) potentially even dangerous.

Provide reproducible distributions

How should reproducible distributions be standardized is the place where I changed my opinion the most since writing a paper. Initially, my preferred solution was to standardize the algorithms underlying std::*_distribution , but that is no longer the case. Nowadays, my prefered solution is to:

Standardize specific algorithms as distributions

The basic idea is simple, we standardize specific algorithms under their own name, and users who want reproducibility just use one of these specific algorithms. As an example, one of the possible algorithms to implement std::normal_distribution is Marsaglia polar method . To provide reproducible normal distribution, it would be standardized as std::marsaglia_polar_method_distribution .

This solution has a significant advantage in that it is both backwards compatible because it does not change the meaning of existing code, and it allows future extensions. If, we standardize some set of algorithms as the reproducible distributions, and 10 years after that someone comes up with a better algorithm for generating normallydistributed numbers, then it can easily be standardized in the next C++ standard. C++ code then can adopt this new algorithm if they do not need backwards compatibility, or keep using the old ones, if they do need backwards compatibility.

It is also very expert friendly, as different algorithms have different performance and numeric characteristics, which the experts might care about. As an example, Marsaglia polar method calls the underlying RNG more often than Box-Muller transform does, but it does not use trigonometric functions and provides slightly better numeric properties.

This approach is not without its negatives. The two big ones are that it introduces a lot of new types, and thus maintenance burden, into the standard library, and that it makes using <random> even less user-friendly. A user that wants a reproducible distribution has to pick which exact algorithm to use. Doing so requires either obtaining a significant amount of expert knowledge, or picking one essentially at random.

Other considered (and rejected) options

Back at Prague's meeting, I proposed two other alternativesto the option above. In fact, I considered the option outlined above the worst one. However, I've changed my mind since then and no longer consider them good. They are:

std::foo_distribution
std::reproducible_foo_distribution

Both of these options share the same problem, that they do not provide future extensibility, and the same advantage in that they introduce less burden on both the maintainers and the non-expert users of <random> . They also provide some different trade-offs in regards to backwards compatibility, implementation latitude and so on.

Challenges, problems, and pitfalls

All three options mentioned above share one big problem, floating-point numbers. This problem further splits into two more problems, floating-point representations, and transcendental functions.

The problem with floating representations is that the C++ standard does not mandate a specific one. In practice, it is unlikely to encounter a platform that does not support IEEE-754, but the C++ standard allows them. There is also the issue of floating-point dialects, caused by compiler flags, such as -ffast-math .

This means that any standard-provided reproducible distribution over floating-point numbers will require some wording to the effect of "results are only reproducible between platforms with the same floating-point number representation".

The other challenge to providing reproducible floating-point distributions is the fact that most algorithms for e.g. normal distribution use transcendental functions, such as trigonometric operations (Box-Muller), or logarithms (Marsaglia). The problem is that transcendental functions are computed by approximation, both the result and the precision of such approximations vary, and which approximation your code ends up using is compiler, platform, and settings dependent.

There are two possible workarounds for the transcendental functions issue:

  1. Standard mandates specific implementation for use in <random>
  2. We use algorithms that avoid these issues at the cost of performance

Neither of these options is great, but they are workable. I don't think that <random> would be well served by just option 2, but I also don't think it should be overlooked.

Rework seeding of Random Number Engines

The last of my complaints in the previous post was that there is no right way to seed an unknown Random Number Engine properly. This issue is caused by a combination of the requirements on Seed Sequence being overly restrictive, and that there is no way to ask an RNE how much seeding it requires upfront.

Strictly speaking, it is possible to fix this with just one change, letting users query any random number engine on how much data it requires for seeding itself. However, that would still leave proper seeding very unergonomic, and so I propose more changes, to fix this. They are:

  1. Let users query RNEs for required seed size
  2. Provide a weaker version of the Seed Sequence requirements
  3. Modify std::random_device to fulfil these requirements

Let users query Random Number Engines required seed size

The idea behind this change is simple. If we know how much random data is required to seed some RNE, we can generate that much randomness ahead of time, and then use a straightforward Seed Sequence type that just copies randomness in and out, while obeying all Seed Sequence requirements.

To do this, we add static constexpr size_t required_seed_size member function to the requirements on Random Number Engines . Its return value is the number of bytes the RNE requires to fully seed itself. Together with a simple, randomness-copying Seed Sequence sized_seed_seq , the code to fully seed a mt19937 with random data would look something like this:

// This prepares the seed sequence
constexpr auto data_needed = std::mt19337::required_seed_size() / sizeof(std::random_device::result_type);
std::array<std::random_device::result_type, data_needed> random_data;
std::generate(random_data.begin(), random_data.end(), std::random_device{});

// Actual seeding
std::mt19937 urbg(sized_seed_seq(random_data.begin(), random_data.end()));

While this works and does what we want, the usability is terrible . To fix the usability for the typical case of random seeding, we need to change the requirements of Seed Sequence.

Provide a weaker version of Seed Sequence requirements

In the ideal world, we would just pass a std::random_device to the constructor of the engine, like so:

std::mt19937(std::random_device{});

However, std::random_device is not a Seed Sequence, and thus the code above does not work. The requirements of Seed Sequence are also such that we cannot create a simple wrapper around random_device that fulfils them. Let's see what requirements we have to drop before a randomized_seed_seq , a seed sequence that just wraps std::random_device , is implementable.

Many of the requirements on Seed Sequence boil down to requiring Seed Sequence instances to be serializable and reproducible. A Seed Sequence-ish that wraps std::random_device cannot provide either, which means that

  • We should drop both param and size member functions. Without param , size is useless, and param cannot be implemented on top of random_device .
  • We should also drop both the range and the initializer list constructors. They require that the bits provided therein are used in the seed sequence, but that cannot be done with random_device .

Removing these functions leaves us with the default constructor and the generate member function. And also with the result_type typedef, but that is almost trivial. We obviously want need to keep the default constructor, but we cannot satisfy the requirements that the state of all default-constructed instances is the same, so we will drop that part. The same thing applies to the generate member function. Any reasonable Seed Sequence has to provide it, but we would need to drop the requirement that the output depends on the inputs during construction (not that there are any).

Thus I propose a new set of named requirements, Basic Seed Sequence . Type only has to fulfil 3 requirements to be considered a Basic Seed Sequence , namely:

  • It provides result_type typedef that is an unsigned integer type of at least32 bits.
  • It provides a default constructor with constant runtime complexity.
  • It provides a generate(rb, re) where rb and re are mutable random access iteratorswhich fills [rb, re) with 32-bit quantities. There are no constraints on the generated data.

This is the minimal set of requirements for a useful Seed Sequence-ish type, and a wrapper type over std::random_device can easily fullfill them:

class randomized_seed_seq {
    std::random_device m_dev;
    
    static_assert(32 <= sizeof(std::random_device::result_type) * CHAR_BIT,
                  "I don't wanna handle this case");
public:

    using result_type = std::random_device::result_type;
    
    template <typename Iter, typename Sentinel>
    void generate(Iter first, Sentinel last) {
        using dest_type = typename std::iterator_traits<Iter>::value_type;
        // We should also check that it is unsigned, but eh.
        static_assert(32 <= sizeof(dest_type) * CHAR_BIT, "");
        
        
        while (first != last) {
            // Note that we are _required_ to only output 32 bits
            *first++ = static_cast<uint32_t>(m_dev());
        }
    }
};

With the wrapper above, we can now seed any Random Number Engine like this:

randomized_seed_seq sseq;
std::mt19937 rng(sseq);

RNEs take the SeedSequence constructor argument using plain ref, so we cannot quite write an oneliner, but compared to the original monstrosity, this is good enough. However, I also think that users shouldn't have to wrap std::random_device in their own type to get this behaviour, but rather the standard should provide it. This leads me to my last suggestion:

Turn std::random_device into a Basic Seed Sequence

This one is simple. If we add generate to std::random_device , it becomes a Basic Seed Sequence as per the definition above. This would let users write these two lines to get a randomly seeded Random Number Engine :

std::random_device dev;
std::mt19937 rng(dev);

Users who require a large number of random bytes could also use this interface to achieve significant performance gain over successively calling random_device::operator() .

Other possible improvements

Until now, this post was about fixing the problems outlined in the previous one. However, in that post, I skipped over "small" issues with <random> , ones that are annoying but do not make it unusable. In this section, I want to also go over some other issues with <random> . These issues are too small to prevent people from using std.random, but they are still annoying enough while using it.

The following issues are mentioned in no specific order.

There are no modern PRNGs in <random> . The best PRNG in <random> is probablythe Mersenne Twister, but using Mersenne Twister instead of say Xorshift , or a PCG variant leaves a lot of performance lying on the table. This lack of modern PRNGs means that serious users will end up writing their own, even if all issues with seeding, distributions, and so on, are fixed.

Most (all?) of the PRNGs in <random> could be constexpr , but they are not. As far as I can tell, this is caused by the fact that nobody actually uses <random> enough to care about constexpr-ing it, rather than any technical reasons.

Random Number Engines take Seed Sequence arguments by plain reference. This prevents creating and fully seeding an RNE from being an oneliner.

There are no ease-of-use utilities. If all the fixes proposed in this post were incorporated, seeding a PRNG becomes easy. However, selecting a random element from

a std::vector would still require a significant amount of boilerplate.

There are likely many more tiny issues with <random> that I am either unaware of completely, or that I haven't run into recently enough to remember them. The point is that if all of my proposed changes were standardized, <random> would become much better but definitely not perfect.

That's it for this post, and for my writing about <random> . At some point in the future I want to write a post about my standardization efforts towards fixing <random> , but that will be a non-technical post about the standardization process itself, rather than about the technical details of <random> .

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章