Consider this simple function chop:

void chop(int* count, std::string& str) {
    for (int i = 0; i < *count; ++i) {
        str[i] = 0;
    }
}

This sets the first *count leading characters in str to zero. Very simple.

If we feed this program to an optimizing compiler, the machine code generated by the compiler is roughly equivalent to the following:

void chop(int *count, std::string *str)
{
  if (*count > 0) {
    long idx = 0;
    do {
      str->__data[idx] = 0;
      idx += 1;
    } while (idx <= *count);
  }
}

This is pretty much equivalent to the program we gave it as input, with an extra jump in the case of *count == 0. This jump is the only “optimization” done by the compiler.

One Small Change…

Let’s mix it up:

void chop(int* count, std::u8string& str) {
    for (int i = 0; i < *count; ++i) {
        str[i] = 0;
    }
}

The only change is that we are now using a u8string (a std::basic_string<char8_t>). From the perspective of most actual machines, a char8_t is indistinguishable from a char (they’re both just an octet), so it should compile to identical code, right?

Not so! Feeding this program to an optimizing compiler produces the following C++-code-equivalent:

void chop(int *count, std::u8string *str)
{
  if (*count > 0) {
    std::memset(str->__data, 0, (size_t)*count);
  }
}

This code is radically different. The loop has entirely disappeared, and is replaced with a call to memset. What’s going on?

Consider this puzzle for yourself, and scroll further for the answer.

[Spoilers Below]

[Spoilers Below]

[Spoilers Below]

The Answer

The answer, unsurprisingly, comes from the guarantees afforded to us by undefined behavior – or rather the compiler’s assumption of its absence.

The potential undefined behavior in question most strongly affects this line in the second sample:

str[idx] = 0;

In the second sample, which used std::u8string:

  1. The compiler may assume that the write through str[idx] cannot effect the value in *count. This assumption holds because we are writing through a pointer-to-char8_t, which cannot be used to alias an int.
  2. The compiler sees that no code within the loop can affect the value of *count, making *count a loop-invariant value, thus the read of *count can be lifted out of the loop and performed once before beginning the loop.
  3. Since we increment idx from zero to the (invariant) value of *count, we can unroll the loop into *count-repeated evaluations of str[idx]=0.
  4. This is now just a sequence of assignments to a contiguous range of memory, so we can convert this into a memset.

But in the first example where we used std::string:

  1. C and C++ allow pointer-to-char to be used to inspect and manipulate the “object representation” of other objects of different types, so a char* can point to an int.
  2. The compiler must assume that the char* and count might alias, and thus any write through str can spuriously affect the value of *count.
  3. Because any iteration of the loop could potentially change the value of *count, *count is not loop-invariant, and we cannot make any assumptions about how many times this loop will execute.

So What?

For the history of C++, we have had std::string, std::string_view, and char*, all of which can potentially alias any other object, and only by tracking the provenance of pointers could a C++ compiler prove otherwise. Pointer provenance is a tricky subject, and it is likely the case that the compiler has absolutely no idea where an object or pointer comes from (think a const std::string& function parameter, or an equivalent std::string_view parameter).

With the addition of char8_t, std::u8string, and std::u8string_view, we now have string and “character” types that are guaranteed not to alias objects of different types.