Mysterious Memset
Consider this simple function chop
:
void chop(int* count, std::string& str) {
for (int i = 0; i < *count; ++i) {
str[i] = 0;
}
}
This sets the first *count
leading characters in str
to zero. Very simple.
If we feed this program to an optimizing compiler, the machine code generated by the compiler is roughly equivalent to the following:
void chop(int *count, std::string *str)
{
if (*count > 0) {
long idx = 0;
do {
str->__data[idx] = 0;
idx += 1;
} while (idx <= *count);
}
}
This is pretty much equivalent to the program we gave it as input, with an extra
jump in the case of *count == 0
. This jump is the only “optimization” done by
the compiler.
One Small Change…
Let’s mix it up:
void chop(int* count, std::u8string& str) {
for (int i = 0; i < *count; ++i) {
str[i] = 0;
}
}
The only change is that we are now using a u8string
(a
std::basic_string<char8_t>
). From the perspective of most actual machines, a
char8_t
is indistinguishable from a char
(they’re both just an octet), so it
should compile to identical code, right?
Not so! Feeding this program to an optimizing compiler produces the following C++-code-equivalent:
void chop(int *count, std::u8string *str)
{
if (*count > 0) {
std::memset(str->__data, 0, (size_t)*count);
}
}
This code is radically different. The loop has entirely disappeared, and is
replaced with a call to memset
. What’s going on?
Consider this puzzle for yourself, and scroll further for the answer.
[Spoilers Below]
[Spoilers Below]
[Spoilers Below]
The Answer
The answer, unsurprisingly, comes from the guarantees afforded to us by undefined behavior – or rather the compiler’s assumption of its absence.
The potential undefined behavior in question most strongly affects this line in the second sample:
str[idx] = 0;
In the second sample, which used std::u8string
:
- The compiler may assume that the write through
str[idx]
cannot effect the value in*count
. This assumption holds because we are writing through a pointer-to-char8_t
, which cannot be used to alias anint
. - The compiler sees that no code within the loop can affect the value of
*count
, making*count
a loop-invariant value, thus the read of*count
can be lifted out of the loop and performed once before beginning the loop. - Since we increment
idx
from zero to the (invariant) value of*count
, we can unroll the loop into*count
-repeated evaluations ofstr[idx]=0
. - This is now just a sequence of assignments to a contiguous range of memory,
so we can convert this into a
memset
.
But in the first example where we used std::string
:
- C and C++ allow pointer-to-
char
to be used to inspect and manipulate the “object representation” of other objects of different types, so achar*
can point to anint
. - The compiler must assume that the
char*
andcount
might alias, and thus any write throughstr
can spuriously affect the value of*count
. - Because any iteration of the loop could potentially change the value of
*count
,*count
is not loop-invariant, and we cannot make any assumptions about how many times this loop will execute.
So What?
For the history of C++, we have had std::string
, std::string_view
, and
char*
, all of which can potentially alias any other object, and only by
tracking the provenance of pointers could a C++ compiler prove otherwise.
Pointer provenance is a tricky subject, and it is likely the case that the
compiler has absolutely no idea where an object or pointer comes from (think a
const std::string&
function parameter, or an equivalent std::string_view
parameter).
With the addition of char8_t
, std::u8string
, and std::u8string_view
, we
now have string and “character” types that are guaranteed not to alias objects
of different types.