Understanding C++ Modules: Part 3: Linkage and Fragments
Wowza! It’s been a while since Part 2, and so much has happened. I’m glad to be back, and I have some “unfinished business” I need to wrap up. Namely: Finish this blog series.
I know that it has been a few months, but buckle up: We’ve got a lot to cover this time around.
This post won’t have an over-arching theme, but will finish up covering all of the new syntax relevant to understanding C++ Modules. This post will cover some odds and ends that we haven’t yet poked at in prior posts.
Posts in this series:
The Module [Unit] Purview
A C++ module unit has what is known as a purview. The purview of a module unit is all tokens beginning at the module-declaration and continues to the end of the translation unit:
// <- NOT within the purview of `Mine`
export module Mine;
// <- Within the purview of `Mine`
I anticipate many readers are now saying
“Hold on… Things can appear before the module-declaration?”
The answer is yes, things can appear before this point. I personally expect usage of this feature will be fairly esoteric and limited. We’ll look at that in a later section below.
For a named module, the purview of the module is the set of purviews for each of its constituent module units. For a module with a single module unit, the module purview and the module unit purview are equivalent.
A New Type of Linkage
C++ has had two primary types of “linkage” for a long time: internal and external. The purpose of linkage is to determine what constitutes when the same name refers to the same entity:
- Internal linkage pertains to entities which are inaccessible outside of the
translation unit in which they are declared. This includes namespace-scope
entities which have been declared
static
, and any entity declared within an anonymous namespace, variables declared withconst
(but notinline
orvolatile
), and members of anonymous unions. These entities may only be named within the translation unit in which they are declared, and their definition is not visible to other translation units. - External linkage pertains to entities declared extern, functions,
class
types and their members (includingstatic
members), and variables, and all templates thereof, that are not already internal by the rules above, and allenum
s. An entity of this type is “the same” between translation units. That is, taking the address of a external-linkage entity will reveal the same address in all translation units. (Note: This is where we can get tripped up by ODR, when “the same” entity between translation units has differing definitions between translation units).
In addition, all external linkage entities have a language linkage, which
allows the “sameness” to be used between different programming languages. The
language linkage is determined by a linkage-specification of extern
followed
by a string literal followed by either a single declaration or a block
containing a sequence of declarations, all of which inherit the language
linkage of the enclosing block. The most common you’ll see is extern "C"
,
which says that the associated names are “the same” as the ones declared within
a C translation unit. There is also an extern "C++"
linkage specification,
but it is the default and, thus, rarely used explicitly.
You may also note that inline
functions get no special treatment: They
have external linkage! Taking the address of an inline
function will yield
the same result between all translation units.
C++ Modules introduce the new aptly named module linkage. By the definition
above, you may have a good guess at what it means. If internal means that
entities are “the same” within a single translation unit, and external means
that they are “the same” between all translation units, then module linkage
means that entities are “the same” within all translation units belonging to
the module in which they are declared. Entities earn module linkage if they are
not internal, not export
ed, and are attached to a named module.
This sounds very esoteric, so let’s pin it down with some concrete examples:
export module foo;
import :counter;
import export :arith;
// Get the value of the counter;
export int get_value() {
return ::counter;
}
export module foo:arith;
import :counter;
// Add to the counter
export void add(int i) {
counter += i;
}
// Subtract from the counter
export void sub(int i) {
counter -= i;
}
module foo:counter;
// A global variable
int counter = 0;
In the above example, add
, sub
, and get_value
all have external linkage,
and counter
has module linkage. There is no way to directly name and
reference counter
from outside of the module, but every module unit within
foo
has the same definition of counter
.
export module bar;
class bar {};
// The bar has a counter
bar counter;
We’ve added another counter
, but this time to another module bar
. Despite
both entities having the fully qualified name ::counter
, and being accessible
across translation units, they aren’t accessible to translation units that
aren’t a part of the module. This gives us a good tradeoff between having a
global variable that is thrown into the wild for anyone to access, and having
an internal variable/class that only your single TU can access. In a way,
module-linkage provides us with encapsulated globals.
If you’re familiar with ABIs, your ears might be turning at this one. With present linker technology, the only way to support module linkage is with additional name-mangling on such entities.
There is a tentative belief that such mangling will also appear on entities
declared within modules with external linkage (anything export
ed or extern
),
which would have a horrible implication that moving an exported entity between
modules is an ABI break, even if it is still re-exported from the same modules.
This author believes such mangling to be unnecessary. The language does not prescribe in this area, but I’m hopeful that the Ecosystem TR might be able to slap this one down before it spreads.
This author hates the trend of ABI stability, so I’ll abstain from opining (even though I just did (kind of)).
IF NDR WARNING
The phrase “ill-formed, no diagnostic required,” strikes fear into the hearts of most developers. In most cases, an “ill-formed” program is immediately obvious, and the compiler/linker will halt and tell you to fix it. For example, missing a closing brace, or declaring the same name as two different things.
With “no diagnostic required,” the compiler/linker might halt and tell you to fix it, but it might silently generate a malformed program with a subtle and hard-to-diagnose issue. It may even appear to work for a while!
C++ Modules introduce an IF NDR situation almost identical to an ODR violation, but a naive understanding of modules tells you that it should work:
export module uk;
export class bin {
void put_refuse();
};
export module murica;
export class trash_can {
void put_refuse();
};
export class bin {
void store_christmas_decorations();
};
If these two modules are combined into a single program, that program becomes
ill-formed, no diagnostic required. There are two conflicting definitions of
::bin
!
In some cases, this is trivial to diagnose for the compiler:
import uk;
import murica; // Hol' up.
Even though a diagnostic isn’t required, a compiler in the above situation
can clearly see the two definitions of ::bin
simultaneously and issue a
diagnostic. It isn’t always so easy:
export module eurasia;
import uk;
export module north_america;
import murica;
export module world;
import eurasia;
import north_america;
In this example, world
imports both eurasia
and north_america
, which both
have a non-exported import of conflicting definitions of ::bin
. Compiling
world
will not necessarily see either ::bin
, as neither module re-exports
the import which introduced the declaration.
On the other hand, this is explicitly banned:
export module statistics;
import uk;
export class bin { // STOP!
};
export class histogram {
vector<bin> bins;
};
In this case, the standard requires a diagnostic for this program, as it is
trivial for an implementation to see that there is already an existing ::bin
.
(As for why statistics
is importing uk
? I dunno, maybe it’s for a census.)
The Global Module
With C++ Modules, every entity must be “attached” to a single module. This
leads us to the questions: “What about code not aware of modules?” and “what
module is main()
attached to?”
The answer to both is the global module. This is a special “implicit” module that catches all code that isn’t declared within a module.
You may see the term “named module.” This simply refers to a module with a name. The only module without a name is the global module.
Declaring a main()
function at global scope attached to a named module is not
allowed!
To add entities to the global module, you can simply declare them in a non-module translation unit.
To use entities from the global module… what do you do? You can’t import
the global module (it’s unnamed). We’ll need to use this all the time,
especially for interacting with non-module-aware libraries. Read on…
A New Preprocessor Directive
Read the entirety of this section. Do not skim the top.
C++ Modules introduces an exciting new preprocessor directive! Bet you weren’t expecting that, eh? After all, Modules are supposed to negate some uses of the preprocessor, right?
Yes, actually. Many uses of the preprocessor’s #include
directive can be
replaced with features introduced by C++ Modules. Nevertheless, we’ll still
need the aid of the preprocessor to import non-module-aware code into our
module-aware code. Should be easy, right?:
export module my_game;
#include <SDL.h>
HALT, CRIMINAL SCUM! This is horribly broken. C++ Modules is almost
entirely agnostic to the existence of the preprocessor. What you’ve just done
in the above code is cause every single symbol in SDL.h
to become attached
to my_game
(with module linkage). Obviously, your module does not own the
contents of SDL.h
.
How would we go about accessing the content of SDL.h
? We use the new
preprocessor directive, of course!
export module my_game;
import <SDL.h>;
This is known as a “header-unit” import (previously called a “legacy
header-unit.” The word “legacy” was dropped), and yes: this is a preprocessor
directive, but it is nothing at all like #include
.
You can also use double-quotes in your header-unit import:
import "my_code.h";
to facilitate the differing <>
and ""
lookup rules of #include
.
A header-unit is created by the compiler by performing lexical translation
phases 1-7 (i.e. everything except codegen), and treating the resulting code
as-if it were a module in which every exportable entity is implicitly
export
ed. (The entities are attached to the global module.)
The act of import
ing a header-unit is as-if importing this synthesized
module. All exportable symbols from the header are made available in the
importing module.
None of the above actually necessitate any usage of the preprocessor, though.
The reason this is a preprocessing directive is that macros from the header
become visible in the import
ing translation unit after the import
directive
is passed.
Important notes about macros with header-unit imports:
- Ambient preprocessor state from the
import
-ing translation unit is not considered by the header-unit. This means that any#define
macros in the importer have no effect on the content of the header-unit. The result is that we have one-way isolation of macros. Presumably, global preprocessor definitions might have an effect (think-D
command-line options). (If you’re disturbed by the weasel words I’m using here: Don’t worry: So am I.) - You may
export import
header-units just like regular modules, but the macros from the header-unit are not re-exported into transitive importers: Only the regular language symbols. Only the literalimport
of a header-unit will expose the macros
So the answer to “How do I use macros with modules?” is here: You just import
them! (There’s one other way discussed later).
I’ve Lied
Well, half-intentionally I’ve made lies. I wrote the above section before I made a discovery in the current draft with important implications. If you want to make yourself sad, read the rest of this section, otherwise: Throw away most everything in the parent section and skip to the next heading.
Header-unit-imports have an important restriction: The import
of a
header-unit is only allowed for what the standard deems
“importable-headers.” The definition of what makes a header “importable” is
entirely implementation-defined.
This gives us a bit of a pickle. Is it safe to rewrite an arbitrary #include
directive to use a header-import? The answer, of course, is ABSOLUTELY NOT.
These two snippets are not equivalent:
#define _UNICODE
#include <windows.h>
#define _UNICODE
import <windows.h>;
Remember: The preprocessor state before import
has no effect on the contents
of the header unit. Those who know windows.h
know that it does some wild
changes based on the presence of _UNICODE
as a macro. Thus, we cannot
safely perform this rewrite without repercussions.
In fact, this is the basis for what most consider “importable” to mean: Safe from being affected by ambient preprocessor state.
Pop-quiz, is this an importable header?
#include <asdfhucini24nyuasdfuybukvbysjfdahjbasdfnvurei.h>
You can’t possibly know, and the nonsensical name is intentional: This is how
the compiler sees user code. It does not speak a human language. It cannot know
the semantics of <asdfhucini24nyuasdfuybukvbysjfdahjbasdfnvurei.h>
, thus the
header cannot be arbitrarily considered an “importable” header.
The current trend is that importable headers must be explicitly enumerated in a
resource that the compiler may consult to discover which headers are candidates
for import
and #include
-rewriting. This means you probably cannot “just
import <SDL.h>
,” but that you must inform the compiler that such an import is
safe. On the other hand, perhaps the presence of import <SDL.h>
is enough to
force the compiler to deem the header “importable,” and perhaps not. Only time
will tell.
Do not worry: We are not out of hope. We have another option.
The Global Module Fragment
Congratulations on making it thus far! We’re really getting deep into the nitty-gritty now.
So, we know that we cannot import <windows.h>
safely without some extra work,
and we can’t #include <windows.h>
within the purview of a module unit, but we
still want to be able to use it within a module. We have one last tool to come
to our aide, and it’s a slippery and weird-looking beast: The global module
fragment. It looks like this:
module;
// stuff ... [1]
module foo;
// module purview... [2]
Remember that things can appear before the module-declaration? Well, that’s
what you have here. The two magic tokens module
followed by ;
tell the
compiler that it’s about to compile a module unit, but that we first have some
non-module code we need to introduce.
The “stuff” in [1]
is the global module fragment, and anything in the
translation unit that is declared/defined here is attached to the global module,
and not to the module of the containing module unit.
There is one gigantic, enormous, immovable caveat that must be taken into
consideration: Prior to preprocessor execution (translation phase 4), only
preprocessor directives may appear in the global fragment. That is: You
cannot write anything between module;
and the module-delcaration except for
preprocessing directives. Anything that you need to declare/define in the
fragment must be accessed via an #include
directive
This is not allowed:
// bar.cpp
module;
extern void foo(); // ILLEGAL: Not a preprocessor directive!
export module bar;
void call_foo() {
foo();
}
However, this is valid:
// foo.h
extern void foo();
module;
#include "foo.h" // Okay: Declares `::foo()`
export module bar;
void call_foo() {
foo(); // Okay: `foo` was declared in the global module fragment.
}
So, if we want to use <windows.h>
in our module unit, we can put it in the
global fragment:
module;
#include <windows.h>
export module bar;
// Do Win32 stuff...
“Discarded” Declarations
C++ modules defines a term decl-reachable, which declares a relationship
between two declarations: A declaration D
may or may not be decl-reachable
from another declaration S
(either of which may be definition). The rules of
being decl-reachable are somewhat complex, and you can read about them
here, but it is
sufficient to note that most uses of a declaration D
from within S
cause
D
to be decl-reachable, and that being decl-reachable is fully
transitive. That is:
- If
A
is decl-reachable fromB
, and B
is decl-reachable fromC
, thenA
is decl-reachable fromC
.
An implemention will discard declarations that appear in the global module fragment if-and-only-if those declarations are not decl-reachable within the module purview.
What does this mean for you, the programmer? Most importantly: Discarded declarations are neither reachable nor visible outside of the module unit. This means that the semantic properties and visibility to any entity declared within the global module fragment is propagated to importers of the containing module unit if-and-only-if that module unit makes use of such a declaration in a way that creates decl-reachability from any of that module’s declarations.
There is one very important case where a declaration is not decl-reachable that you would otherwise expect to work: If the declared entity is used in a way that it is a candidate in an overload set of a dependent expression within a function or class template.
This has some potentially goofy quirks:
// foo.hpp
template <typename T>
void do_something(T val) {
// ...
}
module;
#include "foo.hpp"
export module acme;
template <typename T>
export void frombulate(T item) {
do_something(item);
}
In this case:
do_something
is used as a name in a dependent expression withinfrombulate
.- It cannot be resolved by the first phase of name-lookup.
- We cannot prove that the
do_something
fromfoo.hpp
is used byfrombulate
. - It is not decl-reachable from
frombulate
. - It is not decl-reachable within
acme
, and therefore do_something
fromfoo.hpp
will be discarded!
This means:
import acme;
int main() {
frombulate(42); // ERROR: No matching overload of `do_something`!
}
Something strange happens when we tweak our module, though:
module;
#include "foo.hpp"
export module acme;
template <typename T>
export void frombulate(T item) {
do_something(item);
}
export void use_it() {
frombulate(true);
}
Without making any change to foo.hpp
nor main.cpp
, our program now
compiles and works! What’s going on??
The answer is that use_it
, calls frombulate
, which is resolved immediately
as a specialization frombulate<bool>
. This concrete specialization allows the
compiler to now perform the second phase of name lookup for frombulate<bool>
within the context of the module unit, and it discovers a usage of
do_something(bool)
within the body of frombulate<bool>
, which is no longer
a dependent expression. Thus, do_something
is now decl-reachable from
frombulate
, and therefore decl-reachable from acme
, and it is no longer
allowed to be discarded.
This “spooky action at a distance” is admittedly very contrived, but there is a 99% chance that someone will unwittingly stumble upon this quirk within the first four minutes of the release of a full C++20 module compiler.
The Last New Syntax: The private
Module Fragment
You may remember from the first post a note that a module partition may not
be named private
. Well, now we finally get to talk about why.
There is a special syntax that denotes an aspect of the module purview called the private module fragment. In total, a module unit, with all the wizbang features, looks something like this:
// [The global module fragment - optional]
module;
#include <stuff.hpp>
// [The module preamble]
export module foo:bar;
export import widgets;
import gadgets;
// [The module interface - optional]
export void do_stuff();
// [The private module fragment - optional]
module :private;
void do_stuff() {
std::cout << "Howdy!\n";
}
// [The end]
We’ve always wanted the benefit of keeping the interface and implementation separate (That is: keeping compile times low and hiding implementation from our consumers by reducing the propagation of implementation details into downstream users who only need to see the interface), but we don’t like to pay the price of having multiple source files to define a single logical component. Is there a way we can have the benefits of faster incremental builds without having to juggle multiple source files?
This is where the private module fragment comes in. The way it is specified in
the standard document does not mention its purpose, but rather emphasizes
dozens of restrictions on code and the way it interacts with module :private
,
and it is apparent what its purpose is: to provide a separation of interface
and implementation in a way that we can provide them together in a single
source file without exposing the implementation details in the interface.
In essence, the restrictions on module :private;
are specifically meant to
prohibit code appearing in the fragment from having any apparent effect on the
interface of the containing module. Under no circumstances should modification
of the content of module :private;
necessitate a re-translation of the
importers of the containing module. Another restriction: If a module unit
contains a private module fragment, that module unit must be the primary
module interface unit, and there should be no other module units in that
module. That is: A module which uses the private fragment must be a single-file
module.
The necessary restrictions on the private fragment are numerous and spread all across the specification, so I won’t enumerate them all here, but a good rule of thumb is that anything that can modify a module’s interface is not permitted within the private fragment.
What Makes the “Interface” of a Module?
If you think about the very fact the the private module fragment even exists, you may be asking yourself another question: Why is it necessary at all??
Consider this simple module:
export module greet;
import <iostream>;
export void do_greet(int n) {
while (n--) {
std::cout << "Hello, world!\n";
}
}
and a simple program:
// main.cpp
import greet;
int main(int argc, char**) {
do_greet(argc);
}
When we compile, link, and run this program, it will print Hello, world!
to
standard-output as many times as command line argument we give it plus one.
But what if we want to print to standard-error? Let’s change do_greet
:
export module greet;
import <iostream>;
export void do_greet(int n) {
while (n--) {
std::cerr << "Hello, world!\n";
}
}
Now ask yourself a question: Do we need to recompile our main.cpp
? Of course
not! Right?
… Right?
It depends on which compiler you ask, because it depends if the compiler will
propagate the body of do_greet
into module importers. I doubt it
controversial to say that this is a very surprising behavior. After all: C++
modules are supposed to help us encapsulate!
I think it fair to say that we probably don’t want the body of do_greet
to
be propagated to downstream users, and there are even cases where it is
impossible for do_greet
to be propagated: If do_greet
uses any entity
with internal linkage, the body thereof cannot be exposed outside of the
module. I feel confident that implementations will converge on not
propagating the body of functions, and thus improving the performance of
incremental compilation.
We will finally be able to define our functions at the declaration, and not worry about massive compile times! Hooray!
Except… we won’t.
inline
Ruins the Fun
Update:
This section deals with how the inline
keyword affects the interface of a
module. It assumes that in-class definitions of member functions are implicitly
inline
, which was true until the adoption of
P1779, which adds an explicit exception. As such,
inline
is no longer implied on the declaration of member function defined
within the body of a class when that class is declared within a module unit.
If we redefine do_greet
as such:
export module greet;
import <iostream>;
export inline void do_greet(int n) { // <-- Added `inline`
while (n--) {
std::cerr << "Hello, world!\n";
}
}
It should be no surprise that we now cannot change the body of do_greet
without affecting our importers, as the body of the function is now part of its
interface.
The inline
keyword is used to mean “put the body of the function into
callers.”
Except for when it’s not.
It’s been well-known for many years that inline
is merely a suggestion to a
compiler to embed the body of the function into callers, but under the as-if
rule an optimizer can inline functions as it pleases.
The effect of inline
on the inliner is hotly debated, but completely
irrelevant to this post. What I want to talk about is a sad story regarding a
side-effect of the inline
keyword.
Because a function declared as inline
might be inlined into its callers, the
definition of the function must be visible within every translation unit in
which it is used, but this goes against another rule: ODR. inline
functions
have external linkage, and this means that every translation unit in a
program must agree on the definition, which includes the address of the
function itself.
How can we consolidate the requirement that every translation unit contain the
definition of an inline
function, while also satisfying that there only be
one definition in a program?
The answer is: Don’t even try. Just punch a hole in the rules that permits
inline
functions to a definition in multiple translation units, provided all
definitions are token-wise identical.
In practice, this is implemented by annotating the generated symbol as “weak” on ELF or “selectany” on COFF with MSVC. When the linker merges translation units, it will assume that all definitions are equivalent, and throw away all except a single definition.
This is actually a pretty neat trick. In fact, it was so neat that we wanted to be able to do this with more entities:
class Foo {
// ...
public:
Foo() = default;
};
In here, we have a defaulted definition of the constructor for Foo
, and this
definition is included in every translation unit that sees the definition of
Foo
itself.
class Foo {
string _name;
public:
void say_something() const {
std::cout << "My name is " << _name << '\n';
}
};
And here, we have an in-class definition of the function Foo::say_something
.
Just like with the defaulted constructor, the definition thereof will be
included within every translation unit that sees Foo
.
But we didn’t declare them inline
, and we still have to satisfy ODR. Hmm…
How can we fix this?
Of course! Let’s just implicitly slap inline
on there! Every in-class
definition of a member function is implicitly inline
! [Edit: As of
P1779, inline
is no longer implicit for in-class
member function definitions within a module unit.]
In fact inline
has taken on this role so much that we grant these semantics
to variables using the inline
keyword. In this case, the inline
has nothing
at all to do with inlining code, and is purely a directive to the linker that
we want to allow multiple definitions of the variable.
How does this relate to modules, then?
Recall that the body of an inline
function is part of the interface of that
function. Since in-class definitions of member functions are implicitly
inline
, the in-class definitions of member functions are part of a module
unit’s interface.
export module acme;
export class Widget {
int _age = 0;
private:
void _activate() {
std::cout << "Bzzt!\n";
++_age;
}
void _explode();
public:
void use_it() {
if (_age > 500) {
_explode();
} else {
_activate();
}
}
};
module :private;
void Widget::_explode() {
// ...
}
Note:
The below two paragraphs are no longer correct. Refer to P1779.
In this module, the body of Widget::_explode
is hidden from importers, while
the body of Widget::_activate
and Widget::use_it
are visible to users.
Any modifications to _activate
or use_it
necessitate recompilation of
anyone that imports acme
. Yes: Even though _activate
is private
and
cannot even be referenced by users, it’s body is also part of the module
interface, since it is declared as exported and (implicitly) inline
.
Only _explode
, whose definition is provided out-of-line and in the private
fragment, is safe to modify without triggering a cascading re-compile. Placing
the definition of the method in the private fragment is the closest we can get
to keeping the declaration and definition together without causing cascading
recompilation when the body is modified.
More inline
Goofs
You may remember my frightening side-note about “linkage promotion.” You’ll be glad to hear that linkage promotion is a thing of the past, and it is now a hard-error to use an entity of internal linkage in a way that its definition must be exposed to importers.
Note:
This example is out-dated after the adoption of
P1779. While it is true that internal-linkage
entities must not be used from inline
functions, in-class member function
definitions are no longer implicitly inline
.
Unfortunately, one such illegal-usage is usage within an inline
function.
Since in-class definitions are implicitly inline
, this means that using an
entity of internal linkage is illegal within the body of an in-class member
function definition:
export module acme;
// Internal-linkage:
static const char* greeting = "Go go gadget: Compiler error!";
// Exported class
export class Gadget {
public:
// implicit `export inline` function:
void boom() {
puts(greeting); // ERROR: Usage of internal-linkage entity
// within exported inline function!
}
};
The fix:
export module acme;
// Internal-linkage:
static const char* greeting = "Go go gadget: Compiler error!";
// Exported class
export class Gadget {
public:
// implicit `export` function (not `inline`):
void boom();
};
// Out-of-line definition:
void Gadget::boom() {
puts(greeting); // Okay
}
That’s All for Now
“For now?” I hear you ask. Yes, there’s more to talk about! Fortunately, we’ve covered all of the new syntax and the semantics of that syntax, but we haven’t addressed some of the effects modules play on the semantics of existing syntax, that is primarily: Name and overload resolution.
In the next post, I expect to finish talking about what modules are, but there will be a lengthy post dedicated to what modules are not.
Hopefully it won’t take so long to get to part four.
Stay tuned for more!