We explore when and how our code benefits most from equipping classes with move constructors and assignment operators. We analyse in detail, which constructors and assignment operators (move or copy) are called for C++98 and C++11 when we insert or emplace objects into containers, when we initialise containers and when we shuffle and sort containers. Finally, we run benchmarks for these operations and discuss the actual speed-up, which ranges from 1% to 70%.
A Class with Move Semantics
Let me introduce the class CTeam
, which will accompany us through our experiments with move semantics. The class represents a football team with its name, points and goal difference. The vector m_statistics
of 100 double numbers is only there to make CTeam
objects more heavy-weight during copying.
// cteam.h #ifndef CTEAM_H #define CTEAM_H #include <string> #include <vector> class CTeam { public: ~CTeam(); // dtor CTeam(); // default ctor CTeam(const std::string &n, int p, int gd); // name ctor CTeam(const CTeam &t); // copy ctor CTeam &operator=(const CTeam &t); // copy assign CTeam(CTeam&& t); // move ctor CTeam &operator=(CTeam &&t); // move assign std::string name() const; int points() const; int goalDifference() const; static int copies; private: std::string m_name; int m_points; int m_goalDifference; static const int statisticsSize = 100; std::vector<double> m_statistics; }; #endif // CTEAM_H
I implemented all the constructors and assignment operators, because they print their name to the console and because the copy variants increment the counter copies
. We must only comment out the move constructor and assignment to turn the C++11 code into C++98 code.
The implementation comes with no surprises. The print statements are commented out, because they would completely contort the performance results. We should uncomment them for the detailed analysis. Here is the part that is the same for C++98 and C++11.
// cteam.cpp (same for C++98 and C++11) CTeam::~CTeam() { // cout << "dtor" << endl; } CTeam::CTeam() : m_name("") , m_points(0) , m_goalDifference(0) { // cout << "default ctor" << endl; } CTeam::CTeam(const string &n, int p, int gd) : m_name(n) , m_points(p) , m_goalDifference(gd) { // cout << "name ctor" << endl; m_statistics.reserve(statisticsSize); srand(p); for (int i = 0; i < statisticsSize; ++i) { m_statistics[i] = static_cast(rand() % 10000) / 100.0; } } CTeam::CTeam(const CTeam &t) : m_name(t.m_name) , m_points(t.m_points) , m_goalDifference(t.m_goalDifference) , m_statistics(t.m_statistics) { // cout << "copy ctor" << endl; ++CTeam::copies; } CTeam &CTeam::operator=(const CTeam &t) { // cout << "copyAssign" << endl; ++CTeam::copies; if (this != &t) { m_name = t.m_name; m_points = t.m_points; m_goalDifference = t.m_goalDifference; m_statistics = t.m_statistics; } return *this; }
The move constructor and move assignment are specific to C++11. Again, there is nothing special about them.
// cteam.cpp (specific to C++11) CTeam::CTeam(CTeam &&t) : m_name(move(t.m_name)) , m_points(move(t.m_points)) , m_goalDifference(move(t.m_goalDifference)) , m_statistics(move(t.m_statistics)) { // cout << "move ctor" << endl; } CTeam &CTeam::operator=(CTeam &&t) { // cout << "move assign" << endl; m_name = move(t.m_name); m_points = move(t.m_points); m_goalDifference = move(t.m_goalDifference); m_statistics = move(t.m_statistics); return *this; }
Note that we use the move
function when assigning to the member variables. This tells the compiler to move the members of the argument t
into the members of this object. Moving of plain old data types is the same as copying. Hence, the integers m_points
and m_goalDifference
are copied. The STL data types std::string
and std::vector
support move semantics. Hence, the members m_name
and m_statistics
are eligible to moving.
Detailed Analysis of Copying versus Moving
Pushing an Element to the Back of a Vector
The function testPushBack()
appends a CTeam
object in three slightly different ways to the vector table
.
void Thing::testPushBack() { vectortable; table.reserve(4); cout << "@ Adding Arsenal" << endl; table.push_back(CTeam("Arsenal", 68, 25)); // [1] QCOMPARE(CTeam::copies, 0); cout << "@ Adding Man United" << endl; CTeam manUtd("Man United", 63, 24); table.push_back(manUtd); // [2] QCOMPARE(CTeam::copies, 1); QVERIFY(!manUtd.name().empty()); cout << "@ Adding Liverpool" << endl; CTeam liverpool("Liverpool", 54, 32); table.push_back(move(liverpool)); // [3] QCOMPARE(CTeam::copies, 1); QVERIFY(liverpool.name().empty()); cout << "@ End of test" << endl; }
Let us first look at what the C++98 compiler does for line [1].
table.push_back(CTeam("Arsenal", 68, 25)); // [1]
The debug trace looks as follows.
@ Adding Arsenal // [1], C++98 name ctor copy ctor dtor
The C++98 compiler creates a temporary CTeam
object with the name constructor, copies this object into table
container with the copy constructor, and destroys the temporary object again.
The C++11 compiler recognises that the argument of push_back
in [1] is an rvalue, that is, a temporary object that doesn't have a name and to which the address-of operator cannot be applied. When the C++11 compiler sees an rvalue, it calls the overload push_back(T &&value)
taking an rvalue reference. This overload creates the CTeam
object and moves it into the table
container, as the following debug trace proves. The C++11 compiler chooses the move constructor instead of the copy constructor.
@ Adding Arsenal // [1], C++11 name ctor move ctor dtor
We don't have any speed-up for the two data members m_points
and m_goalDifference
, because - as built-in data types - they are simply copied. The other two data members - m_name
and m_statistics
- are types - std::string
and std::vector
, respectively - that support move semantics. Both hold pointers to their contents. The move constructor copies the pointers, which is cheaper than copying the contents. So, the C++11 version should be slightly faster than the C++98 version.
Finally, the destructor is called on the emptied temporary object. After moving, the temporary object contains an empty string m_name
and an empty vector m_statistics
.
Let us move on to line [2].
CTeam manUtd("Man United", 63, 24); table.push_back(manUtd); // [2]
The debug trace is the same for C++98 and C++11: one call to the name constructor and one to the copy constructor.
@ Adding Man United // [2] name ctor copy ctor
This time we add the local object manUtd
to the table
. This object is not temporary, as it lives on after the call to push_back
. Obviously, it has a name manUtd
and the expression &manUtd
is valid. Hence, manUtd
is an lvalue and not an rvalue. Move semantics and its optimisation do not apply to lvalues. The behaviour and performance for adding an lvalue to a container is the same for C++98 and C++11.
Line [3]
CTeam liverpool("Liverpool", 54, 32); table.push_back(move(liverpool)); // [3]
shows that there is a way to force the C++11 compiler to apply move semantics. We can cast the lvalue liverpool
to an rvalue by calling std::move
on an lvalue. Line [3] behaves the same as line [1], as the debug trace shows.
@ Adding Liverpool // [3] name ctor move ctor @ End of test
It is important to understand that the local variable liverpool
doesn't contain anything any more. It is empty, because it has passed its contents to the containter table
. Hence, the expression liverpool.name().empty()
is true. The local variable liverpool
has no contents any more but it is still valid (in the sense that it doesn't crash).
In summary, we see that the C++98 version of testPushBack
performs 3 copies, whereas the C++11 version performs only 1 copy (line [2]).
Emplacing an Element to the Back of a Vector
Let us have another close look at line [1] of the push_back example.
table.push_back(CTeam("Arsenal", 68, 25)); // [1]
This line creates a temporary CTeam
object, which is half-moved and half-copied into the table
container and then destroyed. The moving is still inefficient. It would be much more efficient to construct this temporary object in-place at the location provided by the table
container. This is exactly what return-value optimisation (RVO) would do for a return value (even for many C++98 compilers). And, this is exactly what the C++11 compiler can do using std::vector::emplace_back()
and move semantics.
If we rewrite the line above as
table.emplace_back("Arsenal", 68, 25); // [1]
the debug trace
@ Adding Arsenal // [1], C++11 name ctor
shows that only the name constructor is called. Neither the move constructor nor the destructor are called. This is the optimal solution!
Note that the three versions
table.emplace_back({"Arsenal", 68, 25}); table.emplace_back(Team{"Arsenal", 68, 25}); table.emplace_back(Team("Arsenal", 68, 25));
are less efficient, because they call the name constructor, the move constructor and the destructor. emplace_back
degenerates into push_back
.
Replacing push_back
by emplace_back
in lines [2] and [3] of the push_back example
CTeam manUtd("Man United", 63, 24); table.emplace_back(manUtd); // [2] name ctor, copy ctor CTeam liverpool("Liverpool", 54, 32); table.emplace_back(move(liverpool)); // [3] name ctor, move ctor
doesn't make any difference.
Initialising Vectors
As enthusiastic C++11 developers, we may be tempted to initialise a vector of teams using C++11's new initialiser lists.
vector<Team> table = { {"Leicester", 80, 32}, {"Spurs", 70, 38}, {"Arsenal", 68, 25}, {"Man City", 65, 30} }; QCOMPARE(Team::copies, 4);
This solution triggers four name-constructor calls for the four teams followed by four copy-constructor calls to place the four teams in the table
and finished by four destructor calls destroying the four temporary objects. This is equivalent to default-constructing an empty table
first and calling push_back
for each team.
OK, we know how to improve that. We use emplace_back
instead of push_back
.
vector<Team> table; table.emplace_back("Leicester", 80, 32); table.emplace_back("Spurs", 70, 38); table.emplace_back("Arsenal", 68, 25); table.emplace_back("Man City", 65, 30); QCOMPARE(Team::copies, 3);
Well, this saves us one copy, but we expected no copy at all. The reason is that vectors expand and shrink their storage capacity as needed. At the beginning, table
has the capacity to store one element. So, team "Leicester" can be name-constructed into the container. When we add the second team, the "Spurs", the capacity is too small and is doubled. Team "Leicester" is copied into the new storage and the "Spurs" are name-constructed into the new storage. When we add the third team, "Arsenal", the capacity is doubled to 4. The first two teams are copied into the new storage and "Arsenal" is name-constructed into the new storage. Now, the capacity is sufficient and we can simply name-construct "Man City" into the existing storage. This makes three copies in total.
The cure is obvious. We reserve four elements in the container before-hand and emplace the team into the existing slots.
vector<Team> table; table.reserve(4); table.emplace_back("Leicester", 80, 32); table.emplace_back("Spurs", 70, 38); table.emplace_back("Arsenal", 68, 25); table.emplace_back("Man City", 65, 30); QCOMPARE(Team::copies, 0);
Finally, there are zero copies. Of course, reserving enough storage capacity only works if we know the approximate number of elements before-hand. If space is at a premium, we may have to use a different data structure like std::unorderd_map
, std::map
or even std::list
.
Shuffling and Sorting
Shuffling and sorting a vector of teams should be where move semantics shines brightest.
void Thing::testShuffleAndSort() { vector<CTeam> table; table.reserve(20); table.emplace_back("Leicester", 81, 32); table.emplace_back("Arsenal", 71, 29); // Adding 18 more teams ... random_shuffle(table.begin(), table.end()); // [1] sort(table.begin(), table.end(), greaterThan); // [2] QCOMPARE(CTeam::copies, 0); }
We first shuffle the 20 teams in table
(line [1]) and then sort them again (line [2]). When the shuffle and sort algorithms swap elements, they can move these elements instead of copying. The function testShuffleAndSort()
doesn't need a single copy.
If we replace the emplace_back
functions by
table.push_back(CTeam("Leicester", 81, 32)); table.push_back(CTeam("Arsenal", 71, 29));
we get perfectly valid C++98 code. If we run the function testShuffleAndSort()
in C++98, we'll see 130 or even 160 copies. The number of copies depends on how badly the container got shuffled.
Shuffling, sorting and moving around elements in containers should be significantly faster in C++11 than in C++98. This is where move semantics really shines.
The Benchmarks
The first benchmark calls the testShuffleAndSort
function 5000 times. We use push_back
instead of emplace_back
for the C++11 version as well, because we can then use identical code for both C++98 and C++11. Note that the C++11 version still works with zero copies.
// thing.cpp namespace { bool greaterThan(const CTeam &t1, const CTeam &t2) { return t1.points() > t2.points() || (t1.points() == t2.points() && t1.goalDifference() > t2.goalDifference()); } } void Thing::benchmarkShuffleAndSort() { vector<CTeam> table; table.reserve(20); table.push_back(CTeam("Leicester", 81, 32)); table.push_back(CTeam("Arsenal", 71, 29)); // Adding 18 more teams ... QBENCHMARK { for (int i = 0; i < 5000; ++i) { random_shuffle(table.begin(), table.end()); sort(table.begin(), table.end(), greaterThan); } } }
The second benchmark adds 20 teams to the table
container using push_back
and clears the container. It repeats these steps 5000 times.
void Thing::benchmarkPushBack() { vector<CTeam> table; table.reserve(20); QBENCHMARK { for (int i = 0; i < 5000; ++i) { table.push_back(CTeam("Leicester", 81, 32)); table.push_back(CTeam("Arsenal", 71, 29)); // Adding 18 more teams ... table.clear(); } } }
The third benchmark is the same as the second, but it uses emplace_back
instead of push_back
. Obviously, we cannot run this test with C++98.
void Thing::benchmarkEmplaceBack() { vector<CTeam> table; table.reserve(20); QBENCHMARK { for (int i = 0; i < 5000; ++i) { table.emplace_back("Leicester", 81, 32); table.emplace_back("Arsenal", 71, 29); // Adding 18 more teams ... table.clear(); } } }
I ran the first and second benchmark with C++98. The results are shown in the column "C++98" in the table below. I ran all three benchmarks with C++11 in two variants. I removed the move constructor and assignment from the class CTeam
for the first variant denoted "C++11/Copy". The second variant denoted "C++11/Move" comes with the full glory of move semantics.
QtTest has the nifty feature that we can run benchmarks in callgrind by passing the option -callgrind
to the test exectuable. Callgrind counts the number of read instructions the CPU executes for the benchmark.
Here is the table with the results.
Benchmark | C++98 | C++11/Copy | C++11/Move |
---|---|---|---|
ShuffleAndSort | 106,442,624 | 106,442,697 | 62,859,675 |
PushBack | 1,823,007,025 | 1,821,111,486 | 1,805,521,091 |
EmplaceBack | n/a | 1,803,255,948 | 1,801,486,078 |
As we want to compare the different implementations of the same problem, we are more interested in relative performance than in absolute numbers of read instructions. We define the implementation "EmplaceBack - C++11/Move" as the reference point with a value of 1.00. The values for the other four implementations of the PushBack and EmplaceBack benchmarks show how much slower the other implementation is than the reference (best) implementation. The implementation "ShuffleAndSort - C++11/Move" is the reference point for the other two ShuffleAndSort implementations.
Benchmark | C++98 | C++11/Copy | C++11/Move |
---|---|---|---|
ShuffleAndSort | 1.693 | 1.693 | 1.000 |
PushBack | 1.012 | 1.011 | 1.002 |
EmplaceBack | n/a | 1.001 | 1.000 |
Move semantics speeds up the ShuffleAndSort benchmark by a factor of nearly 1.7. This is quite impressive but not surprising as the ShuffleAndSort benchmark hits the sweet spot of move semantics. Whenever the shuffle and sort algorithm swap the position of two elements, they do a three-way move instead of a three-way copy.
The results for the PushBack and EmplaceBack benchmarks are within 1% of each other. Let us compare the different implementations one by one.
The difference between "C++98" and "C++11/Copy" is negligible. It is within the measurement variance.
The "C++11/Copy" implementation of the PushBack benchmark uses the push_back(const T&)
overload, whereas the "C++11/Move" implementation uses the push_back(T&&)
overload. The former implementation calls the copy constructor of CTeam
, whereas the latter implementation calls the move constructor. This explains the speed-up of nearly 1% when using move semantics.
At first glance, it is slightly surprising that we see a similar speed-up of 1% when we use emplace_back
instead of push_back
for "C++11/Copy" - although there is no move semantics involved. If we think about it briefly, it is pretty obvious though. emplace_back
constructs the CTeam
object directly into the proper place in the container. It does neither call the copy constructor nor the destructor.
For "C++11/Move", emplace_back
is slightly faster (0.2%) than push_back
. emplace_back
only calls the name constructor of CTeam
, whereas push_back
calls the name constructor, move constructor and destructor.
The difference between "C++11/Copy" and "C++11/Move" for emplace_back
is negligible. It is within the measurement variance.
Accidental Performance Gains
I started writing the code examples in C++11. When I made the code C++98 conformant, I had to use the C++98 way for sorting containers and for generating random numbers. When I profiled the C++98 way against the C++11 way, I noticed significant performance differences. The following table shows columns "C++98" and "C++11/Move" from the Benchmark table and adds the column "C++11/Opt", which uses move semantics and optimised algorithms.
Benchmark | C++98 | C++11/Move | C++11/Opt |
---|---|---|---|
ShuffleAndSort | 1.693 | 1.000 | 0.869 |
PushBack | 1.012 | 1.002 | 0.295 |
EmplaceBack | n/a | 1.000 | 0.294 |
The speed-up is massive: 13% for ShuffleAndSort and 70% for PushBack and EmplaceBack. How can we achieve this speed-up?
For ShuffleAndSort, I replaced
sort(table.begin(), table.end(), greaterThan);
by
sort(table.begin(), table.end(), [](const CTeam &t1, const CTeam &t2) { return t1.points() > t2.points() || (t1.points() == t2.points() && t1.goalDifference() > t2.goalDifference()); });
For C++98, the sort
algorithm calls the function greaterThan
for every comparison. For C++11, it inlines the lamda expression and runs through the inlined instructions of the body of greaterThan
for every comparison. The algorithm saves the overhead of calling greaterThan
for every comparison.
For PushBack and EmplaceBack, the winning change was how the name constructor of CTeam
generates random numbers. In C++98, we call the system function rand
.
srand(p); for (int i = 0; i < statisticsSize; ++i) { m_statistics[i] = static_cast(rand() % 10000) / 100.0; }
In C++11, we use the new classes default_random_engine
and uniform_real_distribution
to generate the random numbers.
std::default_random_engine engine(p); std::uniform_real_distributionrange(0, 100); for (int i = 0; i < statisticsSize; ++i) { m_statistics[i] = range(engine); }
Every time, the system function rand
is called, it sets up the machinery to generate random numbers, generates a number and then tears down the machinery again. In C++, we do the setup only once, when creating the engine
and range
objects. Moreover, we only call a member function (range(engine)
) instead of a more heavy-weight system function.
Conclusion
Equipping our classes with move constructors and assignment operators makes our code sometimes slightly faster and sometimes even a lot faster, but sometimes it doesn't change anything. We must understand when we benefit from C++11's new move semantics.
The most important factor is whether a class is suited for moving. If the data members of a class are builtin types like int
or double
, we will not see any speed-up. If the data members are pointers to "big" objects, the move constructor and assignment will copy the pointer instead of the object pointed to. This applies recursively to data members like strings and vectors whose data members are of pointer type. This is why we see a slight speed-up (factor: 1.012) for CTeam
objects, whose data members m_name
and m_statistics
are of type std::string
and std::vector
, respectively.
This slight speed-up multiplies if such objects are moved around heavily as it happens during sort algorithms. The resulting speed-up can be significant as our ShuffleAndSort benchmark (factor: 1.7) shows.
Even if a class doesn't support move semantics, we can speed up the insertion of elements by using emplacement (factor: 1.012). Emplacing an element into a container constructs the element directly into the right place in the container. It does not move or copy the newly constructed element into the container.
In summary, we can say that using move semantics on suitable classes will speed up operations - some more than others. By using lambda expressions instead of functions in algorithms (e.g., sorting criterion) and by using C++11's new way of generating random numbers, we can speed up our code significantly (by factor 1.15 and 3.4, respectively).