Performance Gains Through C++11 Move Semantics

We explore when and how our code benefits most from equipping classes with move constructors and assignment operators. We analyse in detail, which constructors and assignment operators (move or copy) are called for C++98 and C++11 when we insert or emplace objects into containers, when we initialise containers and when we shuffle and sort containers. Finally, we run benchmarks for these operations and discuss the actual speed-up, which ranges from 1% to 70%.

A Class with Move Semantics

Let me introduce the class CTeam, which will accompany us through our experiments with move semantics. The class represents a football team with its name, points and goal difference. The vector m_statistics of 100 double numbers is only there to make CTeam objects more heavy-weight during copying.

// cteam.h

#ifndef CTEAM_H
#define CTEAM_H

#include <string>
#include <vector>

class CTeam
{
public:
    ~CTeam();                                      // dtor
    CTeam();                                       // default ctor
    CTeam(const std::string &n, int p, int gd);    // name ctor

    CTeam(const CTeam &t);                         // copy ctor
    CTeam &operator=(const CTeam &t);              // copy assign

    CTeam(CTeam&& t);                              // move ctor
    CTeam &operator=(CTeam &&t);                   // move assign

    std::string name() const;
    int points() const;
    int goalDifference() const;

    static int copies;

private:
    std::string m_name;
    int m_points;
    int m_goalDifference;
    static const int statisticsSize = 100;
    std::vector<double> m_statistics;
};

#endif // CTEAM_H

I implemented all the constructors and assignment operators, because they print their name to the console and because the copy variants increment the counter copies. We must only comment out the move constructor and assignment to turn the C++11 code into C++98 code.

The implementation comes with no surprises. The print statements are commented out, because they would completely contort the performance results. We should uncomment them for the detailed analysis. Here is the part that is the same for C++98 and C++11.

// cteam.cpp (same for C++98 and C++11)

CTeam::~CTeam()
{
//    cout << "dtor" << endl;
}

CTeam::CTeam()
    : m_name("")
    , m_points(0)
    , m_goalDifference(0)
{
//    cout << "default ctor" << endl;
}

CTeam::CTeam(const string &n, int p, int gd)
    : m_name(n)
    , m_points(p)
    , m_goalDifference(gd)
{
//    cout << "name ctor" << endl;
    m_statistics.reserve(statisticsSize);
    srand(p);
    for (int i = 0; i < statisticsSize; ++i) {
        m_statistics[i] = static_cast(rand() % 10000) / 100.0;
    }
}

CTeam::CTeam(const CTeam &t)
    : m_name(t.m_name)
    , m_points(t.m_points)
    , m_goalDifference(t.m_goalDifference)
    , m_statistics(t.m_statistics)
{
//    cout << "copy ctor" << endl;
    ++CTeam::copies;
}

CTeam &CTeam::operator=(const CTeam &t)
{
//    cout << "copyAssign" << endl;
    ++CTeam::copies;
    if (this != &t) {
        m_name = t.m_name;
        m_points = t.m_points;
        m_goalDifference = t.m_goalDifference;
        m_statistics = t.m_statistics;
    }
    return *this;
}

The move constructor and move assignment are specific to C++11. Again, there is nothing special about them.

// cteam.cpp (specific to C++11)

CTeam::CTeam(CTeam &&t)
    : m_name(move(t.m_name))
    , m_points(move(t.m_points))
    , m_goalDifference(move(t.m_goalDifference))
    , m_statistics(move(t.m_statistics))
{
//    cout << "move ctor" << endl;
}


CTeam &CTeam::operator=(CTeam &&t)
{
//    cout << "move assign" << endl;
    m_name = move(t.m_name);
    m_points = move(t.m_points);
    m_goalDifference = move(t.m_goalDifference);
    m_statistics = move(t.m_statistics);
    return *this;
}

Note that we use the move function when assigning to the member variables. This tells the compiler to move the members of the argument t into the members of this object. Moving of plain old data types is the same as copying. Hence, the integers m_points and m_goalDifference are copied. The STL data types std::string and std::vector support move semantics. Hence, the members m_name and m_statistics are eligible to moving.

Detailed Analysis of Copying versus Moving

Pushing an Element to the Back of a Vector

The function testPushBack() appends a CTeam object in three slightly different ways to the vector table.

void Thing::testPushBack()
{
    vector table;
    table.reserve(4);

    cout << "@ Adding Arsenal" << endl;
    table.push_back(CTeam("Arsenal", 68, 25)); // [1]
    QCOMPARE(CTeam::copies, 0);

    cout << "@ Adding Man United" << endl;
    CTeam manUtd("Man United", 63, 24);
    table.push_back(manUtd);                   // [2]
    QCOMPARE(CTeam::copies, 1);
    QVERIFY(!manUtd.name().empty());

    cout << "@ Adding Liverpool" << endl;
    CTeam liverpool("Liverpool", 54, 32);
    table.push_back(move(liverpool));          // [3]
    QCOMPARE(CTeam::copies, 1);
    QVERIFY(liverpool.name().empty());

    cout << "@ End of test" << endl;
}

Let us first look at what the C++98 compiler does for line [1].

    table.push_back(CTeam("Arsenal", 68, 25)); // [1]

The debug trace looks as follows.

@ Adding Arsenal      // [1], C++98
name ctor
copy ctor
dtor

The C++98 compiler creates a temporary CTeam object with the name constructor, copies this object into table container with the copy constructor, and destroys the temporary object again.

The C++11 compiler recognises that the argument of push_back in [1] is an rvalue, that is, a temporary object that doesn't have a name and to which the address-of operator cannot be applied. When the C++11 compiler sees an rvalue, it calls the overload push_back(T &&value) taking an rvalue reference. This overload creates the CTeam object and moves it into the table container, as the following debug trace proves. The C++11 compiler chooses the move constructor instead of the copy constructor.

@ Adding Arsenal      // [1], C++11
name ctor
move ctor
dtor

We don't have any speed-up for the two data members m_points and m_goalDifference, because - as built-in data types - they are simply copied. The other two data members - m_name and m_statistics - are types - std::string and std::vector, respectively - that support move semantics. Both hold pointers to their contents. The move constructor copies the pointers, which is cheaper than copying the contents. So, the C++11 version should be slightly faster than the C++98 version.

Finally, the destructor is called on the emptied temporary object. After moving, the temporary object contains an empty string m_name and an empty vector m_statistics.

Let us move on to line [2].

    CTeam manUtd("Man United", 63, 24);
    table.push_back(manUtd);                   // [2]

The debug trace is the same for C++98 and C++11: one call to the name constructor and one to the copy constructor.

@ Adding Man United   // [2]
name ctor
copy ctor

This time we add the local object manUtd to the table. This object is not temporary, as it lives on after the call to push_back. Obviously, it has a name manUtd and the expression &manUtd is valid. Hence, manUtd is an lvalue and not an rvalue. Move semantics and its optimisation do not apply to lvalues. The behaviour and performance for adding an lvalue to a container is the same for C++98 and C++11.

Line [3]

    CTeam liverpool("Liverpool", 54, 32);
    table.push_back(move(liverpool));          // [3]

shows that there is a way to force the C++11 compiler to apply move semantics. We can cast the lvalue liverpool to an rvalue by calling std::move on an lvalue. Line [3] behaves the same as line [1], as the debug trace shows.

@ Adding Liverpool    // [3]
name ctor
move ctor
@ End of test

It is important to understand that the local variable liverpool doesn't contain anything any more. It is empty, because it has passed its contents to the containter table. Hence, the expression liverpool.name().empty() is true. The local variable liverpool has no contents any more but it is still valid (in the sense that it doesn't crash).

In summary, we see that the C++98 version of testPushBack performs 3 copies, whereas the C++11 version performs only 1 copy (line [2]).

Emplacing an Element to the Back of a Vector

Let us have another close look at line [1] of the push_back example.

    table.push_back(CTeam("Arsenal", 68, 25)); // [1]

This line creates a temporary CTeam object, which is half-moved and half-copied into the table container and then destroyed. The moving is still inefficient. It would be much more efficient to construct this temporary object in-place at the location provided by the table container. This is exactly what return-value optimisation (RVO) would do for a return value (even for many C++98 compilers). And, this is exactly what the C++11 compiler can do using std::vector::emplace_back() and move semantics.

If we rewrite the line above as

    table.emplace_back("Arsenal", 68, 25); // [1]

the debug trace

@ Adding Arsenal      // [1], C++11
name ctor

shows that only the name constructor is called. Neither the move constructor nor the destructor are called. This is the optimal solution!

Note that the three versions

    table.emplace_back({"Arsenal", 68, 25});
    table.emplace_back(Team{"Arsenal", 68, 25});
    table.emplace_back(Team("Arsenal", 68, 25));

are less efficient, because they call the name constructor, the move constructor and the destructor. emplace_back degenerates into push_back.

Replacing push_back by emplace_back in lines [2] and [3] of the push_back example

    CTeam manUtd("Man United", 63, 24);
    table.emplace_back(manUtd);               // [2] name ctor, copy ctor

    CTeam liverpool("Liverpool", 54, 32);
    table.emplace_back(move(liverpool));      // [3] name ctor, move ctor

doesn't make any difference.

Initialising Vectors

As enthusiastic C++11 developers, we may be tempted to initialise a vector of teams using C++11's new initialiser lists.

    vector<Team> table = {
        {"Leicester", 80, 32},
        {"Spurs", 70, 38},
        {"Arsenal", 68, 25},
        {"Man City", 65, 30}
    };
    QCOMPARE(Team::copies, 4);

This solution triggers four name-constructor calls for the four teams followed by four copy-constructor calls to place the four teams in the table and finished by four destructor calls destroying the four temporary objects. This is equivalent to default-constructing an empty table first and calling push_back for each team.

OK, we know how to improve that. We use emplace_back instead of push_back.

    vector<Team> table;
    table.emplace_back("Leicester", 80, 32);
    table.emplace_back("Spurs", 70, 38);
    table.emplace_back("Arsenal", 68, 25);
    table.emplace_back("Man City", 65, 30);
    QCOMPARE(Team::copies, 3);

Well, this saves us one copy, but we expected no copy at all. The reason is that vectors expand and shrink their storage capacity as needed. At the beginning, table has the capacity to store one element. So, team "Leicester" can be name-constructed into the container. When we add the second team, the "Spurs", the capacity is too small and is doubled. Team "Leicester" is copied into the new storage and the "Spurs" are name-constructed into the new storage. When we add the third team, "Arsenal", the capacity is doubled to 4. The first two teams are copied into the new storage and "Arsenal" is name-constructed into the new storage. Now, the capacity is sufficient and we can simply name-construct "Man City" into the existing storage. This makes three copies in total.

The cure is obvious. We reserve four elements in the container before-hand and emplace the team into the existing slots.

    vector<Team> table;
    table.reserve(4);
    table.emplace_back("Leicester", 80, 32);
    table.emplace_back("Spurs", 70, 38);
    table.emplace_back("Arsenal", 68, 25);
    table.emplace_back("Man City", 65, 30);
    QCOMPARE(Team::copies, 0);

Finally, there are zero copies. Of course, reserving enough storage capacity only works if we know the approximate number of elements before-hand. If space is at a premium, we may have to use a different data structure like std::unorderd_map, std::map or even std::list.

Shuffling and Sorting

Shuffling and sorting a vector of teams should be where move semantics shines brightest.

void Thing::testShuffleAndSort()
{
    vector<CTeam> table;
    table.reserve(20);

    table.emplace_back("Leicester", 81, 32);
    table.emplace_back("Arsenal", 71, 29);
    // Adding 18 more teams ...

    random_shuffle(table.begin(), table.end());     // [1]
    sort(table.begin(), table.end(), greaterThan);  // [2]
    QCOMPARE(CTeam::copies, 0);
}

We first shuffle the 20 teams in table (line [1]) and then sort them again (line [2]). When the shuffle and sort algorithms swap elements, they can move these elements instead of copying. The function testShuffleAndSort() doesn't need a single copy.

If we replace the emplace_back functions by

    table.push_back(CTeam("Leicester", 81, 32));
    table.push_back(CTeam("Arsenal", 71, 29));

we get perfectly valid C++98 code. If we run the function testShuffleAndSort() in C++98, we'll see 130 or even 160 copies. The number of copies depends on how badly the container got shuffled.

Shuffling, sorting and moving around elements in containers should be significantly faster in C++11 than in C++98. This is where move semantics really shines.

The Benchmarks

The first benchmark calls the testShuffleAndSort function 5000 times. We use push_back instead of emplace_back for the C++11 version as well, because we can then use identical code for both C++98 and C++11. Note that the C++11 version still works with zero copies.

// thing.cpp

namespace {
bool greaterThan(const CTeam &t1, const CTeam &t2)
{
    return t1.points() > t2.points() ||
            (t1.points() == t2.points() && t1.goalDifference() > t2.goalDifference());
}
}

void Thing::benchmarkShuffleAndSort()
{
    vector<CTeam> table;
    table.reserve(20);
    table.push_back(CTeam("Leicester", 81, 32));
    table.push_back(CTeam("Arsenal", 71, 29));
    // Adding 18 more teams ...

    QBENCHMARK {
        for (int i = 0; i < 5000; ++i) {
            random_shuffle(table.begin(), table.end());
            sort(table.begin(), table.end(), greaterThan);
        }
    }
}

The second benchmark adds 20 teams to the table container using push_back and clears the container. It repeats these steps 5000 times.

void Thing::benchmarkPushBack()
{
    vector<CTeam> table;
    table.reserve(20);

    QBENCHMARK {
        for (int i = 0; i < 5000; ++i) {
            table.push_back(CTeam("Leicester", 81, 32));
            table.push_back(CTeam("Arsenal", 71, 29));
            // Adding 18 more teams ...
            table.clear();
        }
    }
}

The third benchmark is the same as the second, but it uses emplace_back instead of push_back. Obviously, we cannot run this test with C++98.

void Thing::benchmarkEmplaceBack()
{
    vector<CTeam> table;
    table.reserve(20);

    QBENCHMARK {
        for (int i = 0; i < 5000; ++i) {
            table.emplace_back("Leicester", 81, 32);
            table.emplace_back("Arsenal", 71, 29);
            // Adding 18 more teams ...
            table.clear();
        }
    }
}

I ran the first and second benchmark with C++98. The results are shown in the column "C++98" in the table below. I ran all three benchmarks with C++11 in two variants. I removed the move constructor and assignment from the class CTeam for the first variant denoted "C++11/Copy". The second variant denoted "C++11/Move" comes with the full glory of move semantics.

QtTest has the nifty feature that we can run benchmarks in callgrind by passing the option -callgrind to the test exectuable. Callgrind counts the number of read instructions the CPU executes for the benchmark.

Here is the table with the results.

Benchmark	C++98	C++11/Copy	C++11/Move
ShuffleAndSort	106,442,624	106,442,697	62,859,675
PushBack	1,823,007,025	1,821,111,486	1,805,521,091
EmplaceBack	n/a	1,803,255,948	1,801,486,078

As we want to compare the different implementations of the same problem, we are more interested in relative performance than in absolute numbers of read instructions. We define the implementation "EmplaceBack - C++11/Move" as the reference point with a value of 1.00. The values for the other four implementations of the PushBack and EmplaceBack benchmarks show how much slower the other implementation is than the reference (best) implementation. The implementation "ShuffleAndSort - C++11/Move" is the reference point for the other two ShuffleAndSort implementations.

Benchmark	C++98	C++11/Copy	C++11/Move
ShuffleAndSort	1.693	1.693	1.000
PushBack	1.012	1.011	1.002
EmplaceBack	n/a	1.001	1.000

Move semantics speeds up the ShuffleAndSort benchmark by a factor of nearly 1.7. This is quite impressive but not surprising as the ShuffleAndSort benchmark hits the sweet spot of move semantics. Whenever the shuffle and sort algorithm swap the position of two elements, they do a three-way move instead of a three-way copy.

The results for the PushBack and EmplaceBack benchmarks are within 1% of each other. Let us compare the different implementations one by one.

The difference between "C++98" and "C++11/Copy" is negligible. It is within the measurement variance.

The "C++11/Copy" implementation of the PushBack benchmark uses the push_back(const T&) overload, whereas the "C++11/Move" implementation uses the push_back(T&&) overload. The former implementation calls the copy constructor of CTeam, whereas the latter implementation calls the move constructor. This explains the speed-up of nearly 1% when using move semantics.

At first glance, it is slightly surprising that we see a similar speed-up of 1% when we use emplace_back instead of push_back for "C++11/Copy" - although there is no move semantics involved. If we think about it briefly, it is pretty obvious though. emplace_back constructs the CTeam object directly into the proper place in the container. It does neither call the copy constructor nor the destructor.

For "C++11/Move", emplace_back is slightly faster (0.2%) than push_back. emplace_back only calls the name constructor of CTeam, whereas push_back calls the name constructor, move constructor and destructor.

The difference between "C++11/Copy" and "C++11/Move" for emplace_back is negligible. It is within the measurement variance.

Accidental Performance Gains

I started writing the code examples in C++11. When I made the code C++98 conformant, I had to use the C++98 way for sorting containers and for generating random numbers. When I profiled the C++98 way against the C++11 way, I noticed significant performance differences. The following table shows columns "C++98" and "C++11/Move" from the Benchmark table and adds the column "C++11/Opt", which uses move semantics and optimised algorithms.

Benchmark	C++98	C++11/Move	C++11/Opt
ShuffleAndSort	1.693	1.000	0.869
PushBack	1.012	1.002	0.295
EmplaceBack	n/a	1.000	0.294

The speed-up is massive: 13% for ShuffleAndSort and 70% for PushBack and EmplaceBack. How can we achieve this speed-up?

For ShuffleAndSort, I replaced

    sort(table.begin(), table.end(), greaterThan);

    sort(table.begin(), table.end(), 
         [](const CTeam &t1, const CTeam &t2) {
            return t1.points() > t2.points() || (t1.points() == t2.points() && t1.goalDifference() > t2.goalDifference());
         });

For C++98, the sort algorithm calls the function greaterThan for every comparison. For C++11, it inlines the lamda expression and runs through the inlined instructions of the body of greaterThan for every comparison. The algorithm saves the overhead of calling greaterThan for every comparison.

For PushBack and EmplaceBack, the winning change was how the name constructor of CTeam generates random numbers. In C++98, we call the system function rand.

    srand(p);
    for (int i = 0; i < statisticsSize; ++i) {
        m_statistics[i] = static_cast(rand() % 10000) / 100.0;
    }

In C++11, we use the new classes default_random_engine and uniform_real_distribution to generate the random numbers.

    std::default_random_engine engine(p);
    std::uniform_real_distribution range(0, 100);
    for (int i = 0; i < statisticsSize; ++i) {
        m_statistics[i] = range(engine);
    }

Every time, the system function rand is called, it sets up the machinery to generate random numbers, generates a number and then tears down the machinery again. In C++, we do the setup only once, when creating the engine and range objects. Moreover, we only call a member function (range(engine)) instead of a more heavy-weight system function.

Conclusion

Equipping our classes with move constructors and assignment operators makes our code sometimes slightly faster and sometimes even a lot faster, but sometimes it doesn't change anything. We must understand when we benefit from C++11's new move semantics.

The most important factor is whether a class is suited for moving. If the data members of a class are builtin types like int or double, we will not see any speed-up. If the data members are pointers to "big" objects, the move constructor and assignment will copy the pointer instead of the object pointed to. This applies recursively to data members like strings and vectors whose data members are of pointer type. This is why we see a slight speed-up (factor: 1.012) for CTeam objects, whose data members m_name and m_statistics are of type std::string and std::vector, respectively.

This slight speed-up multiplies if such objects are moved around heavily as it happens during sort algorithms. The resulting speed-up can be significant as our ShuffleAndSort benchmark (factor: 1.7) shows.

Even if a class doesn't support move semantics, we can speed up the insertion of elements by using emplacement (factor: 1.012). Emplacing an element into a container constructs the element directly into the right place in the container. It does not move or copy the newly constructed element into the container.

In summary, we can say that using move semantics on suitable classes will speed up operations - some more than others. By using lambda expressions instead of functions in algorithms (e.g., sorting criterion) and by using C++11's new way of generating random numbers, we can speed up our code significantly (by factor 1.15 and 3.4, respectively).