Wednesday, July 11, 2012

Comparing objects in C++

The standard C++ pattern for comparing and ordering objects is to overload operator<() for your class. This function is what all the standard containers (std::set, std::priority_queue, etc.) and algorithms (std::sort, std::binary_search, etc.) use for comparing objects.

So if you have a class that you would like to put in a standard container, you have to overload operator<(). Your class usually has more than one member, and you usually want to order your objects first by the value of one member, then by the value of another member, and so on. For example, if you have a class person with string members firstname and lastname, and you would like a "phone book" ordering, you would order by lastname first, then by firstname, e.g.:

Anderson, Clive
Anderson, Lucy
Anderson, Thomas
Armstrong, Lance
Armstrong, Neil
...

To achieve this using operator<(), implemented only in terms of std::string::operator<(), you could do something like this:

struct person
{
    std::string firstname;
    std::string lastname;

    bool operator<(const person &other) const
    {
        /* Order first by lastname */
        if (lastname < other.lastname)
            return true;
        if (other.lastname < lastname)
            return false;

        /* The lastnames were equal, so try to order by firstname instead */
        return firstname < other.firstname;
    }
};

This code is not optimal. To see why, consider a case where you have two persons with the same lastname. The first check will fall through (lastname < other.lastname is false); so will the second (other.lastname < lastname is also false). Each check had to loop over the lastname when a single loop should have sufficed.

Fortunately, std::string happens to implement a member function compare() which returns an int. The semantics are similar to C's strcmp(a, b), which also returns an int. The return value is less than 0 if a < b, 0 if they are equal, or greater than 0 if a > b. Using this function, we can implement a much more efficient operator<():

struct person
{
    std::string firstname;
    std::string lastname;

    bool operator<(const person &other) const
    {
        int d = lastname.compare(other.lastname);
        if (d < 0)
            return true;
        if (d > 0)
            return false;

        /* The lastnames were equal, so try to order by firstname instead */
        return firstname < other.firstname;
    }
};

To see the difference in running time, we can implement a simple loop which inserts the same element many times to a multiset:

int main(int argc, char *argv[])
{
    std::multiset<person> persons;

    for (unsigned int i = 0; i < 1000000; ++i)
        persons.insert(person("Thomas", "Anderson with a rather long last name for the sake of the argument"));
}

We exaggerate the number of elements and the length of the lastname in order to amplify the difference. Running this on my laptop, it takes approximately 3.9 seconds to run the original code and 2.9 seconds to run the new version.

This is all well and good, but the actual problem runs a bit deeper than this. What if our code uses other compound data types than strings? Take e.g. std::pair. It is hopelessly broken, precisely because it doesn't implement a compare() function! std::pair only implements operator<() and only expects its elements to implement operator<() too. Have a look at the libstdc++ implementation (simplified):

template<class _T1, class _T2>
inline bool operator<(const pair<_T1, _T2>& __x, const pair<_T1, _T2>& __y)
{
    return __x.first < __y.first
        || (!(__y.first < __x.first) && __x.second < __y.second);
}

Can you spot the performance bug? Yep, it calls checks both __x.first < __y.first and __y.first < __x.first in the worst case. This also means that std::pair<std::string, std::string> has exactly the same problem as our first implementation of struct person.

In order to show just how bad this really is, consider the case where we have nested pairs:

typedef std::pair<int, int> p1; // 2 elements
typedef std::pair<p1, p1> p2;   // 4 elements
typedef std::pair<p2, p2> p3;   // 8 elements
// ...
typedef std::pair<p5, p5> p6;   // 64 elements

This is not a realistic example per se, but it illustrates very well how big the problem really is. Let's assume we have an object x6 constructed in the following way:

p1 x1(0, 0);
p2 x2(x1, x1);
// ...
p6 x6(x5, x5);

What happens when you try to compare x6 to itself like this?

if (x6 < x6)
    /* ... */;

Let's take a step back. x6 has 64 int elements, which can be reached in the following ways:

x6.first.first.first.first.first.first
x6.first.first.first.first.first.second
x6.first.first.first.first.second.first
x6.first.first.first.first.second.second
...

Intuitively, we should only need to perform 128 comparisons in the worst case (two comparisons for each pair of corresponding elements from the two objects). If we had two arrays of 64 ints, we would do it like this:

int compare(const int a[64], const int b[64])
{
    for (unsigned int i = 0; i < 64; ++i) {
        if (a[i] < b[i])
            return -1;
        if (b[i] < a[i])
            return 1;
    }

    return 0;
}

In order to answer the question posed ("How many comparisons between two ints are we actually making when we compare x6 against itself?"), we can simply program it using a wrapper class for int that counts the number of comparisons:

static unsigned int nr_comparisons = 0;

struct myint
{
    int value;

    myint()
    {
    }

    myint(int v):
        value(v)
    {
    }

    bool operator<(const myint &other) const
    {
        ++nr_comparisons;
        return value < other.value;
    }
};

Now we just have to change the definition of p1 from

typedef std::pair<int, int> p1;

to

typedef std::pair<myint, myint> p1;

and finally print nr_comparisons. The answer? 729.

Because std::pair has three comparisons in the worst case, instead of just two (which would have been sufficient), the complexity of comparing nested pairs has just climbed from 2^n (where n is the number of levels of pairs) to 3^n. That's pretty bad!

Part of the solution, as employed by std::string and our improved struct person, is to define the most fundamental comparison between objects as a function that can tell you whether a < b, a == b, or a > b in a single call rather than determine this using two calls. It is clear that the two are semantically equivalent (you can both define operator<() in terms of compare() and the other way around), but one is computationally superior for compound objects.

This is still not the full story, however. We wouldn't get any performance improvement from std::set if we changed it to call compare() instead of operator<() — because it doesn't need to compare both ways! In fact, if we changed it, we would now find that std::set<int> operations take more time, since calling compare() on an int would have to check both a < b and b < a even when std::set only needs to know the result of one of them! (This is of course tighly coupled to the std::set implementation — implemented as a tree, lookups would typically traverse the tree by comparing the element that is searched for with the current node, and update the current node depending on the result of the comparison.)

The real solution, therefore, seems to be to implement both operator<() and compare() for all types. std::pair's would look something like this:

bool operator<(const pair &other) const
{
    int d = first.compare(other.first);
    if (d < 0)
        return true;
    if (d > 0)
        return false;

    return second < other.second;
}

int compare(const pair &other) const
{
    int d = first.compare(other.first);
    if (d != 0)
        return d;

    return second.compare(other.second);
}

Now containers and algorithms can use the most efficient one — std::set would still use operator<() on its elements, but std::pair::operator<() would now use compare() (for its first element), and so on.

Maybe we could adopt Perl's <=> operator for the next version of C++?

2 comments:

bert hubert said...

Wow, thanks! I had never realized that my 'clever' use of things like if( tie(a.remote, ourSock, a.type) < tie(b.remote, bSock, b.type)) might expand into suboptimal code!

Vegard said...

@bert hubert If it's performance-sensitive and/or you have a lot of nested ties, you may consider just implementing your own tuple/tie that does the same thing using a compare() instead. But I like the idea, it's indeed a clever trick! Your website/blog is nice, by the way :-)