Clang format tanks performance

Let’s benchmark toupper implementations.

Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.

This will be a pretty short one - the longer, original post on the original, maybe far more interesting, topic is coming soon.

So let’s take a look at three implementations which perform toupper over an array of char : that is, which take an input array and modifies it in-place so any lowercase characters are converted to uppercase.

The first simply calls the C standard library toupper function in a C-style loop:

void toupper_rawloop(char* buf, size_t size) {
    for (size_t i = 0; i < size; i++) {
        buf[i] = toupper(buf[i]);

The second uses the more modern approach of using to std::transform to replace the raw loop:

void toupper_transform(char* buf, size_t size) {
    std::transform(buf, buf + size, buf, [](char c){ return toupper(c); });

Finally, the third one is our bespoke ASCII-specific version that checks if the character lies in the range a - z and remaps it by subtracting 32 if so:

void toupper_branch(char* buf, size_t size) {
    for (size_t i = 0; i < size; i++) {
        char c = buf[i];
        if (c >= 'a' && c <= 'z') {
            buf[i] = c - 32;

Seems straightforward enough, right?

Let’s benchmark these on my Skylake i7-6700HQ laptop, with the default gcc 5.5. Here’s a JPSP:

Let’s get three observations that aren’t really part of the story out of the way.

First, the pattern for the branchy algorithm (green). It’s the only one that varies much at all with input size - the other two are basically flat. This turns out to be just a benchmarking artifact. I use random ASCII input, so primary determinant of performance our branchy algorithm is branch prediction. For small input sizes, the branch predictor learns the entire input sequence across iterations of the benchmark and so mispredictions are low and performance is high, just like this . As sequence size grows the predictor memorizes less and less of the sequence until it flatlines when it mispredicts every time there is an uppercase character (0.27 mispredicts per character).

The second thing is this green blob of much slower samples from the toupper_branch in the upper left:

This wasn’t a one time artifact, I saw those guys hanging out up there across several runs. They don’t reproduce if you run that particular size alone however, only when the script runs to collect input across all size values. They don’t always show up. I didn’t look into it further but my best guess is some of unfortunate collision or aliasing effect perhaps in the branch predictor or in the 4k physical to virtual page mapping (VA space randomization was off, however).

The third not interesting thing is the bimodal behavior of toupper_rawloop – the blue dots form two distinct lines, at just above 2 cycles per char and a faster line at 1.5 cycles per char. All performance counters that I checked were the same between the two “modes”. The fast mode, which runs at 1.57 chars/cycle is basically bottlenecked on the load ports: there are 1.54 uops/cycle going to both port 2 and port 3, so those ports are 98% occupied. The slower mode I can’t explain.

While I was investigating it, the fast mode suddenly stopped appearing and I was stuck in slow mode. Maybe my CPU saw what I was up to and downloaded a microcode update in the background to remove the inconsistency, but I still have the SVG to prove it (for now).

So what’s the interesting thing?

The interesting thing is that the raw loop version runs 3x to 4x faster than the std::transform version: 1.5 to 2 cycles per character versus just above 7 cycles per character.

What’s up with that? Are my standard algorithms letting me down? Does std::transform have some fatal flaw?

Not really. Well, not at all.

It turns out these results occur when the functions are compiled in separate files . If you put them into the same file, suddenly the performance is is the same: they both run slowly.

No, it’s not an alignment thing.

It gets weirder too: the fast raw loop version, compiled in a separate file, slows down if you simply include the <algorithm> header . That’s right - including that header, which is never used and generates no code in the object file, slows down the raw loop by 3 to 4 times. Conversely, the std::transform version speeds up to full speed if you copy and paste the std::transform implementation out of <algorithm> and stop including that file.

It gets even weirder (this is the last “it gets weirder”, I promise): including <algorithm> doesn’t always do this. The slowdown happens if <algorithm> is included before <ctype.h> , but not if you swap them around:


#include <algorithm>
#include <ctype.h>


#include <ctype.h>
#include <algorithm>

In fact, in my case, this performance anomaly was triggered (in a different project) when clang format sorted my headers, moving <algorithm> to the top where it belonged (hence the clickbait title).

Of course, we were going to end up mired in assembly at some point. Let’s not postpone the pain any longer.

Here are are the fast and slow versions of the functions, with the core loops annotated:

<algorithm> included first:

toupper_rawloop(char*, unsigned long):
        push    rbp
        push    rbx
        lea     rbp, [rdi+rsi]
        sub     rsp, 8
        test    rsi, rsi
        je      .L1
        mov     rbx, rdi
        movsx   edi, BYTE PTR [rbx]  ; load a char from *buf
        add     rbx, 1               ; buf++
        call    toupper              ; call toupper(c)
        mov     BYTE PTR [rbx-1], al ; save the result to buf[-1]
        cmp     rbp, rbx             ; check for buf == buf_end
        jne     .L5                  ;
        add     rsp, 8
        pop     rbx
        pop     rbp

With <algorithm> second:

toupper_rawloop(char*, unsigned long):
        test    rsi, rsi
        je      .L7
        push    rbp
        push    rbx
        mov     rbp, rsi
        mov     rbx, rdi
        sub     rsp, 8
        call    __ctype_toupper_loc
        lea     rsi, [rbx+rbp]
        mov     rdi, rbx
        movsx   rcx, BYTE PTR [rdi]        ; load a char from buf
        mov     rdx, QWORD PTR [rax]       ; load the toupper table address (pointed to by __ctype_toupper_loc)
        add     rdi, 1                     ; buf++
        mov     edx, DWORD PTR [rdx+rcx*4] ; look up the toupper result by indexing into table with the char
        mov     BYTE PTR [rdi-1], dl       ; store the result
        cmp     rsi, rdi                   ; check buf == end_buf
        jne     .L4                        ;

        add     rsp, 8
        pop     rbx
        pop     rbp
        rep ret

The key difference is the slow version simply calls toupper in the loop, while the fast version has no function calls at all, just a table lookup- the body of std::toupper has been inlined.

Examining the glibc source , we find the implementation of toupper :

__extern_inline int
__NTH (toupper (int __c)) // __NTH is a macro that indicates the function doesn't throw
  return __c >= -128 && __c < 256 ? (*__ctype_toupper_loc ())[__c] : __c;

We see that toupper is implemented as an extern inline function that first checks that the range of the char fits within a byteand then looks up the character in the table returned by __ctype_toupper_loc() . That function returns a thread-local pointer (a const int ** ), which in turn points to a lookup table which given a character returns the uppercase version.

So now the assembly makes sense: the fast version of the algorithm inlines the toupper body, but it can’t inline the __ctype_toupper_loc() call; however, this call is declared __attribute__((const)) which means that its return value depends only on the arguments (and here there are no arguments) and so the compiler knows it returns the same value every time and so can be hoisted out of the loop, so the loop body has only a few loads associated with the lookup table, the store of the new value to the buffer, and loop control.

The slow version, on the other hand, leaves the toupper() inside the loop. The loop itself is one instruction shorted, but of course you need to run all the code inside toupper as well. Here’s what that looks like on my system:

lea    edx,[rdi+0x80]                   ; edx = rdi + 0x80
  movsxd rax,edi                          ; zero extend c
  cmp    edx,0x17f                        ; check that c is in -128 to 255
  ja     2a                               ; if not, we're done
  mov    rdx,QWORD PTR [rip+0x395f30]     ; lookup TLS index
  mov    rdx,QWORD PTR fs:[rdx]           ; access TLS at index
  mov    rdx,QWORD PTR [rdx]              ; dereference TLS pointer
  mov    rdx,QWORD PTR [rdx+0x48]         ; load current toupper lookup table
  mov    eax,DWORD PTR [rdx+rax*4+0x200]  ; lookup c in LUT

Since it’s a standalone function call, it has to do more work. There are no less than five chained (pointer-chasing) memory accesses. Only two of those accesses remained in the fast loop, because the rest were hoisted up and out of the loop. The input to output latency of this function is probably close to 25 cycles, so out measured throughput of ~7 cycles means that the CPU was able to run several copies in parallel, not too terrible all things considered.

Why does this happen?

Through a long series of includes, C++ files like <algorithm> include <bits/os_defines.h> which has this line:

// This keeps isanum, et al from being propagated as macros.
#define __NO_CTYPE 1

When <ctype.h> is ultimately included, this prevents the block containing the extern inline definition of toupper from being included:

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// lots more like this
__isctype_f (xdigit)
# elif defined __isctype
# define isalnum(c)	__isctype((c), _ISalnum)
# define isalpha(c)	__isctype((c), _ISalpha)
// more like this
# endif

// the stuff we care about
__extern_inline int
__NTH (tolower (int __c))
  return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc ())[__c] : __c;

__extern_inline int
__NTH (toupper (int __c))
  return __c >= -128 && __c < 256 ? (*__ctype_toupper_loc ())[__c] : __c;
# endif

// here's where tolower is defined as a macro
# if __GNUC__ >= 2 && defined __OPTIMIZE__ && !defined __cplusplus
#  define tolower(c)	__tobody (c, tolower, *__ctype_tolower_loc (), (c))
#  define toupper(c)	__tobody (c, toupper, *__ctype_toupper_loc (), (c))
# endif /* Optimizing gcc */

#endif /* Not __NO_CTYPE.  */

Note when including <ctype.h> C++ toupper is never defined as a macro - at most it is extern inline - the macro definitions below that are guarded by !defined __cplusplus so they’ll never take effect.

So I’m not sure if the extern inline bodies of tolower and toupper are intended to be excluded by __NO_CTYPE in this case, but that’s what happens and this has a significant performance impact in this toy loop. As a corollary, if you include <cctype> rather than <ctype.h> (the C++ version of the C header which puts functions in the std:: namespace) you also get the slow behavior because <cctype> ultimately includes <bits/os_defines.h> .

Does this natter? Nah. toupper is broken for serious multilingual use and, if you only care about ASCII you can write your own faster function. If you care about proper text handling, you are probably using UTF-8 and you’ll have to use something like ICU to do locale-aware text handling, or wait for C++ to get Unicode support (you might be waiting a while). It’s only interesting in clickbait sense of “clang format can cause a 4x performance regression”.