Question

Why Small Type Changes Can Change Performance in C++ Hot Loops

cppperformanceassemblyx86compiler-optimization

Question

I am benchmarking the fastest way to compute the population count of large arrays in C++. I found a surprising result: changing the inner loop counter from unsigned to uint64_t causes a major slowdown on Intel CPUs when using _mm_popcnt_u64.

Here is the benchmark:

#include <iostream>
#include <chrono>
#include <cstdlib>
#include <cstdint>
#include <x86intrin.h>

int main(int argc, char* argv[]) {
    using namespace std;

    if (argc != 2) {
        cerr << "usage: array_size in MB" << endl;
        return -1;
    }

    uint64_t size = atol(argv[1]) << 20;
    uint64_t* buffer = new uint64_t[size / 8];
    char* charbuffer = reinterpret_cast<char*>(buffer);

    for (unsigned i = 0; i < size; ++i)
        charbuffer[i] = rand() % 256;

    uint64_t count, duration;
    chrono::time_point<chrono::system_clock> startP, endP;

    {
        startP = chrono::system_clock::now();
        count = 0;
        for (unsigned k = 0; k < 10000; k++) {
            for (unsigned i = 0; i < size / 8; i += 4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i + 1]);
                count += _mm_popcnt_u64(buffer[i + 2]);
                count += _mm_popcnt_u64(buffer[i + 3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP - startP).count();
        cout << "unsigned\t" << count << '\t' << (duration / 1.0E9) << " sec\t"
             << (10000.0 * size) / duration << " GB/s" << endl;
    }

    {
        startP = chrono::system_clock::now();
        count = 0;
        for (unsigned k = 0; k < 10000; k++) {
            for (uint64_t i = 0; i < size / 8; i += 4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i + 1]);
                count += _mm_popcnt_u64(buffer[i + 2]);
                count += _mm_popcnt_u64(buffer[i + 3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP - startP).count();
        cout << "uint64_t\t" << count << '\t' << (duration / 1.0E9) << " sec\t"
             << (10000.0 * size) / duration << " GB/s" << endl;
    }

    delete[] buffer;
}

The surprising part is that the two loops do the same logical work, but produce very different performance results depending on whether the loop variable is 32-bit or 64-bit.

I would like to understand:

Why can unsigned and uint64_t produce such different performance in a tight loop?
Why can changing a runtime value to a compile-time constant lead to slower generated code?
Why might adding static change optimization behavior?
Is inline assembly the only reliable way to keep the fastest machine code for a hot loop like this?

The main issue is understanding how small source-level changes affect compiler optimization and generated assembly in performance-critical C++ code.

Short Answer

By the end of this page, you will understand why tiny C++ changes—like using a 32-bit loop index instead of a 64-bit one—can change the generated assembly and significantly affect performance in hot loops. You will also learn how instruction latency, dependency chains, addressing modes, and compiler heuristics influence real CPU throughput.

Concept

In performance-critical C++ code, the source code is only the starting point. What really matters is the machine code the compiler generates and how the CPU executes it.

In this question, the key concept is microarchitectural performance sensitivity:

two pieces of C++ code can be logically equivalent,
but compile into different assembly,
and run at very different speeds on the same CPU.

Why this happens

Modern CPUs do not simply execute one instruction after another in a naive way. They:

decode instructions,
track dependencies between results,
issue work to different execution units,
and try to overlap independent operations.

A hot loop becomes fast when the compiler gives the CPU enough independent work. A hot loop becomes slow when the compiler accidentally creates a dependency chain where each instruction must wait for the previous one.

The important detail in this benchmark

POPCNT is not a free instruction. It has non-trivial latency. If the loop is generated like this conceptually:

tmp1 = popcnt(a);
sum = sum + tmp1;
tmp2 = popcnt(b);
sum = sum + tmp2;
tmp3 = popcnt(c);
sum = sum + tmp3;

then every add depends on the previous add. That creates a serial chain.

If instead the compiler structures the work more like this:

p1 = popcnt(a);
p2 = popcnt(b);
p3 = popcnt(c);
p4 = (d);
sum += p1 + p2 + p3 + p4;

Mental Model

Think of the CPU like a kitchen with several workers.

POPCNT is a cooking step that takes a little time.
ADD combines finished ingredients.
A dependency chain means worker B must wait until worker A finishes every single step.
Independent instructions mean multiple workers can prepare things at once.

Now compare two cooking styles:

Slow style

Cook ingredient 1.
Add it to the bowl.
Cook ingredient 2.
Add it to the bowl.
Cook ingredient 3.
Add it to the bowl.

Everything is serialized.

Faster style

Start cooking ingredient 1.
Start cooking ingredient 2.
Start cooking ingredient 3.
Start cooking ingredient 4.
Combine the results.

Now the kitchen stays busy.

That is what happens in a CPU too. A tiny source-level change may cause the compiler to choose the first style instead of the second. The algorithm is the same, but the schedule is worse.

The loop counter type is not the real story. It is just one small input that nudges the compiler toward a different plan.

Take Quiz

Syntax and Examples

Basic idea: counting set bits

In C++, population count means counting how many 1 bits are set in a number.

Example with the intrinsic

#include <cstdint>
#include <iostream>
#include <x86intrin.h>

int main() {
    uint64_t value = 0b10110100;
    int bits = _mm_popcnt_u64(value);
    std::cout << bits << '\n'; // 4
}

_mm_popcnt_u64(value) asks the compiler to use the CPU's POPCNT instruction for a 64-bit integer.

Looping over an array

#include <cstdint>
#include <iostream>
#include <vector>
#include <x86intrin.h>

  {
    std::vector<> data = {, , };
     total = ;

     ( i = ; i < data.(); ++i) {
        total += _mm_popcnt_u64(data[i]);
    }

    std::cout << total << ;
}

Step by Step Execution

Traceable example

Consider this code:

#include <cstdint>
#include <iostream>
#include <x86intrin.h>

int main() {
    uint64_t data[4] = {1, 3, 7, 15};
    uint64_t total = 0;

    for (unsigned i = 0; i < 4; ++i) {
        total += _mm_popcnt_u64(data[i]);
    }

    std::cout << total << '\n';
}

Step by step

Initial state

data[0] = 1 → binary 0001 → popcount 1
data[1] = 3 → binary 0011 → popcount 2
data[2] = 7 → binary → popcount

Real World Use Cases

Population count and hot-loop optimization appear in many real systems.

Common popcount use cases

Databases and analytics: counting active flags in bitsets
Search engines: bitmap indexes and filtering
Compression: bit-level statistics and encoding decisions
Cryptography: Hamming weight and bit operations
Bioinformatics: comparing binary feature vectors
Game engines: occupancy masks and spatial bitboards

Why loop structure matters in real apps

In a production codebase, the same algorithm may run:

billions of times per second,
on arrays much larger than CPU caches,
under compiler and architecture constraints.

Small changes can affect:

request throughput in a server,
batch processing time,
memory bandwidth usage,
CPU utilization,
power consumption.

Example scenario

Suppose an API service stores user permissions as bitsets. Each request needs to count enabled permissions across many records. If a hot loop becomes 30% slower because the compiler produced a worse instruction schedule, the whole service can become more expensive to run.

Take Quiz

Real Codebase Usage

In real projects, developers usually do not rely on a single fragile loop form. They use patterns that make performance more stable.

1. Prefer `size_t` for indexing

For indexing containers and memory blocks, size_t is usually the natural choice:

for (size_t i = 0; i < n; ++i) {
    total += _mm_popcnt_u64(data[i]);
}

This matches the platform's native pointer size and avoids some signed/unsigned mismatch issues.

2. Use multiple accumulators

This is a common real-world pattern to reduce dependency chains:

uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
for (size_t i = 0; i < n; i += 4) {
    s0 += _mm_popcnt_u64(data[i]);
    s1 += _mm_popcnt_u64(data[i + 1]);
    s2 += _mm_popcnt_u64(data[i + 2]);
    s3 += _mm_popcnt_u64(data[i + 3]);
}
uint64_t total = s0 + s1 + s2 + s3;

This often performs better regardless of minor compiler decisions.

3. Separate setup from the hot loop

Initialization, parsing, random filling, and timing code should not distract from the hot path.

Common Mistakes

1. Assuming equivalent C++ means equivalent speed

Two loops can compute the same result but generate different machine code.

Mistake

for (unsigned i = 0; i < n; ++i) { ... }
for (uint64_t i = 0; i < n; ++i) { ... }

These are not guaranteed to compile the same way.

Avoid it

Inspect assembly for hot loops.
Benchmark real builds.
Focus on dependency chains, not just source-level similarity.

2. Accumulating everything into one running sum

Slower pattern

uint64_t total = 0;
for (size_t i = 0; i < n; i += 4) {
    total += _mm_popcnt_u64(data[i]);
    total += _mm_popcnt_u64(data[i + 1]);
    total += _mm_popcnt_u64(data[i + 2]);
    total += _mm_popcnt_u64(data[i + 3]);
}

This can create a long serial chain.

Better pattern

uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
 ( i = ; i < n; i += ) {
    s0 += _mm_popcnt_u64(data[i]);
    s1 += _mm_popcnt_u64(data[i + ]);
    s2 += _mm_popcnt_u64(data[i + ]);
    s3 += _mm_popcnt_u64(data[i + ]);
}
 total = s0 + s1 + s2 + s3;

Comparisons

Related performance factors

Topic	What it affects	Notes
`unsigned` vs `uint64_t` index	Register width, addressing, scheduling	Not inherently faster or slower by itself
Single accumulator vs multiple accumulators	Dependency length	Multiple accumulators often improve throughput
Runtime bound vs constant bound	Loop shape, compare style, unrolling	Can help or hurt depending on generated code
Intrinsics vs inline assembly	Control vs compiler freedom	Intrinsics are usually easier and safer
`size_t` vs fixed-width integer for indexing	Portability and natural machine size	`size_t` is usually the idiomatic index type

`unsigned` vs

Cheat Sheet

Quick reference

Popcount intrinsic

#include <x86intrin.h>
int bits = _mm_popcnt_u64(x);

Prefer a clean hot loop

uint64_t total = 0;
for (size_t i = 0; i < n; ++i) {
    total += _mm_popcnt_u64(data[i]);
}

Better throughput with multiple accumulators

uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
for (size_t i = 0; i < n; i += 4) {
    s0 += _mm_popcnt_u64(data[i]);
    s1 += _mm_popcnt_u64(data[i + 1]);
    s2 += _mm_popcnt_u64(data[i + 2]);
    s3 += _mm_popcnt_u64(data[i + 3]);
}
uint64_t total = s0 + s1 + s2 + s3;

Key rules

Equivalent C++ can generate different assembly.
In hot loops, dependency chains matter a lot.
Constants do not always make code faster.
static can change optimization indirectly.
Use intrinsics before trying inline assembly.

FAQ

Why did `uint64_t` make the loop slower?

Not because 64-bit integers are always slower, but because the compiler generated a different instruction schedule with less favorable dependency behavior.

Is shorter assembly always faster?

No. A shorter instruction sequence can still run slower if it creates longer dependency chains or uses execution resources poorly.

Why can a compile-time constant make code slower?

A constant can change the compiler's optimization decisions. That may produce a different loop form that happens to run worse on a specific CPU.

Does `static` optimize code by itself?

Not directly in a magical way. It changes program structure and compiler assumptions, which can lead to different generated code.

Should I always use `unsigned` for loop counters in hot loops?

No. Use the type that matches the problem, usually size_t for indexing. Then benchmark if the loop is performance-critical.

Is inline assembly the best solution for reliable maximum speed?

Usually no. Intrinsics plus careful benchmarking are preferred first. Inline assembly is harder to maintain and can reduce the compiler's ability to optimize surrounding code.

How do I make popcount loops faster more reliably?

Use multiple accumulators, keep the loop simple, inspect assembly, and measure on the actual hardware you care about.

Related Concepts

CPU instruction latency — relevant because POPCNT is not instantaneous and latency affects throughput.
Instruction throughput — helps explain why some loop forms keep the CPU busier than others.
Dependency chains — central to understanding why repeated additions can serialize execution.
Loop unrolling — directly related because the benchmark manually unrolls the loop.
Register allocation — compiler register choices can change performance.
Compiler optimizations — explains why tiny source changes can alter assembly output.
Intrinsics in C++ — important because _mm_popcnt_u64 is an intrinsic, not normal C++ syntax.
size_t vs fixed-width integers — relevant when choosing index types.
Benchmarking C++ code — important for measuring hot-loop performance correctly.
Inline assembly vs intrinsics — relevant to the question about forcing a specific code shape.

Take Quiz

Mini Project

Description

Build a small C++ benchmark that compares different popcount loop styles on the same data. This project demonstrates how source-level changes can affect generated code and performance, even when the logic is identical.

Goal

Create a benchmark that measures popcount performance using a single accumulator and multiple accumulators, then compare the results.

Requirements

Create a vector of uint64_t values with deterministic test data.
Implement one function that uses a single accumulator.
Implement one function that uses four accumulators.
Time both functions over many repetitions.
Print the total count and elapsed time for each version.

Take Quiz

Keep learning

Choice	Pros	Cons
`unsigned`	Smaller type, may lead to simpler 32-bit loop control on x86-64	Can be too small for very large ranges
`uint64_t`	Can represent large ranges directly	May trigger different code generation or register use

Approach	Best for	Trade-off
Intrinsics	Expressing CPU instructions in maintainable C++	Less exact control over scheduling
Inline assembly	Extremely specific code generation needs	Hard to maintain, less portable, blocks optimizations

Why Small Type Changes Can Change Performance in C++ Hot Loops

Question

Short Answer

Concept

Why this happens

The important detail in this benchmark

Mental Model

Slow style

Faster style

Syntax and Examples

Basic idea: counting set bits

Example with the intrinsic

Looping over an array

Step by Step Execution

Traceable example

Step by step

Initial state

Real World Use Cases

Common popcount use cases

Why loop structure matters in real apps

Example scenario

Real Codebase Usage

1. Prefer size_t for indexing

2. Use multiple accumulators

3. Separate setup from the hot loop

Common Mistakes

1. Assuming equivalent C++ means equivalent speed

Mistake

Avoid it

2. Accumulating everything into one running sum

Slower pattern

Better pattern

Comparisons

Related performance factors

unsigned vs

Cheat Sheet

Quick reference

Popcount intrinsic

Prefer a clean hot loop

Better throughput with multiple accumulators

Key rules

FAQ

Why did uint64_t make the loop slower?

Is shorter assembly always faster?

Why can a compile-time constant make code slower?

Does static optimize code by itself?

Should I always use unsigned for loop counters in hot loops?

Is inline assembly the best solution for reliable maximum speed?

How do I make popcount loops faster more reliably?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Basic Rules and Idioms for Operator Overloading in C++

C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast

C++ Lambda Expressions Explained: What They Are and When to Use Them

Why loop-counter type matters

Why constants and static can change things

The core lesson

What this does

Manual unrolling

Better accumulation pattern

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Final output

Why performance can still vary

4. Benchmark carefully

5. Trust intrinsics before inline assembly

3. Mixing allocation and deallocation styles

Broken code

Correct code

4. Believing compile-time constants always improve performance

Avoid it

5. Benchmarking too much surrounding code

Avoid it

6. Jumping straight to inline assembly

1. Prefer `size_t` for indexing

`unsigned` vs

Why did `uint64_t` make the loop slower?

Does `static` optimize code by itself?

Should I always use `unsigned` for loop counters in hot loops?

Why constants and `static` can change things