Question
Why Small Type Changes Can Change Performance in C++ Hot Loops
Question
I am benchmarking the fastest way to compute the population count of large arrays in C++. I found a surprising result: changing the inner loop counter from unsigned to uint64_t causes a major slowdown on Intel CPUs when using _mm_popcnt_u64.
Here is the benchmark:
#include <iostream>
#include <chrono>
#include <cstdlib>
#include <cstdint>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1]) << 20;
uint64_t* buffer = new uint64_t[size / 8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i = 0; i < size; ++i)
charbuffer[i] = rand() % 256;
uint64_t count, duration;
chrono::time_point<chrono::system_clock> startP, endP;
{
startP = chrono::system_clock::now();
count = 0;
for (unsigned k = 0; k < 10000; k++) {
for (unsigned i = 0; i < size / 8; i += 4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i + 1]);
count += _mm_popcnt_u64(buffer[i + 2]);
count += _mm_popcnt_u64(buffer[i + 3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP - startP).count();
cout << "unsigned\t" << count << '\t' << (duration / 1.0E9) << " sec\t"
<< (10000.0 * size) / duration << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count = 0;
for (unsigned k = 0; k < 10000; k++) {
for (uint64_t i = 0; i < size / 8; i += 4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i + 1]);
count += _mm_popcnt_u64(buffer[i + 2]);
count += _mm_popcnt_u64(buffer[i + 3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP - startP).count();
cout << "uint64_t\t" << count << '\t' << (duration / 1.0E9) << " sec\t"
<< (10000.0 * size) / duration << " GB/s" << endl;
}
delete[] buffer;
}
The surprising part is that the two loops do the same logical work, but produce very different performance results depending on whether the loop variable is 32-bit or 64-bit.
I would like to understand:
- Why can
unsignedanduint64_tproduce such different performance in a tight loop? - Why can changing a runtime value to a compile-time constant lead to slower generated code?
- Why might adding
staticchange optimization behavior? - Is inline assembly the only reliable way to keep the fastest machine code for a hot loop like this?
The main issue is understanding how small source-level changes affect compiler optimization and generated assembly in performance-critical C++ code.
Short Answer
By the end of this page, you will understand why tiny C++ changes—like using a 32-bit loop index instead of a 64-bit one—can change the generated assembly and significantly affect performance in hot loops. You will also learn how instruction latency, dependency chains, addressing modes, and compiler heuristics influence real CPU throughput.
Concept
In performance-critical C++ code, the source code is only the starting point. What really matters is the machine code the compiler generates and how the CPU executes it.
In this question, the key concept is microarchitectural performance sensitivity:
- two pieces of C++ code can be logically equivalent,
- but compile into different assembly,
- and run at very different speeds on the same CPU.
Why this happens
Modern CPUs do not simply execute one instruction after another in a naive way. They:
- decode instructions,
- track dependencies between results,
- issue work to different execution units,
- and try to overlap independent operations.
A hot loop becomes fast when the compiler gives the CPU enough independent work. A hot loop becomes slow when the compiler accidentally creates a dependency chain where each instruction must wait for the previous one.
The important detail in this benchmark
POPCNT is not a free instruction. It has non-trivial latency. If the loop is generated like this conceptually:
tmp1 = popcnt(a);
sum = sum + tmp1;
tmp2 = popcnt(b);
sum = sum + tmp2;
tmp3 = popcnt(c);
sum = sum + tmp3;
then every add depends on the previous add. That creates a serial chain.
If instead the compiler structures the work more like this:
p1 = popcnt(a);
p2 = popcnt(b);
p3 = popcnt(c);
p4 = (d);
sum += p1 + p2 + p3 + p4;
Mental Model
Think of the CPU like a kitchen with several workers.
POPCNTis a cooking step that takes a little time.ADDcombines finished ingredients.- A dependency chain means worker B must wait until worker A finishes every single step.
- Independent instructions mean multiple workers can prepare things at once.
Now compare two cooking styles:
Slow style
- Cook ingredient 1.
- Add it to the bowl.
- Cook ingredient 2.
- Add it to the bowl.
- Cook ingredient 3.
- Add it to the bowl.
Everything is serialized.
Faster style
- Start cooking ingredient 1.
- Start cooking ingredient 2.
- Start cooking ingredient 3.
- Start cooking ingredient 4.
- Combine the results.
Now the kitchen stays busy.
That is what happens in a CPU too. A tiny source-level change may cause the compiler to choose the first style instead of the second. The algorithm is the same, but the schedule is worse.
The loop counter type is not the real story. It is just one small input that nudges the compiler toward a different plan.
Syntax and Examples
Basic idea: counting set bits
In C++, population count means counting how many 1 bits are set in a number.
Example with the intrinsic
#include <cstdint>
#include <iostream>
#include <x86intrin.h>
int main() {
uint64_t value = 0b10110100;
int bits = _mm_popcnt_u64(value);
std::cout << bits << '\n'; // 4
}
_mm_popcnt_u64(value) asks the compiler to use the CPU's POPCNT instruction for a 64-bit integer.
Looping over an array
#include <cstdint>
#include <iostream>
#include <vector>
#include <x86intrin.h>
{
std::vector<> data = {, , };
total = ;
( i = ; i < data.(); ++i) {
total += _mm_popcnt_u64(data[i]);
}
std::cout << total << ;
}
Step by Step Execution
Traceable example
Consider this code:
#include <cstdint>
#include <iostream>
#include <x86intrin.h>
int main() {
uint64_t data[4] = {1, 3, 7, 15};
uint64_t total = 0;
for (unsigned i = 0; i < 4; ++i) {
total += _mm_popcnt_u64(data[i]);
}
std::cout << total << '\n';
}
Step by step
Initial state
data[0] = 1→ binary0001→ popcount1data[1] = 3→ binary0011→ popcount2data[2] = 7→ binary → popcount
Real World Use Cases
Population count and hot-loop optimization appear in many real systems.
Common popcount use cases
- Databases and analytics: counting active flags in bitsets
- Search engines: bitmap indexes and filtering
- Compression: bit-level statistics and encoding decisions
- Cryptography: Hamming weight and bit operations
- Bioinformatics: comparing binary feature vectors
- Game engines: occupancy masks and spatial bitboards
Why loop structure matters in real apps
In a production codebase, the same algorithm may run:
- billions of times per second,
- on arrays much larger than CPU caches,
- under compiler and architecture constraints.
Small changes can affect:
- request throughput in a server,
- batch processing time,
- memory bandwidth usage,
- CPU utilization,
- power consumption.
Example scenario
Suppose an API service stores user permissions as bitsets. Each request needs to count enabled permissions across many records. If a hot loop becomes 30% slower because the compiler produced a worse instruction schedule, the whole service can become more expensive to run.
Real Codebase Usage
In real projects, developers usually do not rely on a single fragile loop form. They use patterns that make performance more stable.
1. Prefer size_t for indexing
For indexing containers and memory blocks, size_t is usually the natural choice:
for (size_t i = 0; i < n; ++i) {
total += _mm_popcnt_u64(data[i]);
}
This matches the platform's native pointer size and avoids some signed/unsigned mismatch issues.
2. Use multiple accumulators
This is a common real-world pattern to reduce dependency chains:
uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
for (size_t i = 0; i < n; i += 4) {
s0 += _mm_popcnt_u64(data[i]);
s1 += _mm_popcnt_u64(data[i + 1]);
s2 += _mm_popcnt_u64(data[i + 2]);
s3 += _mm_popcnt_u64(data[i + 3]);
}
uint64_t total = s0 + s1 + s2 + s3;
This often performs better regardless of minor compiler decisions.
3. Separate setup from the hot loop
Initialization, parsing, random filling, and timing code should not distract from the hot path.
Common Mistakes
1. Assuming equivalent C++ means equivalent speed
Two loops can compute the same result but generate different machine code.
Mistake
for (unsigned i = 0; i < n; ++i) { ... }
for (uint64_t i = 0; i < n; ++i) { ... }
These are not guaranteed to compile the same way.
Avoid it
- Inspect assembly for hot loops.
- Benchmark real builds.
- Focus on dependency chains, not just source-level similarity.
2. Accumulating everything into one running sum
Slower pattern
uint64_t total = 0;
for (size_t i = 0; i < n; i += 4) {
total += _mm_popcnt_u64(data[i]);
total += _mm_popcnt_u64(data[i + 1]);
total += _mm_popcnt_u64(data[i + 2]);
total += _mm_popcnt_u64(data[i + 3]);
}
This can create a long serial chain.
Better pattern
uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
( i = ; i < n; i += ) {
s0 += _mm_popcnt_u64(data[i]);
s1 += _mm_popcnt_u64(data[i + ]);
s2 += _mm_popcnt_u64(data[i + ]);
s3 += _mm_popcnt_u64(data[i + ]);
}
total = s0 + s1 + s2 + s3;
Comparisons
Related performance factors
| Topic | What it affects | Notes |
|---|---|---|
unsigned vs uint64_t index | Register width, addressing, scheduling | Not inherently faster or slower by itself |
| Single accumulator vs multiple accumulators | Dependency length | Multiple accumulators often improve throughput |
| Runtime bound vs constant bound | Loop shape, compare style, unrolling | Can help or hurt depending on generated code |
| Intrinsics vs inline assembly | Control vs compiler freedom | Intrinsics are usually easier and safer |
size_t vs fixed-width integer for indexing | Portability and natural machine size | size_t is usually the idiomatic index type |
unsigned vs
Cheat Sheet
Quick reference
Popcount intrinsic
#include <x86intrin.h>
int bits = _mm_popcnt_u64(x);
Prefer a clean hot loop
uint64_t total = 0;
for (size_t i = 0; i < n; ++i) {
total += _mm_popcnt_u64(data[i]);
}
Better throughput with multiple accumulators
uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
for (size_t i = 0; i < n; i += 4) {
s0 += _mm_popcnt_u64(data[i]);
s1 += _mm_popcnt_u64(data[i + 1]);
s2 += _mm_popcnt_u64(data[i + 2]);
s3 += _mm_popcnt_u64(data[i + 3]);
}
uint64_t total = s0 + s1 + s2 + s3;
Key rules
- Equivalent C++ can generate different assembly.
- In hot loops, dependency chains matter a lot.
- Constants do not always make code faster.
staticcan change optimization indirectly.- Use intrinsics before trying inline assembly.
FAQ
Why did uint64_t make the loop slower?
Not because 64-bit integers are always slower, but because the compiler generated a different instruction schedule with less favorable dependency behavior.
Is shorter assembly always faster?
No. A shorter instruction sequence can still run slower if it creates longer dependency chains or uses execution resources poorly.
Why can a compile-time constant make code slower?
A constant can change the compiler's optimization decisions. That may produce a different loop form that happens to run worse on a specific CPU.
Does static optimize code by itself?
Not directly in a magical way. It changes program structure and compiler assumptions, which can lead to different generated code.
Should I always use unsigned for loop counters in hot loops?
No. Use the type that matches the problem, usually size_t for indexing. Then benchmark if the loop is performance-critical.
Is inline assembly the best solution for reliable maximum speed?
Usually no. Intrinsics plus careful benchmarking are preferred first. Inline assembly is harder to maintain and can reduce the compiler's ability to optimize surrounding code.
How do I make popcount loops faster more reliably?
Use multiple accumulators, keep the loop simple, inspect assembly, and measure on the actual hardware you care about.
Mini Project
Description
Build a small C++ benchmark that compares different popcount loop styles on the same data. This project demonstrates how source-level changes can affect generated code and performance, even when the logic is identical.
Goal
Create a benchmark that measures popcount performance using a single accumulator and multiple accumulators, then compare the results.
Requirements
- Create a vector of
uint64_tvalues with deterministic test data. - Implement one function that uses a single accumulator.
- Implement one function that uses four accumulators.
- Time both functions over many repetitions.
- Print the total count and elapsed time for each version.
Keep learning
Related questions
Basic Rules and Idioms for Operator Overloading in C++
Learn the core rules, syntax, and common idioms for operator overloading in C++, including member vs non-member operators.
C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast
Learn the difference between C-style casts, static_cast, and dynamic_cast in C++ with clear examples, safety rules, and real usage tips.
C++ Lambda Expressions Explained: What They Are and When to Use Them
Learn what C++ lambda expressions are, why they exist, when to use them, and how they simplify callbacks, algorithms, and local logic.