Question
Why Sorted Arrays Can Be Faster in C++: Branch Prediction Explained
Question
In the following C++ program, sorting the array before the timed section makes the main loop run much faster:
#include <algorithm>
#include <ctime>
#include <cstdlib>
#include <iostream>
int main()
{
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// Sorting here makes the later loop much faster.
std::sort(data, data + arraySize);
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}
Without sorting, the loop takes much longer. With sorting, it becomes several times faster.
A similar effect can also be observed in Java:
import java.util.Arrays;
import java.util.Random;
public class Main {
public static void main(String[] args) {
int arraySize = 32768;
int[] data = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// Sorting here also makes the later loop faster.
Arrays.sort(data);
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i) {
for (int c = 0; c < arraySize; ++c) {
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / );
System.out.println( + sum);
}
}
The order of the values should not change the final sum, so why does processing a sorted array often run much faster than processing an unsorted array?
Short Answer
By the end of this page, you will understand why a sorted array can make a loop much faster even when the algorithm is logically doing the same work. The main reason is usually CPU branch prediction: when data is sorted, the if condition becomes highly predictable, so the processor can keep its pipeline full. You will also see the roles of cache, compiler optimizations, and branchless alternatives.
Concept
The core concept here is branch prediction.
When your code contains a condition like this:
if (data[c] >= 128)
sum += data[c];
the CPU must decide whether the branch will be taken or not taken. Modern processors do not simply wait for the result of every condition. They try to predict the outcome in advance so they can keep executing instructions without stalling.
Why this matters
Modern CPUs are deeply pipelined. That means they prepare and execute multiple instruction stages at once. If the CPU predicts a branch correctly, execution continues smoothly. If it predicts incorrectly, it must throw away speculative work and restart from the correct path. This is called a branch misprediction.
A branch misprediction is expensive compared with a simple integer comparison.
Why sorted data helps
If the array is random, the condition data[c] >= 128 is also roughly random:
- sometimes true
- sometimes false
- often alternating unpredictably
That makes the branch hard to predict.
If the array is sorted, the pattern becomes predictable:
- first many values are
< 128 - then many values are
>= 128
So instead of a random true/false pattern, the CPU sees a long run of false outcomes followed by a long run of true outcomes. That is much easier to predict.
Important point
Mental Model
Imagine a toll road with two lanes:
- the left lane is for values
< 128 - the right lane is for values
>= 128
A worker tries to guess which lane each next car should go into.
Unsorted array
Cars arrive in random order:
- left n- right
- right
- left
- left
- right
The worker keeps guessing wrong, traffic stops, and cars need to be redirected. That is like branch misprediction.
Sorted array
Now cars arrive in a much more organized pattern:
- a long stream goes left
- then a long stream goes right
The worker quickly learns the pattern and keeps traffic moving. That is like good branch prediction.
The amount of work is logically the same, but the road flows much more smoothly.
Syntax and Examples
The key syntax is a conditional branch inside a loop:
for (int i = 0; i < n; ++i) {
if (data[i] >= 128)
sum += data[i];
}
Example with unsorted data
#include <iostream>
int main() {
int data[] = {130, 12, 200, 7, 128, 5, 250, 1};
int sum = 0;
for (int i = 0; i < 8; ++i) {
if (data[i] >= 128)
sum += data[i];
}
std::cout << sum << '\n';
}
The condition results are:
- true
- false
- true
- false
- true
- false
- true
- false
That alternating pattern is harder for the CPU to predict.
Example with sorted data
Step by Step Execution
Consider this small example:
int data[] = {3, 180, 20, 140, 8, 200};
int sum = 0;
for (int i = 0; i < 6; ++i) {
if (data[i] >= 128)
sum += data[i];
}
Step-by-step
Iteration 1
data[0]is33 >= 128is false- do not add anything
sum = 0
Iteration 2
data[1]is180180 >= 128is true- add
180 sum = 180
Iteration 3
data[2]is
Real World Use Cases
This effect appears in many real programs whenever code branches based on data.
Filtering records
if (user.age >= 18)
adults.push_back(user);
If the input is grouped by age range, the branch may be more predictable than if users are randomly ordered.
Image or signal processing
if (pixel > threshold)
brightCount++;
Threshold checks on ordered or partially ordered data can be faster than on random data.
Log processing
if (entry.level == ERROR)
handleError(entry);
If log entries are grouped by severity, branch behavior becomes more regular.
Data cleanup pipelines
if (!value.empty())
process(value);
Missing/non-missing data patterns can affect branch predictability.
Numerical computing
if (x > 0)
sum += x;
A random sign distribution may perform differently from grouped positive and negative values.
Real Codebase Usage
In real projects, developers often deal with this concept indirectly rather than manually sorting just for branch prediction.
Common patterns
1. Filtering with predictable data
If data is naturally grouped, loops with conditions may run faster without any special effort.
2. Guard clauses
Developers often write early checks like:
if (items.empty())
return;
or:
if (!isValid(input))
return false;
These are branches too, but usually they are highly predictable and cheap.
3. Validation passes
Code that validates large arrays or records often contains repeated conditions. If the data is noisy and random, branches may become expensive.
4. Branchless hot loops
In performance-critical code, developers may avoid unpredictable branches:
sum += (data[i] >= 128) ? data[i] : 0;
or use SIMD/vectorized approaches.
5. Partitioning instead of full sorting
If the only goal is to separate values by a condition, developers may use partitioning rather than sorting:
Common Mistakes
1. Assuming this is mainly a cache effect
A common guess is that sorted data is faster because it is "more in cache." In this example, both versions scan the array sequentially, so cache behavior is very similar.
The bigger issue is usually branch prediction.
2. Thinking source-level work equals CPU-level work
These two pieces of code look like they should cost the same:
if (data[i] >= 128)
sum += data[i];
and
sum += (data[i] >= 128) ? data[i] : 0;
But the generated machine code may be very different. One may branch; another may use conditional move or arithmetic.
3. Benchmarking without optimization
If you compare performance without compiler optimizations, results may be misleading.
Use release settings such as:
g++ -O2 file.cpp
or
g++ -O3 file.cpp
and verify what the compiler actually generated.
4. Using a benchmark the compiler can simplify away
If sum is never used, a compiler might remove the loop entirely.
Broken benchmark example:
Comparisons
| Concept | What it means | Performance impact |
|---|---|---|
| Sorted vs unsorted data | Same values, different order | Sorted often makes branch outcomes more predictable |
| Branching code | Uses if or similar control flow | Can be fast or slow depending on predictability |
| Branchless code | Uses arithmetic or conditional moves instead of hard branches | Can reduce misprediction cost |
| Cache locality | How memory is laid out and accessed | Important, but not the main reason in this example |
| Sequential access | Reading array from start to end | Good locality in both sorted and unsorted versions |
if branch vs branchless expression
| Style | Example |
|---|
Cheat Sheet
Main idea
A sorted array can be faster to process because branch outcomes become predictable.
Typical hot-loop pattern
if (data[i] >= 128)
sum += data[i];
Why sorted data helps
- unsorted random data -> branch is hard to predict
- sorted data -> long run of false, then long run of true
- fewer branch mispredictions -> faster execution
Not usually the main cause here
- not primarily cache
- not primarily language-specific behavior
- not because the sum changes
CPU concept to remember
- correct branch prediction: pipeline keeps flowing
- wrong branch prediction: pipeline flush, wasted work
When the effect is strongest
- data is random
- branch result is near 50/50
- the loop is repeated many times
- the compiler emits an actual branch
When the effect may shrink
- branchless machine code
- vectorization
- memory bandwidth dominates
- data is already biased mostly true or mostly false
Useful alternatives
sum += (data[i] >= 128) ? data[i] : 0;
FAQ
Why does sorting make the loop faster if the result is the same?
Because the CPU cares not only about what work is done, but also how predictable the control flow is. Sorting makes the if condition more predictable.
Is this a cache optimization?
Usually not in this example. Both versions read the array sequentially, so memory locality is already good. The bigger effect is branch prediction.
Why does the branch become hard to predict on random data?
Because values above and below the threshold appear in an irregular order, so the CPU often guesses the wrong path.
Does this happen only in C++?
No. It can also happen in Java, C, Rust, and other languages because the code ultimately runs on CPUs with branch predictors.
Can the compiler remove this problem?
Sometimes. A compiler may generate branchless code or vectorized code, which reduces or eliminates the difference between sorted and unsorted input.
Is sorting worth it just to speed up one loop?
Usually no. Sorting costs more than a single linear pass. It only makes sense if you will reuse the ordered data enough times.
Would std::partition be better than std::sort here?
If you only want to group values by a threshold, yes, partitioning is often a better fit because it is linear time and does not fully sort the array.
Why do modern compilers sometimes show a smaller difference?
Because newer compilers may produce better machine code, including branchless instructions or vectorized loops, which reduce dependence on branch prediction.
Mini Project
Description
Build a small benchmark program in C++ that compares three ways of summing values greater than or equal to a threshold:
- processing random data with an
ifstatement - processing sorted data with an
ifstatement - processing random data with a branchless expression
This project helps you see that the same logical task can run differently depending on how predictable the branch is and how the code is written.
Goal
Measure how data order and branch style affect loop performance in a tight C++ benchmark.
Requirements
- Generate a vector of integers with values in a small range such as 0 to 255.
- Benchmark summing values greater than or equal to 128 on unsorted data and sorted data.
- Add a third benchmark that uses a branchless expression instead of
if. - Print the elapsed time and final sum for each approach.
- Keep the computation real so the compiler cannot optimize it away.