Question

Why Sorted Arrays Can Be Faster in C++: Branch Prediction Explained

javacppperformancecpu-architecturebranch-prediction

Question

In the following C++ program, sorting the array before the timed section makes the main loop run much faster:

#include <algorithm>
#include <ctime>
#include <cstdlib>
#include <iostream>

int main()
{
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // Sorting here makes the later loop much faster.
    std::sort(data, data + arraySize);

    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '\n';
    std::cout << "sum = " << sum << '\n';
}

Without sorting, the loop takes much longer. With sorting, it becomes several times faster.

A similar effect can also be observed in Java:

import java.util.Arrays;
import java.util.Random;

public class Main {
    public static void main(String[] args) {
        int arraySize = 32768;
        int[] data = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // Sorting here also makes the later loop faster.
        Arrays.sort(data);

        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i) {
            for (int c = 0; c < arraySize; ++c) {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / );
        System.out.println( + sum);
    }
}

The order of the values should not change the final sum, so why does processing a sorted array often run much faster than processing an unsorted array?

Short Answer

By the end of this page, you will understand why a sorted array can make a loop much faster even when the algorithm is logically doing the same work. The main reason is usually CPU branch prediction: when data is sorted, the if condition becomes highly predictable, so the processor can keep its pipeline full. You will also see the roles of cache, compiler optimizations, and branchless alternatives.

Concept

The core concept here is branch prediction.

When your code contains a condition like this:

if (data[c] >= 128)
    sum += data[c];

the CPU must decide whether the branch will be taken or not taken. Modern processors do not simply wait for the result of every condition. They try to predict the outcome in advance so they can keep executing instructions without stalling.

Why this matters

Modern CPUs are deeply pipelined. That means they prepare and execute multiple instruction stages at once. If the CPU predicts a branch correctly, execution continues smoothly. If it predicts incorrectly, it must throw away speculative work and restart from the correct path. This is called a branch misprediction.

A branch misprediction is expensive compared with a simple integer comparison.

Why sorted data helps

If the array is random, the condition data[c] >= 128 is also roughly random:

sometimes true
sometimes false
often alternating unpredictably

That makes the branch hard to predict.

If the array is sorted, the pattern becomes predictable:

first many values are < 128
then many values are >= 128

So instead of a random true/false pattern, the CPU sees a long run of false outcomes followed by a long run of true outcomes. That is much easier to predict.

Important point

Mental Model

Imagine a toll road with two lanes:

the left lane is for values < 128
the right lane is for values >= 128

A worker tries to guess which lane each next car should go into.

Unsorted array

Cars arrive in random order:

left n- right
right
left
left
right

The worker keeps guessing wrong, traffic stops, and cars need to be redirected. That is like branch misprediction.

Sorted array

Now cars arrive in a much more organized pattern:

a long stream goes left
then a long stream goes right

The worker quickly learns the pattern and keeps traffic moving. That is like good branch prediction.

The amount of work is logically the same, but the road flows much more smoothly.

Take Quiz

Syntax and Examples

The key syntax is a conditional branch inside a loop:

for (int i = 0; i < n; ++i) {
    if (data[i] >= 128)
        sum += data[i];
}

Example with unsorted data

#include <iostream>

int main() {
    int data[] = {130, 12, 200, 7, 128, 5, 250, 1};
    int sum = 0;

    for (int i = 0; i < 8; ++i) {
        if (data[i] >= 128)
            sum += data[i];
    }

    std::cout << sum << '\n';
}

The condition results are:

true
false
true
false
true
false
true
false

That alternating pattern is harder for the CPU to predict.

Example with sorted data

Step by Step Execution

Consider this small example:

int data[] = {3, 180, 20, 140, 8, 200};
int sum = 0;

for (int i = 0; i < 6; ++i) {
    if (data[i] >= 128)
        sum += data[i];
}

Step-by-step

Iteration 1

data[0] is 3
3 >= 128 is false
do not add anything
sum = 0

Iteration 2

data[1] is 180
180 >= 128 is true
add 180
sum = 180

Iteration 3

data[2] is

Real World Use Cases

This effect appears in many real programs whenever code branches based on data.

Filtering records

if (user.age >= 18)
    adults.push_back(user);

If the input is grouped by age range, the branch may be more predictable than if users are randomly ordered.

Image or signal processing

if (pixel > threshold)
    brightCount++;

Threshold checks on ordered or partially ordered data can be faster than on random data.

Log processing

if (entry.level == ERROR)
    handleError(entry);

If log entries are grouped by severity, branch behavior becomes more regular.

Data cleanup pipelines

if (!value.empty())
    process(value);

Missing/non-missing data patterns can affect branch predictability.

Numerical computing

if (x > 0)
    sum += x;

A random sign distribution may perform differently from grouped positive and negative values.

Real Codebase Usage

In real projects, developers often deal with this concept indirectly rather than manually sorting just for branch prediction.

Common patterns

1. Filtering with predictable data

If data is naturally grouped, loops with conditions may run faster without any special effort.

2. Guard clauses

Developers often write early checks like:

if (items.empty())
    return;

or:

if (!isValid(input))
    return false;

These are branches too, but usually they are highly predictable and cheap.

3. Validation passes

Code that validates large arrays or records often contains repeated conditions. If the data is noisy and random, branches may become expensive.

4. Branchless hot loops

In performance-critical code, developers may avoid unpredictable branches:

sum += (data[i] >= 128) ? data[i] : 0;

or use SIMD/vectorized approaches.

5. Partitioning instead of full sorting

If the only goal is to separate values by a condition, developers may use partitioning rather than sorting:

Common Mistakes

1. Assuming this is mainly a cache effect

A common guess is that sorted data is faster because it is "more in cache." In this example, both versions scan the array sequentially, so cache behavior is very similar.

The bigger issue is usually branch prediction.

2. Thinking source-level work equals CPU-level work

These two pieces of code look like they should cost the same:

if (data[i] >= 128)
    sum += data[i];

and

sum += (data[i] >= 128) ? data[i] : 0;

But the generated machine code may be very different. One may branch; another may use conditional move or arithmetic.

3. Benchmarking without optimization

If you compare performance without compiler optimizations, results may be misleading.

Use release settings such as:

g++ -O2 file.cpp

g++ -O3 file.cpp

and verify what the compiler actually generated.

4. Using a benchmark the compiler can simplify away

If sum is never used, a compiler might remove the loop entirely.

Broken benchmark example:

Comparisons

Concept	What it means	Performance impact
Sorted vs unsorted data	Same values, different order	Sorted often makes branch outcomes more predictable
Branching code	Uses `if` or similar control flow	Can be fast or slow depending on predictability
Branchless code	Uses arithmetic or conditional moves instead of hard branches	Can reduce misprediction cost
Cache locality	How memory is laid out and accessed	Important, but not the main reason in this example
Sequential access	Reading array from start to end	Good locality in both sorted and unsorted versions

`if` branch vs branchless expression

Style	Example

Cheat Sheet

Main idea

A sorted array can be faster to process because branch outcomes become predictable.

Typical hot-loop pattern

if (data[i] >= 128)
    sum += data[i];

Why sorted data helps

unsorted random data -> branch is hard to predict
sorted data -> long run of false, then long run of true
fewer branch mispredictions -> faster execution

Not usually the main cause here

not primarily cache
not primarily language-specific behavior
not because the sum changes

CPU concept to remember

correct branch prediction: pipeline keeps flowing
wrong branch prediction: pipeline flush, wasted work

When the effect is strongest

data is random
branch result is near 50/50
the loop is repeated many times
the compiler emits an actual branch

When the effect may shrink

branchless machine code
vectorization
memory bandwidth dominates
data is already biased mostly true or mostly false

Useful alternatives

sum += (data[i] >= 128) ? data[i] : 0;

FAQ

Why does sorting make the loop faster if the result is the same?

Because the CPU cares not only about what work is done, but also how predictable the control flow is. Sorting makes the if condition more predictable.

Is this a cache optimization?

Usually not in this example. Both versions read the array sequentially, so memory locality is already good. The bigger effect is branch prediction.

Why does the branch become hard to predict on random data?

Because values above and below the threshold appear in an irregular order, so the CPU often guesses the wrong path.

Does this happen only in C++?

No. It can also happen in Java, C, Rust, and other languages because the code ultimately runs on CPUs with branch predictors.

Can the compiler remove this problem?

Sometimes. A compiler may generate branchless code or vectorized code, which reduces or eliminates the difference between sorted and unsorted input.

Is sorting worth it just to speed up one loop?

Usually no. Sorting costs more than a single linear pass. It only makes sense if you will reuse the ordered data enough times.

Would `std::partition` be better than `std::sort` here?

If you only want to group values by a threshold, yes, partitioning is often a better fit because it is linear time and does not fully sort the array.

Why do modern compilers sometimes show a smaller difference?

Because newer compilers may produce better machine code, including branchless instructions or vectorized loops, which reduce dependence on branch prediction.

Related Concepts

Branch prediction — The main CPU feature that explains why sorted input can be faster.
CPU pipeline — Helps explain why a mispredicted branch wastes work.
Branch misprediction — The direct performance penalty behind this effect.
Compiler optimization — Determines whether the code uses an actual branch or a branchless alternative.
Vectorization — Can process many values at once and reduce branch costs.
Cache locality — Often confused with branch prediction, but still important in performance work.
Big-O complexity — Useful for understanding why sorting may not be worth it for a single pass.
std::partition — A related algorithm when you only need to separate values by a condition.
Benchmarking and profiling — Necessary to verify whether an optimization really helps in practice.

Take Quiz

Mini Project

Description

Build a small benchmark program in C++ that compares three ways of summing values greater than or equal to a threshold:

processing random data with an if statement
processing sorted data with an if statement
processing random data with a branchless expression

This project helps you see that the same logical task can run differently depending on how predictable the branch is and how the code is written.

Goal

Measure how data order and branch style affect loop performance in a tight C++ benchmark.

Requirements

Generate a vector of integers with values in a small range such as 0 to 255.
Benchmark summing values greater than or equal to 128 on unsorted data and sorted data.
Add a third benchmark that uses a branchless expression instead of if.
Print the elapsed time and final sum for each approach.
Keep the computation real so the compiler cannot optimize it away.

Take Quiz

Keep learning

Branching	`if (x >= 128) sum += x;`	Simple and readable; may suffer from mispredictions
Ternary	`sum += (x >= 128) ? x : 0;`	May compile to branchless code
Arithmetic mask	`sum += x * (x >= 128);`	Sometimes branchless; may be less readable

Technique	What it does	When useful
`std::sort`	Fully orders the data	Useful if full order is needed later
`std::partition`	Groups by condition only	Useful if you only need values separated by a predicate

Operation	Typical complexity
Single scan with condition	`O(n)`
Partition by threshold	`O(n)`
Full sort	`O(n log n)`

Why Sorted Arrays Can Be Faster in C++: Branch Prediction Explained

Question

Short Answer

Concept

Why this matters

Why sorted data helps

Important point

Mental Model

Unsorted array

Sorted array

Syntax and Examples

Example with unsorted data

Example with sorted data

Step by Step Execution

Step-by-step

Iteration 1

Iteration 2

Iteration 3

Real World Use Cases

Filtering records

Image or signal processing

Log processing

Data cleanup pipelines

Numerical computing

Real Codebase Usage

Common patterns

1. Filtering with predictable data

2. Guard clauses

3. Validation passes

4. Branchless hot loops

5. Partitioning instead of full sorting

Common Mistakes

1. Assuming this is mainly a cache effect

2. Thinking source-level work equals CPU-level work

3. Benchmarking without optimization

4. Using a benchmark the compiler can simplify away

Comparisons

if branch vs branchless expression

Cheat Sheet

Main idea

Typical hot-loop pattern

Why sorted data helps

Not usually the main cause here

CPU concept to remember

When the effect is strongest

When the effect may shrink

Useful alternatives

FAQ

Why does sorting make the loop faster if the result is the same?

Is this a cache optimization?

Why does the branch become hard to predict on random data?

Does this happen only in C++?

Can the compiler remove this problem?

Is sorting worth it just to speed up one loop?

Would std::partition be better than std::sort here?

Why do modern compilers sometimes show a smaller difference?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Avoiding Java Code in JSP with JSP 2: EL and JSTL Explained

Choosing a @NotNull Annotation in Java: Validation vs Static Analysis

Convert a Java Stack Trace to a String

Why Java shows a similar effect

Why the result can vary

Branchless style

Another branchless example

Iteration 4

Iteration 5

Iteration 6

What the CPU sees

Why the repeated outer loop matters

6. Profiling before changing code

Practical lesson

5. Assuming sorted data is always faster

6. Sorting when it is not worth the cost

7. Forgetting that data distribution matters

Sorting vs partitioning

`if` branch vs branchless expression

Would `std::partition` be better than `std::sort` here?