Question

Why Separate Loops Can Be Faster Than a Combined Loop in C++: Cache, Memory Bandwidth, and Loop Optimization

cppperformancex86vectorizationcompiler-optimization

Question

In C++, why can element-wise additions be significantly faster when performed in two separate loops instead of one combined loop?

Consider heap-allocated arrays a1, b1, c1, and d1 and the following core loop:

const int n = 100000;

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
    c1[j] += d1[j];
}

This loop is executed 10,000 times by an outer loop. To improve performance, the code was changed to:

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
}

for (int j = 0; j < n; j++) {
    c1[j] += d1[j];
}

When compiled with Microsoft Visual C++ 10.0 using full optimization and SSE2 for 32-bit mode on an Intel Core 2 Duo system, the combined loop takes about 5.5 seconds, while the separated loops take about 1.9 seconds.

After further testing, it appears that the result depends heavily on array size n and CPU cache behavior.

What explains this performance difference in terms of cache usage, memory access patterns, compiler optimization, and CPU architecture? How do array size and cache levels create different performance regions?

Short Answer

By the end of this page, you will understand why two loops that do the same arithmetic work can run at very different speeds in C++. The key ideas are CPU cache, memory bandwidth, instruction-level parallelism, and how compilers optimize loops. You will also learn why performance changes as array sizes move from fitting in L1 cache to L2 cache to main memory.

Concept

Modern CPUs are much faster at arithmetic than at fetching data from memory. Because of that, simple array code is often limited not by +, but by how quickly data can be delivered to the CPU.

In the example, each iteration reads and writes multiple arrays:

a1[j] += b1[j];
c1[j] += d1[j];

That means the CPU is touching four streams of memory at once:

read a1[j]
read b1[j]
write a1[j]
read c1[j]
read d1[j]
write c1[j]

Even though the code looks compact, it creates pressure on several hardware resources:

L1/L2/L3 cache capacity
cache-line fills and evictions
load/store bandwidth
hardware prefetchers
register pressure and instruction scheduling

When you split the loop:

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
    c1[j] += d1[j];
}

Mental Model

Imagine a chef working at a small kitchen counter.

The CPU is the chef.
Registers and L1 cache are the counter space right in front of the chef.
RAM is the pantry in another room.
Arrays are ingredient trays.

In the combined loop, the chef tries to work with four trays at once:

tray a1
tray b1
tray c1
tray d1

The counter gets crowded. The chef keeps moving trays on and off the counter.

In the split-loop version, the chef first works with only:

a1
b1

Then with:

c1
d1

Less crowding means fewer trips to the pantry and less rearranging.

That is often why separate loops can be faster: not because the arithmetic changed, but because the data flow became easier for the hardware to manage.

Take Quiz

Syntax and Examples

Combined loop

for (int j = 0; j < n; j++) {
    a[j] += b[j];
    c[j] += d[j];
}

This performs two independent element-wise additions in one pass.

Separate loops

for (int j = 0; j < n; j++) {
    a[j] += b[j];
}

for (int j = 0; j < n; j++) {
    c[j] += d[j];
}

This performs the same total arithmetic, but with a simpler memory access pattern in each loop.

Small example

#include <iostream>
using namespace std;

int main() {
    const int n = 4;
    double a[n] = {1, 2, 3, 4};
    double b[n] = {10, 20, 30, 40};
    double c[n] = {5, , , };
     d[n] = {, , , };

     ( j = ; j < n; j++) {
        a[j] += b[j];
    }

     ( j = ; j < n; j++) {
        c[j] += d[j];
    }

     ( j = ; j < n; j++) {
        cout << a[j] << ;
    }
    cout << ;

     ( j = ; j < n; j++) {
        cout << c[j] << ;
    }
    cout << ;
}

Step by Step Execution

Consider this traceable example:

const int n = 3;
double a[3] = {1, 1, 1};
double b[3] = {2, 2, 2};
double c[3] = {3, 3, 3};
double d[3] = {4, 4, 4};

for (int j = 0; j < n; j++) {
    a[j] += b[j];
    c[j] += d[j];
}

Iteration `j = 0`

read a[0] = 1
read b[0] = 2
compute a[0] = 3
write a[0] = 3
read c[0] = 3
read d[0] = 4
compute c[0] = 7

Real World Use Cases

This concept appears in many performance-sensitive programs.

Scientific and numerical computing

Code that updates vectors, matrices, and simulation state often spends most of its time in loops like this.

Example:

position[i] += velocity[i];
energy[i] += delta[i];

Whether these are fused into one loop or split into two can affect throughput.

Image and signal processing

Pixel and sample arrays are processed sequentially. Touching too many arrays at once can hurt cache locality.

Data analytics

Column-oriented processing often performs element-wise transforms across large arrays. Engineers must consider memory traffic, not just arithmetic count.

Game engines

Entity updates often loop over large arrays of component data. Structure-of-arrays layouts and cache-friendly loops matter a lot.

High-performance APIs

Libraries such as BLAS, SIMD kernels, and custom numerics code often tune loops around cache size and bandwidth limits.

Take Quiz

Real Codebase Usage

In real projects, developers rarely assume that "one loop is always faster" or "two loops are always faster." Instead, they use patterns based on data size and measured performance.

Common patterns

1. Loop fusion

Combine loops when:

the same data is reused immediately
the working set still fits in cache
it reduces overhead and improves locality

Example:

for (int i = 0; i < n; i++) {
    x[i] = a[i] + b[i];
    y[i] = x[i] * scale;
}

This can be good because x[i] is produced and used immediately.

2. Loop fission (loop splitting)

Split loops when:

too many arrays are touched at once
vectorization is blocked
memory pressure is high
each loop becomes simpler for the compiler

Example:

for (int i = 0; i < n; i++) {
    a[i] += b[i];
}
for (int i = 0; i < n; i++) {
    c[i] += d[i];
}

3. Benchmark-guided optimization

Teams often test multiple versions for:

small input sizes
medium sizes that fit in L2/L3
large sizes that go to DRAM

Common Mistakes

1. Assuming fewer loops always means faster code

A common beginner assumption is:

one loop = faster
two loops = slower

That is not always true on modern CPUs. Memory behavior often matters more than loop count.

2. Ignoring cache effects

This benchmark changes dramatically with array size. A version that is faster for small n may be slower for large n.

3. Benchmarking without enough care

Bad benchmarking can mislead you.

Problems include

not warming up caches
timing allocations instead of just the loop
using too small a workload
letting the compiler optimize away work
comparing debug builds instead of optimized builds

4. Forgetting that write operations may trigger extra traffic

When writing to a[j] or c[j], the CPU usually needs the cache line in cache first. This can cause read-for-ownership traffic before the write completes.

5. Expecting the same result on every CPU

Different CPUs have different:

cache sizes
cache associativity
prefetchers
vector units
memory subsystem behavior

So a benchmark result on one machine may not generalize.

6. Assuming source-code similarity means assembly similarity

Comparisons

Related optimization ideas

Technique	What it does	Can help when	Can hurt when
Loop fusion	Combines multiple loops into one	Reusing the same data immediately	Too many arrays are active at once
Loop fission	Splits one loop into multiple loops	Cache pressure or vectorization problems exist	Extra passes over memory dominate
Manual unrolling	Processes several elements per iteration	Reduces loop overhead and exposes parallelism	Compiler already does it better
SIMD/vectorization	Processes multiple values per instruction	Data is contiguous and dependencies are simple	Alignment, aliasing, or unsupported patterns interfere

Combined loop vs separate loops

Aspect

Cheat Sheet

Quick reference

Core idea

Two loops can be faster than one if splitting them reduces cache pressure and improves memory streaming.

Terms

Cache line: smallest block moved between memory and cache, often 64 bytes
Working set: data actively needed during a phase of execution
Memory-bound: limited by data movement, not arithmetic speed
Loop fusion: combine loops
Loop fission: split loops
Prefetcher: hardware that predicts future memory accesses

Example forms

// fused
for (int i = 0; i < n; i++) {
    a[i] += b[i];
    c[i] += d[i];
}

// split
for (int i = 0; i < n; i++) {
    a[i] += b[i];
}
for (int i = 0; i < n; i++) {
    c[i] += d[i];
}

When split loops may help

arrays are large
multiple streams exceed cache comfort zone
compiler optimizes simpler loops better
hardware prefetch works better with fewer streams

When fused loops may help

data reused immediately
working set fits in cache
fewer loop-control instructions matter

FAQ

Why can two loops be faster than one in C++?

Because the split version may reduce cache pressure, simplify memory access patterns, and allow better compiler scheduling or vectorization.

Is the split-loop version always faster?

No. If the data fits well in cache, a fused loop may be just as fast or faster.

Why does array size matter so much?

Because different sizes fit in different cache levels. Once data spills beyond a cache level, latency and bandwidth costs change.

What is the main bottleneck in this example?

Usually memory traffic, not floating-point addition.

Does contiguous allocation of arrays matter?

Yes. Memory layout can affect spatial locality, cache conflicts, and prefetch behavior.

Why do different CPUs show different graphs?

They have different cache sizes, associativity, prefetch strategies, execution engines, and memory bandwidth.

Should I manually split loops in production code?

Only if profiling shows it helps. Prefer clear code first, then optimize based on measurement.

Does SIMD change the answer?

It can. SIMD often improves throughput, but memory access and cache behavior still remain critical.

Related Concepts

CPU cache — explains why nearby and recently used data is faster to access.
Memory bandwidth — important when array loops are limited by data transfer speed.
Cache lines — helps explain why CPUs move blocks of data, not single elements.
Loop fusion — the optimization of combining loops, which is the direct alternative here.
Loop fission — the optimization of splitting loops, which is the main concept in this example.
SIMD vectorization — relevant because compilers may optimize simple array loops with vector instructions.
Data locality — covers spatial and temporal locality, both central to performance here.
Compiler optimization — helps explain why simple loops may produce better machine code.
Benchmarking in C++ — useful because performance conclusions depend on good measurement.
False sharing and cache coherence — related broader cache topics in multithreaded programs.

Take Quiz

Mini Project

Description

Build a small C++ benchmark that compares a fused loop and a split-loop version for element-wise array addition across several array sizes. This demonstrates how performance changes as data moves from fitting in cache to exceeding cache capacity.

Goal

Measure and compare fused-loop versus split-loop performance for different array sizes and observe where cache effects change the result.

Requirements

Create four std::vector<double> arrays of the same size.
Fill all arrays with initial values so the compiler cannot remove the work.
Benchmark both the combined loop and the separated-loop version.
Run the benchmark for several values of n, from very small to large.
Print timings and a checksum so results stay meaningful.

Take Quiz

Keep learning

Number of passes over index range	1	2
Simultaneous memory streams	More	Fewer per loop
Cache pressure	Often higher	Often lower
Compiler optimization complexity	Higher	Lower
Best for	Small working sets with good reuse	Larger working sets or stream-heavy code

Situation	Main bottleneck	Typical sign
Compute-bound	Arithmetic units	Faster CPU instructions help a lot
Memory-bound	Data movement	Performance flattens as arrays get larger

Why Separate Loops Can Be Faster Than a Combined Loop in C++: Cache, Memory Bandwidth, and Loop Optimization

Question

Short Answer

Concept

Mental Model

Syntax and Examples

Combined loop

Separate loops

Small example

Step by Step Execution

Iteration j = 0

Real World Use Cases

Scientific and numerical computing

Image and signal processing

Data analytics

Game engines

High-performance APIs

Real Codebase Usage

Common patterns

1. Loop fusion

2. Loop fission (loop splitting)

3. Benchmark-guided optimization

Common Mistakes

1. Assuming fewer loops always means faster code

2. Ignoring cache effects

3. Benchmarking without enough care

Problems include

4. Forgetting that write operations may trigger extra traffic

5. Expecting the same result on every CPU

6. Assuming source-code similarity means assembly similarity

Comparisons

Related optimization ideas

Combined loop vs separate loops

Cheat Sheet

Quick reference

Core idea

Terms

Example forms

When split loops may help

When fused loops may help

FAQ

Why can two loops be faster than one in C++?

Is the split-loop version always faster?

Why does array size matter so much?

What is the main bottleneck in this example?

Does contiguous allocation of arrays matter?

Why do different CPUs show different graphs?

Should I manually split loops in production code?

Does SIMD change the answer?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Basic Rules and Idioms for Operator Overloading in C++

C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast

C++ Lambda Expressions Explained: What They Are and When to Use Them

Why this matters

Main performance factors here

1. Cache working set size

2. Number of memory streams

3. Memory bandwidth

4. Compiler optimization

5. Cache reuse across outer-loop iterations

Important point

Iteration j = 1

Iteration j = 2

First loop

Second loop

What changes at the hardware level?

4. Alignment and layout tuning

5. Compiler-friendly code

Broken reasoning example

Better mindset

Compute-bound vs memory-bound

Rules of thumb

Important edge case

Iteration `j = 0`

Iteration `j = 1`

Iteration `j = 2`