Question
Why Separate Loops Can Be Faster Than a Combined Loop in C++: Cache, Memory Bandwidth, and Loop Optimization
Question
In C++, why can element-wise additions be significantly faster when performed in two separate loops instead of one combined loop?
Consider heap-allocated arrays a1, b1, c1, and d1 and the following core loop:
const int n = 100000;
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
c1[j] += d1[j];
}
This loop is executed 10,000 times by an outer loop. To improve performance, the code was changed to:
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
When compiled with Microsoft Visual C++ 10.0 using full optimization and SSE2 for 32-bit mode on an Intel Core 2 Duo system, the combined loop takes about 5.5 seconds, while the separated loops take about 1.9 seconds.
After further testing, it appears that the result depends heavily on array size n and CPU cache behavior.
What explains this performance difference in terms of cache usage, memory access patterns, compiler optimization, and CPU architecture? How do array size and cache levels create different performance regions?
Short Answer
By the end of this page, you will understand why two loops that do the same arithmetic work can run at very different speeds in C++. The key ideas are CPU cache, memory bandwidth, instruction-level parallelism, and how compilers optimize loops. You will also learn why performance changes as array sizes move from fitting in L1 cache to L2 cache to main memory.
Concept
Modern CPUs are much faster at arithmetic than at fetching data from memory. Because of that, simple array code is often limited not by +, but by how quickly data can be delivered to the CPU.
In the example, each iteration reads and writes multiple arrays:
a1[j] += b1[j];
c1[j] += d1[j];
That means the CPU is touching four streams of memory at once:
- read
a1[j] - read
b1[j] - write
a1[j] - read
c1[j] - read
d1[j] - write
c1[j]
Even though the code looks compact, it creates pressure on several hardware resources:
- L1/L2/L3 cache capacity
- cache-line fills and evictions
- load/store bandwidth
- hardware prefetchers
- register pressure and instruction scheduling
When you split the loop:
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
Mental Model
Imagine a chef working at a small kitchen counter.
- The CPU is the chef.
- Registers and L1 cache are the counter space right in front of the chef.
- RAM is the pantry in another room.
- Arrays are ingredient trays.
In the combined loop, the chef tries to work with four trays at once:
- tray
a1 - tray
b1 - tray
c1 - tray
d1
The counter gets crowded. The chef keeps moving trays on and off the counter.
In the split-loop version, the chef first works with only:
a1b1
Then with:
c1d1
Less crowding means fewer trips to the pantry and less rearranging.
That is often why separate loops can be faster: not because the arithmetic changed, but because the data flow became easier for the hardware to manage.
Syntax and Examples
Combined loop
for (int j = 0; j < n; j++) {
a[j] += b[j];
c[j] += d[j];
}
This performs two independent element-wise additions in one pass.
Separate loops
for (int j = 0; j < n; j++) {
a[j] += b[j];
}
for (int j = 0; j < n; j++) {
c[j] += d[j];
}
This performs the same total arithmetic, but with a simpler memory access pattern in each loop.
Small example
#include <iostream>
using namespace std;
int main() {
const int n = 4;
double a[n] = {1, 2, 3, 4};
double b[n] = {10, 20, 30, 40};
double c[n] = {5, , , };
d[n] = {, , , };
( j = ; j < n; j++) {
a[j] += b[j];
}
( j = ; j < n; j++) {
c[j] += d[j];
}
( j = ; j < n; j++) {
cout << a[j] << ;
}
cout << ;
( j = ; j < n; j++) {
cout << c[j] << ;
}
cout << ;
}
Step by Step Execution
Consider this traceable example:
const int n = 3;
double a[3] = {1, 1, 1};
double b[3] = {2, 2, 2};
double c[3] = {3, 3, 3};
double d[3] = {4, 4, 4};
for (int j = 0; j < n; j++) {
a[j] += b[j];
c[j] += d[j];
}
Iteration j = 0
- read
a[0]= 1 - read
b[0]= 2 - compute
a[0] = 3 - write
a[0] = 3 - read
c[0]= 3 - read
d[0]= 4 - compute
c[0] = 7
Real World Use Cases
This concept appears in many performance-sensitive programs.
Scientific and numerical computing
Code that updates vectors, matrices, and simulation state often spends most of its time in loops like this.
Example:
position[i] += velocity[i];
energy[i] += delta[i];
Whether these are fused into one loop or split into two can affect throughput.
Image and signal processing
Pixel and sample arrays are processed sequentially. Touching too many arrays at once can hurt cache locality.
Data analytics
Column-oriented processing often performs element-wise transforms across large arrays. Engineers must consider memory traffic, not just arithmetic count.
Game engines
Entity updates often loop over large arrays of component data. Structure-of-arrays layouts and cache-friendly loops matter a lot.
High-performance APIs
Libraries such as BLAS, SIMD kernels, and custom numerics code often tune loops around cache size and bandwidth limits.
Real Codebase Usage
In real projects, developers rarely assume that "one loop is always faster" or "two loops are always faster." Instead, they use patterns based on data size and measured performance.
Common patterns
1. Loop fusion
Combine loops when:
- the same data is reused immediately
- the working set still fits in cache
- it reduces overhead and improves locality
Example:
for (int i = 0; i < n; i++) {
x[i] = a[i] + b[i];
y[i] = x[i] * scale;
}
This can be good because x[i] is produced and used immediately.
2. Loop fission (loop splitting)
Split loops when:
- too many arrays are touched at once
- vectorization is blocked
- memory pressure is high
- each loop becomes simpler for the compiler
Example:
for (int i = 0; i < n; i++) {
a[i] += b[i];
}
for (int i = 0; i < n; i++) {
c[i] += d[i];
}
3. Benchmark-guided optimization
Teams often test multiple versions for:
- small input sizes
- medium sizes that fit in L2/L3
- large sizes that go to DRAM
Common Mistakes
1. Assuming fewer loops always means faster code
A common beginner assumption is:
- one loop = faster
- two loops = slower
That is not always true on modern CPUs. Memory behavior often matters more than loop count.
2. Ignoring cache effects
This benchmark changes dramatically with array size. A version that is faster for small n may be slower for large n.
3. Benchmarking without enough care
Bad benchmarking can mislead you.
Problems include
- not warming up caches
- timing allocations instead of just the loop
- using too small a workload
- letting the compiler optimize away work
- comparing debug builds instead of optimized builds
4. Forgetting that write operations may trigger extra traffic
When writing to a[j] or c[j], the CPU usually needs the cache line in cache first. This can cause read-for-ownership traffic before the write completes.
5. Expecting the same result on every CPU
Different CPUs have different:
- cache sizes
- cache associativity
- prefetchers
- vector units
- memory subsystem behavior
So a benchmark result on one machine may not generalize.
6. Assuming source-code similarity means assembly similarity
Comparisons
Related optimization ideas
| Technique | What it does | Can help when | Can hurt when |
|---|---|---|---|
| Loop fusion | Combines multiple loops into one | Reusing the same data immediately | Too many arrays are active at once |
| Loop fission | Splits one loop into multiple loops | Cache pressure or vectorization problems exist | Extra passes over memory dominate |
| Manual unrolling | Processes several elements per iteration | Reduces loop overhead and exposes parallelism | Compiler already does it better |
| SIMD/vectorization | Processes multiple values per instruction | Data is contiguous and dependencies are simple | Alignment, aliasing, or unsupported patterns interfere |
Combined loop vs separate loops
| Aspect |
|---|
Cheat Sheet
Quick reference
Core idea
Two loops can be faster than one if splitting them reduces cache pressure and improves memory streaming.
Terms
- Cache line: smallest block moved between memory and cache, often 64 bytes
- Working set: data actively needed during a phase of execution
- Memory-bound: limited by data movement, not arithmetic speed
- Loop fusion: combine loops
- Loop fission: split loops
- Prefetcher: hardware that predicts future memory accesses
Example forms
// fused
for (int i = 0; i < n; i++) {
a[i] += b[i];
c[i] += d[i];
}
// split
for (int i = 0; i < n; i++) {
a[i] += b[i];
}
for (int i = 0; i < n; i++) {
c[i] += d[i];
}
When split loops may help
- arrays are large
- multiple streams exceed cache comfort zone
- compiler optimizes simpler loops better
- hardware prefetch works better with fewer streams
When fused loops may help
- data reused immediately
- working set fits in cache
- fewer loop-control instructions matter
FAQ
Why can two loops be faster than one in C++?
Because the split version may reduce cache pressure, simplify memory access patterns, and allow better compiler scheduling or vectorization.
Is the split-loop version always faster?
No. If the data fits well in cache, a fused loop may be just as fast or faster.
Why does array size matter so much?
Because different sizes fit in different cache levels. Once data spills beyond a cache level, latency and bandwidth costs change.
What is the main bottleneck in this example?
Usually memory traffic, not floating-point addition.
Does contiguous allocation of arrays matter?
Yes. Memory layout can affect spatial locality, cache conflicts, and prefetch behavior.
Why do different CPUs show different graphs?
They have different cache sizes, associativity, prefetch strategies, execution engines, and memory bandwidth.
Should I manually split loops in production code?
Only if profiling shows it helps. Prefer clear code first, then optimize based on measurement.
Does SIMD change the answer?
It can. SIMD often improves throughput, but memory access and cache behavior still remain critical.
Mini Project
Description
Build a small C++ benchmark that compares a fused loop and a split-loop version for element-wise array addition across several array sizes. This demonstrates how performance changes as data moves from fitting in cache to exceeding cache capacity.
Goal
Measure and compare fused-loop versus split-loop performance for different array sizes and observe where cache effects change the result.
Requirements
- Create four
std::vector<double>arrays of the same size. - Fill all arrays with initial values so the compiler cannot remove the work.
- Benchmark both the combined loop and the separated-loop version.
- Run the benchmark for several values of
n, from very small to large. - Print timings and a checksum so results stay meaningful.
Keep learning
Related questions
Basic Rules and Idioms for Operator Overloading in C++
Learn the core rules, syntax, and common idioms for operator overloading in C++, including member vs non-member operators.
C++ Casts Explained: C-Style Cast vs static_cast vs dynamic_cast
Learn the difference between C-style casts, static_cast, and dynamic_cast in C++ with clear examples, safety rules, and real usage tips.
C++ Lambda Expressions Explained: What They Are and When to Use Them
Learn what C++ lambda expressions are, why they exist, when to use them, and how they simplify callbacks, algorithms, and local logic.