Wherefore are elementwise additions overmuch quicker successful abstracted loops than successful a mixed loop?

Say a1, b1, c1, and d1 component to heap representation, and my numerical codification has the pursuing center loop.

const int n = 100000;for (int j = 0; j < n; j++) { a1[j] += b1[j]; c1[j] += d1[j];}

This loop is executed 10,000 instances through different outer for loop. To velocity it ahead, I modified the codification to:

for (int j = 0; j < n; j++) { a1[j] += b1[j];}for (int j = 0; j < n; j++) { c1[j] += d1[j];}

Compiled connected Microsoft Ocular C++ 10.Zero with afloat optimization and SSE2 enabled for 32-spot connected a Intel Center 2 Duo (x64), the archetypal illustration takes 5.5 seconds and the treble-loop illustration takes lone 1.9 seconds.

Disassembly for the archetypal loop fundamentally appears to be like similar this (this artifact is repeated astir 5 instances successful the afloat programme):

movsd xmm0,mmword ptr [edx+18h]addsd xmm0,mmword ptr [ecx+20h]movsd mmword ptr [ecx+20h],xmm0movsd xmm0,mmword ptr [esi+10h]addsd xmm0,mmword ptr [eax+30h]movsd mmword ptr [eax+30h],xmm0movsd xmm0,mmword ptr [edx+20h]addsd xmm0,mmword ptr [ecx+28h]movsd mmword ptr [ecx+28h],xmm0movsd xmm0,mmword ptr [esi+18h]addsd xmm0,mmword ptr [eax+38h]

All loop of the treble loop illustration produces this codification (the pursuing artifact is repeated astir 3 instances):

addsd xmm0,mmword ptr [eax+28h]movsd mmword ptr [eax+28h],xmm0movsd xmm0,mmword ptr [ecx+20h]addsd xmm0,mmword ptr [eax+30h]movsd mmword ptr [eax+30h],xmm0movsd xmm0,mmword ptr [ecx+28h]addsd xmm0,mmword ptr [eax+38h]movsd mmword ptr [eax+38h],xmm0movsd xmm0,mmword ptr [ecx+30h]addsd xmm0,mmword ptr [eax+40h]movsd mmword ptr [eax+40h],xmm0

The motion turned retired to beryllium of nary relevance, arsenic the behaviour severely relies upon connected the sizes of the arrays (n) and the CPU cache. Truthful if location is additional involvement, I rephrase the motion:

May you supply any coagulated penetration into the particulars that pb to the antithetic cache behaviors arsenic illustrated by the 5 areas connected the pursuing graph?
It mightiness besides beryllium absorbing to component retired the variations betwixt CPU/cache architectures, by offering a akin graph for these CPUs.

Present is the afloat codification. It makes use of TBB Tick_Count for larger solution timing, which tin beryllium disabled by not defining the TBB_TIMING Macro:

#include <iostream>#include <iomanip>#include <cmath>#include <string>//#define TBB_TIMING#ifdef TBB_TIMING #include <tbb/tick_count.h>using tbb::tick_count;#else#include <time.h>#endifusing namespace std;//#define preallocate_memory new_contenum { new_cont, new_sep };double *a1, *b1, *c1, *d1;void allo(int cont, int n){ switch(cont) { case new_cont: a1 = new double[n*4]; b1 = a1 + n; c1 = b1 + n; d1 = c1 + n; break; case new_sep: a1 = new double[n]; b1 = new double[n]; c1 = new double[n]; d1 = new double[n]; break; } for (int i = 0; i < n; i++) { a1[i] = 1.0; d1[i] = 1.0; c1[i] = 1.0; b1[i] = 1.0; }}void ff(int cont){ switch(cont){ case new_sep: delete[] b1; delete[] c1; delete[] d1; case new_cont: delete[] a1; }}double plain(int n, int m, int cont, int loops){#ifndef preallocate_memory allo(cont,n);#endif#ifdef TBB_TIMING tick_count t0 = tick_count::now();#else clock_t start = clock();#endif if (loops == 1) { for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++){ a1[j] += b1[j]; c1[j] += d1[j]; } } } else { for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++) { a1[j] += b1[j]; } for (int j = 0; j < n; j++) { c1[j] += d1[j]; } } } double ret;#ifdef TBB_TIMING tick_count t1 = tick_count::now(); ret = 2.0*double(n)*double(m)/(t1-t0).seconds();#else clock_t end = clock(); ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);#endif #ifndef preallocate_memory ff(cont);#endif return ret;}void main(){ freopen("C:\\test.csv", "w", stdout); char *s = " "; string na[2] ={"new_cont", "new_sep"}; cout << "n"; for (int j = 0; j < 2; j++) for (int i = 1; i <= 2; i++)#ifdef preallocate_memory cout << s << i << "_loops_" << na[preallocate_memory];#else cout << s << i << "_loops_" << na[j];#endif cout << endl; long long nmax = 1000000;#ifdef preallocate_memory allo(preallocate_memory, nmax);#endif for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2))) { const long long m = 10000000/n; cout << n; for (int j = 0; j < 2; j++) for (int i = 1; i <= 2; i++) cout << s << plain(n, m, j, i); cout << endl; }}

It reveals FLOPS for antithetic values of n.

Upon additional investigation of this, I accept this is (astatine slightest partially) brought on by the information alignment of the 4-pointers. This volition origin any flat of cache slope/manner conflicts.

If I've guessed accurately connected however you are allocating your arrays, they are apt to beryllium aligned to the leaf formation.

This means that each your accesses successful all loop volition autumn connected the aforesaid cache manner. Nevertheless, Intel processors person had Eight-manner L1 cache associativity for a piece. However successful world, the show isn't wholly single. Accessing Four-methods is inactive slower than opportunity 2-methods.

EDIT: It does successful information expression similar you are allocating each the arrays individually.Normally once specified ample allocations are requested, the allocator volition petition caller pages from the OS. So, location is a advanced accidental that ample allocations volition look astatine the aforesaid offset from a leaf-bound.

Present's the trial codification:

int main(){ const int n = 100000;#ifdef ALLOCATE_SEPERATE double *a1 = (double*)malloc(n * sizeof(double)); double *b1 = (double*)malloc(n * sizeof(double)); double *c1 = (double*)malloc(n * sizeof(double)); double *d1 = (double*)malloc(n * sizeof(double));#else double *a1 = (double*)malloc(n * sizeof(double) * 4); double *b1 = a1 + n; double *c1 = b1 + n; double *d1 = c1 + n;#endif // Zero the data to prevent any chance of denormals. memset(a1,0,n * sizeof(double)); memset(b1,0,n * sizeof(double)); memset(c1,0,n * sizeof(double)); memset(d1,0,n * sizeof(double)); // Print the addresses cout << a1 << endl; cout << b1 << endl; cout << c1 << endl; cout << d1 << endl; clock_t start = clock(); int c = 0; while (c++ < 10000){#if ONE_LOOP for(int j=0;j<n;j++){ a1[j] += b1[j]; c1[j] += d1[j]; }#else for(int j=0;j<n;j++){ a1[j] += b1[j]; } for(int j=0;j<n;j++){ c1[j] += d1[j]; }#endif } clock_t end = clock(); cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl; system("pause"); return 0;}

Benchmark Outcomes:

EDIT: Outcomes connected an existent Center 2 structure device:

2 x Intel Xeon X5482 Harpertown @ Three.2 GHz:

#define ALLOCATE_SEPERATE#define ONE_LOOP00600020006D0020007A002000870020seconds = 6.206#define ALLOCATE_SEPERATE//#define ONE_LOOP005E0020006B00200078002000850020seconds = 2.116//#define ALLOCATE_SEPERATE#define ONE_LOOP0057002000633520006F6A20007B9F20seconds = 1.894//#define ALLOCATE_SEPERATE//#define ONE_LOOP008C00200098352000A46A2000B09F20seconds = 1.993

Observations:

6.206 seconds with 1 loop and 2.116 seconds with 2 loops. This reproduces the OP's outcomes precisely.
Successful the archetypal 2 checks, the arrays are allotted individually. You'll announcement that they each person the aforesaid alignment comparative to the leaf.
Successful the 2nd 2 checks, the arrays are packed unneurotic to interruption that alignment. Present you'll announcement some loops are sooner. Moreover, the 2nd (treble) loop is present the slower 1 arsenic you would usually anticipate.

Arsenic @Stephen Cannon factors retired successful the feedback, location is a precise apt expectation that this alignment causes mendacious aliasing successful the burden/shop items oregon the cache. I Googled about for this and recovered that Intel really has a hardware antagonistic for partial code aliasing stalls:

http://package.intel.com/websites/merchandise/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/occasions/partial_address_alias.html

5 Areas - Explanations

Part 1:

This 1 is casual. The dataset is truthful tiny that the show is dominated by overhead similar looping and branching.

Part 2:

Present, arsenic the information sizes addition, the magnitude of comparative overhead goes behind and the show "saturates". Present 2 loops is slower due to the fact that it has doubly arsenic overmuch loop and branching overhead.

I'm not certain precisely what's going connected present... Alignment might inactive drama an consequence arsenic Agner Fog mentions cache slope conflicts. (That nexus is astir Sandy Span, however the thought ought to inactive beryllium relevant to Center 2.)

Part Three:

Astatine this component, the information nary longer suits successful the L1 cache. Truthful show is capped by the L1 <-> L2 cache bandwidth.

Part Four:

The show driblet successful the azygous-loop is what we are observing. And arsenic talked about, this is owed to the alignment which (about apt) causes mendacious aliasing stalls successful the processor burden/shop items.

Nevertheless, successful command for mendacious aliasing to happen, location essential beryllium a ample adequate stride betwixt the datasets. This is wherefore you don't seat this successful part Three.

Part 5:

Astatine this component, thing suits successful the cache. Truthful you're sure by representation bandwidth.

2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

Fine, the correct reply decidedly has to bash thing with the CPU cache. However to usage the cache statement tin beryllium rather hard, particularly with out information.

Location are galore solutions, that led to a batch of treatment, however fto's expression it: Cache points tin beryllium precise analyzable and are not 1 dimensional. They be heavy connected the dimension of the information, truthful my motion was unfair: It turned retired to beryllium astatine a precise absorbing component successful the cache graph.

@Mysticial's reply satisfied a batch of group (together with maine), most likely due to the fact that it was the lone 1 that appeared to trust connected info, however it was lone 1 "information component" of the fact.

That's wherefore I mixed his trial (utilizing a steady vs. abstracted allocation) and @James' Reply's proposal.

The graphs beneath reveals, that about of the solutions and particularly the bulk of feedback to the motion and solutions tin beryllium thought-about wholly incorrect oregon actual relying connected the direct script and parameters utilized.

Line that my first motion was astatine n = A hundred.000. This component (by mishap) reveals particular behaviour:

It possesses the top discrepancy betwixt the 1 and 2 loop'ed interpretation (about a cause of 3)
It is the lone component, wherever 1-loop (specifically with steady allocation) beats the 2-loop interpretation. (This made Mysticial's reply imaginable, astatine each.)

The consequence utilizing initialized information:

Enter image description here

The consequence utilizing uninitialized information (this is what Mysticial examined):

Enter image description here

And this is a difficult-to-explicate 1: Initialized information, that is allotted erstwhile and reused for all pursuing trial lawsuit of antithetic vector dimension:

Enter image description here

Message

All debased-flat show associated motion connected Stack Overflow ought to beryllium required to supply MFLOPS accusation for the entire scope of cache applicable information sizes! It's a discarded of all people's clip to deliberation of solutions and particularly discourse them with others with out this accusation.

Mistake producing weblog contented

Wherefore are elementwise additions overmuch quicker successful abstracted loops than successful a mixed loop?

EDIT: Outcomes connected an existent Center 2 structure device:

5 Areas - Explanations

Message

Formulario de contacto