resta - Optimización del acceso lineal a las matrices con precarga y caché en C

suma de matrices programa (3)

Debería compilar con un GCC reciente (por lo que haber compilado su GCC 5.2 es una buena idea, en noviembre de 2015), y debe habilitar las optimizaciones para su plataforma en particular, así que sugiero compilar con gcc -Wall -O2 -march=native (Intente también reemplazar -O2 con -O2 ).

^{(No evalúe sus programas sin habilitar optimizaciones en su compilador)}

Si le preocupan los efectos de caché, puede jugar con __builtin_prefetch , pero vea this .

Lea también acerca de OpenMP , OpenCL , OpenACC .

revelación: he intentado una pregunta similar en programmers.stack, pero ese lugar no se encuentra cerca de la pila de actividades.

Introducción

Tiendo a trabajar con muchas imágenes grandes. También vienen en secuencias de más de uno y deben procesarse y reproducirse repetidamente. A veces uso GPU, a veces CPU, a veces ambas. La mayoría de los patrones de acceso son de naturaleza lineal (de ida y vuelta), lo que me hizo pensar en cosas más básicas con respecto a las matrices y cómo se debe abordar un código de escritura optimizado para el máximo ancho de banda posible en el hardware dado (lo que permite el cálculo no está bloqueando la lectura / escritura) .

Especificaciones de prueba

He hecho esto en un MacbookAir4,2 (I5-2557M) de 2011 con 4GB de RAM y SSD. Nada más se estaba ejecutando durante las pruebas excepto iterm2.
gcc 5.2.0 (homebrew) con indicadores: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 con indicadores de biblioteca y de inclusión adicionales, así como indicadores de marco para usar el temporizador glfw que suelo usar . Podría haberlo hecho sin eso, no importa. Todos los de 64 bits, por supuesto.
He probado pruebas con el -fprefetch-loop-arrays opcional -fprefetch-loop-arrays , pero no parece influir en los resultados en absoluto

Prueba

Asignación de dos matrices de n bytes en el montón, donde n es 8, 16, 32, 64, 128, 256, 512 and 1024 MB
Inicialice la array a 0xff , byte a la vez
Prueba 1 - copia lineal

copia lineal

for(uint64_t i = 0; i < ARRAY_NUM; ++i) { array_copy[i] = array[i]; }

Prueba 2 - copiando con zancada. Aquí es donde se vuelve confuso. He intentado jugar el juego pre-fetch aquí. He probado varias combinaciones de cuánto debo hacer por bucle y parece que ~ 40 por bucle produce el mejor rendimiento. ¿Por qué? No tengo idea. Entiendo que malloc en c99 con uint64_t me daría un bloque alineado de memoria. También veo los tamaños de mis cachés L1 a L3, que son más altos que estos 320 bytes , así que, ¿qué estoy golpeando? Las pistas pueden estar más adelante en los gráficos. Realmente me gustaría entender esto.

zancada copia

for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) { array_copy[i] = array[i]; array_copy[i+1] = array[i+1]; array_copy[i+2] = array[i+2]; array_copy[i+3] = array[i+3]; array_copy[i+4] = array[i+4]; array_copy[i+5] = array[i+5]; array_copy[i+6] = array[i+6]; array_copy[i+7] = array[i+7]; array_copy[i+8] = array[i+8]; array_copy[i+9] = array[i+9]; array_copy[i+10] = array[i+10]; array_copy[i+11] = array[i+11]; array_copy[i+12] = array[i+12]; array_copy[i+13] = array[i+13]; array_copy[i+14] = array[i+14]; array_copy[i+15] = array[i+15]; array_copy[i+16] = array[i+16]; array_copy[i+17] = array[i+17]; array_copy[i+18] = array[i+18]; array_copy[i+19] = array[i+19]; array_copy[i+20] = array[i+20]; array_copy[i+21] = array[i+21]; array_copy[i+22] = array[i+22]; array_copy[i+23] = array[i+23]; array_copy[i+24] = array[i+24]; array_copy[i+25] = array[i+25]; array_copy[i+26] = array[i+26]; array_copy[i+27] = array[i+27]; array_copy[i+28] = array[i+28]; array_copy[i+29] = array[i+29]; array_copy[i+30] = array[i+30]; array_copy[i+31] = array[i+31]; array_copy[i+32] = array[i+32]; array_copy[i+33] = array[i+33]; array_copy[i+34] = array[i+34]; array_copy[i+35] = array[i+35]; array_copy[i+36] = array[i+36]; array_copy[i+37] = array[i+37]; array_copy[i+38] = array[i+38]; array_copy[i+39] = array[i+39]; }

Prueba 3 - lectura con zancada. Igual que con copiar con zancada.

zancada leer:

const int imax = 1000; for(int j = 0; j < imax; ++j) { uint64_t tmp = 0; performance = 0; time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) { tmp = array[i]; tmp = array[i+1]; tmp = array[i+2]; tmp = array[i+3]; tmp = array[i+4]; tmp = array[i+5]; tmp = array[i+6]; tmp = array[i+7]; tmp = array[i+8]; tmp = array[i+9]; tmp = array[i+10]; tmp = array[i+11]; tmp = array[i+12]; tmp = array[i+13]; tmp = array[i+14]; tmp = array[i+15]; tmp = array[i+16]; tmp = array[i+17]; tmp = array[i+18]; tmp = array[i+19]; tmp = array[i+20]; tmp = array[i+21]; tmp = array[i+22]; tmp = array[i+23]; tmp = array[i+24]; tmp = array[i+25]; tmp = array[i+26]; tmp = array[i+27]; tmp = array[i+28]; tmp = array[i+29]; tmp = array[i+30]; tmp = array[i+31]; tmp = array[i+32]; tmp = array[i+33]; tmp = array[i+34]; tmp = array[i+35]; tmp = array[i+36]; tmp = array[i+37]; tmp = array[i+38]; tmp = array[i+39]; }

Prueba 4 - Lectura lineal. Byte por byte. Me sorprendió -fprefetch-loop-arrays no dio resultados aquí. Pensé que era para estos casos.

lectura lineal:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) { tmp = array[i]; }

Prueba 5 - memcpy como contraste.

memcpy

memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));

Resultados

Salida de muestra:

salida de muestra:

Init done in 0.767 s - size of array: 1024 MBs (x2) Performance: 1304.325 MB/s Copying (linear) done in 0.898 s Performance: 1113.529 MB/s Copying (stride 40) done in 0.257 s Performance: 3890.608 MB/s [1000/1000] Performance stride 40: 7474.322 MB/s Average: 7523.427 MB/s Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s [1000/1000] Performance dumb: 2504.713 MB/s Average: 2481.502 MB/s Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s Copying (memcpy) done in 1.726 s Performance: 579.485 MB/s -- Init done in 0.415 s - size of array: 512 MBs (x2) Performance: 1233.136 MB/s Copying (linear) done in 0.442 s Performance: 1157.147 MB/s Copying (stride 40) done in 0.116 s Performance: 4399.606 MB/s [1000/1000] Performance stride 40: 6527.004 MB/s Average: 7166.458 MB/s Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s [1000/1000] Performance dumb: 2383.292 MB/s Average: 2409.005 MB/s Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s Copying (memcpy) done in 0.102 s Performance: 5026.476 MB/s -- Init done in 0.228 s - size of array: 256 MBs (x2) Performance: 1124.618 MB/s Copying (linear) done in 0.242 s Performance: 1057.916 MB/s Copying (stride 40) done in 0.070 s Performance: 3650.996 MB/s [1000/1000] Performance stride 40: 7129.206 MB/s Average: 7370.537 MB/s Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s [1000/1000] Performance dumb: 2456.129 MB/s Average: 2435.556 MB/s Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s Copying (memcpy) done in 0.050 s Performance: 5095.845 MB/s -- Init done in 0.100 s - size of array: 128 MBs (x2) Performance: 1277.200 MB/s Copying (linear) done in 0.112 s Performance: 1147.030 MB/s Copying (stride 40) done in 0.029 s Performance: 4424.513 MB/s [1000/1000] Performance stride 40: 6497.635 MB/s Average: 6714.540 MB/s Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s [1000/1000] Performance dumb: 2275.336 MB/s Average: 2335.544 MB/s Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s Copying (memcpy) done in 0.025 s Performance: 5086.502 MB/s -- Init done in 0.051 s - size of array: 64 MBs (x2) Performance: 1255.969 MB/s Copying (linear) done in 0.058 s Performance: 1104.282 MB/s Copying (stride 40) done in 0.015 s Performance: 4305.765 MB/s [1000/1000] Performance stride 40: 7750.063 MB/s Average: 7412.167 MB/s Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s [1000/1000] Performance dumb: 2610.136 MB/s Average: 2577.313 MB/s Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s Copying (memcpy) done in 0.013 s Performance: 4871.823 MB/s -- Init done in 0.024 s - size of array: 32 MBs (x2) Performance: 1306.738 MB/s Copying (linear) done in 0.028 s Performance: 1148.582 MB/s Copying (stride 40) done in 0.008 s Performance: 4265.907 MB/s [1000/1000] Performance stride 40: 6181.040 MB/s Average: 7124.592 MB/s Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s [1000/1000] Performance dumb: 2508.669 MB/s Average: 2556.529 MB/s Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s Copying (memcpy) done in 0.007 s Performance: 4617.860 MB/s -- Init done in 0.013 s - size of array: 16 MBs (x2) Performance: 1243.011 MB/s Copying (linear) done in 0.014 s Performance: 1139.362 MB/s Copying (stride 40) done in 0.004 s Performance: 4181.548 MB/s [1000/1000] Performance stride 40: 6317.129 MB/s Average: 7358.539 MB/s Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s [1000/1000] Performance dumb: 2529.707 MB/s Average: 2525.783 MB/s Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s Copying (memcpy) done in 0.003 s Performance: 5167.561 MB/s -- Init done in 0.007 s - size of array: 8 MBs (x2) Performance: 1186.019 MB/s Copying (linear) done in 0.007 s Performance: 1147.018 MB/s Copying (stride 40) done in 0.002 s Performance: 4157.658 MB/s [1000/1000] Performance stride 40: 6958.839 MB/s Average: 7097.742 MB/s Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s [1000/1000] Performance dumb: 2585.366 MB/s Average: 2537.896 MB/s Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s Copying (memcpy) done in 0.002 s Performance: 5059.164 MB/s

La lectura lineal es 3 veces más lenta que la lectura en zancada. Lectura de zancada máxima a aprox. Alcance de 7500-7800 MB / s. Sin embargo, dos cosas me confunden. En DDR3 1333 Mhz, el rendimiento máximo de memoria debe ser de 10,664 MB/s ¿por qué no lo estoy alcanzando? ¿Por qué la velocidad de lectura no es más consistente y cómo me optimizaría para eso (fallas de caché)? Es más evidente en los gráficos, especialmente en la lectura lineal con caídas regulares en el rendimiento.

Graficas

8-16 MB

32-64 MB

128-256 MB

512-1024 MB

Todos juntos

Aquí está la fuente completa para cualquier persona interesada:

/* gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo optional: -fprefetch-loop-arrays */ #include <stdio.h> #include <stdlib.h> #include <string.h> /* memcpy */ #include <inttypes.h> #include <GLFW/glfw3.h> #define ARRAY_NUM 1000000 * 128 /* GIG */ int main(int argc, char *argv[]) { if(!glfwInit()) { exit(EXIT_FAILURE); } int cx = 0; char filename_stride[50]; char filename_dumb[50]; cx = snprintf(filename_stride, 50, "%lu_stride.dat", ((ARRAY_NUM*sizeof(uint64_t))/1000000)); if(cx < 0 || cx >50) { exit(EXIT_FAILURE); } FILE *file_stride = fopen(filename_stride, "w"); cx = snprintf(filename_dumb, 50, "%lu_dumb.dat", ((ARRAY_NUM*sizeof(uint64_t))/1000000)); if(cx < 0 || cx >50) { exit(EXIT_FAILURE); } FILE *file_dumb = fopen(filename_dumb, "w"); if(file_stride == NULL || file_dumb == NULL) { perror("Error opening file."); exit(EXIT_FAILURE); } uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM); uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM); double performance = 0.0; double time_start = 0.0; double time_end = 0.0; double performance_min = 0.0; double performance_max = 0.0; /* Init array */ time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; ++i) { array[i] = 0xff; } time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); printf("Init done in %.3f s - size of array: %lu MBs (x2)/n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000)); printf("Performance: %.3f MB/s/n/n", performance); /* Linear copy */ performance = 0; time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; ++i) { array_copy[i] = array[i]; } time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); printf("Copying (linear) done in %.3f s/n", (time_end - time_start)); printf("Performance: %.3f MB/s/n/n", performance); /* Copying with wide stride */ performance = 0; time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) { array_copy[i] = array[i]; array_copy[i+1] = array[i+1]; array_copy[i+2] = array[i+2]; array_copy[i+3] = array[i+3]; array_copy[i+4] = array[i+4]; array_copy[i+5] = array[i+5]; array_copy[i+6] = array[i+6]; array_copy[i+7] = array[i+7]; array_copy[i+8] = array[i+8]; array_copy[i+9] = array[i+9]; array_copy[i+10] = array[i+10]; array_copy[i+11] = array[i+11]; array_copy[i+12] = array[i+12]; array_copy[i+13] = array[i+13]; array_copy[i+14] = array[i+14]; array_copy[i+15] = array[i+15]; array_copy[i+16] = array[i+16]; array_copy[i+17] = array[i+17]; array_copy[i+18] = array[i+18]; array_copy[i+19] = array[i+19]; array_copy[i+20] = array[i+20]; array_copy[i+21] = array[i+21]; array_copy[i+22] = array[i+22]; array_copy[i+23] = array[i+23]; array_copy[i+24] = array[i+24]; array_copy[i+25] = array[i+25]; array_copy[i+26] = array[i+26]; array_copy[i+27] = array[i+27]; array_copy[i+28] = array[i+28]; array_copy[i+29] = array[i+29]; array_copy[i+30] = array[i+30]; array_copy[i+31] = array[i+31]; array_copy[i+32] = array[i+32]; array_copy[i+33] = array[i+33]; array_copy[i+34] = array[i+34]; array_copy[i+35] = array[i+35]; array_copy[i+36] = array[i+36]; array_copy[i+37] = array[i+37]; array_copy[i+38] = array[i+38]; array_copy[i+39] = array[i+39]; } time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); printf("Copying (stride 40) done in %.3f s/n", (time_end - time_start)); printf("Performance: %.3f MB/s/n/n", performance); /* Reading with wide stride */ const int imax = 1000; double performance_average = 0.0; for(int j = 0; j < imax; ++j) { uint64_t tmp = 0; performance = 0; time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) { tmp = array[i]; tmp = array[i+1]; tmp = array[i+2]; tmp = array[i+3]; tmp = array[i+4]; tmp = array[i+5]; tmp = array[i+6]; tmp = array[i+7]; tmp = array[i+8]; tmp = array[i+9]; tmp = array[i+10]; tmp = array[i+11]; tmp = array[i+12]; tmp = array[i+13]; tmp = array[i+14]; tmp = array[i+15]; tmp = array[i+16]; tmp = array[i+17]; tmp = array[i+18]; tmp = array[i+19]; tmp = array[i+20]; tmp = array[i+21]; tmp = array[i+22]; tmp = array[i+23]; tmp = array[i+24]; tmp = array[i+25]; tmp = array[i+26]; tmp = array[i+27]; tmp = array[i+28]; tmp = array[i+29]; tmp = array[i+30]; tmp = array[i+31]; tmp = array[i+32]; tmp = array[i+33]; tmp = array[i+34]; tmp = array[i+35]; tmp = array[i+36]; tmp = array[i+37]; tmp = array[i+38]; tmp = array[i+39]; } time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); performance_average += performance; if(performance > performance_max) { performance_max = performance; } if(j == 0) { performance_min = performance; } if(performance < performance_min) { performance_min = performance; } printf("[%d/%d] Performance stride 40: %.3f MB/s/r", j+1, imax, performance); fprintf(file_stride, "%d/t%f/n", j, performance); fflush(file_stride); fflush(stdout); } performance_average = performance_average / imax; printf("/nAverage: %.3f MB/s/n", performance_average); printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s/n/n", performance_min, performance_max); /* Linear reading */ performance_average = 0.0; performance_min = 0.0; performance_max = 0.0; for(int j = 0; j < imax; ++j) { uint64_t tmp = 0; performance = 0; time_start = glfwGetTime(); for(uint64_t i = 0; i < ARRAY_NUM; ++i) { tmp = array[i]; } time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); performance_average += performance; if(performance > performance_max) { performance_max = performance; } if(j == 0) { performance_min = performance; } if(performance < performance_min) { performance_min = performance; } printf("[%d/%d] Performance dumb: %.3f MB/s/r", j+1, imax, performance); fprintf(file_dumb, "%d/t%f/n", j, performance); fflush(file_dumb); fflush(stdout); } performance_average = performance_average / imax; printf("/nAverage: %.3f MB/s/n", performance_average); printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s/n/n", performance_min, performance_max); /* Memcpy */ performance = 0; time_start = glfwGetTime(); memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t)); time_end = glfwGetTime(); performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start); printf("Copying (memcpy) done in %.3f s/n", (time_end - time_start)); printf("Performance: %.3f MB/s/n", performance); /* Cleanup and exit */ free(array); free(array_copy); glfwTerminate(); fclose(file_dumb); fclose(file_stride); exit(EXIT_SUCCESS); }

Resumen

¿Cómo debo escribir código para tener una velocidad máxima y (casi) constante cuando trabajo con matrices donde el patrón de acceso lineal es el más común?
¿Qué puedo aprender sobre el caché y la captura previa de este ejemplo?
¿Estas gráficas me dicen algo que debería saber que no he notado?
¿De qué otra manera puedo desenrollar bucles? He intentado -funroll-loops con ningún resultado, por lo que he recurrido a escribir manualmente los -funroll-loops loop-in-loop.

Gracias por los largos leídos.

EDITAR:

Parece que -O0 da un rendimiento diferente de cuando la -O0 -O está ausente! ¿Lo que da? La ausencia de bandera produce un mejor rendimiento, como se puede ver en el gráfico.

EDIT2:

Finalmente he golpeado el techo con AVX.

=== READING WITH AVX === [1000/1000] Performance AVX: 9868.912 MB/s Average: 10029.085 MB/s Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s

El promedio fue muy cercano a 10664. Tuve que cambiar el compilador a clang porque gcc me estaba haciendo difícil para usar avx (-mavx). Esta es también la razón por la que el gráfico tiene inmersiones más pronunciadas. Todavía me gustaría saber cómo / qué es / tener un rendimiento constante. Supongo que esto se debe a las líneas de caché / caché. También explicaría el rendimiento por encima de la velocidad DDR3 aquí y allá (MAX fue de 11464 MB / s).

Disculpe mi gnuplot-fu y sus llaves. El azul es SSE2 ( _mm_load_si128 ) y el naranja es AVX ( _mm256_load_si256 ). El púrpura se desvía como antes y el verde es tonto al leer uno a la vez.

Entonces, las dos preguntas finales son:

¿Qué está causando caídas y cómo tener un rendimiento más constante?
¿Es posible golpear el techo sin intrínsecos?

con la última versión: https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b y los gráficos de esa versión: http://imgur.com/a/cPeor

Para las cosas que está haciendo, vería SIMD (datos múltiples de una sola instrucción), google para GCC Compiler Intrinsics para más detalles

Su valor para el ancho de banda máximo de la memoria principal está desactivado por un factor de dos. En lugar de 10664 MB / s debería ser 21.3 GB / s (más precisamente debería ser (21333⅓) MB / s - vea mi derivación a continuación). El hecho de que vea más de 10664 MB / s a veces debería haberle dicho que tal vez hubo un problema en el cálculo del ancho de banda máximo.

Para obtener el ancho de banda máximo para Core2 a través de Sandy Bridge, debe utilizar almacenes no temporales . Además, necesitas múltiples hilos . No necesita instrucciones AVX o para desenrollar el bucle.

void copy(char *x, char *y, int n) { #pragma omp parallel for schedule(static) for(int i=0; i<n/16; i++) { _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i])); } }

Las matrices deben estar alineadas a 16 bytes y también ser un múltiplo de 16. La regla de oro para los almacenes no temporales es usarlos cuando la memoria que está copiando es mayor que la mitad del tamaño del caché de último nivel. En su caso, la mitad del tamaño de caché L3 es de 1.5 MB y la matriz más pequeña que copia es de 8 MB, por lo que es mucho más grande que la mitad del tamaño de caché del último nivel.

Aquí hay un código para probar esto.

//gcc -O3 -fopenmp foo.c #include <stdio.h> #include <x86intrin.h> #include <string.h> #include <omp.h> void copy(char *x, char *y, int n) { #pragma omp parallel for schedule(static) for(int i=0; i<n/16; i++) { _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i])); } } void copy2(char *x, char *y, int n) { #pragma omp parallel for schedule(static) for(int i=0; i<n/16; i++) { _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i])); } } int main(void) { unsigned n = 0x7fffffff; char *x = _mm_malloc(n, 16); char *y = _mm_malloc(n, 16); double dtime; memset(x,0,n); memset(y,1,n); dtime = -omp_get_wtime(); copy(x,y,n); dtime += omp_get_wtime(); printf("time %f/n", dtime); dtime = -omp_get_wtime(); copy2(x,y,n); dtime += omp_get_wtime(); printf("time %f/n", dtime); dtime = -omp_get_wtime(); memcpy(x,y,n); dtime += omp_get_wtime(); printf("time %f/n", dtime); }

En mi sistema, Core2 (antes de Nehalem) P9600@2.53GHz, da

time non temporal store 0.39 time SSE store 1.10 time memcpy 0.98

para copiar 2GB.

Tenga en cuenta que es muy importante que "toque" la memoria en la que escribirá primero (usé memset para hacer esto). Su sistema no necesariamente asigna su memoria hasta que usted accede a ella. La sobrecarga para hacer esto puede sesgar sus resultados significativamente si la memoria no ha sido accesada cuando hace la copia de la memoria.

Según wikipedia, DDR3-1333 tiene un reloj de memoria de 166⅔ MHz. DDR transfiere datos a una velocidad de reloj de memoria doble Además, DDR3 tiene un multiplicador de reloj de bus de cuatro. Así que DDR3 tiene un total multiplicado por reloj de memoria de ocho. Además, su placa base tiene dos canales de memoria. Así que la tasa de transferencia total es

21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).