c++ - plantilla - Por qué push_back es más lento que el operador[] para un vector asignado previamente

vector c++ pdf (5)

Acabo de leer este blog http://lemire.me/blog/archives/2012/06/20/do-not-waste-time-with-stl-vectors/ comparando el rendimiento de la asignación del operator[] y push_back en la memoria pre reservado std::vector y decidí probarlo yo mismo. La operación es simple:

// for vector bigarray.reserve(N); // START TIME TRACK for(int k = 0; k < N; ++k) // for operator[]: // bigarray[k] = k; // for push_back bigarray.push_back(k); // END TIME TRACK // do some dummy operations to prevent compiler optimize long sum = accumulate(begin(bigarray), end(array),0 0);

Y aqui esta el resultado:

~/t/benchmark> icc 1.cpp -O3 -std=c++11 ~/t/benchmark> ./a.out [ 1.cpp: 52] 0.789123s --> C++ new [ 1.cpp: 52] 0.774049s --> C++ new [ 1.cpp: 66] 0.351176s --> vector [ 1.cpp: 80] 1.801294s --> reserve + push_back [ 1.cpp: 94] 1.753786s --> reserve + emplace_back [ 1.cpp: 107] 2.815756s --> no reserve + push_back ~/t/benchmark> clang++ 1.cpp -std=c++11 -O3 ~/t/benchmark> ./a.out [ 1.cpp: 52] 0.592318s --> C++ new [ 1.cpp: 52] 0.566979s --> C++ new [ 1.cpp: 66] 0.270363s --> vector [ 1.cpp: 80] 1.763784s --> reserve + push_back [ 1.cpp: 94] 1.761879s --> reserve + emplace_back [ 1.cpp: 107] 2.815596s --> no reserve + push_back ~/t/benchmark> g++ 1.cpp -O3 -std=c++11 ~/t/benchmark> ./a.out [ 1.cpp: 52] 0.617995s --> C++ new [ 1.cpp: 52] 0.601746s --> C++ new [ 1.cpp: 66] 0.270533s --> vector [ 1.cpp: 80] 1.766538s --> reserve + push_back [ 1.cpp: 94] 1.998792s --> reserve + emplace_back [ 1.cpp: 107] 2.815617s --> no reserve + push_back

Para todos los compiladores, el vector con el operator[] es mucho más rápido que el puntero sin formato con el operator[] . Esto llevó a la primera pregunta: ¿por qué? La segunda pregunta es: ya he "reservado" la memoria, entonces ¿por qué el opeator[] es más rápido?

La siguiente pregunta es, dado que la memoria ya está asignada, ¿ push_back qué push_back es más lento que el operator[] ?

El código de prueba se adjunta a continuación:

Como Yakk y yo hemos descubierto, podría haber otro factor interesante que contribuya a la aparente lentitud de push_back .

La primera observación interesante es que en la prueba original, usar new y operar en una matriz en bruto es más lento que usar vector<int> bigarray(N); y operator[] - más que un factor 2. Aún más interesante es que puede obtener el mismo rendimiento para ambos insertando un memset adicional para la variante de matriz sin memset :

int routine1_modified() { int sum; int* bigarray = new int[N]; memset(bigarray, 0, sizeof(int)*N); PROFILE ( { for (unsigned int k = 0; k < N; ++k) bigarray[k] = k; }, "C++ new"); sum = std::accumulate (bigarray, bigarray + N, 0); delete [] bigarray; return sum; }

La conclusión, por supuesto, es que ese PROFILE mide algo diferente de lo esperado. Yakk y yo creemos que tiene algo que ver con la administración de la memoria; del comentario de Yakk al OP:

resize tocará todo el bloque de memoria. reserve se asignará sin tocar. Si tiene un asignador diferido que no capta o asigna páginas de memoria física hasta que se acceda, reserve en un vector vacío puede ser casi gratuito (¡ni siquiera tiene que encontrar memoria física para las páginas!) Hasta que escriba en las páginas ( en ese punto, deben ser encontrados).

Pensé en algo similar, así que probé una pequeña prueba para esta hipótesis tocando ciertas páginas con un "mememet strided" (una herramienta de generación de perfiles podría obtener resultados más confiables):

int routine1_modified2() { int sum; int* bigarray = new int[N]; for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int)) bigarray[k] = 0; PROFILE ( { for (unsigned int k = 0; k < N; ++k) bigarray[k] = k; }, "C++ new"); sum = std::accumulate (bigarray, bigarray + N, 0); delete [] bigarray; return sum; }

Cambiando el paso de cada página a la mitad de cada 4ta página para dejarla completamente afuera, obtenemos una agradable transición de los tiempos del vector<int> bigarray(N); caso para el new int[N] donde no se ha usado memset .

En mi opinión, ese es un fuerte indicio de que la gestión de la memoria es un contribuyente importante a los resultados de medición.

Otro problema es la ramificación en push_back . En muchas respuestas se afirma que esta es una / la razón principal por la que push_back es mucho más lento en comparación con el uso del operator[] . De hecho, al comparar el puntero sin memset con el uso de reserve + push_back , el primero es dos veces más rápido.

Del mismo modo, si agregamos un poco de UB (pero revise los resultados más adelante):

int routine3_modified() { int sum; vector<int> bigarray; bigarray.reserve (N); memset(bigarray.data(), 0, sizeof(int)*N); // technically, it''s UB PROFILE ( { for (unsigned int k = 0; k < N; ++k) bigarray.push_back (k); }, "reserve + push_back"); sum = std::accumulate (begin (bigarray), end (bigarray), 0); return sum; }

esta versión modificada es aproximadamente 2 veces más lenta que el uso de un memset new + completo. Por lo tanto, parece que sea lo que sea lo que push_back la invocación de push_back , resulta en una desaceleración del factor 2 cuando se compara con simplemente establecer el elemento (a través del operator[] tanto en el vector como en el caso del conjunto sin formato).

¿Pero es la bifurcación requerida en push_back , o la operación adicional?

// pseudo-code void push_back(T const& p) { if(size() == capacity()) { resize( size() < 10 ? 10 : size()*2 ); } (*this)[size()] = p; // actually using the allocator ++m_end; }

De hecho, es así de simple, véase, por ejemplo , la implementación de libstdc ++ .

Lo probé usando el vector<int> bigarray(N); + operator[] variante, e insertando una llamada de función que imita el comportamiento de push_back :

unsigned x = 0; void silly_branch(int k) { if(k == x) { x = x < 10 ? 10 : x*2; } } int routine2_modified() { int sum; vector<int> bigarray (N); PROFILE ( { for (unsigned int k = 0; k < N; ++k) { silly_branch(k); bigarray[k] = k; } }, "vector"); sum = std::accumulate (begin (bigarray), end (bigarray), 0); return sum; }

Incluso cuando se declara x como volátil, esto solo tiene una influencia del 1% en la medición. Por supuesto, debe verificar que la rama esté realmente en el código de operación , pero el conocimiento de mi ensamblador no me permite verificarlo (en -O3 ).

El punto interesante ahora es qué sucede cuando agrego un incremento a silly_branch :

unsigned x = 0; void silly_branch(int k) { if(k == x) { x = x < 10 ? 10 : x*2; } ++x; }

Ahora, la routine2_modified modified2_modified se ejecuta 2 veces más lento que la routine2 original2, estando a la par con la routine3_modified propuesta3_modified anterior que incluye UB para confirmar las páginas de memoria. No me parece particularmente sorprendente, ya que agrega otra escritura a cada escritura en el ciclo, por lo que tenemos dos veces el trabajo y dos veces la duración.

Conclusión

Bueno, había que mirar cuidadosamente las herramientas de ensamblaje y perfilado para verificar las hipótesis de la administración de la memoria y la escritura adicional es una buena hipótesis ("correcta"). Pero creo que las pistas son lo suficientemente fuertes como para afirmar que hay algo más complicado que una rama que hace que push_back sea más lento.

Aquí está el código de prueba completo:

#include <iostream> #include <iomanip> #include <vector> #include <numeric> #include <chrono> #include <string> #include <cstring> #define PROFILE(BLOCK, ROUTNAME) ProfilerRun([&](){do {BLOCK;} while(0);}, / ROUTNAME, __FILE__, __LINE__); //#define PROFILE(BLOCK, ROUTNAME) BLOCK template <typename T> void ProfilerRun (T&& func, const std::string& routine_name = "unknown", const char* file = "unknown", unsigned line = 0) { using std::chrono::duration_cast; using std::chrono::microseconds; using std::chrono::steady_clock; using std::cerr; using std::endl; steady_clock::time_point t_begin = steady_clock::now(); // Call the function func(); steady_clock::time_point t_end = steady_clock::now(); cerr << "[" << std::setw (20) << (std::strrchr (file, ''/'') ? std::strrchr (file, ''/'') + 1 : file) << ":" << std::setw (5) << line << "] " << std::setw (10) << std::setprecision (6) << std::fixed << static_cast<float> (duration_cast<microseconds> (t_end - t_begin).count()) / 1e6 << "s --> " << routine_name << endl; cerr.unsetf (std::ios_base::floatfield); } using namespace std; constexpr int N = (1 << 28); constexpr int PAGESIZE = 4096; uint64_t __attribute__((noinline)) routine1() { uint64_t sum; int* bigarray = new int[N]; PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new (routine1)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine2() { uint64_t sum; int* bigarray = new int[N]; memset(bigarray, 0, sizeof(int)*N); PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new + full memset (routine2)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine3() { uint64_t sum; int* bigarray = new int[N]; for(int k = 0; k < N; k += PAGESIZE/2/sizeof(int)) bigarray[k] = 0; PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new + strided memset (every page half) (routine3)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine4() { uint64_t sum; int* bigarray = new int[N]; for(int k = 0; k < N; k += PAGESIZE/1/sizeof(int)) bigarray[k] = 0; PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new + strided memset (every page) (routine4)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine5() { uint64_t sum; int* bigarray = new int[N]; for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int)) bigarray[k] = 0; PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new + strided memset (every other page) (routine5)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine6() { uint64_t sum; int* bigarray = new int[N]; for(int k = 0; k < N; k += PAGESIZE*4/sizeof(int)) bigarray[k] = 0; PROFILE ( { for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k) *p = k; }, "new + strided memset (every 4th page) (routine6)"); sum = std::accumulate (bigarray, bigarray + N, 0ULL); delete [] bigarray; return sum; } uint64_t __attribute__((noinline)) routine7() { uint64_t sum; vector<int> bigarray (N); PROFILE ( { for (int k = 0; k < N; ++k) bigarray[k] = k; }, "vector, using ctor to initialize (routine7)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine8() { uint64_t sum; vector<int> bigarray; PROFILE ( { for (int k = 0; k < N; ++k) bigarray.push_back (k); }, "vector (+ no reserve) + push_back (routine8)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine9() { uint64_t sum; vector<int> bigarray; bigarray.reserve (N); PROFILE ( { for (int k = 0; k < N; ++k) bigarray.push_back (k); }, "vector + reserve + push_back (routine9)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine10() { uint64_t sum; vector<int> bigarray; bigarray.reserve (N); memset(bigarray.data(), 0, sizeof(int)*N); PROFILE ( { for (int k = 0; k < N; ++k) bigarray.push_back (k); }, "vector + reserve + memset (UB) + push_back (routine10)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } template<class T> void __attribute__((noinline)) adjust_size(std::vector<T>& v, int k, double factor) { if(k >= v.size()) { v.resize(v.size() < 10 ? 10 : k*factor); } } uint64_t __attribute__((noinline)) routine11() { uint64_t sum; vector<int> bigarray; PROFILE ( { for (int k = 0; k < N; ++k) { adjust_size(bigarray, k, 1.5); bigarray[k] = k; } }, "vector + custom emplace_back @ factor 1.5 (routine11)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine12() { uint64_t sum; vector<int> bigarray; PROFILE ( { for (int k = 0; k < N; ++k) { adjust_size(bigarray, k, 2); bigarray[k] = k; } }, "vector + custom emplace_back @ factor 2 (routine12)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine13() { uint64_t sum; vector<int> bigarray; PROFILE ( { for (int k = 0; k < N; ++k) { adjust_size(bigarray, k, 3); bigarray[k] = k; } }, "vector + custom emplace_back @ factor 3 (routine13)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine14() { uint64_t sum; vector<int> bigarray; PROFILE ( { for (int k = 0; k < N; ++k) bigarray.emplace_back (k); }, "vector (+ no reserve) + emplace_back (routine14)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine15() { uint64_t sum; vector<int> bigarray; bigarray.reserve (N); PROFILE ( { for (int k = 0; k < N; ++k) bigarray.emplace_back (k); }, "vector + reserve + emplace_back (routine15)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } uint64_t __attribute__((noinline)) routine16() { uint64_t sum; vector<int> bigarray; bigarray.reserve (N); memset(bigarray.data(), 0, sizeof(bigarray[0])*N); PROFILE ( { for (int k = 0; k < N; ++k) bigarray.emplace_back (k); }, "vector + reserve + memset (UB) + emplace_back (routine16)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } unsigned x = 0; template<class T> void /*__attribute__((noinline))*/ silly_branch(std::vector<T>& v, int k) { if(k == x) { x = x < 10 ? 10 : x*2; } //++x; } uint64_t __attribute__((noinline)) routine17() { uint64_t sum; vector<int> bigarray(N); PROFILE ( { for (int k = 0; k < N; ++k) { silly_branch(bigarray, k); bigarray[k] = k; } }, "vector, using ctor to initialize + silly branch (routine17)"); sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL); return sum; } template<class T, int N> constexpr int get_extent(T(&)[N]) { return N; } int main() { uint64_t results[] = {routine2(), routine1(), routine2(), routine3(), routine4(), routine5(), routine6(), routine7(), routine8(), routine9(), routine10(), routine11(), routine12(), routine13(), routine14(), routine15(), routine16(), routine17()}; std::cout << std::boolalpha; for(int i = 1; i < get_extent(results); ++i) { std::cout << i << ": " << (results[0] == results[i]) << "/n"; } std::cout << x << "/n"; }

Una ejecución de muestra, en una computadora vieja y lenta; Nota:

N == 2<<28 , no 2<<29 como en el OP
compilado con g ++ 4.9 20131022 con -std=c++11 -O3 -march=native

[ temp.cpp: 71] 0.654927s --> new + full memset (routine2) [ temp.cpp: 54] 1.042405s --> new (routine1) [ temp.cpp: 71] 0.605061s --> new + full memset (routine2) [ temp.cpp: 89] 0.597487s --> new + strided memset (every page half) (routine3) [ temp.cpp: 107] 0.601271s --> new + strided memset (every page) (routine4) [ temp.cpp: 125] 0.783610s --> new + strided memset (every other page) (routine5) [ temp.cpp: 143] 0.903038s --> new + strided memset (every 4th page) (routine6) [ temp.cpp: 157] 0.602401s --> vector, using ctor to initialize (routine7) [ temp.cpp: 170] 3.811291s --> vector (+ no reserve) + push_back (routine8) [ temp.cpp: 184] 2.091391s --> vector + reserve + push_back (routine9) [ temp.cpp: 199] 1.375837s --> vector + reserve + memset (UB) + push_back (routine10) [ temp.cpp: 224] 8.738293s --> vector + custom emplace_back @ factor 1.5 (routine11) [ temp.cpp: 240] 5.513803s --> vector + custom emplace_back @ factor 2 (routine12) [ temp.cpp: 256] 5.150388s --> vector + custom emplace_back @ factor 3 (routine13) [ temp.cpp: 269] 3.789820s --> vector (+ no reserve) + emplace_back (routine14) [ temp.cpp: 283] 2.090259s --> vector + reserve + emplace_back (routine15) [ temp.cpp: 298] 1.288740s --> vector + reserve + memset (UB) + emplace_back (routine16) [ temp.cpp: 325] 0.611168s --> vector, using ctor to initialize + silly branch (routine17) 1: true 2: true 3: true 4: true 5: true 6: true 7: true 8: true 9: true 10: true 11: true 12: true 13: true 14: true 15: true 16: true 17: true 335544320

Cuando asigna la matriz en el constructor, el compilador / biblioteca puede básicamente memset() el relleno original y luego simplemente establecer cada valor individual. Cuando use push_back() , la clase std::vector<T> necesitará:

Verifica si hay suficiente espacio.
Cambia el puntero final para ser una nueva ubicación.
Establece el valor real.

El último paso es lo único que debe hacerse cuando la memoria se asigna de una vez.

Este es un comentario extendido, no una respuesta, con la intención de ayudar a mejorar la pregunta.

La rutina 4 invoca un comportamiento indefinido. Está escribiendo más allá del final del size de la matriz. Reemplace la reserva con el cambio de tamaño para eliminar eso.

Las rutinas 3 a 5 no pueden hacer nada después de la optimización ya que no tienen salida observable.

Una insert( vec.end(), src.begin(), src.end() ) donde src es un rango de generador de acceso aleatorio ( boost probablemente lo tenga) podría emular la new versión si su insert es inteligente.

La duplicación de la routine1 parece graciosa: por casualidad, ¿esto cambia los tiempos?

Puedo responder tu segunda pregunta. Aunque el vector está preasignado, push_back aún necesita verificar el espacio disponible cada vez que llame a push_back. Por otro lado, el operador [] no realiza comprobaciones y simplemente asume que el espacio está disponible.

push_back hace una verificación de límites. operator[] no. Entonces, incluso si ha reservado el espacio, push_back tendrá una verificación condicional adicional que el operator[] no tendrá. Además, aumentará el valor del size (la reserva solo establece la capacity ), por lo que se actualizará cada vez.

En resumen, push_back está haciendo más de lo que hace el operator[] , razón por la cual es más lento (y más preciso).