usando - ¿Por qué esta función recursiva de C++ tiene un comportamiento de caché tan malo?

funciones recursivas en c (1)

Deje que T sea un árbol binario enraizado de modo que cada nodo interno tenga exactamente dos hijos. Los nodos del árbol se almacenarán en una matriz, vamos a llamarlo TreeArray siguiendo el diseño de preorden.

Entonces, por ejemplo, si este es el árbol que tenemos:

Entonces TreeArray contendrá los siguientes objetos de nodo:

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13

Un nodo en este árbol es una estructura de este tipo:

struct tree_node{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it''s 0 int pos; //position in TreeArray where the node is stored int lpos; //position of the left child int rpos; //position of the right child tree_node(){ id = -1; pos = lpos = rpos = -1; numChildren = 0; } };

Ahora supongamos que queremos una función que devuelva la suma de todos los identificadores en el árbol. Suena muy trivial, todo lo que tienes que hacer es usar un ciclo for que itera sobre TreeArray y acumula todos los identificadores encontrados. Sin embargo, estoy interesado en comprender el comportamiento del caché de la siguiente implementación:

void testCache1(int cur){ //find the positions of the left and right children int lpos = TreeArray[cur].lpos; int rpos = TreeArray[cur].rpos; //if there are no children we are at a leaf so update r and return if(TreeArray[cur].numChildren == 0){ r += TreeArray[cur].id; return; } //otherwise we are in an internal node, so update r and recurse //first to the left subtree and then to the right subtree r += TreeArray[cur].id; testCache1(lpos); testCache1(rpos); }

Para probar el comportamiento de la memoria caché, tengo el siguiente experimento:

r = 0; //r is a global variable int main(int argc, char* argv[]){ for(int i=0;i<100;i++) { r = 0; testCache1(0); } cout<<r<<endl; return 0; }

Para un árbol aleatorio con 5 millones de hojas, perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt imprime lo siguiente:

Performance counter stats for ''./run_tests 111.txt'': 469,511,047 cache-misses # 89.379 % of all cache refs 525,301,814 cache-references 20,715,360,185 instructions 11.214075268 seconds time elapsed

Al principio pensé que tal vez sea por la forma en que genero el árbol, que excluyo incluyéndolo en mi pregunta, pero cuando ejecuto sudo perf record -e cache-misses ./run_tests 111.txt ejecución sudo perf record -e cache-misses ./run_tests 111.txt recibí el siguiente resultado:

Como podemos ver, la mayoría de las fallas de caché provienen de esta función. Sin embargo, no puedo entender por qué este es el caso. Los valores de cur serán secuenciales, primero TreeArray posición 0 de TreeArray , luego a la posición 1 , 2 , 3 , etc.

Para agregar más dudas a mi comprensión de lo que está sucediendo, tengo la siguiente función que encuentra la misma suma:

void testCache4(int index){ if(index == TreeArray.size) return; r += TreeArray[index].id; testCache4(index+1); }

testCache4 accede a los elementos de TreeArray de la misma manera, pero el comportamiento del caché es mucho mejor.

salida de perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt :

Performance counter stats for ''./run_tests 111.txt'': 396,941,872 cache-misses # 54.293 % of all cache refs 731,109,661 cache-references 11,547,097,924 instructions 4.306576556 seconds time elapsed

en la salida de sudo perf record -e cache-misses ./run_tests 111.txt la función ni siquiera está allí:

Me disculpo por la publicación larga pero me siento completamente perdido. Gracias de antemano.

EDITAR:

Aquí está el archivo de prueba completo, junto con los analizadores y todo lo que se requiere. Se supone que el árbol está disponible dentro de un archivo de texto que se da como argumento. Para compilar, escriba g++ -O3 -std=c++11 file.cpp y ejecútelo ./executable tree.txt . El árbol que estoy usando se puede encontrar aquí (no abrir, haz clic en "salvarnos").

#include <iostream> #include <fstream> #define BILLION 1000000000LL using namespace std; /* * * Timing functions * */ timespec startT, endT; void startTimer(){ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT); } double endTimer(){ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT); return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec); } /* * * tree node * */ //this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers struct tree_node_temp{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it''s 0 int size; //size of the subtree rooted at the current node tree_node_temp *leftChild; tree_node_temp *rightChild; tree_node_temp(){ id = -1; size = 1; leftChild = nullptr; rightChild = nullptr; numChildren = 0; } }; struct tree_node{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it''s 0 int size; //size of the subtree rooted at the current node int pos; //position in TreeArray where the node is stored int lpos; //position of the left child int rpos; //position of the right child tree_node(){ id = -1; pos = lpos = rpos = -1; numChildren = 0; } }; /* * * Tree parser. The input is a file containing the tree in the newick format. * */ string treeNewickStr; //string storing the newick format of a tree that we read from a file int treeCurSTRindex; //index to the current position we are in while reading the newick string int treeNumLeafs; //number of leafs in current tree tree_node ** treeArrayReferences; //stack of references to free node objects tree_node *treeArray; //array of node objects int treeStackReferencesTop; //the top index to the references stack int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree //helper function for readNewick tree_node_temp* readNewickHelper() { int i; if(treeCurSTRindex == treeNewickStr.size()) return nullptr; tree_node_temp * leftChild; tree_node_temp * rightChild; if(treeNewickStr[treeCurSTRindex] == ''(''){ //create a left child treeCurSTRindex++; leftChild = readNewickHelper(); } if(treeNewickStr[treeCurSTRindex] == '',''){ //create a right child treeCurSTRindex++; rightChild = readNewickHelper(); } if(treeNewickStr[treeCurSTRindex] == '')'' || treeNewickStr[treeCurSTRindex] == '';''){ treeCurSTRindex++; tree_node_temp * cur = new tree_node_temp(); cur->numChildren = 2; cur->leftChild = leftChild; cur->rightChild = rightChild; cur->size = 1 + leftChild->size + rightChild->size; return cur; } //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format) i = 0; char treeLabel[20]; //buffer used for the label while(treeNewickStr[treeCurSTRindex]!='','' && treeNewickStr[treeCurSTRindex]!=''('' && treeNewickStr[treeCurSTRindex]!='')''){ treeLabel[i] = treeNewickStr[treeCurSTRindex]; treeCurSTRindex++; i++; } treeLabel[i] = ''/0''; tree_node_temp * cur = new tree_node_temp(); cur->numChildren = 0; cur->id = atoi(treeLabel)-1; treeNumLeafs++; return cur; } //create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser void treeInit(tree_node_temp * curRoot){ tree_node * curFinalRoot = treeArrayReferences[curpos]; curFinalRoot->pos = curpos; if(curRoot->numChildren == 0) { curFinalRoot->id = curRoot->id; return; } //add left child tree_node * cnode = treeArrayReferences[treeStackReferencesTop]; curFinalRoot->lpos = curpos + 1; curpos = curpos + 1; treeStackReferencesTop++; cnode->id = curRoot->leftChild->id; treeInit(curRoot->leftChild); //add right child curFinalRoot->rpos = curpos + 1; curpos = curpos + 1; cnode = treeArrayReferences[treeStackReferencesTop]; treeStackReferencesTop++; cnode->id = curRoot->rightChild->id; treeInit(curRoot->rightChild); curFinalRoot->id = curRoot->id; curFinalRoot->numChildren = 2; curFinalRoot->size = curRoot->size; } //the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal void updateInternalNodeIDs(int cur){ tree_node* curNode = treeArrayReferences[cur]; if(curNode->numChildren == 0){ return; } curNode->id = treeNumLeafs++; updateInternalNodeIDs(curNode->lpos); updateInternalNodeIDs(curNode->rpos); } //frees the memory of the first tree generated by the parser void treeFreeMemory(tree_node_temp* cur){ if(cur->numChildren == 0){ delete cur; return; } treeFreeMemory(cur->leftChild); treeFreeMemory(cur->rightChild); delete cur; } //reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree. //this tree is scattered anywhere in the memory. tree_node* readNewick(string& file){ treeCurSTRindex = -1; treeNewickStr = ""; treeNumLeafs = 0; ifstream treeFin; treeFin.open(file, ios_base::in); //read the newick format of the tree and store it in a string treeFin>>treeNewickStr; //initialize index for reading the string treeCurSTRindex = 0; //create the tree in main memory tree_node_temp* root = readNewickHelper(); //store the tree in an array following the pre order layout treeArray = new tree_node[root->size]; treeArrayReferences = new tree_node*[root->size]; int i; for(i=0;i<root->size;i++) treeArrayReferences[i] = &treeArray[i]; treeStackReferencesTop = 0; tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop]; curpos = treeStackReferencesTop; treeStackReferencesTop++; finalRoot->id = root->id; treeInit(root); //update the internal node ids (the leaf ids are defined by the ids stored in the newick string) updateInternalNodeIDs(0); //close the file treeFin.close(); //free the memory of initial tree treeFreeMemory(root); //return the pre order tree return finalRoot; } /* * experiments * */ int r; tree_node* T; void testCache1(int cur){ int lpos = treeArray[cur].lpos; int rpos = treeArray[cur].rpos; if(treeArray[cur].numChildren == 0){ r += treeArray[cur].id; return; } r += treeArray[cur].id; testCache1(lpos); testCache1(rpos); } void testCache4(int index){ if(index == T->size) return; r += treeArray[index].id; testCache4(index+1); } int main(int argc, char* argv[]){ string Tnewick = argv[1]; T = readNewick(Tnewick); double tt; startTimer(); for(int i=0;i<100;i++) { r = 0; testCache4(0); } tt = endTimer(); cout<<r<<endl; cout<<tt/BILLION<<endl; startTimer(); for(int i=0;i<100;i++) { r = 0; testCache1(0); } tt = endTimer(); cout<<r<<endl; cout<<tt/BILLION<<endl; delete[] treeArray; delete[] treeArrayReferences; return 0; }

EDIT2:

Ejecuto algunas pruebas de perfil con valgrind. Las instrucciones pueden ser los gastos generales aquí, pero no entiendo por qué. Por ejemplo, incluso en los experimentos anteriores con perf, una versión da alrededor de 20 mil millones de instrucciones y los otros 11 mil millones. Eso es una diferencia de 9 billones.

Con -O3 habilitado obtengo lo siguiente:

entonces las llamadas a funciones son costosas en testCache1 y no cuestan nada en testCache4 ? La cantidad de llamadas a funciones en ambos casos debe ser la misma ...

Supongo que el problema es un malentendido de lo que realmente cuentan las referencias de caché.

Como se explica en esta respuesta , las referencias de memoria caché en las CPU de Intel en realidad son el número de referencias a la memoria caché de último nivel. Por lo tanto, las referencias de memoria que sirvió el caché L1 no se cuentan. El manual del desarrollador de arquitecturas Intel 64 e IA-32 indica que las cargas del captador previo L1 se cuentan.

Si realmente compara el número absoluto de fallas de caché, verá que son más o menos iguales para ambas funciones. Usé un árbol completamente balanceado para la prueba, quité la pos para obtener un tamaño de 16 que encajaba muy bien en las líneas de caché y obtuve los siguientes números:

testCache4 :

843.628.131 L1-dcache-loads (56,83%) 193.006.858 L1-dcache-load-misses # 22,73% of all L1-dcache hits (57,31%) 326.698.621 cache-references (57,07%) 188.435.203 cache-misses # 57,679 % of all cache refs (56,76%)

testCache1 :

3.519.968.253 L1-dcache-loads (57,17%) 193.664.806 L1-dcache-load-misses # 5,50% of all L1-dcache hits (57,24%) 256.638.490 cache-references (57,12%) 188.007.927 cache-misses # 73,258 % of all cache refs (57,23%)

Y si desactivo manualmente todos los precaptadores de hardware:

testCache4 :

846.124.474 L1-dcache-loads (57,22%) 192.495.450 L1-dcache-load-misses # 22,75% of all L1-dcache hits (57,31%) 193.699.811 cache-references (57,03%) 185.445.753 cache-misses # 95,739 % of all cache refs (57,17%)

testCache1 :

3.534.308.118 L1-dcache-loads (57,16%) 193.595.962 L1-dcache-load-misses # 5,48% of all L1-dcache hits (57,18%) 193.639.498 cache-references (57,12%) 185.120.733 cache-misses # 95,601 % of all cache refs (57,15%)

Como puede ver, las diferencias se han ido ahora. Simplemente hubo eventos adicionales de referencia de caché debido al prefetcher y la referencia real se contó dos veces.

En realidad, si cuenta todas las referencias de memoria, testCache1 tiene una menor relación de falta total de caché, porque cada tree_node se referencia 4 veces en lugar de unos, pero cada miembro de datos de un tree_node encuentra en la misma línea de caché y solo hay uno de 4 errores. .

Para testCache4 , puede ver que la relación de pérdidas de carga L1d es en realidad cercana al 25%, lo que se esperaría si sizeof(tree_node) == 16 y las líneas de caché son 64 bytes.

Además, el compilador (al menos gcc con -O2) aplica optimización de recursión de cola a ambas funciones eliminando la recursión de testCache4 , al tiempo que convierte testCache1 en recursivo unidireccional. Por testCache1 tanto, testCache1 tiene muchas referencias de caché adicionales a los marcos de pila que testCache4 no tiene.

Puede obtener el resultado sin prefetcher también utilizando valgrind, que probablemente también sea más confiable en su salida. Sin embargo, no simula todas las propiedades de las cachés de la CPU.

En cuanto a sus ediciones: como mencioné, gcc aplica optimización de recursividad de cola, por lo que no quedan llamadas en testCache4 y, por supuesto, la recursividad y las cargas de memoria adicionales en testCache1 tienen una carga de instrucción significativa en comparación con el bucle de carga / adición simple que queda en testCache4 .