multithreading - Mandelbrot multiproceso y SIMD vectorizado en R usando Rcpp y OpenMP

(2)

No use OpenMP con objetos *Vector o *Matrix SEXP ya que enmascaran SEXP funciones SEXP / asignaciones de memoria que son de un solo subproceso. OpenMP es un enfoque multiproceso .

Es por eso que el código está fallando.

Una forma de sortear esta limitación es utilizar una estructura de datos que no sea R para almacenar los resultados. Una de las siguientes será suficiente: arma::mat o Eigen::MatrixXd o std::vector<T> ... Como prefiero el armadillo, cambiaré la matriz res a arma::mat de Rcpp::NumericMatrix . Por lo tanto, lo siguiente ejecutará su código en paralelo:

#include <RcppArmadillo.h> // Note the changed include and new attribute // [[Rcpp::depends(RcppArmadillo)]] // Avoid including header if openmp not on system #ifdef _OPENMP #include <omp.h> #endif // [[Rcpp::plugins(openmp)]] // Note the changed return type // [[Rcpp::export]] arma::mat mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max, const int res_x, const int res_y, const int nb_iter) { arma::mat ret(res_x, res_y); // note change double x_step = (x_max - x_min) / res_x; double y_step = (y_max - y_min) / res_y; unsigned r,c; #pragma omp parallel for shared(res) for (r = 0; r < res_y; r++) { for (c = 0; c < res_x; c++) { double zx = 0.0, zy = 0.0, new_zx; double cx = x_min + c*x_step, cy = y_min + r*y_step; unsigned n = 0; for (; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) { new_zx = zx*zx - zy*zy + cx; zy = 2.0*zx*zy + cy; zx = new_zx; } if(n == nb_iter) { n = 0; } ret(r, c) = n; } } return ret; }

Con el código de prueba (nota y y x no se definieron, por lo tanto supuse y = ylims y x = xlims ) tenemos:

xlims = ylims = c(-2.0, 2.0) x_res = y_res = 400L nb_iter = 256L system.time(m <- mandelRcpp(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter)) rainbow = c( rgb(0.47, 0.11, 0.53), rgb(0.27, 0.18, 0.73), rgb(0.25, 0.39, 0.81), rgb(0.30, 0.57, 0.75), rgb(0.39, 0.67, 0.60), rgb(0.51, 0.73, 0.44), rgb(0.67, 0.74, 0.32), rgb(0.81, 0.71, 0.26), rgb(0.89, 0.60, 0.22), rgb(0.89, 0.39, 0.18), rgb(0.86, 0.13, 0.13) ) cols = c(colorRampPalette(rainbow)(100), rev(colorRampPalette(rainbow)(100)), "black") # palette par(mar = c(0, 0, 0, 0)) image(m, col = cols, asp = diff(range(ylims)) / diff(range(xlims)), axes = F)

Por:

Como prueba de rendimiento de OpenMP y Rcpp , quería comprobar qué tan rápido podía calcular el conjunto de Mandelbrot en R utilizando la implementación más sencilla y sencilla de Rcpp + OpenMP . Actualmente lo que hice fue:

#include <Rcpp.h> #include <omp.h> // [[Rcpp::plugins(openmp)]] using namespace Rcpp; // [[Rcpp::export]] Rcpp::NumericMatrix mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max, const int res_x, const int res_y, const int nb_iter) { Rcpp::NumericMatrix ret(res_x, res_y); double x_step = (x_max - x_min) / res_x; double y_step = (y_max - y_min) / res_y; int r,c; #pragma omp parallel for default(shared) private(c) schedule(dynamic,1) for (r = 0; r < res_y; r++) { for (c = 0; c < res_x; c++) { double zx = 0.0, zy = 0.0, new_zx; double cx = x_min + c*x_step, cy = y_min + r*y_step; int n = 0; for (n=0; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) { new_zx = zx*zx - zy*zy + cx; zy = 2.0*zx*zy + cy; zx = new_zx; } ret(c,r) = n; } } return ret; }

Y luego en R:

library(Rcpp) sourceCpp("mandelRcpp.cpp") xlims=c(-0.74877,-0.74872); ylims=c(0.065053,0.065103); x_res=y_res=1080L; nb_iter=10000L; system.time(m <- mandelRcpp(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter)) # 0.92s rainbow=c(rgb(0.47,0.11,0.53),rgb(0.27,0.18,0.73),rgb(0.25,0.39,0.81),rgb(0.30,0.57,0.75),rgb(0.39,0.67,0.60),rgb(0.51,0.73,0.44),rgb(0.67,0.74,0.32),rgb(0.81,0.71,0.26),rgb(0.89,0.60,0.22),rgb(0.89,0.39,0.18),rgb(0.86,0.13,0.13)) cols=c(colorRampPalette(rainbow)(100),rev(colorRampPalette(rainbow)(100)),"black") # palette par(mar=c(0, 0, 0, 0)) system.time(image(m^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T)) # 0.5s

Sin embargo, no estaba seguro de si hay otras mejoras obvias de velocidad que podría aprovechar, aparte de OpenMP multihilo, por ejemplo, a través de la vectorización simd . (el uso de opciones simd en openmp #pragma no parecía hacer nada)

PD: al principio mi código fallaba, pero luego descubrí que esto se resolvió reemplazando ret[r,c] = n; con ret(r,c) = n; El uso de las clases de armadillo como se sugiere en la respuesta a continuación hace que las cosas sean un poco más rápidas, aunque los tiempos son casi los mismos. También volteó alrededor de x e y para que salga en la orientación correcta cuando se traza con image() . Usando 8 hilos la velocidad es ca. 350 veces más rápido que la versión vectorizada de R Mandelbrot simple here y también aproximadamente 7.3 veces más rápido que la versión Python / Numba (no multiproceso) here (similar a las velocidades PyCUDA o PyOpenCL), muy contento con eso ... Rasterización / visualización ahora Parece el cuello de botella en R ....

Seguí adelante y vectoricé el código de OP usando las extensiones de vector de GCC y Clang. Antes de mostrar cómo hice esto, permítanme mostrar el rendimiento con el siguiente hardware:

Skylake (SKL) at 3.1 GHz with 4 cores Knights Landing (KNL) at 1.5 GHz with 68 cores ARMv8 Cortex-A57 arch64 (Nvidia Jetson TX1) 4 cores at ? GHz nb_iter = 1000000 GCC Clang SKL_scalar 6m5,422s SKL_SSE41 3m18,058s SKL_AVX2 1m37,843s 1m39,943s SKL_scalar_omp 0m52,237s SKL_SSE41_omp 0m29,624s 0m31,356s SKL_AVX2_omp 0m14,156s 0m16,783s ARM_scalar 15m28.285s ARM_vector 9m26.384s ARM_scalar_omp 3m54.242s ARM_vector_omp 2m21.780s KNL_scalar 19m34.121s KNL_SSE41 11m30.280s KNL_AVX2 5m0.005s 6m39.568s KNL_AVX512 2m40.934s 6m20.061s KNL_scalar_omp 0m9.108s KNL_SSE41_omp 0m6.666s 0m6.992s KNL_AVX2_omp 0m2.973s 0m3.988s KNL_AVX512_omp 0m1.761s 0m3.335s

La aceleración teórica de KNL vs. SKL es

(68 cores/4 cores)*(1.5 GHz/3.1 Ghz)* (8 doubles per lane/4 doubles per lane) = 16.45

Entré aquí en detalle sobre las capacidades de extensiones de vector de GCC y Clang here . Para vectorizar el código de OP aquí hay tres operaciones de vectores adicionales que necesitamos definir.

1. Difusión

Para un vector v y un escalar s GCC no puede hacer v = s pero Clang sí. Pero encontré una buena solución que funciona para GCC y Clang here . Por ejemplo

vsi v = s - (vsi){};

2. Una función any() como en OpenCL o como en R

Lo mejor que se me ocurrió es una función genérica

static bool any(vli const & x) { for(int i=0; i<VLI_SIZE; i++) if(x[i]) return true; return false; }

Clang en realidad genera un código relativamente eficiente para esto usando la instrucción ptest (pero no para AVX512 ) pero GCC no.

3. Compresión

Los cálculos se realizan como dobles de 64 bits, pero el resultado se escribe como enteros de 32 bits. Entonces, dos cálculos se realizan utilizando enteros de 64 bits y luego los dos cálculos se comprimen en un vector de enteros de 32 bits. Se me ocurrió una solución genérica con la que Clang hace un buen trabajo

static vsi compress(vli const & lo, vli const & hi) { vsi lo2 = (vsi)lo, hi2 = (vsi)hi, z; for(int i=0; i<VLI_SIZE; i++) z[i+0*VLI_SIZE] = lo2[2*i]; for(int i=0; i<VLI_SIZE; i++) z[i+1*VLI_SIZE] = hi2[2*i]; return z; }

La siguiente solución funciona mejor para GCC pero no es mejor para Clang . Pero como esta función no es crítica, solo uso la versión genérica.

static vsi compress(vli const & low, vli const & high) { #if defined(__clang__) return __builtin_shufflevector((vsi)low, (vsi)high, MASK); #else return __builtin_shuffle((vsi)low, (vsi)high, (vsi){MASK}); #endif }

Estas definiciones no se basan en nada específico de x86 y el código (definido a continuación) también se compila para procesadores ARM con GCC y Clang.

Ahora que están definidos aquí, está el código

#include <string.h> #include <inttypes.h> #include <Rcpp.h> using namespace Rcpp; #ifdef _OPENMP #include <omp.h> #endif // [[Rcpp::plugins(openmp)]] // [[Rcpp::plugins(cpp14)]] #if defined ( __AVX512F__ ) || defined ( __AVX512__ ) static const int SIMD_SIZE = 64; #elif defined ( __AVX2__ ) static const int SIMD_SIZE = 32; #else static const int SIMD_SIZE = 16; #endif static const int VSI_SIZE = SIMD_SIZE/sizeof(int32_t); static const int VLI_SIZE = SIMD_SIZE/sizeof(int64_t); static const int VDF_SIZE = SIMD_SIZE/sizeof(double); #if defined(__clang__) typedef int32_t vsi __attribute__ ((ext_vector_type(VSI_SIZE))); typedef int64_t vli __attribute__ ((ext_vector_type(VLI_SIZE))); typedef double vdf __attribute__ ((ext_vector_type(VDF_SIZE))); #else typedef int32_t vsi __attribute__ ((vector_size (SIMD_SIZE))); typedef int64_t vli __attribute__ ((vector_size (SIMD_SIZE))); typedef double vdf __attribute__ ((vector_size (SIMD_SIZE))); #endif static bool any(vli const & x) { for(int i=0; i<VLI_SIZE; i++) if(x[i]) return true; return false; } static vsi compress(vli const & lo, vli const & hi) { vsi lo2 = (vsi)lo, hi2 = (vsi)hi, z; for(int i=0; i<VLI_SIZE; i++) z[i+0*VLI_SIZE] = lo2[2*i]; for(int i=0; i<VLI_SIZE; i++) z[i+1*VLI_SIZE] = hi2[2*i]; return z; } // [[Rcpp::export]] IntegerVector frac(double x_min, double x_max, double y_min, double y_max, int res_x, int res_y, int nb_iter) { IntegerVector out(res_x*res_y); vdf x_minv = x_min - (vdf){}, y_minv = y_min - (vdf){}; vdf x_stepv = (x_max - x_min)/res_x - (vdf){}, y_stepv = (y_max - y_min)/res_y - (vdf){}; double a[VDF_SIZE] __attribute__ ((aligned(SIMD_SIZE))); for(int i=0; i<VDF_SIZE; i++) a[i] = 1.0*i; vdf vi0 = *(vdf*)a; #pragma omp parallel for schedule(dynamic) collapse(2) for (int r = 0; r < res_y; r++) { for (int c = 0; c < res_x/(VSI_SIZE); c++) { vli nv[2] = {0 - (vli){}, 0 - (vli){}}; for(int j=0; j<2; j++) { vdf c2 = 1.0*VDF_SIZE*(2*c+j) + vi0; vdf zx = 0.0 - (vdf){}, zy = 0.0 - (vdf){}, new_zx; vdf cx = x_minv + c2*x_stepv, cy = y_minv + r*y_stepv; vli t = -1 - (vli){}; for (int n = 0; any(t = zx*zx + zy*zy < 4.0) && n < nb_iter; n++, nv[j] -= t) { new_zx = zx*zx - zy*zy + cx; zy = 2.0*zx*zy + cy; zx = new_zx; } } vsi sp = compress(nv[0], nv[1]); memcpy(&out[r*res_x + VSI_SIZE*c], (int*)&sp, SIMD_SIZE); } } return out; }

El código R es casi el mismo que el código del OP

library(Rcpp) sourceCpp("frac.cpp", verbose=TRUE, rebuild=TRUE) xlims=c(-0.74877,-0.74872); ylims=c(0.065053,0.065103); x_res=y_res=1080L; nb_iter=100000L; t = system.time(m <- frac(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter)) print(t) m2 = matrix(m, ncol = x_res) rainbow = c( rgb(0.47, 0.11, 0.53), rgb(0.27, 0.18, 0.73), rgb(0.25, 0.39, 0.81), rgb(0.30, 0.57, 0.75), rgb(0.39, 0.67, 0.60), rgb(0.51, 0.73, 0.44), rgb(0.67, 0.74, 0.32), rgb(0.81, 0.71, 0.26), rgb(0.89, 0.60, 0.22), rgb(0.89, 0.39, 0.18), rgb(0.86, 0.13, 0.13) ) cols = c(colorRampPalette(rainbow)(100), rev(colorRampPalette(rainbow)(100)),"black") # palette par(mar = c(0, 0, 0, 0)) image(m2^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T)

Para compilar para GCC o Clang, cambie el archivo ~/.R/Makevars a

CXXFLAGS= -Wall -std=c++14 -O3 -march=native -ffp-contract=fast -fopenmp #uncomment the following two lines for clang #CXX=clang-5.0 #LDFLAGS= -lomp

Si tiene problemas para que OpenMP funcione para Clang, vea this .

El código produce más o menos la misma imagen.