Una versión en C++ del operador% en% en R

rcpp armadillo (2)

¿Hay alguna función en C ++ equivalente a %in% operador en R? Considere el siguiente comando en R:

which(y %in% x)

Intenté encontrar algo equivalente en C ++ (específicamente en Armadillo) y no pude encontrar nada. Luego escribí mi propia función, que es muy lenta en comparación con el comando R anterior.

Aquí está lo que escribí:

#include <RcppArmadillo.h> // [[Rcpp::depends("RcppArmadillo")]] // [[Rcpp::export]] arma::uvec myInOperator(arma::vec myBigVec, arma::vec mySmallVec ){ arma::uvec rslt = find(myBigVec == mySmallVec[0]); for (int i = 1; i < mySmallVec.size(); i++){ arma::uvec rslt_tmp = find(myBigVec == mySmallVec[i]); rslt = arma::unique(join_cols( rslt, rslt_tmp )); } return rslt; }

Ahora, después de buscar en el código anterior, tenemos:

x <- 1:4 y <- 1:10 res <- benchmark(myInOperator(y, x), which(y %in% x), columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"), order = "relative")

Y aquí están los resultados:

test replications elapsed relative user.self sys.self 2 which(y %in% x) 100 0.001 1 0.001 0 1 myInOperator(y, x) 100 0.002 2 0.001 0

¿Alguien podría guiarme para encontrar un código C ++ correspondiente al cual (y% en% x) o para hacer que mi código sea más eficiente? El tiempo transcurrido ya es muy pequeño para ambas funciones. Supongo que lo que entendí por eficiencia es más desde la perspectiva de la programación y si la manera en que pensé sobre el problema y los comandos que usé son eficientes.

Aprecio tu ayuda.

EDITAR: Gracias a @MatthewLundberg y @Yakk por atrapar mis errores tontos.

Si lo que realmente quieres es una coincidencia más rápida, deberías revisar el paquete de coincidencia rápida de Simon Urbanek. Sin embargo, Rcpp tiene de hecho un azúcar in función que puede usarse aquí. in usa algunas de las ideas del paquete fastmatch y las incorpora a Rcpp . También comparo la solución de @ hadley aquí.

// [[Rcpp::plugins("cpp11")]] #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> sugar_in(IntegerVector x, IntegerVector y) { LogicalVector ind = in(x, y); int n = ind.size(); std::vector<int> output; output.reserve(n); for (int i=0; i < n; ++i) { if (ind[i]) output.push_back(i+1); } return output; } // [[Rcpp::export]] std::vector<int> which_in(IntegerVector x, IntegerVector y) { int nx = x.size(); std::unordered_set<int> z(y.begin(), y.end()); std::vector<int> output; output.reserve(nx); for (int i=0; i < nx; ++i) { if (z.find( x[i] ) != z.end() ) { output.push_back(i+1); } } return output; } // [[Rcpp::export]] std::vector<int> which_in2(IntegerVector x, IntegerVector y) { std::vector<int> y_sort(y.size()); std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end()); int nx = x.size(); std::vector<int> out; for (int i = 0; i < nx; ++i) { std::vector<int>::iterator found = lower_bound(y_sort.begin(), y_sort.end(), x[i]); if (found != y_sort.end()) { out.push_back(i + 1); } } return out; } /*** R set.seed(123) library(microbenchmark) x <- sample(1:100) y <- sample(1:10000, 1000) identical( sugar_in(y, x), which(y %in% x) ) identical( which_in(y, x), which(y %in% x) ) identical( which_in2(y, x), which(y %in% x) ) microbenchmark( sugar_in(y, x), which_in(y, x), which_in2(y, x), which(y %in% x) ) */

Llamar a sourceCpp sobre esto me da, desde el punto de referencia,

Unit: microseconds expr min lq median uq max neval sugar_in(y, x) 7.590 10.0795 11.4825 14.3630 32.753 100 which_in(y, x) 40.757 42.4460 43.4400 46.8240 63.690 100 which_in2(y, x) 14.325 15.2365 16.7005 17.2620 30.580 100 which(y %in% x) 17.070 21.6145 23.7070 29.0105 78.009 100

Para este conjunto de entradas, podemos obtener un poco más de rendimiento utilizando un enfoque que técnicamente tenga una mayor complejidad algorítmica (O (ln n) frente a O (1) para cada búsqueda) pero tiene constantes más bajas: una búsqueda binaria.

// [[Rcpp::plugins("cpp11")]] #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> which_in(IntegerVector x, IntegerVector y) { int nx = x.size(); std::unordered_set<int> z(y.begin(), y.end()); std::vector<int> output; output.reserve(nx); for (int i=0; i < nx; ++i) { if (z.find( x[i] ) != z.end() ) { output.push_back(i+1); } } return output; } // [[Rcpp::export]] std::vector<int> which_in2(IntegerVector x, IntegerVector y) { std::vector<int> y_sort(y.size()); std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end()); int nx = x.size(); std::vector<int> out; for (int i = 0; i < nx; ++i) { std::vector<int>::iterator found = lower_bound(y_sort.begin(), y_sort.end(), x[i]); if (found != y_sort.end()) { out.push_back(i + 1); } } return out; } /*** R set.seed(123) library(microbenchmark) x <- sample(1:100) y <- sample(1:10000, 1000) identical( which_in(y, x), which(y %in% x) ) identical( which_in2(y, x), which(y %in% x) ) microbenchmark( which_in(y, x), which_in2(y, x), which(y %in% x) ) */

En mi computadora que cede

Unit: microseconds expr min lq median uq max neval which_in(y, x) 39.3 41.0 42.7 44.0 81.5 100 which_in2(y, x) 12.8 13.6 14.4 15.0 23.8 100 which(y %in% x) 16.8 20.2 21.0 21.9 31.1 100

así que alrededor del 30% mejor que la base R.