c++ - una - Determinar el elemento mínimo y su posición en cada columna de matriz con CUDA Thrust

obtener posicion de un arreglo java (3)

Tengo un problema bastante simple pero no puedo encontrar una solución elegante para él.

Tengo un código Thrust que produce c vectores del mismo tamaño que contienen valores. Digamos que cada uno de estos c vectores tiene un índice. Me gustaría que cada posición del vector obtenga el índice del vector c para el cual el valor es el más bajo:

Ejemplo:

C0 = (0,10,20,3,40) C1 = (1,2 ,3 ,5,10)

Obtendría como resultado un vector que contiene el índice del vector C que tiene el valor más bajo:

result = (0,1 ,1 ,0,1)

He pensado en hacerlo usando iteradores de compresión de impulso, pero he tenido problemas: podría comprimir todos los vectores c e implementar una transformación arbitraria que toma una tupla y devuelve el índice de su valor más bajo, pero:

¿Cómo iterar sobre los contenidos de una tupla?
Como yo entiendo, las tuplas solo pueden almacenar hasta 10 elementos y puede haber mucho más de 10 vectores c .

Entonces pensé en hacerlo de esta manera: en lugar de tener c vectores separados, añádalos todos en un solo vector C , luego genere claves que hagan referencia a las posiciones y realice una clasificación estable por clave que reagrupe las entradas de vectores desde una misma posición juntas . En el ejemplo que daría:

C = (0,10,20,3,40,1,2,3,5,10) keys = (0,1 ,2 ,3,4 ,0,1,2,3,4 ) after stable sort by key: output = (0,1,10,2,20,3,3,5,40,10) keys = (0,0,1 ,1,2 ,2,3,3,4 ,4 )

Luego genere claves con las posiciones en el vector, comprima la salida con el índice de los vectores c y luego realice una reducción por clave con un funtor personalizado que para cada reducción genera el índice con el valor más bajo. En el ejemplo:

input = (0,1,10,2,20,3,3,5,40,10) indexes= (0,1,0 ,1,0 ,1,0,1,0 ,1) keys = (0,0,1 ,1,2 ,2,3,3,4 ,4) after reduce by keys on zipped input and indexes: output = (0,1,1,0,1)

Sin embargo, ¿cómo escribir dicho functor para la reducción por operación de tecla?

Dado que la longitud de sus vectores tiene que ser la misma. Es mejor concatenarlos juntos y tratarlos como una matriz C.

Entonces su problema se convierte en encontrar los índices del elemento mínimo de cada columna en una matriz principal de filas. Se puede resolver de la siguiente manera.

cambiar la fila-mayor a col-major;
encontrar índices para cada columna

En el paso 1, propuso usar stable_sort_by_key para reorganizar el orden de los elementos, que no es un método efectivo. Dado que la reorganización se puede calcular directamente dado el #row y #col de la matriz. En empuje, se puede hacer con iteradores de permutación como:

thrust::make_permutation_iterator( c.begin(), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), (_1 % row) * col + _1 / row) )

En el paso 2, reduce_by_key puede hacer exactamente lo que quiere. En su caso, el functor binary-op de reducción es fácil, ya que la comparación en tupla (elemento de su vector comprimido) ya se ha definido para comparar el 1er elemento de la tupla, y es compatible con el impulso como

thrust::minimum< thrust::tuple<float, int> >()

El programa completo se muestra de la siguiente manera. Thrust 1.6.0+ es obligatorio ya que utilizo marcadores de posición en iteradores elegantes.

#include <iterator> #include <algorithm> #include <thrust/device_vector.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/iterator/permutation_iterator.h> #include <thrust/iterator/zip_iterator.h> #include <thrust/iterator/discard_iterator.h> #include <thrust/reduce.h> #include <thrust/functional.h> using namespace thrust::placeholders; int main() { const int row = 2; const int col = 5; float initc[] = { 0, 10, 20, 3, 40, 1, 2, 3, 5, 10 }; thrust::device_vector<float> c(initc, initc + row * col); thrust::device_vector<float> minval(col); thrust::device_vector<int> minidx(col); thrust::reduce_by_key( thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 / row), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 / row) + row * col, thrust::make_zip_iterator( thrust::make_tuple( thrust::make_permutation_iterator( c.begin(), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), (_1 % row) * col + _1 / row)), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 % row))), thrust::make_discard_iterator(), thrust::make_zip_iterator( thrust::make_tuple( minval.begin(), minidx.begin())), thrust::equal_to<int>(), thrust::minimum<thrust::tuple<float, int> >() ); std::copy(minidx.begin(), minidx.end(), std::ostream_iterator<int>(std::cout, " ")); std::cout << std::endl; return 0; }

Dos problemas restantes pueden afectar el rendimiento.

los valores mínimos tienen que ser entregados, lo cual no es obligatorio;
reduce_by_key está diseñado para segmentos con longitudes variantes, puede no ser el algoritmo más rápido para la reducción en segmentos con la misma longitud.

Escribir su propio kernel podría ser la mejor solución para el más alto rendimiento.

Tuve la curiosidad de probar cuál de los enfoques anteriores era más rápido. Entonces, implementé la idea de Robert Crovella en el siguiente código, que informa, para mayor completud, también el enfoque de Eric.

#include <iterator> #include <algorithm> #include <thrust/random.h> #include <thrust/device_vector.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/iterator/permutation_iterator.h> #include <thrust/iterator/zip_iterator.h> #include <thrust/iterator/discard_iterator.h> #include <thrust/reduce.h> #include <thrust/functional.h> #include <thrust/sort.h> #include "TimingGPU.cuh" using namespace thrust::placeholders; template <typename Iterator> class strided_range { public: typedef typename thrust::iterator_difference<Iterator>::type difference_type; struct stride_functor : public thrust::unary_function<difference_type,difference_type> { difference_type stride; stride_functor(difference_type stride) : stride(stride) {} __host__ __device__ difference_type operator()(const difference_type& i) const { return stride * i; } }; typedef typename thrust::counting_iterator<difference_type> CountingIterator; typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator; typedef typename thrust::permutation_iterator<Iterator,TransformIterator> PermutationIterator; // type of the strided_range iterator typedef PermutationIterator iterator; // construct strided_range for the range [first,last) strided_range(Iterator first, Iterator last, difference_type stride) : first(first), last(last), stride(stride) {} iterator begin(void) const { return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride))); } iterator end(void) const { return begin() + ((last - first) + (stride - 1)) / stride; } protected: Iterator first; Iterator last; difference_type stride; }; /**************************************************************/ /* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */ /**************************************************************/ template< typename T > struct mod_functor { __host__ __device__ T operator()(T a, T b) { return a % b; } }; /********/ /* MAIN */ /********/ int main() { /***********************/ /* SETTING THE PROBLEM */ /***********************/ const int Nrows = 200; const int Ncols = 200; // --- Random uniform integer distribution between 10 and 99 thrust::default_random_engine rng; thrust::uniform_int_distribution<int> dist(10, 99); // --- Matrix allocation and initialization thrust::device_vector<float> d_matrix(Nrows * Ncols); for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng); TimingGPU timerGPU; /******************/ /* APPROACH NR. 1 */ /******************/ timerGPU.StartCounter(); thrust::device_vector<float> d_min_values(Ncols); thrust::device_vector<int> d_min_indices_1(Ncols); thrust::reduce_by_key( thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 / Nrows), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 / Nrows) + Nrows * Ncols, thrust::make_zip_iterator( thrust::make_tuple( thrust::make_permutation_iterator( d_matrix.begin(), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), (_1 % Nrows) * Ncols + _1 / Nrows)), thrust::make_transform_iterator( thrust::make_counting_iterator((int) 0), _1 % Nrows))), thrust::make_discard_iterator(), thrust::make_zip_iterator( thrust::make_tuple( d_min_values.begin(), d_min_indices_1.begin())), thrust::equal_to<int>(), thrust::minimum<thrust::tuple<float, int> >() ); printf("Timing for approach #1 = %f/n", timerGPU.GetCounter()); /******************/ /* APPROACH NR. 2 */ /******************/ timerGPU.StartCounter(); // --- Computing row indices vector thrust::device_vector<int> d_row_indices(Nrows * Ncols); thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(Nrows * Ncols), thrust::make_constant_iterator(Ncols), d_row_indices.begin(), thrust::divides<int>() ); // --- Computing column indices vector thrust::device_vector<int> d_column_indices(Nrows * Ncols); thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(Nrows * Ncols), thrust::make_constant_iterator(Ncols), d_column_indices.begin(), mod_functor<int>()); // --- int and float iterators typedef thrust::device_vector<int>::iterator IntIterator; typedef thrust::device_vector<float>::iterator FloatIterator; // --- Relevant tuples of int and float iterators typedef thrust::tuple<IntIterator, IntIterator> IteratorTuple1; typedef thrust::tuple<FloatIterator, IntIterator> IteratorTuple2; // --- zip_iterator of the relevant tuples typedef thrust::zip_iterator<IteratorTuple1> ZipIterator1; typedef thrust::zip_iterator<IteratorTuple2> ZipIterator2; // --- zip_iterator creation ZipIterator1 iter1(thrust::make_tuple(d_column_indices.begin(), d_row_indices.begin())); thrust::stable_sort_by_key(d_matrix.begin(), d_matrix.end(), iter1); ZipIterator2 iter2(thrust::make_tuple(d_matrix.begin(), d_row_indices.begin())); thrust::stable_sort_by_key(d_column_indices.begin(), d_column_indices.end(), iter2); typedef thrust::device_vector<int>::iterator Iterator; // --- Strided access to the sorted array strided_range<Iterator> d_min_indices_2(d_row_indices.begin(), d_row_indices.end(), Nrows); printf("Timing for approach #2 = %f/n", timerGPU.GetCounter()); printf("/n/n"); std::copy(d_min_indices_2.begin(), d_min_indices_2.end(), std::ostream_iterator<int>(std::cout, " ")); std::cout << std::endl; return 0; }

Probando los dos enfoques para el caso de las matrices de tamaño 2000x2000 , este ha sido el resultado en una tarjeta Kepler K20c:

Eric''s : 8.4s Robert Crovella''s : 33.4s

Una idea posible, basada en la idea de ordenación vectorizada aquí

Supongamos que tengo vectores como este:
values: C = ( 0,10,20, 3,40, 1, 2, 3, 5,10) keys: K = ( 0, 1, 2, 3, 4, 0, 1, 2, 3, 4) segments: S = ( 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
comprimir juntos K y S para crear KS
stable_sort_by_key usando C como las teclas y KS como los valores:
stable_sort_by_key(C.begin(), C.end(), KS_begin);
comprimir juntos los vectores C y K reordenados, para crear CK
stable_sort_by_key usando el S reordenado como las teclas y CK como los valores:
stable_sort_by_key(S.begin(), S.end(), CK_begin);
utilice un iterador de permutación o un iterador de rango extendido para acceder a cada elemento enésimo (0, N, 2N, ...) del vector K recientemente reordenado, para recuperar un vector de los índices del elemento mínimo en cada segmento, donde N es la longitud de los segmentos.

No lo he implementado, ahora mismo es solo una idea. Tal vez no funcione por alguna razón que no he observado todavía.

segments ( S ) y las keys ( K ) son efectivamente índices de filas y columnas.

Y su pregunta me parece extraña, porque su título menciona "encontrar el índice de valor máximo", pero la mayoría de su pregunta parece referirse al "valor más bajo". De todos modos, con un cambio al paso 6 de mi algoritmo, puede encontrar cualquier valor.