por - que es un array en python

Numpy shuffle matriz multidimensional solo por fila, mantener el orden de las columnas sin cambios (3)

Después de un pequeño experimento, encontré que la mayor parte de la memoria y la forma más eficiente en el tiempo de barajar los datos (en cuanto a la fila) de nd-array es, barajan el índice y obtienen los datos del índice barajado

rand_num2 = np.random.randint(5, size=(6000, 2000)) perm = np.arange(rand_num2.shape[0]) np.random.shuffle(perm) rand_num2 = rand_num2[perm]

en mas detalles
Aquí, estoy usando memory_profiler para encontrar el uso de la memoria y el módulo incorporado de "tiempo" de python para registrar el tiempo y comparar todas las respuestas anteriores

def main(): # shuffle data itself rand_num = np.random.randint(5, size=(6000, 2000)) start = time.time() np.random.shuffle(rand_num) print(''Time for direct shuffle: {0}''.format((time.time() - start))) # Shuffle index and get data from shuffled index rand_num2 = np.random.randint(5, size=(6000, 2000)) start = time.time() perm = np.arange(rand_num2.shape[0]) np.random.shuffle(perm) rand_num2 = rand_num2[perm] print(''Time for shuffling index: {0}''.format((time.time() - start))) # using np.take() rand_num3 = np.random.randint(5, size=(6000, 2000)) start = time.time() np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) print("Time taken by np.take, {0}".format((time.time() - start)))

Resultado para el tiempo

Time for direct shuffle: 0.03345608711242676 # 33.4msec Time for shuffling index: 0.019818782806396484 # 19.8msec Time taken by np.take, 0.06726956367492676 # 67.2msec

Perfilador de memoria Resultado

Line # Mem usage Increment Line Contents ================================================ 39 117.422 MiB 0.000 MiB @profile 40 def main(): 41 # shuffle data itself 42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000)) 43 208.977 MiB 0.000 MiB start = time.time() 44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num) 45 208.977 MiB 0.000 MiB print(''Time for direct shuffle: {0}''.format((time.time() - start))) 46 47 # Shuffle index and get data from shuffled index 48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000)) 49 300.531 MiB 0.000 MiB start = time.time() 50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0]) 51 300.539 MiB 0.004 MiB np.random.shuffle(perm) 52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm] 53 300.539 MiB 0.000 MiB print(''Time for shuffling index: {0}''.format((time.time() - start))) 54 55 # using np.take() 56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000)) 57 392.094 MiB 0.000 MiB start = time.time() 58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) 59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start)))

¿Cómo puedo mezclar una matriz multidimensional por fila solo en Python (así que no mezcle las columnas)?

Estoy buscando la solución más eficiente, porque mi matriz es muy grande. ¿También es posible hacer esto altamente eficiente en la matriz original (para ahorrar memoria)?

Ejemplo:

import numpy as np X = np.random.random((6, 2)) print(X) Y = ???shuffle by row only not colls??? print(Y)

Lo que espero ahora es matriz original:

[[ 0.48252164 0.12013048] [ 0.77254355 0.74382174] [ 0.45174186 0.8782033 ] [ 0.75623083 0.71763107] [ 0.26809253 0.75144034] [ 0.23442518 0.39031414]]

La salida baraja las filas no cols por ejemplo:

[[ 0.45174186 0.8782033 ] [ 0.48252164 0.12013048] [ 0.77254355 0.74382174] [ 0.75623083 0.71763107] [ 0.23442518 0.39031414] [ 0.26809253 0.75144034]]

Para eso es numpy.random.shuffle() :

>>> X = np.random.random((6, 2)) >>> X array([[ 0.9818058 , 0.67513579], [ 0.82312674, 0.82768118], [ 0.29468324, 0.59305925], [ 0.25731731, 0.16676408], [ 0.27402974, 0.55215778], [ 0.44323485, 0.78779887]]) >>> np.random.shuffle(X) >>> X array([[ 0.9818058 , 0.67513579], [ 0.44323485, 0.78779887], [ 0.82312674, 0.82768118], [ 0.29468324, 0.59305925], [ 0.25731731, 0.16676408], [ 0.27402974, 0.55215778]])

También puede usar np.random.permutation para generar permutación aleatoria de índices de fila y luego indexar en las filas de X usando np.take con axis=0 . Además, np.take facilita la sobrescritura a la matriz de entrada X con out= opción out= , lo que nos ahorraría memoria. Por lo tanto, la implementación se vería así:

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Ejecución de la muestra

In [23]: X Out[23]: array([[ 0.60511059, 0.75001599], [ 0.30968339, 0.09162172], [ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.0957233 , 0.96210485], [ 0.56843186, 0.36654023]]) In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X); In [25]: X Out[25]: array([[ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.30968339, 0.09162172], [ 0.56843186, 0.36654023], [ 0.0957233 , 0.96210485], [ 0.60511059, 0.75001599]])

Mejora adicional del rendimiento

Aquí hay un truco para acelerar np.random.permutation(X.shape[0]) con np.argsort() -

np.random.rand(X.shape[0]).argsort()

Resultados de aceleración -

In [32]: X = np.random.random((6000, 2000)) In [33]: %timeit np.random.permutation(X.shape[0]) 1000 loops, best of 3: 510 µs per loop In [34]: %timeit np.random.rand(X.shape[0]).argsort() 1000 loops, best of 3: 297 µs per loop

Por lo tanto, la solución de barajado podría ser modificada a

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Pruebas de tiempo de ejecución

Estas pruebas incluyen los dos enfoques enumerados en esta publicación y uno basado en @Kasramvd''s solution .

In [40]: X = np.random.random((6000, 2000)) In [41]: %timeit np.random.shuffle(X) 10 loops, best of 3: 25.2 ms per loop In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X) 10 loops, best of 3: 53.3 ms per loop In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X) 10 loops, best of 3: 53.2 ms per loop

Por lo tanto, parece que el uso de estos np.take podría usar solo si la memoria es una preocupación o si np.random.shuffle , la solución basada en np.random.shuffle parece ser el camino a seguir.