algorithm - simple - Algoritmo para el muestreo sin reemplazo?

muestreo aleatorio simple pdf (6)

Estoy tratando de probar la probabilidad de que se haya producido una agrupación de datos en particular por casualidad. Una forma sólida de hacerlo es la simulación de Monte Carlo, en la que las asociaciones entre los datos y los grupos se reasignan aleatoriamente una gran cantidad de veces (por ejemplo, 10 000) y se usa una métrica de agrupamiento para comparar los datos reales con las simulaciones para determinar una valor.

Tengo la mayor parte de esto funcionando, con punteros mapeando la agrupación a los elementos de datos, por lo que planeo reasignar punteros aleatoriamente a los datos. LA PREGUNTA: ¿cuál es una forma rápida de muestrear sin reemplazo, de modo que cada apuntador se reasigna aleatoriamente en los conjuntos de datos duplicados?

Por ejemplo (estos datos son solo un ejemplo simplificado):

Datos (n = 12 valores) - Grupo A: 0.1, 0.2, 0.4 / Grupo B: 0.5, 0.6, 0.8 / Grupo C: 0.4, 0.5 / Grupo D: 0.2, 0.2, 0.3, 0.5

Para cada conjunto de datos duplicados, tendría los mismos tamaños de clúster (A = 3, B = 3, C = 2, D = 4) y valores de datos, pero reasignaría los valores a los clústeres.

Para hacer esto, podría generar números aleatorios en el rango 1-12, asignar el primer elemento del grupo A, luego generar números aleatorios en el rango 1-11 y asignar el segundo elemento en el grupo A, y así sucesivamente. La reasignación del puntero es rápida, y habrá preasignado todas las estructuras de datos, pero el muestreo sin reemplazo parece un problema que podría haberse resuelto muchas veces anteriormente.

Lógica o pseudocódigo preferido.

Aquí hay un código para el muestreo sin reemplazo basado en el Algoritmo 3.4.2S del algoritmo Seminumeric de Knuth.

void SampleWithoutReplacement ( int populationSize, // size of set sampling from int sampleSize, // size of each sample vector<int> & samples // output, zero-offset indicies to selected items ) { // Use Knuth''s variable names int& n = sampleSize; int& N = populationSize; int t = 0; // total input records dealt with int m = 0; // number of items selected so far double u; while (m < n) { u = GetUniform(); // call a uniform(0,1) random number generator if ( (N - t)*u >= n - m ) { t++; } else { samples[m] = t; t++; m++; } } }

Hay un método más eficiente pero más complejo de Jeffrey Scott Vitter en "Un algoritmo eficiente para muestreo aleatorio secuencial", Transacciones de ACM en software matemático, 13 (1), marzo de 1987, 58-67.

Otro algoritmo para el muestreo sin reemplazo se describe aquí .

Es similar al descrito por John D. Cook en su respuesta y también de Knuth, pero tiene diferentes hipótesis: El tamaño de la población es desconocido, pero la muestra puede caber en la memoria. Este se llama "Algoritmo S de Knuth".

Citando el artículo de rosettacode:

Seleccione los primeros n elementos como la muestra a medida que estén disponibles;
Para el ítem i-ésimo donde i> n, tiene una posibilidad aleatoria de n / i de mantenerlo. Si falla esta oportunidad, la muestra sigue siendo la misma. Si no, hágalo aleatoriamente (1 / n) reemplazando uno de los n elementos seleccionados previamente de la muestra.
Repita # 2 para cualquier artículo subsecuente.

Inspirado por la respuesta de @John D. Cook , escribí una implementación en Nim. Al principio tuve dificultades para entender cómo funciona, así que comenté exhaustivamente y también incluí un ejemplo. Quizás ayuda entender la idea. Además, he cambiado ligeramente los nombres de las variables.

iterator uniqueRandomValuesBelow*(N, M: int) = ## Returns a total of M unique random values i with 0 <= i < N ## These indices can be used to construct e.g. a random sample without replacement assert(M <= N) var t = 0 # total input records dealt with var m = 0 # number of items selected so far while (m < M): let u = random(1.0) # call a uniform(0,1) random number generator # meaning of the following terms: # (N - t) is the total number of remaining draws left (initially just N) # (M - m) is the number how many of these remaining draw must be positive (initially just M) # => Probability for next draw = (M-m) / (N-t) # i.e.: (required positive draws left) / (total draw left) # # This is implemented by the inequality expression below: # - the larger (M-m), the larger the probability of a positive draw # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100% # - for (N-t) >> (M-m), we must get a very small u # # example: (N-t) = 7, (M-m) = 5 # => we draw the next with prob 5/7 # lets assume the draw fails # => t += 1 => (N-t) = 6 # => we draw the next with prob 5/6 # lets assume the draw succeeds # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4 # => we draw the next with prob 4/5 # lets assume the draw fails # => t += 1 => (N-t) = 4 # => we draw the next with prob 4/4, i.e., # we will draw with certainty from now on # (in the next steps we get prob 3/3, 2/2, ...) if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t) # no draw -- happens mainly for (N-t) >> (M-m) and/or high u t += 1 else: # draw t -- happens when (M-m) gets large and/or low u yield t # this is where we output an index, can be used to sample t += 1 m += 1 # example use for i in uniqueRandomValuesBelow(100, 5): echo i

Un código de trabajo en C ++ basado en la respuesta de John D. Cook .

#include <random> #include <vector> double GetUniform() { static std::default_random_engine re; static std::uniform_real_distribution<double> Dist(0,1); return Dist(re); } // John D. Cook, https://.com/a/311716/15485 void SampleWithoutReplacement ( int populationSize, // size of set sampling from int sampleSize, // size of each sample std::vector<int> & samples // output, zero-offset indicies to selected items ) { // Use Knuth''s variable names int& n = sampleSize; int& N = populationSize; int t = 0; // total input records dealt with int m = 0; // number of items selected so far double u; while (m < n) { u = GetUniform(); // call a uniform(0,1) random number generator if ( (N - t)*u >= n - m ) { t++; } else { samples[m] = t; t++; m++; } } } #include <iostream> int main(int,char**) { const size_t sz = 10; std::vector< int > samples(sz); SampleWithoutReplacement(10*sz,sz,samples); for (size_t i = 0; i < sz; i++ ) { std::cout << samples[i] << "/t"; } return 0; }

Ver mi respuesta a esta pregunta ¿ Números aleatorios únicos (no repetitivos) en O (1)? . La misma lógica debería lograr lo que estás buscando hacer.

Cuando el tamaño de la población es mucho mayor que el tamaño de la muestra, los algoritmos anteriores se vuelven ineficaces, ya que tienen complejidad O ( n ), siendo n el tamaño de la población.

Cuando era estudiante, escribí algunos algoritmos para el muestreo uniforme sin reemplazo, que tienen una complejidad promedio O ( s log s ), donde s es el tamaño de la muestra. Aquí está el código para el algoritmo de árbol binario, con complejidad promedio O ( s log s ), en R:

# The Tree growing algorithm for uniform sampling without replacement # by Pavel Ruzankin quicksample = function (n,size) # n - the number of items to choose from # size - the sample size { s=as.integer(size) if (s>n) { stop("Sample size is greater than the number of items to choose from") } # upv=integer(s) #level up edge is pointing to leftv=integer(s) #left edge is poiting to; must be filled with zeros rightv=integer(s) #right edge is pointig to; must be filled with zeros samp=integer(s) #the sample ordn=integer(s) #relative ordinal number ordn[1L]=1L #initial value for the root vertex samp[1L]=sample(n,1L) if (s > 1L) for (j in 2L:s) { curn=sample(n-j+1L,1L) #current number sampled curordn=0L #currend ordinal number v=1L #current vertice from=1L #how have come here: 0 - by left edge, 1 - by right edge repeat { curordn=curordn+ordn[v] if (curn+curordn>samp[v]) { #going down by the right edge if (from == 0L) { ordn[v]=ordn[v]-1L } if (rightv[v]!=0L) { v=rightv[v] from=1L } else { #creating a new vertex samp[j]=curn+curordn ordn[j]=1L # upv[j]=v rightv[v]=j break } } else { #going down by the left edge if (from==1L) { ordn[v]=ordn[v]+1L } if (leftv[v]!=0L) { v=leftv[v] from=0L } else { #creating a new vertex samp[j]=curn+curordn-1L ordn[j]=-1L # upv[j]=v leftv[v]=j break } } } } return(samp) }

La complejidad de este algoritmo se discute en: Rouzankin, PS; Voytishek, AV Sobre el costo de los algoritmos para la selección al azar. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999.5.1.39

Si encuentra que el algoritmo es útil, por favor haga una referencia.

Ver también: P. Gupta, GP Bhattacharjee. (1984) Un algoritmo eficiente para muestreo aleatorio sin reemplazo. Revista Internacional de Informática Matemática 16: 4, páginas 201-209. DOI: 10.1080 / 00207168408803438

Teuhola, J. y Nevalainen, O. 1982. Dos algoritmos eficientes para muestreo aleatorio sin reemplazo. / IJCM /, 11 (2): 127-140. DOI: 10.1080 / 00207168208803304

En el último documento, los autores utilizan tablas hash y afirman que sus algoritmos tienen O ( s ) complejidad. Existe otro algoritmo de tabla hash rápido, que pronto se implementará en pqR (R bastante rápido): https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html