python - Optimizar un codificador de reed-solomon(división polinomial)

numpy optimization (3)

Estoy tratando de optimizar un codificador Reed-Solomon, que de hecho es simplemente una operación de división polinomial sobre Galois Fields 2 ^ 8 (lo que simplemente significa que los valores envuelven más de 255). El código es de hecho muy similar a lo que se puede encontrar aquí para Go: http://research.swtch.com/field

El algoritmo para la división polinomial utilizado aquí es una división sintética (también llamada método de Horner).

Intenté todo: numpy, pypy, cython. El mejor rendimiento que obtengo es usar pypy con este simple bucle anidado:

def rsenc(msg_in, nsym, gen): ''''''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''''' msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) for i in xrange(len(msg_in)): coef = msg_out[i] # coef = gf_mul(msg_out[i], gf_inverse(gen[0])) // for general polynomial division (when polynomials are non-monic), we need to compute: coef = msg_out[i] / gen[0] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that''s why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out

¿Puede un asistente de optimización de Python guiarme a algunas pistas sobre cómo obtener una aceleración? Mi objetivo es obtener al menos una aceleración de 3x, pero más sería increíble. Se acepta cualquier enfoque o herramienta, siempre que sea multiplataforma (funciona al menos en Linux y Windows).

Aquí hay un pequeño script de prueba con algunas de las otras alternativas que probé (¡el intento de cython no está incluido porque era más lento que el python nativo!):

import random from operator import xor numpy_enabled = False try: import numpy as np numpy_enabled = True except ImportError: pass # Exponent table for 3, a generator for GF(256) gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19, 53, 95, 225, 56, 72, 216, 115, 149, 164, 247, 2, 6, 10, 30, 34, 102, 170, 229, 52, 92, 228, 55, 89, 235, 38, 106, 190, 217, 112, 144, 171, 230, 49, 83, 245, 4, 12, 20, 60, 68, 204, 79, 209, 104, 184, 211, 110, 178, 205, 76, 212, 103, 169, 224, 59, 77, 215, 98, 166, 241, 8, 24, 40, 120, 136, 131, 158, 185, 208, 107, 189, 220, 127, 129, 152, 179, 206, 73, 219, 118, 154, 181, 196, 87, 249, 16, 48, 80, 240, 11, 29, 39, 105, 187, 214, 97, 163, 254, 25, 43, 125, 135, 146, 173, 236, 47, 113, 147, 174, 233, 32, 96, 160, 251, 22, 58, 78, 210, 109, 183, 194, 93, 231, 50, 86, 250, 21, 63, 65, 195, 94, 226, 61, 71, 201, 64, 192, 91, 237, 44, 116, 156, 191, 218, 117, 159, 186, 213, 100, 172, 239, 42, 126, 130, 157, 188, 223, 122, 142, 137, 128, 155, 182, 193, 88, 232, 35, 101, 175, 234, 37, 111, 177, 200, 67, 197, 84, 252, 31, 33, 99, 165, 244, 7, 9, 27, 45, 119, 153, 176, 203, 70, 202, 69, 207, 74, 222, 121, 139, 134, 145, 168, 227, 62, 66, 198, 81, 243, 14, 18, 54, 90, 238, 41, 123, 141, 140, 143, 138, 133, 148, 167, 242, 13, 23, 57, 75, 221, 124, 132, 151, 162, 253, 28, 36, 108, 180, 199, 82, 246] * 2 + [1]) # Logarithm table, base 3 gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 51, 238, 223, # BEWARE: the first entry should be None instead of 0 because it''s undefined, but for a bytearray we can''t set such a value 3, 100, 4, 224, 14, 52, 141, 129, 239, 76, 113, 8, 200, 248, 105, 28, 193, 125, 194, 29, 181, 249, 185, 39, 106, 77, 228, 166, 114, 154, 201, 9, 120, 101, 47, 138, 5, 33, 15, 225, 36, 18, 240, 130, 69, 53, 147, 218, 142, 150, 143, 219, 189, 54, 208, 206, 148, 19, 92, 210, 241, 64, 70, 131, 56, 102, 221, 253, 48, 191, 6, 139, 98, 179, 37, 226, 152, 34, 136, 145, 16, 126, 110, 72, 195, 163, 182, 30, 66, 58, 107, 40, 84, 250, 133, 61, 186, 43, 121, 10, 21, 155, 159, 94, 202, 78, 212, 172, 229, 243, 115, 167, 87, 175, 88, 168, 80, 244, 234, 214, 116, 79, 174, 233, 213, 231, 230, 173, 232, 44, 215, 117, 122, 235, 22, 11, 245, 89, 203, 95, 176, 156, 169, 81, 160, 127, 12, 246, 111, 23, 196, 73, 236, 216, 67, 31, 45, 164, 118, 123, 183, 204, 187, 62, 90, 251, 96, 177, 134, 59, 82, 161, 108, 170, 85, 41, 157, 151, 178, 135, 144, 97, 190, 220, 252, 188, 149, 207, 205, 55, 63, 91, 209, 83, 57, 132, 60, 65, 162, 109, 71, 20, 42, 158, 93, 86, 242, 211, 171, 68, 17, 146, 217, 35, 32, 46, 137, 180, 124, 184, 38, 119, 153, 227, 165, 103, 74, 237, 222, 197, 49, 254, 24, 13, 99, 140, 128, 192, 247, 112, 7]) if numpy_enabled: np_gf_exp = np.array(gf_exp) np_gf_log = np.array(gf_log) def gf_pow(x, power): return gf_exp[(gf_log[x] * power) % 255] def gf_poly_mul(p, q): r = [0] * (len(p) + len(q) - 1) lp = [gf_log[p[i]] for i in xrange(len(p))] for j in range(len(q)): lq = gf_log[q[j]] for i in range(len(p)): r[i + j] ^= gf_exp[lp[i] + lq] return r def rs_generator_poly_base3(nsize, fcr=0): g_all = {} g = [1] g_all[0] = g_all[1] = g for i in range(fcr+1, fcr+nsize+1): g = gf_poly_mul(g, [1, gf_pow(3, i)]) g_all[nsize-i] = g return g_all # Fastest way with pypy def rsenc(msg_in, nsym, gen): ''''''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''''' msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) for i in xrange(len(msg_in)): coef = msg_out[i] # coef = gf_mul(msg_out[i], gf_inverse(gen[0])) # for general polynomial division (when polynomials are non-monic), the usual way of using synthetic division is to divide the divisor g(x) with its leading coefficient (call it a). In this implementation, this means:we need to compute: coef = msg_out[i] / gen[0] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that''s why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative 1: the loops were completely changed, instead of fixing msg_out[i] and updating all subsequent i+j items, we now fixate msg_out[i+j] and compute it at once using all couples msg_out[i] * gen[j] - msg_out[i+1] * gen[j-1] - ... since when we fixate msg_out[i+j], all previous msg_out[k] with k < i+j are already fully computed. def rsenc_alt1(msg_in, nsym, gen): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) # Alternative 1 jlist = range(1, len(gen)) for k in xrange(1, len(msg_out)): for x in xrange(max(k-len(msg_in),0), len(gen)-1): if k-x-1 < 0: break msg_out[k] ^= gf_exp[msg_out[k-x-1] + lgen[jlist[x]]] # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative 2: a rewrite of alternative 1 with generators and reduce def rsenc_alt2(msg_in, nsym, gen): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))]) # Alternative 1 jlist = range(1, len(gen)) for k in xrange(1, len(msg_out)): items_gen = ( gf_exp[msg_out[k-x-1] + lgen[jlist[x]]] if k-x-1 >= 0 else next(iter(())) for x in xrange(max(k-len(msg_in),0), len(gen)-1) ) msg_out[k] ^= reduce(xor, items_gen) # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in return msg_out # Alternative with Numpy def rsenc_numpy(msg_in, nsym, gen): msg_in = np.array(bytearray(msg_in)) msg_out = np.pad(msg_in, (0, nsym), ''constant'') lgen = np_gf_log[gen] for i in xrange(msg_in.size): msg_out[i+1:i+lgen.size] ^= np_gf_exp[np.add(lgen[1:], msg_out[i])] msg_out[:len(msg_in)] = msg_in return msg_out gf_mul_arr = [bytearray(256) for _ in xrange(256)] gf_add_arr = [bytearray(256) for _ in xrange(256)] # Precompute multiplication and addition tables def gf_precomp_tables(gf_exp=gf_exp, gf_log=gf_log): global gf_mul_arr, gf_add_arr for i in xrange(256): for j in xrange(256): gf_mul_arr[i][j] = gf_exp[gf_log[i] + gf_log[j]] gf_add_arr[i][j] = i ^ j return gf_mul_arr, gf_add_arr # Alternative with precomputation of multiplication and addition tables, inspired by zfec: https://hackage.haskell.org/package/fec-0.1.1/src/zfec/fec.c def rsenc_precomp(msg_in, nsym, gen=None): msg_in = bytearray(msg_in) msg_out = bytearray(msg_in) + bytearray(len(gen)-1) for i in xrange(len(msg_in)): # [i for i in xrange(len(msg_in)) if msg_in[i] != 0] coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) mula = gf_mul_arr[coef] for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that''s why we start at position 1) #msg_out[i + j] = gf_add_arr[msg_out[i+j]][gf_mul_arr[coef][gen[j]]] # slower... #msg_out[i + j] ^= gf_mul_arr[coef][gen[j]] # faster msg_out[i + j] ^= mula[gen[j]] # fastest # Recopy the original message bytes msg_out[:len(msg_in)] = msg_in # equivalent to c = mprime - b, where mprime is msg_in padded with [0]*nsym return msg_out def randstr(n, size): ''''''Generate very fastly a random hexadecimal string. Kudos to jcdryer http://stackoverflow.com/users/131084/jcdyer'''''' hexstr = ''%0''+str(size)+''x'' for _ in xrange(n): yield hexstr % random.randrange(16**size) # Simple test case if __name__ == "__main__": # Setup functions to test funcs = [rsenc, rsenc_precomp, rsenc_alt1, rsenc_alt2] if numpy_enabled: funcs.append(rsenc_numpy) gf_precomp_tables() # Setup RS vars n = 255 k = 213 import time # Init the generator polynomial g = rs_generator_poly_base3(n) # Init the ground truth mes = ''hello world'' mesecc_correct = rsenc(mes, n-11, g[k]) # Test the functions for func in funcs: # Sanity check if func(mes, n-11, g[k]) != mesecc_correct: print func.__name__, ": output is incorrect!" # Time the function total_time = 0 for m in randstr(1000, n): start = time.clock() func(m, n-k, g[k]) total_time += time.clock() - start print func.__name__, ": total time elapsed %f seconds." % total_time

Y aquí está el resultado en mi máquina:

With PyPy: rsenc : total time elapsed 0.108183 seconds. rsenc_alt1 : output is incorrect! rsenc_alt1 : total time elapsed 0.164084 seconds. rsenc_alt2 : output is incorrect! rsenc_alt2 : total time elapsed 0.557697 seconds. Without PyPy: rsenc : total time elapsed 3.518857 seconds. rsenc_alt1 : output is incorrect! rsenc_alt1 : total time elapsed 5.630897 seconds. rsenc_alt2 : output is incorrect! rsenc_alt2 : total time elapsed 6.100434 seconds. rsenc_numpy : output is incorrect! rsenc_numpy : total time elapsed 1.631373 seconds

(Nota: las alternativas deben ser correctas, algunos índices deben estar un poco apagados, pero dado que son más lentos de todos modos, no intenté arreglarlos)

/ ACTUALIZACIÓN y objetivo de la recompensa: Encontré un truco de optimización muy interesante que promete acelerar mucho los cálculos: precomputar la tabla de multiplicar . Actualicé el código anterior con la nueva función rsenc_precomp (). Sin embargo, no hay ninguna ganancia en mi implementación, incluso es un poco más lenta:

rsenc : total time elapsed 0.107170 seconds. rsenc_precomp : total time elapsed 0.108788 seconds.

¿Cómo puede ser que las búsquedas de matrices cuesten más que operaciones como adiciones o xor? ¿Por qué funciona en ZFEC y no en Python?

Atribuiré la bonificación a quien me muestre cómo hacer funcionar esta optimización de tablas de búsqueda de multiplicación / adición (más rápido que las operaciones xor y de suma) o quién puede explicarme con referencias o análisis por qué esta optimización no puede funcionar aquí (usando Python / PyPy / Cython / Numpy, etc. Los probé todos).

Alternativamente, si conoce C, le recomendaría reescribir esta función de Python en C simple y llamarla (por ejemplo, con CFFI). Al menos sabes que alcanzas el máximo rendimiento en los bucles internos de tus funciones sin necesidad de conocer los trucos de PyPy o Cython.

Ver: http://cffi.readthedocs.org/en/latest/overview.html#performance

Lo siguiente es 3 veces más rápido que pypy en mi máquina (0.04s vs 0.15s). Usando Cython:

ctypedef unsigned char uint8_t # does not work with Microsoft''s C Compiler: from libc.stdint cimport uint8_t cimport cpython.array as array cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19, lots of numbers omitted for space reasons ...]) cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, more numbers omitted for space reasons ...]) import cython @cython.boundscheck(False) @cython.wraparound(False) @cython.initializedcheck(False) def rsenc(msg_in_r, nsym, gen_t): ''''''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''''' cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can''t make a memory view from a read only object cdef int[::1] gen = array.array(''i'',gen_t) # convert list to array cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1) cdef int j cdef uint8_t[::1] lgen = bytearray(gen.shape[0]) for j in xrange(gen.shape[0]): lgen[j] = gf_log[gen[j]] cdef uint8_t coef,lcoef cdef int i for i in xrange(msg_in.shape[0]): coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that''s why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:msg_in.shape[0]] = msg_in return msg_out

Es solo su versión más rápida con tipos estáticos (y verificando el html desde cython -a hasta que los loops no se resalten en amarillo).

Algunas notas breves:

Cython prefiere x.shape[0] a len(shape)
La definición de las vistas en memoria como [::1] promete que son continuas en la memoria, lo que ayuda
initializedcheck(False) es necesario para evitar muchas comprobaciones de gf_exp y gf_log definidos globalmente. (Puede encontrar que puede acelerar su código Python / PyPy básico al crear una referencia de variable local para estos y usar eso en el lugar)
Tuve que copiar un par de argumentos de entrada. Cython no puede hacer una vista de memoria desde un objeto de solo lectura (en este caso msg_in , una cadena. Sin embargo, probablemente podría haberlo hecho un char *). También gen (una lista) debe estar en algo con acceso rápido a elementos.

Aparte de eso, todo es bastante directo. (No he probado ninguna variación de haberlo conseguido más rápido). Estoy realmente muy impresionado de lo bien que PyPy lo hace.

Basándome en la respuesta de DavidW, aquí está la implementación que estoy usando actualmente, que es aproximadamente un 20% más rápida usando nogil y cómputo paralelo:

from cython.parallel import parallel, prange @cython.boundscheck(False) @cython.wraparound(False) @cython.initializedcheck(False) cdef rsenc_cython(msg_in_r, nsym, gen_t) : ''''''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''''' cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can''t make a memory view from a read only object #cdef int[::1] gen = array.array(''i'',gen_t) # convert list to array cdef uint8_t[::1] gen = gen_t cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1) cdef int i, j cdef uint8_t[::1] lgen = bytearray(gen.shape[0]) for j in xrange(gen.shape[0]): lgen[j] = gf_log_c[gen[j]] cdef uint8_t coef,lcoef with nogil: for i in xrange(msg_in.shape[0]): coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log_c[coef] # precaching for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that''s why we start at position 1) msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:msg_in.shape[0]] = msg_in return msg_out

Me gustaría que fuera más rápido (en una implementación real, los datos están codificados a aproximadamente 6.4 MB / s con n = 255, siendo n el tamaño del mensaje + palabra clave).

La principal ventaja para una implementación más rápida que he encontrado es utilizar un enfoque LUT (tabla de búsqueda), precomputando las matrices de multiplicación y adición. Sin embargo, en mis implementaciones de Python y Cython, el enfoque de LUT es más lento que el cálculo de XOR y las operaciones de adición.

Existen otros enfoques para implementar un codificador RS más rápido, pero no tengo las habilidades ni el tiempo para probarlos. Los dejaré como referencias para otros lectores interesados:

"Implementación rápida de software de operaciones de campo finito", Cheng Huang y Lihao Xu, Universidad de Washington en St. Louis, Tech. Rep (2003). enlace y una implementación correcta del código aquí .
Luo, Jianqiang, y col. "Implementaciones de software eficientes de grandes campos finitos GF (2 n) para aplicaciones de almacenamiento seguras". Transacciones de ACM en almacenamiento (TOS) 8.1 (2012): 2.
"Una evaluación de rendimiento y examen de bibliotecas de codificación de borrado de código abierto para almacenamiento". Plank, JS y Luo, J. y Schuman, CD y Xu, L., y Wilcox-O''Hearn, Z, FAST. Vol. 9. 2009. link O también la versión no extendida: "Una comparación de rendimiento de bibliotecas de codificación de borrado de código abierto para aplicaciones de almacenamiento", Plank y Schuman.
Código fuente de la biblioteca ZFEC, con enlace de optimización LUT de multiplicación.
"Aritmética optimizada para codificadores de Reed-Solomon", Christof Paar (1997, junio). En IEEE International Symposium on Information Theory (pp. 250-250). INSTITUTO DE INGENIEROS ELÉCTRICOS INC (IEEE). enlazar
"Un algoritmo rápido para codificar el (255,233) Código Reed-Solomon sobre GF (2 ^ 8)", RL Miller y TK Truong, IS Reed. enlazar
"Optimización de la aritmética de Galois Field para diversas arquitecturas y aplicaciones de procesadores", Greenan, Kevin y M., Ethan y L. Miller y Thomas JE Schwarz, Modelado, Análisis y Simulación de Computadoras y Sistemas de Telecomunicaciones, 2008. MASCOTS 2008. IEEE International Symposium on . IEEE, 2008. enlace
Anvin, H. Peter. "Las matemáticas de RAID-6". (2007). enlace y enlace
Biblioteca Wirehair , una de las pocas implementaciones de Cauchy Reed-Solomon, que se dice que es muy rápida.
"Algoritmo de tiempo booleano logarítmico para división de polinomios paralelos", Bini, D. y Pan, VY (1987), Cartas de procesamiento de información, 24 (4), 233-237. Ver también Bini, D. y V. Pan. "Rápidos algoritmos paralelos para la división polinómica en un campo arbitrario de constantes". Computadoras y Matemáticas con Aplicaciones 12.11 (1986): 1105-1118. enlazar
Kung, HT "Evaluación rápida e interpolación". (1973) enlazar
Cao, Zhengjun y Hanyue Cao. "Nota sobre el algoritmo de división rápida para polinomios utilizando Newton iteration". arXiv preprint arXiv: 1112.4014 (2011). enlazar
"Introducción a Galois Fields y Reed-Solomon Coding", James Westall y James Martin, 2010. enlace
Mamidi, Suman, y col. "Las extensiones del conjunto de instrucciones para la codificación y decodificación de Reed-Solomon". Sistemas específicos de aplicaciones, procesadores de arquitectura, 2005. ASAP 2005. 16a Conferencia Internacional IEEE en. IEEE, 2005. enlace
Dumas, Jean-Guillaume, Laurent Fousse y Bruno Salvy. "Reducción modular simultánea y sustitución Kronecker para pequeños campos finitos". Journal of Symbolic Computation 46.7 (2011): 823-840.
Greenan, Kevin M., Ethan L. Miller y Thomas Schwarz. Análisis y construcción de campos de galois para una confiabilidad de almacenamiento eficiente. Vol. 9. Informe técnico UCSC-SSRC-07, 2007. enlace

Sin embargo, creo que la mejor ventaja es utilizar una reducción modular polinómica eficiente en lugar de una división polinómica:

"Reducción modular en GF (2 n) sin fase pre-computacional". Kneževic, M., et al. Aritmética de los campos finitos. Springer Berlin Heidelberg, 2008. 77-87.
"En el cálculo de la reducción modular polinómica". Wu, Huapeng. Informe técnico, Univ. de Waterloo, el Centro de investigación criptográfica aplicada, 2000.
"Una implementación de software rápida para operaciones aritméticas en GF (2n)". De Win, E., Bosselaers, A., Vandenberghe, S., De Gersem, P., y Vandewalle, J. (1996, enero). En Advances in Cryptology-Asiacrypt''96 (pp. 65-76). Springer Berlin Heidelberg. enlazar
Reducción Barnett

/ EDITAR: de hecho parece que "En el cálculo de la reducción modular polinomial" solo usa el mismo enfoque que hice con las variantes rsenc_alt1 () y rsenc_alt2 () (la idea principal es que precalculamos los pares de coeficientes que necesitaremos, y reducirlos todos a la vez), y desafortunadamente no es más rápido (en realidad es más lento porque la precomputación no puede hacerse de una vez porque depende de la entrada del mensaje).

/ EDIT: encontré una biblioteca con optimizaciones realmente interesantes, lotes que ni siquiera se encuentran en ningún documento académico (que el autor afirmó haber leído por cierto), y que es probablemente la implementación de software más rápida de Reed-Solomon: el proyecto wirehair y el blog relacionado para más detalles. Vale la pena señalar que el autor también hizo un códec de Cauchy-Reed-Solomon llamado longhair con trucos de optimización similares.

/ FINAL EDIT: parece que la implementación más rápida disponible se basa en este documento:

Plank, James S., Kevin M. Greenan y Ethan L. Miller. "Gritando rápidamente la aritmética de campo de Galois usando las instrucciones SIMD de Intel". RÁPIDO. 2013. enlace

La implementación, en puro Go, está disponible aquí y está escrito por Klaus Post . Es la implementación más rápida que he leído, tanto en un solo hilo como en paralelo (admite ambos). Reclama más de 1GB / s en un solo hilo y más de 4 GB / s con 8 hilos. Sin embargo, confía en las instrucciones SIMD optimizadas y varias optimizaciones de bajo nivel en operaciones matriciales (porque aquí el códec RS está orientado a la matriz en lugar del enfoque polinómico que tengo en mi pregunta).

Entonces, si usted es un lector interesado y desea encontrar el códec Reed-Solomon más rápido disponible, ese es el indicado.