embedded - realizar - que es el algoritmo de la raiz cuadrada

Buscando un algoritmo de raíz cuadrada entero eficiente para ARM Thumb2 (10)

Aquí hay una solución en Java que combina integer log_2 y el método de Newton para crear un algoritmo sin bucle. Como inconveniente, necesita división. Las líneas comentadas son necesarias para convertir a un algoritmo de 64 bits.

private static final int debruijn= 0x07C4ACDD; //private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6; static { for(int x= 0; x<32; ++x) { final long v= ~( -2L<<(x)); DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58 } for(int x= 0; x<32; ++x) SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2))); } public static int sqrt(final int num) { int y; if(num==0) return num; { int v= num; v|= v>>>1; // first round up to one less than a power of 2 v|= v>>>2; v|= v>>>4; v|= v>>>8; v|= v>>>16; //v|= v>>>32; y= SQRT[(v*debruijn)>>>27]; //>>>58 } //y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; return y*y>num?y-1:y; }

Cómo funciona esto: la primera parte produce una raíz cuadrada con una precisión de aproximadamente tres bits. La línea [y = (y + num / y) >> 1;] duplica la precisión en bits. La última línea elimina las raíces del techo que se pueden generar.

Estoy buscando un algoritmo rápido, solo entero para encontrar la raíz cuadrada (parte entera del mismo) de un entero sin signo. El código debe tener un excelente rendimiento en los procesadores ARM Thumb 2. Podría ser lenguaje ensamblador o código C.

Cualquier sugerencia bienvenida.

Depende del uso de la función sqrt. A menudo uso algo de aproximadamente para hacer versiones rápidas. Por ejemplo, cuando necesito calcular el módulo de vector:

Module = SQRT( x^2 + y^2)

Yo suelo :

Module = MAX( x,y) + Min(x,y)/2

Que se puede codificar en 3 o 4 instrucciones como:

If (x > y ) Module = x + y >> 1; Else Module = y + x >> 1;

Encuentro que la mayoría de los algoritmos se basan en ideas simples, pero se implementan de una manera más complicada de lo necesario. Tomé la idea desde aquí: http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (por Ross M. Fosler) y la convertí en una función C muy corta:

uint16_t int_sqrt32(uint32_t x) { uint16_t res=0; uint16_t add= 0x8000; int i; for(i=0;i<16;i++) { uint16_t temp=res | add; uint32_t g2=temp*temp; if (x>=g2) { res=temp; } add>>=1; } return res; }

Esto compila a 5 ciclos / bit en mi blackfin. Creo que su código compilado en general será más rápido si utiliza bucles for loop en lugar de while, y obtendrá el beneficio adicional del tiempo determinista (aunque eso depende en cierta medida de cómo su compilador optimiza la instrucción if).

Este método es similar a la división larga: construyes una estimación para el próximo dígito de la raíz, haces una resta e ingresas el dígito si la diferencia cumple con ciertos criterios. Con una versión binaria, su única opción para el siguiente dígito es 0 o 1, por lo que siempre adivina 1, haga la resta e ingrese un 1 a menos que la diferencia sea negativa.

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

Implementé la sugerencia de Warren y el método de Newton en C # para enteros de 64 bits. Isqrt usa el método de Newton, mientras que Isqrt usa el método de Warren. Aquí está el código fuente:

using System; namespace Cluster { public static class IntegerMath { /// <summary> /// Compute the integer square root, the largest whole number less than or equal to the true square root of N. /// /// This uses the integer version of Newton''s method. /// </summary> public static long Isqrt(this long n) { if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined."); if (n <= 1) return n; var xPrev2 = -2L; var xPrev1 = -1L; var x = 2L; // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare // to two previous values to test for convergence. while (x != xPrev2 && x != xPrev1) { xPrev2 = xPrev1; xPrev1 = x; x = (x + n/x)/2; } // The two values x and xPrev1 will be above and below the true square root. Choose the lower one. return x < xPrev1 ? x : xPrev1; } #region Sqrt using Bit-shifting and magic numbers. // From http://.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2 // Converted to C#. private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6; private static readonly ulong[] SQRT = new ulong[64]; private static readonly int[] DeBruijnArray = new int[64]; static IntegerMath() { for(int x= 0; x<64; ++x) { ulong v= (ulong) ~( -2L<<(x)); DeBruijnArray[(v*debruijn)>>58]= x; } for(int x= 0; x<64; ++x) SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2))); } public static long Isqrt2(this long n) { ulong num = (ulong) n; ulong y; if(num==0) return (long)num; { ulong v= num; v|= v>>1; // first round up to one less than a power of 2 v|= v>>2; v|= v>>4; v|= v>>8; v|= v>>16; v|= v>>32; y= SQRT[(v*debruijn)>>58]; } y= (y+num/y)>>1; y= (y+num/y)>>1; y= (y+num/y)>>1; y= (y+num/y)>>1; // Make sure that our answer is rounded down, not up. return (long) (y*y>num?y-1:y); } #endregion } }

Usé lo siguiente para comparar el código:

using System; using System.Diagnostics; using Cluster; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace ClusterTests { [TestClass] public class IntegerMathTests { [TestMethod] public void Isqrt_Accuracy() { for (var n = 0L; n <= 100000L; n++) { var expectedRoot = (long) Math.Sqrt(n); var actualRoot = n.Isqrt(); Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n)); } } [TestMethod] public void Isqrt2_Accuracy() { for (var n = 0L; n <= 100000L; n++) { var expectedRoot = (long)Math.Sqrt(n); var actualRoot = n.Isqrt2(); Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n)); } } [TestMethod] public void Isqrt_Speed() { var integerTimer = new Stopwatch(); var libraryTimer = new Stopwatch(); integerTimer.Start(); var total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = n.Isqrt(); total += root; } integerTimer.Stop(); libraryTimer.Start(); total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = (long)Math.Sqrt(n); total += root; } libraryTimer.Stop(); var isqrtMilliseconds = integerTimer.ElapsedMilliseconds; var libraryMilliseconds = libraryTimer.ElapsedMilliseconds; var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds); Debug.WriteLine(msg); Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg); } [TestMethod] public void Isqrt2_Speed() { var integerTimer = new Stopwatch(); var libraryTimer = new Stopwatch(); var warmup = (10L).Isqrt2(); integerTimer.Start(); var total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = n.Isqrt2(); total += root; } integerTimer.Stop(); libraryTimer.Start(); total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = (long)Math.Sqrt(n); total += root; } libraryTimer.Stop(); var isqrtMilliseconds = integerTimer.ElapsedMilliseconds; var libraryMilliseconds = libraryTimer.ElapsedMilliseconds; var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds); Debug.WriteLine(msg); Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg); } } }

Mis resultados en una Dell Latitude E6540 en modo de lanzamiento, Visual Studio 2012 fueron que la biblioteca llamada Math.Sqrt es más rápida.

59 ms - Newton (Isqrt)
12 ms - Cambio de bit (Isqrt2)
5 ms - Math.Sqrt

No soy inteligente con las directivas de compilación, por lo que es posible ajustar el compilador para obtener el entero matemático más rápido. Claramente, el enfoque de cambio de bit está muy cerca de la biblioteca. En un sistema sin coprocesador matemático, sería muy rápido.

Me he conformado con algo similar al algoritmo binario de dígito por dígito descrito en este artículo de Wikipedia .

No es rápido, pero es pequeño y simple:

int isqrt(int n) { int b = 0; while(n >= 0) { n = n - b; b = b + 1; n = n - b; } return b - 1; }

Si no se requiere precisión exacta, tengo una aproximación rápida para usted, que usa 260 bytes de ram (puede reducir a la mitad, pero no).

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340}; int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977}; int fisqrt(int val) { int cnt=0; int t=val; while (t) {cnt++;t>>=1;} if (6>=cnt) t=(val<<(6-cnt)); else t=(val>>(cnt-6)); return (ftbl[cnt]*ftbl2[t&31])>>15; }

Aquí está el código para generar las tablas:

ftbl[0]=0; for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i)); printf("int ftbl[33]={0"); for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]); printf("};/n"); for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768; printf("int ftbl2[32]={"); for (int i=0;i<32;i++) printf("%c%d",(i)?'','':'' '',ftbl2[i]); printf("};/n");

En el rango 1-> 2 ^ 20, el error máximo es 11, y en el rango 1-> 2 ^ 30, es aproximadamente 256. Puede usar tablas más grandes y minimizar esto. Vale la pena mencionar que el error siempre será negativo, es decir, cuando sea incorrecto, el valor será MENOR que el valor correcto.

Puede hacer bien en seguir esto con una etapa de refinación.

La idea es bastante simple: (ab) ^ 0.5 = a ^ 0.b * b ^ 0.5.

Entonces, tomamos la entrada X = A * B donde A = 2 ^ N y 1 <= B <2 Luego tenemos un lookuptable para sqrt (2 ^ N), y un lookuptable para sqrt (1 <= B <2) . Almacenamos el lookuptable para sqrt (2 ^ N) como un entero, lo que podría ser un error (las pruebas no muestran efectos negativos), y almacenamos el lookuptable para sqrt (1 <= B <2) en 15bits de punto fijo.

Sabemos que 1 <= sqrt (2 ^ N) <65536, entonces eso es 16bit, y sabemos que realmente solo podemos multiplicar 16bitx15bit en un ARM, sin temor a represalias, así que eso es lo que hacemos.

En términos de implementación, while (t) {cnt ++; t >> = 1;} es efectivamente una instrucción de conteo de bits (CLB), ¡así que si tu versión del chipset tiene eso, estás ganando! Además, la instrucción de cambio sería fácil de implementar con una palanca de cambios bidireccional, si tiene una? Hay un algoritmo Lg [N] para contar el bit más alto establecido here.

En términos de números mágicos, para cambiar los tamaños de las mesas, EL número mágico para ftbl2 es 32, aunque tenga en cuenta que 6 (Lg [32] +1) se usa para el cambio.

Un enfoque común es la bisección.

hi = number lo = 0 mid = ( hi + lo ) / 2 mid2 = mid*mid while( lo < hi-1 and mid2 != number ) { if( mid2 < number ) { lo = mid else hi = mid mid = ( hi + lo ) / 2 mid2 = mid*mid

Algo así debería funcionar razonablemente bien. Realiza pruebas de log2 (número), haciendo que log2 (número) se multiplique y divida. Como la división es una división entre 2, puede reemplazarla por un >> .

La condición de terminación puede no ser inmediata, así que asegúrese de probar una variedad de enteros para asegurarse de que la división por 2 no oscile incorrectamente entre dos valores pares; ellos diferirían por más de 1.

Integer Square Roots por Jack W. Crenshaw podría ser útil como otra referencia.

El archivo C Snippets también tiene una implementación de raíz cuadrada entera . Éste va más allá del resultado entero, y calcula los bits fraccionarios adicionales (punto fijo) de la respuesta. (Actualización: desafortunadamente, el archivo de fragmentos de C ahora ha desaparecido. El enlace apunta al archivo web de la página). Aquí está el código del archivo de fragmentos C:

#define BITSPERLONG 32 #define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2)) struct int_sqrt { unsigned sqrt, frac; }; /* usqrt: ENTRY x: unsigned long EXIT returns floor(sqrt(x) * pow(2, BITSPERLONG/2)) Since the square root never uses more than half the bits of the input, we use the other half of the bits to contain extra bits of precision after the binary point. EXAMPLE suppose BITSPERLONG = 32 then usqrt(144) = 786432 = 12 * 65536 usqrt(32) = 370727 = 5.66 * 65536 NOTES (1) change BITSPERLONG to BITSPERLONG/2 if you do not want the answer scaled. Indeed, if you want n bits of precision after the binary point, use BITSPERLONG/2+n. The code assumes that BITSPERLONG is even. (2) This is really better off being written in assembly. The line marked below is really a "arithmetic shift left" on the double-long value with r in the upper half and x in the lower half. This operation is typically expressible in only one or two assembly instructions. (3) Unrolling this loop is probably not a bad idea. ALGORITHM The calculations are the base-two analogue of the square root algorithm we all learned in grammar school. Since we''re in base 2, there is only one nontrivial trial multiplier. Notice that absolutely no multiplications or divisions are performed. This means it''ll be fast on a wide range of processors. */ void usqrt(unsigned long x, struct int_sqrt *q) { unsigned long a = 0L; /* accumulator */ unsigned long r = 0L; /* remainder */ unsigned long e = 0L; /* trial product */ int i; for (i = 0; i < BITSPERLONG; i++) /* NOTE 1 */ { r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */ a <<= 1; e = (a << 1) + 1; if (r >= e) { r -= e; a++; } } memcpy(q, &a, sizeof(long)); }

Me decidí por el siguiente código. Es esencialmente del artículo de Wikipedia sobre métodos de computación de raíz cuadrada . Pero se ha cambiado para usar los tipos stdint.h uint32_t etc. Estrictamente hablando, el tipo de devolución podría cambiarse a uint16_t .

/** * /brief Fast Square root algorithm * * Fractional parts of the answer are discarded. That is: * - SquareRoot(3) --> 1 * - SquareRoot(4) --> 2 * - SquareRoot(5) --> 2 * - SquareRoot(8) --> 2 * - SquareRoot(9) --> 3 * * /param[in] a_nInput - unsigned integer for which to find the square root * * /return Integer square root of the input value. */ uint32_t SquareRoot(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } return res; }

Lo bueno, descubrí, es que una modificación bastante simple puede devolver la respuesta "redondeada". Encontré esto útil en cierta aplicación para una mayor precisión. Tenga en cuenta que en este caso, el tipo de retorno debe ser uint32_t porque la raíz cuadrada redondeada de 2 ³² - 1 es 2 ¹⁶ .

/** * /brief Fast Square root algorithm, with rounding * * This does arithmetic rounding of the result. That is, if the real answer * would have a fractional part of 0.5 or greater, the result is rounded up to * the next integer. * - SquareRootRounded(2) --> 1 * - SquareRootRounded(3) --> 2 * - SquareRootRounded(4) --> 2 * - SquareRootRounded(6) --> 2 * - SquareRootRounded(7) --> 3 * - SquareRootRounded(8) --> 3 * - SquareRootRounded(9) --> 3 * * /param[in] a_nInput - unsigned integer for which to find the square root * * /return Integer square root of the input value. */ uint32_t SquareRootRounded(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } /* Do arithmetic rounding to nearest integer */ if (op > res) { res++; } return res; }