c++ - the - Posición del bit menos significativo que se establece

numero binario bit mas significativo (21)

¿Por qué no usar el ffs incorporado? (Cogí una página de hombre de Linux, pero está más ampliamente disponible que eso).

ffs (3) - página man de Linux
Nombre
ffs: encuentre el primer bit en una palabra
Sinopsis
#include <strings.h> int ffs(int i); #define _GNU_SOURCE #include <string.h> int ffsl(long int i); int ffsll(long long int i);
Descripción
La función ffs () devuelve la posición del primer bit (menos significativo) establecido en la palabra i. El bit menos significativo es la posición 1 y la posición más significativa, por ejemplo, 32 o 64. Las funciones ffsll () y ffsl () hacen lo mismo pero toman argumentos de un tamaño posiblemente diferente.
Valor de retorno
Estas funciones devuelven la posición del primer bit establecido, o 0 si no se establecen bits en i.
De acuerdo a
4.3BSD, POSIX.1-2001.
Notas
Los sistemas BSD tienen un prototipo en <string.h> .

Estoy buscando una manera eficiente de determinar la posición del bit menos significativo que se establece en un entero, por ejemplo, para 0x0FF0 sería 4.

Una implementación trivial es esta:

unsigned GetLowestBitPos(unsigned value) { assert(value != 0); // handled separately unsigned pos = 0; while (!(value & 1)) { value >>= 1; ++pos; } return pos; }

¿Alguna idea de cómo exprimir algunos ciclos de ella?

(Nota: esta pregunta es para las personas que disfrutan de tales cosas, no para que la gente me diga que la xyzoptimización es malvada).

[edit] ¡ Gracias a todos por las ideas! También aprendí algunas otras cosas. ¡Guay!

¿Por qué no usar la búsqueda binaria ? Esto siempre se completará después de 5 operaciones (suponiendo un tamaño int de 4 bytes):

if (0x0000FFFF & value) { if (0x000000FF & value) { if (0x0000000F & value) { if (0x00000003 & value) { if (0x00000001 & value) { return 1; } else { return 2; } } else { if (0x0000004 & value) { return 3; } else { return 4; } } } else { ... } else { ... } else { ...

De acuerdo con la página de BitScan de Programación de Ajedrez y mis propias medidas, restar y xor es más rápido que negar y máscara.

(Tenga en cuenta que si va a contar los ceros finales en 0 , el método como lo tengo devuelve 63 mientras que el negar y la máscara devuelve 0 ).

Aquí hay un resta de 64 bits y xor:

unsigned long v; // find the number of trailing zeros in 64-bit v int r; // result goes here static const int MultiplyDeBruijnBitPosition[64] = { 0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61, 54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62, 46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45, 25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63 }; r = MultiplyDeBruijnBitPosition[((uint32_t)((v ^ (v-1)) * 0x03F79D71B4CB0A89U)) >> 58];

Como referencia, aquí hay una versión de 64 bits del método negate y mask:

unsigned long v; // find the number of trailing zeros in 64-bit v int r; // result goes here static const int MultiplyDeBruijnBitPosition[64] = { 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6 }; r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x03F79D71B4CB0A89U)) >> 58];

Dios mío tiene esto simplemente en espiral.

Lo que falta en la mayoría de estos ejemplos es un poco de comprensión sobre cómo funciona todo el hardware.

Cada vez que tienes una rama, la CPU tiene que adivinar qué rama se tomará. El tubo de instrucciones se carga con las instrucciones que conducen por la ruta adivinada. Si la CPU ha adivinado mal, la tubería de instrucción se purga y la otra rama debe cargarse.

Considere el simple bucle while en la parte superior. La conjetura será permanecer dentro del ciclo. Va a estar equivocado al menos una vez cuando abandona el ciclo. Esto vaciará la tubería de instrucción. Este comportamiento es un poco mejor que adivinar que abandonará el ciclo, en cuyo caso descargaría el conducto de instrucciones en cada iteración.

La cantidad de ciclos de CPU que se pierden varía mucho de un tipo de procesador a otro. Pero puede esperar entre 20 y 150 ciclos de CPU perdidos.

El siguiente grupo peor es donde crees que vas a guardar algunas iteraciones dividiendo el valor en piezas más pequeñas y agregando varias ramas más. Cada una de estas ramas agrega una oportunidad adicional para enjuagar el tubo de instrucción y cuesta entre 20 y 150 ciclos de reloj.

Consideremos qué sucede cuando busca un valor en una tabla. Lo más probable es que el valor no esté actualmente en caché, al menos no la primera vez que se llama a su función. Esto significa que la CPU se detiene mientras el valor se carga desde la memoria caché. De nuevo, esto varía de una máquina a la siguiente. Los nuevos chips Intel realmente usan esto como una oportunidad para intercambiar hilos mientras el hilo actual está esperando que se complete la carga de la caché. Esto podría ser más costoso que una descarga de tubo de instrucciones, sin embargo, si está realizando esta operación varias veces, es probable que solo ocurra una vez.

Claramente, la solución de tiempo constante más rápida es la que implica matemática determinista. Una solución pura y elegante.

Mis disculpas si esto ya estaba cubierto.

Cada compilador que uso, excepto XCODE AFAIK, tiene intrínsecos de compilador tanto para el bit de bits directo como para el bit de bits inverso. Estos se compilarán en una única instrucción de ensamblaje en la mayoría del hardware sin caché Miss, sin Branch Miss-Prediction y ningún otro programador generado tropiezos.

Para los compiladores de Microsoft, use _BitScanForward y _BitScanReverse.
Para GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.

Además, evite publicar una respuesta y pueda inducir a error a los recién llegados si no conoce adecuadamente el tema que se está discutiendo.

Lo siento, olvidé totalmente proporcionar una solución. Este es el código que uso en el iPad que no tiene instrucciones de nivel de ensamblaje para la tarea:

unsigned BitScanLow_BranchFree(unsigned value) { bool bwl = (value & 0x0000ffff) == 0; unsigned I1 = (bwl * 15); value = (value >> I1) & 0x0000ffff; bool bbl = (value & 0x00ff00ff) == 0; unsigned I2 = (bbl * 7); value = (value >> I2) & 0x00ff00ff; bool bnl = (value & 0x0f0f0f0f) == 0; unsigned I3 = (bnl * 3); value = (value >> I3) & 0x0f0f0f0f; bool bsl = (value & 0x33333333) == 0; unsigned I4 = (bsl * 1); value = (value >> I4) & 0x33333333; unsigned result = value + I1 + I2 + I3 + I4 - 1; return result; }

Lo que hay que entender aquí es que no es la comparación lo que es caro, sino la rama que ocurre después de la comparación. La comparación en este caso se fuerza a un valor de 0 o 1 con ... == 0, y el resultado se usa para combinar las matemáticas que se habrían producido en cualquier lado de la rama.

Editar:

El código anterior está totalmente roto. Este código funciona y aún no tiene ramificaciones (si está optimizado):

int BitScanLow_BranchFree(ui value) { int i16 = !(value & 0xffff) << 4; value >>= i16; int i8 = !(value & 0xff) << 3; value >>= i8; int i4 = !(value & 0xf) << 2; value >>= i4; int i2 = !(value & 0x3) << 1; value >>= i2; int i1 = !(value & 0x1); int i0 = (value >> i1) & 1? 0 : -32; return i16 + i8 + i4 + i2 + i1 + i0; }

Esto devuelve -1 si se da 0. Si no te importa 0 o te complace obtener 31 para 0, elimina el cálculo i0, ahorrando un poco de tiempo.

Hay una instrucción de ensamblaje x86 ( bsf ) que lo hará. :)

¿Más optimizado?

Nota al costado:

La optimización en este nivel depende intrínsecamente de la arquitectura. Los procesadores de hoy en día son demasiado complejos (en términos de predicción de bifurcaciones, errores de caché, canalización) que es tan difícil predecir qué código se ejecuta más rápidamente en qué arquitectura. Disminuir las operaciones de 32 a 9 o cosas por el estilo podría incluso disminuir el rendimiento en algunas arquitecturas. El código optimizado en una sola arquitectura puede dar como resultado un código peor en la otra. Creo que optimizaría esto para una CPU específica o lo dejaría como está y dejaría que el compilador elija lo que cree que es mejor.

Inspirado por esta publicación similar que implica buscar un bit establecido, ofrezco lo siguiente:

unsigned GetLowestBitPos(unsigned value) { double d = value ^ (value - !!value); return (((int*)&d)[1]>>20)-1023; }

Pros:

no hay bucles
sin ramificaciones
funciona en tiempo constante
maneja el valor = 0 devolviendo un resultado de lo contrario fuera de límites
solo dos líneas de código

Contras:

asume poca endianidad como está codificada (se puede arreglar cambiando las constantes)
asume que el doble es un flotante IEEE real * 8 (IEEE 754)

Actualización: Como se señaló en los comentarios, una unión es una implementación más limpia (para C, al menos) y se vería así:

unsigned GetLowestBitPos(unsigned value) { union { int i[2]; double d; } temp = { .d = value ^ (value - !!value) }; return (temp.i[1] >> 20) - 1023; }

Esto supone operaciones de 32 bits con almacenamiento little-endian para todo (piense en procesadores x86).

La mayoría de las arquitecturas modernas tendrán instrucciones para encontrar la posición del bit más bajo establecido, o el bit más alto, o contar el número de ceros a la izquierda, etc.

Si tiene alguna instrucción de esta clase, puede emular a los demás de manera barata.

Tómese un momento para trabajar en papel y darse cuenta de que x & (x-1) borrará el bit más bajo establecido en x, y ( x & ~(x-1) ) devolverá solo el bit ajustado más bajo, independientemente de la arquitectura , longitud de palabra, etc. Sabiendo esto, es trivial usar el hardware count-leading-zeroes / highest-set-bit para encontrar el bit más bajo establecido si no hay instrucciones explícitas para hacerlo.

Si no hay ningún soporte de hardware relevante, la implementación de multiplicar y buscar de los ceros que marcan conteos dados here o uno de los de la página Bit Twiddling Hacks se puede convertir trivialmente para dar el bit más bajo usando las identidades anteriores y tiene la ventaja de ser sin sucursales.

La solución más rápida (no intrínseca / no ensambladora) para esto es encontrar el byte más bajo y luego usar ese byte en una tabla de búsqueda de 256 entradas. Esto le proporciona el peor de los casos de cuatro instrucciones condicionales y el mejor de los casos. No solo es esta la menor cantidad de instrucciones, sino la menor cantidad de ramas que es muy importante en el hardware moderno.

Su tabla (256 entradas de 8 bits) debe contener el índice del LSB para cada número en el rango 0-255. Compruebe cada byte de su valor y encuentre el byte distinto de cero más bajo, luego use este valor para buscar el índice real.

Esto requiere 256 bytes de memoria, pero si la velocidad de esta función es tan importante, entonces los 256 bytes valen la pena,

P.ej

byte lowestBitTable[256] = { .... // left as an exercise for the reader to generate }; unsigned GetLowestBitPos(unsigned value) { // note that order to check indices will depend whether you are on a big // or little endian machine. This is for little-endian byte* bytes = (byte*)value; if (bytes[0]) return lowestBitTable[bytes[0]]; else if (bytes[1]) return lowestBitTable[bytes[1]] + 8; else if (bytes[2]) return lowestBitTable[bytes[2]] + 16; else return lowestBitTable[bytes[3]] + 24; }

Otra solución, no la más rápida posible, pero parece bastante buena.
Al menos no tiene ramas. ;)

uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000 x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000 x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000 x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000 x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000 x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000 // now x is filled with ''1'' from the least significant ''1'' to bit 31 x = ~x; // 0x00000000 0x0000003f 0x00001fff // now we have 1''s below the original least significant 1 // let''s count them x = x & 0x55555555 + (x >> 1) & 0x55555555; // 0x00000000 0x0000002a 0x00001aaa x = x & 0x33333333 + (x >> 2) & 0x33333333; // 0x00000000 0x00000024 0x00001444 x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f; // 0x00000000 0x00000006 0x00000508 x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff; // 0x00000000 0x00000006 0x0000000d x = x & 0x0000ffff + (x >> 16) & 0x0000ffff; // 0x00000000 0x00000006 0x0000000d // least sign.bit pos. was: 0 6 13

Otro método (división y búsqueda de módulo) merece una mención especial aquí desde el mismo enlace proporcionado por @ anton-tykhyy. este método es muy similar en rendimiento al método de multiplicación y búsqueda de DeBruijn con una ligera pero importante diferencia.

división y búsqueda de módulo

unsigned int v; // find the number of trailing zeros in v int r; // put the result in r static const int Mod37BitPosition[] = // map a bit value mod 37 to its position { 32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4, 7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5, 20, 8, 19, 18 }; r = Mod37BitPosition[(-v & v) % 37];

la división del módulo y el método de búsqueda devuelve diferentes valores para v = 0x00000000 y v = FFFFFFFF, mientras que DeBruijn multiplica y el método de búsqueda devuelve cero en ambas entradas.

prueba:-

unsigned int n1=0x00000000, n2=0xFFFFFFFF; MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */ MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */ Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */ Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */

Podrías verificar si alguno de los bits de orden inferior está establecido. Si es así, mira el orden inferior de los bits restantes. p.ej,:

Int de 32 bits: compruebe si se ha establecido alguno de los primeros 16. Si es así, verifique si alguno de los primeros 8 está configurado. si es así, ....

si no, verifique si alguno de los 16 superiores está configurado.

Esencialmente es una búsqueda binaria.

Vea mi respuesta here para saber cómo hacerlo con una sola instrucción x86, excepto que para encontrar el bit de ajuste menos significativo, querrá la instrucción BSF ("escaneo de bit forward") en lugar de BSR allí descrita.

Weee, un montón de soluciones y no un punto de referencia a la vista. Ustedes deberían estar avergonzados de ustedes mismos ;-)

Mi máquina es una Intel i530 (2.9 GHz), con Windows 7 de 64 bits. Compilé una versión de 32 bits de MinGW.

$ gcc --version gcc.exe (GCC) 4.7.2 $ gcc bench.c -o bench.exe -std=c99 -Wall -O2 $ bench Naive loop. Time = 2.91 (Original questioner) De Bruijn multiply. Time = 1.16 (Tykhyy) Lookup table. Time = 0.36 (Andrew Grant) FFS instruction. Time = 0.90 (ephemient) Branch free mask. Time = 3.48 (Dan / Jim Balter) Double hack. Time = 3.41 (DocMax) $ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native $ bench Naive loop. Time = 2.92 De Bruijn multiply. Time = 0.47 Lookup table. Time = 0.35 FFS instruction. Time = 0.68 Branch free mask. Time = 3.49 Double hack. Time = 0.92

Mi código:

#include <stdio.h> #include <stdlib.h> #include <time.h> #define ARRAY_SIZE 65536 #define NUM_ITERS 5000 // Number of times to process array int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE]) { int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { unsigned value = nums[i]; if (value == 0) continue; unsigned pos = 0; while (!(value & 1)) { value >>= 1; ++pos; } total += pos + 1; } } return total; } int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE]) { static const int MultiplyDeBruijnBitPosition[32] = { 1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9, 32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10 }; int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { unsigned int c = nums[i]; total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27]; } } return total; } unsigned char lowestBitTable[256]; int get_lowest_set_bit(unsigned num) { unsigned mask = 1; for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) { if (num & mask) { return cnt; } } return 0; } int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE]) { int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { unsigned int value = nums[i]; // note that order to check indices will depend whether you are on a big // or little endian machine. This is for little-endian unsigned char *bytes = (unsigned char *)&value; if (bytes[0]) total += lowestBitTable[bytes[0]]; else if (bytes[1]) total += lowestBitTable[bytes[1]] + 8; else if (bytes[2]) total += lowestBitTable[bytes[2]] + 16; else total += lowestBitTable[bytes[3]] + 24; } } return total; } int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE]) { int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { total += __builtin_ffs(nums[i]); } } return total; } int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE]) { int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { unsigned value = nums[i]; int i16 = !(value & 0xffff) << 4; value >>= i16; int i8 = !(value & 0xff) << 3; value >>= i8; int i4 = !(value & 0xf) << 2; value >>= i4; int i2 = !(value & 0x3) << 1; value >>= i2; int i1 = !(value & 0x1); int i0 = (value >> i1) & 1? 0 : -32; total += i16 + i8 + i4 + i2 + i1 + i0 + 1; } } return total; } int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE]) { int total = 0; // Prevent compiler from optimizing out the code for (int j = 0; j < NUM_ITERS; j++) { for (int i = 0; i < ARRAY_SIZE; i++) { unsigned value = nums[i]; double d = value ^ (value - !!value); total += (((int*)&d)[1]>>20)-1022; } } return total; } int main() { unsigned nums[ARRAY_SIZE]; for (int i = 0; i < ARRAY_SIZE; i++) { nums[i] = rand() + (rand() << 15); } for (int i = 0; i < 256; i++) { lowestBitTable[i] = get_lowest_set_bit(i); } clock_t start_time, end_time; int result; start_time = clock(); result = find_first_bits_naive_loop(nums); end_time = clock(); printf("Naive loop. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); start_time = clock(); result = find_first_bits_de_bruijn(nums); end_time = clock(); printf("De Bruijn multiply. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); start_time = clock(); result = find_first_bits_lookup_table(nums); end_time = clock(); printf("Lookup table. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); start_time = clock(); result = find_first_bits_ffs_instruction(nums); end_time = clock(); printf("FFS instruction. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); start_time = clock(); result = find_first_bits_branch_free_mask(nums); end_time = clock(); printf("Branch free mask. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); start_time = clock(); result = find_first_bits_double_hack(nums); end_time = clock(); printf("Double hack. Time = %.2f, result = %d/n", (end_time - start_time) / (double)(CLOCKS_PER_SEC), result); }

Bit Twiddling Hacks ofrece una excelente colección de, eh, bit twiddling hacks, con discusión de rendimiento / optimización adjunta. Mi solución favorita para su problema (desde ese sitio) es «multiplicar y buscar»:

unsigned int v; // find the number of trailing zeros in 32-bit v int r; // result goes here static const int MultiplyDeBruijnBitPosition[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

Referencias útiles:

" Uso de las secuencias de Bruijn para indexar un 1 en una palabra de computadora ": explicación sobre por qué funciona el código anterior.
" Representación de la placa> Bitboards> BitScan ": análisis detallado de este problema, con un enfoque particular en la programación de ajedrez

If you have the resources, you can sacrifice memory in order to improve the speed:

static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ }; unsigned GetLowestBitPos(unsigned value) { assert(value != 0); // handled separately return bitPositions[value]; }

Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned ). This is an example of trading one limited resource (RAM) for another (execution speed).

If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

Se puede hacer con el peor caso de menos de 32 operaciones:

Principio: Verificar dos o más bits es tan eficiente como verificar 1 bit.

Así que, por ejemplo, no hay nada que te impida verificar para qué agrupación está primero, luego verificar cada bit del más pequeño al más grande en ese grupo.

Asi que...
Si comprueba 2 bits a la vez, tiene en el peor de los casos (Nbits / 2) + 1 verifica el total.
si comprueba 3 bits a la vez, tiene en el peor de los casos (Nbits / 3) + 2 verificaciones en total.
...

Lo óptimo sería verificar en grupos de 4. Lo que requeriría en el peor de los casos 11 operaciones en lugar de las 32.

El mejor caso va desde el 1 cheque de tu algoritmo hasta 2 cheques si usas esta idea de agrupamiento. Pero ese cheque extra 1 en el mejor de los casos vale la pena para el peor de los casos.

Nota: Lo escribo completo en lugar de usar un bucle porque es más eficiente de esa manera.

int getLowestBitPos(unsigned int value) { //Group 1: Bits 0-3 if(value&0xf) { if(value&0x1) return 0; else if(value&0x2) return 1; else if(value&0x4) return 2; else return 3; } //Group 2: Bits 4-7 if(value&0xf0) { if(value&0x10) return 4; else if(value&0x20) return 5; else if(value&0x40) return 6; else return 7; } //Group 3: Bits 8-11 if(value&0xf00) { if(value&0x100) return 8; else if(value&0x200) return 9; else if(value&0x400) return 10; else return 11; } //Group 4: Bits 12-15 if(value&0xf000) { if(value&0x1000) return 12; else if(value&0x2000) return 13; else if(value&0x4000) return 14; else return 15; } //Group 5: Bits 16-19 if(value&0xf0000) { if(value&0x10000) return 16; else if(value&0x20000) return 17; else if(value&0x40000) return 18; else return 19; } //Group 6: Bits 20-23 if(value&0xf00000) { if(value&0x100000) return 20; else if(value&0x200000) return 21; else if(value&0x400000) return 22; else return 23; } //Group 7: Bits 24-27 if(value&0xf000000) { if(value&0x1000000) return 24; else if(value&0x2000000) return 25; else if(value&0x4000000) return 26; else return 27; } //Group 8: Bits 28-31 if(value&0xf0000000) { if(value&0x10000000) return 28; else if(value&0x20000000) return 29; else if(value&0x40000000) return 30; else return 31; } return -1; }

Found this clever trick using ''magic masks'' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.

Magic masks:

m0 = (...............01010101) m1 = (...............00110011) m2 = (...............00001111) m3 = (.......0000000011111111) ....

Key idea: No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...

int lastSetBitPos(const uint64_t x) { if (x == 0) return -1; //For 64 bit number, log2(64)-1, ie; 5 masks needed int steps = log2(sizeof(x) * 8); assert(steps == 6); //magic masks uint64_t m[] = { 0x5555555555555555, // .... 010101 0x3333333333333333, // .....110011 0x0f0f0f0f0f0f0f0f, // ...00001111 0x00ff00ff00ff00ff, //0000000011111111 0x0000ffff0000ffff, 0x00000000ffffffff }; //Firstly extract only the last set bit uint64_t y = x & -x; int trailZeros = 0, i = 0 , factor = 0; while (i < steps) { factor = ((y & m[i]) == 0 ) ? 1 : 0; trailZeros += factor * pow(2,i); ++i; } return (trailZeros+1); }

If C++11 is available for you, a compiler sometimes can do the task for you :)

constexpr std::uint64_t lssb(const std::uint64_t value) { return !value ? 0 : (value % 2 ? 1 : lssb(value >> 1) + 1); }

Result is 1-based index.

This is in regards of @Anton Tykhyy answer

Here is my C++11 constexpr implementation doing away with casts and removing a warning on VC++17 by truncating a 64bit result to 32 bits:

constexpr uint32_t DeBruijnSequence[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; constexpr uint32_t ffs ( uint32_t value ) { return DeBruijnSequence[ (( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF) >> 27]; }

To get around the issue of 0x1 and 0x0 both returning 0 you can do:

constexpr uint32_t ffs ( uint32_t value ) { return (!value) ? 32 : DeBruijnSequence[ (( ( value & ( -static_cast<int32_t>(value) ) ) * 0x077CB531ULL ) & 0xFFFFFFFF) >> 27]; }

but if the compiler can''t or won''t preprocess the call it will add a couple of cycles to the calculation.

Finally, if interested, here''s a list of static asserts to check that the code does what is intended to:

static_assert (ffs(0x1) == 0, "Find First Bit Set Failure."); static_assert (ffs(0x2) == 1, "Find First Bit Set Failure."); static_assert (ffs(0x4) == 2, "Find First Bit Set Failure."); static_assert (ffs(0x8) == 3, "Find First Bit Set Failure."); static_assert (ffs(0x10) == 4, "Find First Bit Set Failure."); static_assert (ffs(0x20) == 5, "Find First Bit Set Failure."); static_assert (ffs(0x40) == 6, "Find First Bit Set Failure."); static_assert (ffs(0x80) == 7, "Find First Bit Set Failure."); static_assert (ffs(0x100) == 8, "Find First Bit Set Failure."); static_assert (ffs(0x200) == 9, "Find First Bit Set Failure."); static_assert (ffs(0x400) == 10, "Find First Bit Set Failure."); static_assert (ffs(0x800) == 11, "Find First Bit Set Failure."); static_assert (ffs(0x1000) == 12, "Find First Bit Set Failure."); static_assert (ffs(0x2000) == 13, "Find First Bit Set Failure."); static_assert (ffs(0x4000) == 14, "Find First Bit Set Failure."); static_assert (ffs(0x8000) == 15, "Find First Bit Set Failure."); static_assert (ffs(0x10000) == 16, "Find First Bit Set Failure."); static_assert (ffs(0x20000) == 17, "Find First Bit Set Failure."); static_assert (ffs(0x40000) == 18, "Find First Bit Set Failure."); static_assert (ffs(0x80000) == 19, "Find First Bit Set Failure."); static_assert (ffs(0x100000) == 20, "Find First Bit Set Failure."); static_assert (ffs(0x200000) == 21, "Find First Bit Set Failure."); static_assert (ffs(0x400000) == 22, "Find First Bit Set Failure."); static_assert (ffs(0x800000) == 23, "Find First Bit Set Failure."); static_assert (ffs(0x1000000) == 24, "Find First Bit Set Failure."); static_assert (ffs(0x2000000) == 25, "Find First Bit Set Failure."); static_assert (ffs(0x4000000) == 26, "Find First Bit Set Failure."); static_assert (ffs(0x8000000) == 27, "Find First Bit Set Failure."); static_assert (ffs(0x10000000) == 28, "Find First Bit Set Failure."); static_assert (ffs(0x20000000) == 29, "Find First Bit Set Failure."); static_assert (ffs(0x40000000) == 30, "Find First Bit Set Failure."); static_assert (ffs(0x80000000) == 31, "Find First Bit Set Failure.");

recently I see that singapore''s premier posted a program he wrote on facebook, there is one line to mention it..

The logic is simply "value & -value", suppose you have 0x0FF0, then, 0FF0 & (F00F+1) , which equals 0x0010, that means the lowest 1 is in the 4th bit.. :)

unsigned GetLowestBitPos(unsigned value) { if (value & 1) return 1; if (value & 2) return 2; if (value & 4) return 3; if (value & 8) return 4; if (value & 16) return 5; if (value & 32) return 6; if (value & 64) return 7; if (value & 128) return 8; if (value & 256) return 9; if (value & 512) return 10; if (value & 1024) return 11; if (value & 2048) return 12; if (value & 4096) return 13; if (value & 8192) return 14; if (value & 16384) return 15; if (value & 32768) return 16; if (value & 65536) return 17; if (value & 131072) return 18; if (value & 262144) return 19; if (value & 524288) return 20; if (value & 1048576) return 21; if (value & 2097152) return 22; if (value & 4194304) return 23; if (value & 8388608) return 24; if (value & 16777216) return 25; if (value & 33554432) return 26; if (value & 67108864) return 27; if (value & 134217728) return 28; if (value & 268435456) return 29; if (value & 536870912) return 30; return 31; }

50% of all numbers will return on the first line of code.

75% of all numbers will return on the first 2 lines of code.

87% of all numbers will return in the first 3 lines of code.

94% of all numbers will return in the first 4 lines of code.

97% of all numbers will return in the first 5 lines of code.

etc.

I think people that are complaining on how inefficient the worst case scenario for this code don''t understand how rare that condition will happen.