java - Buscar índice de una matriz de bytes dentro de otra matriz de bytes

search bytearray (9)

¿Es esto lo que estás buscando?

public class KPM { /** * Search the data byte array for the first occurrence of the byte array pattern within given boundaries. * @param data * @param start First index in data * @param stop Last index in data so that stop-start = length * @param pattern What is being searched. ''*'' can be used as wildcard for "ANY character" * @return */ public static int indexOf( byte[] data, int start, int stop, byte[] pattern) { if( data == null || pattern == null) return -1; int[] failure = computeFailure(pattern); int j = 0; for( int i = start; i < stop; i++) { while (j > 0 && ( pattern[j] != ''*'' && pattern[j] != data[i])) { j = failure[j - 1]; } if (pattern[j] == ''*'' || pattern[j] == data[i]) { j++; } if (j == pattern.length) { return i - pattern.length + 1; } } return -1; } /** * Computes the failure function using a boot-strapping process, * where the pattern is matched against itself. */ private static int[] computeFailure(byte[] pattern) { int[] failure = new int[pattern.length]; int j = 0; for (int i = 1; i < pattern.length; i++) { while (j>0 && pattern[j] != pattern[i]) { j = failure[j - 1]; } if (pattern[j] == pattern[i]) { j++; } failure[i] = j; } return failure; } }

Dada una matriz de bytes, ¿cómo puedo encontrar dentro de ella la posición de una matriz de bytes (más pequeña)?

Esta documentación parecía prometedora, usando ArrayUtils , pero si estoy en lo correcto, solo me permitiría encontrar un byte individual dentro de la matriz que se va a buscar.

(No puedo ver que importe, pero por las dudas: a veces el conjunto de bytes de búsqueda será caracteres ASCII regulares, otras veces serán caracteres de control o caracteres ASCII extendidos. Por lo tanto, el uso de operaciones de Cadena no siempre sería apropiado)

La matriz grande podría tener entre 10 y aproximadamente 10000 bytes, y la matriz más pequeña alrededor de 10. En algunos casos tendré varias matrices más pequeñas que quiero encontrar dentro de la matriz más grande en una sola búsqueda. Y a veces querré encontrar el último índice de una instancia en lugar de la primera.

Copiado casi idéntico de java.lang.String.

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) { if (fromIndex >= sourceCount) { return (targetCount == 0 ? sourceCount : -1); } if (fromIndex < 0) { fromIndex = 0; } if (targetCount == 0) { return fromIndex; } byte first = target[targetOffset]; int max = sourceOffset + (sourceCount - targetCount); for (int i = sourceOffset + fromIndex; i <= max; i++) { /* Look for first character. */ if (source[i] != first) { while (++i <= max && source[i] != first) ; } /* Found first character, now look at the rest of v2 */ if (i <= max) { int j = i + 1; int end = j + targetCount - 1; for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++) ; if (j == end) { /* Found whole string. */ return i - sourceOffset; } } } return -1; }

La guayaba de Google proporciona un Bytes.indexOf (byte [] matriz, byte [] destino).

La manera más sencilla sería comparar cada elemento:

public int indexOf(byte[] outerArray, byte[] smallerArray) { for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) { boolean found = true; for(int j = 0; j < smallerArray.length; ++j) { if (outerArray[i+j] != smallerArray[j]) { found = false; break; } } if (found) return i; } return -1; }

Algunas pruebas:

@Test public void testIndexOf() { byte[] outer = {1, 2, 3, 4}; assertEquals(0, indexOf(outer, new byte[]{1, 2})); assertEquals(1, indexOf(outer, new byte[]{2, 3})); assertEquals(2, indexOf(outer, new byte[]{3, 4})); assertEquals(-1, indexOf(outer, new byte[]{4, 4})); assertEquals(-1, indexOf(outer, new byte[]{4, 5})); assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8})); }

A medida que actualizaba su pregunta: Las cadenas de Java son cadenas UTF-16, no les importa el conjunto ASCII extendido, por lo que podría usar string.indexOf ()

Las cadenas de Java se componen de caracteres de 16 bits, no de byte de 8 bits. Un char puede contener un byte , por lo que siempre puedes convertir tus arrays de bytes en cadenas, y usar indexOf : caracteres ASCII, caracteres de control e incluso cero caracteres funcionarán bien.

Aquí hay una demostración:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4}; byte[] small = new byte[] {7,0,8,9,0,0,1}; String bigStr = new String(big, StandardCharsets.UTF_8); String smallStr = new String(small, StandardCharsets.UTF_8); System.out.println(bigStr.indexOf(smallStr));

Esto imprime 7 .

Sin embargo, teniendo en cuenta que su matriz grande puede tener hasta 10.000 bytes, y la matriz pequeña tiene solo diez bytes, esta solución puede no ser la más eficiente por dos motivos:

Requiere copiar su matriz grande en una matriz que es el doble de grande (misma capacidad, pero con char lugar de byte ). Esto triplica sus requisitos de memoria.
Algoritmo de búsqueda de cadenas de Java no es el más rápido disponible. Puede obtener lo suficientemente rápido si implementa uno de los algoritmos avanzados, por ejemplo, el de Knuth–Morris–Pratt . Esto podría reducir la velocidad de ejecución en un factor de hasta diez (la longitud de la cadena pequeña), y requerirá memoria adicional que es proporcional a la longitud de la secuencia pequeña, no a la secuencia grande.

Para ahorrar tiempo en las pruebas:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

le da un código que funciona si realiza computeFailure () static:

public class KPM { /** * Search the data byte array for the first occurrence * of the byte array pattern. */ public static int indexOf(byte[] data, byte[] pattern) { int[] failure = computeFailure(pattern); int j = 0; for (int i = 0; i < data.length; i++) { while (j > 0 && pattern[j] != data[i]) { j = failure[j - 1]; } if (pattern[j] == data[i]) { j++; } if (j == pattern.length) { return i - pattern.length + 1; } } return -1; } /** * Computes the failure function using a boot-strapping process, * where the pattern is matched against itself. */ private static int[] computeFailure(byte[] pattern) { int[] failure = new int[pattern.length]; int j = 0; for (int i = 1; i < pattern.length; i++) { while (j>0 && pattern[j] != pattern[i]) { j = failure[j - 1]; } if (pattern[j] == pattern[i]) { j++; } failure[i] = j; } return failure; } }

Como siempre es conveniente probar el código que pides prestado, puedes comenzar con:

public class Test { public static void main(String[] args) { do_test1(); } static void do_test1() { String[] ss = { "", "/r/n/r/n", "/n/n", "/r/n/r/nthis is a test", "this is a test/r/n/r/n", "this is a test/r/n/r/nthis si a test", "this is a test/r/n/r/nthis si a test/r/n/r/n", "this is a test/n/r/nthis si a test", "this is a test/r/nthis si a test/r/n/r/n", "this is a test" }; for (String s: ss) { System.out.println(""+KPM.indexOf(s.getBytes(), "/r/n/r/n".getBytes())+"in ["+s+"]"); } } }

Para un pequeño servidor HTTP en el que estoy trabajando actualmente, se me ocurrió el siguiente código para encontrar límites en una solicitud multipart / form-data. Esperaba encontrar una mejor solución aquí, pero creo que me quedaré con eso. Creo que es lo más eficiente posible (bastante rápido y no utiliza mucho RAM). Utiliza los bytes de entrada como el búfer en anillo, lee el siguiente byte tan pronto como no coincide con el límite y escribe los datos después del primer ciclo completo en el flujo de salida. Por supuesto, se puede cambiar para matrices de bytes en lugar de secuencias, como se pregunta en la pregunta.

private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary) { try { String n = "--"+boundary; byte[] bc = n.getBytes("UTF-8"); int s = bc.length; byte[] b = new byte[s]; int p = 0; long l = 0; int c; boolean r; while ((c = is.read()) != -1) { b[p] = (byte) c; l += 1; p = (int) (l % s); if (l>p) { r = true; for (int i = 0; i < s; i++) { if (b[(p + i) % s] != bc[i]) { r = false; break; } } if (r) break; os.write(b[p]); } } os.flush(); return true; } catch(IOException e) {e.printStackTrace();} return false; }

Usar el Knuth–Morris–Pratt algorithm es la forma más eficiente.

StreamSearcher.java es una implementación de este y es parte del proyecto de elephant-bird de Twitter .

No se recomienda incluir esta biblioteca, ya que es bastante importante para usar solo una clase.

import java.io.IOException; import java.io.InputStream; import java.util.Arrays; /** * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm. * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm. */ public class StreamSearcher { private byte[] pattern_; private int[] borders_; // An upper bound on pattern length for searching. Results are undefined for longer patterns. @SuppressWarnings("unused") public static final int MAX_PATTERN_LENGTH = 1024; StreamSearcher(byte[] pattern) { setPattern(pattern); } /** * Sets a new pattern for this StreamSearcher to use. * * @param pattern the pattern the StreamSearcher will look for in future calls to search(...) */ public void setPattern(byte[] pattern) { pattern_ = Arrays.copyOf(pattern, pattern.length); borders_ = new int[pattern_.length + 1]; preProcess(); } /** * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have * another reasonable default, i.e. leave the stream unchanged. * * @return bytes consumed if found, -1 otherwise. */ long search(InputStream stream) throws IOException { long bytesRead = 0; int b; int j = 0; while ((b = stream.read()) != -1) { bytesRead++; while (j >= 0 && (byte) b != pattern_[j]) { j = borders_[j]; } // Move to the next character in the pattern. ++j; // If we''ve matched up to the full pattern length, we found it. Return, // which will automatically save our position in the InputStream at the point immediately // following the pattern match. if (j == pattern_.length) { return bytesRead; } } // No dice, Note that the stream is now completely consumed. return -1; } /** * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally * and aids in implementation of the Knuth-Moore-Pratt string search. * <p> * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm. */ private void preProcess() { int i = 0; int j = -1; borders_[i] = j; while (i < pattern_.length) { while (j >= 0 && pattern_[i] != pattern_[j]) { j = borders_[j]; } borders_[++i] = ++j; } } }

package org.example; import java.util.List; import org.riversun.finbin.BinarySearcher; public class Sample2 { public static void main(String[] args) throws Exception { BinarySearcher bs = new BinarySearcher(); // UTF-8 without BOM byte[] srcBytes = "Hello world.It''s a small world.".getBytes("utf-8"); byte[] searchBytes = "world".getBytes("utf-8"); List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes); System.out.println("indexList=" + indexList); } }

por lo que resulta en

indexList=[6, 25]

Entonces, puedes encontrar el índice de byte [] en byte []

Ejemplo aquí en Github en: https://github.com/riversun/finbin