algorithm - horspool - ¿Existe una función de búsqueda y reemplazo rápido de cadenas de Boyer-Moore y un conteo rápido de cadenas para la Cadena Delphi 2010(UnicodeString)?

boyer-moore algorithm (2)

Desde que estaba buscando lo mismo: Jedi JCL tiene un motor de búsqueda con código Unicode que usa Boyer-Moore en jclUnicode.pas. No tengo idea de lo bueno o rápido que es.

Necesito tres funciones fast-on-large-string: búsqueda rápida, búsqueda rápida y reemplazo, y conteo rápido de subcadenas en una cadena.

Me he topado con búsquedas de cadenas de Boyer-Moore en C ++ y Python, pero el único algoritmo de Delphi Boyer-Moore utilizado para implementar la búsqueda y el reemplazo rápidos que he encontrado es parte de los FastStrings de Peter Morris, anteriormente en el software DroopyEyes, y su sitio web y el correo electrónico ya no funciona.

Ya he FastStrings para que funcione bien para AnsiStrings en Delphi 2009/2010, donde un byte es igual a un AnsiChar, pero hacer que también funcionen con String (UnicodeString) en Delphi 2010 parece no trivial.

Usando este algoritmo de Boyer-Moore, debería ser posible realizar fácilmente búsquedas que no distingan entre mayúsculas y minúsculas, así como búsquedas y reemplazos sin distinción de mayúsculas y minúsculas, sin ninguna cadena temporal (usando StrUpper, etc.) y sin llamar a Pos (), que es más lento que Boyer- Moore busca cuando se requieren búsquedas repetidas sobre el mismo texto.

(Edit: Tengo una solución parcial, escrita como una respuesta a esta pregunta, está casi al 100% completa, incluso tiene una función de reemplazo de cadena rápida. Creo que DEBE tener errores, y especialmente creo que ya que pretende ser Unicode capaz de ser que haya fallas debido a promesas Unicode no cumplidas).

(Edit2: resultado interesante e inesperado; el gran tamaño de pila de una tabla de puntos de código de Unicode en la pila - SkipTable en el código de abajo pone un serio obstáculo en la cantidad de optimización de ganar-ganar que puedes hacer aquí en un boyer de cuerdas Unicode -Mayor búsqueda de cadenas. Gracias a Florent Ouchet por señalar lo que debería haber notado de inmediato.

Esta respuesta ahora está completa y funciona en modo sensible a mayúsculas y minúsculas, pero no funciona en modo que no distingue entre mayúsculas y minúsculas, y probablemente también tenga otros errores, ya que no está bien probado en la unidad, y probablemente podría optimizarse aún más, por ejemplo, repetí la función local __SameChar en lugar de utilizar una función de comparación, la devolución de llamada hubiera sido más rápida y, de hecho, permitirle al usuario pasar una función de comparación para todos estos sería genial para los usuarios de Unicode que desean proporcionar alguna lógica adicional (conjuntos equivalentes de glifos de Unicode para algunos idiomas) ).

Basado en el código de Dorin Dominica, construí lo siguiente.

{ _FindStringBoyer: Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM. Credited to Dorin Duminica. } function _FindStringBoyer(const sString, sPattern: string; const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if bCaseSensitive then Result := (sString[StringIndex] = sPattern[PatternIndex]) else Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; var SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; begin if fromPos < 1 then raise Exception.CreateFmt(''Invalid search start position: %d.'', [fromPos]); LengthPattern := Length(sPattern); LengthString := Length(sString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[sPattern[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[sPattern[LengthPattern]]; SkipTable[sPattern[LengthPattern]] := Large; Index := fromPos + LengthPattern -1; Result := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[sString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. Result := Index - kIndex + 1; Index := Index + LengthPattern; exit; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[sString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of a substring inside the main string, at a much faster rate than we could have done otherwise. Another thing that would be great is to have a function that returns an array of find-locations, which would be way faster to do than repeatedly calling Pos. } function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer; var foundPos:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := 0; foundPos := 1; fromPos := 1; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. //Result := Index - kIndex + 1; Index := Index + LengthPattern; fromPos := Index; Inc(Result); break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; end;

Este es realmente un buen algoritmo, porque:

es mucho más rápido contar instancias de subcadena X en la cadena Y de esta manera, magníficamente.
Para simplemente reemplazar Pos (), _FindStringBoyer () es más rápido que la versión asm pura de Pos () contribuida a Delphi por el personal del proyecto FastCode, que se usa actualmente para Pos, y si necesita la insensibilidad de mayúsculas, puede imaginar el rendimiento aumenta cuando no tenemos que llamar a UpperCase en una cadena de 100 megabytes. (De acuerdo, tus cuerdas no van a ser TAN grandes. Pero aún así, los Algoritmos eficientes son una cosa de belleza).

Bueno, escribí un String Replace al estilo de Boyer-Moore:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String; var errors:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; CaseSensitive:Boolean; foundAt:Integer; lastFoundAt:Integer; copyStartsAt:Integer; copyLen:Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := ''''; lastFoundAt := 0; fromPos := 1; errors := 0; CaseSensitive := rfIgnoreCase in Flags; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; foundAt := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin foundAt := Index - kIndex + 1; Index := Index + LengthPattern; //fromPos := Index; fromPos := (foundAt+LengthPattern); if lastFoundAt=0 then begin copyStartsAt := 1; copyLen := foundAt-copyStartsAt; end else begin copyStartsAt := lastFoundAt+LengthPattern; copyLen := foundAt-copyStartsAt; end; if (copyLen<=0)or(copyStartsAt<=0) then begin Inc(errors); end; Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString; lastFoundAt := foundAt; if not (rfReplaceAll in Flags) then fromPos := 0; // break out of outer while loop too! break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; if (lastFoundAt=0) then begin // nothing was found, just return whole original string Result := aSourceString; end else if (lastFoundAt+LengthPattern < Limit) then begin // the part that didn''t require any replacing, because nothing more was found, // or rfReplaceAll flag was not specified, is copied at the // end as the final step. copyStartsAt := lastFoundAt+LengthPattern; copyLen := Limit; { this number can be larger than needed to be, and it is harmless } Result := Result + Copy(aSourceString, copyStartsAt, copyLen ); end; end;

Bien, problema: apilar la huella de esto:

var skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi

Adiós CPU infierno, hola apila infierno. Si voy por una matriz dinámica, entonces tengo que cambiar su tamaño en tiempo de ejecución. Así que esto es básicamente rápido, porque el sistema de memoria virtual en su computadora no parpadea a 256K en la pila, pero esto no siempre es un código óptimo. Sin embargo, mi PC no parpadea en una pila tan grande como esta. No se convertirá en un valor predeterminado de la biblioteca estándar de Delphi ni ganará ningún desafío de código rápido en el futuro, con ese tipo de huella. Creo que las búsquedas repetidas son un caso en el que el código anterior debe escribirse como una clase, y el cuadro omitible debe ser un campo de datos en esa clase. Luego, puede construir la tabla boyer-moore una vez, y con el tiempo, si la cadena es invariante, use ese objeto repetidamente para hacer búsquedas rápidas.