c# - textos - Detectar diferencias entre dos cuerdas.

programa para comparar archivos y carpetas (5)

Aquí podría haber otra solución, código completo y comentado. Sin embargo, el resultado de su primer ejemplo original se invierte:

class Program { enum CharState { Add, Equal, Remove } struct CharResult { public char c; public CharState state; } static void Main(string[] args) { string a = "asdfghjk"; string b = "wsedrftr"; while (true) { Console.WriteLine("Enter string a (enter to quit) :"); a = Console.ReadLine(); if (a == string.Empty) break; Console.WriteLine("Enter string b :"); b = Console.ReadLine(); List<CharResult> result = calculate(a, b); DisplayResults(result); } Console.WriteLine("Press a key to exit"); Console.ReadLine(); } static List<CharResult> calculate(string a, string b) { List<CharResult> res = new List<CharResult>(); int i = 0, j = 0; char[] array_a = a.ToCharArray(); char[] array_b = b.ToCharArray(); while (i < array_a.Length && j < array_b.Length) { //For the current char in a, we check for the equal in b int index = b.IndexOf(array_a[i], j); if (index < 0) //not found, this char should be removed { res.Add(new CharResult() { c = array_a[i], state = CharState.Remove }); i++; } else { //we add all the chars between B''s current index and the index while (j < index) { res.Add(new CharResult() { c = array_b[j], state = CharState.Add }); j++; } //then we say the current is the same res.Add(new CharResult() { c = array_a[i], state = CharState.Equal }); i++; j++; } } while (i < array_a.Length) { //b is now empty, we remove the remains res.Add(new CharResult() { c = array_a[i], state = CharState.Remove }); i++; } while (j < array_b.Length) { //a has been treated, we add the remains res.Add(new CharResult() { c = array_b[j], state = CharState.Add }); j++; } return res; } static void DisplayResults(List<CharResult> results) { foreach (CharResult r in results) { Console.WriteLine($"''{r.c}'' - {r.state}"); } } }

Tengo 2 cuerdas

string a = "foo bar"; string b = "bar foo";

y quiero detectar los cambios de a a b . ¿Qué personajes tengo que cambiar para ir de a a b ?

Creo que debe haber una iteración sobre cada personaje y detectar si se agregó, se eliminó o si se mantuvo igual. Así que este es mi resultado exprected

''f'' Remove ''o'' Remove ''o'' Remove '' '' Remove ''b'' Equal ''a'' Equal ''r'' Equal '' '' Add ''f'' Add ''o'' Add ''o'' Add

clase y enumeración para el resultado:

public enum Operation { Add,Equal,Remove }; public class Difference { public Operation op { get; set; } public char c { get; set; } }

Aquí está mi solución, pero el caso de "Eliminar" no me queda claro cómo debe verse el código

public static List<Difference> CalculateDifferences(string left, string right) { int count = 0; List<Difference> result = new List<Difference>(); foreach (char ch in left) { int index = right.IndexOf(ch, count); if (index == count) { count++; result.Add(new Difference() { c = ch, op = Operation.Equal }); } else if (index > count) { string add = right.Substring(count, index - count); result.AddRange(add.Select(x => new Difference() { c = x, op = Operation.Add })); count += add.Length; } else { //Remove? } } return result; }

¿Cómo debe verse el código para los caracteres eliminados?

Actualización - agregó algunos ejemplos más

Ejemplo 1:

string a = "foobar"; string b = "fooar";

Resultado Esperado:

''f'' Equal ''o'' Equal ''o'' Equal ''b'' Remove ''a'' Equal ''r'' Equal

ejemplo 2:

string a = "asdfghjk"; string b = "wsedrftr";

Resultado Esperado:

''a'' Remove ''w'' Add ''s'' Equal ''e'' Add ''d'' Equal ''r'' Add ''f'' Equal ''g'' Remove ''h'' Remove ''j'' Remove ''k'' Remove ''t'' Add ''r'' Add

Actualizar:

Aquí hay una comparación entre la respuesta de Dmitry e ingen: https://dotnetfiddle.net/MJQDAO

Está buscando la distancia de edición (mínima) / la secuencia de edición (mínima) . Puedes encontrar la teoría del proceso aquí:

web.stanford.edu/class/cs124/lec/med.pdf

Implementemos (más simple) el algoritmo Levenstein Distance / Sequence (para más detalles, consulte en.wikipedia.org/wiki/Levenshtein_distance ). Empecemos por las clases de ayuda (he cambiado un poco tu implementación de ellas):

public enum EditOperationKind : byte { None, // Nothing to do Add, // Add new character Edit, // Edit character into character (including char into itself) Remove, // Delete existing character }; public struct EditOperation { public EditOperation(char valueFrom, char valueTo, EditOperationKind operation) { ValueFrom = valueFrom; ValueTo = valueTo; Operation = valueFrom == valueTo ? EditOperationKind.None : operation; } public char ValueFrom { get; } public char ValueTo {get ;} public EditOperationKind Operation { get; } public override string ToString() { switch (Operation) { case EditOperationKind.None: return $"''{ValueTo}'' Equal"; case EditOperationKind.Add: return $"''{ValueTo}'' Add"; case EditOperationKind.Remove: return $"''{ValueFrom}'' Remove"; case EditOperationKind.Edit: return $"''{ValueFrom}'' to ''{ValueTo}'' Edit"; default: return "???"; } } }

Por lo que puedo ver en los ejemplos proporcionados, no tenemos ninguna operación de edición , pero agregamos + eliminamos ; por eso puse editCost = 2 cuando insertCost = 1 , int removeCost = 1 (en caso de empate : insert + remove vs. edit ponemos insert + remove ). Ahora estamos listos para implementar el algoritmo de Levenstein:

public static EditOperation[] EditSequence( string source, string target, int insertCost = 1, int removeCost = 1, int editCost = 2) { if (null == source) throw new ArgumentNullException("source"); else if (null == target) throw new ArgumentNullException("target"); // Forward: building score matrix // Best operation (among insert, update, delete) to perform EditOperationKind[][] M = Enumerable .Range(0, source.Length + 1) .Select(line => new EditOperationKind[target.Length + 1]) .ToArray(); // Minimum cost so far int[][] D = Enumerable .Range(0, source.Length + 1) .Select(line => new int[target.Length + 1]) .ToArray(); // Edge: all removes for (int i = 1; i <= source.Length; ++i) { M[i][0] = EditOperationKind.Remove; D[i][0] = removeCost * i; } // Edge: all inserts for (int i = 1; i <= target.Length; ++i) { M[0][i] = EditOperationKind.Add; D[0][i] = insertCost * i; } // Having fit N - 1, K - 1 characters let''s fit N, K for (int i = 1; i <= source.Length; ++i) for (int j = 1; j <= target.Length; ++j) { // here we choose the operation with the least cost int insert = D[i][j - 1] + insertCost; int delete = D[i - 1][j] + removeCost; int edit = D[i - 1][j - 1] + (source[i - 1] == target[j - 1] ? 0 : editCost); int min = Math.Min(Math.Min(insert, delete), edit); if (min == insert) M[i][j] = EditOperationKind.Add; else if (min == delete) M[i][j] = EditOperationKind.Remove; else if (min == edit) M[i][j] = EditOperationKind.Edit; D[i][j] = min; } // Backward: knowing scores (D) and actions (M) let''s building edit sequence List<EditOperation> result = new List<EditOperation>(source.Length + target.Length); for (int x = target.Length, y = source.Length; (x > 0) || (y > 0);) { EditOperationKind op = M[y][x]; if (op == EditOperationKind.Add) { x -= 1; result.Add(new EditOperation(''/0'', target[x], op)); } else if (op == EditOperationKind.Remove) { y -= 1; result.Add(new EditOperation(source[y], ''/0'', op)); } else if (op == EditOperationKind.Edit) { x -= 1; y -= 1; result.Add(new EditOperation(source[y], target[x], op)); } else // Start of the matching (EditOperationKind.None) break; } result.Reverse(); return result.ToArray(); }

Manifestación:

var sequence = EditSequence("asdfghjk", "wsedrftr"); Console.Write(string.Join(Environment.NewLine, sequence));

Salir:

''a'' Remove ''w'' Add ''s'' Equal ''e'' Add ''d'' Equal ''r'' Add ''f'' Equal ''g'' Remove ''h'' Remove ''j'' Remove ''k'' Remove ''t'' Add ''r'' Add

Me arriesgaré aquí y proporcionaré un algoritmo que no es el más eficiente, pero es fácil de razonar.

Vamos a cubrir un poco de terreno primero:

1) los asuntos de orden

string before = "bar foo" string after = "foo bar"

A pesar de que "bar" y "foo" aparecen en ambas cadenas, la "barra" deberá eliminarse y agregarse más tarde. Esto también nos dice que es la cadena after que nos da el orden de los caracteres que nos interesan, primero queremos "foo".

2) orden sobre cuenta

Otra forma de verlo es que algunos caracteres nunca pueden obtener su turno.

string before = "abracadabra" string after = "bar bar"

Solo los caracteres en negrita de " bar b a r" expresan su opinión en "a b r a cadab ra ". Aunque tenemos dos b en ambas cadenas, solo la primera cuenta . Cuando llegamos a la segunda b en "ba r b ar", la segunda b en "abracada br a" ya ha pasado, cuando estábamos buscando la primera aparición de ''r''.

3) Barreras

Las barreras son los caracteres que existen en ambas cadenas, teniendo en cuenta el orden y la cuenta. Esto ya sugiere que un conjunto podría no ser la estructura de datos más apropiada, ya que perderíamos la cuenta.

Para una entrada

string before = "pinata" string after = "accidental"

Obtenemos (pseudocódigo)

var barriers = { ''a'', ''t'', ''a'' }

"pin ata "

" a cciden ta l"

Sigamos el flujo de ejecución:

''a'' es la primera barrera, también es el primer carácter de after por lo que todo lo que precede a la primera ''a'' puede eliminarse before . "pin a ta" -> " a ta"
la segunda barrera es ''t'', no está en la siguiente posición en nuestra cadena after , por lo que podemos insertar todo lo que está en el medio. "a t a" -> "acciden t a"
la tercera barrera ''a'' ya está en la siguiente posición, por lo que podemos pasar a la siguiente barrera sin hacer ningún trabajo real.
no hay más barreras, pero nuestra longitud de cadena es aún menor que la de after , por lo que habrá un procesamiento posterior. "Accidenta" -> "Accidenta l "

Tenga en cuenta que ''i'' y ''n'' no vuelven a jugar, nuevamente, el orden se cuenta en exceso.

Implementación

Hemos establecido que el orden y el conteo son importantes, una Queue viene a la mente.

static public List<Difference> CalculateDifferences(string before, string after) { List<Difference> result = new List<Difference>(); Queue<char> barriers = new Queue<char>(); #region Preprocessing int index = 0; for (int i = 0; i < after.Length; i++) { // Look for the first match starting at index int match = before.IndexOf(after[i], index); if (match != -1) { barriers.Enqueue(after[i]); index = match + 1; } } #endregion #region Queue Processing index = 0; while (barriers.Any()) { char barrier = barriers.Dequeue(); // Get the offset to the barrier in both strings, // ignoring the part that''s already been handled int offsetBefore = before.IndexOf(barrier, index) - index; int offsetAfter = after.IndexOf(barrier, index) - index; // Remove prefix from ''before'' string if (offsetBefore > 0) { RemoveChars(before.Substring(index, offsetBefore), result); before = before.Substring(offsetBefore); } // Insert prefix from ''after'' string if (offsetAfter > 0) { string substring = after.Substring(index, offsetAfter); AddChars(substring, result); before = before.Insert(index, substring); index += substring.Length; } // Jump over the barrier KeepChar(barrier, result); index++; } #endregion #region Post Queue processing if (index < before.Length) { RemoveChars(before.Substring(index), result); } if (index < after.Length) { AddChars(after.Substring(index), result); } #endregion return result; } static private void KeepChar(char barrier, List<Difference> result) { result.Add(new Difference() { c = barrier, op = Operation.Equal }); } static private void AddChars(string substring, List<Difference> result) { result.AddRange(substring.Select(x => new Difference() { c = x, op = Operation.Add })); } static private void RemoveChars(string substring, List<Difference> result) { result.AddRange(substring.Select(x => new Difference() { c = x, op = Operation.Remove })); }

Probé con los 3 ejemplos anteriores, y devuelve el resultado esperado de manera adecuada y perfecta.

int flag = 0; int flag_2 = 0; string a = "asdfghjk"; string b = "wsedrftr"; char[] array_a = a.ToCharArray(); char[] array_b = b.ToCharArray(); for (int i = 0,j = 0, n= 0; i < array_b.Count(); i++) { //Execute 1 time until reach first equal character if(i == 0 && a.Contains(array_b[0])) { while (array_a[n] != array_b[0]) { Console.WriteLine(String.Concat(array_a[n], " : Remove")); n++; } Console.WriteLine(String.Concat(array_a[n], " : Equal")); n++; } else if(i == 0 && !a.Contains(array_b[0])) { Console.WriteLine(String.Concat(array_a[n], " : Remove")); n++; Console.WriteLine(String.Concat(array_b[0], " : Add")); } else { if(n < array_a.Count()) { if (array_a[n] == array_b[i]) { Console.WriteLine(String.Concat(array_a[n], " : Equal")); n++; } else { flag = 0; for (int z = n; z < array_a.Count(); z++) { if (array_a[z] == array_b[i]) { flag = 1; break; } } if (flag == 0) { flag_2 = 0; for (int aa = i; aa < array_b.Count(); aa++) { for(int bb = n; bb < array_a.Count(); bb++) { if (array_b[aa] == array_a[bb]) { flag_2 = 1; break; } } } if(flag_2 == 1) { Console.WriteLine(String.Concat(array_b[i], " : Add")); } else { for (int z = n; z < array_a.Count(); z++) { Console.WriteLine(String.Concat(array_a[z], " : Remove")); n++; } Console.WriteLine(String.Concat(array_b[i], " : Add")); } } else { Console.WriteLine(String.Concat(array_a[n], " : Remove")); i--; n++; } } } else { Console.WriteLine(String.Concat(array_b[i], " : Add")); } } }//end for MessageBox.Show("Done"); //OUTPUT CONSOLE: /* a : Remove w : Add s : Equal e : Add d : Equal r : Add f : Equal g : Remove h : Remove j : Remove k : Remove t : Add r : Add */

Si desea tener una comparación precisa entre dos cadenas, debe leer y comprender Levenshtein Distance . Al usar este algoritmo, puede calcular con precisión la tasa de similitud entre dos cadenas y también puede retroceder el algoritmo para obtener la cadena de cambios en la segunda cadena. este algoritmo es una métrica importante para el procesamiento del lenguaje natural también.

Hay algunos otros beneficios y es tiempo de aprender.

En este enlace hay una versión C # de Levenshtein Distance:

https://www.dotnetperls.com/levenshtein