.net regex algorithm string pascal-case

.NET-¿Cómo se puede dividir una cadena delimitada por "mayúsculas" en una matriz?



regex algorithm (16)

¿Cómo puedo ir desde esta cadena: "ThisIsMyCapsDelimitedString"

... a esta cadena: "Esta es mi Cadena Delimitada Cadena"

Se prefieren las líneas de código más antiguas en VB.net, pero C # también es bienvenido.

¡Aclamaciones!


¡Gran respuesta, MizardX! Lo modifiqué ligeramente para tratar los números como palabras separadas, de modo que "AddressLine1" se convertiría en "Address Line 1" en lugar de "Address Line1":

Regex.Replace(s, "([a-z](?=[A-Z0-9])|[A-Z](?=[A-Z][a-z]))", "$1 ")


A continuación se muestra un prototipo que convierte lo siguiente a Title Case:

  • snake_case
  • el caso de Carmel
  • PascalCase
  • caso de sentencia
  • Título del caso (mantener el formato actual)

Obviamente, solo necesitarías el método "ToTitleCase".

using System; using System.Collections.Generic; using System.Globalization; using System.Text.RegularExpressions; public class Program { public static void Main() { var examples = new List<string> { "THEQuickBrownFox", "theQUICKBrownFox", "TheQuickBrownFOX", "TheQuickBrownFox", "the_quick_brown_fox", "theFOX", "FOX", "QUICK" }; foreach (var example in examples) { Console.WriteLine(ToTitleCase(example)); } } private static string ToTitleCase(string example) { var fromSnakeCase = example.Replace("_", " "); var lowerToUpper = Regex.Replace(fromSnakeCase, @"(/p{Ll})(/p{Lu})", "$1 $2"); var sentenceCase = Regex.Replace(lowerToUpper, @"(/p{Lu}+)(/p{Lu}/p{Ll})", "$1 $2"); return new CultureInfo("en-US", false).TextInfo.ToTitleCase(sentenceCase); } }

La salida de la consola sería la siguiente:

THE Quick Brown Fox The QUICK Brown Fox The Quick Brown FOX The Quick Brown Fox The Quick Brown Fox The FOX FOX QUICK

Publicación del blog referenciada


El excelente comentario de Grant Wagner aparte:

Dim s As String = RegularExpressions.Regex.Replace("ThisIsMyCapsDelimitedString", "([A-Z])", " $1")


Hice esto hace un tiempo. Empareja cada componente de un nombre de CamelCase.

/([A-Z]+(?=$|[A-Z][a-z])|[A-Z]?[a-z]+)/g

Por ejemplo:

"SimpleHTTPServer" => ["Simple", "HTTP", "Server"] "camelCase" => ["camel", "Case"]

Para convertir eso solo inserte espacios entre las palabras:

Regex.Replace(s, "([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))", "$1 ")

Si necesita manejar dígitos:

/([A-Z]+(?=$|[A-Z][a-z]|[0-9])|[A-Z]?[a-z]+|[0-9]+)/g Regex.Replace(s,"([a-z](?=[A-Z]|[0-9])|[A-Z](?=[A-Z][a-z]|[0-9])|[0-9](?=[^0-9]))","$1 ")


Impl rápido y de procedimiento:

/// <summary> /// Get the words in a code <paramref name="identifier"/>. /// </summary> /// <param name="identifier">The code <paramref name="identifier"/></param> to extract words from. public static string[] GetWords(this string identifier) { Contract.Ensures(Contract.Result<string[]>() != null, "returned array of string is not null but can be empty"); if (identifier == null) { return new string[0]; } if (identifier.Length == 0) { return new string[0]; } const int MIN_WORD_LENGTH = 2; // Ignore one letter or one digit words var length = identifier.Length; var list = new List<string>(1 + length/2); // Set capacity, not possible more words since we discard one char words var sb = new StringBuilder(); CharKind cKindCurrent = GetCharKind(identifier[0]); // length is not zero here CharKind cKindNext = length == 1 ? CharKind.End : GetCharKind(identifier[1]); for (var i = 0; i < length; i++) { var c = identifier[i]; CharKind cKindNextNext = (i >= length - 2) ? CharKind.End : GetCharKind(identifier[i + 2]); // Process cKindCurrent switch (cKindCurrent) { case CharKind.Digit: case CharKind.LowerCaseLetter: sb.Append(c); // Append digit or lowerCaseLetter to sb if (cKindNext == CharKind.UpperCaseLetter) { goto TURN_SB_INTO_WORD; // Finish word if next char is upper } goto CHAR_PROCESSED; case CharKind.Other: goto TURN_SB_INTO_WORD; default: // charCurrent is never Start or End Debug.Assert(cKindCurrent == CharKind.UpperCaseLetter); break; } // Here cKindCurrent is UpperCaseLetter // Append UpperCaseLetter to sb anyway sb.Append(c); switch (cKindNext) { default: goto CHAR_PROCESSED; case CharKind.UpperCaseLetter: // "SimpleHTTPServer" when we are at ''P'' we need to see that NextNext is ''e'' to get the word! if (cKindNextNext == CharKind.LowerCaseLetter) { goto TURN_SB_INTO_WORD; } goto CHAR_PROCESSED; case CharKind.End: case CharKind.Other: break; // goto TURN_SB_INTO_WORD; } //------------------------------------------------ TURN_SB_INTO_WORD: string word = sb.ToString(); sb.Length = 0; if (word.Length >= MIN_WORD_LENGTH) { list.Add(word); } CHAR_PROCESSED: // Shift left for next iteration! cKindCurrent = cKindNext; cKindNext = cKindNextNext; } string lastWord = sb.ToString(); if (lastWord.Length >= MIN_WORD_LENGTH) { list.Add(lastWord); } return list.ToArray(); } private static CharKind GetCharKind(char c) { if (char.IsDigit(c)) { return CharKind.Digit; } if (char.IsLetter(c)) { if (char.IsUpper(c)) { return CharKind.UpperCaseLetter; } Debug.Assert(char.IsLower(c)); return CharKind.LowerCaseLetter; } return CharKind.Other; } enum CharKind { End, // For end of string Digit, UpperCaseLetter, LowerCaseLetter, Other }

Pruebas:

[TestCase((string)null, "")] [TestCase("", "")] // Ignore one letter or one digit words [TestCase("A", "")] [TestCase("4", "")] [TestCase("_", "")] [TestCase("Word_m_Field", "Word Field")] [TestCase("Word_4_Field", "Word Field")] [TestCase("a4", "a4")] [TestCase("ABC", "ABC")] [TestCase("abc", "abc")] [TestCase("AbCd", "Ab Cd")] [TestCase("AbcCde", "Abc Cde")] [TestCase("ABCCde", "ABC Cde")] [TestCase("Abc42Cde", "Abc42 Cde")] [TestCase("Abc42cde", "Abc42cde")] [TestCase("ABC42Cde", "ABC42 Cde")] [TestCase("42ABC", "42 ABC")] [TestCase("42abc", "42abc")] [TestCase("abc_cde", "abc cde")] [TestCase("Abc_Cde", "Abc Cde")] [TestCase("_Abc__Cde_", "Abc Cde")] [TestCase("ABC_CDE_FGH", "ABC CDE FGH")] [TestCase("ABC CDE FGH", "ABC CDE FGH")] // Should not happend (white char) anything that is not a letter/digit/''_'' is considered as a separator [TestCase("ABC,CDE;FGH", "ABC CDE FGH")] // Should not happend (,;) anything that is not a letter/digit/''_'' is considered as a separator [TestCase("abc<cde", "abc cde")] [TestCase("abc<>cde", "abc cde")] [TestCase("abc<D>cde", "abc cde")] // Ignore one letter or one digit words [TestCase("abc<Da>cde", "abc Da cde")] [TestCase("abc<cde>", "abc cde")] [TestCase("SimpleHTTPServer", "Simple HTTP Server")] [TestCase("SimpleHTTPS2erver", "Simple HTTPS2erver")] [TestCase("camelCase", "camel Case")] [TestCase("m_Field", "Field")] [TestCase("mm_Field", "mm Field")] public void Test_GetWords(string identifier, string expectedWordsStr) { var expectedWords = expectedWordsStr.Split('' ''); if (identifier == null || identifier.Length <= 1) { expectedWords = new string[0]; } var words = identifier.GetWords(); Assert.IsTrue(words.SequenceEqual(expectedWords)); }


Implementando el código de psudo desde: https://.com/a/5796394/4279201

private static StringBuilder camelCaseToRegular(string i_String) { StringBuilder output = new StringBuilder(); int i = 0; foreach (char character in i_String) { if (character <= ''Z'' && character >= ''A'' && i > 0) { output.Append(" "); } output.Append(character); i++; } return output; }


Necesitaba una solución que admitiera acrónimos y números. Esta solución basada en Regex trata los siguientes patrones como "palabras" individuales:

  • Una letra mayúscula seguida de letras minúsculas
  • Una secuencia de números consecutivos
  • Letras mayúsculas consecutivas (interpretadas como acrónimos): una nueva palabra puede comenzar a usar el último capital, por ejemplo, HTMLGuide => "HTML Guide", "TheATeam" => "The A Team"

Podrías hacerlo como un trazador de líneas:

Regex.Replace(value, @"(?<!^)((?<!/d)/d|(?(?<=[A-Z])[A-Z](?=[a-z])|[A-Z]))", " $1")

Un enfoque más legible podría ser mejor:

using System.Text.RegularExpressions; namespace Demo { public class IntercappedStringHelper { private static readonly Regex SeparatorRegex; static IntercappedStringHelper() { const string pattern = @" (?<!^) # Not start ( # Digit, not preceded by another digit (?<!/d)/d | # Upper-case letter, followed by lower-case letter if # preceded by another upper-case letter, e.g. ''G'' in HTMLGuide (?(?<=[A-Z])[A-Z](?=[a-z])|[A-Z]) )"; var options = RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled; SeparatorRegex = new Regex(pattern, options); } public static string SeparateWords(string value, string separator = " ") { return SeparatorRegex.Replace(value, separator + "$1"); } } }

Aquí hay un extracto de las pruebas (XUnit):

[Theory] [InlineData("PurchaseOrders", "Purchase-Orders")] [InlineData("purchaseOrders", "purchase-Orders")] [InlineData("2Unlimited", "2-Unlimited")] [InlineData("The2Unlimited", "The-2-Unlimited")] [InlineData("Unlimited2", "Unlimited-2")] [InlineData("222Unlimited", "222-Unlimited")] [InlineData("The222Unlimited", "The-222-Unlimited")] [InlineData("Unlimited222", "Unlimited-222")] [InlineData("ATeam", "A-Team")] [InlineData("TheATeam", "The-A-Team")] [InlineData("TeamA", "Team-A")] [InlineData("HTMLGuide", "HTML-Guide")] [InlineData("TheHTMLGuide", "The-HTML-Guide")] [InlineData("TheGuideToHTML", "The-Guide-To-HTML")] [InlineData("HTMLGuide5", "HTML-Guide-5")] [InlineData("TheHTML5Guide", "The-HTML-5-Guide")] [InlineData("TheGuideToHTML5", "The-Guide-To-HTML-5")] [InlineData("TheUKAllStars", "The-UK-All-Stars")] [InlineData("AllStarsUK", "All-Stars-UK")] [InlineData("UKAllStars", "UK-All-Stars")]



Para obtener más variedad, utilizando objetos simples antiguos C #, lo siguiente produce el mismo resultado que la excelente expresión regular de @MizardX.

public string FromCamelCase(string camel) { // omitted checking camel for null StringBuilder sb = new StringBuilder(); int upperCaseRun = 0; foreach (char c in camel) { // append a space only if we''re not at the start // and we''re not already in an all caps string. if (char.IsUpper(c)) { if (upperCaseRun == 0 && sb.Length != 0) { sb.Append('' ''); } upperCaseRun++; } else if( char.IsLower(c) ) { if (upperCaseRun > 1) //The first new word will also be capitalized. { sb.Insert(sb.Length - 1, '' ''); } upperCaseRun = 0; } else { upperCaseRun = 0; } sb.Append(c); } return sb.ToString(); }


Probablemente haya una solución más elegante, pero esto es lo que se me ocurre:

string myString = "ThisIsMyCapsDelimitedString"; for (int i = 1; i < myString.Length; i++) { if (myString[i].ToString().ToUpper() == myString[i].ToString()) { myString = myString.Insert(i, " "); i++; } }


Regex es aproximadamente 10-12 veces más lento que un bucle simple:

public static string CamelCaseToSpaceSeparated(this string str) { if (string.IsNullOrEmpty(str)) { return str; } var res = new StringBuilder(); res.Append(str[0]); for (var i = 1; i < str.Length; i++) { if (char.IsUpper(str[i])) { res.Append('' ''); } res.Append(str[i]); } return res.ToString(); }


Solo por una pequeña variedad ... Aquí hay un método de extensión que no usa una expresión regular.

public static class CamelSpaceExtensions { public static string SpaceCamelCase(this String input) { return new string(InsertSpacesBeforeCaps(input).ToArray()); } private static IEnumerable<char> InsertSpacesBeforeCaps(IEnumerable<char> input) { foreach (char c in input) { if (char.IsUpper(c)) { yield return '' ''; } yield return c; } } }


Solución ingenua de expresiones regex. No manejará O''Conner, y agrega un espacio al comienzo de la cadena también.

s = "ThisIsMyCapsDelimitedString" split = Regex.Replace(s, "[A-Z0-9]", " $&");


Tratar de usar

"([A-Z]*[^A-Z]*)"

El resultado se ajustará a la mezcla de alfabeto con números

Regex.Replace("AbcDefGH123Weh", "([A-Z]*[^A-Z]*)", "$1 "); Abc Def GH123 Weh Regex.Replace("camelCase", "([A-Z]*[^A-Z]*)", "$1 "); camel Case


Regex.Replace("ThisIsMyCapsDelimitedString", "(//B[A-Z])", " $1")


string s = "ThisIsMyCapsDelimitedString"; string t = Regex.Replace(s, "([A-Z])", " $1").Substring(1);