Algoritmo URL Slugify en C#?

(5)

Aquí encontrará una forma de generar url slug en c #. Esta función elimina todos los acentos (respuesta de Marcel), reemplaza los espacios, elimina los caracteres no válidos, elimina los guiones del final y reemplaza las apariciones dobles de "-" o "_"

Código:

public static string ToUrlSlug(string value){ //First to lower case value = value.ToLowerInvariant(); //Remove all accents var bytes = Encoding.GetEncoding("Cyrillic").GetBytes(value); value = Encoding.ASCII.GetString(bytes); //Replace spaces value = Regex.Replace(value, @"/s", "-", RegexOptions.Compiled); //Remove invalid chars value = Regex.Replace(value, @"[^a-z0-9/s-_]", "",RegexOptions.Compiled); //Trim dashes from end value = value.Trim(''-'', ''_''); //Replace double occurences of - or _ value = Regex.Replace(value, @"([-_]){2,}", "$1", RegexOptions.Compiled); return value ; }

Así que he buscado y navegado a través de la etiqueta slug en SO y solo encontré dos soluciones convincentes:

Que son una solución parcial al problema. Pude codificarlo manualmente, pero me sorprende que todavía no haya una solución.

Entonces, ¿hay una implementación de slugify alrogithm en C # y / o .NET que aborde correctamente los caracteres latinos, unicode y varios otros problemas de idioma correctamente?

Aquí está mi oportunidad. Es compatible con:

eliminación de signos diacríticos (por lo que no solo eliminamos los caracteres "no válidos")
longitud máxima para el resultado (o antes de la eliminación de diacríticos - "truncado anticipado")
separador personalizado entre trozos normalizados
el resultado puede ser forzado a mayúscula o minúscula
lista configurable de categorías Unicode admitidas
lista configurable de rangos de caracteres permitidos
admite el marco 2.0

Código:

/// <summary> /// Defines a set of utilities for creating slug urls. /// </summary> public static class Slug { /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <returns> /// A slugged text. /// </returns> public static string Create(string text) { return Create(text, (SlugOptions)null); } /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <param name="options">The options. May be null.</param> /// <returns>A slugged text.</returns> public static string Create(string text, SlugOptions options) { if (text == null) return null; if (options == null) { options = new SlugOptions(); } string normalised; if (options.EarlyTruncate && options.MaximumLength > 0 && text.Length > options.MaximumLength) { normalised = text.Substring(0, options.MaximumLength).Normalize(NormalizationForm.FormD); } else { normalised = text.Normalize(NormalizationForm.FormD); } int max = options.MaximumLength > 0 ? Math.Min(normalised.Length, options.MaximumLength) : normalised.Length; StringBuilder sb = new StringBuilder(max); for (int i = 0; i < normalised.Length; i++) { char c = normalised[i]; UnicodeCategory uc = char.GetUnicodeCategory(c); if (options.AllowedUnicodeCategories.Contains(uc) && options.IsAllowed(c)) { switch (uc) { case UnicodeCategory.UppercaseLetter: if (options.ToLower) { c = options.Culture != null ? char.ToLower(c, options.Culture) : char.ToLowerInvariant(c); } sb.Append(options.Replace(c)); break; case UnicodeCategory.LowercaseLetter: if (options.ToUpper) { c = options.Culture != null ? char.ToUpper(c, options.Culture) : char.ToUpperInvariant(c); } sb.Append(options.Replace(c)); break; default: sb.Append(options.Replace(c)); break; } } else if (uc == UnicodeCategory.NonSpacingMark) { // don''t add a separator } else { if (options.Separator != null && !EndsWith(sb, options.Separator)) { sb.Append(options.Separator); } } if (options.MaximumLength > 0 && sb.Length >= options.MaximumLength) break; } string result = sb.ToString(); if (options.MaximumLength > 0 && result.Length > options.MaximumLength) { result = result.Substring(0, options.MaximumLength); } if (!options.CanEndWithSeparator && options.Separator != null && result.EndsWith(options.Separator)) { result = result.Substring(0, result.Length - options.Separator.Length); } return result.Normalize(NormalizationForm.FormC); } private static bool EndsWith(StringBuilder sb, string text) { if (sb.Length < text.Length) return false; for (int i = 0; i < text.Length; i++) { if (sb[sb.Length - 1 - i] != text[text.Length - 1 - i]) return false; } return true; } } /// <summary> /// Defines options for the Slug utility class. /// </summary> public class SlugOptions { /// <summary> /// Defines the default maximum length. Currently equal to 80. /// </summary> public const int DefaultMaximumLength = 80; /// <summary> /// Defines the default separator. Currently equal to "-". /// </summary> public const string DefaultSeparator = "-"; private bool _toLower; private bool _toUpper; /// <summary> /// Initializes a new instance of the <see cref="SlugOptions"/> class. /// </summary> public SlugOptions() { MaximumLength = DefaultMaximumLength; Separator = DefaultSeparator; AllowedUnicodeCategories = new List<UnicodeCategory>(); AllowedUnicodeCategories.Add(UnicodeCategory.UppercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.LowercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.DecimalDigitNumber); AllowedRanges = new List<KeyValuePair<short, short>>(); AllowedRanges.Add(new KeyValuePair<short, short>((short)''a'', (short)''z'')); AllowedRanges.Add(new KeyValuePair<short, short>((short)''A'', (short)''Z'')); AllowedRanges.Add(new KeyValuePair<short, short>((short)''0'', (short)''9'')); } /// <summary> /// Gets the allowed unicode categories list. /// </summary> /// <value> /// The allowed unicode categories list. /// </value> public virtual IList<UnicodeCategory> AllowedUnicodeCategories { get; private set; } /// <summary> /// Gets the allowed ranges list. /// </summary> /// <value> /// The allowed ranges list. /// </value> public virtual IList<KeyValuePair<short, short>> AllowedRanges { get; private set; } /// <summary> /// Gets or sets the maximum length. /// </summary> /// <value> /// The maximum length. /// </value> public virtual int MaximumLength { get; set; } /// <summary> /// Gets or sets the separator. /// </summary> /// <value> /// The separator. /// </value> public virtual string Separator { get; set; } /// <summary> /// Gets or sets the culture for case conversion. /// </summary> /// <value> /// The culture. /// </value> public virtual CultureInfo Culture { get; set; } /// <summary> /// Gets or sets a value indicating whether the string can end with a separator string. /// </summary> /// <value> /// <c>true</c> if the string can end with a separator string; otherwise, <c>false</c>. /// </value> public virtual bool CanEndWithSeparator { get; set; } /// <summary> /// Gets or sets a value indicating whether the string is truncated before normalization. /// </summary> /// <value> /// <c>true</c> if the string is truncated before normalization; otherwise, <c>false</c>. /// </value> public virtual bool EarlyTruncate { get; set; } /// <summary> /// Gets or sets a value indicating whether to lowercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be lowercased; otherwise, <c>false</c>. /// </value> public virtual bool ToLower { get { return _toLower; } set { _toLower = value; if (_toLower) { _toUpper = false; } } } /// <summary> /// Gets or sets a value indicating whether to uppercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be uppercased; otherwise, <c>false</c>. /// </value> public virtual bool ToUpper { get { return _toUpper; } set { _toUpper = value; if (_toUpper) { _toLower = false; } } } /// <summary> /// Determines whether the specified character is allowed. /// </summary> /// <param name="character">The character.</param> /// <returns>true if the character is allowed; false otherwise.</returns> public virtual bool IsAllowed(char character) { foreach (var p in AllowedRanges) { if (character >= p.Key && character <= p.Value) return true; } return false; } /// <summary> /// Replaces the specified character by a given string. /// </summary> /// <param name="character">The character to replace.</param> /// <returns>a string.</returns> public virtual string Replace(char character) { return character.ToString(); } }

Aquí está mi versión, basada en las respuestas de Joan y Marcel. Los cambios que hice son los siguientes:

Use un método ampliamente aceptado para eliminar acentos.
Almacenamiento en caché Regex explícito para mejoras de velocidad modestas.
Más separadores de palabras reconocidos y normalizados a guiones.

Aquí está el código:

public class UrlSlugger { // white space, em-dash, en-dash, underscore static readonly Regex WordDelimiters = new Regex(@"[/s—–_]", RegexOptions.Compiled); // characters that are not valid static readonly Regex InvalidChars = new Regex(@"[^a-z0-9/-]", RegexOptions.Compiled); // multiple hyphens static readonly Regex MultipleHyphens = new Regex(@"-{2,}", RegexOptions.Compiled); public static string ToUrlSlug(string value) { // convert to lower case value = value.ToLowerInvariant(); // remove diacritics (accents) value = RemoveDiacritics(value); // ensure all word delimiters are hyphens value = WordDelimiters.Replace(value, "-"); // strip out invalid characters value = InvalidChars.Replace(value, ""); // replace multiple hyphens (-) with a single hyphen value = MultipleHyphens.Replace(value, "-"); // trim hyphens (-) from ends return value.Trim(''-''); } /// See: http://www.siao2.com/2007/05/14/2629747.aspx private static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for (int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if (uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return (sb.ToString().Normalize(NormalizationForm.FormC)); } }

Esto todavía no resuelve el problema del carácter no latino. Una solución completamente alternativa sería usar Uri.EscapeDataString para convertir la cadena en su representación hexadecimal:

string original = "测试公司"; // %E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8 string converted = Uri.EscapeDataString(original);

Luego use los datos para generar un hipervínculo:

<a href="http://www.example.com/100/%E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8"> 测试公司 </a>

Muchos navegadores mostrarán caracteres chinos en la barra de direcciones (ver a continuación), pero según mis pruebas limitadas, no es completamente compatible.

NOTA: Para que Uri.EscapeDataString funcione de esta manera, iriParsing debe estar habilitado.

EDITAR

Para aquellos que buscan generar URL Slugs en C #, les recomiendo que revisen esta pregunta relacionada:

¿Cómo genera sus URL amigables para SEO?

Es lo que terminé usando para mi proyecto.

Un problema que he tenido con slugification (¡palabra nueva!) Son las colisiones. Si tengo una publicación de blog, por ejemplo, llamada "Stack-Overflow" y una llamada "", las babosas de esos dos títulos son las mismas. Por lo tanto, mi generador de babosas generalmente tiene que involucrar a la base de datos de alguna manera. Esta podría ser la razón por la que no ve soluciones más genéricas por ahí.

http://predicatet.blogspot.com/2009/04/improved-c-slug-generator-or-how-to.html

public static string GenerateSlug(this string phrase) { string str = phrase.RemoveAccent().ToLower(); // invalid chars str = Regex.Replace(str, @"[^a-z0-9/s-]", ""); // convert multiple spaces into one space str = Regex.Replace(str, @"/s+", " ").Trim(); // cut and trim str = str.Substring(0, str.Length <= 45 ? str.Length : 45).Trim(); str = Regex.Replace(str, @"/s", "-"); // hyphens return str; } public static string RemoveAccent(this string txt) { byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt); return System.Text.Encoding.ASCII.GetString(bytes); }