regex - regulares - ¿Cómo se combinan solo los números romanos válidos con una expresión regular?

expresiones regulares ejemplos (10)

Afortunadamente, el rango de números está limitado a 1..3999 o menos. Por lo tanto, puedes construir la pieza de comida regular.

<opt-thousands-part><opt-hundreds-part><opt-tens-part><opt-units-part>

Cada una de esas partes se ocupará de los caprichos de la notación romana. Por ejemplo, usando la notación Perl:

<opt-hundreds-part> = m/(CM|DC{0,3}|CD|C{1,3})?/;

Repita y ensamble.

Agregado : El <opt-hundreds-part> se puede comprimir más:

<opt-hundreds-part> = m/(C[MD]|D?C{0,3})/;

Dado que la cláusula ''D? C {0,3}'' no puede coincidir con nada, no es necesario el signo de interrogación. Y, lo más probable, los paréntesis deberían ser del tipo que no captura, en Perl:

<opt-hundreds-part> = m/(?:C[MD]|D?C{0,3})/;

Por supuesto, también debería ser insensible a mayúsculas / minúsculas.

También puede extender esto para tratar las opciones mencionadas por James Curran (para permitir XM o IM para 990 o 999, y CCCC para 400, etc.).

<opt-hundreds-part> = m/(?:[IXC][MD]|D?C{0,4})/;

Al pensar en mi otro problema , decidí que ni siquiera puedo crear una expresión regular que coincida con los números romanos (y mucho menos con una gramática libre de contexto que los genere)

El problema es hacer coincidir solo números romanos válidos. Por ejemplo, 990 NO es "XM", es "CMXC"

Mi problema al hacer la expresión regular para esto es que, para permitir o no permitir ciertos caracteres, necesito mirar hacia atrás. Tomemos miles y cientos, por ejemplo.

Puedo permitir M {0,2} C? M (para permitir 900, 1000, 1900, 2000, 2900 y 3000). Sin embargo, si la coincidencia está en CM, no puedo permitir que los siguientes caracteres sean C o D (porque ya estoy en 900).

¿Cómo puedo expresar esto en una expresión regular?
Si simplemente no se puede expresar en una expresión regular, ¿es expresable en una gramática libre de contexto?

Como Jeremy y Pax señalaron anteriormente ... ''^ M {0,4} (CM | CD | D? C {0,3}) (XC | XL | L? X {0,3}) (IX | IV | V? I {0,3}) $ ''debería ser la solución que buscas ...

La URL específica que debería haberse adjuntado (en mi humilde opinión) es http://thehazeltree.org/diveintopython/7.html

El ejemplo 7.8 es la forma abreviada que usa {n, m}

El problema de la solución de Jeremy y Pax es que tampoco coincide con "nada".

La siguiente expresión regular espera al menos un número romano:

^(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|[IDCXMLV])$

En realidad, tu premisa es defectuosa. 990 ES "XM", así como también "CMXC".

Los romanos estaban menos preocupados por las "reglas" que su maestra de tercer grado. Mientras se suma, estaba bien. Por lo tanto, "IIII" fue tan bueno como "IV" para 4. Y "IIM" fue completamente genial para 998.

(Si tiene problemas para lidiar con eso ... Recuerde que las ortografías en inglés no se formalizaron hasta los 1700. Hasta entonces, mientras el lector pudiera descifrarlo, era lo suficientemente bueno).

Escribiría funciones para mi trabajo para mí. Aquí hay dos funciones de números romanos en PowerShell.

function ConvertFrom-RomanNumeral { <# .SYNOPSIS Converts a Roman numeral to a number. .DESCRIPTION Converts a Roman numeral - in the range of I..MMMCMXCIX - to a number. .EXAMPLE ConvertFrom-RomanNumeral -Numeral MMXIV .EXAMPLE "MMXIV" | ConvertFrom-RomanNumeral #> [CmdletBinding()] [OutputType([int])] Param ( [Parameter(Mandatory=$true, HelpMessage="Enter a roman numeral in the range I..MMMCMXCIX", ValueFromPipeline=$true, Position=0)] [ValidatePattern("^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$")] [string] $Numeral ) Begin { $RomanToDecimal = [ordered]@{ M = 1000 CM = 900 D = 500 CD = 400 C = 100 XC = 90 L = 50 X = 10 IX = 9 V = 5 IV = 4 I = 1 } } Process { $roman = $Numeral + " " $value = 0 do { foreach ($key in $RomanToDecimal.Keys) { if ($key.Length -eq 1) { if ($key -match $roman.Substring(0,1)) { $value += $RomanToDecimal.$key $roman = $roman.Substring(1) break } } else { if ($key -match $roman.Substring(0,2)) { $value += $RomanToDecimal.$key $roman = $roman.Substring(2) break } } } } until ($roman -eq " ") $value } End { } } function ConvertTo-RomanNumeral { <# .SYNOPSIS Converts a number to a Roman numeral. .DESCRIPTION Converts a number - in the range of 1 to 3,999 - to a Roman numeral. .EXAMPLE ConvertTo-RomanNumeral -Number (Get-Date).Year .EXAMPLE (Get-Date).Year | ConvertTo-RomanNumeral #> [CmdletBinding()] [OutputType([string])] Param ( [Parameter(Mandatory=$true, HelpMessage="Enter an integer in the range 1 to 3,999", ValueFromPipeline=$true, Position=0)] [ValidateRange(1,3999)] [int] $Number ) Begin { $DecimalToRoman = @{ Ones = "","I","II","III","IV","V","VI","VII","VIII","IX"; Tens = "","X","XX","XXX","XL","L","LX","LXX","LXXX","XC"; Hundreds = "","C","CC","CCC","CD","D","DC","DCC","DCCC","CM"; Thousands = "","M","MM","MMM" } $column = @{Thousands = 0; Hundreds = 1; Tens = 2; Ones = 3} } Process { [int[]]$digits = $Number.ToString().PadLeft(4,"0").ToCharArray() | ForEach-Object { [Char]::GetNumericValue($_) } $RomanNumeral = "" $RomanNumeral += $DecimalToRoman.Thousands[$digits[$column.Thousands]] $RomanNumeral += $DecimalToRoman.Hundreds[$digits[$column.Hundreds]] $RomanNumeral += $DecimalToRoman.Tens[$digits[$column.Tens]] $RomanNumeral += $DecimalToRoman.Ones[$digits[$column.Ones]] $RomanNumeral } End { } }

Para evitar que coincida con la cadena vacía, tendrá que repetir el patrón cuatro veces y reemplazar cada 0 con un 1 por turno, y dar cuenta de V , L y D :

(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))

En este caso (debido a que este patrón usa ^ y $ ), sería mejor que buscara primero las líneas vacías y no se molestara en compararlas. Si usa límites de palabras, entonces no tiene ningún problema porque no existe una palabra vacía. (Al menos regex no define uno, no empieces a filosofar, ¡estoy siendo pragmático aquí!)

En mi caso particular (del mundo real) necesitaba números de coincidencia al final de las palabras y no encontré otra forma de eludirlo. Necesitaba borrar los números de la nota al pie de página de mi documento de texto simple, donde el texto como "el mar Rojo y el ^{cli de} la Gran Barrera de Coral" se había convertido en the Red Seacl and the Great Barrier Reefcli . Pero todavía tuve problemas con las palabras válidas como Tahiti y las fantastic son restregadas en Tahit y fantasti .

Solo para guardarlo aquí:

(^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$)

Coincide con todos los números romanos. No le importan las cadenas vacías (requiere al menos una letra de número romano). Debería funcionar en PCRE, Perl, Python y Ruby.

Demostración en línea de Ruby: http://rubular.com/r/KLPR1zq3Hj

Conversión en línea: http://www.onlineconversion.com/roman_numerals_advanced.htm

Steven Levithan usa esta expresión regular en su publicación que valida los números romanos antes de "deromanizar" el valor:

/^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/

Tratar:

^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

Rompiéndolo:

M{0,4}

Esto especifica la sección miles y básicamente la restringe entre 0 y 4000 . Es relativamente simple:

0: <empty> matched by M{0} 1000: M matched by M{1} 2000: MM matched by M{2} 3000: MMM matched by M{3} 4000: MMMM matched by M{4}

(CM|CD|D?C{0,3})

Ligeramente más complejo, esto es para la sección de los cientos y cubre todas las posibilidades:

0: <empty> matched by D?C{0} (with D not there) 100: C matched by D?C{1} (with D not there) 200: CC matched by D?C{2} (with D not there) 300: CCC matched by D?C{3} (with D not there) 400: CD matched by CD 500: D matched by D?C{0} (with D there) 600: DC matched by D?C{1} (with D there) 700: DCC matched by D?C{2} (with D there) 800: DCCC matched by D?C{3} (with D there) 900: CM matched by CM

(XC|XL|L?X{0,3})

Las mismas reglas que la sección anterior, pero para el lugar de las decenas:

0: <empty> matched by L?X{0} (with L not there) 10: X matched by L?X{1} (with L not there) 20: XX matched by L?X{2} (with L not there) 30: XXX matched by L?X{3} (with L not there) 40: XL matched by XL 50: L matched by L?X{0} (with L there) 60: LX matched by L?X{1} (with L there) 70: LXX matched by L?X{2} (with L there) 80: LXXX matched by L?X{3} (with L there) 90: XC matched by XC

(IX|IV|V?I{0,3})

Esta es la sección de unidades, manejando del 0 al 9 y también similar a las dos secciones anteriores (los números romanos, a pesar de su aparente rareza, siguen algunas reglas lógicas una vez que descubres cuáles son):

0: <empty> matched by V?I{0} (with V not there) 1: I matched by V?I{1} (with V not there) 2: II matched by V?I{2} (with V not there) 3: III matched by V?I{3} (with V not there) 4: IV matched by IV 5: V matched by V?I{0} (with V there) 6: VI matched by V?I{1} (with V there) 7: VII matched by V?I{2} (with V there) 8: VIII matched by V?I{3} (with V there) 9: IX matched by IX

import re pattern = ''^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'' if re.search(pattern, ''XCCMCI''): print ''Valid Roman'' else: print ''Not valid Roman''

Para las personas que realmente quieren entender la lógica, por favor diveintopython un vistazo a la explicación paso a paso en 3 páginas en diveintopython .

La única diferencia con la solución original (que tenía M{0,4} ) es porque descubrí que ''MMMM'' no es un número romano válido (también los viejos romanos probablemente no hayan pensado en ese gran número y no estarán de acuerdo conmigo). Si eres de los que no están de acuerdo con los viejos romanos, por favor, perdóname y usa la versión {0,4}.