parsing - nomenclatura - pep8 online

PEG para sangría de estilo Python (5)

Cómo escribiría una gramática de expresión de análisis en cualquiera de los siguientes generadores de analizador ( PEG.js , Citrus , Treetop ) que pueden manejar la sangría de estilo de Python / Haskell / CoffeScript:

Ejemplos de un lenguaje de programación aún no existente:

square x = x * x

cube x = x * square x

fib n = if n <= 1 0 else fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

Actualización: No intente escribir un intérprete para los ejemplos anteriores. Solo estoy interesado en el problema de sangría. Otro ejemplo podría ser analizar lo siguiente:

foo bar = 1 baz = 2 tap zap = 3 # should yield (ruby style hashmap): # {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

Creo que un lenguaje sensible a las indentaciones es sensible al contexto. Creo que PEG solo puede hacer lenguajes sin contexto.

Tenga en cuenta que, aunque la respuesta de nalply es ciertamente correcta, que PEG.js puede hacerlo a través de un estado externo (es decir, las temidas variables globales), puede ser una ruta peligrosa para caminar (peor que los problemas habituales con variables globales). Algunas reglas pueden coincidir inicialmente (y luego ejecutar sus acciones) pero las reglas primarias pueden fallar, lo que hace que la ejecución de la acción no sea válida. Si se modifica el estado externo en dicha acción, puede terminar con un estado no válido. Esto es súper horrible y podría provocar temblores, vómitos y la muerte. Algunos problemas y soluciones están en los comentarios aquí: https://github.com/dmajda/pegjs/issues/45

Entonces, lo que estamos haciendo aquí con sangría es crear algo así como un bloque de estilo C que a menudo tiene su propio alcance léxico. Si estuviera escribiendo un compilador para un lenguaje como ese, creo que trataría de hacer que el lexer no pierda de vista la sangría. Cada vez que la sangría aumenta, podría insertar un token ''{''. Del mismo modo, cada vez que disminuye puede insertar un token ''}''. Luego, escribir una expresión gramatical con llaves explícitas para representar el alcance léxico se vuelve más directo.

Puedes hacer esto en Treetop usando predicados semánticos. En este caso, necesita un predicado semántico que detecte el cierre de un bloque indentado de espacio en blanco debido a la aparición de otra línea que tiene la misma sangría o menos. El predicado debe contar la sangría de la línea de apertura, y devolver verdadero (bloque cerrado) si la sangría de la línea actual ha terminado en la misma longitud o más corta. Debido a que la condición de cierre depende del contexto, no debe ser memorada. Aquí está el código de ejemplo que estoy a punto de agregar a la documentación de Treetop. Tenga en cuenta que he anulado el método de inspección SyntaxNode de Treetop para que sea más fácil visualizar el resultado.

grammar IndentedBlocks rule top # Initialise the indent stack with a sentinel: &{|s| @indents = [-1] } nested_blocks { def inspect nested_blocks.inspect end } end rule nested_blocks ( # Do not try to extract this semantic predicate into a new rule. # It will be memo-ized incorrectly because @indents.last will change. !{|s| # Peek at the following indentation: save = index; i = _nt_indentation; index = save # We''re closing if the indentation is less or the same as our enclosing block''s: closing = i.text_value.length <= @indents.last } block )* { def inspect elements.map{|e| e.block.inspect}*"/n" end } end rule block indented_line # The block''s opening line &{|s| # Push the indent level to the stack level = s[0].indentation.text_value.length @indents << level true } nested_blocks # Parse any nested blocks &{|s| # Pop the indent stack # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned @indents.pop true } { def inspect indented_line.inspect + (nested_blocks.elements.size > 0 ? ( "/n{/n" + nested_blocks.elements.map { |content| content.block.inspect+"/n" }*'''' + "}" ) : "") end } end rule indented_line indentation text:((!"/n" .)*) "/n" { def inspect text.text_value end } end rule indentation '' ''* end end

Aquí hay un pequeño programa de prueba de controlador para que pueda probarlo fácilmente:

require ''polyglot'' require ''treetop'' require ''indented_blocks'' parser = IndentedBlocksParser.new input = <<END def foo here is some indented text here it''s further indented and here the same but here it''s further again and some more like that before going back to here down again back twice and start from the beginning again with only a small block this time END parse_tree = parser.parse input p parse_tree

Pure PEG no puede analizar la sangría.

Pero peg.js puede.

Hice un experimento rápido y sucio (inspirándome en el comentario de Ira Baxter sobre hacer trampa).

/* Initializations */ { function start(first, tail) { var done = [first[1]]; for (var i = 0; i < tail.length; i++) { done = done.concat(tail[i][1][0]) done.push(tail[i][1][1]); } return done; } var depths = [0]; function indent(s) { var depth = s.length; if (depth == depths[0]) return []; if (depth > depths[0]) { depths.unshift(depth); return ["INDENT"]; } var dents = []; while (depth < depths[0]) { depths.shift(); dents.push("DEDENT"); } if (depth != depths[0]) dents.push("BADDENT"); return dents; } } /* The real grammar */ start = first:line tail:(newline line)* newline? { return start(first, tail) } line = depth:indent s:text { return [depth, s] } indent = s:" "* { return indent(s) } text = c:[^/n]* { return c.join("") } newline = "/n" {}

depths es una pila de indentaciones. indent () devuelve una matriz de tokens de sangría y start () desenvuelve la matriz para hacer que el analizador se comporte de forma similar a una secuencia.

peg.js produce para el texto:

alpha beta gamma delta epsilon zeta eta theta iota

estos resultados:

[ "alpha", "INDENT", "beta", "gamma", "INDENT", "delta", "DEDENT", "DEDENT", "epsilon", "INDENT", "zeta", "DEDENT", "BADDENT", "eta", "theta", "INDENT", "iota", "DEDENT", "", "" ]

Este analizador incluso detecta malas sangrías.

Sé que este es un hilo viejo, pero solo quería agregar un código PEGjs a las respuestas. Este código analizará un fragmento de texto y lo "anidará" en una especie de estructura "AST-ish". Solo va uno profundo y se ve feo, además, realmente no usa los valores de retorno para crear la estructura correcta pero mantiene un árbol en memoria de tu sintaxis y lo devolverá al final. Esto podría volverse difícil de manejar y causar algunos problemas de rendimiento, pero al menos hace lo que se supone que debe hacer.

Nota: ¡Asegúrate de tener pestañas en lugar de espacios!

{ var indentStack = [], rootScope = { value: "PROGRAM", values: [], scopes: [] }; function addToRootScope(text) { // Here we wiggle with the form and append the new // scope to the rootScope. if (!text) return; if (indentStack.length === 0) { rootScope.scopes.unshift({ text: text, statements: [] }); } else { rootScope.scopes[0].statements.push(text); } } } /* Add some grammar */ start = lines: (line EOL+)* { return rootScope; } line = line: (samedent t:text { addToRootScope(t); }) &EOL / line: (indent t:text { addToRootScope(t); }) &EOL / line: (dedent t:text { addToRootScope(t); }) &EOL / line: [ /t]* &EOL / EOF samedent = i:[/t]* &{ return i.length === indentStack.length; } { console.log("s:", i.length, " level:", indentStack.length); } indent = i:[/t]+ &{ return i.length > indentStack.length; } { indentStack.push(""); console.log("i:", i.length, " level:", indentStack.length); } dedent = i:[/t]* &{ return i.length < indentStack.length; } { for (var j = 0; j < i.length + 1; j++) { indentStack.pop(); } console.log("d:", i.length + 1, " level:", indentStack.length); } text = numbers: number+ { return numbers.join(""); } / txt: character+ { return txt.join(""); } number = $[0-9] character = $[ a-zA-Z->+] __ = [ ]+ _ = [ ]* EOF = !. EOL = "/r/n" / "/n" / "/r"