significado - ¿Cómo verificar si una cadena es un identificador válido de python? incluida la comprobación de palabras clave?

palabras reservadas y su significado (4)

Python 3

Python 3 ahora tiene ''foo''.isidentifier() , por lo que parece ser la mejor solución para las versiones recientes de Python (gracias a Runciter @ freenode por sugerencia). Sin embargo, algo contraintuitivo, no se compara con la lista de palabras clave, por lo que se debe utilizar la combinación de ambas:

import keyword def isidentifier(ident: str) -> bool: """Determines if string is valid Python identifier.""" if not isinstance(ident, str): raise TypeError("expected str, but got {!r}".format(type(ident))) if not ident.isidentifier(): return False if keyword.iskeyword(ident): return False return True

Python 2

Para Python 2, la forma más fácil de verificar si una cadena dada es válida. El identificador de Python es dejar que Python lo analice.

Hay dos enfoques posibles. Lo más rápido es usar ast y verificar si la AST de expresión única tiene la forma deseada:

import ast def isidentifier(ident): """Determines, if string is valid Python identifier.""" # Smoke test — if it''s not string, then it''s not identifier, but we don''t # want to just silence exception. It''s better to fail fast. if not isinstance(ident, str): raise TypeError("expected str, but got {!r}".format(type(ident))) # Resulting AST of simple identifier is <Module [<Expr <Name "foo">>]> try: root = ast.parse(ident) except SyntaxError: return False if not isinstance(root, ast.Module): return False if len(root.body) != 1: return False if not isinstance(root.body[0], ast.Expr): return False if not isinstance(root.body[0].value, ast.Name): return False if root.body[0].value.id != ident: return False return True

Otra es dejar que el módulo tokenize divida el identificador en el flujo de tokens, y verifique que solo contenga nuestro nombre:

import keyword import tokenize def isidentifier(ident): """Determines if string is valid Python identifier.""" # Smoke test - if it''s not string, then it''s not identifier, but we don''t # want to just silence exception. It''s better to fail fast. if not isinstance(ident, str): raise TypeError("expected str, but got {!r}".format(type(ident))) # Quick test - if string is in keyword list, it''s definitely not an ident. if keyword.iskeyword(ident): return False readline = lambda g=(lambda: (yield ident))(): next(g) tokens = list(tokenize.generate_tokens(readline)) # You should get exactly 2 tokens if len(tokens) != 2: return False # First is NAME, identifier. if tokens[0][0] != tokenize.NAME: return False # Name should span all the string, so there would be no whitespace. if ident != tokens[0][1]: return False # Second is ENDMARKER, ending stream if tokens[1][0] != tokenize.ENDMARKER: return False return True

La misma función, pero compatible con Python 3, se ve así:

import keyword import tokenize def isidentifier_py3(ident): """Determines if string is valid Python identifier.""" # Smoke test — if it''s not string, then it''s not identifier, but we don''t # want to just silence exception. It''s better to fail fast. if not isinstance(ident, str): raise TypeError("expected str, but got {!r}".format(type(ident))) # Quick test — if string is in keyword list, it''s definitely not an ident. if keyword.iskeyword(ident): return False readline = lambda g=(lambda: (yield ident.encode(''utf-8-sig'')))(): next(g) tokens = list(tokenize.tokenize(readline)) # You should get exactly 3 tokens if len(tokens) != 3: return False # If using Python 3, first one is ENCODING, it''s always utf-8 because # we explicitly passed in UTF-8 BOM with ident. if tokens[0].type != tokenize.ENCODING: return False # Second is NAME, identifier. if tokens[1].type != tokenize.NAME: return False # Name should span all the string, so there would be no whitespace. if ident != tokens[1].string: return False # Third is ENDMARKER, ending stream if tokens[2].type != tokenize.ENDMARKER: return False return True

Sin embargo, tenga en cuenta los errores en la implementación de tokenize Python 3 que rechazan algunos identificadores completamente válidos como ℘᧚ , ﮯ y 贈ᩭ . ast funciona bien sin embargo En general, desaconsejo el uso de tokenize basadas en tokenize para verificaciones reales.

Además, algunos pueden considerar que la maquinaria pesada como el analizador AST es un poco excesivo. Esta implementación simple es autónoma y está garantizada para funcionar en cualquier Python 2:

import keyword import string def isidentifier(ident): """Determines if string is valid Python identifier.""" if not isinstance(ident, str): raise TypeError("expected str, but got {!r}".format(type(ident))) if not ident: return False if keyword.iskeyword(ident): return False first = ''_'' + string.lowercase + string.uppercase if ident[0] not in first: return False other = first + string.digits for ch in ident[1:]: if ch not in other: return False return True

Aquí hay algunas pruebas para comprobar todos estos trabajos:

assert(isidentifier(''foo'')) assert(isidentifier(''foo1_23'')) assert(not isidentifier(''pass'')) # syntactically correct keyword assert(not isidentifier(''foo '')) # trailing whitespace assert(not isidentifier('' foo'')) # leading whitespace assert(not isidentifier(''1234'')) # number assert(not isidentifier(''1234abc'')) # number and letters assert(not isidentifier(''👻'')) # Unicode not from allowed range assert(not isidentifier('''')) # empty string assert(not isidentifier('' '')) # whitespace only assert(not isidentifier(''foo bar'')) # several tokens assert(not isidentifier(''no-dashed-names-for-you'')) # no such thing in Python # Unicode identifiers are only allowed in Python 3: assert(isidentifier(''℘᧚'')) # Unicode $Other_ID_Start and $Other_ID_Continue

Actuación

Todas las mediciones se realizaron en mi máquina (MBPr mediados de 2014) en el mismo conjunto de prueba generado aleatoriamente de 1 500 000 elementos, 1000 000 válidos y 500 000 inválidos. YMMV

== Python 3: method | calls/sec | faster --------------------------- token | 48 286 | 1.00x ast | 175 530 | 3.64x native | 1 924 680 | 39.86x == Python 2: method | calls/sec | faster --------------------------- token | 83 994 | 1.00x ast | 208 206 | 2.48x simple | 1 066 461 | 12.70x

¿Alguien sabe si hay algún método Python integrado que verifique si algo es un nombre de variable Python válido, INCLUYENDO un cheque contra palabras clave reservadas? (así, es decir, algo como ''in'' o ''for'' fallaría ...)

De no ser así, ¿alguien sabe dónde puedo obtener una lista de palabras clave reservadas (es decir, de forma dinámica desde Python, en lugar de copiar y pegar algo de los documentos en línea)? O, ¿tienes otra buena manera de escribir tu propio cheque?

Sorprendentemente, las pruebas envolviendo un setattr en try / except no funcionan, como algo así:

setattr(myObj, ''My Sweet Name!'', 23)

... realmente funciona! (... e incluso se puede recuperar con getattr!)

El módulo de keyword contiene la lista de todas las palabras clave reservadas:

>>> import keyword >>> keyword.iskeyword("in") True >>> keyword.kwlist [''and'', ''as'', ''assert'', ''break'', ''class'', ''continue'', ''def'', ''del'', ''elif'', ''else'', ''except'', ''exec'', ''finally'', ''for'', ''from'', ''global'', ''if'', ''import'', ''in'', ''is'', ''lambda'', ''not'', ''or'', ''pass'', ''print'', ''raise'', ''return'', ''try'', ''while'', ''with'', ''yield'']

Tenga en cuenta que esta lista será diferente según la versión principal de Python que esté utilizando, ya que la lista de palabras clave cambia (especialmente entre Python 2 y Python 3).

Si también desea todos los nombres incorporados, use __builtins__

>>> dir(__builtins__) [''ArithmeticError'', ''AssertionError'', ''AttributeError'', ''BaseException'', ''BlockingIOError'', ''BrokenPipeError'', ''BufferError'', ''BytesWarning'', ''ChildProcessError'', ''ConnectionAbortedError'', ''ConnectionError'', ''ConnectionRefusedError'', ''ConnectionResetError'', ''DeprecationWarning'', ''EOFError'', ''Ellipsis'', ''EnvironmentError'', ''Exception'', ''False'', ''FileExistsError'', ''FileNotFoundError'', ''FloatingPointError'', ''FutureWarning'', ''GeneratorExit'', ''IOError'', ''ImportError'', ''ImportWarning'', ''IndentationError'', ''IndexError'', ''InterruptedError'', ''IsADirectoryError'', ''KeyError'', ''KeyboardInterrupt'', ''LookupError'', ''MemoryError'', ''NameError'', ''None'', ''NotADirectoryError'', ''NotImplemented'', ''NotImplementedError'', ''OSError'', ''OverflowError'', ''PendingDeprecationWarning'', ''PermissionError'', ''ProcessLookupError'', ''ReferenceError'', ''ResourceWarning'', ''RuntimeError'', ''RuntimeWarning'', ''StopIteration'', ''SyntaxError'', ''SyntaxWarning'', ''SystemError'', ''SystemExit'', ''TabError'', ''TimeoutError'', ''True'', ''TypeError'', ''UnboundLocalError'', ''UnicodeDecodeError'', ''UnicodeEncodeError'', ''UnicodeError'', ''UnicodeTranslateError'', ''UnicodeWarning'', ''UserWarning'', ''ValueError'', ''Warning'', ''ZeroDivisionError'', ''_'', ''__build_class__'', ''__debug__'', ''__doc__'', ''__import__'', ''__name__'', ''__package__'', ''abs'', ''all'', ''any'', ''ascii'', ''bin'', ''bool'', ''bytearray'', ''bytes'', ''callable'', ''chr'', ''classmethod'', ''compile'', ''complex'', ''copyright'', ''credits'', ''delattr'', ''dict'', ''dir'', ''divmod'', ''enumerate'', ''eval'', ''exec'', ''exit'', ''filter'', ''float'', ''format'', ''frozenset'', ''getattr'', ''globals'', ''hasattr'', ''hash'', ''help'', ''hex'', ''id'', ''input'', ''int'', ''isinstance'', ''issubclass'', ''iter'', ''len'', ''license'', ''list'', ''locals'', ''map'', ''max'', ''memoryview'', ''min'', ''next'', ''object'', ''oct'', ''open'', ''ord'', ''pow'', ''print'', ''property'', ''quit'', ''range'', ''repr'', ''reversed'', ''round'', ''set'', ''setattr'', ''slice'', ''sorted'', ''staticmethod'', ''str'', ''sum'', ''super'', ''tuple'', ''type'', ''vars'', ''zip'']

Y tenga en cuenta que algunos de estos (como los copyright ) no son realmente tan importantes que anular.

Una advertencia más: tenga en cuenta que en Python 2, True , False y None no se consideran palabras clave. Sin embargo, la asignación a None es un error de sintaxis. Se permite la asignación a True o False , aunque no se recomienda (igual que con cualquier otro componente incorporado). En Python 3, son palabras clave, por lo que no es un problema.

John: como una leve mejora, agregué $ en la re, de lo contrario, la prueba no detecta espacios:

import keyword import re my_var = "$testBadVar" print re.match("[_A-Za-z][_a-zA-Z0-9]*$",my_var) and not keyword.iskeyword(my_var)

La lista de palabras clave de Python es corta, por lo que solo puede verificar la sintaxis con una expresión regular y la membresía en una lista relativamente pequeña de palabras clave

import keyword #thanks asmeurer import re my_var = "$testBadVar" print re.match("[_A-Za-z][_a-zA-Z0-9]*",my_var) and not keyword.iskeyword(my_var)

Una alternativa más corta pero más peligrosa sería

my_bad_var="%#ASD" try:exec("{0}=1".format(my_bad_var)) except SyntaxError: #this maynot be right error print "Invalid variable name!"

y por último una variante ligeramente más segura.

my_bad_var="%#ASD" try: cc = compile("{0}=1".format(my_bad_var),"asd","single") eval(cc) print "VALID" except SyntaxError: #maybe different error print "INVALID!"