instalar - Escape de caracteres HTML especiales en Python

instalar tabulate en python (4)

El método cgi.escape convertirá caracteres especiales a etiquetas html válidas

import cgi original_string = ''Hello "XYZ" this /'is/' a test & so on '' escaped_string = cgi.escape(original_string, True) print original_string print escaped_string

resultará en

Hello "XYZ" this ''is'' a test & so on Hello "XYZ" this ''is'' a test & so on

El segundo parámetro opcional en cgi.escape escapa de las comillas. Por defecto, no se escapan.

Tengo una cadena donde pueden aparecer caracteres especiales como '' o " o & (...). En la cadena:

string = """ Hello "XYZ" this ''is'' a test & so on """

¿Cómo puedo escapar automáticamente a cada personaje especial, para que pueda obtener esto:

string = " Hello "XYZ" this 'is' a test & so on "

En Python 3.2, podría usar la función html.escape , por ejemplo

>>> string = """ Hello "XYZ" this ''is'' a test & so on """ >>> import html >>> html.escape(string) '' Hello "XYZ" this 'is' a test & so on ''

Para versiones anteriores de Python, visite http://wiki.python.org/moin/EscapingHtml :

El módulo cgi que viene con Python tiene una función de escape() :
import cgi s = cgi.escape( """& < >""" ) # s = "& < >"
Sin embargo, no escapa a los caracteres más allá de & , < , y > . Si se usa como cgi.escape(string_to_escape, quote=True) , también se escapa " .
Aquí hay un pequeño fragmento de código que te permitirá escapar de comillas y apóstrofes también:
html_escape_table = { "&": "&", ''"'': """, "''": "'", ">": ">", "<": "<", } def html_escape(text): """Produce entities within text.""" return "".join(html_escape_table.get(c,c) for c in text)
También puedes usar escape() de xml.sax.saxutils para escapar de html. Esta función debería ejecutarse más rápido. La función unescape() del mismo módulo puede pasar los mismos argumentos para decodificar una cadena.
from xml.sax.saxutils import escape, unescape # escape() and unescape() takes care of &, < and >. html_escape_table = { ''"'': """, "''": "'" } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_escape(text): return escape(text, html_escape_table) def html_unescape(text): return unescape(text, html_unescape_table)

Las otras respuestas aquí serán de ayuda, como los caracteres que enumeró y algunos otros. Sin embargo, si también quieres convertir todo lo demás en nombres de entidades, tendrás que hacer otra cosa. Por ejemplo, si á necesita ser convertido a á , ni cgi.escape ni html.escape te ayudarán allí. html.entities.entitydefs hacer algo como esto que use html.entities.entitydefs , que es solo un diccionario. (El siguiente código está hecho para Python 3.x, pero hay un intento parcial de hacerlo compatible con 2.x para que tengas una idea):

# -*- coding: utf-8 -*- import sys if sys.version_info[0]>2: from html.entities import entitydefs else: from htmlentitydefs import entitydefs text=";/"áèïøæỳ" #This is your string variable containing the stuff you want to convert text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn''t likely to have in the document. We''re converting it so it doesn''t convert the semi-colons in the entity name into entity names. text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names if sys.version_info[0]>2: #Using appropriate code for each Python version. for k,v in entitydefs.items(): if k not in {"semi", "amp"}: text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. else: for k,v in entitydefs.iteritems(): if k not in {"semi", "amp"}: text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. #The above code doesn''t cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I''m manually doing some common ones I like hereafter: text=text.replace("ŷ", "&ycirc;") text=text.replace("Ŷ", "&Ycirc;") text=text.replace("ŵ", "&wcirc;") text=text.replace("Ŵ", "&Wcirc;") text=text.replace("ỳ", "ỳ") text=text.replace("Ỳ", "Ỳ") text=text.replace("ẃ", "&wacute;") text=text.replace("Ẃ", "&Wacute;") text=text.replace("ẁ", "ẁ") text=text.replace("Ẁ", "Ẁ") print(text) #Python 3.x outputs: &semi;"áèïøæỳ #The Python 2.x version outputs the wrong stuff. So, clearly you''ll have to adjust the code somehow for it.

Una simple función de cadena lo hará:

def escape(t): """HTML-escape the text in `t`.""" return (t .replace("&", "&").replace("<", "<").replace(">", ">") .replace("''", "'").replace(''"'', """) )

Otras respuestas en este hilo tienen problemas menores: el método cgi.escape por alguna razón ignora las comillas simples, y usted debe pedirle explícitamente que haga comillas dobles. La página wiki vinculada realiza los cinco, pero utiliza la entidad XML ' , que no es una entidad HTML.

Esta función de código hace las cinco todo el tiempo, utilizando entidades estándar de HTML.