html entities to string python

Convierta entidades XML/HTML en cadenas Unicode en Python (10)

Esta pregunta ya tiene una respuesta aquí:

Decodificar entidades HTML en una cadena de Python? 5 respuestas

Estoy haciendo algunos borradores web y los sitios usan con frecuencia entidades HTML para representar caracteres no ascii. ¿Tiene Python una utilidad que toma una cadena con entidades HTML y devuelve un tipo Unicode?

Por ejemplo:

Vuelvo

ǎ

que representa un "ǎ" con una marca de tono. En binario, esto se representa como 16 bit 01ce. Quiero convertir la entidad html en el valor u''/u01ce''

Aquí está la versión de Python 3 de la respuesta de dF :

import re import html.entities def unescape(text): """ Removes HTML or XML character references and entities from a text string. :param text: The HTML (or XML) source text. :return: The plain text, as a Unicode string, if necessary. """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return chr(int(text[3:-1], 16)) else: return chr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = chr(html.entities.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?/w+;", fixup, text)

Los principales cambios se refieren a htmlentitydefs que ahora es html.entities y unichr que ahora es chr . Vea esta guía de transporte de Python 3 .

Aquí puede encontrar una respuesta: ¿ Cómo obtener personajes internacionales de una página web?

EDITAR : parece que BeautifulSoup no convierte entidades escritas en forma hexadecimal. Se puede arreglar:

import copy, re from BeautifulSoup import BeautifulSoup hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) # replace hexadecimal character reference by decimal one hexentityMassage += [(re.compile(''&#x([^;]+);''), lambda m: ''&#%d;'' % int(m.group(1), 16))] def convert(html): return BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=hexentityMassage).contents[0].string html = ''<html>ǎǎ</html>'' print repr(convert(html)) # u''/u01ce/u01ce''

EDITAR :

unescape() función @dF unescape() mencionada por @dF que usa el módulo estándar unichr() y unichr() podría ser más apropiado en este caso.

El propio HTMLParser de la lib estándar tiene una función no documentada unescape () que hace exactamente lo que crees que hace:

Esta es una función que debería ayudarlo a hacer las cosas bien y convertir las entidades de nuevo a caracteres UTF-8.

def unescape(text): """Removes HTML or XML character references and entities from a text string. @param text The HTML (or XML) source text. @return The plain text, as a Unicode string, if necessary. from Fredrik Lundh 2008-01-03: input only unicode characters string. http://effbot.org/zone/re-sub.htm#unescape-html """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: print "Value Error" pass else: # named entity # reescape the reserved characters. try: if text[1:-1] == "amp": text = "&amp;" elif text[1:-1] == "gt": text = "&gt;" elif text[1:-1] == "lt": text = "&lt;" else: print text[1:-1] text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: print "keyerror" pass return text # leave as is return re.sub("&#?/w+;", fixup, text)

No estoy seguro de por qué el subproceso de desbordamiento de pila no incluye el '';'' en la búsqueda / reemplazo (es decir, lambda m: ''& #% d * ; *'') Si no lo hace, BeautifulSoup puede barf porque el personaje adyacente se puede interpretar como parte del código HTML (es decir, & # 39B para & # 39 Blackout).

Esto funcionó mejor para mí:

import re from BeautifulSoup import BeautifulSoup html_string=''<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">'Blackout in a can; on some shelves despite ban</a>'' hexentityMassage = [(re.compile(''&#x([^;]+);''), lambda m: ''&#%d;'' % int(m.group(1), 16))] soup = BeautifulSoup(html_string, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=hexentityMassage)

El int (m.group (1), 16) convierte el número (especificado en base-16) de nuevo a un número entero.
m.group (0) devuelve la coincidencia completa, m.group (1) devuelve el grupo de captura de expresiones regulares
Básicamente, usar markupMessage es lo mismo que:
html_string = re.sub (''& # x ([^;] +);'', lambda m: ''& #% d;''% int (m.grupo (1), 16), html_string)

Otra solución es la biblioteca integrada xml.sax.saxutils (tanto para html como para xml). Sin embargo, solo convertirá & gt, & amp y & lt.

from xml.sax.saxutils import unescape escaped_text = unescape(text_to_escape)

Python tiene el módulo htmlentitydefs , pero esto no incluye una función para deshacer las entidades HTML.

El desarrollador de Python Fredrik Lundh (autor de elementtree, entre otras cosas) tiene dicha función en su sitio web , que funciona con entidades decimales, hexadecimales y nombradas:

import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary. def unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?/w+;", fixup, text)

Si está en Python 3.4 o posterior, simplemente puede usar html.unescape :

s = html.unescape(s)

Una alternativa, si tienes lxml:

>>> import lxml.html >>> lxml.html.fromstring(''&#x01ce'').text u''/u01ce''

Utilice el unichr incorporado - BeautifulSoup no es necesario:

>>> entity = ''&#x01ce'' >>> unichr(int(entity[3:],16)) u''/u01ce''