python - scraping - Extendiendo los selectores de CSS en BeautifulSoup

install beautifulsoup python 3 (2)

La pregunta:

BeautifulSoup proporciona un soporte muy limitado para los selectores de CSS . Por ejemplo, la única pseudoclase admitida es nth-of-type y solo puede aceptar valores numéricos; no se permiten argumentos como even o odd .

¿Es posible extender los selectores de CSS BeautifulSoup o dejar que use lxml.cssselect internamente como un mecanismo de selección de CSS subyacente?

Echemos un vistazo a un problema de ejemplo / caso de uso . Localice solo filas iguales en el siguiente HTML:

<table> <tr> <td>1</td> <tr> <td>2</td> </tr> <tr> <td>3</td> </tr> <tr> <td>4</td> </tr> </table>

En lxml.html y lxml.cssselect , es fácil de hacer a través de :nth-of-type(even) :

from lxml.html import fromstring from lxml.cssselect import CSSSelector tree = fromstring(data) sel = CSSSelector(''tr:nth-of-type(even)'') print [e.text_content().strip() for e in sel(tree)]

Pero, en BeautifulSoup :

print(soup.select("tr:nth-of-type(even)"))

lanzaría un error:

NotImplementedError: actualmente solo se admiten valores numéricos para la pseudo-clase del enésimo de tipo.

Tenga en cuenta que podemos solucionarlo con .find_all() :

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])

Después de verificar el código fuente, parece que BeautifulSoup no proporciona ningún punto conveniente en su interfaz para extender o parchear su funcionalidad existente a este respecto. El uso de la funcionalidad de lxml tampoco es posible ya que BeautifulSoup solo usa lxml durante el análisis y utiliza los resultados del análisis para crear sus propios objetos respectivos a partir de ellos. Los objetos lxml no se conservan y no se puede acceder más tarde.

Dicho esto, con suficiente determinación y con la flexibilidad y las capacidades de introspección de Python, todo es posible. Puede modificar los elementos internos del método BeautifulSoup incluso en tiempo de ejecución:

import inspect import re import textwrap import bs4.element def replace_code_lines(source, start_token, end_token, replacement, escape_tokens=True): """Replace the source code between `start_token` and `end_token` in `source` with `replacement`. The `start_token` portion is included in the replaced code. If `escape_tokens` is True (default), escape the tokens to avoid them being treated as a regular expression.""" if escape_tokens: start_token = re.escape(start_token) end_token = re.escape(end_token) def replace_with_indent(match): indent = match.group(1) return textwrap.indent(replacement, indent) return re.sub(r"^(/s+)({}[/s/S]+?)(?=^/1{})".format(start_token, end_token), replace_with_indent, source, flags=re.MULTILINE) # Get the source code of the Tag.select() method src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select)) # Replace the relevant part of the method start_token = "if pseudo_type == ''nth-of-type'':" end_token = "else" replacement = """/ if pseudo_type == ''nth-of-type'': try: if pseudo_value in ("even", "odd"): pass else: pseudo_value = int(pseudo_value) except: raise NotImplementedError( ''Only numeric values, "even" and "odd" are currently '' ''supported for the nth-of-type pseudo-class.'') if isinstance(pseudo_value, int) and pseudo_value < 1: raise ValueError( ''nth-of-type pseudo-class value must be at least 1.'') class Counter(object): def __init__(self, destination): self.count = 0 self.destination = destination def nth_child_of_type(self, tag): self.count += 1 if pseudo_value == "even": return not bool(self.count % 2) elif pseudo_value == "odd": return bool(self.count % 2) elif self.count == self.destination: return True elif self.count > self.destination: # Stop the generator that''s sending us # these things. raise StopIteration() return False checker = Counter(pseudo_value).nth_child_of_type """ new_src = replace_code_lines(src, start_token, end_token, replacement) # Compile it and execute it in the target module''s namespace exec(new_src, bs4.element.__dict__) # Monkey patch the target method bs4.element.Tag.select = bs4.element.select

This es la porción de código que se está modificando.

Por supuesto, esto es todo menos elegante y confiable. No me imagino que esto sea usado seriamente en ninguna parte, nunca.

Oficialmente, Beautifulsoup no soporta todos los selectores de CSS.

Si Python no es la única opción, recomiendo encarecidamente JSoup (el equivalente en java de esto). Es compatible con todos los selectores de CSS.

Es de código abierto (licencia MIT).
La sintaxis es facil
Soporta todos los selectores css.
Puede abarcar múltiples hilos también para escalar
Soporte de API enriquecido en java para almacenar en bases de datos. Por lo tanto, es fácil de integrar.

La otra forma alternativa, si aún quieres seguir usando python, haz que sea una implementación jython.

http://jsoup.org/

https://github.com/jhy/jsoup/