python - stopwords - nltk tokenize spanish

BeatifulSoup4 get_text todavía tiene javascript (2)

Basado parcialmente en ¿Puedo eliminar etiquetas de script con BeautifulSoup?

import urllib from bs4 import BeautifulSoup url = "http://www.cnn.com" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.decompose() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = ''/n''.join(chunk for chunk in chunks if chunk) print(text)

Estoy tratando de eliminar todo el html / javascript usando bs4, sin embargo, no elimina el javascript. Todavía lo veo allí con el texto. ¿Cómo puedo evitar esto?

Intenté usar nltk que funciona bien, sin embargo, clean_html y clean_url se eliminarán en el futuro. ¿Hay alguna manera de usar sopas get_text y obtener el mismo resultado?

Traté de mirar estas otras páginas:

BeautifulSoup get_text no quita todas las etiquetas y JavaScript

Actualmente estoy usando las funciones en desuso de la nltk.

EDITAR

Aquí hay un ejemplo:

import urllib from bs4 import BeautifulSoup url = "http://www.cnn.com" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) print soup.get_text()

Todavía veo lo siguiente para CNN:

$j(function() { "use strict"; if ( window.hasOwnProperty(''safaripushLib'') && window.safaripushLib.checkEnv() ) { var pushLib = window.safaripushLib, current = pushLib.currentPermissions(); if (current === "default") { pushLib.checkPermissions("helloClient", function() {}); } } }); /*globals MainLocalObj*/ $j(window).load(function () { ''use strict''; MainLocalObj.init(); });

¿Cómo puedo eliminar el js?

Solo otras opciones que encontré son:

https://github.com/aaronsw/html2text

El problema con html2text es que a veces es muy lento y crea un retraso notable, que es algo con lo que nltk siempre fue muy bueno.

Para evitar errores de codificación al final ...

import urllib from bs4 import BeautifulSoup url = url html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = ''/n''.join(chunk for chunk in chunks if chunk) print(text.encode(''utf-8''))