with tutorial framework español djangoproject desde con cero applications python web-applications

tutorial - ¿Cuál es una forma simple de extraer la lista de URL en una página web usando Python?



tutorial django (3)

Quiero crear un rastreador web simple para divertirme. Necesito el rastreador web para obtener una lista de todos los enlaces en una página. ¿Tiene la biblioteca de Python alguna función incorporada que haría esto más fácil? Gracias cualquier conocimiento apreciado.


Esto es realmente muy simple con BeautifulSoup .

from BeautifulSoup import BeautifulSoup [element[''href''] for element in BeautifulSoup(document_contents).findAll(''a'', href=True)] # [u''http://example.com/'', u''/example'', ...]

Una última cosa: puede usar urlparse.urljoin para hacer que todas las URL sean absolutas. Si necesita el texto del enlace, puede usar algo como element.contents[0] .

Y así es como puedes atarlo todo:

import urllib2 import urlparse from BeautifulSoup import BeautifulSoup def get_all_link_targets(url): return [urlparse.urljoin(url, tag[''href'']) for tag in BeautifulSoup(urllib2.urlopen(url)).findAll(''a'', href=True)]


Hay un artículo sobre el uso de HTMLParser para obtener las URL de las etiquetas <a> en una página web.

El código es este:

de HTMLParser import HTMLParser de urllib2 import urlopen

class Spider(HTMLParser): def __init__(self, url): HTMLParser.__init__(self) req = urlopen(url) self.feed(req.read()) def handle_starttag(self, tag, attrs): if tag == ''a'' and attrs: print "Found link => %s" % attrs[0][1] Spider(''http://www.python.org'')

Si ejecutara ese script, obtendría resultados como este:

rafe@linux-7o1q:~> python crawler.py Found link => / Found link => #left-hand-navigation Found link => #content-body Found link => /search Found link => /about/ Found link => /news/ Found link => /doc/ Found link => /download/ Found link => /community/ Found link => /psf/ Found link => /dev/ Found link => /about/help/ Found link => http://pypi.python.org/pypi Found link => /download/releases/2.7/ Found link => http://docs.python.org/ Found link => /ftp/python/2.7/python-2.7.msi Found link => /ftp/python/2.7/Python-2.7.tar.bz2 Found link => /download/releases/3.1.2/ Found link => http://docs.python.org/3.1/ Found link => /ftp/python/3.1.2/python-3.1.2.msi Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 Found link => /community/jobs/ Found link => /community/merchandise/ Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => color:#D58228; margin-top:1.5em Found link => /psf/donations/ Found link => http://wiki.python.org/moin/Languages Found link => http://wiki.python.org/moin/Languages Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics Found link => http://wiki.python.org/moin/Python2orPython3 Found link => http://pypi.python.org/pypi Found link => /3kpoll Found link => /about/success/usa/ Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => /about/quotes Found link => http://wiki.python.org/moin/WebProgramming Found link => http://wiki.python.org/moin/CgiScripts Found link => http://www.zope.org/ Found link => http://www.djangoproject.com/ Found link => http://www.turbogears.org/ Found link => http://wiki.python.org/moin/PythonXml Found link => http://wiki.python.org/moin/DatabaseProgramming/ Found link => http://www.egenix.com/files/python/mxODBC.html Found link => http://sourceforge.net/projects/mysql-python Found link => http://wiki.python.org/moin/GuiProgramming Found link => http://wiki.python.org/moin/WxPython Found link => http://wiki.python.org/moin/TkInter Found link => http://wiki.python.org/moin/PyGtk Found link => http://wiki.python.org/moin/PyQt Found link => http://wiki.python.org/moin/NumericAndScientific Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html Found link => http://www.pentangle.net/python/handbook/ Found link => /community/sigs/current/edu-sig Found link => http://www.openbookproject.net/pybiblio/ Found link => http://osl.iu.edu/~lums/swc/ Found link => /about/apps Found link => http://docs.python.org/howto/sockets.html Found link => http://twistedmatrix.com/trac/ Found link => /about/apps Found link => http://buildbot.net/trac Found link => http://www.edgewall.com/trac/ Found link => http://roundup.sourceforge.net/ Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments Found link => /about/apps Found link => http://www.pygame.org/news.html Found link => http://www.alobbs.com/pykyra Found link => http://www.vrplumber.com/py3d.py Found link => /about/apps Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => /channews.rdf Found link => /about/website Found link => http://www.xs4all.com/ Found link => http://www.timparkin.co.uk/ Found link => /psf/ Found link => /about/legal

Puede usar regex para distinguir entre URL absolutas y relativas.


Solución hecha usando libxml.

import urllib import libxml2 parse_opts = libxml2.HTML_PARSE_RECOVER + / libxml2.HTML_PARSE_NOERROR + / libxml2.HTML_PARSE_NOWARNING doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '''', None, parse_opts) print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]