scraping - ¿Cómo puedo recuperar el título de la página de una página web usando Python?

web scraping mercado libre (8)

Esto es probablemente excesivo para una tarea tan simple, pero si planeas hacer más que eso, entonces es mejor comenzar con estas herramientas (mecanizar, BeautifulSoup) porque son mucho más fáciles de usar que las alternativas (urllib para obtener contenido y regexen o algún otro analizador para analizar html)

Enlaces: BeautifulSoup mecaniza

#!/usr/bin/env python #coding:utf-8 from BeautifulSoup import BeautifulSoup from mechanize import Browser #This retrieves the webpage content br = Browser() res = br.open("https://www.google.com/") data = res.get_data() #This parses the content soup = BeautifulSoup(data) title = soup.find(''title'') #This outputs the content :) print title.renderContents()

¿Cómo puedo recuperar el título de la página de una página web (etiqueta html del título) usando Python?

soup.title.string realidad devuelve una cadena Unicode. Para convertir eso en una cadena normal, debe hacer string=string.encode(''ascii'',''ignore'')

Siempre usaré lxml para tales tareas. También podrías usar beautifulsoup .

import lxml.html t = lxml.html.parse(url) print t.find(".//title").text

Usando HTMLParser :

from urllib.request import urlopen from html.parser import HTMLParser class TitleParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.match = False self.title = '''' def handle_starttag(self, tag, attributes): self.match = True if tag == ''title'' else False def handle_data(self, data): if self.match: self.title = data self.match = False url = "http://example.com/" html_string = str(urlopen(url).read()) parser = TitleParser() parser.feed(html_string) print(parser.title) # prints: Example Domain

No es necesario importar otras bibliotecas. La solicitud tiene esta funcionalidad incorporada.

>> hearders = {''headers'':''Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0''} >>> n = requests.get(''http://www.imdb.com/title/tt0108778/'', headers=hearders) >>> al = n.text >>> al[al.find(''<title>'') + 7 : al.find(''</title>'')] u''Friends (TV Series 1994/u20132004) - IMDb''

Aquí hay una versión simplificada de la respuesta de @Vinko Vrsalovic :

import urllib2 from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen("https://www.google.com")) print soup.title.string

NOTA:

soup.title encuentra el primer elemento de título en cualquier parte del documento html
title.string asume que tiene solo un nodo hijo, y ese nodo hijo es una cadena

Para beautifulsoup 4.x , use una importación diferente:

from bs4 import BeautifulSoup

El objeto Mechanize Browser tiene un método title (). Por lo tanto, el código de esta publicación puede reescribirse como sigue:

from mechanize import Browser br = Browser() br.open("http://www.google.com/") print br.title()

Usando expresiones regulares

import re match = re.search(''<title>(.*?)</title>'', raw_html) title = match.group(1) if match else ''No title''