tutorial scraping examples python web-scraping beautifulsoup

scraping - Python-Usando BeautifulSoup para raspar una tabla de ESPN



web scraping python examples (2)

Creo que estás equivocado. Todos los datos para un equipo parecen estar en el mismo tr . Aquí está el primero, con todo el estilo eliminado:

<tr> <td id="sovrRk_9">1</td> <td><a title="Team Li (Royce Li)" href="...">Team Li</a></td> <td><spacer type="block" width="1" height="1"> </spacer> </td> <td id="tmTotalStat_9_19">.4656</td> <td id="tmTotalStat_9_20">.8049</td> <td id="tmTotalStat_9_17">437</td> <td id="tmTotalStat_9_6">1752</td> <td id="tmTotalStat_9_3">962</td> <td id="tmTotalStat_9_2">284</td> <td id="tmTotalStat_9_1">228</td> <td id="tmTotalStat_9_11">578</td> <td id="tmTotalStat_9_0">4804</td> <td>4-4-1</td> <td title="Season Moves">12</td> </tr>

Todo está allí.

Intento usar BeautifulSoup para raspar la tabla "Estadísticas de la temporada" en esta página. ¿Hay alguna manera de que pueda obtener toda la mesa en un solo objeto de sopa? Actualmente mi código es así:

seasonStats = soup.find(''table'', {''id'': ''statsTable''}) categoryList = seasonStats.findAll(''tr'')[2].findAll(''a'')

El problema al que me estoy enfrentando es que FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS se almacenan en una fila, pero RK, LAST, MOVES se almacenan en otra fila. ¿De todos modos puedo raspar toda la tabla correctamente, donde RK, TEAM, FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS, ÚLTIMO, MUDO están todos almacenados en una fila (categoryList)? Parece una tontería que ESPN incluso ponga estos valores en filas diferentes. Además, si pudiera obtener toda esta tabla en una matriz, sería de gran ayuda.

Salida deseada:

[''RK'', ''TEAM'', ''FG%'', ''FT%'', ''3PM'', ''REB'', ''AST'', ''STL'', ''BLK'', ''TO'', ''PTS'', ''LAST'', ''MOVES''] [''1'', ''Team Li'', ''.4656'', ''.8049'', ''437'', ''1752'', ''962'', ''284'', ''228'', ''578'', ''4804'', ''4-4-1'', ''12''] [''2'', ''Team Aguilar'', ''.4499'', ''.7727'', ''415'', ''1925'', ''737'', ''276'', ''292'', ''543'', ''4901'', ''4-4-1'', ''0''] [''3'', ''Suck MyDirk'', ''.4533'', ''.8083'', ''410'', ''1798'', ''1035'', ''367'', ''153'', ''658'', ''5331'', ''3-6-0'', ''8''] [''4'', ''Knicks Tape'', ''.4589'', ''.8057'', ''339'', ''1458'', ''1029'', ''285'', ''132'', ''566'', ''4304'', ''4-5-0'', ''12''] [''5'', ''Kris Kaman His Pants'', ''.4576'', ''.8068'', ''534'', ''1530'', ''940'', ''306'', ''115'', ''515'', ''4603'', ''5-4-0'', ''17''] [''6'', ''Nutz Vs. Draymond Green'', ''.4518'', ''.8000'', ''404'', ''1641'', ''1004'', ''270'', ''176'', ''620'', ''4554'', ''5-4-0'', ''12''] [''7'', ''Team Keyrouze'', ''.4548'', ''.7895'', ''441'', ''1558'', ''809'', ''293'', ''195'', ''510'', ''4874'', ''4-5-0'', ''11''] [''8'', ''In Porzingod We Trust'', ''.4607'', ''.7542'', ''275'', ''1699'', ''1020'', ''274'', ''252'', ''482'', ''4119'', ''6-3-0'', ''13''] [''9'', ''Team Iannetta'', ''.4706'', ''.7908'', ''260'', ''1800'', ''1026'', ''310'', ''137'', ''646'', ''4909'', ''8-1-0'', ''13''] [''10'', "Jesse''s Blue Balls", ''.4646'', ''.6766'', ''403'', ''2029'', ''505'', ''243'', ''238'', ''481'', ''3929'', ''5-4-0'', ''16''] [''11'', ''Team Pauls 2 da Wall'', ''.4531'', ''.7602'', ''313'', ''1797'', ''1197'', ''313'', ''268'', ''525'', ''3719'', ''6-3-0'', ''13''] [''12'', ''YOU REACH, I TEACH'', ''.4552'', ''.7591'', ''401'', ''1488'', ''997'', ''285'', ''108'', ''521'', ''3694'', ''4-5-0'', ''12''] [''13'', ''Team Noey'', ''.4740'', ''.7610'', ''273'', ''1821'', ''681'', ''301'', ''226'', ''491'', ''4059'', ''3-6-0'', ''9''] [''14'', ''Team Jackson'', ''.4325'', ''.7484'', ''206'', ''1104'', ''714'', ''174'', ''101'', ''383'', ''2532'', ''1-8-0'', ''4'']

Salida de corriente:

[''1'', ''Team Li'', ''.4656'', ''.8049'', ''437'', ''1752'', ''962'', ''284'', ''228'', ''578'', ''4804'', ''4-4-1'', ''12''] [''2'', ''Team Aguilar'', ''.4499'', ''.7727'', ''415'', ''1925'', ''737'', ''276'', ''292'', ''543'', ''4901'', ''4-4-1'', ''0''] [''3'', ''Suck MyDirk'', ''.4533'', ''.8083'', ''410'', ''1798'', ''1035'', ''367'', ''153'', ''658'', ''5331'', ''3-6-0'', ''8''] [''4'', ''Knicks Tape'', ''.4589'', ''.8057'', ''339'', ''1458'', ''1029'', ''285'', ''132'', ''566'', ''4304'', ''4-5-0'', ''12''] [''5'', ''Kris Kaman His Pants'', ''.4576'', ''.8068'', ''534'', ''1530'', ''940'', ''306'', ''115'', ''515'', ''4603'', ''5-4-0'', ''17''] [''6'', ''Nutz Vs. Draymond Green'', ''.4518'', ''.8000'', ''404'', ''1641'', ''1004'', ''270'', ''176'', ''620'', ''4554'', ''5-4-0'', ''12''] [''7'', ''Team Keyrouze'', ''.4548'', ''.7895'', ''441'', ''1558'', ''809'', ''293'', ''195'', ''510'', ''4874'', ''4-5-0'', ''11''] [''8'', ''In Porzingod We Trust'', ''.4607'', ''.7542'', ''275'', ''1699'', ''1020'', ''274'', ''252'', ''482'', ''4119'', ''6-3-0'', ''13''] [''9'', ''Team Iannetta'', ''.4706'', ''.7908'', ''260'', ''1800'', ''1026'', ''310'', ''137'', ''646'', ''4909'', ''8-1-0'', ''13''] [''10'', "Jesse''s Blue Balls", ''.4646'', ''.6766'', ''403'', ''2029'', ''505'', ''243'', ''238'', ''481'', ''3929'', ''5-4-0'', ''16''] [''11'', ''Team Pauls 2 da Wall'', ''.4531'', ''.7602'', ''313'', ''1797'', ''1197'', ''313'', ''268'', ''525'', ''3719'', ''6-3-0'', ''13''] [''12'', ''YOU REACH, I TEACH'', ''.4552'', ''.7591'', ''401'', ''1488'', ''997'', ''285'', ''108'', ''521'', ''3694'', ''4-5-0'', ''12''] [''13'', ''Team Noey'', ''.4740'', ''.7610'', ''273'', ''1821'', ''681'', ''301'', ''226'', ''491'', ''4059'', ''3-6-0'', ''9''] [''14'', ''Team Jackson'', ''.4325'', ''.7484'', ''206'', ''1104'', ''714'', ''174'', ''101'', ''383'', ''2532'', ''1-8-0'', ''4'']

Muchas gracias.


import requests, bs4 url = ''http://games.espn.com/fba/standings?leagueId=224165&seasonId=2017'' r = requests.get(url) soup = bs4.BeautifulSoup(r.text, ''lxml'') table = soup.find(id="statsTable") rows = table.find_all(class_=["tableBody sortableRow","tableSubHead"]) rows = iter(rows) header_1 = [td.text for td in next(rows).find_all(''td'') if td.text] header_2 = [td.text for td in next(rows).find_all(''td'') if td.text] header = header_1[:2] + header_2 + header_1[-2:] print(header) for row in rows: data = [td.text for td in row.find_all(''td'') if td.text] print(data)

fuera:

[''RK'', ''TEAM'', ''FG%'', ''FT%'', ''3PM'', ''REB'', ''AST'', ''STL'', ''BLK'', ''TO'', ''PTS'', ''LAST'', ''MOVES''] [''1'', ''Team Li'', ''.4656'', ''.8049'', ''437'', ''1752'', ''962'', ''284'', ''228'', ''578'', ''4804'', ''4-4-1'', ''12''] [''2'', ''Team Aguilar'', ''.4499'', ''.7727'', ''415'', ''1925'', ''737'', ''276'', ''292'', ''543'', ''4901'', ''4-4-1'', ''0''] [''3'', ''Suck MyDirk'', ''.4533'', ''.8083'', ''410'', ''1798'', ''1035'', ''367'', ''153'', ''658'', ''5331'', ''3-6-0'', ''8''] [''4'', ''Knicks Tape'', ''.4589'', ''.8057'', ''339'', ''1458'', ''1029'', ''285'', ''132'', ''566'', ''4304'', ''4-5-0'', ''12''] [''5'', ''Kris Kaman His Pants'', ''.4576'', ''.8068'', ''534'', ''1530'', ''940'', ''306'', ''115'', ''515'', ''4603'', ''5-4-0'', ''17''] [''6'', ''Nutz Vs. Draymond Green'', ''.4518'', ''.8000'', ''404'', ''1641'', ''1004'', ''270'', ''176'', ''620'', ''4554'', ''5-4-0'', ''12''] [''7'', ''Team Keyrouze'', ''.4548'', ''.7895'', ''441'', ''1558'', ''809'', ''293'', ''195'', ''510'', ''4874'', ''4-5-0'', ''11''] [''8'', ''In Porzingod We Trust'', ''.4607'', ''.7542'', ''275'', ''1699'', ''1020'', ''274'', ''252'', ''482'', ''4119'', ''6-3-0'', ''13''] [''9'', ''Team Iannetta'', ''.4706'', ''.7908'', ''260'', ''1800'', ''1026'', ''310'', ''137'', ''646'', ''4909'', ''8-1-0'', ''13''] [''10'', "Jesse''s Blue Balls", ''.4646'', ''.6766'', ''403'', ''2029'', ''505'', ''243'', ''238'', ''481'', ''3929'', ''5-4-0'', ''17''] [''11'', ''Team Pauls 2 da Wall'', ''.4531'', ''.7602'', ''313'', ''1797'', ''1197'', ''313'', ''268'', ''525'', ''3719'', ''6-3-0'', ''13''] [''12'', ''YOU REACH, I TEACH'', ''.4552'', ''.7591'', ''401'', ''1488'', ''997'', ''285'', ''108'', ''521'', ''3694'', ''4-5-0'', ''12''] [''13'', ''Team Noey'', ''.4740'', ''.7610'', ''273'', ''1821'', ''681'', ''301'', ''226'', ''491'', ''4059'', ''3-6-0'', ''9''] [''14'', ''Team Jackson'', ''.4325'', ''.7484'', ''206'', ''1104'', ''714'', ''174'', ''101'', ''383'', ''2532'', ''1-8-0'', ''4'']