Parsing html in with BeautifulSoup fails to find a table

Parsing html in with BeautifulSoup fails to find a table - python

I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?

use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table

Related

Scrape data link and name informations with beautiful soup inside a python nested loop

I'm trying to scrape the data information from a website.
The html structure is like that:
<tbody>
<tr id="city_1">
<td class="first">Name_1</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
<tr id="city_1">
<td class="first">Name_2</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
</tbody>
I created a nested loop in python with beautiful soup package to access the hyperlink in which is store the information that I need (the link and the name).
Below my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#get all the city links of the page
page = requests.get("link")
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup)
for x in soup.tbody:
for y in x:
for z in y:
print(z.find('a')) #here the problem.
I don't know how to get the href and the name with soup for every hyperlinks of the list.

Try this:
for x in soup.tbody.find_all('td',class_='first'):
print(x.find('a').get('href'),x.text)
Output:
http://www.aachen.de/ Aachen
http://www.aalen.de/ Aalen
http://www.amberg.de/ Amberg
etc.

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?

I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())

This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())

Python scrape specific tag without class name

I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks

If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.

This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

I am trying to use Python with BeautifulSoup in order to pull multiple numbers from a web page. I know I am doing something wrong though because my script is returning an empty array. The fact that there are multiple spans and classes confuses me as well. Here is a sample of the HTML data I am working with:
<td class="confluenceTd" colspan="1">
<span>
Autoworks
</span>
</td>
<td class="confluenceTd" colspan="1">
900009
</td>
<td class="confluenceTd" colspan="1">
<p>
uyi: 3456778, 33344778, 11199087
</p>
<p>
PRY: 54675389
</p>
</td>
<td class="confluenceTd" colspan="1">
AutoNone
</td>
<td class="confluenceTd" colspan="1">
9998887
</td>
<td class="confluenceTd" colspan="1">
<p>
YUN: 232323, 6788889, 78695554
</p>
<p>
IOY: 3444666, 2343233, 1232322
</p>
</td>
Here is my Python code:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user': "user1", 'password':
'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("span", {"class":"confluenceTd"})
print data
Again, I am only trying to pull the actual numbers. Any help would be greatly appreciated. Thanks.

if you like to get all numbers present under specific class use regex/regular expressions to pull numbers and make sure requests is pulling html
import requests,re
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user':"user1",'password': 'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("td", {"class":"confluenceTd"})
for d in data:
m=re.search('([0-9]+)',str(d.findAll(text=True)))
if m:
print m.group(0)

Get every second URL from HTML in Python

I am struggling with scraping URL's from a website. The HTML code from the website I want to scrape is:
<tr>
<td>
<span>
<table class="search-result-ad-row" cellspacing="3" border="0">
<tbody>
<tr>
<td class="picture" rowspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><img src="/data/189/RE0005055/ads/195/RE0005055-16-000281/img/thum/37587134.jpeg" alt=""/></a>
</td>
<td class="title" colspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><h2 style="font-size: inherit;">3.izbový byt v starom meste na ulici Kpt. Nálepku</h2></a>
<span></span>
</td>
</tr>
<tr>
I want to get the href by using this python code:
br = mechanize.Browser()
br.open("http://www.reality.sk/")
br.select_form(nr=0)
br["tabs:scrn243:scrn115:errorTooltip.cityName:cityName"]="poprad"
br.submit()
def hello():
soup = BeautifulSoup(br.response().read())
for link in soup.findAll('a'):
link2 = link.get('href')
if "inzerat/" in link2:
print 'http://www.reality.sk/' + link.get('href')
But the problem is I get 2 results for each URL (because there are 2 href attributes). I have tried to scrape using the table tag, the td tag with a class attribute (either "picture" or "title") or even using rowspan (=2). But I am not getting the desired result. I don't know how to make code work.

I guess you had issues with looking up by the class selector. Also you can chain the tags returned by find - please take a look if this solution helps (I'm not 100% sure if that's what you want to achieve):
soup.find_all('table', class_='search-result-ad-row')
for ad_table in soup.find_all('table', class_='search-result-ad-row'):
print ad_table.find(class_='picture').find('a').attrs['href']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing html in with BeautifulSoup fails to find a table - python

Related

Scrape data link and name informations with beautiful soup inside a python nested loop

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

Python scrape specific tag without class name

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

Get every second URL from HTML in Python

Categories

Resources