Limiting BeautifulSoup output - python

I have been working semi-successfully with BeautifulSoup and Selenium for some weeks now. However I have found myself in a situation I cannot untangle.
I need to extract the html from the first 6 rows or so out of a table. These rows do not share any class, id or similar.
Table structure:
<table class="Table">
<tr class="Table_Header">
<td colspan="2">Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td><span class="Class"></span>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2"> Some Text </td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td> <div class="Class4">Some Text</div>
<div class="Class4">Some Text</div>
</td>
</tr>
The table goes on and on, maintaining this structure but with seemingly random classes popping in and out.
Basically I would need to return the first six tr . I have tried several methods that either return the entire table or a single tr.
Any ideas?
Thanks in advance!

So you're trying to get the first 6 tr from a table? If I understand the question correctly I had a similar problem where I needed to get the first 400 td. Perhaps the code below would help?
Maybe something like
for row in get_log().findAll('tr'):
for cell in row.findAll('td'):
print (cell.text)
logfile.write('{}\n'.format(cell.text))
i += 1
if i == 400:
break
Also let me point you at the article I used to solve my own problem, the good stuff is near the end as it assumes you know literally nothing.
https://first-web-scraper.readthedocs.org/en/latest/
EDIT:
Using the table on Boone County as a source:
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'collapse shadow BCSDTable'})
i = 0
for row in table.findAll('tr'):
print (row.prettify())
i += 1
print i
if i == 6:
break
This outputs a ton of information, so I won't post it.Maybe you want to refine what you want from within each tr?

Related

How to select <tr> tags inside of a div with aspecific css attrbute via beautifulsoup?

I want to scrape several columns of text contained in td tags with a common css attribute inside of tr with a common css attribute inside of a table with a specific class inside of a div
For example, this is exactly how the website is structured.
<div class="stats-table>
<table class=stats_table>
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
.
.
.
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
I want to get the texts enclosed in the td tags
I have tried solving this problem by writing the code below
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.select(".stats_table")
all_data = [l.get_text(strip=True) for l in soup.select(".stats_table:has(> [data-row])")]
print(all_data)
But when I try to execute this code, I get an empty list. I need your help on this matter, thanks.
Why your solution did not work?
> is used when the element that you are selecting has the parent that you specified on the left side. But since in your case the parent of the td is tbody and not element with class .stats_table. So as stated if you specify the parent class in the selector it would work as expected. tr tag below is not necessary in the selector.
Also has tag means that selector matches element with class .stats_table that directly contains an element that has some element with data-row attribute in it.
soup.select(".stats_table tbody:has(> tr[data-row])")
But this won't give you the expected output. To get the expected output follow this below.
Solution
I see that you specifically want all the element "that has an attribute [data-row] inside the table class stats-table".
There are 2 ways in which you can do this.
Using regex
import re
html = '''
<div class="stats-table">
<table class="stats_table">
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
datastats = soup.find_all("td", {"data-stat" : re.compile(r".*")})
for stat in datastats:
print(stat.text)
which gives us the expected output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
Using CSS Selector
The below selector means that select all the td tags that has an attribute data-stat inside the table that has class stats_table. You may or may not use td beside [data-stat] as ("table.stats_table td[data-stat]")
datastats = soup.select("table.stats_table [data-stat]")
for stat in datastats:
print(stat.text)
which gives us the same output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
You can find more information on CSS_SELECTOR here

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.
To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?
I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())
This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())

When webscraping with python and Beautifulsoup adding a class to a findAll() search brings back zero results?

I'm new to coding in general. To be brief I am using the soup.findAll('table') function and it brings back all the tables on the web page. When I search soup.findAll('table', class_='playerTable rtable') it brings back []. I know that that is the correct class name as I copied it from the HTML. Do you guys know why this might be happening? What am I missing here?
url I'm attempting to scrape from http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/
The reason you guys don't see the same table as me is because you need to be signed in to an account, that costs money to for the access to the information, my question still stands, why might this be happening? When I know there is a table with the class I am searching for. Thanks so much for the help guys!
I don't see any class named "playerTable rtable" for the link you provided. Maybe you can try this and let me know if it was what you needed. Happy to delete/change my answer if it doesn't work out for you:
>>> r = BeautifulSoup(requests.get("http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/").content, "lxml")
>>> r.findAll("table", attrs = {"class":"playerTable"})
[<table class="playerTable">
<tbody>
<tr>
<td class="contract-type">
<div>
<h2>
<span class="contract-type-logo"><img alt="Team contract signed with" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></span>
<span class="contract-type-years">2016-2016 <small>Dead Money</small></span>
</h2>
</div>
</td>
</tr>
</tbody>
</table>, <table class="playerTable">
<tbody>
<tr>
<td style="padding-right:5px;">
<table class="salaryTable rtable current">
<thead>
<tr class="salaryRow">
<th class="header center">Year</th>
<th class="header center"> </th>
<th class="header salaryAmt center "><span>Base Salary</span></th>
<th class="header salaryAmt center"><span title="">Signing Bonus</span></th> <th class="header salaryAmt center"><span>Workout Bonus</span></th> <th class="header salaryAmt center"><span title="">Restruc. Bonus</span></th> <th class="header salaryAmt center"><span>Dead Cap Hit</span></th>
</tr>
</thead>
<tbody>
<tr class="salaryRow">
<td class="salaryYear center">2016</td>
<td class="salaryYear center"><img alt="Player contract details by year" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></td>
<td class="salaryAmt ">-</td>
<td class="salaryAmt ">-</td> <td class="salaryAmt ">-</td> <td class="salaryAmt ">$2,500,000</td> <td class="salaryAmt ">$2,500,000</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>]
The reason for this is because no table on this page has both classes playerTable and rtable. And soup.findAll('table', class_='playerTable rtable') is an AND operation ie it will fetch table elements with both the classes, hence empty list.
EDIT: Finally the main reason for this behaviour was because of the unauthenticated request used to fetch html. Therefore no table containing the specified classes existed.

Is there anyway to check if given XPath is valid in Python?

I have a python code that is extracting some information from a table. But the thing is sometimes the Xpath changes. Right now it only changes between two different XPath's that looks like this:
//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span/
and the other alternative is a slight change in the table like this:
//*[#id='content-primary']/table[2]/tbody/tr[td[1]/span/span/
this is the code that i am using right now to get the information that i need:
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
So what i want to do is a check if the given XPath is valid. If it is not i just try the other XPath alternative.
Hope somebody can help me with this problem. Thank you all.
EDIT1
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tfoot>
<tr>
<td colspan="3">
<dl>
<dt class="clNotify">Röd text</dt>
<dd> = Ändrad matchtid </dd>
<dt><img src="http://svenskfotboll.se/i/u/alert.gif" alt="Röda utropstecknet" /></dt>
<dd> = Peka på utropstecknet så visas en notering </dd>
<dt><img src="http://svenskfotboll.se/i/widget.gif" alt="Widget" /></dt>
<dd>Hämta widget för kommande matcher</dd>
</dl>
</td>
</tr>
</tfoot>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2015-04-17<!-- br ok --> 19:15</span></span> //This is the date i am checking with first
</td>
<td>Götene IF - Vårgårda IK </td> // The other information that i need from the table later
<td>Sparbanksvallen Götene konstgräs </td>
</tr>
In my situation i did not need to specify which table to extract the information from. Since the information that i will get is specified with the date that only contains in that table i just used this code and it worked out fine for me:
**rows_xpath = XPath("//*[#id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))**
now it is just table which means it will go through both tables in the website. Its not maybe a clean solution but works for me..

Categories

Resources