Using Beautiful Soup v4, I've some td elements, some of which contain a child a element.
<tr class="">
<td class="tblimg"><img alt="" src="/blah/deficon.png"/></td>
<td><b>file.mp3</b><br/><span
style="color: grey;">76.33 MB<br/>33129 Downloads<br/>55:34 Mins<br/>192kbps Stereo</span>
</td>
</tr>
Is there a good way to find only those td that have a child a? Currently, I'm iterating over all of them and discarding the ones for which td.find("a") doesn't exist.
Although you already have the answer, I would like to provide another solution for your reference:)
from simplified_scrapy import SimplifiedDoc
html = '''<table><tr class="">
<td class="tblimg"><img alt="" src="/blah/deficon.png"/></td>
<td><b>file.mp3</b><br/><span
style="color: grey;">76.33 MB<br/>33129 Downloads<br/>55:34 Mins<br/>192kbps Stereo</span>
</td>
</tr></table>
'''
doc = SimplifiedDoc(html) # create doc
# First get all a in the table, and then take the parent of a. All the data can be retrieved at one time.
tds = doc.selects('table>a').parent
print (tds)
Result:
[{'tag': 'td', 'html': '<b>file.mp3</b><br /><span style="color: grey;">76.33 MB<br />33129 Downloads<br />55:34 Mins<br />192kbps Stereo</span>\n '}]
Related
I usually use selenium but figured I would give bs4 a shot!
I am trying to find this specific text on the website, in the example below I want the last - 189305014
<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
Here is the script I am using -
TwitterID = soup.find('td',attrs={'class':'left_column'}).text
This returns
Twitter User ID:
You can search for the next <p> tag to tag that contains "Twitter User ID:":
from bs4 import BeautifulSoup
txt = '''<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.find('p', text='Twitter User ID:').find_next('p'))
Prints:
<p>189305014</p>
Or last <p> element inside class="profile_info":
print(soup.select('.profile_info p')[-1])
Or first sibling to class="left_column":
print(soup.select_one('.left_column + *').text)
Use the following code to get you the desired output:
TwitterID = soup.find('td',attrs={'class': None}).text
To only get the digits from the second <p> tag, you can filter if the string isdigit():
from bs4 import BeautifulSoup
html = """<div class="info_container">
<div id="profile_photo">
<img src="https://pbs.twimg.com/profile_images/882103883610427393/vLTiH3uR_reasonably_small.jpg" />
</div>
<table class="profile_info">
<tr>
<td class="left_column">
<p>Twitter User ID:</p>
</td>
<td>
<p>189305014</p>
</td>
</tr>"""
soup = BeautifulSoup(html, 'html.parser')
result = ''.join(
[t for t in soup.find('div', class_='info_container').text if t.isdigit()]
)
print(result)
Output:
189305014
How to find next td of a td with a span in it?
html_text = """
<tr class="someClass">
<td> </td>
<td>A normal string</td>
<td class="someClass">10</td>
<td class="someClass">11</td>
<td class="someClass">12</td>
<td> </td>
</tr>
<tr class="someClass">
<td> </td>
<td>Non normal string <span style="font-size:10px">(with span)</span></td>
<td class="someClass">2 000</td>
<td class="someClass">2 100</td>
<td class="someClass">2 150</td>
<td> </td>
</tr>
"""
To get the td after the td with "A normal string" in it I would simply just find it by:
a_normal_string = str(soup.find("td", text="A normal string").find_next('td'))
a_normal_string = re.findall(r'\d+', a_normal_string)
print a_normal_string #['10']
However, in the second tr where i need to find the td after the td with a Non normal string above method will not work. So how to deal with a td containing spans?
First thought was to find it by regex and compile a_nonnormal_string = str(soup.find("td", text=re.compile(r'A non normal string')).find_next('td')) but this is not applicable as well.
This is just an example of two trs but the actually website has hundreds of trs.
One option would be to solve it with a searching function, using get_text() to check the text against a desired string (note that get_text() returns the complete text of an element including its child elements, but .string does not - it would be None if there are child elements - this is actually the reason why your second approach does not work):
tds = soup.find_all(lambda tag: tag.name == "td" and "normal string" in tag.get_text())
for td in tds:
a_normal_string = td.find_next('td').get_text()
print(a_normal_string)
Prints:
10
2 000
I am trying to extract some simple fields from an HTML page. It is a table with some repetitive data.
Every record has a FIRST_NAME (and a bunch of other stuff) but not every record has a WEBSITE. So my xpath solution was returning 10 names but only 9 website urls.
fname= tree.xpath('//span[#class="given-name"]/text()')
fweb = tree.xpath('//a[#class="url"]/text()')
Using that method I can't tell which record is missing the url.
So now I want to divide the file into chunks; each chunk would start with the span class GIVEN-NAME and end right before the next GIVEN-NAME.
How do I do that? In my code, I have an infinite loop that keeps returning the first instance of span class FIRST-NAME, it doesn't progress through the HTML file.
with open('sample A.htm') as f:
soup = bs4.BeautifulSoup(f.read())
many_names= soup.find_all('span',class_='given-name')
print len(many_names)
for i in range(len(many_names)):
first_name = soup.find('span', class_='given-name').text
website = soup.find('a', class_='url').text
myprint (i, first_name, last_name, aco, city, qm, website)
soup.find_next('span', class_='given-name')
The last statement (find_next) doesn't seem to do anything.
With or without it, it's just loop that reads from the beginning over and over again. What is the right way to do this?
EDIT: sample from HTML file (I edited some out because there is a lot more)
Physically, the layout is span given-name blah blah blah URL buried in there somewhere, then another span given-name
enter code here
</div>
<div class="connections-list cn-list-body cn-clear" id="cn-list-body">
<div class="cn-list-section-head" id="cn-char-A"></div><div class="cn-list-row-alternate vcard individual art-literary-agents celebrity-nonfiction-literary-agents chick-lit-fiction-literary-agents commercial-fiction-literary-agents fiction-literary-agents film-entertainment-literary-agents history-nonfiction-literary-agents literary-fiction-literary-agents military-war-literary-agents multicultural-nonfiction-literary-agents multicultural-fiction-literary-agents music-literary-agents new-york-literary-agents-ny nonfiction-literary-agents photography-literary-agents pop-culture-literary-agents religion-nonfiction-literary-agents short-story-collection-literary-agents spirituality-literary-agents sports-nonfiction-literary-agents usa-literary-agents womens-issues-literary-agents" id="richard-abate" data-entry-type="individual" data-entry-id="19337" data-entry-slug="richard-abate"><div id="entry-id-193375501ffd6551a6" class="cn-entry">
<table border="0px" bordercolor="#E3E3E3" cellspacing="0px" cellpadding="0px">
<tr>
<td align="left" width="55%" valign="top">
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Richard Abate" title="Logo for Richard Abate" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/richard-abate/richard-abate-literary-agent_logo_1-7bbdb1a0dbafe8417e994150608c55e4.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Richard</span> <span class="family-name">Abate</span> </span>
<span class="title">3 Arts Entertainment</span>
<span class="org"><span class="organization-unit">Query method(s): Postal Mail *</span></span>
</div>
<span class="address-block">
<span class="adr"><span class="address-name">Work</span> <span class="street-address">16 West 22th St</span> <span class="locality">New York</span> <span class="region">NY</span> <span class="postal-code">10010</span> <span class="country-name">USA</span><span class="type" style="display: none;">work</span></span>
</span>
</div>
</td>
</tr>
<tr>
<td valign="bottom" style="text-align: left;">
<a class="cn-note-anchor toggle-div" id="note-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="note-block-193375501ffd6551a6" data-str-show="Show Notes" data-str-hide="Close Notes">Show Notes</a> | <a class="cn-bio-anchor toggle-div" id="bio-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="bio-block-193375501ffd6551a6" data-str-show="Show Bio" data-str-hide="Close Bio">Show Bio</a>
</td>
<td align="right" valign="bottom" style="text-align: right;">
<a class="url" href="http://www.3arts.com" target="new" rel="nofollow">http://www.3arts.com</a>
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Andree Abecassis" title="Logo for Andree Abecassis" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/andree-abecassis/andree-abecassis-literary-agent_logo_1-b531cbac02864497b301e74bc6b37aa9.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Andree</span> <span class="family-name">Abecassis</span> </span>
enter code here
I'm pretty sure it's not the case, assuming you're properly copied and pasted your code, that the last statement gives you a SyntaxError as you say; rather it will give you an AttributeError because you've mis-spelled the method name findNext calling it, instead, find_next for some mysterious reason. In general, copy and paste your traceback rather than trying to "paraphrase" it.
However, since you already have a list of all the spans with the relevant class, simplest is to change your second loop to search within each of them:
for i, a_span in enumerate(many_names):
first_name = a_span.text
website = a_span.find('a', class_='url')
if website is None:
website = '*MISSING*'
else:
website = website.text
last_name = aco = city = qm = 'YOU NEVER EXTRACT THESE!!!'
myprint(i, first_name, last_name, aco, city, qm, website)
assuming you have indeed defined a function myprint with all of these parameters.
You'll note I've set four variables to remind you that you never extract these values -- I suspect you'll want to fix that, right?-)
EDIT: as it now appears the relation between the tags being sought is not in the HTML's structure, but a fragile dependence on the mere sequence of the tags' occurrence in the HTML text, a very different approach is required. Here's a possibility:
from bs4 import BeautifulSoup
with open('ha.txt') as f:
soup = BeautifulSoup(f)
def tag_of_interest(t):
if t.name=='a': return t.attrs.get('class')==['url']
if t.name=='span': return t.attrs.get('class')==['given-name']
return False
for t in soup.find_all(tag_of_interest):
print(t)
E.g, when I save in ha.txt the HTML snippet now given in the Q after an edit, this script emits:
<span class="given-name">Richard</span>
<a class="url" href="http://www.3arts.com" rel="nofollow" target="new">http://www.3arts.com</a>
<span class="given-name">Andree</span>
So what now remains is to appropriately group the relevant sequence of tags (which I think will also include others, such as the spans with class last-name &c). A class seems appropriate (and functionality such as myprint could neatly be recast as methods of the class, but I'll skip that part).
class Entity(object):
def __init__(self)
self.first_name = self.last_name = self.website = None # &c
entities = []
for t in soup.find_all(tag_of_interest):
if t.name=='span' and t.class==['given-name']:
ent = Entity()
ent.given-name = t.text
entities.append(ent)
else:
if not entities:
print 'tag', t, 'out of context'
continue
ent = entities[-1]
if t.name=='a' and t.class==['url']:
ent.website = t.text
# etc for other tags of interest
In the end, the entities list can be examined for entities missing mandatory bits of data, and so forth.
I have the following HTML code:
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
I am trying to use beautiful soup to parse certain elements into a tab-delimited file.
I got some great help and have:
for td in soup.select('td.title'):
span = td.select('span.wlb_wrapper')
if span:
print span[0].get('data-tconst') # To get `tt0082971`
Now I want to get "Target Text 1" .
I've tried some things like the above text such as:
for td in soup.select('td.image'): #trying to select the <td class="image"> tag
img = td.select('a.title') #from inside td I now try to look inside the a tag that also has the word title
if img:
print img[2].get('title') #if it finds anything, then I want to return the text in class 'title'
If you're trying to get a different td based on the class (i.e. td class="image" and td class="title" you can use beautiful soup as a dictionary to get the different classes.
This will find all the td class="image" in the table.
from bs4 import BeautifulSoup
page = """
<table>
<tr>
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(page)
tbl = soup.find('table')
rows = tbl.findAll('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
if col.has_attr('class') and col['class'][0] == 'image':
hrefs = col.find_all('a')
for href in hrefs:
print href.get('title')
elif col.has_attr('class') and col['class'][0] == 'title':
spans = col.find_all('span')
for span in spans:
if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
print span.get('data-tconst')
span.wlb_wrapper is a selector used to select <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">. Refer this & this for more information on selectors
change this in your python code span = td.select('span.wlb_wrapper') to span = td.select('span') & also span = td.select('span.year_type') and see what it returns.
If you try above and analyze what span holds you will get what you want.
I'm trying to tags that are nested in a tr tag, but the identifier that I'm using to find the correct value is nested in another td within the tr tag.
That is, I'm using the website LoLKing
And trying to scrape it for statistics based on a name, for example, Ahri.
The HTML is:
<tr>
<td data-sorttype="string" data-sortval="Ahri" style="text-align: left;">
<div style="display: table-cell;">
<div class="champion-list-icon" style="background:url(//lkimg.zamimg.com/shared/riot/images/champions/103_32.png)">
<a style="display: inline-block; width: 28px; height: 28px;" href="/champions/ahri"></a>
</div>
</div>
<div style="display: table-cell; vertical-align: middle; padding-top: 3px; padding-left: 5px;">Ahri</div>
</td>
<td style="text-align: center;" data-sortval="975"><img src='//lkimg.zamimg.com/images/rp_logo.png' width='18' class='champion-price-icon'>975</td>
<td style="text-align: center;" data-sortval="6300"><img src='//lkimg.zamimg.com/images/ip_logo.png' width='18' class='champion-price-icon'>6300</td>
<td style="text-align: center;" data-sortval="10.98">10.98%</td>
<td style="text-align: center;" data-sortval="48.44">48.44%</td>
<td style="text-align: center;" data-sortval="18.85">18.85%</td>
<td style="text-align: center;" data-sorttype="string" data-sortval="Middle Lane">Middle Lane</td>
<td style="text-align: center;" data-sortval="1323849600">12/14/2011</td>
</tr>
I'm having problems extracting the statistics, which are nested in td tags outside of the data-sortval. I imagine that I want to pull ALL the tr tags, but I don't know how to pull the tr tag based off of the one that contains the td tag with data-sortval="Ahri". At that point, I would want to step through the tr tag x times until I reach the first statistic I want, 10.98
At the moment, I'm trying to do a find for the td with data-sortval Ahri, but it doesn't return the rest of the tr.
It might be important to not that all of this is nested inside if a larger tag:
<table class="clientsort champion-list" width="100%" cellspacing="0" cellpadding="0">
<thead>
<tr><th>Champion</th><th>RP Cost</th><th>IP Cost</th><th>Popularity</th><th>Win Rate</th><th>Ban Rate</th><th>Meta</th><th>Released</th></tr>
</thead>
<tbody>
I apologize for the lack of clarity, I'm new with this scraping terminology, but I hope that makes enough sense.
Right now, I'm also doing:
main = soup.find('table', {'class':'clientsort champion-list'})
To get only that table
edit:
I typed this for the variable:
for champ in champs:
a = str(champ)
print type(a) is str
td_name = soup.find('td',{"data-sortval":a})
It confirms that a is a string.
But it throws this error:
File "lolrec.py", line 82, in StatScrape
tr = td_name.parent
AttributeError: 'NoneType' object has no attribute 'parent'
GO LOL!
For commercial purpose, please read the terms of services before scraping.
(1) To scrape a list of heroes, you can do this, which follows a similar logic as you described.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# locate the cell that contains hero name: Ahri
hero_list = ["Blitzcrank", "Ahri", "Akali"]
for hero in hero_list:
td_name = soup.find('td', {"data-sortval":hero})
tr = td_name.parent
popularity = tr.find_all('td', recursive=False)[3].text
print hero, popularity
Output
Blitzcrank 12.58%
Ahri 10.98%
Akali 7.52%
Output
10.98%
(2) To scrape all the heroes.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# find the table first
table = soup.find('table', {"class":"clientsort champion-list"})
# find the all the rows
for row in table.find('tbody').find_all("tr", recursive=False):
cols = row.find_all("td")
hero = cols[0].text.strip()
popularity = cols[3].text
print hero, popularity
Output:
Aatrox 6.86%
Ahri 10.98%
Akali 7.52%
Alistar 4.9%
Amumu 8.75%
...