I am trying to extract some simple fields from an HTML page. It is a table with some repetitive data.
Every record has a FIRST_NAME (and a bunch of other stuff) but not every record has a WEBSITE. So my xpath solution was returning 10 names but only 9 website urls.
fname= tree.xpath('//span[#class="given-name"]/text()')
fweb = tree.xpath('//a[#class="url"]/text()')
Using that method I can't tell which record is missing the url.
So now I want to divide the file into chunks; each chunk would start with the span class GIVEN-NAME and end right before the next GIVEN-NAME.
How do I do that? In my code, I have an infinite loop that keeps returning the first instance of span class FIRST-NAME, it doesn't progress through the HTML file.
with open('sample A.htm') as f:
soup = bs4.BeautifulSoup(f.read())
many_names= soup.find_all('span',class_='given-name')
print len(many_names)
for i in range(len(many_names)):
first_name = soup.find('span', class_='given-name').text
website = soup.find('a', class_='url').text
myprint (i, first_name, last_name, aco, city, qm, website)
soup.find_next('span', class_='given-name')
The last statement (find_next) doesn't seem to do anything.
With or without it, it's just loop that reads from the beginning over and over again. What is the right way to do this?
EDIT: sample from HTML file (I edited some out because there is a lot more)
Physically, the layout is span given-name blah blah blah URL buried in there somewhere, then another span given-name
enter code here
</div>
<div class="connections-list cn-list-body cn-clear" id="cn-list-body">
<div class="cn-list-section-head" id="cn-char-A"></div><div class="cn-list-row-alternate vcard individual art-literary-agents celebrity-nonfiction-literary-agents chick-lit-fiction-literary-agents commercial-fiction-literary-agents fiction-literary-agents film-entertainment-literary-agents history-nonfiction-literary-agents literary-fiction-literary-agents military-war-literary-agents multicultural-nonfiction-literary-agents multicultural-fiction-literary-agents music-literary-agents new-york-literary-agents-ny nonfiction-literary-agents photography-literary-agents pop-culture-literary-agents religion-nonfiction-literary-agents short-story-collection-literary-agents spirituality-literary-agents sports-nonfiction-literary-agents usa-literary-agents womens-issues-literary-agents" id="richard-abate" data-entry-type="individual" data-entry-id="19337" data-entry-slug="richard-abate"><div id="entry-id-193375501ffd6551a6" class="cn-entry">
<table border="0px" bordercolor="#E3E3E3" cellspacing="0px" cellpadding="0px">
<tr>
<td align="left" width="55%" valign="top">
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Richard Abate" title="Logo for Richard Abate" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/richard-abate/richard-abate-literary-agent_logo_1-7bbdb1a0dbafe8417e994150608c55e4.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Richard</span> <span class="family-name">Abate</span> </span>
<span class="title">3 Arts Entertainment</span>
<span class="org"><span class="organization-unit">Query method(s): Postal Mail *</span></span>
</div>
<span class="address-block">
<span class="adr"><span class="address-name">Work</span> <span class="street-address">16 West 22th St</span> <span class="locality">New York</span> <span class="region">NY</span> <span class="postal-code">10010</span> <span class="country-name">USA</span><span class="type" style="display: none;">work</span></span>
</span>
</div>
</td>
</tr>
<tr>
<td valign="bottom" style="text-align: left;">
<a class="cn-note-anchor toggle-div" id="note-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="note-block-193375501ffd6551a6" data-str-show="Show Notes" data-str-hide="Close Notes">Show Notes</a> | <a class="cn-bio-anchor toggle-div" id="bio-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="bio-block-193375501ffd6551a6" data-str-show="Show Bio" data-str-hide="Close Bio">Show Bio</a>
</td>
<td align="right" valign="bottom" style="text-align: right;">
<a class="url" href="http://www.3arts.com" target="new" rel="nofollow">http://www.3arts.com</a>
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Andree Abecassis" title="Logo for Andree Abecassis" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/andree-abecassis/andree-abecassis-literary-agent_logo_1-b531cbac02864497b301e74bc6b37aa9.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Andree</span> <span class="family-name">Abecassis</span> </span>
enter code here
I'm pretty sure it's not the case, assuming you're properly copied and pasted your code, that the last statement gives you a SyntaxError as you say; rather it will give you an AttributeError because you've mis-spelled the method name findNext calling it, instead, find_next for some mysterious reason. In general, copy and paste your traceback rather than trying to "paraphrase" it.
However, since you already have a list of all the spans with the relevant class, simplest is to change your second loop to search within each of them:
for i, a_span in enumerate(many_names):
first_name = a_span.text
website = a_span.find('a', class_='url')
if website is None:
website = '*MISSING*'
else:
website = website.text
last_name = aco = city = qm = 'YOU NEVER EXTRACT THESE!!!'
myprint(i, first_name, last_name, aco, city, qm, website)
assuming you have indeed defined a function myprint with all of these parameters.
You'll note I've set four variables to remind you that you never extract these values -- I suspect you'll want to fix that, right?-)
EDIT: as it now appears the relation between the tags being sought is not in the HTML's structure, but a fragile dependence on the mere sequence of the tags' occurrence in the HTML text, a very different approach is required. Here's a possibility:
from bs4 import BeautifulSoup
with open('ha.txt') as f:
soup = BeautifulSoup(f)
def tag_of_interest(t):
if t.name=='a': return t.attrs.get('class')==['url']
if t.name=='span': return t.attrs.get('class')==['given-name']
return False
for t in soup.find_all(tag_of_interest):
print(t)
E.g, when I save in ha.txt the HTML snippet now given in the Q after an edit, this script emits:
<span class="given-name">Richard</span>
<a class="url" href="http://www.3arts.com" rel="nofollow" target="new">http://www.3arts.com</a>
<span class="given-name">Andree</span>
So what now remains is to appropriately group the relevant sequence of tags (which I think will also include others, such as the spans with class last-name &c). A class seems appropriate (and functionality such as myprint could neatly be recast as methods of the class, but I'll skip that part).
class Entity(object):
def __init__(self)
self.first_name = self.last_name = self.website = None # &c
entities = []
for t in soup.find_all(tag_of_interest):
if t.name=='span' and t.class==['given-name']:
ent = Entity()
ent.given-name = t.text
entities.append(ent)
else:
if not entities:
print 'tag', t, 'out of context'
continue
ent = entities[-1]
if t.name=='a' and t.class==['url']:
ent.website = t.text
# etc for other tags of interest
In the end, the entities list can be examined for entities missing mandatory bits of data, and so forth.
Related
from bs4 import BeautifulSoup
html_content = """<div id="formContents" class="dformDisplay ">
<div class="sectionDiv expanded">
<table id="sect_s1" class="formSection LabelsAbove">
<tr class="formRow ">
<td id="tdl_8" class="label lc" >
<label class="fieldLabel " ><b >Address</b></label>
<table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
<tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
<div id="tdf_8" class="cell cc" >
<a
href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div>
</td></tr></table>
</td>
"""
try:
soup = BeautifulSoup(html_content, 'html.parser')
form_data = soup.find("div",{"id":"formContents"})
if form_data:
section_data = soup.findAll("div",{"class":"sectionDiv expanded"})
for datas in section_data:
labels = datas.findAll("label",{"class":"fieldLabel"})
for item in labels:
labels = item.text
print(labels)
entity_data = item.findAll("td").text
print(entity_data)
except Exception as e:
print(e)
My required output:
Address : 183 Pelham Wood Dr, Rock Hill, SC 29732.
Is there any solution to get the particular output using beautifulsoup. I need to the address of the particular HTML source content.
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
You could select all <td> with <label> in your element and use stripped_strings to extract the contents - In case it is the same motive as in How to scrape data from the website which is not aligned properly you could get a nicly structured dict of label and text
dict(e.stripped_strings for e in soup.select('#formContents td:has(label)'))
or this if it is close to the requirements from How to extract the data from the html content:
dict((e.text,e.find_next('td').get_text(strip=True)) for e in soup.select('label'))
Example
from bs4 import BeautifulSoup
html_content = """<div id="formContents" class="dformDisplay ">
<div class="sectionDiv expanded">
<div class="Title expanded ToggleSection shead"
style="margin-top:1em"
id="sect_s11Header">
<div><!--The div around the table is so that the toggling can be animated smoothly-->
<table id="sect_s1" class="formSection LabelsAbove">
<tr class="formRow ">
<td id="tdl_8" class="label lc" >
<label class="fieldLabel " ><b >Address</b></label>
<table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
<tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
<div id="tdf_8" class="cell cc" >
<a
href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div>
</td></tr></table>
</td>
"""
soup = BeautifulSoup(html_content)
dict(e.stripped_strings for e in soup.select('#formContents td:has(label)'))
Output
{'Address': '1183\nPelham Wood Dr, Rock Hill, SC 29732'}
You can search for a tag where href starts with https://maps.google.com:
>>> soup.find('a', {'href': re.compile('^https://maps.google.com')}).text.replace('\n', ' ')
'1183 Pelham Wood Dr, Rock Hill, SC 29732'
The important thing here is not the soup object used but the strategy with a regexp to extract the address text from the tag.
When I try your code, it prints
Address
ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
You should take note of the second line, because item.findAll("td").text will always raise error; you can instead do something like '\n'.join([td.text for td in item.findAll("td")]) and that should not raise any error.
However, it will return only an empty string [as item.findAll("td") is an empty ResultSet] because with for item in labels....item.findAll("td")..., you're looking for td tags inside the label tags when they're actually in a table tag next to the label.
Solution 1: Using .find_next_siblings
soup = BeautifulSoup(html_content, 'html.parser')
form_data = soup.find("div",{"id":"formContents"})
if form_data:
section_data = soup.find_all("div",{"class":"sectionDiv expanded"})
for datas in section_data:
labels = datas.find_all("label",{"class":"fieldLabel"})
for item in labels:
print(item.text) ## label
for nxtTable in item.find_next_siblings('table'):
print('\n'.join([td.text for td in nxtTable.find_all("td")]))
break ## [ only takes the first table ]
[Like this, you shouldn't need the try...except either.]
And [for me] that printed
Address
1183
Pelham Wood Dr, Rock Hill, SC 29732
Solution 2: Using .select with CSS selectors
soup = BeautifulSoup(html_content, 'html.parser')
section_sel = 'div#formContents div.sectionDiv.expanded'
label_sel = 'label.fieldLabel'
for datas in soup.select(f'{section_sel}:has({label_sel}+table td)'):
labels = datas.select(f'{label_sel}:has(+table td)')
labels = [' '.join(l.get_text(' ').split()) for l in labels]
entity_data = [' '.join([
' '.join(td.get_text(' ').split()) for td in ed.select('td')
]) for ed in datas.select(f'{label_sel}+table:has(td)')]
# data_dict = dict(zip(labels, entity_data))
for l, ed in zip(labels, entity_data): print(f'{l}: {ed}')
And that should print
Address: 1183 Pelham Wood Dr, Rock Hill, SC 29732
Btw, dict(zip(labels, entity_data)) would have returned {'Address': '1183 Pelham Wood Dr, Rock Hill, SC 29732'}, and I've used ' '.join(td.get_text(' ').split()) instead of just td.text (and same with l in labels) to minimize whitespace and get everything in one line.
Note: Both solutions become less reliable unless each label is for exactly one table; the second solution assumes that the table is directly adjacent to the label (and will skip any labels without an adjacent table with td tags); and the first solution risks taking a table from the next label if a label is missing a table after it.
Using Beautiful Soup v4, I've some td elements, some of which contain a child a element.
<tr class="">
<td class="tblimg"><img alt="" src="/blah/deficon.png"/></td>
<td><b>file.mp3</b><br/><span
style="color: grey;">76.33 MB<br/>33129 Downloads<br/>55:34 Mins<br/>192kbps Stereo</span>
</td>
</tr>
Is there a good way to find only those td that have a child a? Currently, I'm iterating over all of them and discarding the ones for which td.find("a") doesn't exist.
Although you already have the answer, I would like to provide another solution for your reference:)
from simplified_scrapy import SimplifiedDoc
html = '''<table><tr class="">
<td class="tblimg"><img alt="" src="/blah/deficon.png"/></td>
<td><b>file.mp3</b><br/><span
style="color: grey;">76.33 MB<br/>33129 Downloads<br/>55:34 Mins<br/>192kbps Stereo</span>
</td>
</tr></table>
'''
doc = SimplifiedDoc(html) # create doc
# First get all a in the table, and then take the parent of a. All the data can be retrieved at one time.
tds = doc.selects('table>a').parent
print (tds)
Result:
[{'tag': 'td', 'html': '<b>file.mp3</b><br /><span style="color: grey;">76.33 MB<br />33129 Downloads<br />55:34 Mins<br />192kbps Stereo</span>\n '}]
I need to click on the check box in the HTML table after asserting the text. Below is the html.
<div class="k-widget k-grid ">
<div class="k-grid-header" style="padding: 0px 16px 0px 0px;">
<div class="k-grid-header-wrap">
<table>
<colgroup>
<col width="50px">
<col>
<col>
</colgroup>
<thead>
<tr>
<th aria-sort="" colspan="1" rowspan="1" class="k-header checkbox-grid-column"><input id="c3c07f7e-5119-4a36-9f67-98fa4d21fa07" type="checkbox" class="k-checkbox"><label class="k-checkbox-label" for="c3c07f7e-5119-4a36-9f67-98fa4d21fa07"></label></th>
<th aria-sort="" colspan="1" rowspan="1" class="k-header"><a class="k-link">Permission</a></th>
<th aria-sort="" colspan="1" rowspan="1" class="k-header"><a class="k-link">Description</a></th>
</tr>
</thead>
</table>
</div>
</div>
<div class="k-grid-container">
<div class="k-grid-content k-virtual-content">
<div style="position: relative;">
<table tabindex="-1" class="k-grid-table">
<colgroup>
<col width="50px">
<col>
<col>
</colgroup>
<tbody>
<tr class="k-master-row">
<td colspan="1" class="checkbox-grid-column"><input id="c8711bab-702a-43b9-8a75-02ad06a8baa3" type="checkbox" class="k-checkbox"><label class="k-checkbox-label" for="c8711bab-702a-43b9-8a75-02ad06a8baa3"></label></td>
<td>ACCESSGROUP_BULKDELETE</td>
<td colspan="1">Enable Bulk Delete button in access group management</td>
</tr>
<tr class="k-master-row k-alt">
<td colspan="1" class="checkbox-grid-column"><input id="a029bc1e-53d8-4328-89ce-6640363d515a" type="checkbox" class="k-checkbox"><label class="k-checkbox-label" for="a029bc1e-53d8-4328-89ce-6640363d515a"></label></td>
<td>ACCESSGROUP_DELETE</td>
<td colspan="1">Enable Delete Button in the access group details page</td>
</tr>
<tr class="k-master-row">
<td colspan="1" class="checkbox-grid-column"><input id="645f8474-9840-48e2-a02c-112178aaf5e2" type="checkbox" class="k-checkbox"><label class="k-checkbox-label" for="645f8474-9840-48e2-a02c-112178aaf5e2"></label></td>
<td>ACCESSGROUP_NEW</td>
I was able to get text from the TR with the code mentioned
table_id = context.driver.find_elements(By.XPATH, '//*[#id="root"]//table//tbody//tr//td[1]')
print (table_id)
# get all of the rows in the table
#rows = table_id.find_elements(By.TAG_NAME, "tr")
#for row in rows:
#permission = row.find_elements(By.TAG_NAME, 'td')[1]
#print (permission.text)
But I need to iterate through and find exact text and then click the check box.
The locator that you want is an XPath because XPath lets you find an element based on contained text.
//tr[./td[.='ACCESSGROUP_BULKDELETE']]//input
^ find a TR
^ that has a child TD that contains the desired text
^ then find the descendant INPUT of that TR
You can replace the 'ACCESSGROUP_BULKDELETE' text with whichever label you want in the table. I would take this a step further and put this into a method and pass in the label text as a parameter so you can make it reusable.
You can easily click on the associated check box with respect to the desired text e.g. ACCESSGROUP_BULKDELETE, ACCESSGROUP_DELETE, ACCESSGROUP_NEW, etc writing a function as follows:
def click_item(item_text):
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td[.='" + item_text + "']//preceding::td[1]//label[#for]"))).click()
Now you can call the function passing any of the labels as a string arguments as follows:
click_item("ACCESSGROUP_BULKDELETE")
# or
click_item("ACCESSGROUP_DELETE")
# or
click_item("ACCESSGROUP_NEW")
If you want to locate the relevant checkbox which corresponds to some given text use the following expression:
//td[text()='ACCESSGROUP_BULKDELETE']/preceding-sibling::td/input
Demo:
In the above expression
preceding-sibling is XPath Axis which matches all "siblings" - DOM elements having the same parent and appearing above the current element.
text() is the XPath Function which returns the text content of the current node
Below is code in java....you can relate it with Python. Hope this helps
String rowXpath = "//*[#id='root']//table//tbody//tr[INDEX]" //xpath to find no of rows
noOfRows = context.driver.find_elements(By.XPATH, "//*[#id='root']//table//tbody//tr") //get the count of row
for(int i=1; i<= noOfRows; i++){
String currXpath = rowXpath.replace(INDEX, i) //update xpath to point to curent row
currXpath = currXpath+/td[2] //Find xpath of chkBox for current tr in loop
String chkBoxXpath = currXpath+"/td[1]" //xpath to point to 1st td under current tr in loop
String currentText = context.driver.find_element(By.XPATH, currXpath).getText(); //find text in 2nd td current row in loop
if(currentText.equals("ACCESSGROUP_BULKDELETE")){ //Compare with any text of your interest //chk if text found above macthes with your interested text
context.driver.find_element(By.XPATH, chkBoxXpath).click(); //click on the 1st td(chkbox) for the tr which contains your text
}
}
I have the following HTML code:
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
I am trying to use beautiful soup to parse certain elements into a tab-delimited file.
I got some great help and have:
for td in soup.select('td.title'):
span = td.select('span.wlb_wrapper')
if span:
print span[0].get('data-tconst') # To get `tt0082971`
Now I want to get "Target Text 1" .
I've tried some things like the above text such as:
for td in soup.select('td.image'): #trying to select the <td class="image"> tag
img = td.select('a.title') #from inside td I now try to look inside the a tag that also has the word title
if img:
print img[2].get('title') #if it finds anything, then I want to return the text in class 'title'
If you're trying to get a different td based on the class (i.e. td class="image" and td class="title" you can use beautiful soup as a dictionary to get the different classes.
This will find all the td class="image" in the table.
from bs4 import BeautifulSoup
page = """
<table>
<tr>
<td class="image">
<a href="/target/tt0111161/" title="Target Text 1">
<img alt="target img" height="74" src="img src url" title="image title" width="54"/>
</a>
</td>
<td class="title">
<span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">
</span>
<a href="/target/tt0111161/">
Other Text
</a>
<span class="year_type">
(2013)
</span>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(page)
tbl = soup.find('table')
rows = tbl.findAll('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
if col.has_attr('class') and col['class'][0] == 'image':
hrefs = col.find_all('a')
for href in hrefs:
print href.get('title')
elif col.has_attr('class') and col['class'][0] == 'title':
spans = col.find_all('span')
for span in spans:
if span.has_attr('class') and span['class'][0] == 'wlb_wrapper':
print span.get('data-tconst')
span.wlb_wrapper is a selector used to select <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0111161">. Refer this & this for more information on selectors
change this in your python code span = td.select('span.wlb_wrapper') to span = td.select('span') & also span = td.select('span.year_type') and see what it returns.
If you try above and analyze what span holds you will get what you want.
I'm trying to tags that are nested in a tr tag, but the identifier that I'm using to find the correct value is nested in another td within the tr tag.
That is, I'm using the website LoLKing
And trying to scrape it for statistics based on a name, for example, Ahri.
The HTML is:
<tr>
<td data-sorttype="string" data-sortval="Ahri" style="text-align: left;">
<div style="display: table-cell;">
<div class="champion-list-icon" style="background:url(//lkimg.zamimg.com/shared/riot/images/champions/103_32.png)">
<a style="display: inline-block; width: 28px; height: 28px;" href="/champions/ahri"></a>
</div>
</div>
<div style="display: table-cell; vertical-align: middle; padding-top: 3px; padding-left: 5px;">Ahri</div>
</td>
<td style="text-align: center;" data-sortval="975"><img src='//lkimg.zamimg.com/images/rp_logo.png' width='18' class='champion-price-icon'>975</td>
<td style="text-align: center;" data-sortval="6300"><img src='//lkimg.zamimg.com/images/ip_logo.png' width='18' class='champion-price-icon'>6300</td>
<td style="text-align: center;" data-sortval="10.98">10.98%</td>
<td style="text-align: center;" data-sortval="48.44">48.44%</td>
<td style="text-align: center;" data-sortval="18.85">18.85%</td>
<td style="text-align: center;" data-sorttype="string" data-sortval="Middle Lane">Middle Lane</td>
<td style="text-align: center;" data-sortval="1323849600">12/14/2011</td>
</tr>
I'm having problems extracting the statistics, which are nested in td tags outside of the data-sortval. I imagine that I want to pull ALL the tr tags, but I don't know how to pull the tr tag based off of the one that contains the td tag with data-sortval="Ahri". At that point, I would want to step through the tr tag x times until I reach the first statistic I want, 10.98
At the moment, I'm trying to do a find for the td with data-sortval Ahri, but it doesn't return the rest of the tr.
It might be important to not that all of this is nested inside if a larger tag:
<table class="clientsort champion-list" width="100%" cellspacing="0" cellpadding="0">
<thead>
<tr><th>Champion</th><th>RP Cost</th><th>IP Cost</th><th>Popularity</th><th>Win Rate</th><th>Ban Rate</th><th>Meta</th><th>Released</th></tr>
</thead>
<tbody>
I apologize for the lack of clarity, I'm new with this scraping terminology, but I hope that makes enough sense.
Right now, I'm also doing:
main = soup.find('table', {'class':'clientsort champion-list'})
To get only that table
edit:
I typed this for the variable:
for champ in champs:
a = str(champ)
print type(a) is str
td_name = soup.find('td',{"data-sortval":a})
It confirms that a is a string.
But it throws this error:
File "lolrec.py", line 82, in StatScrape
tr = td_name.parent
AttributeError: 'NoneType' object has no attribute 'parent'
GO LOL!
For commercial purpose, please read the terms of services before scraping.
(1) To scrape a list of heroes, you can do this, which follows a similar logic as you described.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# locate the cell that contains hero name: Ahri
hero_list = ["Blitzcrank", "Ahri", "Akali"]
for hero in hero_list:
td_name = soup.find('td', {"data-sortval":hero})
tr = td_name.parent
popularity = tr.find_all('td', recursive=False)[3].text
print hero, popularity
Output
Blitzcrank 12.58%
Ahri 10.98%
Akali 7.52%
Output
10.98%
(2) To scrape all the heroes.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# find the table first
table = soup.find('table', {"class":"clientsort champion-list"})
# find the all the rows
for row in table.find('tbody').find_all("tr", recursive=False):
cols = row.find_all("td")
hero = cols[0].text.strip()
popularity = cols[3].text
print hero, popularity
Output:
Aatrox 6.86%
Ahri 10.98%
Akali 7.52%
Alistar 4.9%
Amumu 8.75%
...