Grabbing values from HTML with BS4 - python

I am having a hard time figuring out how to grab certain data from this HTML snippet that I've obtained from parsing through HTML via BeautifulSoup.
Here is my code:
productpage = 'http://www.sneakersnstuff.com/en/product/26133/adidas-samba-waves-x-naked'
rr = requests.get(productpage)
soup1 = BeautifulSoup(rr.content, 'xml')
productIDArray = soup1.find_all("div", class_="size-button property available")
#print for debugging purposes
print(productIDArray[0])
productIDArray[0] returns
<div class="size-button property available" data-productId="207789">
<span class="size-type" title="UK 3.5 | 36">
US 4
</span>
</div>
How would i grab the value of data-productID and the title of the span so that I can place them into variables?
Thank you.

productIDArray['data-productId']
out:
'207789'
productIDArray.span['title']
out:
'UK 3.5 | 36'

Related

Python get specific data with Beautifulsoup

Hey so i’m trying to get the value of wpid from this website https://www.scan.co.uk/products/1m-canyon-apple-lightning-to-usb-cable-for-apple-non-mfi-certified-iphones-5-6-7-8-x-11-black with python using beautifulsoup but i can’t figure out how to only get the wpid and not the other stuff. help would be appreciated.
page = sess.get(url)
data = page.text
soup = BeautifulSoup(data, 'html.parser')
text = soup.find('div', class_='buyButton large')
text2 = text.find('a')['href']
text3 = (text.find('a').contents[0])
text4 = (soup.find('div', class_='buyButton large').contents[0])
#text3 = text.find('div')['data-wpid']
print(text)
print(text2)
print(text3)
print(text4)
this is the response i get: <div class="buyButton large" data-instock="1" data-source="2" data-wpid="2951361"><a class="btn" href="https://secure.scan.co.uk/web/basket/addproduct/2951361" rel="nofollow">Add To Basket</a></div> https://secure.scan.co.uk/web/basket/addproduct/2951361 Add To Basket <a class="btn" href="https://secure.scan.co.uk/web/basket/addproduct/2951361" rel="nofollow">Add To Basket</a> but i only want the value of the wpid which would be 2951361
Access the data-wpid attribute like you already do it, but on the correct element.
text is already the div with the attribute, therefore you don't need the extra find().
>>> print("wpid:", text["data-wpid"])
wpid: 2951361

Python BS4 can not extract data properly

So I have this source code
<div class="field-fluid col-xs-12 col-sm-12 col-md-12 col-lg-6">
<div class="label-fluid">Email Address</div>
<div class="data-fluid rev-field" aria-data="rei-0">maldapalmer<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>#gmail<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>.com</div>
</div>
I seem to be doing everything right however I just can not extract the email address housed in the second div within the main div element. This is my code:
fields = []
for row in rows:
fields.append(row.find_all('div', recursive = False))
email = fields[0][0].find(class_ = "data-fluid rev-field").text
Row here is the element within the main div is housed. Any suggestions are welcome, also I hope I explained the issue well enough.
The problem I get is that the string shows up empty ''. Thanks!
You can extract Email by using the following code:
from bs4 import *
from requests import get
response = get('http://127.0.0.1/bs.html') # Replce 'http://127.0.0.1/bs.html' with your URL
sp = BeautifulSoup(response.text, 'html.parser')
email = sp.find('div', class_= "data-fluid rev-field").text
spn = sp.find('span', class_= "hd-form-field").text
email = email.replace(spn,"")
print(email)
Output:
maldapalmer#gmail.com

BeautifulSoup extract text from comment html [duplicate]

This question already has answers here:
How to find all comments with Beautiful Soup
(2 answers)
Closed 4 years ago.
Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
How do I get the part 'views' and 'interaction'?
You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:
from bs4 import BeautifulSoup, Comment
html = """<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])
Outputs:
1.56M Clicks 0
If the comment is not the first on the page you would need to index it like this:
comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]
or find it from a parent element.
Updated in response to comment:
You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.
from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')
for tr in soup.find_all('tr'):
comment = tr.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
shares= commentsoup.find('span', {'class': 'shares'})
print (views.get_text(), shares['data-shares'])
Outputs:
3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...
The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this <!-- and this --> signs from the html elements and the rest are as it is. Take a look at the below script.
from bs4 import BeautifulSoup
htdoc = """
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')
Output:
1.56M Clicks
0
If you want only the views then:
views = soup.findAll("span", {"class": "views"})
You also can get the whole paragraph with:
p = soup.findAll("p", {"class": "statistics"})
Then you can get the data from the p.

Python: How to extract URL from HTML Page using BeautifulSoup?

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?
According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.
After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>
from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Python BeautifulSoup parsing

I am trying to scrape some content (am very new to Python) and I have hit a stumbling block. The code I am trying to scrape is:
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22"</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
What I am trying to scrape is the link text (Spear & Jackson etc) and the price (£5.95). I have looked about on Google, the BeautifulSoup documentation and on this forum and I managed to get to extract the "Now: £5.95" using this code:
for node in soup.findAll('p', { "class" : "productlist_grid_price" }):
print ''.join(node.findAll(text=True))
However the result I am after is just 5.95. I have also had limited success trying to get the link text (Spear & Jackson) using:
soup.h2.a.contents[0]
However of course this returns just the first result.
The ultimate result that I am aiming for is to have the results look like:
Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc
As I am looking to export this to a csv, I need to figure out how to put the data into 2 columns. Like I say I am very new to python so I hope this makes sense.
I appreciate any help!
Many thanks
I think what you're looking for is something like this:
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)
print item, price
Example output:
Spear & Jackson Predator Universal Hardpoint Saw - 22" 5.95
Note that for the item, the regular expression is used just to remove extra whitespace, while for the price is used to capture the number.
html = '''
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
'''
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())
print desc, price

Categories

Resources