Python get specific data with Beautifulsoup - python

Hey so i’m trying to get the value of wpid from this website https://www.scan.co.uk/products/1m-canyon-apple-lightning-to-usb-cable-for-apple-non-mfi-certified-iphones-5-6-7-8-x-11-black with python using beautifulsoup but i can’t figure out how to only get the wpid and not the other stuff. help would be appreciated.
page = sess.get(url)
data = page.text
soup = BeautifulSoup(data, 'html.parser')
text = soup.find('div', class_='buyButton large')
text2 = text.find('a')['href']
text3 = (text.find('a').contents[0])
text4 = (soup.find('div', class_='buyButton large').contents[0])
#text3 = text.find('div')['data-wpid']
print(text)
print(text2)
print(text3)
print(text4)
this is the response i get: <div class="buyButton large" data-instock="1" data-source="2" data-wpid="2951361"><a class="btn" href="https://secure.scan.co.uk/web/basket/addproduct/2951361" rel="nofollow">Add To Basket</a></div> https://secure.scan.co.uk/web/basket/addproduct/2951361 Add To Basket <a class="btn" href="https://secure.scan.co.uk/web/basket/addproduct/2951361" rel="nofollow">Add To Basket</a> but i only want the value of the wpid which would be 2951361

Access the data-wpid attribute like you already do it, but on the correct element.
text is already the div with the attribute, therefore you don't need the extra find().
>>> print("wpid:", text["data-wpid"])
wpid: 2951361

Related

How to extract specific part of html using Beautifulsoup?

I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to.
<div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00">
This is my code:
from bs4 import BeautifulSoup
with open("messages.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results)
And the output is a list with all <div for the html file.
To access the value inside title. Simply call ['title'].
If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title'])
For example:
from bs4 import BeautifulSoup
fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>'
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results[0]['title'])
Or:
results = soup.find('div', attrs={'class':'pull_right date details'})
print(results['title'])
Output:
22.12.2022 01:49:03 UTC-03:00
22.12.2022 01:49:03 UTC-03:00

How to take link from onclickvalue in BeautifulSoup?

Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.
You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

return value inside html tag with beautifulsoup

I'm trying to get the data from some social networks and put in the mongodb.
This is the information inside the html tag
<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>
I was able to recover the 347K as follows
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find_all(class_="ProfileNav-value")
seguidores = followers[2]
print seguidores.get_text()
However I wanted to get the data inside the data-cont tag I'm trying that way, but the result was: none
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find('data-count')
print(followers)
Tks for you
Use 'element.attrs' to read attribute:
seguidores = followers[2]
datacount = seguidores.attrs['data-count']
rel_soup = BeautifulSoup('<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>','html.parser')
rel_soup.span['data-count']

BeautifulSoup - Python - Find the key from HTML

I have been practicing with bs4 and Python and now I have been stucked.
My plan is to do a If - Else state where I wanted to do similar like
If(I find a value inside this html)
Do This method
Else:
Do something else
and I have scraped up a html I found randomly which looks like -
<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>
and what I have done so far is that:
s = requests.Session()
Url = 'www.myhtml.com' #Just took a random page which I don't feel to insert
r = s.get(Url)
soup = soup(r, "lxml")
findKey = soup.find(('div', {'class': 'Talkinghand'})['data-key'])
print(findKey)
but no luck. Gives me error and
TypeError: object of type 'Response' has no len()
Once I find or print out the key. I wanted to do a if else statement where it also says:
If(there is a value inside that data-key)
...
To display the data-key attribute from inside the <div> tag, you can do the following:
from bs4 import BeautifulSoup
html = '<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>'
soup = BeautifulSoup(html, "html.parser")
print soup.div['data-key']
This would print:
123456
You would need to pass r.content to your soup call.
Your script had an extra ( and ), so the following would also work:
findKey = soup.find('div', {'class': 'Talkinghand'})['data-key']
print findKey

Categories

Resources