I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to.
<div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00">
This is my code:
from bs4 import BeautifulSoup
with open("messages.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results)
And the output is a list with all <div for the html file.
To access the value inside title. Simply call ['title'].
If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title'])
For example:
from bs4 import BeautifulSoup
fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>'
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results[0]['title'])
Or:
results = soup.find('div', attrs={'class':'pull_right date details'})
print(results['title'])
Output:
22.12.2022 01:49:03 UTC-03:00
22.12.2022 01:49:03 UTC-03:00
Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.
You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))
I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']
I'm trying to get the data from some social networks and put in the mongodb.
This is the information inside the html tag
<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>
I was able to recover the 347K as follows
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find_all(class_="ProfileNav-value")
seguidores = followers[2]
print seguidores.get_text()
However I wanted to get the data inside the data-cont tag I'm trying that way, but the result was: none
page = requests.get("https://twitter.com/cancaonova")
soup = BeautifulSoup(page.content, 'html.parser')
followers = soup.find('data-count')
print(followers)
Tks for you
Use 'element.attrs' to read attribute:
seguidores = followers[2]
datacount = seguidores.attrs['data-count']
rel_soup = BeautifulSoup('<span class="ProfileNav-value" data-count="347235" data-is-compact="true">347K</span>','html.parser')
rel_soup.span['data-count']
I have been practicing with bs4 and Python and now I have been stucked.
My plan is to do a If - Else state where I wanted to do similar like
If(I find a value inside this html)
Do This method
Else:
Do something else
and I have scraped up a html I found randomly which looks like -
<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>
and what I have done so far is that:
s = requests.Session()
Url = 'www.myhtml.com' #Just took a random page which I don't feel to insert
r = s.get(Url)
soup = soup(r, "lxml")
findKey = soup.find(('div', {'class': 'Talkinghand'})['data-key'])
print(findKey)
but no luck. Gives me error and
TypeError: object of type 'Response' has no len()
Once I find or print out the key. I wanted to do a if else statement where it also says:
If(there is a value inside that data-key)
...
To display the data-key attribute from inside the <div> tag, you can do the following:
from bs4 import BeautifulSoup
html = '<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>'
soup = BeautifulSoup(html, "html.parser")
print soup.div['data-key']
This would print:
123456
You would need to pass r.content to your soup call.
Your script had an extra ( and ), so the following would also work:
findKey = soup.find('div', {'class': 'Talkinghand'})['data-key']
print findKey