Getting only numbers from BeautifulSoup instead of whole div - python
I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']
Related
How to extract specific part of html using Beautifulsoup?
I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to. <div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"> This is my code: from bs4 import BeautifulSoup with open("messages.html") as fp: soup = BeautifulSoup(fp, 'html.parser') results = soup.find_all('div', attrs={'class':'pull_right date details'}) print(results) And the output is a list with all <div for the html file.
To access the value inside title. Simply call ['title']. If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title']) For example: from bs4 import BeautifulSoup fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>' soup = BeautifulSoup(fp, 'html.parser') results = soup.find_all('div', attrs={'class':'pull_right date details'}) print(results[0]['title']) Or: results = soup.find('div', attrs={'class':'pull_right date details'}) print(results['title']) Output: 22.12.2022 01:49:03 UTC-03:00 22.12.2022 01:49:03 UTC-03:00
bs4: splitting text with same class - python
I am web scraping for the first time, and ran into a problem: some classes have the same name. This is the code: testlink = 'https://www.ah.nl/producten/product/wi387906/wasa-volkoren' r = requests.get(testlink) soup = BeautifulSoup(r.content, 'html.parser') products = (soup.findAll('dd', class_='product-info-definition-list_value__kspp6')) And this is the output [<dd class="product-info-definition-list_value__kspp6">13 g</dd>, <dd class="product-info-definition-list_value__kspp6">20</dd>, <dd class="product-info-definition-list_value__kspp6">Rogge, Glutenbevattende Granen</dd>, <dd class="product-info-definition-list_value__kspp6">Sesamzaad, Melk</dd>] I need to get the 3rd class (Rogge, Glutenbevattende Granen)... I am using this link to test, and eventually want to scrape multiple pages of the website. Anyone any tips? Thank you!
You can select all of dd tags with class value product-info-definition-list_value__kspp6 and list slicing import requests from bs4 import BeautifulSoup url='https://www.ah.nl/producten/pasta-rijst-en-wereldkeuken?page={page}' for page in range(1,11): req = requests.get(url.format(page=page)) soup = BeautifulSoup(req.content, 'html.parser') for link in soup.select('div[class="product-card-portrait_content__2xN-b"] a'): abs_url = 'https://www.ah.nl' + link.get('href') #print(abs_url) req2 = requests.get(abs_url) soup2 = BeautifulSoup(req2.content, 'html.parser') dd = [d.get_text() for d in soup2.select('dd[class="product-info-definition-list_value__kspp6"]')][2:-2] print(dd)
How can I scrape Songs Title from this request that I have collected using python
import requests from bs4 import BeautifulSoup r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1") soup = BeautifulSoup(r.text, "html.parser") result = soup.find("div", {"class": "s_c"}) print(result.class) From the above code, I am able to scrape this data https://www.pastiebin.com/5f08080b8db82 Now I would like to scrape only the title of the songs and then make a list out of them like the below: Meri Aashiqui Genda Phool Any suggestions are much appreciated!
Try this : import requests from bs4 import BeautifulSoup r = requests.get("https://gaana.com/playlist/gaana-dj-hindi-top-50-1") soup = BeautifulSoup(r.text, "html.parser") result = soup.find("div", {"class": "s_c"}) #print(result) div = result.find_all('div', class_='track_npqitemdetail') name_list = [] for x in div: span = x.find('span').text name_list.append(span) print(name_list) this code will return all song name in name_list list.
Certain content not loading when scraping a site with Beautiful Soup
I'm trying to scrape the ratings off recipes on NYT Cooking but having issues getting the content I need. When I look at the source on the NYT page, I see the following: <div class="ratings-rating"> <span class="ratings-header ratings-content">194 ratings</span> <div class="ratings-stars-wrap"> <div class="ratings-stars ratings-content four-star-rating avg-rating"> The content I'm trying to pull out is 194 ratings and four-star-rating. However, when I pull in the page source via Beautiful Soup I only see this: <div class="ratings-rating"> <span class="ratings-header ratings-content"><%= header %></span> <div class="ratings-stars-wrap"> <div class="ratings-stars ratings-content <%= ratingClass %> <%= state %>"> The code I'm using is: url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill' r = get(url, headers = headers, timeout=15) page_soup = soup(r.text,'html.parser') Any thoughts why that information isn't pulling through?
Try using below code import requests import lxml from lxml import html import re url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card®ion=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1" r = requests.get(url) tree = html.fromstring(r.content) t = tree.xpath('/html/body/script[14]')[0] # look for value for bootstrap.recipe.avg_rating m = re.search("bootstrap.recipe.avg_rating = ", t.text) colon = re.search(";", t.text[m.end()::]) rating = t.text[m.end():m.end()+colon.start()] print(rating) # look for value for bootstrap.recipe.num_ratings = n = re.search("bootstrap.recipe.num_ratings = ", t.text) colon2 = re.search(";", t.text[n.end()::]) star = t.text[n.end():n.end()+colon2.start()] print(star)
much easier to use attribute = value selectors to grab from span with class ratings-metadata import requests from bs4 import BeautifulSoup data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill') soup = BeautifulSoup(data.content, 'lxml') rating = soup.select_one('[itemprop=ratingValue]').text ratingCount = soup.select_one('[itemprop=ratingCount]').text print(rating, ratingCount)
Data missing on requests.get() Python 2
I want to webscrape the IAA Consensus price on https://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp?txtSymbol=PTT&ssoPageId=9&selectPage=10 In Google chrome inspect elements, I can use <h3> through beautifulsoup to get the data. But from the print page.content I get ... <h3 class="colorGreen"></h3> ... Where it should be <h3 class="colorGreen">62.00</h3> Here's my code import requests from bs4 import BeautifulSoup def findPrice(Quote): link = "http://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp?txtSymbol="+Quote+"&ssoPageId=9&selectPage=10" page = requests.get(link) soup = BeautifulSoup(page.content,'html.parser') print page.content target = soup.findAll('h3') return target.string findPrice('PTT')
I guess, the server is checking for a LstQtLst cookie and generates the HTML with the "Consensus Target Price" filled in. import requests from bs4 import BeautifulSoup def find_price(quote): link = ('http://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp' '?txtSymbol={}' '&ssoPageId=9' '&selectPage=10'.format(quote)) html = requests.get(link, cookies={'LstQtLst': quote}).text soup = BeautifulSoup(html, 'html.parser') price = soup.find('h3').string return price >>> find_price('PTT') 62.00