I am web scraping for the first time, and ran into a problem: some classes have the same name.
This is the code:
testlink = 'https://www.ah.nl/producten/product/wi387906/wasa-volkoren'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
products = (soup.findAll('dd', class_='product-info-definition-list_value__kspp6'))
And this is the output
[<dd class="product-info-definition-list_value__kspp6">13 g</dd>, <dd class="product-info-definition-list_value__kspp6">20</dd>, <dd class="product-info-definition-list_value__kspp6">Rogge, Glutenbevattende Granen</dd>, <dd class="product-info-definition-list_value__kspp6">Sesamzaad, Melk</dd>]
I need to get the 3rd class (Rogge, Glutenbevattende Granen)... I am using this link to test, and eventually want to scrape multiple pages of the website. Anyone any tips?
Thank you!
You can select all of dd tags with class value product-info-definition-list_value__kspp6 and list slicing
import requests
from bs4 import BeautifulSoup
url='https://www.ah.nl/producten/pasta-rijst-en-wereldkeuken?page={page}'
for page in range(1,11):
req = requests.get(url.format(page=page))
soup = BeautifulSoup(req.content, 'html.parser')
for link in soup.select('div[class="product-card-portrait_content__2xN-b"] a'):
abs_url = 'https://www.ah.nl' + link.get('href')
#print(abs_url)
req2 = requests.get(abs_url)
soup2 = BeautifulSoup(req2.content, 'html.parser')
dd = [d.get_text() for d in soup2.select('dd[class="product-info-definition-list_value__kspp6"]')][2:-2]
print(dd)
I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']
I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)
You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.
You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))
i want to get the name of the url target from a webpage
this is what is have done so far :
check ='https://www.zap.co.il/search.aspx?keyword='+'N3580-5092'
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'xml')
storeName = bsObj.select_one('div.StoresLines div.BuyButtonsTxt')
the result is :
<div class="BuyButtonsTxt">
ב-<a aria-label="לקנייה ב-פיסי אונליין Dell Inspiron 15 3580
N3580-5092" href="/fs.aspx?pid=666473435&sog=c-pclaptop" id=""
target="_blank">פיסי אונליין</a>
</div>
i want only the value in the href : "פיסי אונליין"
how to do it ?
I had to change bsObj = BeautifulSoup(html.content,'xml') to bsObj = BeautifulSoup(html.content,'html.parser'), as the 'xml' wouldn't find the tag for me
from bs4 import BeautifulSoup
import requests
check ='https://www.zap.co.il/search.aspx?keyword='+'N3580-5092'
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
storeName = bsObj.select_one('div.StoresLines div.BuyButtonsTxt')
text = storeName.find('a').text
Output:
'פיסי אונליין'
I want to webscrape the IAA Consensus price on https://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp?txtSymbol=PTT&ssoPageId=9&selectPage=10
In Google chrome inspect elements, I can use <h3> through beautifulsoup to get the data. But from the print page.content I get
...
<h3 class="colorGreen"></h3>
...
Where it should be <h3 class="colorGreen">62.00</h3>
Here's my code
import requests
from bs4 import BeautifulSoup
def findPrice(Quote):
link = "http://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp?txtSymbol="+Quote+"&ssoPageId=9&selectPage=10"
page = requests.get(link)
soup = BeautifulSoup(page.content,'html.parser')
print page.content
target = soup.findAll('h3')
return target.string
findPrice('PTT')
I guess, the server is checking for a LstQtLst cookie and generates the HTML with the "Consensus Target Price" filled in.
import requests
from bs4 import BeautifulSoup
def find_price(quote):
link = ('http://www.settrade.com/AnalystConsensus/C04_10_stock_saa_p1.jsp'
'?txtSymbol={}'
'&ssoPageId=9'
'&selectPage=10'.format(quote))
html = requests.get(link, cookies={'LstQtLst': quote}).text
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('h3').string
return price
>>> find_price('PTT')
62.00