Error while web-scraping using BeautifulSoup - python

I am gathering housing data from zillow's website.So far I have gathered data from the first webpage.For my next step, I am trying to find links to the next button, which will navigate me to page 2, page 3, and so on. I used the Inspect feature of Chrome to locate the 'next button' button, which has the following structure
Next
I then used Beautiful Soup’s find_all method and filter on tag “a” and class “on”.I used the following code to extract all the links
driver = webdriver.Chrome(chromedriver)
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)
soup = BeautifulSoup(driver.page_source,'html.parser')
next_button = soup.find_all("a", class_="on")
print(next_button)
I am not getting any output.Any inputs on where I am going wrong?

The class for the next button appears to be off not on, as such you could scrape details of each property and advance through all the pages as follows. It uses the requests library to get the HTML which should be faster than using a chrome driver.
from bs4 import BeautifulSoup
import requests
base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
while url:
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
print('\n' + url)
for div in soup.find_all('div', class_="zsg-photo-card-caption"):
print(" {}".format(list(div.stripped_strings)))
next_button = soup.find("a", class_="off", href=True)
url = base_url + next_button['href'] if next_button else None
This continues requesting URLs until no next button is found. The output would be of the form:
https://www.zillow.com/homes/Bellevue-WA-98004_rb/
['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']

I think it will be easier if you are using soup.findAll
my solution goes this way:
zillow_url = URL
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
response = requests.get(zillow_url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
prices = ["$" + re.sub(r'(\s\d)|(\W)|([a-z]+)', "", div.text.split("/")[0], ) for div in
soup.find_all('div', class_='list-card-price')]
# print(prices)
addresses = [div.text for div in
soup.findAll('address', class_='list-card-addr')]
urls = [x.get('href') if 'http' in x.get('href') else 'https://www.zillow.com' + x.get('href') for x in soup.find_all("a", class_="list-card-link list-card-link-top-margin list-card-img")]

Related

How to select and scrape specific texts out of a bunch <ul> and <li>?

I need to scrape "2015" and "09/09/2015" from the below link:
lacentrale.fr/auto-occasion-annonce-87102353714.html
But since there are many li and ul, I cant scrape the exact text. I used the below code Your help is highly appreciated.
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Fan of css selectors and :-soup-contains() as in #Andrejs answer mentioned. So just in case an alternative approach, if it comes to the point there are more options needed.
Generate a dict with all options pick the relevant value, by option label as key:
data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))
data lokks like:
{'Année': '2015', 'Mise en circulation': '09/09/2015', 'Contrôle technique': 'requis', 'Kilométrage compteur': '68 736 Km', 'Énergie': 'Electrique', 'Rechargeable': 'oui', 'Autonomie batterie': '190 Km', 'Capacité batterie': '22 kWh', 'Boîte de vitesse': 'automatique', 'Couleur extérieure': 'gris foncé metal', 'Couleur intérieure': 'cuir noir', 'Nombre de portes': '5', 'Nombre de places': '4', 'Garantie': '6 mois', 'Première main (déclaratif)': 'non', 'Nombre de propriétaires': '2', 'Puissance fiscale': '3 CV', 'Puissance din': '102 ch', 'Puissance moteur': '125 kW', "Crit'Air": '0', 'Émissions de CO2': '0 g/kmA', 'Norme Euro': 'EURO6', 'Prime à la conversion': ''}
Example
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
url = 'https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))
print(data['Année'], data['Mise en circulation'], sep='\n')
Output
2015
09/09/2015
Try:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
url = "https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
v1 = soup.select_one('.optionLabel:-soup-contains("Année") + span')
v2 = soup.select_one(
'.optionLabel:-soup-contains("Mise en circulation") + span'
)
print(v1.text)
print(v2.text)
Prints:
2015
09/09/2015

Why is this web scrape not working on python?

I haven’t recently been using the code attached. For the past few weeks, it has been working completely fine and always produced results. However, I used this today and for some reason it didn’t work. Could you please help and provide a solution to the problem.
import requests, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {"q": "dji", "hl": "en", 'gl': 'us', 'tbm': 'shop'}
response = requests.get("https://www.google.com/search",
params=params,
headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
# list with two dict() combined
shopping_data = []
shopping_results_dict = {}
for shopping_result in soup.select('.sh-dgr__content'):
title = shopping_result.select_one('.Lq5OHe.eaGTj h4').text
product_link = f"https://www.google.com{shopping_result.select_one('.Lq5OHe.eaGTj')['href']}"
source = shopping_result.select_one('.IuHnof').text
price = shopping_result.select_one('span.kHxwFf span').text
try:
rating = shopping_result.select_one('.Rsc7Yb').text
except:
rating = None
try:
reviews = shopping_result.select_one('.Rsc7Yb').next_sibling.next_sibling
except:
reviews = None
try:
delivery = shopping_result.select_one('.vEjMR').text
except:
delivery = None
shopping_results_dict.update({
'shopping_results': [{
'title': title,
'link': product_link,
'source': source,
'price': price,
'rating': rating,
'reviews': reviews,
'delivery': delivery,
}]
})
shopping_data.append(dict(shopping_results_dict))
print(title)
Because .select in for shopping_result in soup.select('.sh-dgr__content'): could not find any element so it gives you an empty list. Therefor the body of the for-loop is not executed. Python jumps out of the loop.
title only exists and is defined when the body of the for loop executes.
You should make sure you used a correct method to find your element(s).

Web scraper returning zero in terminal

I'm trying to scrape this website
https://www.merinfo.se/search?d=c&ap=1&emp=0%3A20&rev=0%3A100&who=bygg&bf=1&page=1
And I've put a def getQuestions(tag) in the who={tag} part of the url and that works fine. When I try to add def getQuestions(tag, page) page={page} it just returns 0 in the terminal, and I really hope no clue what could be causing this.
Here is the full code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
questionlist = []
def getQuestions(tag, page):
url = 'https://www.merinfo.se/search?d=c&ap=1&emp=0%3A20&rev=0%3A100&who={bygg}&bf=1&page={page}'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.find_all('div', {'class': 'box-white p-0 mb-4'})
for item in questions:
question = {
'title': item.find('a', {'class': 'link-primary'}).text,
'link': item.find('a', {'class': 'link-primary'})['href'],
'nummer': item.find('a', {'class': 'link-body'})['href'],
'address': item.find('address', {'class': 'mt-2 mb-0'}).text,
'RegÅr': item.find('div', {'class': 'col text-center'}).text,
}
questionlist.append(question)
return
for x in range(1,5):
getQuestions('bygg', x)
print(len(questionlist))
Any help would be appreciated. Best regards!
Change the string in url variable to f-string:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
def getQuestions(tag, page):
questionlist = []
url = f"https://www.merinfo.se/search?d=c&ap=1&emp=0%3A20&rev=0%3A100&who={tag}&bf=1&page={page}"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
questions = soup.find_all("div", {"class": "box-white p-0 mb-4"})
for item in questions:
question = {
"title": item.find("a", {"class": "link-primary"}).text,
"link": item.find("a", {"class": "link-primary"})["href"],
"nummer": item.find("a", {"class": "link-body"})["href"],
"address": item.find("address", {"class": "mt-2 mb-0"}).text,
"RegÅr": item.find("div", {"class": "col text-center"}).text,
}
questionlist.append(question)
return questionlist
out = []
for x in range(1, 5):
out.extend(getQuestions("bygg", x))
print(len(out))
Prints:
80
Try changing your url to this:
url = f'https://www.merinfo.se/search?d=c&ap=1&emp=0%3A20&rev=0%3A100&who={tag}&bf=1&page={page}'
You didn't quite have your f-Strings set up right

webscraping script for booking.com doesn't work

I made a script to scrape hotel name, rating and perks from hotels on this page : link
Here's my script :
import numpy as np
import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)]
root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]
pointforts = []
hotels = []
notes = []
for url in urls1:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
try :
div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
pointforts.append(pointfort)
except:
pointforts.append('Nan')
try:
note = soup.find('div', class_ = 'bui-review-score__badge').text
notes.append(note)
except:
notes.append('Nan')
try:
hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
hotels.append(hotel)
except:
hotels.append('Nan')
data = pd.DataFrame({
'Notes' : notes,
'Points fort' : pointforts,
'Nom' : hotels})
#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')
It worked, I made a loop to scrape all the links for the hotel and after scrape ratings and perks for all of those hotels. But I had doublons, so instead of :
links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]
I put : links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)] as you can see in my script above.
But now it doesn't work anymore, I obtained only Nan, while before, when I had doublons, I have some with Nan but most of them have the perks I wanted and the ratings. I don't understand why.
Here's the html for the hotels links :
hotellink
Here's the html to get the name (after I obtaine the link, the script go to this link) :
namehtml
And here's the html to get all the perks related to the hotel (Like the name, the script go to the link I scraped before) :
perkshtml
And here's my result...
output
The href tags on that website contain newlines. One at the start and also some mid way through. As such when you try and combine root_url you are not getting valid URLs.
A fix can be to remove all the newlines. As the href always starts with a / this can also be removed from the root_url, or you could use urllib.parse.urljoin().
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links1 = [a['href'].replace('\n','') for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)]
root_url = 'https://www.booking.com'
urls1 = [f'{root_url}{i}' for i in links1]
pointforts = []
hotels = []
notes = []
for url in urls1:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
try:
div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
pointforts.append(pointfort)
except:
pointforts.append('Nan')
try:
note = soup.find('div', class_ = 'bui-review-score__badge').text
notes.append(note)
except:
notes.append('Nan')
try:
hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
hotels.append(hotel)
except:
hotels.append('Nan')
data = pd.DataFrame({
'Notes' : notes,
'Points fort' : pointforts,
'Nom' : hotels})
#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')
This would give you an output CSV file starting:
Notes;Points fort;Nom
8,3 ;['Parking (fee required)', 'Free WiFi Internet Access Included', 'Family Rooms', 'Airport Shuttle', 'Non Smoking Rooms', '24 hour Front Desk', 'Bar'];Elysées Union
8,4 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', '24 hour Front Desk', 'Rooms/Facilities for Disabled'];Hyatt Regency Paris Etoile
8,3 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', 'Restaurant', '24 hour Front Desk', 'Bar'];Pullman Paris Tour Eiffel
8,7 ;['Free WiFi Internet Access Included', 'Non Smoking Rooms', 'Restaurant', '24 hour Front Desk', 'Rooms/Facilities for Disabled', 'Elevator', 'Bar'];citizenM Paris Gare de Lyon

Website returning spoofed result/404 while scraping?

I am trying to scrape the following site. I tried using request.get and parsed with Beautiful Soup, but it does not return the same result as to when viewed using a browser. I also directly calling the endpoint they were using but that returns a 404 error. I have tried using headers, but that has not solved it. How do I solve it?
Here is the code, I used:
import requests
import BeautifulSoup
headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 X-Requested-With: XMLHttpRequest'}
URL = 'url'
x = requests.get(url, headers=headers)
The above code does return output, but it does not have the same content as the website, that is the link to a article that appears
It used ajax to load the page.I found the API.
All the url could be:
url = "https://legitquest.com/Search/GetResultBySelectedSearchResult?caseText=AIR+1950+SC+1&type=citation&filter=&sortBy=1&formattedCitation=AIR+1950+SC+1&removeFilter=&filterValueList=&_={}".format(str(time.time()).replace(".","")[:-4])
But due to some reasons, it also couldn't crawl the page.(This page use a strict rule to prevent crawl)
Even I used the right url,it also couldn't get it:
Strongly recommend you use selenium.It will be easier.
I get it:
import requests
import time
headers = {
"X-Requested-With": "XMLHttpRequest"
}
url = 'https://legitquest.com/Search/GetResultBySelectedSearchResult?caseText=AIR+1950+SC+1&type=citation&filter=&sortBy=1&formattedCitation=AIR+1950+SC+1&removeFilter=&filterValueList=&_={}'.format(str(time.time()).replace(".","")[:-4])
x = requests.get(url,headers=headers)
print(x.json()["CaseDetails"][0]["LinkText"])
Result:
Sheth Maneklal Mansukhbhai V. Messrs. Hormusji Jamshedji Ginwallaand Sons
The json format:
{
'filterList': '',
'filterValueList': '',
'caseText': 'AIR 1950 SC 1',
'currentpage': 1,
'CaseCount': 1,
'openPopup': False,
'UserId': '',
'IsSubscribed': False,
'IsMobileDevice': False,
'CaseDetails': [{
'LinkText': 'Sheth Maneklal Mansukhbhai V. Messrs. Hormusji Jamshedji Ginwallaand Sons',
'PartyName': 'sheth-maneklal-mansukhbhai-vs-messrs.-hormusji-jamshedji-ginwallaand-sons',
'SearchString': None,
'CaseId': 21763,
'EncryptedId': '1EBBB',
'CourtName': 'Supreme Court Of India',
'Id': 125883,
'CourtId': 1,
'CaseType': None,
'HeadNotes': None,
'Judges': "HON'BLE MR. JUSTICE M.C. MAHAJAN<BR />HON'BLE MR. JUSTICE SAIYID FAZAL ALI<BR />HON'BLE MR. JUSTICE B.K. MUKHERJEA",
'DateOfJudgment': '21-03-1950',
'Judgment': None,
'OrderByDateTime': '/Date(-624326400000)/',
'CaseNo': None,
'Advocates': None,
'CitationText': '',
'CitatedCount': 0,
'CopyText': None,
'AlternativeCitation': '(1950) SCR 75 ; AIR 1950 SC 1 ; 1950 SCJ 317 ; (1950) 63 LW 495',
'Petitioner': None,
'Responder': None,
'Citation': None,
'Question': None,
'HighlightedText': '',
'IsFoundText': True,
'IsOverruledExist': False,
'IsDistinguishedExist': False,
'IsOtherStatusExist': True,
'OtherStatusImgUrl': 'https://www.legitquest.com/Content/themes/treatment/referred.svg',
'OverruledImgUrl': None,
'DistinguishedImgUrl': None,
'BookmarkId': 0,
'Chart': None,
'CaseCitedCount': None,
'SnapShot': None
}]
}
On Doing This :
url = 'https://legitquest.com/Home/GetCaseDetails?searchType=citation&publisher=AIR%201950%20SC%201'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'}
page_html = requests.get(url,headers=headers)
print("Status Code : ")
print(page_html.status_code)
page_soup = soup(page_html.content,features="lxml")
I got this result which you require

Categories

Resources