Beautifulsoup Python: Goibibo hotel rates - python

I'm trying to extract hotel rates from Goibibo.
url : https://www.goibibo.com/hotels/hotels-in-ahmedabad-ct/
I'm using the following code :
import requests
from bs4 import BeautifulSoup
import pandas as pd
# target URL to scrape
url = ""
# headers
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}
response = requests.request("GET", url, headers=headers)
data = BeautifulSoup(response.text, 'html.parser')
print(data)
cards_data = data.find_all('div', attrs={'class', 'HotelCardstyles__HotelCardInfoWrapperDiv-sc-1s80tyk-7 iLLynP'})
print('Total Number of Cards Found:', len(cards_data)
for card in cards_data:
hotel_name = card.find('a')
room_price = card.find('p', attrs={'class': 'HotelCardstyles__CurrentPrice-sc-1s80tyk-28 inUyrJ'})
print(hotel_name.text, room_price.text)
The problem I have is, the given url picks up default values of date of booking.
When I change the date of booking to desired values, and change search parameters accordingly the output turns to 0 cards found.
url with updated dates : https://www.goibibo.com/hotels/find-hotels-in-Jaipur/4278754392716898526/4278754392716898526/%7B%22ci%22:%2220210520%22,%22co%22:%2220210521%22,%22r%22:%221-2-0%22%7D/?{%22filter%22:{}}&sec=dom&cc=IN
I am not able to understand what changes and what to change in order to get those cards. Any help will be appreciated

When the date filters are changed the actual content is served, differently. You can see it by opening the "Network tab" of the "Google Chrome developer tools" (or the equivalent of your browser).
In this particular example you provided, the data comes in JSON form from this URL:
https://hermes.goibibo.com/hotels/v12/search/data/v3/4278754392716898526/20210520/20210521/1-2-0?s=popularity&cur=INR&tmz=-120

Related

Beautifulsoup scrap very few listings prices instead of all listings prices on a page

I want to scrap data from a real estate website for my education project. I am using beautifulsoup. I write following code. Code works properly but shows very less data.
import requests
from bs4 import BeautifulSoup
url = "https://www.zillow.com/homes/San-Francisco,-CA_rb/"
headers = {
"Accept-Language": "en-GB,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0"
}
response = requests.get(url=url, headers=headers )
soup = BeautifulSoup(response.text, "html.parser")
prices = soup.find_all("span", attrs={"data-test":True})
prices_list = [price.getText().strip("+,/,m,o,1,bd, ") for price in prices]
print(prices_list)
The output of this only shows first 9 listings prices.
['$2,959', '$2,340', '$2,655', '$2,632', '$2,524', '$2,843', '$2,64', '$2,300', '$2,604']
It's because the content is created progressively with continuous requests (Lazy loading). You could try to reverse engineer the backend of the site. I'll look into it and if I find an easy solution I'll update the answer. :)
The API call to their backend looks something like this: https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-123.07190982226562%2C%22east%22%3A-121.79474917773437%2C%22south%22%3A37.63132659190023%2C%22north%22%3A37.918977518603874%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sortSelection%22%3A%7B%22value%22%3A%22days%22%7D%2C%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants={%22cat1%22:[%22mapResults%22]}&requestId=3
You need to handle cookies correctly in order to see the results but if delivers around 1000 results. Have fun :)
UPDATE:
should look like this
import json
with open("GetSearchPageState.json", "r") as f:
a = json.load(f)
print(a["cat1"]["searchResults"]["mapResults"])

Unable to fetch tabular content from a site using requests

I'm trying to fetch tabular content from a webpage using the requests module. After navigating to that webpage, when I manually type 0466425389 right next to Company number and hit the search button, the table is produced accordingly. However, when I mimic the same using requests, I get the following response.
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><redirect url="/bc9/web/catalog"></redirect></partial-response>
I've tried with:
import requests
link = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'page_searchForm:actions:0:button',
'javax.faces.partial.execute': 'page_searchForm',
'javax.faces.partial.render': 'page_searchForm page_listForm pageMessagesId',
'page_searchForm:actions:0:button': 'page_searchForm:actions:0:button',
'page_searchForm': 'page_searchForm',
'page_searchForm:j_id3:generated_number_2_component': '0466425389',
'page_searchForm:j_id3:generated_name_4_component': '',
'page_searchForm:j_id3:generated_address_zipCode_6_component': '',
'page_searchForm:j_id3_activeIndex': '0',
'page_searchForm:j_id2_stateholder': 'panel_param_visible;',
'page_searchForm:j_idt133_stateholder': 'panel_param_visible;',
'javax.faces.ViewState': 'e1s1'
}
headers = {
'Faces-Request': 'partial/ajax',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'https://cri.nbb.be',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'cri.nbb.be',
'Origin': 'https://cri.nbb.be',
'Referer': 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.get(link)
s.headers.update(headers)
res = s.post(link,data=payload)
print(res.text)
How can I fetch tabular content from that site using requests?
From looking at the "action" attribute on the search form, the form appears to generate a new JSESSIONID every time it is opened, and this seems to be a required attribute. I had some success by including this in the URL.
You don't need to explicitly set the headers other than the User-Agent.
I added some code: (a) to pull out the "action" attribute of the form using BeautifulSoup - you could do this with regex if you prefer, (b) to get the url from that redirection XML that you showed at the top of your question.
import re
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
...
with requests.Session() as s:
s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# GET to get search form
req1 = s.get(link)
# Get the form action
soup = BeautifulSoup(req1.text, "lxml")
form = soup.select_one("#page_searchForm")
form_action = urljoin(link, form["action"])
# POST the form
req2 = s.post(form_action, data=payload)
# Extract the target from the redirection xml response
target = re.search('url="(.*?)"', req2.text).group(1)
# Final GET to get the search result
req3 = s.get(urljoin(link, target))
# Parse and print (some of) the result
soup = BeautifulSoup(req3.text, "lxml").body
for detail in soup.select(".company-details tr"):
columns = detail.select("td")
if columns:
print(f"{columns[0].text.strip()}: {columns[1].text.strip()}")
Result:
Company number: 0466.425.389
Name: A en B PARTNERS
Address: Quai de Willebroeck 37
: BE 1000 Bruxelles
Municipality code NIS: 21004 Bruxelles
Legal form: Cooperative company with limited liability
Legal situation: Normal situation
Activity code (NACE-BEL)
The activity code of the company is the statistical activity code in use on the date of consultation, given by the CBSO based on the main activity codes available at the Crossroads Bank for Enterprises and supplementary informations collected from the companies: 69201 - Accountants and fiscal advisors
I think requests could not handle dynamic web pages. I use helium and pandas to do the work.
import helium as he
import pandas as pd
url = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
driver = he.start_chrome(url)
he.write('0466425389', into='Company number')
he.click('Search')
he.wait_until(he.Button('New search').exists)
he.select(he.ComboBox('10'), '100')
he.wait_until(he.Button('New search').exists)
with open('download.html', 'w') as html:
html.write(driver.page_source)
he.kill_browser()
df = pd.read_html('download.html')
df[2]
Output

Having trouble setting up a web scraper with Python

Three days ago I started learning Python to create a web scraper and collect information about new book releases. I´m stuck on one of my target websites...I know this is a really basic question but I´ve watched some videos, looked at many related questions on stack overflow, tried more than 10 different solutions and nothing. If anybody could help, much appreciated:
My problem:
I can retrieve the title information but can´t retrieve the price information
Data Source:
https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25
My code:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
source = requests.get(url, headers=headers).text
#code to retrieve title
soup = BeautifulSoup(source, 'lxml')
for productdetails in soup.find_all("div", class_='figDetails'):
producttitle = productdetails.a.text
print(producttitle)
#code to retrieve price
for productpricedetails in soup.find_all("div", class_='related-products-block'):
productprice = productdetails.find("div", class_="new-price").span.text
print(productprice)
There are two elements with the name span, I need the information on the second one but don´t know how to get to it.
Also, on trying different possible solutions I kept getting a noneType error...
It looks like the source you're trying to scrape populates this data via Javascript.
Viewing the source of the page you can see the raw HTML shows the div you're trying to target is empty.
<html>
...
<div class="related-products-block" id="reletedProduct_490420">
</div>
...
</html>
You can also see this if you update your second loop like so:
for productpricedetails in soup.find_all("div", class_="related-products-block"):
print(productpricedetails)
Edit:
As a bonus, you can inspect the Javascript the page uses. It is very easy to understand, and the request simply returns the HTML which you are looking for. It will be a bit more involved to get the JSON prepared for the requests but here's an example:
import requests
url = "https://www.bloomsbury.com/uk/catalog/RelatedProductsData"
payload = {"productId": 490420, "type": "List", "ordertype": 0, "formatType": 0}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text.encode("utf8"))

I want to fetch the live stock price data through google search

I was trying to fetch the real time stock price through google search using web scraping but its giving me an error
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs.BeautifulSoup(resp.text,'lxml')
tab = soup.find('div',attrs = {'class':'gsrt'}).find('span').text
'NoneType'object has no attribute find
You could use
soup.select_one('td[colspan="3"] b').text
Code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent' : 'Mozilla/5.0'}
res = requests.get('https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8', headers = headers)
soup = bs(res.content, 'lxml')
quote = soup.select_one('td[colspan="3"] b').text
print(quote)
Try this maybe...
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
print(tab[3].text.strip())
or, if you only want the price..
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
price = tab[3].text.strip()
print(price[:7])`
user-agent is not specified in your request. It could be the reason why you were getting an empty result. This way Google treats your request as a python-requests aka automated script, instead of a "real user" visit.
It's fairly easy to do:
Click on SelectorGadget Chrome extension (once installed).
Click on the stock price and receive a CSS selector provided by SelectorGadget.
Use this selector to get the data.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=nasdaq stock price', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
>>> 177,33
Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The biggest difference in this example that you don't have to figure out why the heck something doesn't work, although it should. Everything is already done for the end-user (in this case all selections and figuring out how to scrape this data) with a json output.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "nasdaq stock price",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
>>> 177.42
Disclaimer, I work for SerpApi.

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?
You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

Categories

Resources