Web scraping the Uber price estimator page using Python - python

So I'm really new to web scraping and I want to build a bot that checks the Uber ride price from point A to point B over a period of time. I used the Selenium library to input the pickup location and the destination and now I want to scrape the resulting estimated price from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Initialize webdriver object
firefox_webdriver_path = '/usr/local/bin/geckodriver'
webdriver = webdriver.Firefox(executable_path=firefox_webdriver_path)
webdriver.get('https://www.uber.com/global/en/price-estimate/')
time.sleep(3)
# Find the search box
elem = webdriver.find_element_by_name('pickup')
elem.send_keys('name/of/the/pickup/location')
time.sleep(1)
elem.send_keys(Keys.ENTER)
time.sleep(1)
elem2 = webdriver.find_element_by_name('destination')
elem2.send_keys('name/of/the/destination')
time.sleep(1)
elem2.send_keys(Keys.ENTER)
time.sleep(5)
elem3 = webdriver.find_element_by_class_name('bn rw bp nk ih cr vk')
print(elem3.text)
Unfortunately, there's an error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .bn rw bp nk ih cr vk
And I can't seem to figure out a solution. While inspecting the page, the name of the class that holds the price has the following name "bn rw bp nk ih cr vk" and after some searches I found out this might be Javascript instead of HTML. (I would also point out that I'm not really familiar with them.)
Inspecting the page
Eventually, I thought I could use BeautifulSoup and requests modules and faced another error.
import requests
from bs4 import BeautifulSoup
import re
import json
response = requests.get('https://www.uber.com/global/en/price-estimate/')
print(response.status_code)
406
I also tried changing the User Agent in hope of resolving this HTTP error message, but it did not work. I have no idea how to approach this.

Not exactly what you need. But I recently made a similar application and I will provide part of the function that you need. The only thing you need is to get the latitude and longitude. I used google_places provider for this, but iam sure there is many free services for this.
import requests
import json
def get_ride_price(origin_latitude, origin_longitude, destination_latitude, destination_longitude):
url = "https://www.uber.com/api/loadFEEstimates?localeCode=en"
payload = json.dumps({
"origin": {
"latitude": origin_latitude,
"longitude": origin_longitude
},
"destination": {
"latitude": destination_latitude,
"longitude": destination_longitude
},
"locale": "en"
})
headers = {
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-csrf-token': 'x'
}
response = requests.request("POST", url, headers=headers, data=payload)
result = [[x['vehicleViewDisplayName'], x['fareString']] for x in response.json()['data']['prices']]
return result
print(get_ride_price(51.5072178, -0.1275862, 51.4974948, -0.1356583))
OUTPUT:
[['Assist', '£13.84'], ['Access', '£13.84'], ['Green', '£13.86'], ['UberX', '£14.53'], ['Comfort', '£16.02'], ['UberXL', '£17.18'], ['Uber Pet', '£17.77'], ['Exec', '£20.88'], ['Lux', '£26.32']]

Related

Scrape a phone number inside a popup button using python beautifulsoup

I want to scrape a hidden phone number from a website using beautifulsoup
https://haraj.com.sa/1194697687, as you can see in this link
the phone number is hidden, and it only showed when you click "التواصل" button
The Button
Here is my code
from bs4 import BeautifulSoup
url = "https://haraj.com.sa/1199808969"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36.'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,features='lxml')
post = soup.find('span', {'class', 'contact'})
print(post)
and here is the output I got
<span class="contact"><button class="sc-bdvvaa AGAbw" type="button"><img src="https://v8-cdn.haraj.com.sa/logos/contact_logo.svg" style="margin-left:5px;filter:brightness(0) invert(1)"/>التواصل</button></span>
BeautifulSoup won't be enough for what you're trying to do - it's just an HTML parser. And Selenium is overkill. The page you're trying to scrape from uses JavaScript to dynamically and asynchronously populate the DOM with content when you press the button. If you make a request to that page in Python, and try to parse the HTML, you're only looking at the barebones template, which would normally get populated later on by the browser. The data for the modal comes from a fetch/XHR HTTP POST request to a GraphQL API, the response of which is JSON. If you use your browser's developer tools to log your network traffic when you press the button, you can see the HTTP request URL, query-string parameters, POST payload, request headers, etc. You just need to mimic that request in Python - fortunately this API seems to be pretty lenient, so you won't have to provide all the same parameters that the browser provides:
def main():
import requests
url = "https://graphql.haraj.com.sa"
params = {
"queryName": "postContact",
"token": "",
"clientId": "",
"version": ""
}
headers = {
"user-agent": "Mozilla/5.0"
}
payload = {
"query": "query postContact($postId: Int!) {postContact(postId: $postId){contactText}}",
"variables": {
"postId": 94697687
}
}
response = requests.post(url, params=params, headers=headers, json=payload);
response.raise_for_status()
print(response.json())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
{'data': {'postContact': {'contactText': '0562038953'}}}
I was able to do this using selenium and chromedriver https://chromedriver.chromium.org/downloads just be sure to change the path to where you extract chromedriver and install selenium via pip;
pip install selenium
main.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
url = "https://haraj.com.sa/1199808969"
def main():
print(get_value())
def get_value():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Developement\chromedriver.exe",chrome_options=chrome_options)
driver.get(url)
driver.find_element(By.CLASS_NAME, "AGAbw").click()
time.sleep(5)
val = driver.find_element(By.XPATH, '//*[#id="modal"]/div/div/a[2]/div[2]').text
driver.quit()
return val
main()
Output:
[0829/155029.109:INFO:CONSOLE(1)] "HBM Loaded", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.571:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.604:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155031.143:INFO:CONSOLE(16)] "Yay! SW loaded 🎉", source: https://haraj.com.sa/sw.js (16)
0559559838

Access denied - python selenium - even after using User-Agent and other headers

Using python, I am trying to extract the options chain data table published publicly by NSE exchange on https://www.nseindia.com/option-chain
Tried to use requests session as well as selenium, but somehow the website is not allowing to extract data using bot.
Below are the attempts done:
Instead of plain requests, tried to setup a session and attempted to first get csrf_token from https://www.nseindia.com/api/csrf-token and then called the url. However the website seems to have certain additional authorization using javascripts.
On studying the xhr and js tabs of chrome developer console, the website seems to be using certain js scripts for first time authorisation, so used selenium this time. Also passed useragent and Accept-Language arguments in headers (as per this stackoverflow answer) while loading driver. But somehow the access is still blocked by website.
Is there anything obvious that i am missing ? Or website will make all attempts to block automated extraction of data from website using selenium/requests + python? Either case, how do i extract this data?
Below is my current code: ( to get table contents from https://www.nseindia.com/option-chain)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36")
opts.add_argument("Accept-Language=en-US,en;q=0.5")
opts.add_argument("Accept=text/html")
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe",chrome_options=opts)
#driver.get('https://www.nseindia.com/api/csrf-token')
driver.get('https://www.nseindia.com/')
#driver.get('https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY')
driver.get('https://www.nseindia.com/option-chain')
The data is loaded via Javascript from external URL. But you need first to load cookies visiting other URL:
import json
import requests
from bs4 import BeautifulSoup
symbol = 'NIFTY'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=' + symbol
with requests.session() as s:
# load cookies:
s.get('https://www.nseindia.com/get-quotes/derivatives?symbol=' + symbol, headers=headers)
# get data:
data = s.get(url, headers=headers).json()
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
{
"records": {
"expiryDates": [
"03-Sep-2020",
"10-Sep-2020",
"17-Sep-2020",
"24-Sep-2020",
"01-Oct-2020",
"08-Oct-2020",
"15-Oct-2020",
"22-Oct-2020",
"29-Oct-2020",
"26-Nov-2020",
"31-Dec-2020",
"25-Mar-2021",
"24-Jun-2021",
"30-Dec-2021",
"30-Jun-2022",
"29-Dec-2022",
"29-Jun-2023"
],
"data": [
{
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"PE": {
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"underlying": "NIFTY",
"identifier": "OPTIDXNIFTY31-12-2020PE4600.00",
"openInterest": 19,
"changeinOpenInterest": 0,
"pchangeinOpenInterest": 0,
"totalTradedVolume": 0,
"impliedVolatility": 0,
"lastPrice": 31,
"change": 0,
"pChange": 0,
"totalBuyQuantity": 10800,
"totalSellQuantity": 0,
"bidQty": 900,
"bidprice": 3.05,
"askQty": 0,
"askPrice": 0,
"underlyingValue": 11647.6
}
},
{
"strikePrice": 5000,
"expiryDate": "31-Dec-2020",
...and so on.

I want to fetch the live stock price data through google search

I was trying to fetch the real time stock price through google search using web scraping but its giving me an error
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs.BeautifulSoup(resp.text,'lxml')
tab = soup.find('div',attrs = {'class':'gsrt'}).find('span').text
'NoneType'object has no attribute find
You could use
soup.select_one('td[colspan="3"] b').text
Code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent' : 'Mozilla/5.0'}
res = requests.get('https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8', headers = headers)
soup = bs(res.content, 'lxml')
quote = soup.select_one('td[colspan="3"] b').text
print(quote)
Try this maybe...
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
print(tab[3].text.strip())
or, if you only want the price..
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
price = tab[3].text.strip()
print(price[:7])`
user-agent is not specified in your request. It could be the reason why you were getting an empty result. This way Google treats your request as a python-requests aka automated script, instead of a "real user" visit.
It's fairly easy to do:
Click on SelectorGadget Chrome extension (once installed).
Click on the stock price and receive a CSS selector provided by SelectorGadget.
Use this selector to get the data.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=nasdaq stock price', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
>>> 177,33
Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The biggest difference in this example that you don't have to figure out why the heck something doesn't work, although it should. Everything is already done for the end-user (in this case all selections and figuring out how to scrape this data) with a json output.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "nasdaq stock price",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
>>> 177.42
Disclaimer, I work for SerpApi.

Wrong number of results in Google Scrape with Python

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.
my simple code is
import requests, bs4
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:1/01/2015,cd_max:1/01/2015','tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload)
soup = bs4.BeautifulSoup(r.text)
elems = soup.select('#resultStats')
print(elems[0].getText())
And the result I get is
About 8,600 results
So apparently all works... apart from the fact that the result is wrong. If I open the URL in Firefox (I can obtain the complete URL with r.url)
https://www.google.com/search?tbm=nws&as_epq=James+Clark&tbs=cdr%3A1%2Ccd_min%3A1%2F01%2F2015%2Ccd_max%3A1%2F01%2F2015
I see that the results are actually only 2, and if I manually download the HTML file, open the page source and search for id="resultStats" I find that the number of results is indeed 2!
Can anybody help me to understand why searching for the same id tag in the saved HTML file and in the soup item lead to two different numerical results?
************** UPDATE
It seems that the problem is the custom date range that does not get processed correctly by requests.get. If I use the same URL with selenium I get the correct answer
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
elems = soup.select('#resultStats')
print(elems[0].getText())
And the answer is
2 results (0.09 seconds)
The problem is that this methodology seems to be more cumbersome because I need to open the page in Firefox...
There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:
import requests, bs4
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)
soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text
To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.
I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.
Here are some examples of user-agents that break cdr:
Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)
There's no need in selenium, you're looking for this:
soup.select_one('#result-stats nobr').previous_sibling
# About 10,700,000 results
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": 'James Clark', # query
"hl": "en", # lang
"gl": "us", # country to search from
"tbm": "nws", # news filter
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# if used without "nobr" selector and previous_sibling it will return seconds as well: (0.41 secods)
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 10,700,000 results
Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to find which selectors will make the work done, or figure out why some of them don't return data you want also they should, bypass blocks from search engines, and maintain it over time.
Instead, you only need to iterate over structured JSON and get the data you want, fast.
Code to integrate for your case:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": 'James Clark',
"tbm": "nws",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
number_of_results = results['search_information']['total_results']
print(number_of_results)
# 14300000
Disclaimer, I work for SerpApi.

google search with python requests library

(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?
Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.
import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']
input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park
The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())
You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')

Categories

Resources