Access denied - python selenium - even after using User-Agent and other headers

Access denied - python selenium - even after using User-Agent and other headers - python

Using python, I am trying to extract the options chain data table published publicly by NSE exchange on https://www.nseindia.com/option-chain
Tried to use requests session as well as selenium, but somehow the website is not allowing to extract data using bot.
Below are the attempts done:
Instead of plain requests, tried to setup a session and attempted to first get csrf_token from https://www.nseindia.com/api/csrf-token and then called the url. However the website seems to have certain additional authorization using javascripts.
On studying the xhr and js tabs of chrome developer console, the website seems to be using certain js scripts for first time authorisation, so used selenium this time. Also passed useragent and Accept-Language arguments in headers (as per this stackoverflow answer) while loading driver. But somehow the access is still blocked by website.
Is there anything obvious that i am missing ? Or website will make all attempts to block automated extraction of data from website using selenium/requests + python? Either case, how do i extract this data?
Below is my current code: ( to get table contents from https://www.nseindia.com/option-chain)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36")
opts.add_argument("Accept-Language=en-US,en;q=0.5")
opts.add_argument("Accept=text/html")
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe",chrome_options=opts)
#driver.get('https://www.nseindia.com/api/csrf-token')
driver.get('https://www.nseindia.com/')
#driver.get('https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY')
driver.get('https://www.nseindia.com/option-chain')

The data is loaded via Javascript from external URL. But you need first to load cookies visiting other URL:
import json
import requests
from bs4 import BeautifulSoup
symbol = 'NIFTY'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=' + symbol
with requests.session() as s:
# load cookies:
s.get('https://www.nseindia.com/get-quotes/derivatives?symbol=' + symbol, headers=headers)
# get data:
data = s.get(url, headers=headers).json()
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
{
"records": {
"expiryDates": [
"03-Sep-2020",
"10-Sep-2020",
"17-Sep-2020",
"24-Sep-2020",
"01-Oct-2020",
"08-Oct-2020",
"15-Oct-2020",
"22-Oct-2020",
"29-Oct-2020",
"26-Nov-2020",
"31-Dec-2020",
"25-Mar-2021",
"24-Jun-2021",
"30-Dec-2021",
"30-Jun-2022",
"29-Dec-2022",
"29-Jun-2023"
],
"data": [
{
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"PE": {
"strikePrice": 4600,
"expiryDate": "31-Dec-2020",
"underlying": "NIFTY",
"identifier": "OPTIDXNIFTY31-12-2020PE4600.00",
"openInterest": 19,
"changeinOpenInterest": 0,
"pchangeinOpenInterest": 0,
"totalTradedVolume": 0,
"impliedVolatility": 0,
"lastPrice": 31,
"change": 0,
"pChange": 0,
"totalBuyQuantity": 10800,
"totalSellQuantity": 0,
"bidQty": 900,
"bidprice": 3.05,
"askQty": 0,
"askPrice": 0,
"underlyingValue": 11647.6
}
},
{
"strikePrice": 5000,
"expiryDate": "31-Dec-2020",
...and so on.

Related

Web scraping the Uber price estimator page using Python

So I'm really new to web scraping and I want to build a bot that checks the Uber ride price from point A to point B over a period of time. I used the Selenium library to input the pickup location and the destination and now I want to scrape the resulting estimated price from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Initialize webdriver object
firefox_webdriver_path = '/usr/local/bin/geckodriver'
webdriver = webdriver.Firefox(executable_path=firefox_webdriver_path)
webdriver.get('https://www.uber.com/global/en/price-estimate/')
time.sleep(3)
# Find the search box
elem = webdriver.find_element_by_name('pickup')
elem.send_keys('name/of/the/pickup/location')
time.sleep(1)
elem.send_keys(Keys.ENTER)
time.sleep(1)
elem2 = webdriver.find_element_by_name('destination')
elem2.send_keys('name/of/the/destination')
time.sleep(1)
elem2.send_keys(Keys.ENTER)
time.sleep(5)
elem3 = webdriver.find_element_by_class_name('bn rw bp nk ih cr vk')
print(elem3.text)
Unfortunately, there's an error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .bn rw bp nk ih cr vk
And I can't seem to figure out a solution. While inspecting the page, the name of the class that holds the price has the following name "bn rw bp nk ih cr vk" and after some searches I found out this might be Javascript instead of HTML. (I would also point out that I'm not really familiar with them.)
Inspecting the page
Eventually, I thought I could use BeautifulSoup and requests modules and faced another error.
import requests
from bs4 import BeautifulSoup
import re
import json
response = requests.get('https://www.uber.com/global/en/price-estimate/')
print(response.status_code)
406
I also tried changing the User Agent in hope of resolving this HTTP error message, but it did not work. I have no idea how to approach this.

Not exactly what you need. But I recently made a similar application and I will provide part of the function that you need. The only thing you need is to get the latitude and longitude. I used google_places provider for this, but iam sure there is many free services for this.
import requests
import json
def get_ride_price(origin_latitude, origin_longitude, destination_latitude, destination_longitude):
url = "https://www.uber.com/api/loadFEEstimates?localeCode=en"
payload = json.dumps({
"origin": {
"latitude": origin_latitude,
"longitude": origin_longitude
},
"destination": {
"latitude": destination_latitude,
"longitude": destination_longitude
},
"locale": "en"
})
headers = {
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-csrf-token': 'x'
}
response = requests.request("POST", url, headers=headers, data=payload)
result = [[x['vehicleViewDisplayName'], x['fareString']] for x in response.json()['data']['prices']]
return result
print(get_ride_price(51.5072178, -0.1275862, 51.4974948, -0.1356583))
OUTPUT:
[['Assist', '£13.84'], ['Access', '£13.84'], ['Green', '£13.86'], ['UberX', '£14.53'], ['Comfort', '£16.02'], ['UberXL', '£17.18'], ['Uber Pet', '£17.77'], ['Exec', '£20.88'], ['Lux', '£26.32']]

Script fails to extract certain fields from a webpage using requests

I'm trying to scrape title, website and email from this webpage. The content available in there are heavily dynamic. Although requests module can't handle javascript heavy content, there are always alternatives to grab the same using the very library, as in using any external link retrieved from dev tools. However, I just can't find the right way to do the trick.
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.firmy.cz/detail/13153157-azajola-stare-mesto-nova-seninka.html'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1").text
website = soup.select_one("a.companyUrl").text
email = soup.select_one("a.companyMail").text
print(title,website,email)
When I run the script above, it throws AttributeError.
Output that I'm after:
AZAJOLA, s.r.o., www.chatanovaseninka.cz, info#chatanovaseninka.cz
PS I don't wish to go for any browser simulator like selenium.

The site uses FastRPC protocol (which is additionally encoded with Base64). You can install PyFRPC module from https://pypi.org/project/pyfrpc/ to encode/decode the messages:
import re
import pyfrpc # <-- install from https://pypi.org/project/pyfrpc/
import base64
from pprint import pprint
url = (
"https://www.firmy.cz/detail/13153157-azajola-stare-mesto-nova-seninka.html"
)
api_url = "https://www.firmy.cz/RPC2/"
id_ = int(re.search(r"detail/(\d+)", url).group(1))
headers = {
"Accept": "application/x-base64-frpc",
"Content-Type": "application/x-base64-frpc",
}
c = pyfrpc.FrpcCall(
name="system.multicall",
args=[
[
{
"methodName": "detail.getDetail",
"params": [
{"version": "1.0"},
id_,
{"searchInCategory": False, "deliveryNetwork": ""},
],
},
{
"methodName": "review.listReviews",
"params": [{"version": "1.0"}, id_, 0, 3],
},
]
],
)
msg_to_send = base64.b64encode(pyfrpc.encode(c, 0x0201))
r = requests.post(api_url, headers=headers, data=msg_to_send)
response = pyfrpc.decode(base64.b64decode(r.text))
# uncomment to see all data:
# pprint(response.data)
print(response.data[0]["result"]["title_web"])
print(response.data[0]["result"]["email"])
print(response.data[0]["result"]["url"].split("#")[0])
Prints:
AZAJOLA, s.r.o.
info#chatanovaseninka.cz
http://www.chatanovaseninka.cz

Python requests 401 error but url opens in browser

I am trying to pull the json from this location -
https://www.nseindia.com/api/option-chain-indices?symbol=BANKNIFTY
This opens fine in my browser, but using requests in python throws a 401 permission error. I have tried adding headers with different arguments, but to no avail.
Interestingly, the json on this page does not open in the browser as well until https://www.nseindia.com is opened separately. I believe it requires some kind of authentication, but surprised it works in the browser without any.
Is there a way to extract the information from this url? Any help is much appreciated.
Here is my implementation -
import requests
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=BANKNIFTY'
# This throws a 401 response error
page = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
# This throws a 'Connection aborted' error
page = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"})

To get the correct data, first load cookies from other URL with requests.get() and then do Ajax request to load the JSON:
import json
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=BANKNIFTY'
with requests.session() as s:
# load cookies:
s.get('https://www.nseindia.com/get-quotes/derivatives?symbol=BANKNIFTY', headers=headers)
# get data:
data = s.get(url, headers=headers).json()
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
{
"records": {
"expiryDates": [
"03-Sep-2020",
"10-Sep-2020",
"17-Sep-2020",
"24-Sep-2020",
"01-Oct-2020",
"08-Oct-2020",
"15-Oct-2020",
"22-Oct-2020",
"29-Oct-2020",
"26-Nov-2020"
],
"data": [
{
"strikePrice": 18100,
"expiryDate": "03-Sep-2020",
"CE": {
"strikePrice": 18100,
"expiryDate": "03-Sep-2020",
"underlying": "BANKNIFTY",
"identifier": "OPTIDXBANKNIFTY03-09-2020CE18100.00",
"openInterest": 1,
"changeinOpenInterest": 1,
"pchangeinOpenInterest": 0,
"totalTradedVolume": 2,
"impliedVolatility": 95.51,
"lastPrice": 6523.6,
"change": 2850.1000000000004,
"pChange": 77.58540901048048,
"totalBuyQuantity": 2925,
"totalSellQuantity": 2800,
"bidQty": 25,
"bidprice": 6523.6,
"askQty": 25,
"askPrice": 6570.3,
"underlyingValue": 24523.8
},
"PE": {
"strikePrice": 18100,
"expiryDate": "03-Sep-2020",
"underlying": "BANKNIFTY",
...and so on.

How to check if a url is indexed by google using Google Custom search API and Python?

i need to check if some URLs are indexed by google using a python script and google custom search.
I'd like to obtain in the script the same results i obtain when from my browser i google for site:www.example.it.
My code is:
import urllib2
import json
import pprint
data = urllib2.urlopen('https://www.googleapis.com/customsearch/v1?key=AIzaSyA3xNw1doOc4rjoUGc7sq1gltQvOgalHqA&cx=017576662512468239146:omuauf_lfve&q=site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1')
data=json.load(data)
print data
The output of this is:
{ u'kind': u'customsearch#search',
u'queries': { u'request': [ { u'count': 10,
u'cx': u'017576662512468239146:omuauf_lfve',
u'inputEncoding': u'utf8',
u'outputEncoding': u'utf8',
u'safe': u'off',
u'searchTerms': u'site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'title': u'Google Custom Search - site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'totalResults': u'0'}]},
u'searchInformation': { u'formattedSearchTime': u'0.55',
u'formattedTotalResults': u'0',
u'searchTime': 0.552849,
u'totalResults': u'0'},
u'url': { u'template': u'https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json',
u'type': u'application/json'}}
As you can see there are no "items" while if you google for site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1 you have at least one item.
After various experiments it seems that google custom search doesn't work for the queries site:website.
Do you know any solution or alternative to this problem?
Thanks.

With Google CSE you specify the site via your CSE configuration (corresponding to your 'cx' parameter) not via the 'site:' query parameter. In the 'basics' tab of your CSE you should see a section called "Sites to search".

Urls are in Excel file
import requests
import pandas as pd
import time
from bs4 import BeautifulSoup
from urllib.parse import urlencode
seconds = 3
proxies = {
'https' : 'https://localhost:8123',
'http' : 'http://localhost:8123'
}
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
headers = { 'User-Agent' : user_agent}
df = pd.read_excel('url_links.xlsx')
for i in range(0, len(df)):
line = df.loc[i,'links']
#print(line)
if line:
query = {'q': 'site:' + line}
google = "https://www.google.com/search?" + urlencode(query)
data = requests.get(google, headers=headers)
data.encoding = 'ISO-8859-1'
soup = BeautifulSoup(str(data.content), "html.parser")
try:
check = soup.find(id="rso").find("div").find("div").find("div").find("div").find("div").find("a")["href"]
print("URL is Index ")
except AttributeError:
print("URL Not Index")
time.sleep(float(seconds))
else:
print("Invalid Url")

google search with python requests library

(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?

Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.

import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']

input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park

The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.

In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())

You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access denied - python selenium - even after using User-Agent and other headers - python

Related

Web scraping the Uber price estimator page using Python

Script fails to extract certain fields from a webpage using requests

Python requests 401 error but url opens in browser

How to check if a url is indexed by google using Google Custom search API and Python?

google search with python requests library

Categories

Resources