Collect Data from web link to List/Dataframe

Collect Data from web link to List/Dataframe - python

I have a web link as below:
https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp
I use the below code to collect the data but getting error as:
requests.exceptions.ConnectionError: ('Connection aborted.',
OSError("(10060, 'WSAETIMEDOUT')",))
My Code:
from requests import Session
import lxml.html
expiry_list = []
try:
session = Session()
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
session.headers.update(headers)
url = 'https://www1.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp'
params = {'symbolCode': 9999, 'symbol': 'BANKNIFTY', 'instrument': 'OPTIDX', 'date': '-', 'segmentLink': 17}
response = session.get(url, params=params)
soup = lxml.html.fromstring(response.text)
expiry_list = soup.xpath('//form[#id="ocForm"]//option/text()')
expiry_list.remove(expiry_list[0])
except Exception as error:
print("Error:", error)
print("Expiry_Date =", expiry_list)
Its working perfect in my local machine but giving error in Amazon EC2 Instance Any settings need to be changed for resolving request timeout error.

AWS houses lots of botnets, so spam blacklists frequently list AWS IPs. Your EC2 is probably part of an IP block that is blacklisted. You might be able to verify by putting your public EC2 IP in here https://mxtoolbox.com/. I would try verifying if you can even make a request via curl from the command line curl -v {URL}. If that times out, then I bet your IP is blocked by the remote server's firewall rules. Since your home IP has access, you can try to setup a VPN on your network, have the EC2 connect to your VPN, and then retry your python script. It should work then, but it will be as if you're making the request from your home (so don't do anything stupid). Most routers allow you to setup an OpenVPN or PPTP VPN right in the admin UI. I suspect that once your EC2's IP changes, you'll trick the upstream server and be able to scrape.

Related

Python requests doesn't change IP

I am currently building a proxy rotator for Python. Everything is running fine so far, except for the fact that despite the proxies, the tracker - pages return my own IP.
I have already read through dozens of posts in this forum. It often says "something is wrong with the proxy in this case".
I have a long list of proxies ( about 600 ) which I test with my method and I made sure when I scrapped them that they were marked either "elite" or "anonymous" before I put them on this list.
So can it be that the majority of free proxies are "junk" when it comes to anonymity or am I fundamentally doing something wrong?
And is there basically a way to find out how the proxy is set regarding anonymity?
Python 3.10.
import requests
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
proxi = {"http": ""}
prox_ping_ready = [173.219.112.85:8080,
43.132.148.107:2080,
216.176.187.99:8886,
193.108.21.234:1234,
151.80.120.192:3128,
139.255.10.234:8080,
120.24.33.141:8000,
12.88.29.66:9080,
47.241.66.249:1081,
51.79.205.165:8080,
63.250.53.181:3128,
160.3.168.70:8080]
ipTracker = ["wtfismyip.com/text", "api.ip.sb/ip", "ipecho.net/plain", "ifconfig.co/ip"]
for element in proxy_ping_ready:
for choice in ipTracker:
try:
proxi["http"] = "http://" + element
ips = requests.get(f'https://{choice}', proxies=proxi, timeout=1, headers=headers).text
print(f'My IP address is: {ips}', choice)
except Exception as e:
print("Error:", e)
time.sleep(3)
Output(example):
My IP address is: 89.13.9.135
api.ip.sb/ip
My IP address is: 89.13.9.135
wtfismyip.com/text
My IP address is: 89.13.9.135
ifconfig.co/ip
(Every time my own address).

You only set your proxy for http traffic, you need to include a key for https traffic as well.
proxi["http"] = "http://" + element
proxi["https"] = "http://" + element # or "https://" + element, depends on the proxy

As James mentioned, you should use also https proxy
proxi["https"] = "http://" + element
If you getting max retries with url it most probably means that the proxy is not working or is too slow and overloaded, so you might increase your timeout.
You can verify if your proxy is working by setting it as env variable. I took one from your list
import os
os.environ["http_proxy"] = "173.219.112.85:8080"
os.environ["https_proxy"] = "173.219.112.85:8080"
and then run your code without proxy settings by changing your request to
ips = requests.get(f'wtfismyip.com/text', headers=headers).text

Retrieving Powershell Proxy Config

On Powershell, I am currently performing this request, copied from the network tab on developer tools
$session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
$session.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
Invoke-WebRequest -useBasicParsing -Uri "https://......?...."
-WebSession $session
-Headers #{
"Accept"="*/*"
"Accept-Encoding"="gzip, deflate, br"
"Accept-Language"="en-US,en;q=0.9"
"Authorization"="Basic mzYw....="
"Referer"="https://......."
"sec-Fetch-Dest"="empty"
"sec-Fetch-Mode"=="cors"
"sec-Fetch-Site="same-origin"
"sec-ch-ua"="`""Chromium`";v=`"106`",`"Google Chrome`";v=`"106`",`"NotlA=Brand`";v=`"99`""
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
}
-ContentType "application/x-www-form-urlencoded"
which returns me the response 200 just fine.
However, when i tried to perform the same requests, with the headers config on python requests, I am getting a SSL proxy-related error (see SSL_verification wrong version number even with certifi verify).
Is proxy automatically configured on PowerShell requests? How can I find out what proxy are my requests currently routed to? Otherwise, how can I replicate 1:1 powershell requests to python requests?
I have tried running ipconfig /all command and using the Primary Dns Suffix field as proxy arguments in requests
requests.get(url, header = headers_in_powershell, proxies = { 'http': 'the_dns_suffix', 'https': 'the_dns_suffix' }
but the requests just gets stuck (waits with no response indefinitely).

For most commands, Powershell uses the system proxy by default (or they have a -Proxy switch to tell them where it is, but some don't and have to be told to use it.
From memory, Invoke-WebRequest can be problematical as (I think) it uses the .NET web client
Try adding this to the start of the PS script:
[System.Net.WebRequest]::DefaultWebProxy = [System.Net.WebRequest]::GetSystemWebProxy()
[System.Net.WebRequest]::DefaultWebProxy.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials

Code works from localhost but not on server - https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050 - python

I am trying to access https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050. It is working fine from my localhost (code compiled in vscode) but when I deploy it on the server I get HTTP 499 error.
Did anybody get through this and was able to fetch the data using this approach?
Looks like nse is blocking the request somehow. But then how is it working from a localhost?
P.S. - I am a paid user of pythonAnywhere (Hacker) subscription
import requests
import time
def marketDatafn(query):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'}
main_url = "https://www.nseindia.com/"
session = requests.Session()
response = session.get(main_url, headers=headers)
cookies = response.cookies
url = "https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050"
nifty50DataReq = session.get(url, headers=headers, cookies=cookies, timeout=15)
nifty50DataJson = nifty100DataReq.json()
return nifty50DataJson['data']

Actually "Pythonanywhere" only supports those website which are in this whitelist.
And I have found that there are only two subdomain available under "nseindia.com", which is not that you are trying to request.
bricsonline.nseindia.com
bricsonlinereguat.nseindia.com
So, pythonanywhere is blocking you to sent request to that website.
Here's the link to read more about how to request to add your website there.

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

My Django website is hosted using Apache server. I want to send data using requests.post to my website using a python script on my pc but It is giving 403 forbidden.
import json
url = "http://54.161.205.225/Project/devicecapture"
headers = {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'content-type': 'application/json'}
data = {
"nb_imsi":"test API",
"tmsi1":"test",
"tmsi2":"test",
"imsi":"test API",
"country":"USA",
"brand":"Vodafone",
"operator":"test",
"mcc":"aa",
"mnc":"jhj",
"lac":"hfhf",
"cellIid":"test cell"
}
response = requests.post(url, data =json.dumps(data),headers=headers)
print(response.status_code)
I have also given permission to the directory containing the views.py where this request will go.
I have gone through many other answers but they didn't help.
I have tried the code without json.dumps also but it isn't working with that also.
How to resolve this?

After investigating it looks like the URL that you need to post to in order to login is: http://54.161.205.225/Project/accounts/login/?next=/Project/
You can work out what you need to send in a post request by looking in the Chrome DevTools, Network tab. This tells us that you need to send the fields username, password and csrfmiddlewaretoken, which you need to pull from the page.
You can get it by extracting it from the response of the first get request. It is stored on the page like this:
<input type="hidden" name="csrfmiddlewaretoken" value="OspZfYQscPMHXZ3inZ5Yy5HUPt52LTiARwVuAxpD6r4xbgyVu4wYbfpgYMxDgHta">
So you'll need to do some kind of Regex to get it. You'll work it out.
So first you have to create a session. Then load the login page with a get request. Then send a post request with your login credentials to that same URL. And then your session will gain the required cookies that will then allow you to post to your desired URL. This is an example below.
import requests
# Create session
session = requests.session()
# Add user-agent string
session.headers.update({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
# Get login page
response = session.get('http://54.161.205.225/Project/accounts/login/?next=/Project/')
# Get csrf
# Do something to response.text
# Post to login
response = session.post('http://54.161.205.225/Project/accounts/login/?next=/Project/', data={
'username': 'example123',
'password': 'examplexamplexample',
'csrfmiddlewaretoken': 'something123123',
})
# Post desired data
response = session.post('http://url.url/other_page', data={
'data': 'something',
})
print(response.status_code)
Hopefully this should get you there. Good luck.
For more information check out this question on requests: Python Requests and persistent sessions

I faced that situation many times
The problems were :
54.161.205.225 is not added to allowed hosts in settings.py
the apache wsgi is not correctly configured
things might help with debug :
Check apache error-logs to investigate what went wrong
try running server locally and post to it to make sure prob is not related to apache

get auth code from URL response - Python

I'm using OneDrive Python SKD in order to handle authentication with OneDrive SDK. The authentication is done as:
import onedrivesdk
from onedrivesdk.helpers import GetAuthCodeServer
redirect_uri = "http://localhost:8080/"
client_secret = "your_app_secret"
client = onedrivesdk.get_default_client(client_id='your_client_id',
scopes=['wl.signin',
'wl.offline_access',
'onedrive.readwrite'])
auth_url = client.auth_provider.get_auth_url(redirect_uri)
#this will block until we have the code
code = GetAuthCodeServer.get_auth_code(auth_url, redirect_uri)
client.auth_provider.authenticate(code, redirect_uri, client_secret)
However since I use a EC2 instance to run this authentication, and furthermore I don't want to utilize a browser just for that, the code blocks indefinitely. Here's the get_auth_code from Microsoft:
def get_auth_code(auth_url, redirect_uri):
"""Easy way to get the auth code. Wraps up all the threading
and stuff. Does block main thread.
Args:
auth_url (str): URL of auth server
redirect_uri (str): Redirect URI, as set for the app. Should be
something like "http://localhost:8080" for this to work.
Returns:
str: A string representing the auth code, sent back by the server
"""
HOST, PORT = urlparse(redirect_uri).netloc.split(':')
PORT = int(PORT)
# Set up HTTP server and thread
code_acquired = threading.Event()
s = GetAuthCodeServer((HOST, PORT), code_acquired, GetAuthCodeRequestHandler)
th = threading.Thread(target=s.serve_forever)
th.start()
webbrowser.open(auth_url)
# At this point the browser will open and the code
# will be extracted by the server
code_acquired.wait() # First wait for the response from the auth server
code = s.auth_code
s.shutdown()
th.join()
return code
I want to return the code. Here's a sample of auth_url:
https://login.live.com/oauth20_authorize.srf?scope=wl.offline_access+onedrive.readwrite&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&response_type=code&client_id='your_client_id'
When I enter that URL in the browser, I get the code back:
http://localhost:8080/?code=Mb0bba7d1-adbc-9c1d-f790-3709cd0b9f16
SO I want to avoid that cumbersome process to get the code back by using requests. How can I accomplish that?

I know this is an old question, but I was struggling with the same problem - I wanted to get the code using requests library. I managed to do it somehow, but I'm doubting that this isn't a very sustainable solution. Hopefully, after reading my solution you will understand better how the authentication works and you might find an improved solution.
I have a Python Flask app with mySQL database. Occasionally, I want to create a backup of the database and send the backup file to my OneDrive, plus I want to fire this process inside my Flask App.
First, I registered my App at Microsoft Application Registration Portal and added a new platform Web with redirect url http://localhost:8080/signin-microsoft. I gave the app read and write permissions and stored the Application Id (client_id) and Application Secret (client_secret).
Second, I added a new route to my Flask App. Note that my Flask App is running on localhost:8080.
#app.route("/signin-microsoft", methods=['GET'])
def get_code():
return 'Yadda'
Third, I replicated the HTTP request header created by my browser to my requests.get call. That is, I opened Chrome, pasted auth_url to address bar, hit enter, inspected the request header and copied its content to my code.
r = requests.get(auth_url,
headers = {"Host" : "login.live.com",
"Connection" : "keep-alive",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate, br",
"Upgrade-Insecure-Requests" : "1",
"Accept-Language" : "fi-FI,fi;q=0.9,en-US;q=0.8,en;q=0.7",
"User-agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Cookie": (SUPER LONG WONT PASTE HERE)})
Fourth, I parsed the code from url the request was redirected.
re_url = r.url
code = re_url.split('code=')[-1]
Here is the final code:
redirect_uri = 'http://localhost:8080/signin-microsoft'
client_secret = CLIENT_SECRET
client_id = CLIENT_ID
api_base_url='https://api.onedrive.com/v1.0/'
scopes=['wl.signin', 'wl.offline_access', 'onedrive.readwrite']
http_provider = onedrivesdk.HttpProvider()
auth_provider = onedrivesdk.AuthProvider(
http_provider = http_provider, client_id=client_id, scopes=scopes)
client = onedrivesdk.OneDriveClient(api_base_url, auth_provider, http_provider)
auth_url = client.auth_provider.get_auth_url(redirect_uri)
r = requests.get(auth_url,
headers = {"Host" : "login.live.com",
"Connection" : "keep-alive",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate, br",
"Upgrade-Insecure-Requests" : "1",
"Accept-Language" : "fi-FI,fi;q=0.9,en-US;q=0.8,en;q=0.7",
"User-agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Cookie": (SUPER LOOONG)})
re_url = r.url
code = re_url.split('code=')[-1]
client.auth_provider.authenticate(code, redirect_uri, client_secret)
I think that there are two main points here: You need a HTTP server that listens redirect uri (in Microsoft's example they used HTTPServer from http.server) and you need to get the headers of the request right. Without the headers, the request won't redirect correctly and you won't get the code!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Collect Data from web link to List/Dataframe - python

Related

Python requests doesn't change IP

Retrieving Powershell Proxy Config

Code works from localhost but not on server - https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050 - python

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

get auth code from URL response - Python

Categories

Resources