Python Script- failing to bypass logon page using beautifulSoup

Python Script- failing to bypass logon page using beautifulSoup - python

I had a script that would bypass a logon page that looks like this
URL="http://mywebsite.com/logon.aspx"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
username="username"
password="password"
s = Session()
s.verify = False
s.headers.update(headers)
r = s.get(URL)
soup=BeautifulSoup(r.content,"html.parser")
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
VIEWSTATEGENERATOR = soup.find(id="__VIEWSTATEGENERATOR")['value']
EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")['value']
login_data={"__VIEWSTATE":VIEWSTATE,
"__VIEWSTATEGENERATOR":VIEWSTATEGENERATOR,
"__EVENTVALIDATION":EVENTVALIDATION,
"txtUsername":username,
"txtPassword":password,
"btnLogin":"Login"
}
#r = s.post(URL, data=login_data, verify=False)
r = s.post("http://mywebsite.com/logon.aspx", data=login_data)
r = s.get("http://mywebsite.com/SummaryReport/Index")
that script was working fine before but then it started running into SSL errors so I changed it so that verify=false for the session
Now I don't get SSL errors but now it won't post the data to logon page, I'm not sure if it is related or not but any help is much appreciated

If this is the SSL error you are seeing its a warning and can be ignored.
/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'ownwebsite.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
What's failing is the following lines.
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
The content received from the URL doesn't have the expected id __VIEWSTATE and is returning None and trying to access based of key 'value' is causing the error.
TypeError: 'NoneType' object is not subscriptable

Related

Code works from localhost but not on server - https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050 - python

I am trying to access https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050. It is working fine from my localhost (code compiled in vscode) but when I deploy it on the server I get HTTP 499 error.
Did anybody get through this and was able to fetch the data using this approach?
Looks like nse is blocking the request somehow. But then how is it working from a localhost?
P.S. - I am a paid user of pythonAnywhere (Hacker) subscription
import requests
import time
def marketDatafn(query):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'}
main_url = "https://www.nseindia.com/"
session = requests.Session()
response = session.get(main_url, headers=headers)
cookies = response.cookies
url = "https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050"
nifty50DataReq = session.get(url, headers=headers, cookies=cookies, timeout=15)
nifty50DataJson = nifty100DataReq.json()
return nifty50DataJson['data']

Actually "Pythonanywhere" only supports those website which are in this whitelist.
And I have found that there are only two subdomain available under "nseindia.com", which is not that you are trying to request.
bricsonline.nseindia.com
bricsonlinereguat.nseindia.com
So, pythonanywhere is blocking you to sent request to that website.
Here's the link to read more about how to request to add your website there.

Http - Tunnel connection failed: 403 Forbidden error with Python web scraping

I am trying to web scrape a http website and I am getting below error when I am trying to read the website.
HTTPSConnectionPool(host='proxyvipecc.nb.xxxx.com', port=83): Max retries exceeded with url: http://campanulaceae.myspecies.info/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden',)))
Below is the code I have written with similar website. I tried using urllib and user-agent and still the same issue.
url = "http://campanulaceae.myspecies.info/"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'})
soup = BeautifulSoup(response.text, 'html.parser')
Can anyone help me with the issue. Thanks in advance

you should try to add proxy while requesting url.
proxyDict = {
'http' : "add http proxy",
'https' : "add https proxy"
}
requests.get(url, proxies=proxyDict)
you can find more information here

i tried using User-Agent: Defined and it worked for me.
url = "http://campanulaceae.myspecies.info/"
headers = {
"Accept-Language" : "en-US,en;q=0.5",
"User-Agent": "Defined",
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.text
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())
If you get an error that says "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser." Then it means you're not using the right parser, you'll need to import lxml at the top and install the module then use "lxml" instead of "html.parser" when you make soup.

python https authentication error?

I've been trying to print time data from this site: clockofeidolon.com and I found that the hour, minutes and seconds are stored in "span class="big-x"
tags and have tried to get the data with this
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find('<span class="big')
print(data)
print([span.text for span in spans])
I keep getting authentication erros though
socket.gaierror: [Errno 11001] getaddrinfo failed

This error is occuring because you are trying to access an URL that doesn't exist (https://clockofeidolon) or Python can't reach.
Look at this question, which explains what that error means:
"getaddrinfo failed", what does that mean?

The host clockofeidolon did not resolve to an IP. You were probably looking for clockofeidolon.com.

How can I use POST from requests module to login to Github?

I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.

I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/

Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)

This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)

You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub

Error: HTTPS site requires a 'Referer header' to be sent by your Web browser, but none was sent

You are seeing this message because this HTTPS site requires a 'Referer
header' to be sent by your Web browser, but none was sent. This header is
required for security reasons, to ensure that your browser is not being
hijacked by third parties.
I was trying to login to a website using requests but received the error above, how do I create a 'Referer
header'?
payload = {'inUserName': 'xxx.com', 'inUserPass': 'xxxxxx'}
url = 'https:xxxxxx'
req=requests.post(url, data=payload)
print(req.text)

You can pass in headers you want to send on your request as a keyword argument to request.post:
payload = {'inUserName': 'xxx.com', 'inUserPass': 'xxxxxx'}
url = 'https:xxxxxx'
req=requests.post(url, data=payload, headers={'Referer': 'yourReferer')
print(req.text)

I guess you are using this library: http://docs.python-requests.org/en/latest/user/quickstart/
If this is the case you have to add a custom header Referer (see section Custom headers). The code would be something like this:
url = '...'
payload = ...
headers = {'Referer': 'https://...'}
r = requests.post(url, data=payload, headers=headers)
For more information on the referer see this wikipedia article: https://en.wikipedia.org/wiki/Referer

I was getting the Same error in Chrome. What I did was just disabled all my chrome extensions including ad blockers. Here after I reloaded the page from where i wanted to scrape the data and logged in once again and then in the code as #Stephan Kulla mentioned you need to add headers inside headers i added user agent, referer, referrer-policy, origin. all these you can get in from inspect sample where you will find a Network part..
add all those in header and try to login again using post it should work.(It worked for me)
ori = 'https:......'
login_route = 'login/....'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' , 'origin':'https://www.screener.in', 'referer': '/login/','referrer-policy':'same-origin'}
s=requests.session()
csrf = s.get(ori+login_route).cookies['csrftoken']
payload = {
'username': 'xxxxxx',
'password': 'yyyyyyy',
'csrfmiddlewaretoken': csrf
}
login_req = s.post(ori+login_route,headers=header,data=payload)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Script- failing to bypass logon page using beautifulSoup - python

Related

Code works from localhost but not on server - https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050 - python

Http - Tunnel connection failed: 403 Forbidden error with Python web scraping

python https authentication error?

How can I use POST from requests module to login to Github?

Error: HTTPS site requires a 'Referer header' to be sent by your Web browser, but none was sent

Categories

Resources