Python Data Scrape - Form Authentication Issue

Python Data Scrape - Form Authentication Issue - python

Below is some code that I have been trying to use to log in to the cook's illustrated website (https://www.cooksillustrated.com/sign_in).
I start a session, get the authentication token and a hidden encoding field, and then pass the "name" and "value" of the email and password fields (found by inspecting the elements in chrome). The form doesn't seem to contain any other elements; however, the post method doesn't log me in.
I noticed that all of the CSRF tokens ended in "==", so I tried removing them. But it didn't work.
I also tried modifying the post to use the "id" field of the form inputs instead of the "name" (just a shot in the dark, really...name seems like it should work from what I've seen in other examples).
Any thoughts would be much appreciated.
import requests, lxml.html
s = requests.session()
# go to the login page and get its text
login = s.get('https://www.cooksillustrated.com/sign_in')
login_html = lxml.html.fromstring(login.text)
# find the hidden fields names and values; store in a dictionary
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_inputs}
print(form)
# I noticed that they all ended in two = signs, so I tried taking that off
# form['authenticity_token'] = form['authenticity_token'][:-2]
# this adds to the form payload the two named fields for user name and password
# found using the "inspect elements" on the login screen
form['user[email]'] = 'my_email'
form['user[password]'] = 'my_pw'
# this uses "id" instead of "name" from the input fields
#form['user_email'] = 'my_email'
#form['user_password'] = 'my_pw'
response = s.post('https://www.cooksillustrated.com/sign_in', data=form)
print(form)
# trying to see if it worked - but the response URL is login again instead of main page
# and it can't find my name
# responses are okay, but I think that just means it posted the form
print(response.url)
print('Christopher' in response.text)
print(response.status_code)
print(response.ok)

Well, the POST request URL should be https://www.cooksillustrated.com/sessions, and if you capture all traffic while logging in, you'll find out the actual POST request made to the server:
POST /sessions HTTP/1.1
Host: www.cooksillustrated.com
Connection: keep-alive
Content-Length: 179
Cache-Control: max-age=0
Origin: https://www.cooksillustrated.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.cooksillustrated.com/sign_in
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
utf8=%E2%9C%93&authenticity_token=Uvku64N8V2dq8z%2BGerrqWNobn03Ydjvz8xqgOAvfBmvDM%2B71xJWl2DmRU4zbBE15gGVESmDKP2E16KIqBeAJ0g%3D%3D&user%5Bemail%5D=demo&user%5Bpassword%5D=demodemo
Notice that the last line is encoded data for this request, there're 4 parameters which are utf, authenticity_token, user[email] and user[password].
So in your case, form should include all of them:
form = {'user[email]': 'my_email',
'user[password]': 'my_pw',
'utf': '✓',
'authenticity_token': 'xxxxxx' # make sure you don't ignore '=='
}
Also, you might want to add some headers to appear as coming from Chrome (or whatever browser you like), since the default header of request is python-requests/2.13.0 and some websites don't like traffic from "bots":
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
... # more
}
Now we're ready to make the POST request:
response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)

Related

Authentication parameters missing (401) in python request

I want to get the data points from the graph below (https://index.mysteel.com/price/getChartMultiCity_1_0.html)
I found out from Developer Tools -> Network -> XHR that a request is made when I explore the graph. The response has all the data needed, like date and price.
If I just copy the request URL and fetch the data with python's requests, I get the correct response. Example:
import requests
r = requests.get('https://index.mysteel.com/zs/newprice/getBaiduChartMultiCity.ms?catalog=%25E8%259E%25BA%25E7%25BA%25B9%25E9%2592%25A2_%3A_%25E8%259E%25BA%25E7%25BA%25B9%25E9%2592%25A2&city=%25E4%25B8%258A%25E6%25B5%25B7%3A15278&spec=HRB400E%252020MM_%3A_HRB400E_20MM&startTime=2021-08-10&endTime=2021-08-12&callback=json')
r.json()
{'marketDatas': .........}
However, if I change the values of endTime or startTime query parameters, I get a 401 error:
Missing authentication parameters!
It looks like I can only send requests that are already sent from my browser (I'm running a jupyter notebook in the same browser). Even if I add request headers like in the Network tab, I still get the same error:
params= {'catalog':'%E8%9E%BA%E7%BA%B9%E9%92%A2_:_%E8%9E%BA%E7%BA%B9%E9%92%A2' ,'city':'%E4%B8%8A%E6%B5%B7:15278' ,
'spec':'HRB400E%2020MM_:_HRB400E_20MM' ,'startTime':'2021-08-10', 'endTime':'2021-08-12', 'callback':'json'}
headers = {'Host': 'index.mysteel.com',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://index.mysteel.com/price/getChartMultiCity_1_0.html',
'appKey': '47EE3F12CF0C443F851EFDA73AC815',
'Cookie':'href=https%3A%2F%2Findex.mysteel.com%2Fprice%2FgetChartMultiCity_1_0.html; accessId=5d36a9e0-919c-11e9-903c-ab24d11b; pageViewNum=2'}
url = 'https://index.mysteel.com/zs/newprice/getBaiduChartMultiCity.ms'
r = requests.get(url, params=params, headers=headers)
What am I missing here? I am not authenticating myself in the browser, so there shouldn't be any authentication issues. Is there some other parameter missing in my headers dict?
PS: These are all the request headers:
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,sq;q=0.8,de;q=0.7
appKey: 47EE3F12CF0C443F851EFDA73AC815
Connection: keep-alive
Cookie: href=https%3A%2F%2Findex.mysteel.com%2Fprice%2FgetChartMultiCity_1_0.html; accessId=5d36a9e0-919c-11e9-903c-ab24d11b; pageViewNum=2
dnt: 1
Host: index.mysteel.com
Referer: https://index.mysteel.com/price/getChartMultiCity_1_0.html
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
sec-gpc: 1
sign: D404184969BF3B8C081A9F0C913AF68E
timestamp: 1629136315665
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36
version: 1.0.0
X-Requested-With: XMLHttpRequest

Cannot login to a page with session.post(URL, data=payload)

I've been trying to automatically login with python requests to a page to download a file every hour, but i haven't had any luck with it.
The page is this one: https://www.still-fds.com/fleetmanager/login/dercomaq
i'm using the Session object like this:
import requests
login_URL = 'https://www.still-fds.com/fleetmanager/login/dercomaq'
next_URL = 'https://www.still-fds.com/fleetmanager/pages/reports/vehicle.xhtml'
#(these arent the real username and password, I use the real ones in my code)
payload = {"j_idt181:username": "User", "j_idt181:password": "Password"}
with requests.Session() as ses:
ses.get(login_URL) #I get a JSESSIONID cookie here
ses.post(login_URL, data = payload) #I send the login request
r = ses.get(next_URL) #I try accessing the next page after login
with open('login-test.html', 'wb') as f: #writing the HTML i get back to a file so I can preview it
f.write(r.content)
However when I check the preview if the next page, it always redirects/shows me the login page
I've also tried sending a more complete payload, copying everything that gets sent in the normal login request, like this
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source':'j_idt181:j_idt190',
'javax.faces.partial.execute': '#all',
'j_idt181:j_idt190': 'j_idt181:j_idt190',
'j_idt181': 'j_idt181',
'j_idt181:username': 'User',
'j_idt181:password': 'Password',
'javax.faces.ViewState': '-4453297688092219000:-1561371993877484606 '
}
And I've tried with copying the request headers:
#I replace the JSESSIONID cookie for the one that I'm given in the first get request
req_header = {
'Accept': 'application/xml, text/xml, */*; q=0.01'
,'Accept-Encoding': 'gzip, deflate, br'
,'Accept-Language': 'es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7'
,'Connection': 'keep-alive'
,'Content-Length': '287'
,'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
,'Cookie': 'JSESSIONID=U27BXdHsu+EXITX-QfUGkmqW; clientId=dercomaq'
,'Faces-Request': 'partial/ajax'
,'Host': 'www.still-fds.com'
,'Origin': 'https://www.still-fds.com'
,'Referer': 'https://www.still-fds.com/fleetmanager/login/dercomaq'
,'Sec-Fetch-Dest': 'empty'
,'Sec-Fetch-Mode': 'cors'
,'Sec-Fetch-Site': 'same-origin'
,'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
,'X-Requested-With': 'XMLHttpRequest'
}
However I never get the ckientId cookie (I've tried manually giving it a clientId cookie, since it's always 'dercomaq', but it doesn't help either)
All of this works very easily when using Selenium and ChromeDriver, but because of web app constraints I can't use that

Returning 403 Forbidden from simple get but loads okay in browser

I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?

The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers

These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.

Python requests login page before parsing data

I want to parse a data from a page but this data is showing only registered users. So, I have to login first then parse it. There is no problem with the parsing side but I have problem on login side. Here is my only login code:
I have changed private domain name to domain.com
import requests
with requests.Session() as c:
url = 'https://domain.com/giris?returnUrl=https://domain.com/'
USERNAME = 'xxxxxxx#gmail.com'
PASSWORD = '1111111'
c.get(url)
__RequestVerificationToken = c.cookies['__RequestVerificationToken']
login_data = dict(__RequestVerificationToken=__RequestVerificationToken, UserName=USERNAME, Password=PASSWORD, ReturnUrl='https://domain.com/', RememberMe='false')
c.post(url, data=login_data, headers={"Referer": "https://domain.com/giris?returnUrl=https%3A%2F%2Fdomain.com%2F"})
page = c.get('https://domain.com/mesaj/')
print page.content
If login is succeeded I need to see https://domain.com/mesaj/ page but it redirects login page again because of unsuccessful login.
Also, here is the Header of login page that is captures from Google Chrome:
General
Request URL:https://domain.com/giris
Request Method:POST
Status Code:302 Found
Remote Address:176.53.43.2:443
Response Headers
view source
Response Headers
Cache-Control:no-cache
Content-Length:140
Content-Type:text/html; charset=utf-8
Date:Tue, 24 Jan 2017 16:51:30 GMT
Expires:-1
Location:https://domain.com/
Pragma:no-cache
Set-Cookie:a=vjFFqBh+ZZMKr71K2XrBCr5SutOMAOpWjv1RAS5hRYMrR2RojaTV/wgIP8HiUOjdMU7x28DpfxRsCnfSvLeLHPvGTBKjwF0O5W99julK7w23vdctrnE5FDBlhXSSB9nCQm+DB3vNgGjxEr+DNRMrWNwMZWbSQID+klPDtUnReAJQA/GfLdoo2izsD0HP6tir; path=/; HttpOnly
Strict-Transport-Security:max-age=31536000; includeSubDomains; preload
X-Content-Type-Options:nosniff
X-Frame-Options:DENY
X-XSS-Protection:1; mode=block
Request Headers
view source
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, br
Accept-Language:tr-TR,tr;q=0.8,en-US;q=0.6,en;q=0.4
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:251
Content-Type:application/x-www-form-urlencoded
Cookie:iq=c3bae8d44a624f03a431c8df6741af36; __gfp_64b=0icQApqT.NFM3ZR0rZLAyXumlDI4n2eStqQJ74n4H...U7; __gads=ID=bb39b708ca7de25d:T=1484676131:S=ALNI_MYxcRsRoDIaQmsY859bPz_jriFRDA; ASP.NET_SessionId=efsub3xz101y542pyjhugvi2; cookies_info_viewed=yes; notheme=1; __adm_int_sc=1; __adm_int=1; __RequestVerificationToken=xzayOzscFJ3a_m4C5jF8eaKrW7F_Yen7umGMm_nZxDPKmO5rUKacPc4yHK63wVqQwd2S_H2mLiKt_ROW2pCG1B5ZTEtytYF-GU0khK2BlnM1; _gat=1; _ga=GA1.2.633914475.1484676130; __asc=14f63ba5159d14e81061abc20c9; __auc=328f9ce5159ad97e58cfeb70218
Host:domain.com
Origin:https://domain.com
Referer:https://domain.com/giris?returnUrl=https%3A%2F%2Fdomain.com%2F
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Form Data
view source
view URL encoded
Form Data
__RequestVerificationToken:F2ZCmIge7rSV4A4Xoelf3aweaDQ9vNHew16Bfb6GDSlpeFQeQ_cfmV6UrFhNRWRBqvGPXzXrxVLAIXgbKI-08Q0fD3Vfttezq5hTkMFYTwo1
ReturnUrl:https://domain.com/
UserName:xxxxxxx#gmail.com
Password:1111111
RememberMe:false

Use requests module in Python to log in to Barclays premier league fantasy football?

I'm trying to write a Python script to let me log in to my fantasy football account at https://fantasy.premierleague.com/, but something is not quite right with my log in. When I login through my browser and check the details using Chrome developer tools, I find that the Request URL is https://users.premierleague.com/accounts/login/ and the form data sent is:
csrfmiddlewaretoken:[My token]
login:[My username]
password:[My password]
app:plfpl-web
redirect_uri:https://fantasy.premierleague.com/a/login
There are also a number of Request headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:185
Content-Type:application/x-www-form-urlencoded
Cookie:[My cookies]
Host:users.premierleague.com
Origin:https://fantasy.premierleague.com
Referer:https://fantasy.premierleague.com/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
So I've written a short Python script using the request library to try to log in and navigate to a page as follows:
import requests
with requests.Session() as session:
url_home = 'https://fantasy.premierleague.com/'
html_home = session.get(url_home)
csrftoken = session.cookies['csrftoken']
values = {
'csrfmiddlewaretoken': csrftoken,
'login': <My username>,
'password': <My password>,
'app': 'plfpl-web',
'redirect_uri': 'https://fantasy.premierleague.com/a/login'
}
head = {
'Host':'users.premierleague.com',
'Referer': 'https://fantasy.premierleague.com/',
}
session.post('https://users.premierleague.com/accounts/login/',
data = values, headers = head)
url_transfers = 'https://fantasy.premierleague.com/a/squad/transfers'
html_transfers = session.get(url_transfers)
print(html_transfers.content)
On printing out the content of my post request, I get a HTML response code 500 error with:
b'\n<html>\n<head>\n<title>Fastly error: unknown domain users.premierleague.com</title>\n</head>\n<body>\nFastly error: unknown domain: users.premierleague.com. Please check that this domain has been added to a service.</body></html>'
If I remove the 'host' from my head dict, I get a HTML response code 405 error with:
b''
I've tried including various combinations of the Request headers in my head dict and nothing seems to work.

The following worked for me. I simply removed headers = head
session.post('https://users.premierleague.com/accounts/login/',
data = values)
I think you are trying to pick your team programmatically, like me. Your code got me started thanks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Data Scrape - Form Authentication Issue - python

Related

Authentication parameters missing (401) in python request

Cannot login to a page with session.post(URL, data=payload)

Returning 403 Forbidden from simple get but loads okay in browser

Python requests login page before parsing data

Use requests module in Python to log in to Barclays premier league fantasy football?

Categories

Resources