I am trying to log into zyBooks using the Requests library in Python. I saw in the Network tab of google chrome that I need an auth_token to be able to add to the URL to actually create and do the login request. Firstly, here is the Network tab snapshot after I log into the website:
So first, I need to do the 1st POST request that is named 'signin' (the 2nd one, 1st OPTIONS request doesn't seem to do anything or respond with anything). The signin POST request is supposed to respond with an auth_token, which then I can use to login using the 3rd name in the list, which is the first GET request.
The response of the first POST request is the auth_token:
And here is the detail about the first POST request. You can see the request URL and the payload required:
As proof, here is what request URL would look like. As you can see, it needs the auth_token.
I am however, unable to get the first POST request's auth_token in anyway that I have tried so far. Both request URL for the first 2 'signin' are what is in the code. Here is the code:
import requests
url = 'https://learn.zybooks.com/signin'
payload = {"email":"myemail","password":"mypassword"}
headers = {
'Host': 'zyserver.zybooks.com',
'Connection': 'keep-alive',
'Content-Length': '52',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': "Chromium;v=88, Google Chrome;v=88, ;Not A Brand;v=99",
'Accept': 'application/json, text/javascript, */*; q=0.01',
'DNT': '1',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://learn.zybooks.com',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://learn.zybooks.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
}
session = requests.Session()
req1 = session.post(url)
req2 = session.post(url, data=payload)
print(req2.json())
I just get the JSONDecoreError:
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
From what I have researched in many posts online, this error happens because the response doesn't contain any JSON. But that doesn't make any sense as I need that JSON response with the auth_token to be able to create the GET request to login to the site.
Got it. It's because zyBooks runs on Ember.js. There is no html (barely), it's a javascript website. The javascript needs to be loaded first, then the form can be filled and submitted.
I did not go through with implementing it myself, but for future people coming here, there are posts on this subject, such as:
using requests to login to a website that has javascript login form
Related
I am using https://developer.microsoft.com/en-us/graph/graph-explorer to make requests
I am trying to convert them to Python to use for general automation.
I always copy from browser>postman>code, so I have all the cookies/tokens/etc. I need, and my python request will work until something expires. In this case, that something is a bearer token.
I can’t figure out how to get a new, valid bearer token other than re-doing above process or copying just the token and copy-pasting into my code.
While trying to find an auth request that would spit one out, I came across a collection for Postman here:
https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow
and when I replace {{tenant}} with my orgs tenant_id, I get a 200 request with a bearer token, but when I insert this bearer token into my Graph API request code I get the following error:
{"error":{"code":"BadRequest","message":"/me request is only valid with delegated authentication flow.","innerError":{"date":"2022-10-23T14:31:22","request-id":"...","client-request-id":"..."}}}
Here is a screenshot of the Postman Auth
Here is my Graph API call that only works with bearer tokens copied from graph-explorer
def recreate_graph_request1(bearer = None):
'''
I went to https://developer.microsoft.com/en-us/graph/graph-explorer
and picked a request. Outlook>GET emails from a user
at first response was for some generic user, but I logged in using my account and it actually worked.
Then I used my old copy curl as bash trick to make it python
:return:
'''
url = "https://graph.microsoft.com/v1.0/me/messages?$filter=(from/emailAddress/address)%20eq%20%27my.boss#company.com%27"
payload = {}
headers = {
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.9',
'Authorization': bearer,
'Connection': 'keep-alive',
'Origin': 'https://developer.microsoft.com',
'Referer': 'https://developer.microsoft.com/',
'SdkVersion': 'GraphExplorer/4.0, graph-js/3.0.2 (featureUsage=6)',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'client-request-id': 'n0t_th3_s4m3_4s_1n_P05tm4n',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
}
response = requests.request("GET", url, headers=headers, data=payload)
return response
token_from_ms_auth = 'eyCOPIED_FROM_POSTMAN....'
bearer_from_ms_auth = 'Bearer '+token_from_ms_auth
print(recreate_graph_request1(bearer_from_ms_auth).text)
TBH, I was not overly optimistic that any bearer token would work, even if it was somehow associated with my tenant - but I hoped it would, and the resulting disappointment has driven me to ask the universe for help. I do not understand these meandering flows and looking at others' answers only confuses me more. I am hoping someone can help me figure out this scenario.
Access tokens are short lived. Refresh them after they expire to continue accessing resources.
Please refer this document: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow#refresh-the-access-token
Hope this helps.
I am trying to get the response body of this request "ListByMovieAndDate" from this specific website:
https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME
Screenshot below is the request in Chrome Dev Tool.
I have tried several methods to mimic the request, including
copying the request as cURL (bash) and using a tool to translate it to Python request
import requests
headers = {'authority': 'hkmovie6.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'uthorization': 'eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA',
'x-grpc-web': '1',
'language': 'zhHK',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'content-type': 'application/grpc-web+proto',
'accept': '*/*',
'origin': 'https://hkmovie6.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME',
'accept-language': 'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,ja;q=0.6',
'cookie': '__stripe_mid=dfb76ec9-1469-48ef-81d6-659f8d7c12da9a119d; lang=zhHK; auth=%7B%22isLogin%22%3Afalse%2C%22access%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA%22%2C%22expiry%22%3A1628084551%7D%2C%22refresh%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6InJlZnJlc2giLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjMwNjc0NzUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjM0YWFjNWVhLTkwZTctNDdhYS05OTE3LTQ5N2UxMGUwNmU3YSJ9.Mrwt2iWddQHthQNHafF4mirU-JiynidiTzq0X4J96IMICcWbWEoZBB4M1HhvFdeB2WvU1nHaNDyMZEhkINKK8g%22%2C%22expiry%22%3A1630674751%7D%7D; showtimeMode=time; _gid=GA1.2.2026576359.1628082750; _ga=GA1.2.704463189.1627482203; _ga_8W8P8XEJX1=GS1.1.1628082750.11.1.1628083640.0',
}
data = '$\\u0000\\u0000\\u0000\\u0000,\\n$d88a803b-4a76-488f-b587-6ccbd3f43d86\\u0010\\u0080\xB1\xA7\\u0088\\u0006'
response = requests.post('https://hkmovie6.com/m6-api/showpb.ShowAPI/ListByMovieAndDate', headers=headers, data=data)
All I got is a response header with a message: grpc: received message larger than max:
{'Content-Type': 'application/grpc-web+proto', 'grpc-status': '8',
'grpc-message': 'grpc: received message larger than max (1551183920
vs. 4194304)', 'x-envoy-upstream-service-time': '49',
'access-control-allow-origin': 'https://hkmovie6.com',
'access-control-allow-credentials': 'true',
'access-control-expose-headers': 'grpc-status,grpc-message',
'X-Cloud-Trace-Context': '72c873ad3012ad710f938098310f7f11', ...
I also tried to use Postman Interceptor to capture the actual request sent when I browsed the site. This time with a different message:
I managed to get the response body when I used selenium but it is far from ideal performance-wise.
I wonder if grpc is a hint but I spent several hours reading without getting what I wanted.
My only question is whether it is possible to get the "ListByMovieAndDate" response just by making simple Python http request to the api url? Thanks!
An admittedly cursory read suggests that the backend is gRPC and the client that you're introspecting is using gRPC-Web which is a clever solution to the problem of wanting to make gRPC requests using a JavaScript client.
Suffice to say that, you can't access the backend using HTTP/1 and REST if it is indeed gRPC but you may (!) be able to craft a Python gRPC client that talks to it if there's no constraints by e.g. client IP, type and there's no auth.
I'm trying to get data from etoro. This link works in my browser https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo but it's forbidden via request.get() even if I add user agent, headers and even cookies.
import requests
url = "https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo"
headers = {
'Host': 'www.etoro.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Referer': 'https://www.etoro.com/people/viveredidividend/chart',
'Cookie': 'XXX',
'TE': 'Trailers'
}
requests.get(url, headers=headers)
>>> <Response [403]>
How to solve it without selenium?
This error gives when you doesn't authenticate the python code in browser. When you login with website it is authenticate and its remember it, thats why you can use and works fine in browser by site.
In order to solve this problem you first need to authenticate the browser in your python code.
To authenticate,
import requests
response = requests.get(url, auth=(username, password))
The error 403 tells that the request you are making is getting blocked. Actually, the website is protected by cloudflare which is preventing the website to get scraped. You can check it by executing print(response.text) in your code and you'll see Access denied | www.etoro.com used Cloudflare to restrict access in the returned cloudflare HTML inside title tag.
Under the hood, when you sent the requests it goes through the cloudflare server and verify whether it's coming from the real browser or not. If the request pass the verification then only it forward the request to website server which returns the valid response. Otherwise, the cloudflare block the request.
It's difficult to bypass cloudflare. Nevertheless, you can try your luck with the code given below.
Code
import urllib.request
url = 'https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo'
headers = {
'authority': 'www.etoro.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'accounttype': 'Real',
'applicationidentifier': 'ReToro',
'sec-ch-ua-mobile': '?0',
'applicationversion': '331.0.2',
'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.etoro.com/discover/markets/cryptocurrencies',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfruid=e7f40231e2946a1a645f6fa0eb19af969527087e-1624781498; _gcl_au=1.1.279416294.1624782732; _gid=GA1.2.518227313.1624782732; _scid=64860a19-28e4-4e83-9f65-252b26c70796; _fbp=fb.1.1624782732733.795190273; __adal_ca=so%3Ddirect%26me%3Dnone%26ca%3Ddirect%26co%3D%28not%2520set%29%26ke%3D%28not%2520set%29; __adal_cw=1624782733150; _sctr=1|1624732200000; _gaexp=GAX1.2.eSuc0QBTRhKbpaD4vT_-oA.18880.x331; _hjTLDTest=1; _hjid=bb69919f-e61b-4a94-a03b-db7b1f4ec4e4; hp_preferences=%7B%22locale%22%3A%22en-gb%22%7D; funnelFromId=38; eToroLocale=en-gb; G_ENABLED_IDPS=google; marketing_visitor_regulation_id=10; marketing_visitor_country=96; __cflb=0KaS4BfEHptJdJv5nwPFxhdSsqV6GxaSK8BuVNBmVkuj6hYxsLDisSwNTSmCwpbFxkL3LDuPyToV1fUsaeNLoSNtWLVGmBErMgEeYAyzW4uVUEoJHMzTirQMGVAqNKRnL; __cf_bm=6ef9d6f250ee71d99f439672839b52ac168f7c89-1624785170-1800-ASu4E7yXfb+ci0NsW8VuCgeJiCE72Jm9uD7KkGJdy1XyNwmPvvg388mcSP+hTCYUJvtdLyY2Vl/ekoQMAkXDATn0gyFR0LbMLl0b7sCd1Fz/Uwb3TlvfpswY1pv2NvCdqJBy5sYzSznxEsZkLznM+IGjMbvSzQffBIg6k3LDbNGPjWwv7jWq/EbDd++xriLziA==; _uetsid=2ba841e0d72211eb9b5cc3bdcf56041f; _uetvid=2babee20d72211eb97efddb582c3c625; _ga=GA1.2.1277719802.1624782732; _gat_UA-2056847-65=1; __adal_ses=*; __adal_id=47f4f887-c22b-4ce0-8298-37d6a0630bdd.1624782733.2.1624785174.1624782818.770dd6b7-1517-45c9-9554-fc8d210f1d7a; _gat=1; TS01047baf=01d53e5818a8d6dc983e2c3d0e6ada224b4742910600ba921ea33920c60ab80b88c8c57ec50101b4aeeb020479ccfac6c3c567431f; outbrain_cid_fetch=true; _ga_B0NS054E7V=GS1.1.1624785164.2.1.1624785189.35; TMIS2=9a74f8b353780f2fbe59d8dc1d9cd901437be0b823f8ee60d0ab36264e2503993c5e999eaf455068baf761d067e3a4cf92d9327aaa1db627113c6c3ae3b39cd5e8ea5ce755fb8858d673749c5c919fe250d6297ac50c5b7f738927b62732627c5171a8d3a86cdc883c43ce0e24df35f8fe9b6f60a5c9148f0a762e765c11d99d; mp_dbbd7bd9566da85f012f7ca5d8c6c944_mixpanel=%7B%22distinct_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24device_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%7D',
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read()
print(response.decode('utf-8'))
I am still a beginner at web scraping, I am trying to extract data from an API but the problem is that it has a Bearer token and this token changed after 5 to 6 hours so I have to go to the web page again and copy the token again so is there any way to extract the data without any more opening to the web page and copy the token again
I found this info as well on the network request, as someone told me that I could use the refresh_token to access but I don't know how to do that
Cache-Control: no-cache,
Connection: keep-alive,
Content-Length: 177,
Content-Type: application/json;charset=UTF-8,
Cookie: dhh_token=; refresh_token=; _hurrier_session=81556f54bf555a952d1a7f780766b028,
dnt: 1
import pandas as pd
from time import sleep
def make_request():
headers = {
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': '^\\^',
'Accept': 'application/json',
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJMdXRiZlZRUVZhWlpmNTNJbGxhaXFDY3BCVTNyaGtqZiIsInN1YiI6MzEzMTcwLCJleHAiOjE2MjQzMjU2NDcsInJvbCI6ImRpc3BhdGNoZXIiLCJyb2xlcyI6WyJodXJyaWVyLmRpc3BhdGNoZXIiLCJjb2QuY29kX21hbmFnZXIiXSwibmFtIjoiRXNsYW0gWmVmdGF3eSIsImVtYSI6ImV6ZWZ0YXd5QHRhbGFiYXQuY29tIiwidXNlcm5hbWUiOiJlemVmdGF3eUB0YWxhYmF0LmNvbSIsImNvdW50cmllcyI6WyJrdyIsImJoIiwicWEiLCJhZSIsImVnIiwib20iLCJqbyIsInEyIiwiazMiXX0.XYykBij-jaiIS_2tdqKFIfYGfw0uS0rKmcOTSHor8Nk',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8',
'Origin': 'url',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'url',
'Accept-Language': 'en-US,en;q=0.9,ar-EG;q=0.8,ar;q=0.7',
'dnt': '1',
}
data = {
'status': 'picked'
}
response = requests.post('url/api', headers=headers, json=data)
print(response.text)
return json.loads(response.text)
def extract_data(row):
data_row = {
'order_id': row['order']['code'],
'deedline': row['order']['deadline'].split('.')[0],
'picked_at': row['picked_at'].split('.')[0],
'picked_by': row['picked_by'],
'processed_at': row['processed_at'],
'type': row['type']
}
return data_row
def periodique_extract(delay):
extract_count = 0
while True:
extract_count += 1
data = make_request()
if extract_count == 1 :
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a')
else:
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a',header=False)
print('exracting data {} times'.format(extract_count))
sleep(delay)
periodique_extract(60)
#note: as the website is track live operation so I extract data every 1 min ```
Sometimes these tokens require JavaScript execution to be set and automatically added to API requests. That means you need to open the page in something that actually runs the javascript, in order to get the token. I.e. actually opening the page in a browser.
One solution could be to use something like Selenium or Puppeteer to open the page whenever the token expires to get a new token, that you then feed to your script. But this depends on the specifics on the page, without a link the correct solution is difficult to say. But if the method of you opening the page in your browser, copying the token, then running your script works, then this is very likely to also work.
Im trying to post login to this site using python requests post. First time i can requests for 3-4 times. But until 5 times i got 403 error from the server.
I already tried to set headers, included referer,origin,user-agent and proxy but not helped much.
import json
import requests
response = requests.Session()
url = 'https://www.saksfifthavenue.com/account/login?_k=%2Faccount%2Fsummary'
while True:
try:
headers = {
'sec-fetch-mode': 'cors',
'origin': 'https://www.saksfifthavenue.com',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en-US',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'accept': 'application/json, text/plain, */*',
'referer': url,
'authority': 'www.saksfifthavenue.com',
'sec-fetch-site': 'same-origin',
'dnt': '1',
}
data = {"username":"demo#gmail.com,"password":"Thisisatest"}
login = response.post(
'https://www.saksfifthavenue.com/v1/account-service/accounts/sign-in', headers=headers, data=json.dumps(data)).content
loginCheck = login.decode()
print(loginCheck)
if "Sorry, this does not match our records. Please try again." in loginCheck:
print('Login failed!!!')
break
elif """Your Account""" in loginCheck:
print('Login success!!!')
else:
print('403 Error. Login Failed')
break
except:
pass
It looks like the server detected your spide requests, you request too fast, maybe you can try to set an interval for these requests.
But why you need to login post in a while? (and without logout