Not getting all cookies from Javascript site (python)

Not getting all cookies from Javascript site (python) - python

I am trying to make a program that checks for ski lift reservation openings. So far I am able to get the correct response from the API but it only works for about 15 min before some cookie expires. Here is my current process.
Go to site: https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx and look at the network response, then I copy the highlighted xhr script as a curl(bash).
website/api in question
I then take that curl(bash) and import it into postman and get the response:
Postman response
Then I take the code from postman so I can run it in python
Code used by postman
import requests, json
url = "https://www.keystoneresort.com/api/LiftAccessApi/GetLiftTicketControlReservationInventory?
startDate=01%2F21%2F2021&endDate=03%2F06%2F2021&_=1611254694375"
payload={}
headers = {
'authority': 'www.keystoneresort.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-queueit-ajaxpageurl': 'https%3A%2F%2Fwww.keystoneresort.com%2Fplan-your-trip%2Flift-
access%2Ftickets.aspx%3FstartDate%3D01%252F23%252F2021%26numberOfDays%3D1%26ageGroup%3DAdult',
'x-requested-with': 'XMLHttpRequest',
'__requestverificationtoken': 'mbVIzNL1qZUKDT3Re8H9kXVNoYLmQPC-tgLCSbM_inVSN1v_2Pei-A- GWDaKL7i6NRIVTr0lnlmiYACNvfmd6Zzsikk1:HI8y8wZJXMuP7nsTJwS-adYZu7FoHVPVHWY5naHRiB71dg2PzehuQa8WJy418eIrVqwmvhw-a1F34sJ425mXzWpEANE1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36',
'save-data': 'off',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx? startDate=01%2F23%2F2021&numberOfDays=1&ageGroup=Adult',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'QueueITAccepted-SDFrts345E-V3_vailresortsecomm1=EventId%3Dvailresortsecomm1%26QueueId%3D96d15411-09e1-4443-89a3-f0d6e4cef5d5%26RedirectType%3Dsafetynet%26IssueTime%3D1611254692%26Hash%3D06e1aecd2d5cdf64363d53f4fc63f1c22316f604895cd3ecfd1d8b03f86ba36a; TS019b45a2=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; TS01f060ff=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; AMCV_974C370453295F9A0A490D44%40AdobeOrg=1406116232%7CMCIDTS%7C18649%7CMCMID%7C30886069937558409272202898840476568322%7CMCAAMLH-1611859494%7C9%7CMCAAMB-1611859494%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1611261894s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C2.5.0;'
}
s = requests.Session()
y = s.get(url)
print(y)
response = requests.request("GET", url, headers=headers, data=payload)
todos = json.loads(response.text)
x = json.dumps(todos, indent = 2)
print(x)
Now if you run this in python, it will not work because the cookies will have expired for this session by the time someone tries it. So you would have to follow the process I listed above if you want to see what I am doing. The response I get looks like this, which is what I want but only for it not to expire.
Python response
I have looked extensively at different ways I can get the cookies using requests and selneium. All solutions I have tried only get some of the cookies and not all of them. I need the ones that are in the "cookie" header listed in my code, but I have not found a way to do that without refreshing the page and posting the curl in postman and copying the response. I am still fairly new to python and coding in general so don't go to hard on me if the answer is super simple.
I think some of these cookies are rendered by java script, which may be part of the problem. I can also delete some of the cookies in my code and have it still work(until it expires). If there is an easier way to do what I am doing please let me know.
Thanks.

Related

How to authenticate MS graph requests?

I am using https://developer.microsoft.com/en-us/graph/graph-explorer to make requests
I am trying to convert them to Python to use for general automation.
I always copy from browser>postman>code, so I have all the cookies/tokens/etc. I need, and my python request will work until something expires. In this case, that something is a bearer token.
I can’t figure out how to get a new, valid bearer token other than re-doing above process or copying just the token and copy-pasting into my code.
While trying to find an auth request that would spit one out, I came across a collection for Postman here:
https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow
and when I replace {{tenant}} with my orgs tenant_id, I get a 200 request with a bearer token, but when I insert this bearer token into my Graph API request code I get the following error:
{"error":{"code":"BadRequest","message":"/me request is only valid with delegated authentication flow.","innerError":{"date":"2022-10-23T14:31:22","request-id":"...","client-request-id":"..."}}}
Here is a screenshot of the Postman Auth
Here is my Graph API call that only works with bearer tokens copied from graph-explorer
def recreate_graph_request1(bearer = None):
'''
I went to https://developer.microsoft.com/en-us/graph/graph-explorer
and picked a request. Outlook>GET emails from a user
at first response was for some generic user, but I logged in using my account and it actually worked.
Then I used my old copy curl as bash trick to make it python
:return:
'''
url = "https://graph.microsoft.com/v1.0/me/messages?$filter=(from/emailAddress/address)%20eq%20%27my.boss#company.com%27"
payload = {}
headers = {
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.9',
'Authorization': bearer,
'Connection': 'keep-alive',
'Origin': 'https://developer.microsoft.com',
'Referer': 'https://developer.microsoft.com/',
'SdkVersion': 'GraphExplorer/4.0, graph-js/3.0.2 (featureUsage=6)',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'client-request-id': 'n0t_th3_s4m3_4s_1n_P05tm4n',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
}
response = requests.request("GET", url, headers=headers, data=payload)
return response
token_from_ms_auth = 'eyCOPIED_FROM_POSTMAN....'
bearer_from_ms_auth = 'Bearer '+token_from_ms_auth
print(recreate_graph_request1(bearer_from_ms_auth).text)
TBH, I was not overly optimistic that any bearer token would work, even if it was somehow associated with my tenant - but I hoped it would, and the resulting disappointment has driven me to ask the universe for help. I do not understand these meandering flows and looking at others' answers only confuses me more. I am hoping someone can help me figure out this scenario.

Access tokens are short lived. Refresh them after they expire to continue accessing resources.
Please refer this document: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow#refresh-the-access-token
Hope this helps.

Extract network activities and their response headers without using selenium

I am trying to get the response body of this request "ListByMovieAndDate" from this specific website:
https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME
Screenshot below is the request in Chrome Dev Tool.
I have tried several methods to mimic the request, including
copying the request as cURL (bash) and using a tool to translate it to Python request
import requests
headers = {'authority': 'hkmovie6.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'uthorization': 'eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA',
'x-grpc-web': '1',
'language': 'zhHK',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'content-type': 'application/grpc-web+proto',
'accept': '*/*',
'origin': 'https://hkmovie6.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME',
'accept-language': 'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,ja;q=0.6',
'cookie': '__stripe_mid=dfb76ec9-1469-48ef-81d6-659f8d7c12da9a119d; lang=zhHK; auth=%7B%22isLogin%22%3Afalse%2C%22access%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA%22%2C%22expiry%22%3A1628084551%7D%2C%22refresh%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6InJlZnJlc2giLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjMwNjc0NzUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjM0YWFjNWVhLTkwZTctNDdhYS05OTE3LTQ5N2UxMGUwNmU3YSJ9.Mrwt2iWddQHthQNHafF4mirU-JiynidiTzq0X4J96IMICcWbWEoZBB4M1HhvFdeB2WvU1nHaNDyMZEhkINKK8g%22%2C%22expiry%22%3A1630674751%7D%7D; showtimeMode=time; _gid=GA1.2.2026576359.1628082750; _ga=GA1.2.704463189.1627482203; _ga_8W8P8XEJX1=GS1.1.1628082750.11.1.1628083640.0',
}
data = '$\\u0000\\u0000\\u0000\\u0000,\\n$d88a803b-4a76-488f-b587-6ccbd3f43d86\\u0010\\u0080\xB1\xA7\\u0088\\u0006'
response = requests.post('https://hkmovie6.com/m6-api/showpb.ShowAPI/ListByMovieAndDate', headers=headers, data=data)
All I got is a response header with a message: grpc: received message larger than max:
{'Content-Type': 'application/grpc-web+proto', 'grpc-status': '8',
'grpc-message': 'grpc: received message larger than max (1551183920
vs. 4194304)', 'x-envoy-upstream-service-time': '49',
'access-control-allow-origin': 'https://hkmovie6.com',
'access-control-allow-credentials': 'true',
'access-control-expose-headers': 'grpc-status,grpc-message',
'X-Cloud-Trace-Context': '72c873ad3012ad710f938098310f7f11', ...
I also tried to use Postman Interceptor to capture the actual request sent when I browsed the site. This time with a different message:
I managed to get the response body when I used selenium but it is far from ideal performance-wise.
I wonder if grpc is a hint but I spent several hours reading without getting what I wanted.
My only question is whether it is possible to get the "ListByMovieAndDate" response just by making simple Python http request to the api url? Thanks!

An admittedly cursory read suggests that the backend is gRPC and the client that you're introspecting is using gRPC-Web which is a clever solution to the problem of wanting to make gRPC requests using a JavaScript client.
Suffice to say that, you can't access the backend using HTTP/1 and REST if it is indeed gRPC but you may (!) be able to craft a Python gRPC client that talks to it if there's no constraints by e.g. client IP, type and there's no auth.

Unable to create request to API with python requests. Even when all headers are correct

I am trying to get a request https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE using python. I have tried tons of different things but they all respond with 403 error forbidden. I have tried everything I can think of and have googled with no success.
currently my code for this request looks like this:
headers = {
'authority': 'api.dex.guru',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': (cookies are here)
}
response = requests.get('https://api.dex.guru/v1/tradingview/symbols?symbol=0x7060d3f1cc70a07f4768560b9d9b692ac29244de-bsc', headers=headers)
then i print out response and it is a 403 error. Please help, I need this data for a project.

Good afternoon.
I have managed to get this to work with the help of another user on Reddit.
The key to getting this API call to work is to use the cloudscraper module :-
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://api.dex.guru/v1/tokens/0x8076C74C5e3F5852037F31Ff0093Eeb8c8ADd8D3-bsc").text)
This gave me a 200 response with the expected JSON content (substitute my URL above with yours and you should get the expected 200 response).
Many thanks
Jimmy

I tried messing around with this myself, it appears your site has some sort of DDOS protection from Cloudflare blocking these API calls. I'm not an expert in Python or headers by any means, so you might be supplying something to deal with that. However I looked on their website and it seems like the API is still in development. Finally, I was getting 503 errors instead, and I was able to access the API normally through my browser. Happy to tinker around more with this if you don't mind explaining what some of the cookies/headers are doing.

Try to check the body of the response (response.content or response.text) as that might give you a more clear picture of why you get blocked.
For me it looks like they do some filtering based on the user-agent. I do get a Cloudflare DoS protection page (with a HTTP 503 response for example). Using a user-agent string that suggests that JavaScript won't work I do get a HTTP 200:
headers = {"User-Agent": "HTTPie/2.4.0"}
r = requests.get("https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE", headers=headers)

Why is does my function return HTTP Error 301?

So I'm making a function that would return if i can watch a certain anime on in this case Crunchyroll, but after looking for the right solution for a couple of days i could not quite find the answer. And I am very new to webscraping, so i don't have any experience.
Here are the headers i've added. The only one that makes a difference right now is the host header.
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,nl;q=0.8,ja;q=0.7',
'connection': 'keep-alive',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-fetch-dest': 'document',
'sec-ch-ua-mobile': '?0',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'host': 'crunchyroll.com',
'referer': 'https://www.google.com/',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
}
And here is the last version i tried. I have already tried to work with the request module, but it keeps giving me the exceeding 20 redirects error. After using 'allow_redirects = False' it gives me the 301 error, but if there is a solution through the request module i'd be happy too.
(namelist is for example: [rezero-kara-hajimeru-isekai-seikatsu, rezero-starting-life-in-another-world-, rezero])
for i in namelist:
# Crunchyroll Checker
Cleanlink = 'https://www.crunchyroll.com/en-gb/'
attempt = Cleanlink + i
try:
req = urllib.request.Request(attempt, headers=header)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj), urllib.request.HTTPRedirectHandler)
response = opener.open(req)
response.close()
print(response)
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
This code gives me this response:
The last 30x error message was:
Moved Permanently
The last 30x error message was:
Moved Permanently
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
So the only thing i need is that the code is able to check if a website exists. For example: https://www.crunchyroll.com/en-gb/rezero-starting-life-in-another-world- should return the 200 code.
Thanks in advance.

Crunchyroll has a very strict CloudFlare WAF. If your requests have something fishy, it will give you a hard time right away. Possible reasons why you get 301 are
Maybe you are requesting behind a proxy
WAF may check if the requester (urllib, requests module in your case) has javascript enable (so it can tell if requester are bot or real user).
Solution
You should use this Python lib to do the request for you. It is a wrapper around requests lib in Python, so we can use it as if using requests module in python.
https://pypi.org/project/cloudscraper/
Note
Even with this lib, you can only send few hundreds requests per few hours. Because WAF will still detect that your IP is requesting too much and block you. Crunchyroll's WAF is nasty.

Why does the requests.get with correct header return empty content?

I am trying to crawl a website and copied the Request Headers information from Chrome directly,however, after using the requests.get, the returned content is empty.But the header I printed from requests is correct. Anyone knows the reason for this? Thx!
Mac, Chrome, Python3.7
General InformationRequests Information
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',
'Host': 'hotels.ctrip.com',
'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'
}
url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'
data = requests.get(url, headers = headers)
print(data.request.headers)

The request header information that you shared in the image, gives the info that the server responded correctly to the request. Also the actual url that you shared http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815
was something different from the one shown in the image. Infact it seems the actual page is indeed calling lot of other urls to form the final page. so there is no guarantee that you will get the response as you see in the browser when you use requests. If the server or the actual implementation at the server end is depending on the browser's javascript engine to execute the javascript and then render the content, you won't be able to get the final html as it looks like in the browser. Would be better to use selenium webdriver in those cases to hit the url and then get the html content. Again if you can share the actual url, can suggest on other ideas

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not getting all cookies from Javascript site (python) - python

Related

How to authenticate MS graph requests?

Extract network activities and their response headers without using selenium

Unable to create request to API with python requests. Even when all headers are correct

Why is does my function return HTTP Error 301?

Why does the requests.get with correct header return empty content?

Categories

Resources