Extract network activities and their response headers without using selenium

Extract network activities and their response headers without using selenium - python

I am trying to get the response body of this request "ListByMovieAndDate" from this specific website:
https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME
Screenshot below is the request in Chrome Dev Tool.
I have tried several methods to mimic the request, including
copying the request as cURL (bash) and using a tool to translate it to Python request
import requests
headers = {'authority': 'hkmovie6.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'uthorization': 'eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA',
'x-grpc-web': '1',
'language': 'zhHK',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'content-type': 'application/grpc-web+proto',
'accept': '*/*',
'origin': 'https://hkmovie6.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME',
'accept-language': 'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,ja;q=0.6',
'cookie': '__stripe_mid=dfb76ec9-1469-48ef-81d6-659f8d7c12da9a119d; lang=zhHK; auth=%7B%22isLogin%22%3Afalse%2C%22access%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA%22%2C%22expiry%22%3A1628084551%7D%2C%22refresh%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6InJlZnJlc2giLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjMwNjc0NzUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjM0YWFjNWVhLTkwZTctNDdhYS05OTE3LTQ5N2UxMGUwNmU3YSJ9.Mrwt2iWddQHthQNHafF4mirU-JiynidiTzq0X4J96IMICcWbWEoZBB4M1HhvFdeB2WvU1nHaNDyMZEhkINKK8g%22%2C%22expiry%22%3A1630674751%7D%7D; showtimeMode=time; _gid=GA1.2.2026576359.1628082750; _ga=GA1.2.704463189.1627482203; _ga_8W8P8XEJX1=GS1.1.1628082750.11.1.1628083640.0',
}
data = '$\\u0000\\u0000\\u0000\\u0000,\\n$d88a803b-4a76-488f-b587-6ccbd3f43d86\\u0010\\u0080\xB1\xA7\\u0088\\u0006'
response = requests.post('https://hkmovie6.com/m6-api/showpb.ShowAPI/ListByMovieAndDate', headers=headers, data=data)
All I got is a response header with a message: grpc: received message larger than max:
{'Content-Type': 'application/grpc-web+proto', 'grpc-status': '8',
'grpc-message': 'grpc: received message larger than max (1551183920
vs. 4194304)', 'x-envoy-upstream-service-time': '49',
'access-control-allow-origin': 'https://hkmovie6.com',
'access-control-allow-credentials': 'true',
'access-control-expose-headers': 'grpc-status,grpc-message',
'X-Cloud-Trace-Context': '72c873ad3012ad710f938098310f7f11', ...
I also tried to use Postman Interceptor to capture the actual request sent when I browsed the site. This time with a different message:
I managed to get the response body when I used selenium but it is far from ideal performance-wise.
I wonder if grpc is a hint but I spent several hours reading without getting what I wanted.
My only question is whether it is possible to get the "ListByMovieAndDate" response just by making simple Python http request to the api url? Thanks!

An admittedly cursory read suggests that the backend is gRPC and the client that you're introspecting is using gRPC-Web which is a clever solution to the problem of wanting to make gRPC requests using a JavaScript client.
Suffice to say that, you can't access the backend using HTTP/1 and REST if it is indeed gRPC but you may (!) be able to craft a Python gRPC client that talks to it if there's no constraints by e.g. client IP, type and there's no auth.

Related

How to authenticate MS graph requests?

I am using https://developer.microsoft.com/en-us/graph/graph-explorer to make requests
I am trying to convert them to Python to use for general automation.
I always copy from browser>postman>code, so I have all the cookies/tokens/etc. I need, and my python request will work until something expires. In this case, that something is a bearer token.
I can’t figure out how to get a new, valid bearer token other than re-doing above process or copying just the token and copy-pasting into my code.
While trying to find an auth request that would spit one out, I came across a collection for Postman here:
https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow
and when I replace {{tenant}} with my orgs tenant_id, I get a 200 request with a bearer token, but when I insert this bearer token into my Graph API request code I get the following error:
{"error":{"code":"BadRequest","message":"/me request is only valid with delegated authentication flow.","innerError":{"date":"2022-10-23T14:31:22","request-id":"...","client-request-id":"..."}}}
Here is a screenshot of the Postman Auth
Here is my Graph API call that only works with bearer tokens copied from graph-explorer
def recreate_graph_request1(bearer = None):
'''
I went to https://developer.microsoft.com/en-us/graph/graph-explorer
and picked a request. Outlook>GET emails from a user
at first response was for some generic user, but I logged in using my account and it actually worked.
Then I used my old copy curl as bash trick to make it python
:return:
'''
url = "https://graph.microsoft.com/v1.0/me/messages?$filter=(from/emailAddress/address)%20eq%20%27my.boss#company.com%27"
payload = {}
headers = {
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.9',
'Authorization': bearer,
'Connection': 'keep-alive',
'Origin': 'https://developer.microsoft.com',
'Referer': 'https://developer.microsoft.com/',
'SdkVersion': 'GraphExplorer/4.0, graph-js/3.0.2 (featureUsage=6)',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'client-request-id': 'n0t_th3_s4m3_4s_1n_P05tm4n',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
}
response = requests.request("GET", url, headers=headers, data=payload)
return response
token_from_ms_auth = 'eyCOPIED_FROM_POSTMAN....'
bearer_from_ms_auth = 'Bearer '+token_from_ms_auth
print(recreate_graph_request1(bearer_from_ms_auth).text)
TBH, I was not overly optimistic that any bearer token would work, even if it was somehow associated with my tenant - but I hoped it would, and the resulting disappointment has driven me to ask the universe for help. I do not understand these meandering flows and looking at others' answers only confuses me more. I am hoping someone can help me figure out this scenario.

Access tokens are short lived. Refresh them after they expire to continue accessing resources.
Please refer this document: https://learn.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-auth-code-flow#refresh-the-access-token
Hope this helps.

python requests 403 forbidden but it's used with charles on

I get 403 forbidden when I use python requests to access .
However, when I open Charles proxy it works.
When I open fiddler, I also get 403.
I wanna know why this happens.
import requests
def get_test():
# proxies = {'http': 'http://127.0.0.1:8888', 'https': 'http://127.0.0.1:8888'}
proxies=None
url = ""
my_header={
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,ja;q=0.6,zh-HK;q=0.5',
'cache-control': 'max-age=0', 'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
rsp = requests.get(url=url,headers=my_header)
print(rsp)
if __name__ == '__main__':
get_test()

I try to request this page by postman, and also get the result of 403 forbidden. It seems that this website uses Cloudflare's anti-bot page to anti web-scraper which is hard to solve by yourself. This is why 403 forbidden happens.
So I try to use cloudscraper to solve this problem:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://www.zolo.ca/").text)
but get the exceptions:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free)
version.
It seems that the opensource(free) version of cloudscraper can't solve this problem, and I can't do anythings more.
For more details of cloudscraper, you can see this page or github:
https://pypi.org/project/cloudscraper/
https://github.com/VeNoMouS/cloudscraper

If you urgently need to scrape the website, you can try Selenium.
Although this method is not elegant enough, but it'll certainly prove equal to the task.

Cloudflare does checks on your TLS Settings, given that, requests is being detected.
The reason why charles isnt detected its because charles changes your TLS settings.
Additionally, the site may have blocked HTTP/1 which is the unique protocol that requests supports.

Unable to create request to API with python requests. Even when all headers are correct

I am trying to get a request https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE using python. I have tried tons of different things but they all respond with 403 error forbidden. I have tried everything I can think of and have googled with no success.
currently my code for this request looks like this:
headers = {
'authority': 'api.dex.guru',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': (cookies are here)
}
response = requests.get('https://api.dex.guru/v1/tradingview/symbols?symbol=0x7060d3f1cc70a07f4768560b9d9b692ac29244de-bsc', headers=headers)
then i print out response and it is a 403 error. Please help, I need this data for a project.

Good afternoon.
I have managed to get this to work with the help of another user on Reddit.
The key to getting this API call to work is to use the cloudscraper module :-
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://api.dex.guru/v1/tokens/0x8076C74C5e3F5852037F31Ff0093Eeb8c8ADd8D3-bsc").text)
This gave me a 200 response with the expected JSON content (substitute my URL above with yours and you should get the expected 200 response).
Many thanks
Jimmy

I tried messing around with this myself, it appears your site has some sort of DDOS protection from Cloudflare blocking these API calls. I'm not an expert in Python or headers by any means, so you might be supplying something to deal with that. However I looked on their website and it seems like the API is still in development. Finally, I was getting 503 errors instead, and I was able to access the API normally through my browser. Happy to tinker around more with this if you don't mind explaining what some of the cookies/headers are doing.

Try to check the body of the response (response.content or response.text) as that might give you a more clear picture of why you get blocked.
For me it looks like they do some filtering based on the user-agent. I do get a Cloudflare DoS protection page (with a HTTP 503 response for example). Using a user-agent string that suggests that JavaScript won't work I do get a HTTP 200:
headers = {"User-Agent": "HTTPie/2.4.0"}
r = requests.get("https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE", headers=headers)

Why is does my function return HTTP Error 301?

So I'm making a function that would return if i can watch a certain anime on in this case Crunchyroll, but after looking for the right solution for a couple of days i could not quite find the answer. And I am very new to webscraping, so i don't have any experience.
Here are the headers i've added. The only one that makes a difference right now is the host header.
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,nl;q=0.8,ja;q=0.7',
'connection': 'keep-alive',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-fetch-dest': 'document',
'sec-ch-ua-mobile': '?0',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'host': 'crunchyroll.com',
'referer': 'https://www.google.com/',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
}
And here is the last version i tried. I have already tried to work with the request module, but it keeps giving me the exceeding 20 redirects error. After using 'allow_redirects = False' it gives me the 301 error, but if there is a solution through the request module i'd be happy too.
(namelist is for example: [rezero-kara-hajimeru-isekai-seikatsu, rezero-starting-life-in-another-world-, rezero])
for i in namelist:
# Crunchyroll Checker
Cleanlink = 'https://www.crunchyroll.com/en-gb/'
attempt = Cleanlink + i
try:
req = urllib.request.Request(attempt, headers=header)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj), urllib.request.HTTPRedirectHandler)
response = opener.open(req)
response.close()
print(response)
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
This code gives me this response:
The last 30x error message was:
Moved Permanently
The last 30x error message was:
Moved Permanently
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
So the only thing i need is that the code is able to check if a website exists. For example: https://www.crunchyroll.com/en-gb/rezero-starting-life-in-another-world- should return the 200 code.
Thanks in advance.

Crunchyroll has a very strict CloudFlare WAF. If your requests have something fishy, it will give you a hard time right away. Possible reasons why you get 301 are
Maybe you are requesting behind a proxy
WAF may check if the requester (urllib, requests module in your case) has javascript enable (so it can tell if requester are bot or real user).
Solution
You should use this Python lib to do the request for you. It is a wrapper around requests lib in Python, so we can use it as if using requests module in python.
https://pypi.org/project/cloudscraper/
Note
Even with this lib, you can only send few hundreds requests per few hours. Because WAF will still detect that your IP is requesting too much and block you. Crunchyroll's WAF is nasty.

Not getting all cookies from Javascript site (python)

I am trying to make a program that checks for ski lift reservation openings. So far I am able to get the correct response from the API but it only works for about 15 min before some cookie expires. Here is my current process.
Go to site: https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx and look at the network response, then I copy the highlighted xhr script as a curl(bash).
website/api in question
I then take that curl(bash) and import it into postman and get the response:
Postman response
Then I take the code from postman so I can run it in python
Code used by postman
import requests, json
url = "https://www.keystoneresort.com/api/LiftAccessApi/GetLiftTicketControlReservationInventory?
startDate=01%2F21%2F2021&endDate=03%2F06%2F2021&_=1611254694375"
payload={}
headers = {
'authority': 'www.keystoneresort.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-queueit-ajaxpageurl': 'https%3A%2F%2Fwww.keystoneresort.com%2Fplan-your-trip%2Flift-
access%2Ftickets.aspx%3FstartDate%3D01%252F23%252F2021%26numberOfDays%3D1%26ageGroup%3DAdult',
'x-requested-with': 'XMLHttpRequest',
'__requestverificationtoken': 'mbVIzNL1qZUKDT3Re8H9kXVNoYLmQPC-tgLCSbM_inVSN1v_2Pei-A- GWDaKL7i6NRIVTr0lnlmiYACNvfmd6Zzsikk1:HI8y8wZJXMuP7nsTJwS-adYZu7FoHVPVHWY5naHRiB71dg2PzehuQa8WJy418eIrVqwmvhw-a1F34sJ425mXzWpEANE1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36',
'save-data': 'off',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx? startDate=01%2F23%2F2021&numberOfDays=1&ageGroup=Adult',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'QueueITAccepted-SDFrts345E-V3_vailresortsecomm1=EventId%3Dvailresortsecomm1%26QueueId%3D96d15411-09e1-4443-89a3-f0d6e4cef5d5%26RedirectType%3Dsafetynet%26IssueTime%3D1611254692%26Hash%3D06e1aecd2d5cdf64363d53f4fc63f1c22316f604895cd3ecfd1d8b03f86ba36a; TS019b45a2=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; TS01f060ff=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; AMCV_974C370453295F9A0A490D44%40AdobeOrg=1406116232%7CMCIDTS%7C18649%7CMCMID%7C30886069937558409272202898840476568322%7CMCAAMLH-1611859494%7C9%7CMCAAMB-1611859494%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1611261894s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C2.5.0;'
}
s = requests.Session()
y = s.get(url)
print(y)
response = requests.request("GET", url, headers=headers, data=payload)
todos = json.loads(response.text)
x = json.dumps(todos, indent = 2)
print(x)
Now if you run this in python, it will not work because the cookies will have expired for this session by the time someone tries it. So you would have to follow the process I listed above if you want to see what I am doing. The response I get looks like this, which is what I want but only for it not to expire.
Python response
I have looked extensively at different ways I can get the cookies using requests and selneium. All solutions I have tried only get some of the cookies and not all of them. I need the ones that are in the "cookie" header listed in my code, but I have not found a way to do that without refreshing the page and posting the curl in postman and copying the response. I am still fairly new to python and coding in general so don't go to hard on me if the answer is super simple.
I think some of these cookies are rendered by java script, which may be part of the problem. I can also delete some of the cookies in my code and have it still work(until it expires). If there is an easier way to do what I am doing please let me know.
Thanks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract network activities and their response headers without using selenium - python

Related

How to authenticate MS graph requests?

python requests 403 forbidden but it's used with charles on

Unable to create request to API with python requests. Even when all headers are correct

Why is does my function return HTTP Error 301?

Not getting all cookies from Javascript site (python)

Categories

Resources