I'm trying to scrape google-lens with python requests but can't find the request where it uploads the image or how it is decoded.
The request (which the answer is the image-analysis) is as following:
import requests
cookies = {
'CONSENT': 'PENDING+XXX',
'SOCS': 'XXXXXXXXXXXXXXXXXXXXXXXXXXX',
'HSID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'SSID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'APISID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'SAPISID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'__Secure-1PAPISID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'SID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.',
'__Secure-1PSID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'SIDCC': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'__Secure-1PSIDCC': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'AEC': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'NID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'OTZ': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'__Secure-ENID': 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
}
headers = {
'authority': 'lens.google.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
'referer': 'https://lens.google.com/upload?hl=de-CH&re=df&st=1675340672651&plm=ChAIARIMCIDX7p4GEMDxtbYC&ep=gisbubb',
'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
'sec-ch-ua-arch': '"x86"',
'sec-ch-ua-bitness': '"64"',
'sec-ch-ua-full-version': '"109.0.5414.120"',
'sec-ch-ua-full-version-list': '"Not_A Brand";v="99.0.0.0", "Google Chrome";v="109.0.5414.120", "Chromium";v="109.0.5414.120"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-model': '""',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"10.0.0"',
'sec-ch-ua-wow64': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'x-client-data': 'CJe2yQEIorbJAQjBtskBCKmdygEIte3KAQiTocsBCPKCzQEIv4TNAQiAjM0BCIiMzQEI14zNAQiGjc0BCMeNzQEI1Y7NAQj2js0BCNLhrAII8fStAg==',
}
p = "AfVzNa_TGIdeDaL6ZaPXF7Wx8FCDSF8grbjYLUPXuk5_7Ia3vUCoQ5BUa8slWojngiUp-88dvc59Ohx3_22wAH3GXJHgaT-bLnpAm0r-5YjYIErXRCYJJ0ndUQUxxdF1JptYTdjqaEXXRR87igdc_xBCpxGpdXkXrf7Nf226SST0MdF3vF7mmtvJyklqA8494byV6bj_I92D3vihWglO3OV6phVD1zsqVyfSU_qZvtuEPEA59LETwQ4SKlztDy0fMWmBGgCsXiCuz2bWH2bOIRqUFo0stSVAvscHpY0iIVcEyRYQhXBxRkibV6UvnSIK2w_JQZV7TP4AkRRBPCwy2iKu-KJS6R28OZ3ABqIth7IPDLGymZKQ20vl_HPjXBHAgHzZgFLTs-AfR7zkmsnyWQ9FB77YVA"
response = requests.get(
'https://lens.google.com/search?p='+p+"%3D%3D&ep=gisbubb&hl=en-US&re=df&st=1675340672651&plm=ChAIARIMCIDX7p4GEMDxtbYCCg8IFRILCIDX7p4GENCgvHUKDwgWEgsIgNfungYQkM3CdQoPCBMSCwiA1%2B6eBhCA/MJ1ChAIFBIMCIDX7p4GEOjKj7MC",
cookies=cookies,
headers=headers,
)
The p parameter in the url seems to me like data, but:
Maybe too short for a image?
I can't decode the string as Base64 to an image. Any ideas?
p in my case is:
AfVzNa_TGIdeDaL6ZaPXF7Wx8FCDSF8grbjYLUPXuk5_7Ia3vUCoQ5BUa8slWojngiUp-88dvc59Ohx3_22wAH3GXJHgaT-bLnpAm0r-5YjYIErXRCYJJ0ndUQUxxdF1JptYTdjqaEXXRR87igdc_xBCpxGpdXkXrf7Nf226SST0MdF3vF7mmtvJyklqA8494byV6bj_I92D3vihWglO3OV6phVD1zsqVyfSU_qZvtuEPEA59LETwQ4SKlztDy0fMWmBGgCsXiCuz2bWH2bOIRqUFo0stSVAvscHpY0iIVcEyRYQhXBxRkibV6UvnSIK2w_JQZV7TP4AkRRBPCwy2iKu-KJS6R28OZ3ABqIth7IPDLGymZKQ20vl_HPjXBHAgHzZgFLTs-AfR7zkmsnyWQ9FB77YVA==
In the network tab when uploading the image, I can't find any other request with data.
Also, how would I encode a image to such a string using python?
As you guessed, the data seems too short for the image (In base 64 format or other codings).
We can not certainly tell what is happening in the Google Image search internal procedures but the following scenario comes to mind (and usually such search systems works like this):
the user first uploads the image to Google lens and google allocates an ID to the uploaded image in its internal database. You will see that ID as p parameter in the search URL and your code. Then the image search uses that ID to refer to the uploaded image in its internal database.
Just to make sure that such a tiny string like p can not hold the whole image, run base64.b64encode(open('path/to/image.png', 'rb').read()) and see the result is a very long string.
If you intercept the network tab in Google Chrome more precisely you will notice that the user first is redirected to an address like https://lens.google.com/upload&re=df&st=some_number_hereplm=intenal_database_identifier and then the user will be redirected to the main search page with the p parameter in the address bar.
So in order to use Google image search best solution is to use the official API and libraries like this. But if you insist on using the unofficial way, something like selenium can act like a browser and get the parameter you are looking for.
Related
I am trying to get the response body of this request "ListByMovieAndDate" from this specific website:
https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME
Screenshot below is the request in Chrome Dev Tool.
I have tried several methods to mimic the request, including
copying the request as cURL (bash) and using a tool to translate it to Python request
import requests
headers = {'authority': 'hkmovie6.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'uthorization': 'eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA',
'x-grpc-web': '1',
'language': 'zhHK',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'content-type': 'application/grpc-web+proto',
'accept': '*/*',
'origin': 'https://hkmovie6.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://hkmovie6.com/movie/d88a803b-4a76-488f-b587-6ccbd3f43d86/SHOWTIME',
'accept-language': 'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,ja;q=0.6',
'cookie': '__stripe_mid=dfb76ec9-1469-48ef-81d6-659f8d7c12da9a119d; lang=zhHK; auth=%7B%22isLogin%22%3Afalse%2C%22access%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6ImFjY2VzcyIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjI4MDg0NTUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjQxZjJmZDBjLTk3YzgtNDFiYi04NDRiLTU5YWM5MTY0ZmYyNSJ9.jz_G80XDafzSHyzxog1IAY_xikAdQEEFizJXkiiHkNhwAY-MWF1E11Nel7WrsDlE184tcFtSjUKbHdx7281dFA%22%2C%22expiry%22%3A1628084551%7D%2C%22refresh%22%3A%7B%22token%22%3A%22eyJhbGciOiJIUzUxMiIsImtpZCI6InJlZnJlc2giLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJtb3ZpZTYiLCJhdWQiOiJyb2xlLmJhc2ljIiwiZXhwIjoxNjMwNjc0NzUxLCJpYXQiOjE2MjgwODI3NTEsImp0aSI6IjM0YWFjNWVhLTkwZTctNDdhYS05OTE3LTQ5N2UxMGUwNmU3YSJ9.Mrwt2iWddQHthQNHafF4mirU-JiynidiTzq0X4J96IMICcWbWEoZBB4M1HhvFdeB2WvU1nHaNDyMZEhkINKK8g%22%2C%22expiry%22%3A1630674751%7D%7D; showtimeMode=time; _gid=GA1.2.2026576359.1628082750; _ga=GA1.2.704463189.1627482203; _ga_8W8P8XEJX1=GS1.1.1628082750.11.1.1628083640.0',
}
data = '$\\u0000\\u0000\\u0000\\u0000,\\n$d88a803b-4a76-488f-b587-6ccbd3f43d86\\u0010\\u0080\xB1\xA7\\u0088\\u0006'
response = requests.post('https://hkmovie6.com/m6-api/showpb.ShowAPI/ListByMovieAndDate', headers=headers, data=data)
All I got is a response header with a message: grpc: received message larger than max:
{'Content-Type': 'application/grpc-web+proto', 'grpc-status': '8',
'grpc-message': 'grpc: received message larger than max (1551183920
vs. 4194304)', 'x-envoy-upstream-service-time': '49',
'access-control-allow-origin': 'https://hkmovie6.com',
'access-control-allow-credentials': 'true',
'access-control-expose-headers': 'grpc-status,grpc-message',
'X-Cloud-Trace-Context': '72c873ad3012ad710f938098310f7f11', ...
I also tried to use Postman Interceptor to capture the actual request sent when I browsed the site. This time with a different message:
I managed to get the response body when I used selenium but it is far from ideal performance-wise.
I wonder if grpc is a hint but I spent several hours reading without getting what I wanted.
My only question is whether it is possible to get the "ListByMovieAndDate" response just by making simple Python http request to the api url? Thanks!
An admittedly cursory read suggests that the backend is gRPC and the client that you're introspecting is using gRPC-Web which is a clever solution to the problem of wanting to make gRPC requests using a JavaScript client.
Suffice to say that, you can't access the backend using HTTP/1 and REST if it is indeed gRPC but you may (!) be able to craft a Python gRPC client that talks to it if there's no constraints by e.g. client IP, type and there's no auth.
I get 403 forbidden when I use python requests to access .
However, when I open Charles proxy it works.
When I open fiddler, I also get 403.
I wanna know why this happens.
import requests
def get_test():
# proxies = {'http': 'http://127.0.0.1:8888', 'https': 'http://127.0.0.1:8888'}
proxies=None
url = ""
my_header={
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,ja;q=0.6,zh-HK;q=0.5',
'cache-control': 'max-age=0', 'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
rsp = requests.get(url=url,headers=my_header)
print(rsp)
if __name__ == '__main__':
get_test()
I try to request this page by postman, and also get the result of 403 forbidden. It seems that this website uses Cloudflare's anti-bot page to anti web-scraper which is hard to solve by yourself. This is why 403 forbidden happens.
So I try to use cloudscraper to solve this problem:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://www.zolo.ca/").text)
but get the exceptions:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free)
version.
It seems that the opensource(free) version of cloudscraper can't solve this problem, and I can't do anythings more.
For more details of cloudscraper, you can see this page or github:
https://pypi.org/project/cloudscraper/
https://github.com/VeNoMouS/cloudscraper
If you urgently need to scrape the website, you can try Selenium.
Although this method is not elegant enough, but it'll certainly prove equal to the task.
Cloudflare does checks on your TLS Settings, given that, requests is being detected.
The reason why charles isnt detected its because charles changes your TLS settings.
Additionally, the site may have blocked HTTP/1 which is the unique protocol that requests supports.
I am trying to get a request https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE using python. I have tried tons of different things but they all respond with 403 error forbidden. I have tried everything I can think of and have googled with no success.
currently my code for this request looks like this:
headers = {
'authority': 'api.dex.guru',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': (cookies are here)
}
response = requests.get('https://api.dex.guru/v1/tradingview/symbols?symbol=0x7060d3f1cc70a07f4768560b9d9b692ac29244de-bsc', headers=headers)
then i print out response and it is a 403 error. Please help, I need this data for a project.
Good afternoon.
I have managed to get this to work with the help of another user on Reddit.
The key to getting this API call to work is to use the cloudscraper module :-
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://api.dex.guru/v1/tokens/0x8076C74C5e3F5852037F31Ff0093Eeb8c8ADd8D3-bsc").text)
This gave me a 200 response with the expected JSON content (substitute my URL above with yours and you should get the expected 200 response).
Many thanks
Jimmy
I tried messing around with this myself, it appears your site has some sort of DDOS protection from Cloudflare blocking these API calls. I'm not an expert in Python or headers by any means, so you might be supplying something to deal with that. However I looked on their website and it seems like the API is still in development. Finally, I was getting 503 errors instead, and I was able to access the API normally through my browser. Happy to tinker around more with this if you don't mind explaining what some of the cookies/headers are doing.
Try to check the body of the response (response.content or response.text) as that might give you a more clear picture of why you get blocked.
For me it looks like they do some filtering based on the user-agent. I do get a Cloudflare DoS protection page (with a HTTP 503 response for example). Using a user-agent string that suggests that JavaScript won't work I do get a HTTP 200:
headers = {"User-Agent": "HTTPie/2.4.0"}
r = requests.get("https://api.dex.guru/v1/tokens/0x7060d3F1CC70A07f4768560B9D9B692ac29244dE", headers=headers)
I am trying to make a program that checks for ski lift reservation openings. So far I am able to get the correct response from the API but it only works for about 15 min before some cookie expires. Here is my current process.
Go to site: https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx and look at the network response, then I copy the highlighted xhr script as a curl(bash).
website/api in question
I then take that curl(bash) and import it into postman and get the response:
Postman response
Then I take the code from postman so I can run it in python
Code used by postman
import requests, json
url = "https://www.keystoneresort.com/api/LiftAccessApi/GetLiftTicketControlReservationInventory?
startDate=01%2F21%2F2021&endDate=03%2F06%2F2021&_=1611254694375"
payload={}
headers = {
'authority': 'www.keystoneresort.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-queueit-ajaxpageurl': 'https%3A%2F%2Fwww.keystoneresort.com%2Fplan-your-trip%2Flift-
access%2Ftickets.aspx%3FstartDate%3D01%252F23%252F2021%26numberOfDays%3D1%26ageGroup%3DAdult',
'x-requested-with': 'XMLHttpRequest',
'__requestverificationtoken': 'mbVIzNL1qZUKDT3Re8H9kXVNoYLmQPC-tgLCSbM_inVSN1v_2Pei-A- GWDaKL7i6NRIVTr0lnlmiYACNvfmd6Zzsikk1:HI8y8wZJXMuP7nsTJwS-adYZu7FoHVPVHWY5naHRiB71dg2PzehuQa8WJy418eIrVqwmvhw-a1F34sJ425mXzWpEANE1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36',
'save-data': 'off',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx? startDate=01%2F23%2F2021&numberOfDays=1&ageGroup=Adult',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'QueueITAccepted-SDFrts345E-V3_vailresortsecomm1=EventId%3Dvailresortsecomm1%26QueueId%3D96d15411-09e1-4443-89a3-f0d6e4cef5d5%26RedirectType%3Dsafetynet%26IssueTime%3D1611254692%26Hash%3D06e1aecd2d5cdf64363d53f4fc63f1c22316f604895cd3ecfd1d8b03f86ba36a; TS019b45a2=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; TS01f060ff=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; AMCV_974C370453295F9A0A490D44%40AdobeOrg=1406116232%7CMCIDTS%7C18649%7CMCMID%7C30886069937558409272202898840476568322%7CMCAAMLH-1611859494%7C9%7CMCAAMB-1611859494%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1611261894s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C2.5.0;'
}
s = requests.Session()
y = s.get(url)
print(y)
response = requests.request("GET", url, headers=headers, data=payload)
todos = json.loads(response.text)
x = json.dumps(todos, indent = 2)
print(x)
Now if you run this in python, it will not work because the cookies will have expired for this session by the time someone tries it. So you would have to follow the process I listed above if you want to see what I am doing. The response I get looks like this, which is what I want but only for it not to expire.
Python response
I have looked extensively at different ways I can get the cookies using requests and selneium. All solutions I have tried only get some of the cookies and not all of them. I need the ones that are in the "cookie" header listed in my code, but I have not found a way to do that without refreshing the page and posting the curl in postman and copying the response. I am still fairly new to python and coding in general so don't go to hard on me if the answer is super simple.
I think some of these cookies are rendered by java script, which may be part of the problem. I can also delete some of the cookies in my code and have it still work(until it expires). If there is an easier way to do what I am doing please let me know.
Thanks.
i am trying to web-scrape pokemon information from their online pokedex, but i'm having trouble with the findAll()function. i've got:
containers = page_soup.findAll("div",{"class":"pokemon-info"})
but I'm not sure if this div is where i need to be looking at all, because (see photo html) this div is inside a li, so perhaps i should search within it instead, like so:
containers = page_soup.findAll("li", {"class":"animating"})
but in both cases when i use len(containers), the length returned is always 0, even though there are several entries.
i also tried find_all(), but the results of len() are the same.
The problem is that BeautifulSoup can't read javascript. As furas said, you should open the webpage and turn off javascript (here's how) and then see if you can still access what you want. If you can't, you need to use something like Selenium to control the browser.
As the other comments and answer suggested, the site is loading the data in the background. The most common response to this is to use Selenium; my approach is to first check for any API calls in Chrome. Luckily for us, the page retrieves 953 pokemon on load.
Below is a script that will retrieve the clean JSON data and here is a little article I wrote explaining the use of chrome developer tools in the first instance over Selenium.
# Gotta catch em all
import requests
import pandas as pd
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.pokemon.com/us/pokedex/',
'Connection': 'keep-alive',
}
r = requests.get('https://www.pokemon.com/us/api/pokedex/kalos', headers=headers)
j = r.json()
print(j[0])