I'm scraping data from amazon for educational purposes and I have some problems with the cookies and the antibot.
I managed to scrape data, but sometimes, the cookies will not be in the response, or the antibot flags me.
I already tried to use a random list of headers like this:
headers_list = [{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-User": "?1",
"TE": "trailers"
},
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "fr-FR,fr;q=0.7",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
]
And put the following in my code:
headers = random.choice(headers_list)
with requests.Session() as s:
res = s.get(url, headers=headers)
if not res.cookies:
print("Error getting cookies")
raise SystemExit(1)
But this doesn't solve the issue, I still sometimes get no cookie in my response and detection from the antibot.
I am scraping the data like this:
post = s.post(url, data=login_data, headers=headers, cookies=cookies, allow_redirects=True)
soup = BeautifulSoup(post.text, 'html.parser')
if soup.find('input', {'name': 'appActionToken'})['value'] is not None \
and soup.find('input', {'name': 'appAction'})['value'] is not None \
and soup.find('input', {'name': 'subPageType'})['value'] is not None \
and soup.find('input', {'name': 'openid.return_to'})['value'] is not None \
and soup.find('input', {'name': 'prevRID'})['value'] is not None \
and soup.find('input', {'name': 'workflowState'})['value'] is not None \
and soup.find('input', {'name': 'email'})['value'] is not None:
print("found")
else:
print("not found")
raise SystemExit(1)
But when the antibot detects me, this content will not be available, thus throwing out an error.
Any idea on how I could prevent that? Thanks!
You can set a
time.sleep(10)
for a certain amount of time before each Scrape operation. It will be harder for Amazon to catch you, but if you send too many regular requests, they may detect and block them as well.
Rotate your request headers with random user agents (update your headers list with more useragents)
Remove everything (tracking parameters) coming after /dp/ASIN/ from the product url
for example after removing tracking parameters your url will be like this : https://www.amazon.com/Storage-Stackable-Organizer-Foldable-Containers/dp/B097PVKRYM/
Add litle sleep in between requests (use time.sleep() )
Use proxy with your requests (you can use Tor proxy, if they blocks Tor go with some other paid proxy services)
Related
I'm trying to log in to the site, but I have a problem!
Here is my code:
from requests_ntlm import HttpNtlmAuth
import requests
from main import username, password
data = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7",
"Authorization": "NTLM TlRMTVNT.......",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Cookie": "_ym_uid=1654686701790358885; _ym_d=1654686701; _ym_isad=2",
"Host": "...",
"Pragma": "no-cache",
"Referer": "https://...",
"sec-ch-ua": '" Not A;Brand";v="99", "Chromium";v="104", "Opera GX";v="90"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/104.0.5112.102 Safari/537.36 OPR/90.0.4480.117"
}
auth = HttpNtlmAuth(username, password)
with requests.Session() as session:
q1 = session.get("https://...", auth=auth, headers=data)
data['Authorization'] = q1.headers.get("WWW-Authenticate")
q2 = session.get("https://...", auth=auth, headers=data)
print(q2.raise_for_status())
You need to log in inside the site. I used to use HttpBaseAuth, but after searching in the site files I saw that it does a strange thing using NTLM.
He makes a get request using my headers, receives a 401 and another "WWW-Authenticate" header in the response and resends this request, but with the changed "Authorization" header just the same to the value of the "WWW-Authenticate" header. The header "Authorization" in the very first request is always the same, the values do not change (unfortunately I can't write it here), but if you send it yourself, then the response is still 401 and via response.headers.get not view
What should I do?enter image description here
I can't log in to the site.
If you log in manually, in the browser, it makes a get request, receives the “WWW-authenticate” header in response, and makes a get request again, but with this header.
When I try to do the same thing through python, I get a 401 error.
I'm trying to submit a file (test.exe) to a website using a POST request, but instead of a normal 302 response, it keeps responding with 500. I don't know what I could change in my request: maybe in the headers or in the files format, or maybe I need to somehow pass the data parameter?
I would appreciate any advice on this!
import requests
url = "https://cuckoo.cert.ee/submit/api/presubmit"
files = {"test.exe": open("test.exe", "rb")}
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "199",
"Content-Type": "multipart/form-data; boundary=----WebKitFormBoundarymoUA16cLBrh9JNGC",
"Cookie": "csrftoken=O9tFpNhZuZrj7DsEnBAcj0wmV00z8qE3; theme=cyborg; csrftoken=O9tFpNhZuZrj7DsEnBAcj0wmV00z8qE3",
"Host": "cuckoo.cert.ee",
"Origin": "https://cuckoo.cert.ee",
"Referer": "https://cuckoo.cert.ee/submit/",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"X-CSRFToken": "O9tFpNhZuZrj7DsEnBAcj0wmV00z8qE3"
}
response = requests.post(url, headers=headers, files=files, verify=False)
print(response)
Possibly try changing content type to application/octet-stream.
500 Error indicates that the website may not be able to handle the file that you are trying to upload. It could simply be that the website is malfunctioning or having a temporary failure.
If you have access to the back-end logs, I would recommend looking at that, or contacting the website to see if they have any suggestions.
EDIT:
Verify that your content matches the length that you are declaring as well, it looks like you have a content-length parameter declared in your request. Try taking that out to see if that helps.
I have been trying to figure this out for a couple hours and no answers I've found on here or elsewhere has worked. I am using requests version 2.26.0 and trying to grab the cookie to add to my headers for later use. I did notice on the page that the cookie starts with php-session, perhaps there is a different way I need to grab it besides requests. Anyways, here is the headers I am using and the code I used to try and get the cookie, all it ever outputs is <RequestsCookieJar[]>, no matter what I try.
import requests
headers = {
"Accept": "*/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Content-Length": "565",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "ig_cb=1",
"DNT": "1",
"Host": "www.numlookup.com",
"Origin": "https://www.numlookup.com",
"Referer": "https://www.numlookup.com",
"Sec-GPC": "1",
"TE": "Trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
"X-Requested-With": "XMLHttpRequest"
}
s = requests.Session()
s2 = s.get("https://www.numlookup.com/", headers=headers).cookies
print(s2)
I am new on python. I am sending POST Request using this line of code:
response = requests.post(url=API_ENDPOINT, headers=headers, data=payload)
The problem is that the values of header are dynamic(they are different every time on browser).
These are the headers in browser:
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Content-Length": "276",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "acceptedCookie=%7B%22type%22%3A%22all%22%7D; TS01a14d32=01f893c9654ba8a49f70366efc3464fd76d4a461343cf44a7f074a5071b9818b6b196051effd669b784f691c8fab79bdc5a7efada418db04fc3cf8c3e43224fe186e64941eab43b5d9500201644abda7c0f5914ebb9ab95046ee2cb83c43f259ab0ed0e538fee3db50b2aa541ee5646d70634cea4cec54352547d3366c51e2ae5270756ee57bf78d915dcb8209c9c5771956c715bd75fb761bf42da6ba5cfa34ffbfee670e871ed33f8e25c09fdfc882953efd981f; ASLBSA=85b54f44c65f329c72b20a3ee7a9fc9a63d44001bc2c4e2c2b2f26fdaba7e0e3; ASLBSACORS=85b54f44c65f329c72b20a3ee7a9fc9a63d44001bc2c4e2c2b2f26fdaba7e0e3; utag_main=v_id:0179ddda773b0020fa6584d13ce40004e024f00d00978$_sn:2$_ss:1$_st:1623016566769$_pn:1%3Bexp-session$ses_id:1623014766769%3Bexp-session; s_cc=true; s_fid=3BE425806C624053-0396695F1870C86E; s_sq=luxmyluxottica%3D%2526pid%253DSite%25253APreLogin%25253ALogin%2526pidt%253D1%2526oid%253DLOGIN%2526oidt%253D3%2526ot%253DSUBMIT; todayVisit=true",
"Host": "mywebsite.com",
"Origin": "https://mywebsite.com",
"Referer": "https://mywebsite.com",
"TE": "Trailers",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
The value of content length, cookies, Accept parameter is different every time whenever I hit the API on browser, so I cannot just copy paste the values of headers and send it on POST request. How to generate this dynamic header(how to generate content length, cookies etc)? Please help.
I am trying to scrape German zip codes (PLZ) for a given street in a given city using Python's requests on this server. I am trying to apply what I learned here.
I want to return the PLZ of
Schanzäckerstr. in Nürnberg.
import requests
url = 'https://www.11880.com/ajax/getsuggestedcities/schanz%C3%A4ckerstra%C3%9Fe%20n%C3%BCrnberg?searchString=schanz%25C3%25A4ckerstra%25C3%259Fe%2520n%25C3%25BCrnberg'
data = 'searchString=schanz%25C3%25A4ckerstra%25C3%259Fe%2520n%25C3%25BCrnberg'
headers = {"Authority": "wwww.11880.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"X-Requested-With": "XMLHttpRequest",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Content-Length": "400",
"Origin": "https://www.postleitzahlen.de",
"Sec-Fetch-Site": "cross-site",
"Fetch-Mode": "cors",
"DNT": "1",
"Connection": "keep-alive",
"Referer": "https://www.postleitzahlen.de",
}
multipart_data = {(None, data,)}
session = requests.Session()
response = session.get(url, files=multipart_data, headers=headers)
print(response.text)
The above code yields an empty response of the type 200. I want to return:
'90443'
I was able to solve this problem using nominatim openstreetmap API. One can also add street numbers
import requests
city = 'Nürnberg'
street = 'Schanzäckerstr. 2'
response = requests.get( 'https://nominatim.openstreetmap.org/search', headers={'User-Agent': 'PLZ_scrape'}, params={'city': city, 'street': street[1], 'format': 'json', 'addressdetails': '1'}, )
print(street, ',', [i.get('address').get('postcode') for i in response.json()][0])
Make sure to only send one request per second.