I am trying to scrape German zip codes (PLZ) for a given street in a given city using Python's requests on this server. I am trying to apply what I learned here.
I want to return the PLZ of
Schanzäckerstr. in Nürnberg.
import requests
url = 'https://www.11880.com/ajax/getsuggestedcities/schanz%C3%A4ckerstra%C3%9Fe%20n%C3%BCrnberg?searchString=schanz%25C3%25A4ckerstra%25C3%259Fe%2520n%25C3%25BCrnberg'
data = 'searchString=schanz%25C3%25A4ckerstra%25C3%259Fe%2520n%25C3%25BCrnberg'
headers = {"Authority": "wwww.11880.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"X-Requested-With": "XMLHttpRequest",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Content-Length": "400",
"Origin": "https://www.postleitzahlen.de",
"Sec-Fetch-Site": "cross-site",
"Fetch-Mode": "cors",
"DNT": "1",
"Connection": "keep-alive",
"Referer": "https://www.postleitzahlen.de",
}
multipart_data = {(None, data,)}
session = requests.Session()
response = session.get(url, files=multipart_data, headers=headers)
print(response.text)
The above code yields an empty response of the type 200. I want to return:
'90443'
I was able to solve this problem using nominatim openstreetmap API. One can also add street numbers
import requests
city = 'Nürnberg'
street = 'Schanzäckerstr. 2'
response = requests.get( 'https://nominatim.openstreetmap.org/search', headers={'User-Agent': 'PLZ_scrape'}, params={'city': city, 'street': street[1], 'format': 'json', 'addressdetails': '1'}, )
print(street, ',', [i.get('address').get('postcode') for i in response.json()][0])
Make sure to only send one request per second.
Related
I'm scraping data from amazon for educational purposes and I have some problems with the cookies and the antibot.
I managed to scrape data, but sometimes, the cookies will not be in the response, or the antibot flags me.
I already tried to use a random list of headers like this:
headers_list = [{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-User": "?1",
"TE": "trailers"
},
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "fr-FR,fr;q=0.7",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
]
And put the following in my code:
headers = random.choice(headers_list)
with requests.Session() as s:
res = s.get(url, headers=headers)
if not res.cookies:
print("Error getting cookies")
raise SystemExit(1)
But this doesn't solve the issue, I still sometimes get no cookie in my response and detection from the antibot.
I am scraping the data like this:
post = s.post(url, data=login_data, headers=headers, cookies=cookies, allow_redirects=True)
soup = BeautifulSoup(post.text, 'html.parser')
if soup.find('input', {'name': 'appActionToken'})['value'] is not None \
and soup.find('input', {'name': 'appAction'})['value'] is not None \
and soup.find('input', {'name': 'subPageType'})['value'] is not None \
and soup.find('input', {'name': 'openid.return_to'})['value'] is not None \
and soup.find('input', {'name': 'prevRID'})['value'] is not None \
and soup.find('input', {'name': 'workflowState'})['value'] is not None \
and soup.find('input', {'name': 'email'})['value'] is not None:
print("found")
else:
print("not found")
raise SystemExit(1)
But when the antibot detects me, this content will not be available, thus throwing out an error.
Any idea on how I could prevent that? Thanks!
You can set a
time.sleep(10)
for a certain amount of time before each Scrape operation. It will be harder for Amazon to catch you, but if you send too many regular requests, they may detect and block them as well.
Rotate your request headers with random user agents (update your headers list with more useragents)
Remove everything (tracking parameters) coming after /dp/ASIN/ from the product url
for example after removing tracking parameters your url will be like this : https://www.amazon.com/Storage-Stackable-Organizer-Foldable-Containers/dp/B097PVKRYM/
Add litle sleep in between requests (use time.sleep() )
Use proxy with your requests (you can use Tor proxy, if they blocks Tor go with some other paid proxy services)
I'm trying to log in to the site, but I have a problem!
Here is my code:
from requests_ntlm import HttpNtlmAuth
import requests
from main import username, password
data = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7",
"Authorization": "NTLM TlRMTVNT.......",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Cookie": "_ym_uid=1654686701790358885; _ym_d=1654686701; _ym_isad=2",
"Host": "...",
"Pragma": "no-cache",
"Referer": "https://...",
"sec-ch-ua": '" Not A;Brand";v="99", "Chromium";v="104", "Opera GX";v="90"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/104.0.5112.102 Safari/537.36 OPR/90.0.4480.117"
}
auth = HttpNtlmAuth(username, password)
with requests.Session() as session:
q1 = session.get("https://...", auth=auth, headers=data)
data['Authorization'] = q1.headers.get("WWW-Authenticate")
q2 = session.get("https://...", auth=auth, headers=data)
print(q2.raise_for_status())
You need to log in inside the site. I used to use HttpBaseAuth, but after searching in the site files I saw that it does a strange thing using NTLM.
He makes a get request using my headers, receives a 401 and another "WWW-Authenticate" header in the response and resends this request, but with the changed "Authorization" header just the same to the value of the "WWW-Authenticate" header. The header "Authorization" in the very first request is always the same, the values do not change (unfortunately I can't write it here), but if you send it yourself, then the response is still 401 and via response.headers.get not view
What should I do?enter image description here
I can't log in to the site.
If you log in manually, in the browser, it makes a get request, receives the “WWW-authenticate” header in response, and makes a get request again, but with this header.
When I try to do the same thing through python, I get a 401 error.
I have been trying to figure this out for a couple hours and no answers I've found on here or elsewhere has worked. I am using requests version 2.26.0 and trying to grab the cookie to add to my headers for later use. I did notice on the page that the cookie starts with php-session, perhaps there is a different way I need to grab it besides requests. Anyways, here is the headers I am using and the code I used to try and get the cookie, all it ever outputs is <RequestsCookieJar[]>, no matter what I try.
import requests
headers = {
"Accept": "*/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Content-Length": "565",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "ig_cb=1",
"DNT": "1",
"Host": "www.numlookup.com",
"Origin": "https://www.numlookup.com",
"Referer": "https://www.numlookup.com",
"Sec-GPC": "1",
"TE": "Trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
"X-Requested-With": "XMLHttpRequest"
}
s = requests.Session()
s2 = s.get("https://www.numlookup.com/", headers=headers).cookies
print(s2)
I am new on python. I am sending POST Request using this line of code:
response = requests.post(url=API_ENDPOINT, headers=headers, data=payload)
The problem is that the values of header are dynamic(they are different every time on browser).
These are the headers in browser:
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Content-Length": "276",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "acceptedCookie=%7B%22type%22%3A%22all%22%7D; TS01a14d32=01f893c9654ba8a49f70366efc3464fd76d4a461343cf44a7f074a5071b9818b6b196051effd669b784f691c8fab79bdc5a7efada418db04fc3cf8c3e43224fe186e64941eab43b5d9500201644abda7c0f5914ebb9ab95046ee2cb83c43f259ab0ed0e538fee3db50b2aa541ee5646d70634cea4cec54352547d3366c51e2ae5270756ee57bf78d915dcb8209c9c5771956c715bd75fb761bf42da6ba5cfa34ffbfee670e871ed33f8e25c09fdfc882953efd981f; ASLBSA=85b54f44c65f329c72b20a3ee7a9fc9a63d44001bc2c4e2c2b2f26fdaba7e0e3; ASLBSACORS=85b54f44c65f329c72b20a3ee7a9fc9a63d44001bc2c4e2c2b2f26fdaba7e0e3; utag_main=v_id:0179ddda773b0020fa6584d13ce40004e024f00d00978$_sn:2$_ss:1$_st:1623016566769$_pn:1%3Bexp-session$ses_id:1623014766769%3Bexp-session; s_cc=true; s_fid=3BE425806C624053-0396695F1870C86E; s_sq=luxmyluxottica%3D%2526pid%253DSite%25253APreLogin%25253ALogin%2526pidt%253D1%2526oid%253DLOGIN%2526oidt%253D3%2526ot%253DSUBMIT; todayVisit=true",
"Host": "mywebsite.com",
"Origin": "https://mywebsite.com",
"Referer": "https://mywebsite.com",
"TE": "Trailers",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
The value of content length, cookies, Accept parameter is different every time whenever I hit the API on browser, so I cannot just copy paste the values of headers and send it on POST request. How to generate this dynamic header(how to generate content length, cookies etc)? Please help.
I try to build a python script who sends a POST with parameters for extracting the result.
With fiddler, I have extracted the post request who return that I want. The website uses https only.
POST /Services/GetFromDataBaseVersionned HTTP/1.1
Host: www.mywbsite.fr
"Connection": "keep-alive",
"Content-Length": 129,
"Origin": "https://www.mywbsite.fr",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5",
"Content-Type": "application/json",
"Accept": "*/*",
"Referer": "https://www.mywbsite.fr/data/mult.aspx",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Cookie": "ASP.NET_SessionId=j1r1b2a2v2w245; GSFV=FirstVisit=; GSRef=https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CHgQFjAA&url=https://www.mywbsite.fr/&ei=FZq_T4abNcak0QWZ0vnWCg&usg=AFQjCNHq90dwj5RiEfr1Pw; HelpRotatorCookie=HelpLayerWasSeen=0; NSC_GSPOUGS!TTM=ffffffff09f4f58455e445a4a423660; GS=Site=frfr; __utma=1.219229010.1337956889.1337956889.1337958824.2; __utmb=1.1.10.1337958824; __utmc=1; __utmz=1.1337956889.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)"
{"isLeftColumn":false,"lID":-1,"userIpCountryCode":"FR","version":null,"languageCode":"fr","siteCode":"frfr","Quotation":"eu"}
And now my python script:
#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
import string
import httplib
import urllib2
host = "www.mywbsite.fr/sport/multiplex.aspx"
params='"isLeftColumn":"false","liveID":"-1","userIpCountryCode":"FR","version":"null","languageCode":"fr","siteCode":"frfr","Quotation":"eu"'
headers = { Host: www.mywbsite.fr,
"Connection": "keep-alive",
"Content-Length": 129,
"Origin": "https://www.mywbsite.fr",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5",
"Content-Type": "application/json",
"Accept": "*/*",
"Referer": "https://www.mywbsite.fr/data/mult.aspx",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Cookie": "ASP.NET_SessionId=j1r1b2a2v2w245; GSFV=FirstVisit=; GSRef=https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CHgQFjAA&url=https://www.mywbsite.fr/&ei=FZq_T4abNcak0QWZ0vnWCg&usg=AFQjCNHq90dwj5RiEfr1Pw; HelpRotatorCookie=HelpLayerWasSeen=0; NSC_GSPOUGS!TTM=ffffffff09f4f58455e445a4a423660; GS=Site=frfr; __utma=1.219229010.1337956889.1337956889.1337958824.2; __utmb=1.1.10.1337958824; __utmc=1; __utmz=1.1337956889.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)"
}
url = "/Services/GetFromDataBaseVersionned"
# POST the request
conn = httplib.HTTPConnection(host,port=443)
conn.request("POST",url,params,headers)
response = conn.getresponse()
data = response.read()
print data
But when I run my script, I have this error:
socket.gaierror: [Errno -2] Name or service not known
Thanks a lot for your link to the requests module. It's just perfect. Below the solution to my problem.
import requests
import json
url = 'https://www.mywbsite.fr/Services/GetFromDataBaseVersionned'
payload = {
"Host": "www.mywbsite.fr",
"Connection": "keep-alive",
"Content-Length": 129,
"Origin": "https://www.mywbsite.fr",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5",
"Content-Type": "application/json",
"Accept": "*/*",
"Referer": "https://www.mywbsite.fr/data/mult.aspx",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Cookie": "ASP.NET_SessionId=j1r1b2a2v2w245; GSFV=FirstVisit=; GSRef=https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CHgQFjAA&url=https://www.mywbsite.fr/&ei=FZq_T4abNcak0QWZ0vnWCg&usg=AFQjCNHq90dwj5RiEfr1Pw; HelpRotatorCookie=HelpLayerWasSeen=0; NSC_GSPOUGS!TTM=ffffffff09f4f58455e445a4a423660; GS=Site=frfr; __utma=1.219229010.1337956889.1337956889.1337958824.2; __utmb=1.1.10.1337958824; __utmc=1; __utmz=1.1337956889.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)"
}
# Adding empty header as parameters are being sent in payload
headers = {}
r = requests.post(url, data=json.dumps(payload), headers=headers)
print(r.content)
If we want to add custom HTTP headers to a POST request, we must pass them through a dictionary to the headers parameter.
Here is an example with a non-empty body and headers:
import requests
import json
url = 'https://somedomain.com'
body = {'name': 'Maryja'}
headers = {'content-type': 'application/json'}
r = requests.post(url, data=json.dumps(body), headers=headers)
Source
To make POST request instead of GET request using urllib2, you need to specify empty data, for example:
import urllib2
req = urllib2.Request("http://am.domain.com:8080/openam/json/realms/root/authenticate?authIndexType=Module&authIndexValue=LDAP")
req.add_header('X-OpenAM-Username', 'demo')
req.add_data('')
r = urllib2.urlopen(req)