I am trying to download html file from the following website:
https://www.avto.net/Ads/results.asp?znamka=Audi&model=&modelID=&tip=katerikoli%20tip&znamka2=&model2=&tip2=katerikoli%20tip&znamka3=&model3=&tip3=katerikoli%20tip&cenaMin=0&cenaMax=999999&letnikMin=0&letnikMax=2090&bencin=0&starost2=999&oblika=0&ccmMin=0&ccmMax=99999&mocMin=&mocMax=&kmMin=0&kmMax=9999999&kwMin=0&kwMax=999&motortakt=&motorvalji=&lokacija=0&sirina=&dolzina=&dolzinaMIN=&dolzinaMAX=&nosilnostMIN=&nosilnostMAX=&lezisc=&presek=&premer=&col=&vijakov=&EToznaka=&vozilo=&airbag=&barva=&barvaint=&EQ1=1000000000&EQ2=1000000000&EQ3=1000000000&EQ4=100000000&EQ5=1000000000&EQ6=1000000000&EQ7=1000000120&EQ8=1010000001&EQ9=1000000000&KAT=1010000000&PIA=&PIAzero=&PSLO=&akcija=&paketgarancije=&broker=&prikazkategorije=&kategorija=&ONLvid=&ONLnak=&zaloga=&arhiv=&presort=&tipsort=&stran=1
If I look at the source in Google Chrome, I can get the HTML without any problem. But, I want to download multiple pages with Python requests. However, if I try to get the html that way, I encounter an error.
Using:
response = requests.get(url)
content = response.text
with open('filename', 'w') as dat:
dat.write(content)
I get the following error:
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I also tried using "allow_redirects=False", however, if I do that, I get a faulty html, which only contains the following text:
Object Moved
This document may be found here.
I am wondering what to do to be able to download this html using requests in python.
If I add the header:
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
the code does run, but once again, not giving the html I'm looking for. The html it creates is just one like something like this
<html><head><title>avto.net</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var ...
Try define a header for your requests.get() function i.e.
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',}
url = <url-here>
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
This fixed it for me.
Related
I'm trying to log in to a website through a python script that I've created using the requests module. I've issued a post HTTP request with appropriate parameters and headers to the server, but for some reason I get a different response from that site compared to what I see in dev tools. The status is always 200, though. There is also a get request in place within the script that should fetch the credentials once the login is successful. Currently, it throws a JSONDecodeError on the last line.
import requests
link = 'https://propwire.com/login'
check_url = 'https://propwire.com/search'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'referer': 'https://propwire.com/login',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'origin': 'https://propwire.com',
}
payload = {"email":"some-email","password":"password","remember":"true"}
with requests.Session() as s:
r = s.get(link)
headers['x-xsrf-token'] = r.cookies['XSRF-TOKEN'].rstrip('%3D')
s.headers.update(headers)
s.post(link,json=payload)
res = s.get(check_url)
print(res.json()['props']['auth'])
I'm a complete newbie to python and trying to get the content of a webpage with a get request. The page I'm trying to access is public without any authorization as far as I can see. It's a job listing from the career website of a popular company and everyone can view the page.
My code looks like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.tuvsud.com/de-de/karriere/stellen/jobs/projektmanagerin-auditservice-food-corporate-functions-business-support-all-regions-133776'
headers = {
'Host': '',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(r.status_code)
However I get the status code 403. With the google url for example it works though.
I would be happy about any help! thanks in advance
I am scraping a number of websites for data. Many websites I have no problem scraping at all, but a couple return encrypted data. I have created a basic demo below of what is going on. Is there a way to decrypt the returned results?
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
q = 'www.nike.com'
s = requests.Session()
url = 'http://' + q
r = s.get(url, headers=headers_Get)
r.text
The above code returns the expected html from Nike.Com.
However, if we run the same code and replace q = 'www.nike.com', with q = 'www.vanityfair.com' we receive code that looks like the following:
\x1bX�U?�(J�\x1a��|=;�:���N�\x01��J�.��$�D[����1�\x11[T2/����rq}�\x00ʁ�\x06(��J,�ܳR�\'Gs�я�l�\n���)�Qf��\x11�\x15�\x80��\r\x1d�o �<�o�??>}�������\x07��\n�\x1dE\ti�\x19\x01D�)�z\x06\x00p�\x18�e\n(�s&��\x1c��ga$e\n�PGd\x07琚\x17I�8�ީ�A�\x1f�c^�C�zh�Ǵ�t��#�X��wbl\x18�|}[��o���g\x02;����8+��:6\x039���-\x19\x1b��Q���\t\x1aJJ\x1b�\x11��\rq\x0c\x11��p�Q\x10\x18����\x14͋��\x0bus��e3X�w�狔�\x1d��6�nwen�\x02\x08�J�O�߯ףQ�T\x0c�P����0���]]��bI��5��Em/n��������ze�n.Wx��(\x05���+}���^�.qa����E�V�e���}w}�\x16�U]/�]-�d͋$ਡ�aėup��m���o\x06'
Im guessing this is the site upgrading the insecure request, but how can I decrpyt these results to receive the expected html code like Nike?
Note: I get the same results with post and get.
Make the request without the Accept-Encoding header, that way the server doesn't compress the message to be sent
I would like to monitor a particular URL and wait until it internally redirects me by using python requests. The website will randomly redirect me after a period of time. However, I am having some issues right now. The strategy I have employed so far is something like this:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
session = requests.Session()
while success is False:
r = session.get(url, headers=headers, allow_redirects=True)
if keyword in r.text:
success = True
time.sleep(30)
print("Success.")
It seems as though every time I make a GET request, the timer is reset and so I am never redirected, I thought a session would fix this but perhaps not. Although, how am I meant to check for changes to the page without sending a new request every 30 seconds? Looking at the network tab in Chrome it seems as though the status code is 307.
If anyone knows how to resolve this issue it would be very helpful, thanks.
Selenium is the quick and ugly answer:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36")
browser = webdriver.Firefox(profile)
browser.get(url)
while success is False:
text = browser.page_source
if keyword in text:
success = True
time.sleep(30)
print("Success.")
As far using requests goes, I'd hazard to guess that your web browser is requesting the reload, does the request in the network differ in anyway than the initial request? browsermob-proxy is a great tool for deep diving into these sorts of issues, it's effectively the network tab on steroids.
Apologies for the vagueness of the last half, but it's difficult to say more without having seen the website.
I have written some code for scraping
that program uses requests.get(url, headers=headers)
with headers exactly same with my Chrome browser except cookie
Initially, It works fine. but later. It gets 403 error
My Chrome browser get that data very well without error
but My python requests code doesn't work. What is the problem. I don't know
url = 'http://www.matchesfashion.com/en-kr/products/1171735'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Whale/0.10.36.11 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.6,en;q=0.4',
'Host': 'www.matchesfashion.com',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
'Accept-Encoding':'gzip, deflate'}
r = requests.get(url, headers=headers)