I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.
Related
I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)
I am trying to download the file from the URL:
https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf
I tried using the python requests library, but the request just timed out. I tried specifying the 'User-Agent' from my browser as a header, but it still just timed out, including when I copied across every single header from my browser into my python script. I tried setting allow_redirects=True, this did not help. I've also tried wget and curl, everything fails apart from actually opening the browser, visiting the URL and downloading the file.
I'm wondering what the actual difference is between the requests in my browser and the python requests where I set the headers to match those in my browser - is there any way I can download this file using python?
Code snippet:
import requests
requests.get("https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf") # hangs
Check this, It's worked for me.
import requests
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
response = requests.get(
"https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf", headers=headers)
pdf = open("Chadv20-239.pdf", 'wb')
pdf.write(response.content)
pdf.close()
It is difficult to understand what might be going wrong without some code snippet. How is the file being downloaded? Are you getting raw response content and saving that as pdf? The official docs(https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content) suggest using chunk based approach to save the streamed/raw content. Did you try that approach?
Python error when using request get
Hello guys i have this in my code
from bs4 import BeautifulSoup
r = requests.get(url)
And I'm gettin this
<Response [403]>
Whats could be the solution
The url is 'https://www3.animeflv.net/anime/sailor-moon'
btw the title is weird because i dont know why stack overflow dont allow me the way i want to put it :(
For your specific case you can overcome that by faking your User-Agent in request headers.
import requests
url = 'https://www3.animeflv.net/anime/sailor-moon'
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
res = requests.get(url, headers=headers)
print(res.status_code)
<Response [200]>
Some websites try to block requests made with python requests library, by default when you make a request from python script your User-Agent is something like python3/requests but if you fake it with manipulating headers you can easily bypass that. Take a look at this library https://pypi.org/project/fake-useragent/ for generating fake User-Agent strings.
I am currently trying to build a webscraping program to pull data from a real estate website using Beautiful Soup. I haven't gotten very far but the code is as follows:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
print(soup)
When I try to print the data to at least see if the program is working I get an error message saying "Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security." How do I get the server to stop blocking my IP address? I've read some similar issues with other programs and tried clearing the cookies, trying different browsers, etc and nothing has fixed it.
This is happening since the webpage thinks that your a bot (and is correct), therefore you will get blocked when sending a request.
To "bypass" this issue, try adding the user-agent to the headers parameter in the requests.get() method.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
url = "http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print(soup.prettify())
The following python script gives me 403 error, the type of request is 'GET'.
import requests
import json
url ='https://footballapi.pulselive.com/football/players?pageSize=30&compSeasons=210&altIds=true&page=2&type=player&id=-1&compSeasonId=210'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.status_code)
A Screenshot:
Check XHR Request Screenshot
Your code looks fine. I ran it and got the same 403 response. But if you open the url you posted, you'll notice a 403 error there as well. This looks like an issue with the website itself or maybe you are using an incorrect url.
This might be a late answer, but what you're missing is the correct header to access the Pulselive API. The necessary header is 'Origin':'https://www.premierleague.com'.
This makes the API think that the request is coming from the official Premier League website, and they have access to the API.
Hope this helps!