I am trying to download the file from the URL:
https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf
I tried using the python requests library, but the request just timed out. I tried specifying the 'User-Agent' from my browser as a header, but it still just timed out, including when I copied across every single header from my browser into my python script. I tried setting allow_redirects=True, this did not help. I've also tried wget and curl, everything fails apart from actually opening the browser, visiting the URL and downloading the file.
I'm wondering what the actual difference is between the requests in my browser and the python requests where I set the headers to match those in my browser - is there any way I can download this file using python?
Code snippet:
import requests
requests.get("https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf") # hangs
Check this, It's worked for me.
import requests
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
response = requests.get(
"https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf", headers=headers)
pdf = open("Chadv20-239.pdf", 'wb')
pdf.write(response.content)
pdf.close()
It is difficult to understand what might be going wrong without some code snippet. How is the file being downloaded? Are you getting raw response content and saving that as pdf? The official docs(https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content) suggest using chunk based approach to save the streamed/raw content. Did you try that approach?
Related
I'm new to python. I have to download some images from the web and save it to my local file system. I've noticed that the response content does not contain any image data.
The problem only occurs with this specific url, with every other image url the code works fine.
I know the easiest solution would be just use another url but still i'd like to ask if someone had a similar problem.
import requests
url = 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png'
filename = "bitcoin.png"
response = requests.get(url, stream = True)
response.raw.decode_content = True
with open(f'images/{filename}', 'wb') as outfile:
outfile.write(response.content)
First, look at the content of the response with response.text, you'll see the website blocked your request.
Please turn JavaScript on and reload the page.
Then, you can try to check if changing the User-Agent of your request fixes your issue.
response = requests.get(
url,
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
},
stream = True
)
If it doesn't, you may need to get your data with something which can parse javascript like selenium or Puppeteer.
Hi i'm trying to build a manga downloader app, for this reason I'm scraping several sites, however I have a problem once I get the image URL.
I can see the image using my browser (chrome), I can also download it, however I can't do the same using any popular scripting library.
Here is what I've tried:
String imgSrc = "https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"
Connection.Response resultImageResponse = Jsoup.connect(imgSrc)
.userAgent(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.referrer("none").execute();
// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(String.valueOf(imgPath))));
out.write(resultImageResponse.body()); // resultImageResponse.body() is where the image's contents are.
out.close();
I've also tried this:
URL imgUrl = new URL(imgSrc);
Files.copy(imgUrl.openStream(), imgPath);
Lastly, since I was sure the link works I've tried to download the image using python, but also in this case I get a 403 error
import requests
base_url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url)
googling I found this Unable to get image url in Mangaeden API Angular 6 which seems really close to my problem, however I don't understand if I'm setting wrong the referrer or it doesn't work at all...
Do you have any tips?
Thank you!
How to fix?
Add some "headers" to your request to show that you might be a "browser", this will give you a 200 as response and you can save the file.
Note This will also work for postman, just overwrite the hidden user agent and you will get the image as response
Example (python)
import requests
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url,headers=headers)
with open("image.jpg", 'wb') as f:
f.write(res.content)
Someone wrote this answer, but later deleted it, so I will copy the answer in case it can be useful.
AFAIK, you can't download anything else apart from HTML Documents
using jsoup.
If you open up Developer Tools on your browser, you can get the exact
request the browser has made. With Chrome, it's something like this.
The minimal cURL request would in your case be:
'https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg'
\ -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21
(KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21' \ --output
image.jpg;
You can refer to HedgeHog's answer for a sample Python
solution; here's how to achieve the same in Java using the new HTTP
Client:
import java.net.URI; import java.net.http.HttpClient; import
java.net.http.HttpRequest; import
java.net.http.HttpResponse.BodyHandlers; import java.nio.file.Path;
import java.nio.file.Paths;
public class ImageDownload {
public static void main(String[] args) throws Exception {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"))
.header("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0
Safari/535.21")
.build();
client.send(request, BodyHandlers.ofFile(Paths.get("image.jpg")));
} }
I adopted this solution in my java code.
Also, one last bit, if the image is downloaded but you can't open it, it is probably due to a 503 error code in the request, in this case you will just have to perform the request again. You can recognize broken images because the image reader will say something like
Not a JPEG file: starts with 0x3c 0x68
which is <h, an HTML error page instead of the image
I've tried searching for this - can't seem to find the answer!
I'm trying to do a really simple scrape of an entire webpage so that I can look for key words. I'm using the following code:
import requests
Website = requests.get('http://www.somfy.com', {'User-Agent':'a'}, headers = {'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
When I visit this website in a browser (eg chrome or firefox) it works. When I run the python code I just get the result "Gone" (error code 410).
I'd like to be able to reliably put in a range of website urls, and pull back the raw html to be able to look for key-words.
Questions
1. What have I done wrong, and how should I set this up to have the best chance of success in the future.
2. Could you point me to any guidance on how to go about working out what is wrong?
Many thanks - and sorry for the beginner questions!
You have an invalid User-Agent and you didn't include it in your headers.
I have fixed your code for you - it returns a 200 status code.
import requests
Website = requests.get('http://www.somfy.com', headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3835.0 Safari/537.36', 'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.
I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:
headers = {
'Referer': 'https://freemidi.org/download-20225',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.
midi = requests.get(url, headers=headers).content
However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.
I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?
It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.
You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):
midi = requests.get(url, headers=headers, allow_redirects=False)
Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).
with open('example.mid', 'wb') as ex:
ex.write(midi.content)