Image scraped as HTML page with urlretrieve

Image scraped as HTML page with urlretrieve - python

I'm trying to scrape this image using urllib.urlretrieve.
>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg',
path) # path was previously defined
This code successfully saves the file in the given path. However, when I try to open the file, I get:
Could not load image 'imagename.jpg':
Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)
When I do file imagename.jpg in my bash terminal, I get imagefile.jpg: HTML document, ASCII text.
So how do I scrape this image as a JPEG file?

It's because the owner of the server hosting that image is deliberately blocking access from Python's urllib. That's why it's working with requests. You can also do it with pure Python, but you'll have to give it an HTTP User-Agent header that makes it look like something other than urllib. For example:
import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
outfile.write(imgdata)
So it's a little more involved to get around, but still not too bad.
Note that the site owner probably did this because some people had gotten abusive. Please don't be one of them! With great power comes great responsibility, and all that.

Related

How do you find the filetype of an image in a url with nonobvious filetype in Python

Certain CDNs like googleusercontent don't (obviously) encode the filenames of images in their urls, so you can't get the filetype from simply using string manipulation like other answers here have suggested. knowing this, how can tell that
https://lh3.googleusercontent.com/pw/AM-JKLURvu-Ro2N3c1vm1PTM3a7Ae5nG3LNWynuKNEeFNBMwH_uWLQJe0q0HmaOzKC0k0gRba10SbonLaheGcNpxROnCenf1YJnzDC3jL-N9fTtZ7u0q5Z-3iURXtrt4GlyeEI3t4KWxprFDqFWRO29sJc8=w440-h248-no
is a gif whilst
https://lh3.googleusercontent.com/pw/AM-JKLXk2WxafqHOi0ZrETUh2vUNkiLyYW1jRmAQsHBmYyVP7Le-KBCSVASCgO2C6_3QbW3LcLYOV_8OefPafyz2i4g8nqpw8xZnIhzDdemd5dFPS5A7dVAGQWx9DIy5aYOGuh06hTrmfhF9mZmITjjTwuc=w1200-h600-no
is a jpg

Building on the responses to this question, you could try:
import requests
from PIL import Image # pillow package
from io import BytesIO
url = "your link"
image = Image.open( BytesIO( requests.get( url ).content))
file_type = image.format
This calls for downloading the entire file, though. If you're looking to do this in bulk, you might want to explore the option in the comment above that mentions "magic bytes"...
Edit:
You can also try to get the image type from the headers of the response to your url:
headers = requests.get(url).headers
file_type =headers.get('Content-Type', "nope/nope").split("/")[1]
# Will print 'nope' if 'Content-Type' header isn't found
print(file_type)
# Will print 'gif' or 'jpeg' for your listed urls
Edit 2:
If you're really only concerned with the file type of the link and not the file itself, you could use the head method instead of the get method of the requests module. It's faster:
headers = requests.head(url).headers
file_type =headers.get('Content-Type', "nope/nope").split("/")[1]

How to download files from website using PHP with Python

I have a Python script that crawls various webistes and downloads files form them. My problem is, that some of the websites seem to be using PHP, at least that's my theory since the URLs look like this: https://www.portablefreeware.com/download.php?dd=1159
The problem is that I can't get any file names or endings from a link like this and therefore can't save the file. Currently I'm only saving the URLs.
Is there any way to get to the actual file name behind the link?
This is my stripped down download code:
r = requests.get(url, allow_redirects=True)
file = open("name.something", 'wb')
file.write(r.content)
file.close()
Disclaimer: I've never done any work with PHP so please forgive any incorrect terminolgy or understanding I have of that. I'm happy to learn more though

import requests
import mimetypes
response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content=response.content
content_type = response.headers['Content-Type']
ext= mimetypes.guess_extension(content_type)
print(content)# [ZipBinary]
print(ext)# .zip
print(content_type)#application/zip, application/octet-stream
with open("newFile."+ext, 'wb') as f:
f.write(content)
f.close()

With your use of the allow_redirects=True option, requests.get would automatically follow the URL in the Location header of the response to make another request, losing the headers of the first response as a result, which is why you can't find the file name information anywhere.
You should instead use the allow_redirects=False option so that you can the Location header, which contains the actual download URL:
import requests
url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])
This outputs:
https://www.diskinternals.com/download/Linux_Reader.exe
Demo: https://replit.com/#blhsing/TrivialLightheartedLists
You can then make another request to the download URL, and use os.path.basename to obtain the name of the file to which the content will be written:
import os
url = r.headers['Location']
with open(os.path.basename(url), 'w') as file:
r = requests.get(url)
file.write(r.content)

You're using requests for downloading. This doesn't work with downloads of this kind.
Try urllib instead:
import urllib.request
urllib.request.urlretrieve(url, filepath)

You can download the file with file name get from response header.
Here's my code for a download with a progress bar and a chunk size buffer:
To display a progress bar, use tqdm. pip install tqdm
In this, chunk write is used to save memory during downloading.
import os
import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
response = requests.get(url, stream=True)
total_length = int(response.headers.get("content-length"))
for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
if chunk:
file.write(chunk)
file.flush()
Progress output:
6%|▌ | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]

redirectable can be bounced via DNS distributed Network any where. So the example answers above show https://www but in my case they will be resolved to Europe so my fastest local source is coming in as
https://eu.diskinternals.com/download/Linux_Reader.exe
by far the simplest is to raw curl first if its good no need to inspect or scrape
without bothering to resolve anything,
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159
however I know in this case that not the expected result, so next level is
curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"
and that gives the result as shown by others
https://www.diskinternals.com/download/Linux_Reader.exe
but that's not the fuller picture since if we back feed that
curl.exe -K location.txt
we get
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
hence the nested redirects to
https://eu.diskinternals.com/download/Linux_Reader.exe
all of that can be command line scripted to run in loops in a line or two but I don't use Python so you will need to write perhaps a dozen lines to do similar
C:\Users\WDAGUtilityAccount\Desktop>curl -O https://eu.diskinternals.com/download/Linux_Reader.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 44.9M 100 44.9M 0 0 3057k 0 0:00:15 0:00:15 --:--:-- 3640k
C:\Users\WDAGUtilityAccount\Desktop>dir /b lin*.*
Linux_Reader.exe
and from the help file yesterdays extra update (Sunday, ‎September ‎4, ‎2022) Link
curl -O https://eu.diskinternals.com/download/Uneraser_Setup.exe

Post pdf to api (python) returns response with wrong encoding

I am trying to use an ocr API with python to convert pdf to text. The API i'm using is : https://www.convertapi.com/pdf-to-txt . When i upload the file through the website it works perfectly but the API call has the following issue:
Python code:
import requests
url ='https://v2.convertapi.com/convert/pdf/to/txt?Secret=mykey'
files = {'file': open('C:\<some_url>\filename.pdf', 'rb')}
r = requests.post(url, files=files)
The API call works fine, but it when i try to access the response through
r.text
it returns giberish: (Notice the FileData section)
'{"ConversionCost":4,"Files":[{"FileName":"stateoftheartKWextraction.txt","FileExt":"txt","FileSize":60179,"FileData":"QXV0b21hdGljIEtleXBocmFzZSBFeHRyYWN0aW9uOiBBIFN1cnZleSBvZiB0aGUgU3RhdGUgb2YgdGhlIEFydA0KDQpLYXppIFNhaWR1bCBIYXNhbiAgYW5kICBWaW5jZW50IE5nDQpIdW1hbiBMYW5ndWFnZSBUZWNobm9sb2d5IFJlc2VhcmNoIEluc3RpdHV0ZSBVbml2ZXJzaXR5IG9mIFRleGFzIGF0IERhbGxhcyBSaWNoYXJkc29uLCBUWCA3NTA4My0wNjg4DQp7c2FpZHVsLHZpbmNlfUBobHQudXRkYWxsYXMuZW...
Even if i use json load to convert it into a dict, it still prints the text in giberish.
I've tried to upload the file as not binary but that doesn't work(it throws an exception).
I've tried many pdf files and they all were in english.
Thank you.

The text is decoded, so you need to decode it. Let's take the first file as an example.
import base64
r = r.json()
text = r['Files'][0]['FileData']
print(base64.b64decode(text))
By the way, they seem to have a Python library as well, you might want to check that out: https://github.com/ConvertAPI/convertapi-python

Save streaming audio from URL as MP3, or even just audio file from URL as MP3

I am trying to have my server, in python 3, go grab files from URLs. Specifically, I would like to pass a URL into a function, I would like the function to go grab an audio file(of many varying formats) and save it as an MP3, probably using ffmpeg or ffmpy. If the URL also has a PDF, I would also like to save that, as a PDF. I haven't done much research on the PDF yet, but I have been working on the audio piece and wasn't sure if this was even possible.
I have looked at several questions here, but most notably;
How do I download a file over HTTP using Python?
It's a little old but I tried several methods in there and always get some sort of issue. I have tried using the requests library, urllib, streamripper, and maybe one other.
Is there a way to do this and with a recommended library?
For example, most of the ones I have tried do save something, like the html page, or an empty file called 'file.mp3' in this case.
Streamripper received a try changing user agents error.
I am not sure if this is possible, but I am sure there is something I'm not understanding here, could someone point me in the right direction?
This isn't necessarily the code I'm trying to use, just an example of something I have used that doesn't work.
import requests
url = "http://someurl.com/webcast/something"
r = requests.get(url)
with open('file.mp3', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
**Edit
import requests
import ffmpy
import datetime
import os
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE AUDIO/MPEG, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.MP3
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE application/pdf, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.PDF
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE other than application/pdf, OR
## audio/mpeg, THE FILE WILL NOT BE SAVED
def BordersPythonDownloader(url):
print('Beginning file download requests')
r = requests.get(url, stream=True)
contype = r.headers['content-type']
if contype == "audio/mpeg":
print("audio file")
filename = '[{}].mp3'.format(str(datetime.datetime.now()))
with open('file.mp3', 'wb+') as f:
f.write(r.content)
ff = ffmpy.FFmpeg(
inputs={'file.mp3': None},
outputs={filename: None}
)
ff.run()
if os.path.exists('file.mp3'):
os.remove('file.mp3')
elif contype == "application/pdf":
print("pdf file")
filename = '[{}].pdf'.format(str(datetime.datetime.now()))
with open(filename, 'wb+') as f:
f.write(r.content)
else:
print("URL DID NOT RETURN AN AUDIO OR PDF FILE, IT RETURNED {}".format(contype))
# INSERT YOUR URL FOR TESTING
# OR CALL THIS SCRIPT FROM ELSEWHERE, PASSING IT THE URL
#DEFINE YOUR URL
#url = 'http://archive.org/download/testmp3testfile/mpthreetest.mp3'
#CALL THE SCRIPT; PASSING IT YOUR URL
#x = BordersPythonDownloader(url)
#ANOTHER EXAMPLE WITH A PDF
#url = 'https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SY/configuration/guide/sy_swcg/etherchannel.pdf'
#x = BordersPythonDownloader(url)
Thanks Richard, this code works and helps me understand this better. Any suggestions for improving the above working example?

Download file using python without knowing its extension. - content-type - stream

Hey i just did some research and found that i could download images from urls which end with filename.extension like 000000.jpeg. i wonder now how i could downoad a picture which doesnt have any extension.
Here is my url which i want to download the image http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api
when i put the url directly to the browser it displays an image
furthermore here is what i tried:
from six.moves import urllib
thumbnail='http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api'
img=urllib.request.Request(thumbnail)
pic=urllib.request.urlopen(img)
pic=urllib.request.urlopen(img).read()
Anyhelp will be appreciated so much

This is a way to do it using HTTP response headers :
import requests
import time
r = requests.get("http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api", stream=True)
ext = r.headers['content-type'].split('/')[-1] # converts response headers mime type to an extension (may not work with everything)
with open("%s.%s" % (time.time(), ext), 'wb') as f: # open the file to write as binary - replace 'wb' with 'w' for text files
for chunk in r.iter_content(1024): # iterate on stream using 1KB packets
f.write(chunk) # write the file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Image scraped as HTML page with urlretrieve - python

Related

How do you find the filetype of an image in a url with nonobvious filetype in Python

How to download files from website using PHP with Python

Post pdf to api (python) returns response with wrong encoding

Save streaming audio from URL as MP3, or even just audio file from URL as MP3

Download file using python without knowing its extension. - content-type - stream

Categories

Resources