I am trying to have my server, in python 3, go grab files from URLs. Specifically, I would like to pass a URL into a function, I would like the function to go grab an audio file(of many varying formats) and save it as an MP3, probably using ffmpeg or ffmpy. If the URL also has a PDF, I would also like to save that, as a PDF. I haven't done much research on the PDF yet, but I have been working on the audio piece and wasn't sure if this was even possible.
I have looked at several questions here, but most notably;
How do I download a file over HTTP using Python?
It's a little old but I tried several methods in there and always get some sort of issue. I have tried using the requests library, urllib, streamripper, and maybe one other.
Is there a way to do this and with a recommended library?
For example, most of the ones I have tried do save something, like the html page, or an empty file called 'file.mp3' in this case.
Streamripper received a try changing user agents error.
I am not sure if this is possible, but I am sure there is something I'm not understanding here, could someone point me in the right direction?
This isn't necessarily the code I'm trying to use, just an example of something I have used that doesn't work.
import requests
url = "http://someurl.com/webcast/something"
r = requests.get(url)
with open('file.mp3', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
**Edit
import requests
import ffmpy
import datetime
import os
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE AUDIO/MPEG, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.MP3
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE application/pdf, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.PDF
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE other than application/pdf, OR
## audio/mpeg, THE FILE WILL NOT BE SAVED
def BordersPythonDownloader(url):
print('Beginning file download requests')
r = requests.get(url, stream=True)
contype = r.headers['content-type']
if contype == "audio/mpeg":
print("audio file")
filename = '[{}].mp3'.format(str(datetime.datetime.now()))
with open('file.mp3', 'wb+') as f:
f.write(r.content)
ff = ffmpy.FFmpeg(
inputs={'file.mp3': None},
outputs={filename: None}
)
ff.run()
if os.path.exists('file.mp3'):
os.remove('file.mp3')
elif contype == "application/pdf":
print("pdf file")
filename = '[{}].pdf'.format(str(datetime.datetime.now()))
with open(filename, 'wb+') as f:
f.write(r.content)
else:
print("URL DID NOT RETURN AN AUDIO OR PDF FILE, IT RETURNED {}".format(contype))
# INSERT YOUR URL FOR TESTING
# OR CALL THIS SCRIPT FROM ELSEWHERE, PASSING IT THE URL
#DEFINE YOUR URL
#url = 'http://archive.org/download/testmp3testfile/mpthreetest.mp3'
#CALL THE SCRIPT; PASSING IT YOUR URL
#x = BordersPythonDownloader(url)
#ANOTHER EXAMPLE WITH A PDF
#url = 'https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SY/configuration/guide/sy_swcg/etherchannel.pdf'
#x = BordersPythonDownloader(url)
Thanks Richard, this code works and helps me understand this better. Any suggestions for improving the above working example?
I'm trying to download and save an image from the web using python's requests module.
Here is the (working) code I used:
img = urllib2.urlopen(settings.STATICMAP_URL.format(**data))
with open(path, 'w') as f:
f.write(img.read())
Here is the new (non-working) code using requests:
r = requests.get(settings.STATICMAP_URL.format(**data))
if r.status_code == 200:
img = r.raw.read()
with open(path, 'w') as f:
f.write(img)
Can you help me on what attribute from the response to use from requests?
You can either use the response.raw file object, or iterate over the response.
To use the response.raw file-like object will not, by default, decode compressed responses (with GZIP or deflate). You can force it to decompress for you anyway by setting the decode_content attribute to True (requests sets it to False to control decoding itself). You can then use shutil.copyfileobj() to have Python stream the data to a file object:
import requests
import shutil
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
To iterate over the response use a loop; iterating like this ensures that data is decompressed by this stage:
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
This'll read the data in 128 byte chunks; if you feel another chunk size works better, use the Response.iter_content() method with a custom chunk size:
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r.iter_content(1024):
f.write(chunk)
Note that you need to open the destination file in binary mode to ensure python doesn't try and translate newlines for you. We also set stream=True so that requests doesn't download the whole image into memory first.
Get a file-like object from the request and copy it to a file. This will also avoid reading the whole thing into memory at once.
import shutil
import requests
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
How about this, a quick solution.
import requests
url = "http://craphound.com/images/1006884_2adf8fc7.jpg"
response = requests.get(url)
if response.status_code == 200:
with open("/Users/apple/Desktop/sample.jpg", 'wb') as f:
f.write(response.content)
I have the same need for downloading images using requests. I first tried the answer of Martijn Pieters, and it works well. But when I did a profile on this simple function, I found that it uses so many function calls compared to urllib and urllib2.
I then tried the way recommended by the author of requests module:
import requests
from PIL import Image
# python2.x, use this instead
# from StringIO import StringIO
# for python3.x,
from io import StringIO
r = requests.get('https://example.com/image.jpg')
i = Image.open(StringIO(r.content))
This much more reduced the number of function calls, thus speeded up my application.
Here is the code of my profiler and the result.
#!/usr/bin/python
import requests
from StringIO import StringIO
from PIL import Image
import profile
def testRequest():
image_name = 'test1.jpg'
url = 'http://example.com/image.jpg'
r = requests.get(url, stream=True)
with open(image_name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
def testRequest2():
image_name = 'test2.jpg'
url = 'http://example.com/image.jpg'
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(image_name)
if __name__ == '__main__':
profile.run('testUrllib()')
profile.run('testUrllib2()')
profile.run('testRequest()')
The result for testRequest:
343080 function calls (343068 primitive calls) in 2.580 seconds
And the result for testRequest2:
3129 function calls (3105 primitive calls) in 0.024 seconds
This might be easier than using requests. This is the only time I'll ever suggest not using requests to do HTTP stuff.
Two liner using urllib:
>>> import urllib
>>> urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
There is also a nice Python module named wget that is pretty easy to use. Found here.
This demonstrates the simplicity of the design:
>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'
Enjoy.
Edit: You can also add an out parameter to specify a path.
>>> out_filepath = <output_filepath>
>>> filename = wget.download(url, out=out_filepath)
Following code snippet downloads a file.
The file is saved with its filename as in specified url.
import requests
url = "http://example.com/image.jpg"
filename = url.split("/")[-1]
r = requests.get(url, timeout=0.5)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
There are 2 main ways:
Using .content (simplest/official) (see Zhenyi Zhang's answer):
import io # Note: io.BytesIO is StringIO.StringIO on Python2.
import requests
r = requests.get('http://lorempixel.com/400/200')
r.raise_for_status()
with io.BytesIO(r.content) as f:
with Image.open(f) as img:
img.show()
Using .raw (see Martijn Pieters's answer):
import requests
r = requests.get('http://lorempixel.com/400/200', stream=True)
r.raise_for_status()
r.raw.decode_content = True # Required to decompress gzip/deflate compressed responses.
with PIL.Image.open(r.raw) as img:
img.show()
r.close() # Safety when stream=True ensure the connection is released.
Timing both shows no noticeable difference.
As easy as to import Image and requests
from PIL import Image
import requests
img = Image.open(requests.get(url, stream = True).raw)
img.save('img1.jpg')
This is how I did it
import requests
from PIL import Image
from io import BytesIO
url = 'your_url'
files = {'file': ("C:/Users/shadow/Downloads/black.jpeg", open('C:/Users/shadow/Downloads/black.jpeg', 'rb'),'image/jpg')}
response = requests.post(url, files=files)
img = Image.open(BytesIO(response.content))
img.show()
Here is a more user-friendly answer that still uses streaming.
Just define these functions and call getImage(). It will use the same file name as the url and write to the current directory by default, but both can be changed.
import requests
from StringIO import StringIO
from PIL import Image
def createFilename(url, name, folder):
dotSplit = url.split('.')
if name == None:
# use the same as the url
slashSplit = dotSplit[-2].split('/')
name = slashSplit[-1]
ext = dotSplit[-1]
file = '{}{}.{}'.format(folder, name, ext)
return file
def getImage(url, name=None, folder='./'):
file = createFilename(url, name, folder)
with open(file, 'wb') as f:
r = requests.get(url, stream=True)
for block in r.iter_content(1024):
if not block:
break
f.write(block)
def getImageFast(url, name=None, folder='./'):
file = createFilename(url, name, folder)
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(file)
if __name__ == '__main__':
# Uses Less Memory
getImage('http://www.example.com/image.jpg')
# Faster
getImageFast('http://www.example.com/image.jpg')
The request guts of getImage() are based on the answer here and the guts of getImageFast() are based on the answer above.
I'm going to post an answer as I don't have enough rep to make a comment, but with wget as posted by Blairg23, you can also provide an out parameter for the path.
wget.download(url, out=path)
This is the first response that comes up for google searches on how to download a binary file with requests. In case you need to download an arbitrary file with requests, you can use:
import requests
url = 'https://s3.amazonaws.com/lab-data-collections/GoogleNews-vectors-negative300.bin.gz'
open('GoogleNews-vectors-negative300.bin.gz', 'wb').write(requests.get(url, allow_redirects=True).content)
my approach was to use response.content (blob) and save to the file in binary mode
img_blob = requests.get(url, timeout=5).content
with open(destination + '/' + title, 'wb') as img_file:
img_file.write(img_blob)
Check out my python project that downloads images from unsplash.com based on keywords.
You can do something like this:
import requests
import random
url = "https://images.pexels.com/photos/1308881/pexels-photo-1308881.jpeg? auto=compress&cs=tinysrgb&dpr=1&w=500"
name=random.randrange(1,1000)
filename=str(name)+".jpg"
response = requests.get(url)
if response.status_code.ok:
with open(filename,'w') as f:
f.write(response.content)
Agree with Blairg23 that using urllib.request.urlretrieve is one of the easiest solutions.
One note I want to point out here. Sometimes it won't download anything because the request was sent via script (bot), and if you want to parse images from Google images or other search engines, you need to pass user-agent to request headers first, and then download the image, otherwise, the request will be blocked and it will throw an error.
Pass user-agent and download image:
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(URL, 'image_name.jpg')
Code in the online IDE that scrapes and downloads images from Google images using requests, bs4, urllib.requests.
Alternatively, if your goal is to scrape images from search engines like Google, Bing, Yahoo!, DuckDuckGo (and other search engines), then you can use SerpApi. It's a paid API with a free plan.
The biggest difference is that there's no need to figure out how to bypass blocks from search engines or how to extract certain parts from the HTML or JavaScript since it's already done for the end-user.
Example code to integrate:
import os, urllib.request
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# download images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# saves original res image to the SerpApi_Images folder and add index to the end of file name
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
-----------
'''
]
# other images
{
"position": 100, # 100 image
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQK62dIkDjNCvEgmGU6GGFZcpVWwX-p3FsYSg&usqp=CAU",
"source": "homewardboundnj.org",
"title": "pexels-helena-lopes-1931367 - Homeward Bound Pet Adoption Center",
"link": "https://homewardboundnj.org/upcoming-event/black-cat-appreciation-day/pexels-helena-lopes-1931367/",
"original": "https://homewardboundnj.org/wp-content/uploads/2020/07/pexels-helena-lopes-1931367.jpg",
"is_product": false
}
]
'''
Disclaimer, I work for SerpApi.
Here is a very simple code
import requests
response = requests.get("https://i.imgur.com/ExdKOOz.png") ## Making a variable to get image.
file = open("sample_image.png", "wb") ## Creates the file for image
file.write(response.content) ## Saves file content
file.close()
for download Image
import requests
Picture_request = requests.get(url)
My purpose is to download the data from this website:
http://transoutage.spp.org/
When opening this website, in the bottom of web, there is a description used to illustrate how to auto-download the data. For example:
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true
The code I wrote is this:
import requests
ul_begin = 'http://transoutage.spp.org/report.aspx?download=true'
timeset = '3/1/2018' #define the time, m/d/yyyy
fn = ['&actualendgreaterthan='] + [timeset] + ['&includenulls=true']
fn = ''.join(fn)
ul = ul_begin+fn
r = requests.get(ul, verify=False)
Since, if you input the web address,
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true,
into the Chrome, it will auto-download the data in .csv file. I do not know how to continue my code.
Please help me!!!!
You need to write the response you receive to a file:
r = requests.get(ul, verify=False)
if 200 >= r.status_code <= 300:
# If the request has succeeded
file_path = '<path_where_file_has_to_be_downloaded>`
f = open(file_path, 'w+')
f.write(r.content)
f.close()
This will work properly if the csv file is small in size. but for large files, you need to use stream param to download: http://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html
I'm using Requests to upload a PDF to an API. It is stored as "response" below. I'm trying to write that out to Excel.
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
file = open("out.xls", "w")
file.write(response)
file.close()
I'm getting the error:
file.write(response)
TypeError: expected a character buffer object
I believe all the existing answers contain the relevant information, but I would like to summarize.
The response object that is returned by requests get and post operations contains two useful attributes:
Response attributes
response.text - Contains str with the response text.
response.content - Contains bytes with the raw response content.
You should choose one or other of these attributes depending on the type of response you expect.
For text-based responses (html, json, yaml, etc) you would use response.text
For binary-based responses (jpg, png, zip, xls, etc) you would use response.content.
Writing response to file
When writing responses to file you need to use the open function with the appropriate file write mode.
For text responses you need to use "w" - plain write mode.
For binary responses you need to use "wb" - binary write mode.
Examples
Text request and save
# Request the HTML for this web page:
response = requests.get("https://stackoverflow.com/questions/31126596/saving-response-from-requests-to-file")
with open("response.txt", "w") as f:
f.write(response.text)
Binary request and save
# Request the profile picture of the OP:
response = requests.get("https://i.stack.imgur.com/iysmF.jpg?s=32&g=1")
with open("response.jpg", "wb") as f:
f.write(response.content)
Answering the original question
The original code should work by using wb and response.content:
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
file = open("out.xls", "wb")
file.write(response.content)
file.close()
But I would go further and use the with context manager for open.
import requests
with open('1.pdf', 'rb') as file:
files = {'f': ('1.pdf', file)}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
with open("out.xls", "wb") as file:
file.write(response.content)
You can use the response.text to write to a file:
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
with open("resp_text.txt", "w") as file:
file.write(response.text)
As Peter already pointed out:
In [1]: import requests
In [2]: r = requests.get('https://api.github.com/events')
In [3]: type(r)
Out[3]: requests.models.Response
In [4]: type(r.content)
Out[4]: str
You may also want to check r.text.
Also: https://2.python-requests.org/en/latest/user/quickstart/
I have to read a txt ini file from my browser. [this is required]
res = urllib2.urlopen(URL)
inifile = res.read()
Then I want to basically use this the same way as I would have read any txt file.
config = ConfigParser.SafeConfigParser()
config.read( inifile )
But now looks like I can't use it as this is actually a string now
Can anybody suggest a way around?
You want configparser.readfp -- Presumably, you might even be able to get away with:
res = urllib2.urlopen(URL)
config = ConfgiParser.SafeConfigParser()
config.readfp(res)
assuming that urllib2.urlopen returns an object that is sufficiently file-like (i.e. it has a readline method). For easier debugging, you could do:
config.readfp(res, URL)
If you have to read it the data from a string, you could pack the whole thing into a io.StringIO (or StringIO.StringIO) buffer and read from that:
import io
res = urllib2.urlopen(URL)
inifile_text = res.read()
inifile = io.StringIO(inifile_text)
inifile.seek(0)
config.readfp(inifile)