i'm trying to download 200k images using their URL.
This is my code:
import requests # to get image from the web
import shutil # to save it locally
r = requests.get(image_url, stream = True)
# Check if the image was retrieved successfully
if r.status_code == 200:
# Set decode_content value to True, otherwise the downloaded image file's size will be zero.
r.raw.decode_content = True
if not os.path.isdir('images/' + filename.rsplit('/',1)[0] + '/'):
os.makedirs('images/' + filename.rsplit('/',1)[0] + '/')
with open('images/' + filename,'wb') as f:
shutil.copyfileobj(r.raw, f)
The when i run it, it downloads some images but the rest doesn't. It gives the error:
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead
I have no idea why or when this happens. Maybe when a URL is unreachable? How can I assure that everything UP will be downloaded and exceptions will be passed?
What about using a try/except?
import requests # to get image from the web
import shutil # to save it locally
try:
r = requests.get(image_url, stream = True)
# Check if the image was retrieved successfully
if r.status_code == 200:
# Set decode_content value to True, otherwise the downloaded image file's size will be zero.
r.raw.decode_content = True
if not os.path.isdir('images/' + filename.rsplit('/',1)[0] + '/'):
os.makedirs('images/' + filename.rsplit('/',1)[0] + '/')
with open('images/' + filename,'wb') as f:
shutil.copyfileobj(r.raw, f)
except urllib3.exceptions.ProtocolError as error:
print("skipped error: " + error)
Perhaps to download such a large number of images you would be interested in an asynchronous web framework like aiohttp. This would save you from having to wait for a slow site to send you its image to download more.
Related
I'm currently facing an issue.
(started trying out python 3 hours ago)
So I tried making this discord bot where if a user sends an image, the bot would save it. My issue was the bot wasn't saving it in a certain folder (idk how also haha) so what I tried was I would copy the said image where it would create a new folder and that new folder is where the copied image would be placed. The original copies of the image would then be deleted, thus only leaving the image files in the folder.
My issue here now is that it's not consistent. It works on the first image input but it won't work if it would be attempted on the second time.
I would like to find a simpler way on being able to save an image file where it would then be directed to a folder rather than being saved in the same place as the python file.
#client.command()
async def save(ctx):
try:
url = ctx.message.attachments[0].url
except IndexError:
print("Error : No attachments")
await ctx.send("No attachments detected")
else:
imageName = str(uuid.uuid4()) + '.jpg'
r = requests.get(url, stream=True)
with open(imageName,'wb',) as out_file:
print('Saving image: ' + imageName)
shutil.copyfileobj(r.raw, out_file,)
images = [f for f in os.listdir() if '.jpg' in f.lower()]
os.mkdir('Images')
for image in images:
new_path = 'Images/' + image
shutil.move(image, new_path)```
You just need to change with open(imageName, 'wb') as out_file:
As it is it will save the image in the folder where the script is running, if you want to save them in the Images folder you just have to change that to with open("Images/" + imageName, 'wb') as out_file: or any other folder.
I think you're not giving a specific path, you can give a specific path using os.join(os.getcwd(),Images)
#client.command()
async def save(ctx):
try:
url = ctx.message.attachments[0].url
except IndexError:
print("Error : No attachments")
await ctx.send("No attachments detected")
else:
imageName = str(uuid.uuid4()) + '.jpg'
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(specific_path, 'wb') as f:
f.write(r.content)
I have a code that takes a random flag from the Flag Mashup Bot and downloads it:
import requests
DIR = 'C:/Users/myUser/Desktop/Flags/'
URL = 'https://flagsmashupbot.pythonanywhere.com/mashup?passwd=fl4gsm4shupb0t'
def download_image(img_url: str, dest_dir: str):
img_data = requests.get(img_url).content
with open(dest_dir, 'wb') as file:
file.write(img_data)
if __name__ == "__main__":
response = requests.get(URL)
if response.ok:
page = response.text
image_url = page[page.find('data:image', page.find('data:image') + 1):page.find('" download=')]
name = page[page.find('" download=') + 12:page.find('_FlagsMashupBot.png"')]
DIR += (name + '.png')
print(DIR)
download_image(image_url, DIR)
When I run it, I get the following error on line 8:
requests.exceptions.InvalidSchema: No connection adapters were found for [image URL]
When I read about it, I realized that it's because the image URLs from the site don't start with "https://" (or at least that's what I understood).
So, is there a way to use requests.get() without having https at the start of the URL?
The reason you would not get an HTTP/HTTPs based URL is since the data is in href format pointing to the base64 encoded version of the image.
You may use urllib to open up the href download link and save the contents to a file:
data = 'data:image/png;charset=utf-8;base64,iVBORw0KGgoAAAANSUhEUgAABwgAAASwCAIAAABggIlUAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nOzdaZhd92Hf97Pde+fOYGYAzAx2gCBBivsCgqtEUl7piF.......'
response = urllib.request.urlopen(data)
with open('image.png', 'wb') as f:
f.write(response.file.read())
I am trying to download an image from an instagram media URL:
https://instagram.fybz2-1.fna.fbcdn.net/v/t51.2885-15/fr/e15/p1080x1080/106602453_613520712600632_6255422472318530180_n.jpg?_nc_ht=instagram.fybz2-1.fna.fbcdn.net&_nc_cat=108&_nc_ohc=WQizf6rhDmQAX883HrQ&oh=140f221889178fd03bf654cf18a9d9a2&oe=5F4D2AFE
Pasting this into my browser will bring up the image, but when I run the following code I get the following error which i suspect is due to issues with the URL containing a query string (running this on a simple url ending in .jpg works without issue
File "C:/Users/19053/InstagramImageDownloader/downloadImage.py", line 18, in <module>
with open(filename, 'wb') as f:
OSError: [Errno 22] Invalid argument: '106602453_613520712600632_6255422472318530180_n.jpg?_nc_ht=instagram.fybz2-1.fna.fbcdn.net&_nc_cat=108&_nc_ohc=WQizf6rhDmQAX883HrQ&oh=140f221889178fd03bf654cf18a9d9a2&oe=5F4D2AFE'
Full code as follows:
## Importing Necessary Modules
import requests # to get image from the web
import shutil # to save it locally
## Set up the image URL and filename
image_url = "https://instagram.fybz2-1.fna.fbcdn.net/v/t51.2885-15/fr/e15/p1080x1080/106602453_613520712600632_6255422472318530180_n.jpg?_nc_ht=instagram.fybz2-1.fna.fbcdn.net&_nc_cat=108&_nc_ohc=WQizf6rhDmQAX883HrQ&oh=140f221889178fd03bf654cf18a9d9a2&oe=5F4D2AFE"
filename = image_url.split("/")[-1]
# Open the url image, set stream to True, this will return the stream content.
r = requests.get(image_url, stream=True)
# Check if the image was retrieved successfully
if r.status_code == 200:
# Set decode_content value to True, otherwise the downloaded image file's size will be zero.
r.raw.decode_content = True
# Open a local file with wb ( write binary ) permission.
with open(filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
print('Image sucessfully Downloaded: ', filename)
else:
print('Image Couldn\'t be retreived')
The problem is with the filename. You need to first split by ? then take the first element then split by /
import requests # to get image from the web
import shutil # to save it locally
## Set up the image URL and filename
image_url = "https://instagram.fybz2-1.fna.fbcdn.net/v/t51.2885-15/fr/e15/p1080x1080/106602453_613520712600632_6255422472318530180_n.jpg?_nc_ht=instagram.fybz2-1.fna.fbcdn.net&_nc_cat=108&_nc_ohc=WQizf6rhDmQAX883HrQ&oh=140f221889178fd03bf654cf18a9d9a2&oe=5F4D2AFE"
filename = image_url.split("?")[0].split("/")[-1]
# Open the url image, set stream to True, this will return the stream content.
r = requests.get(image_url, stream=True)
# Check if the image was retrieved successfully
if r.status_code == 200:
# Set decode_content value to True, otherwise the downloaded image file's size will be zero.
r.raw.decode_content = True
# Open a local file with wb ( write binary ) permission.
with open(filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
print('Image sucessfully Downloaded: ', filename)
else:
print('Image Couldn\'t be retreived')
I am new to web scraping and I am using python to build a Google Images Web Scraper. This is a snippet of my code.
import requests
import os
import bs4 as bs
query = 'kittens'
url = 'https://www.google.co.in/search?q='+query+'&source=lnms&tbm=isch'
res = requests.get(url)
res.raise_for_status()
os.makedirs('new1')
imgElem = soup.select('div img')
print(len(imgElem))
for i in range(1,len(imgElem)):
if imgElem == []: #if not found print error
print('could not find any image')
else:
try:
imgUrl = imgElem[i].get('src')
print(imgElem[i].get('src'))
print('Downloading image %s.....' %(imgUrl))
res = requests.get(imgUrl)
res.raise_for_status()
#except requests.exceptions.MissingSchema:
except Exception as e:
#skip if not a normal image file
print(e)
num = str(i) + ".jpg"
imageFile = open(os.path.join('.\\new1', num),'wb')
#write downloaded image to hard disk
for chunk in res.iter_content(10000):
imageFile.write(chunk)
imageFile.close()
len(imgElem) returns 21 for me.
I can currently only download 20 images.
Why do I get only 20 images and what would be a good way to overcome this?
You are having this issue because not all the src attribute values in imgEleme are valid urls.
Try this:
for el in imgElem:
print(el['src'])
You will see that first output line is
/images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif
while all the others are valid urls. So the statement:
res = requests.get(imgUrl)
fails in that cases; hence only 20 images downloaded.
My purpose is to download the data from this website:
http://transoutage.spp.org/
When opening this website, in the bottom of web, there is a description used to illustrate how to auto-download the data. For example:
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true
The code I wrote is this:
import requests
ul_begin = 'http://transoutage.spp.org/report.aspx?download=true'
timeset = '3/1/2018' #define the time, m/d/yyyy
fn = ['&actualendgreaterthan='] + [timeset] + ['&includenulls=true']
fn = ''.join(fn)
ul = ul_begin+fn
r = requests.get(ul, verify=False)
Since, if you input the web address,
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true,
into the Chrome, it will auto-download the data in .csv file. I do not know how to continue my code.
Please help me!!!!
You need to write the response you receive to a file:
r = requests.get(ul, verify=False)
if 200 >= r.status_code <= 300:
# If the request has succeeded
file_path = '<path_where_file_has_to_be_downloaded>`
f = open(file_path, 'w+')
f.write(r.content)
f.close()
This will work properly if the csv file is small in size. but for large files, you need to use stream param to download: http://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html