Batch download google images with tags - python

I'm trying to find an efficient and replicable way to batch download full-size image files from a Google image search. Other people have asked similar things, but I haven't found anything that's exactly what I'm looking for or that I understand.
Most refer to the depreciated Google Image Search API or the Google Custom Search API which doesn't seem to work for the whole web, or are just about downloading images from a single URL.
I imagine this could be a two step process: First, pull all the image URLs from a search and then batch download from those?
I should add that I am a beginner (which is probably obvious; sorry). So if someone could explain as well as point me in the right direction, that would be much appreciated.
I've also looked into freeware options, but those seem spotty as well. Unless anyone knows of a reliable one.
Download images from google image search (python)
In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?
And if anyone know anything about the labels from this and if they exist somewhere/are associated with the images? https://en.wikipedia.org/wiki/Google_Image_Labeler
import json
import os
import time
import requests
from PIL import Image
from StringIO import StringIO
from requests.exceptions import ConnectionError
def go(query, path):
"""Download full size images from Google image search.
Don't print or republish images without permission.
I used this to train a learning algorithm.
"""
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
'v=1.0&q=' + query + '&start=%d'
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
url = image_info['unescapedUrl']
try:
image_r = requests.get(url)
except ConnectionError, e:
print 'could not download %s' % url
continue
# Remove file-system path characters from name.
title = image_info['titleNoFormatting'].replace('/', '').replace('\\', '')
file = open(os.path.join(BASE_PATH, '%s.jpg') % title, 'w')
try:
Image.open(StringIO(image_r.content)).save(file, 'JPEG')
except IOError, e:
# Throw away some gifs...blegh.
print 'could not save %s' % url
continue
finally:
file.close()
print start
start += 4 # 4 images per page.
# Be nice to Google and they'll be nice back :)
time.sleep(1.5)
# Example use
go('landscape', 'myDirectory')
Update
I was able to create a Custom Search using the full web as specified here, and successfully execute to get the image links, but as also mentioned in that previous post, they don't exactly align with the normal Google image results.

Try using the ImageSoup module. To install it, simply:
pip install imagesoup
A sample code:
>>> from imagesoup import ImageSoup
>>>
>>> soup = ImageSoup()
>>> images_wanted = 50
>>> query = 'landscape'
>>> images = soup.search(query, n_images=50)
Now you have a list with 50 landscape images from Google Images. Let's play with the first one:
>>> im = images[0]
>>> im.URL
https://static.pexels.com/photos/279315/pexels-photo-279315.jpeg
>>> im.size
(2600, 1300)
>>> im.mode
RGB
>>> im.dpi
(300, 300)
>>> im.color_count
493230
>>> # Let's check the main 4 colors in the image. We use
>>> # reduce_size = True to speed up the process.
>>> im.main_color(reduce_size=True, n=4))
[('black', 0.2244), ('darkslategrey', 0.1057), ('darkolivegreen', 0.0761), ('dodgerblue', 0.0531)]
# Let's take a look on our image
>>> im.show()
>>> # Nice image! Let's save it.
>>> im.to_file('landscape.jpg')
The number of images returned by each search may change. Usually is a number smaller than 900. If you want to get all images, set n_images=1000.
To contribute or report bugs, check the github repo: https://github.com/rafpyprog/ImageSoup

Related

How to extract images, video and audio from a pdf file using python

I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.
I think you could achieve this using pymupdf.
To extract images see the following: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-pdf-documents
For Sound and Video these are essentially Annotation types.
The following "annots" function would get all the annotations of a specific type for a PDF page:
https://pymupdf.readthedocs.io/en/latest/page.html#Page.annots
Annotation types are as follows:
https://pymupdf.readthedocs.io/en/latest/vars.html#annotationtypes
Once you have acquired an annotation I think you can use the get_file method to extract the content ( see: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.get_file)
Hope this helps!
#George Davis-Diver can you please let me have an example PDF with video?
Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.
For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.
Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:
import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0] # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)
This is a dictionary with image metadata and the binary image stream.
Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.
Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only.
#George Davis-Diver - thanks for example file!
Here is code that extracts video content:
import sys
import pathlib
import fitz
doc = fitz.open("vid.pdf") # open PDF
page = doc[0] # load desired page (0-based)
annot = page.first_annot # access the desired annot (first one in example)
if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
print(f"Annotation type is {annot.type[1]}")
print("Only support RichMedia currently")
sys.exit()
cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
if cont[0] != "array": # should be PDF array
sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
array = cont[1][1:-1] # remove array delimiters
# jump over the name / title: we will get it later
if array[0] == "(":
i = array.find(")")
else:
i = array.find(">")
xref = array[i + 1 :] # here is the xref of the actual video stream
if not xref.endswith(" 0 R"):
sys.exit("media contents array has more than one entry")
xref = int(xref[:-4]) # xref of video stream file
video_filename = doc.xref_get_key(xref, "F")[1]
video_xref = doc.xref_get_key(xref, "EF/F")[1]
video_xref = int(video_xref.split()[0])
video_stream = doc.xref_stream_raw(video_xref)
pathlib.Path(video_filename).write_bytes(video_stream)

Replacing a word with another word, and replacing an image with another image in a PDF file through python, is this possible?

I need to replace a K words with K other words for every PDF file I have within a certain path file location and on top of this I need to replace every logo with another logo. I have around 1000 PDF files, and so I do not want to use Adobe Acrobat and edit 1 file at a time. How can I start this?
Replacing words seems at least doable as long as there is a decent PDF reader one can access through Python ( Note I want to do this task in Python ), however replacing an image might be more difficult. I will most likely have to find the dimension of the current image and resize the image being used to replace the current image dynamically, whilst the program runs through these PDF files.
Hi, so I've written down some code regarding this:
from pikepdf import Pdf, PdfImage, Name
import os
import glob
from PIL import Image
import zlib
example = Pdf.open(r'...\Likelihood.pdf')
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
if len(list(i.images.keys())) >= 1:
PagesWithImages.append(i)
ImageCodesForPages.append(list(i.images.keys()))
pdfImages = []
for i,j in zip(PagesWithImages, ImageCodesForPages):
for x in j:
pdfImages.append(i.images[x])
# Replace every single page using random image, ensure that the dimensions remain the same?
for i in pdfImages:
pdfimage = PdfImage(i)
rawimage = pdfimage.obj
im = Image.open(r'...\panda.jpg')
pillowimage = pdfimage.as_pil_image()
print(pillowimage.height)
print(pillowimage.width)
im = im.resize((pillowimage.width, pillowimage.height))
im.show()
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
So just one problem, it doesn't actually replace anything. If you're wondering why and how I wrote this code I actually got it from this documentation:
https://buildmedia.readthedocs.org/media/pdf/pikepdf/latest/pikepdf.pdf
Start at Page 53
I essentially put all the pdfImages into a list, as 1 page can have multiple images. In conjunction with this, the last for loop essentially tries to replace all these images whilst maintaining the same width and height size. Also note, the file path names I changed here and it definitely is not the issue.
Again Thank You
I have figured out what I was doing wrong. So for anyone that wants to actually replace an image with another image in place on a PDF file what you do is:
from pikepdf import Pdf, PdfImage, Name
from PIL import Image
import zlib
example = Pdf.open(filepath, allow_overwriting_input=True)
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
imagelists = list(i.images.keys())
if len(imagelists) >= 1:
for x in imagelists:
rawimage = i.images[x]
pdfimage = PdfImage(rawimage)
rawimage = pdfimage.obj
pillowimage = pdfimage.as_pil_image()
im = Image.open(imagePath)
im = im.resize((pillowimage.width, pillowimage.height))
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
rawimage.Width, rawimage.Height = pillowimage.width, pillowimage.height
example.save()
Essentially, I changed the arguements in the first line, such that I specify that I can overwrite. In conjunction, I also added the last line which actually allows me to save.

Issue writing temp images to temp pdf in pyramid with reportlabs

I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.
Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers

How to handle multi-page images in PythonMagick?

I want to convert some multi-pages .tif or .pdf files to individual .png images. From command line (using ImageMagick) I just do:
convert multi_page.pdf file_out.png
And I get all the pages as individual images (file_out-0.png, file_out-1.png, ...)
I would like to handle this file conversion within Python, unfortunately PIL cannot read .pdf files, so I want to use PythonMagick. I tried:
import PythonMagick
im = PythonMagick.Image('multi_page.pdf')
im.write("file_out%d.png")
or just
im.write("file_out.png")
But I only get 1 page converted to png.
Of course I could load each pages individually and convert them one by one. But there must be a way to do them all at once?
ImageMagick is not memory efficient, so if you try to read a large pdf, like 100 pages or so, the memory requirement will be huge and it might crash or seriously slow down your system. So after all reading all pages at once with PythonMagick is a bad idea, its not safe.
So for pdfs, I ended up doing it page by page, but for that I need to get the number of pages first using pyPdf, its reasonably fast:
pdf_im = pyPdf.PdfFileReader(file('multi_page.pdf', "rb"))
npage = pdf_im.getNumPages()
for p in npage:
im = PythonMagick.Image('multi_page.pdf['+ str(p) +']')
im.write('file_out-' + str(p)+ '.png')
A more complete example based on the answer by Ivo Flipse and http://p-s.co.nz/wordpress/pdf-to-png-using-pythonmagick/
This uses a higher resolution and uses PyPDF2 instead of older pyPDF.
import sys
import PyPDF2
import PythonMagick
pdffilename = sys.argv[1]
pdf_im = PyPDF2.PdfFileReader(file(pdffilename, "rb"))
npage = pdf_im.getNumPages()
print('Converting %d pages.' % npage)
for p in range(npage):
im = PythonMagick.Image()
im.density('300')
im.read(pdffilename + '[' + str(p) +']')
im.write('file_out-' + str(p)+ '.png')
I had the same problem and as a work around i used ImageMagick and did
import subprocess
params = ['convert', 'src.pdf', 'out.png']
subprocess.check_call(params)

Where / how to get free high resolution satellite images for geospatial data visualization with python

I want to overlay geospatial data (mostly heatmaps) on top of high resolution satellite images using python. (i am newbie, so be gentle on me ;-) )
Here is my wish list
detailed enough to show streets and buildings
must be fairly recent (captured within last several years)
coordinates and projection of images/maps must be known that heatmaps i created can be overlayed
easy retrieval (hopefully, several lines of python codes will take care of getting right images)
free
I think google map/earth, yahoo map, bing, etc... could be potential candidates, but I am not sure how to access them easily. Code examples would be very helpful.
Any suggestions?
Open Street Map is a good equivalent to Google maps (which I do not know very well).
Their database increases with time. It is an open source map acquisition attempt. They are sometimes a little bit more accurate than Google maps, see the Berlin zoo example.
It has several APIs, which are read-only access: http://wiki.openstreetmap.org/wiki/XAPI.
It appears to use the REST protocol.
For the use of REST and Python, I would suggest this SO link.
So you want to do something almost exactly like this:
http://www.jjguy.com/heatmap/
which I found by googling for "python heatmap".
Now you are a bit unclear about what you want to do with these images, so remember that Google Earth imagery is copyrighted and there's a set of restrictions on what you can do with them.
Google Maps explicitly forbid using map tiles offline or caching them, but I think Microsoft Bing Maps don't say anything explicitly against it, and I guess you are not planning to use your program commercially (?)
Then, you could use this. It creates a cache, first loading a tile from memory, else fom disk, else from the internet, always caching everything to disk for reuse. Of course you'll have to figure out how to tweak it, specifically how to get the tile coordinates and zoom level you need, and for this I suggest strongly this site. Good study!
#!/usr/bin/env python
# coding: utf-8
import os
import Image
import random
import urllib
import cStringIO
import cairo
#from geofunctions import *
class TileServer(object):
def __init__(self):
self.imdict = {}
self.surfdict = {}
self.layers = 'ROADMAP'
self.path = './'
self.urltemplate = 'http://ecn.t{4}.tiles.virtualearth.net/tiles/{3}{5}?g=0'
self.layerdict = {'SATELLITE': 'a', 'HYBRID': 'h', 'ROADMAP': 'r'}
def tiletoquadkey(self, xi, yi, z):
quadKey = ''
for i in range(z, 0, -1):
digit = 0
mask = 1 << (i - 1)
if(xi & mask) != 0:
digit += 1
if(yi & mask) != 0:
digit += 2
quadKey += str(digit)
return quadKey
def loadimage(self, fullname, tilekey):
im = Image.open(fullname)
self.imdict[tilekey] = im
return self.imdict[tilekey]
def tile_as_image(self, xi, yi, zoom):
tilekey = (xi, yi, zoom)
result = None
try:
result = self.imdict[tilekey]
except:
filename = '{}_{}_{}_{}.jpg'.format(zoom, xi, yi, self.layerdict[self.layers])
fullname = self.path + filename
try:
result = self.loadimage(fullname, tilekey)
except:
server = random.choice(range(1,4))
quadkey = self.tiletoquadkey(*tilekey)
print quadkey
url = self.urltemplate.format(xi, yi, zoom, self.layerdict[self.layers], server, quadkey)
print "Downloading tile %s to local cache." % filename
urllib.urlretrieve(url, fullname)
result = self.loadimage(fullname, tilekey)
return result
if __name__ == "__main__":
ts = TileServer()
im = ts.tile_as_image(5, 9, 4)
im.show()
One possible source is the images from NASA World Wind. You can look at the their source to find out how they access their data sources, and do the same in your application.
I have made use of Bing Maps API coupled with the knowledge of Map Tiling. Please find the code for the same in my Github Repository.
You may find this helpful too.

Categories

Resources