Fast download and compression of network-images in Python - python

I am working on an aggregation platform. We want to store resized version of 'aggregated' images from the web on our servers. To be specific, these images are of e-commerce products from different vendors. The 'item' dictionary has "image" field which is a url and needs to be downloaded and compressed and saved to disk.
download and compression method:
def downloadCompressImage( url, width, item):
#Retrieve our source image from a URL
#Load the URL data into an image
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
img = cStringIO.StringIO(response.read())
im = Image.open(img)
wpercent = (width/float(im.size[0]))
hsize = int((float(im.size[1])*float(wpercent)))
#Resize the image
im2 = im.resize((width, hsize), Image.ANTIALIAS)
key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"
path = "/var/www/html/server/images/"
path = path + timestamp + "/"
#save compressed image to disk
im2.save(path + key_name, 'JPEG', quality = 85)
url = "http://server.com/images/" + timestamp + "/" + key_name
return url
worker method:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
line = line.rstrip('\n')
item = json.loads(line.decode('ascii', 'ignore'))
#
#Do stuff with the item dict and update it
#
# Append item to result if image dl and compression is successful
try:
item["grid_image"] = downloadCompressImage(item["image"],200, item)
except:
print "dl-comp exception in processing: " + item['name'] + item['vendor']
traceback.print_exc(file=sys.stdout)
continue
if(item["grid_image"] != -1):
result.append(item)
return result
main method:
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 15
numlines = 1000
lines = open('parserProducts.json').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
for line in result_lines:
jdata = json.dumps(line)
f.write(jdata+',\n')
pool.close()
pool.join()
f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')
print "parsing is done"
My question:
Is this the best I can do with python? The count of dictionary items is ~ 3 M. Without calling the "downloadCompresssImage" method in 'worker', the "#Do stuff with the item dict and update it" portion takes only 8 minutes to complete. With compression though, it seems it would take weeks, if not months.
Any ideas appreciated, thanks a bunch.

You are working with 3 million images here, which are downloaded from internet and then compressed. So how much time will it take, depends on 2 things as far as I can tell.
Your network speed (and the speed of the target server), to download the images.
Your CPU power, to compress the images.
So, it is not Python limiting you, you are doing fine with multiprocessing.Pool, main bottlenecks are your network speed and number of cores (or CPU power) you have.

Related

Concurrent.futures and SQLAlchemy benchmarks vs. synchronous code

I have a project where i need to upload ~70 files to my flask app. I'm learning concurrency right now so this seems like perfect practice. When using print statements, the concurrent version of this function is about 2x to 2.5x faster than the synchronous function.
Though when actually writing to the SQLite database, it takes about the same amount of time.
Original func:
#app.route('/test_sync')
def auto_add():
t0 = time.time()
# Code does not work without changing directory. better option?
os.chdir('my_app/static/tracks')
list_dir = os.listdir('my_app/static/tracks')
# list_dir consists of .mp3 and .jpg files
for filename in list_dir:
if filename.endswith('.mp3'):
try:
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
except Exception:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
resize_image(thumbnail)
with open(filename, 'rb') as f, open(thumbnail, 'rb') as t:
track = Track(
title=filename[15:-4],
artist='Sam Gellaitry',
description='No desc.',
thumbnail=t.read(),
binary_audio=f.read()
)
else:
continue
db.session.add(track)
db.session.commit()
elapsed = time.time() - t0
return f'Uploaded all tracks in {elapsed} seconds.'
Concurrent func(s):
#app.route('/test_concurrent')
def auto_add_concurrent():
t0 = time.time()
MAX_WORKERS = 40
os.chdir('/my_app/static/tracks')
list_dir = os.listdir('/my_app/static/tracks')
mp3_list = [x for x in list_dir if x.endswith('.mp3')]
with futures.ThreadPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(add_one_file, mp3_list)
for x in res:
db.session.add(x)
db.session.commit()
elapsed = time.time() - t0
return f'Uploaded all tracks in {elapsed} seconds.'
-----
def add_one_file(filename):
list_dir = os.listdir('/my_app/static/tracks')
try:
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
except Exception:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
resize_image(thumbnail)
with open(filename, 'rb') as f, open(thumbnail, 'rb') as t:
track = Track(
title=filename[15:-4],
artist='Sam Gellaitry',
description='No desc.',
thumbnail=t.read(),
binary_audio=f.read()
)
return track
Heres the resize_image func for completeness
def resize_image(thumbnail):
with Image.open(thumbnail) as img:
img.resize((500, 500))
img.save(thumbnail)
return thumbnail
And benchmarks:
/test_concurrent (with print statements)
Uploaded all tracks in 0.7054300308227539 seconds.
/test_sync
Uploaded all tracks in 1.8661110401153564 seconds.
------
/test_concurrent (with db.session.add/db.session.commit)
Uploaded all tracks in 5.303245782852173 seconds.
/test_sync
Uploaded all tracks in 6.123792886734009 seconds.
What am i doing wrong with this concurrent code, and how can I optimize it?
It seems that the DB writes dominate your timings, and they do not usually benefit from parallelization when writing many rows to the same table, or in case of SQLite the same DB. Instead of adding the ORM objects 1 by 1 to the session, perform a bulk insert:
db.session.bulk_save_objects(list(res))
In your current code the ORM has to insert the Track objects one at a time during flush just before the commit in order to fetch their primary keys after insert. Session.bulk_save_objects does not do that by default, which means that the objects are less usable after – they're not added to the session for example – but that does not seem to be an issue in your case.
"I’m inserting 400,000 rows with the ORM and it’s really slow!" is a good read on the subject.
As a side note, when working with files it is best to try and avoid any TOCTOU situations, if possible. In other words don't use
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
to check if the file exists, use os.path.isfile() or such instead if you must, but you should just try and open it and then handle the error, if it cannot be opened:
thumbnail = filename[:-4] + '.jpg'
try:
resize_image(thumbnail)
except FileNotFoundError:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
# Note that the latter open attempt will fail as well, if this fails
...

How to check if a file has completed uploading into S3 Bucket using Boto in Python?

I'm trying to upload an image into S3 bucket using boto. After the image has successfully uploaded, I want to perform a certain operation using the file URL of the image in the S3 bucket. The problem is that sometimes the image doesn't upload fast enough and I end up with a server error when I want to perform the operation dependent on the file URL of the Image.
This is my source code. I'm using python flask.
def search_test(consumer_id):
consumer = session.query(Consumer).filter_by(consumer_id=consumer_id).one()
products = session.query(Product).all()
product_dictionary = {'Products': [p.serialize for p in products]}
if request.method == 'POST':
p_product_image_url = request.files['product_upload_url']
s3 = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = s3.get_bucket(AWS_BUCKET_NAME)
k = Key(bucket)
if p_product_image_url and allowed_file(p_product_image_url.filename):
# Read the contents of the file
file_content = p_product_image_url.read()
# Use Boto to upload the file to S3
k.set_metadata('Content-Type', mimetypes.guess_type(p_product_image_url.filename))
k.key = secure_filename(p_product_image_url.filename)
k.set_contents_from_string(file_content)
print ('consumer search upload successful')
new_upload = Uploads(picture_upload_url=k.key.replace(' ', '+'), consumer=consumer)
session.add(new_upload)
session.commit()
new_result = jsonify(Result=perform_actual_search(amazon_s3_base_url + k.key.replace(' ', '+'),
product_dictionary))
return new_result
else:
return render_template('upload_demo.html', consumer_id=consumer_id)
The jsonify method needs a valid image url to perform the operation. It works sometimes, sometimes it doesn't. The reason I suspect being due to the issue that the image would not have uploaded yet by the time it executes that line of code.
The perform_actual_search method is as follows:
def get_image_search_results(image_url):
global description
url = ('http://style.vsapi01.com/api-search/by-url/?apikey=%s&url=%s' % (just_visual_api_key, image_url))
h = httplib2.Http()
response, content = h.request(url,
'GET') # alternatively write content=h.request((url,'GET')[1]) ///Numbr 2 in our array
result = json.loads(content)
result_dictionary = []
for i in range(0, 10):
if result:
try:
if result['errorMessage']:
result_dictionary = []
except:
pass
if result['images'][i]:
images = result['images'][i]
jv_img_url = images['imageUrl']
title = images['title']
try:
if images['description']:
description = images['description']
else:
description = "no description"
except:
pass
# print("\njv_img_url: %s,\ntitle: %s,\ndescription: %s\n\n"% (
# jv_img_url, title, description))
image_info = {
'image_url': jv_img_url,
'title': title,
'description': description,
}
result_dictionary.append(image_info)
if result_dictionary != []:
# for i in range(len(result_dictionary)):
# print (result_dictionary[i])
# print("\n\n")
return result_dictionary
else:
return []
def performSearch(jv_input_dictionary, imagernce_products_dict):
print jv_input_dictionary
print imagernce_products_dict
global common_desc_ratio
global isReady
image_search_results = []
if jv_input_dictionary != []:
for i in range(len(jv_input_dictionary)):
print jv_input_dictionary[i]
for key in jv_input_dictionary[i]:
if key == 'description':
input_description = jv_input_dictionary[i][key]
s1w = re.findall('\w+', input_description.lower())
s1count = Counter(s1w)
print input_description
for j in imagernce_products_dict:
if j == 'Products':
for q in range(len(imagernce_products_dict['Products'])):
for key2 in imagernce_products_dict['Products'][q]:
if key2 == 'description':
search_description = imagernce_products_dict['Products'][q]['description']
print search_description
s2w = re.findall('\w+', search_description.lower())
s2count = Counter(s2w)
# Commonality magic
common_desc_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print('Common ratio is: %.2f' % common_desc_ratio)
if common_desc_ratio > 0.09:
image_search_results.append(imagernce_products_dict['Products'][q])
if image_search_results:
print image_search_results
return image_search_results
else:
return {'404': 'No retailers registered with us currently own this product.'}
def perform_actual_search(image_url, imagernce_product_dictionary):
return performSearch(get_image_search_results(image_url), imagernce_product_dictionary)
Any help solving this would be greatly appreciated.
I would configure S3 to generate notifications on events such as s3:ObjectCreated:*
Notifications can be posted to an SNS topic, a SQS queue or directly trigger a lambda function.
More details about S3 notifications : http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You should rewrite your code to separate the upload part and the image processing part. The later can be implemented as a Lambda function in Python.
Working in an Asynchronous way is key here, writing blocking code is usually not scalable.
you can compare bytes written to s3 with file size. lets say you use following method to write to s3:
bytes_written = key.set_contents_from_file(file_binary, rewind=True)
in your case it's set_contents_from_string
then I would compare, bytes_written with p_product_image_url.seek(0, os.SEEK_END)
if they match. whole file has been uploaded to s3.

Downloading, writing, converting and saving with imagemagick in python (corrupt images)

Okay so I've been trying to solve this for about six hours now and it's just not happening.
I've got this function that grabs a bunch of image (GIF) files from a URL based on a timestamp (using the Requests library. The images get saved to my desktop in a specific directory just fine.
When I try to open that image, rename it and process it then everything breaks.
Here's the initial method that sets everything up:
def createImage():
AB_CODES = ["WHK", "WHN", "WWW", "XBU", "XSM"]
BASE_URL = "http://url_where_I_get_images_from"
orig_dir = "originals/"
new_dir = "processed/"
# Add new image for each code
for code in AB_CODES:
radar_dir = BASE_URL + code
url = requests.head(radar_dir)
#parseUrl creates a valid timestamp corresponding to the latest image.
timestamp = parseUrl(url)
filename = timestamp + "_" + code + "_PRECIP_RAIN.gif"
radar = BASE_URL + code + "/" + filename
radar_img = requests.get(radar, timeout=30.000)
# This is where the file from original source gets saved to my desktop, works.
# Image gets saved in path/originals/img.gif
if (radar_img.status_code == requests.codes.ok):
image = radar_img.content
filepath = os.path.join(orig_dir, filename)
imgfile = open(filepath, "wb")
imgfile.write(image)
imgfile.close()
# This is where I create a new image to be saved and worked on in
# path/processed/new_img.gif
image = radar_img.content
filename = code + "_radar.gif"
convpath = os.path.join(new_dir, filename)
convimg = open(convpath, "wb")
convimg.write(image)
# This is the call to the function where I use imagemagick
# which is not working
image = processImage(convimg.name, "processed/XSM_radar_output.gif")
convimg.close()
Here are the two methods that make up my processing function. Right now it's in more of a testing phase because it won't work.
def formatArg(arg):
if arg.startswith("#") or " " in arg:
return repr(arg)
return arg
def processImage(input, output, verbose=True, shell=True):
args = [
"convert", input,
"-resize", "50x50",
output
]
if verbose:
print("Image: %s" % input)
print(" ".join(formatArg(a) for a in args))
print
if os.path.exists(input):
try:
result = subprocess.check_call(args)
except subprocess.CalledProcessError:
result = False
if result:
print("%s : SUCCESS!" % output)
else:
print("%s : FAIL!" % output)
I've tried it without shell=true. The error message I get is that the image is corrupted:
Image: processed/XSM_radar.gif
convert processed/XSM_radar.gif -resize 50x50 processed/XSM_radar.gif
convert.im6: corrupt image `processed/XSM_radar.gif' # error/gif.c/ReadGIFImage/1356.
convert.im6: no images defined `processed/XSM_radar_output.gif' # error/convert.c/ConvertImageCommand/3044.
processed/XSM_radar.gif : FAIL!
I don't understand why it's telling me that it's corrupted. I've run these exact same commands from the command line and it works just fine.
(I have imported subprocess)
Someone helped me determine the error.
def processImage(input, output, verbose=True, shell=True):
args = [
"convert", input,
"-resize", "50x50",
output
]
if verbose:
print("Image: %s" % input)
print(" ".join(formatArg(a) for a in args))
print
if os.path.exists(input):
try:
**result = subprocess.check_call(args)**
**except subprocess.CalledProcessError:**
**result = False**
**if result:**
print("%s : SUCCESS!" % output)
else:
print("%s : FAIL!" % output)
That section with the results should have been
result_code = subprocess.check_call(args)
and if it returned an error, result_code should have been set equal to 1.
So:
if result_code == 0:
do stuff
Thank you for your attention!
Have a lovely day.

Python 3.3: LAN speed test which calculates the read & write speeds and then returns the figure to Excel

I'm creating a LAN speed test which creates a data file in a specified location of a specified size and records the speed at which it is created/read. For the most part this is working correctly, there is just one problem: the read speed is ridiculously fast because all it's doing is timing how long it takes for the file to open, rather than how long it takes for the file to actually be readable (if that makes sense?).
So far I have this:
import time
import pythoncom
from win32com.client import Dispatch
import os
# create file - write speed
myPath = input('Where do you want to write the file?')
size_MB = int(input('What sized file do you want to test with? (MB)'))
size_B = size_MB * 1024 * 1024
fName = '\pydatafile'
#start timer
start = time.clock()
f = open(myPath + fName,'w')
f.write("\x00" * size_B)
f.close()
# how much time it took
elapsed = (time.clock() -start)
print ("It took", elapsed, "seconds to write the", size_MB, "MB file")
time.sleep(1)
writeMBps = size_MB / elapsed
print("That's", writeMBps, "MBps.")
time.sleep(1)
writeMbps = writeMBps * 8
print("Or", writeMbps, "Mbps.")
time.sleep(2)
# open file - read speed
startRead = time.clock()
f = open(myPath + fName,'r')
# how much time it took
elapsedRead = (time.clock() - startRead)
print("It took", elapsedRead,"seconds to read the", size_MB,"MB file")
time.sleep(1)
readMBps = size_MB / elapsedRead
print("That's", readMBps,"MBps.")
time.sleep(1)
readMbps = readMBps * 8
print("Or", readMbps,"Mbps.")
time.sleep(2)
f.close()
# delete the data file
os.remove(myPath + fName)
# record results on Excel
xl = Dispatch('Excel.Application')
xl.visible= 0
wb = xl.Workbooks.Add(r'C:\File\Location')
ws = wb.Worksheets(1)
# Write speed result
#
# loop until empty cell is found in column
col = 1
row = 1
empty = False
while not empty:
val = ws.Cells(row,col).value
print("Looking for next available cell to write to...")
if val == None:
print("Writing result to cell")
ws.Cells(row,col).value = writeMbps
empty = True
row += 1
# Read speed result
#
# loop until empty cell is found in column
col = 2
row = 1
empty = False
while not empty:
val = ws.Cells(row,col).value
print("Looking for next available cell to write to...")
if val == None:
print("Writing result to cell")
ws.Cells(row,col).value = readMbps
empty = True
row += 1
xl.Run('Save')
xl.Quit()
pythoncom.CoUninitialize()
How can I make this so the read speed is correct?
Thanks a lot
Try to actually read the file:
f = open(myPath + fName, 'r')
f.read()
Or (if the file is too large to fit in memory):
f = open(myPath + fName, 'r')
while f.read(1024 * 1024):
pass
But operating system could still make read fast by caching file content. You've just written it there! And even if you manage to disable caching, your measurement (in addition to network speed) could include the time to write data to file server disk.
If you want network speed only, you need to use 2 separate machines on LAN. E.g. run echo server on one machine (e.g. by enabling Simple TCP/IP services or writing and running your own). Then run Python echo client on another machine that sends some data to echo server, makes sure it receives the same data back and measures the turnaround time.

Automatic background changer using Python 2.7.3 not working, though it should

I'm very new to Ubuntu/Python/Bash/Gnome in general, so I still feel like there's a chance I'm doing something wrong, but it's been 3 days now without success...
Here's what the script is supposed to do:
* [✓] Download 1 random image from wallbase.cc
* [✓] Save it to the same directory that the script is running from
* [x] Set it as the wallpaper
There are two attempts made to set the wallpaper two using different commands and NEITHER work when in the script. There is a print statement (2nd line from the bottom) that spits out the correct terminal command because I can C&P the print result and it works fine, it just doesn't work when it's executed in the script.
#!/usr/bin/env python
import urllib2
import os
from gi.repository import Gio
response = urllib2.urlopen("http://wallbase.cc/random/12/eqeq/1366x768/0.000/100/32")
page_source = response.read()
thlink_pos = page_source.find("ico-X")
address_start = (page_source.find("href=\"", thlink_pos) + 6)
address_end = page_source.find("\"", address_start + 1)
response = urllib2.urlopen(page_source[address_start:address_end])
page_source = response.read()
bigwall_pos = page_source.find("bigwall")
address_start = (page_source.find("src=\"", bigwall_pos) + 5)
address_end = page_source.find("\"", address_start + 1)
address = page_source[address_start:address_end]
slash_pos = address.rfind("/") + 1
pic_name = address[slash_pos:]
bashCommand = "wget " + page_source[address_start:address_end]
os.system(bashCommand)
print "Does my new image exists?", os.path.exists(os.getcwd() + "/" + pic_name)
#attempt 1
settings = Gio.Settings.new("org.gnome.desktop.background")
settings.set_string("picture-uri", "file://" + os.getcwd() + "/" + pic_name)
settings.apply()
#attempt 2
bashCommand = "gsettings set org.gnome.desktop.background picture-uri file://" + os.getcwd() + "/" + pic_name
print bashCommand
os.system(bashCommand)
settings.apply()
You've successfully changed your settings, but they're still left unapplied, try next:
settings.apply()
after setting "picture-uri" string.
It works for me (Ubuntu 12.04).
I've modified your script (unrelated to your error):
#!/usr/bin/python
"""Set desktop background using random images from http://wallbase.cc
It uses `gi.repository.Gio.Settings` to set the background.
"""
import functools
import itertools
import logging
import os
import posixpath
import random
import re
import sys
import time
import urllib
import urllib2
import urlparse
from collections import namedtuple
from bs4 import BeautifulSoup # $ sudo apt-get install python-bs4
from gi.repository.Gio import Settings # pylint: disable=F0401,E0611
DEFAULT_IMAGE_DIR = os.path.expanduser('~/Pictures/backgrounds')
HTMLPAGE_SIZE_MAX = 1 << 20 # bytes
TIMEOUT_MIN = 300 # seconds
TIMEOUT_DELTA = 30 # jitter
# "Anime/Manga", "Wallpapers/General", "High Resolution Images"
CATEGORY_W, CATEGORY_WG, CATEGORY_HR = range(1, 4)
PURITY_SFW, PURITY_SKETCHY, PURITY_NSFW, PURITY_DEFAULT = 4, 2, 1, 0
DAY_IN_SECONDS = 86400
UrlRetreiveResult = namedtuple('UrlRetreiveResult', "path headers")
def set_background(image_path, check_exist=True):
"""Change desktop background to image pointed by `image_path`.
"""
if check_exist: # make sure we can read it (at this time)
with open(image_path, 'rb') as f:
f.read(1)
# prepare uri
path = os.path.abspath(image_path)
if isinstance(path, unicode): # quote() doesn't like unicode
path = path.encode('utf-8')
uri = 'file://' + urllib.quote(path)
# change background
bg_setting = Settings.new('org.gnome.desktop.background')
bg_setting.set_string('picture-uri', uri)
bg_setting.apply()
def url2filename(url):
"""Return basename corresponding to url.
>>> url2filename('http://example.com/path/to/file?opt=1')
'file'
"""
urlpath = urlparse.urlsplit(url).path # pylint: disable=E1103
basename = posixpath.basename(urllib.unquote(urlpath))
if os.path.basename(basename) != basename:
raise ValueError # refuse 'dir%5Cbasename.ext' on Windows
return basename
def download(url, dirpath, extensions=True, filename=None):
"""Download url to dirpath.
Use basename of the url path as a filename.
Create destination directory if necessary.
Use `extensions` to require the file to have an extension or any
of in a given sequence of extensions.
Return (path, headers) on success.
Don't retrieve url if path exists (headers are None in this case).
"""
if not os.path.isdir(dirpath):
os.makedirs(dirpath)
logging.info('created directory %s', dirpath)
# get filename from the url
filename = url2filename(url) if filename is None else filename
if os.path.basename(filename) != filename:
logging.critical('filename must not have path separator in it "%s"',
filename)
return
if extensions:
# require the file to have an extension
root, ext = os.path.splitext(filename)
if root and len(ext) > 1:
# require the extension to be in the list
try:
it = iter(extensions)
except TypeError:
pass
else:
if ext not in it:
logging.warn(("file extension is not in the list"
" url=%s"
" extensions=%s"),
url, extensions)
return
else:
logging.warn("file has no extension url=%s", url)
return
# download file
path = os.path.join(dirpath, filename)
logging.info("%s\n%s", url, path)
if os.path.exists(path): # don't retrieve if path exists
logging.info('path exists')
return UrlRetreiveResult(path, None)
try:
return UrlRetreiveResult(*urllib.urlretrieve(url, path,
_print_download_status))
except IOError:
logging.warn('failed to download {url} -> {path}'.format(
url=url, path=path))
def _print_download_status(block_count, block_size, total_size):
logging.debug('%10s bytes of %s', block_count * block_size, total_size)
def min_time_between_calls(min_delay):
"""Enforce minimum time delay between calls."""
def decorator(func):
lastcall = [None] # emulate nonlocal keyword
#functools.wraps(func)
def wrapper(*args, **kwargs):
if lastcall[0] is not None:
delay = time.time() - lastcall[0]
if delay < min_delay:
_sleep(min_delay - delay)
lastcall[0] = time.time()
return func(*args, **kwargs)
return wrapper
return decorator
#min_time_between_calls(5)
def _makesoup(url):
try:
logging.info(vars(url) if isinstance(url, urllib2.Request) else url)
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(HTMLPAGE_SIZE_MAX))
return soup
except (IOError, OSError) as e:
logging.warn('failed to return soup for %s, error: %s',
getattr(url, 'get_full_url', lambda: url)(), e)
class WallbaseImages:
"""Given parameters it provides image urls to download."""
def __init__(self,
categories=None, # default; sequence of CATEGORY_*
resolution_exactly=True, # False means 'at least'
resolution=None, # all; (width, height)
aspect_ratios=None, # all; sequence eg, [(5,4),(16,9)]
purity=PURITY_DEFAULT, # combine with |
thumbs_per_page=None, # default; an integer
):
"""See usage below."""
self.categories = categories
self.resolution_exactly = resolution_exactly
self.resolution = resolution
self.aspect_ratios = aspect_ratios
self.purity = purity
self.thumbs_per_page = thumbs_per_page
def _as_request(self):
"""Create a urllib2.Request() using given parameters."""
# make url
if self.categories is not None:
categories = "".join(str(n) for n in (2, 1, 3)
if n in self.categories)
else: # default
categories = "0"
if self.resolution_exactly:
at_least_or_exactly_resolution = "eqeq"
else:
at_least_or_exactly_resolution = "gteq"
if self.resolution is not None:
resolution = "{width:d}x{height:d}".format(
width=self.resolution[0], height=self.resolution[1])
else:
resolution = "0x0"
if self.aspect_ratios is not None:
aspect_ratios = "+".join("%.2f" % (w / float(h),)
for w, h in self.aspect_ratios)
else: # default
aspect_ratios = "0"
purity = "{0:03b}".format(self.purity)
thumbs = 20 if self.thumbs_per_page is None else self.thumbs_per_page
url = ("http://wallbase.cc/random/"
"{categories}/"
"{at_least_or_exactly_resolution}/{resolution}/"
"{aspect_ratios}/"
"{purity}/{thumbs:d}").format(**locals())
logging.info(url)
# make post data
data = urllib.urlencode(dict(query='', board=categories, nsfw=purity,
res=resolution,
res_opt=at_least_or_exactly_resolution,
aspect=aspect_ratios,
thpp=thumbs))
req = urllib2.Request(url, data)
return req
def __iter__(self):
"""Yield background image urls."""
# find links to bigwall pages
# css-like: #thumbs div[class="thumb"] \
# a[class~="thlink" and href^="http://"]
soup = _makesoup(self._as_request())
if not soup:
logging.warn("can't retrieve the main page")
return
thumbs_soup = soup.find(id="thumbs")
for thumb in thumbs_soup.find_all('div', {'class': "thumb"}):
bigwall_a = thumb.find('a', {'class': "thlink",
'href': re.compile(r"^http://")})
if bigwall_a is None:
logging.warn("can't find thlink link")
continue # try the next thumb
# find image url on the bigwall page
# css-like: #bigwall > img[alt and src^="http://"]
bigwall_soup = _makesoup(bigwall_a['href'])
if bigwall_soup is not None:
bigwall = bigwall_soup.find(id='bigwall')
if bigwall is not None:
img = bigwall.find('img',
src=re.compile(r"(?i)^http://.*\.jpg$"),
alt=True)
if img is not None:
url = img['src']
filename = url2filename(url)
if filename.lower().endswith('.jpg'):
yield url, filename # successfully found image url
else:
logging.warn('suspicious url "%s"', url)
continue
logging.warn("can't parse bigwall page")
def main():
level = logging.INFO
if '-d' in sys.argv:
sys.argv.remove('-d')
level = logging.DEBUG
# configure logging
logging.basicConfig(format='%(levelname)s: %(asctime)s %(message)s',
level=level, datefmt='%Y-%m-%d %H:%M:%S %Z')
if len(sys.argv) > 1:
backgrounds_dir = sys.argv[1]
else:
backgrounds_dir = DEFAULT_IMAGE_DIR
# infinite loop: Press Ctrl+C to interrupt it
#NOTE: here's some arbitrary logic: modify for you needs e.g., break
# after the first image found
timeout = TIMEOUT_MIN # seconds
for i in itertools.cycle(xrange(timeout, DAY_IN_SECONDS)):
found = False
try:
for url, filename in WallbaseImages(
categories=[CATEGORY_WG, CATEGORY_HR, CATEGORY_W],
purity=PURITY_SFW,
thumbs_per_page=60):
res = download(url, backgrounds_dir, extensions=('.jpg',),
filename=filename)
if res and res.path:
found = True
set_background(res.path)
# don't hammer the site
timeout = max(TIMEOUT_MIN, i % DAY_IN_SECONDS)
_sleep(random.randint(timeout, timeout + TIMEOUT_DELTA))
except Exception: # pylint: disable=W0703
logging.exception('unexpected error')
_sleep(timeout)
else:
if not found:
logging.error('failed to retrieve any images')
_sleep(timeout)
timeout = (timeout * 2) % DAY_IN_SECONDS
def _sleep(timeout):
"""Add logging to time.sleep() call."""
logging.debug('sleep for %s seconds', timeout)
time.sleep(timeout)
main()
Tried to implement a python script that used the PIL library to write text on an image then update the Gnome background "picture-uri" to point to that image using the Gio class. The python script would ping pong between two images to always modify the one not in use and then attempt to "switch" by updating the Settings. Did this to avoid any flicker as modifying the current background directly drops it out temporarily. While in the shell and calling the script directly I rarely saw any issue, but in the cronjob it simply wouldn't update on the pong. I used both sync and apply and would wait several minutes before trying to switch the images. Didn't work. Tried cron as user (su -c "cmd" user) and that didn't work either.
Finally gave up on the ping pong approach when I noticed that Gnome will detect any change in the background file and update. So dropped the ping pong method and went to a temp file that I just copy over the current background using the shutil library. Works like a charm.

Categories

Resources