I have a project where i need to upload ~70 files to my flask app. I'm learning concurrency right now so this seems like perfect practice. When using print statements, the concurrent version of this function is about 2x to 2.5x faster than the synchronous function.
Though when actually writing to the SQLite database, it takes about the same amount of time.
Original func:
#app.route('/test_sync')
def auto_add():
t0 = time.time()
# Code does not work without changing directory. better option?
os.chdir('my_app/static/tracks')
list_dir = os.listdir('my_app/static/tracks')
# list_dir consists of .mp3 and .jpg files
for filename in list_dir:
if filename.endswith('.mp3'):
try:
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
except Exception:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
resize_image(thumbnail)
with open(filename, 'rb') as f, open(thumbnail, 'rb') as t:
track = Track(
title=filename[15:-4],
artist='Sam Gellaitry',
description='No desc.',
thumbnail=t.read(),
binary_audio=f.read()
)
else:
continue
db.session.add(track)
db.session.commit()
elapsed = time.time() - t0
return f'Uploaded all tracks in {elapsed} seconds.'
Concurrent func(s):
#app.route('/test_concurrent')
def auto_add_concurrent():
t0 = time.time()
MAX_WORKERS = 40
os.chdir('/my_app/static/tracks')
list_dir = os.listdir('/my_app/static/tracks')
mp3_list = [x for x in list_dir if x.endswith('.mp3')]
with futures.ThreadPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(add_one_file, mp3_list)
for x in res:
db.session.add(x)
db.session.commit()
elapsed = time.time() - t0
return f'Uploaded all tracks in {elapsed} seconds.'
-----
def add_one_file(filename):
list_dir = os.listdir('/my_app/static/tracks')
try:
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
except Exception:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
resize_image(thumbnail)
with open(filename, 'rb') as f, open(thumbnail, 'rb') as t:
track = Track(
title=filename[15:-4],
artist='Sam Gellaitry',
description='No desc.',
thumbnail=t.read(),
binary_audio=f.read()
)
return track
Heres the resize_image func for completeness
def resize_image(thumbnail):
with Image.open(thumbnail) as img:
img.resize((500, 500))
img.save(thumbnail)
return thumbnail
And benchmarks:
/test_concurrent (with print statements)
Uploaded all tracks in 0.7054300308227539 seconds.
/test_sync
Uploaded all tracks in 1.8661110401153564 seconds.
------
/test_concurrent (with db.session.add/db.session.commit)
Uploaded all tracks in 5.303245782852173 seconds.
/test_sync
Uploaded all tracks in 6.123792886734009 seconds.
What am i doing wrong with this concurrent code, and how can I optimize it?
It seems that the DB writes dominate your timings, and they do not usually benefit from parallelization when writing many rows to the same table, or in case of SQLite the same DB. Instead of adding the ORM objects 1 by 1 to the session, perform a bulk insert:
db.session.bulk_save_objects(list(res))
In your current code the ORM has to insert the Track objects one at a time during flush just before the commit in order to fetch their primary keys after insert. Session.bulk_save_objects does not do that by default, which means that the objects are less usable after – they're not added to the session for example – but that does not seem to be an issue in your case.
"I’m inserting 400,000 rows with the ORM and it’s really slow!" is a good read on the subject.
As a side note, when working with files it is best to try and avoid any TOCTOU situations, if possible. In other words don't use
thumbnail = [thumb for thumb in list_dir if thumb == filename[:-4] + '.jpg'][0]
to check if the file exists, use os.path.isfile() or such instead if you must, but you should just try and open it and then handle the error, if it cannot be opened:
thumbnail = filename[:-4] + '.jpg'
try:
resize_image(thumbnail)
except FileNotFoundError:
print(f'ERROR - COULD NOT FIND THUMB for { filename }')
# Note that the latter open attempt will fail as well, if this fails
...
Related
I'm trying to speed up a CPU-bound Python script (on Windows11). Threats in Python do not seem to run on a different cpu(core) so the only option I have is multiprocessing.
I have a big dictionary data structure (11GB memory footprint after loading from file) that I am checking calculated values on if they are in that dictionary. Input for the calculation also comes from a file (100GB in size). This input I can pool-map to the processes in batches, no problem. But I cannot copy the dictionary to all processes because there is not enough memory for that. So I need to find a way for the processes to check if the value (actually a string) is in the dictionary.
Any advice?
Pseudo programm flow:
--main--
- load dictionary structure from file # 11GB memory footprint
- ...
- While not all chuncks loaded
- Load chunk of calcdata from file # (10.000 lines per chunk)
- Distribute (map) calcdata-chunck to processes
- Wait for processes to complete all chunks
--process--
- for each element in subchunk
- perform calculation
- check if calculation in dictionary # here is my problem!
- store result in file
Edit, after implementing comments below, I am now at:
def ReadDictFromFile()
cnt=0
print("Reading dictionary from " + dictfilename)
with open(dictfilename, encoding=("utf-8"), errors=("replace")) as f:
next(f) #skip first line (header)
for line in f:
s = line.rstrip("\n")
(key,keyvalue) = s.split()
shared_dict[str(key)]=keyvalue
cnt = cnt + 1
if ((cnt % 1000000) == 0): #log each 1000000 where we are
print(cnt)
return #temp to speed up testing, not load whole dictionary atm
print("Done loading dictionary")
def checkqlist(qlist)
print(str(os.getpid()) + "-" + str(len(qlist)))
for li in qlist:
try:
checkvalue = calculations(li)
(found, keyval) = InMem(checkvalue)
if (found):
print("FOUND!!! " + checkvalue + ' ' + keyvalue)
except Exception as e:
print("(" + str(os.getpid()) + ")Error log: %s" % repr(e))
time.sleep(15)
def InMem(checkvalue):
if(checkvalue in shared_dict):
return True, shared_dict[checkvalue]
else:
return False, ""
if __name__ == "__main__":
start_time = time.time()
global shared_dict
manager = Manager()
shared_dict = manager.dict()
ReadDictFromFile()
chunksize=5
nr_of_processes = 10
with open(filetocheck, encoding=("utf-8"), errors=("replace")) as f:
qlist = []
for line in f:
s = line.rstrip("\n")
qlist.append(s)
if (len(qlist) >= (chunksize * nr_of_processes)):
chunked_list = [qlist[i:i+chunk_size] for i in range(0, len(qlist), chunk_size)]
try:
with multiprocessing.Pool() as pool:
pool.map(checkqlist, chunked_list, nr_of_processes) #problem: qlist is a single string, not a list of about 416 strings.
except Exception as e:
print("error log: %s" % repr(e))
time.sleep(15)
logit("Completed! " + datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y"))
print("--- %s seconds ---" % (time.time() - start_time))
you can use a multiprocessing.Manager.dict for this, it's the fastest IPC you can use to do the check between processes in python, and for the memory size, just make it smaller by changing all values to None, on my pc it can do 33k member checks every second ... about 400 times slower than a normal dictionary.
manager = Manager()
shared_dict = manager.dict()
shared_dict.update({x:None for x in main_dictionary})
shared_dict["new_element"] = None # to set another value
del shared_dict["new_element"] # to delete a certain value
you can also use a dedicated in-memory database for this like redis, which can handle being polled by multiple processes at the same time.
#Sam Mason suggestion to use WSL and fork may be better, but this one is the most portable.
Edit: to store it in children global scope you have to pass it through the initializer.
def define_global(var):
global shared_dict
shared_dict = var
...
if __name__ == "__main__":
...
with multiprocessing.Pool(initializer=define_global, initargs=(shared_dict ,)) as pool:
I have a very interesting case. I have a built a filemanagement system in python which moves files from source to destination or archive everytime I run it. Now I want to make 2 tables in MySQL (using python) who are actually monitoring the filemanagement system.
The first table monitors the last time the filemanagementsystem ran. So just a small table with 1 column and 1 row which contains the following information --> Last run: 1-1-2020 10:30.
The second table has to give me all the content of the last file or files which were/was moved from source to destination in table form.
Everytime I run my python script 2 things need to happen. 1. The files are being moved and 2. the MySQL monitoring tables are being updated. Does anyone knows how this needs to be done? Please note I'am using a MySQL Workbench 8.0. Thank you indeed.
Here is the code I have right now for moving the files.
import os
import time
from datetime import datetime
import pathlib
SOURCE = r'C:\Users\AM\Desktop\Source'
DESTINATION = r'C:\Users\AM\Desktop\Destination'
ARCHIVE =r'C:\Users\AM\Desktop\Archive'
def get_time_difference(date, time_string):
"""
You may want to modify this logic to change the way the time difference is calculated.
"""
time_difference = datetime.now() - datetime.strptime(f"{date} {time_string}", "%d-%m-%Y %H:%M")
hours = time_difference.total_seconds() // 3600
minutes = (time_difference.total_seconds() % 3600) // 60
return f"{int(hours)}:{int(minutes)}"
def move_and_transform_file(file_path, dst_path, delimiter="\t"):
"""
Reads the data from the old file, writes it into the new file and then
deletes the old file.
"""
with open(file_path, "r") as input_file, open(dst_path, "w") as output_file:
data = {
"Date": None,
"Time": None,
"Power": None,
}
time_difference_seen = False
for line in input_file:
(line_id, item, line_type, value) = line.strip().split()
if item in data:
data[item] = value
if not time_difference_seen and data["Date"] is not None and data["Time"] is not None:
time_difference = get_time_difference(data["Date"], data["Time"])
time_difference_seen = True
print(delimiter.join([line_id, "TimeDif", line_type, time_difference]), file=output_file)
if item == "Power":
value = str(int(value) * 10)
print(delimiter.join((line_id, item, line_type, value)), file=output_file)
os.remove(file_path)
def process_files(all_file_paths, newest_file_path, subdir):
"""
For each file, decide where to send it, then perform the transformation.
"""
for file_path in all_file_paths:
if file_path == newest_file_path and os.path.getctime(newest_file_path) < time.time() - 120:
dst_root = DESTINATION
else:
dst_root = ARCHIVE
dst_path = os.path.join(dst_root, subdir, os.path.basename(file_path))
move_and_transform_file(file_path, dst_path)
def main():
"""
Gather the files from the directories and then process them.
"""
for subdir in os.listdir(SOURCE):
subdir_path = os.path.join(SOURCE, subdir)
if not os.path.isdir(subdir_path):
continue
all_file_paths = [
os.path.join(subdir_path, p)
for p in os.listdir(subdir_path)
if os.path.isfile(os.path.join(subdir_path, p))
]
if all_file_paths:
newest_path = max(all_file_paths, key=os.path.getctime)
process_files(all_file_paths, newest_path, subdir)
if __name__ == "__main__":
main()
I am trying to write a python script scanning a folder and collect updated SQL script, and then automatically pull data for the SQL script. In the code, a while loop is scanning new SQL file, and send to data pull function. I am having trouble to understand how to make a dynamic queue with while loop, but also have multiprocess to run the tasks in the queue.
The following code has a problem that the while loop iteration will work on a long job before it moves to next iteration and collects other jobs to fill the vacant processor.
Update:
Thanks to #pbacterio for catching the bug, and now the error message is gone. After changing the code, the python code can take all the job scripts during one iteration, and distribute the scripts to four processors. However, it will get hang by a long job to go to next iteration, scanning and submitting the newly added job scripts. Any idea how to reconstruct the code?
I finally figured out the solution see answer below. It turned out what I was looking for is
the_queue = Queue()
the_pool = Pool(4, worker_main,(the_queue,))
For those stumble on the similar idea, following is the whole architecture of this automation script converting a shared drive to a 'server for SQL pulling' or any other job queue 'server'.
a. The python script auto_data_pull.py as shown in the answer. You need to add your own job function.
b. A 'batch script' with following:
start C:\Anaconda2\python.exe C:\Users\bin\auto_data_pull.py
c. Add a task triggered by start computer, run the 'batch script'
That's all. It works.
Python Code:
from glob import glob
import os, time
import sys
import CSV
import re
import subprocess
import pandas as PD
import pypyodbc
from multiprocessing import Process, Queue, current_process, freeze_support
#
# Function run by worker processes
#
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = compute(func, args)
output.put(result)
#
# Function used to compute result
#
def compute(func, args):
result = func(args)
return '%s says that %s%s = %s' % \
(current_process().name, func.__name__, args, result)
def query_sql(sql_file): #test func
#jsl file processing and SQL querying, data table will be saved to csv.
fo_name = os.path.splitext(sql_file)[0] + '.csv'
fo = open(fo_name, 'w')
print sql_file
fo.write("sql_file {0} is done\n".format(sql_file))
return "Query is done for \n".format(sql_file)
def check_files(path):
"""
arguments -- root path to monitor
returns -- dictionary of {file: timestamp, ...}
"""
sql_query_dirs = glob(path + "/*/IDABox/")
files_dict = {}
for sql_query_dir in sql_query_dirs:
for root, dirs, filenames in os.walk(sql_query_dir):
[files_dict.update({(root + filename): os.path.getmtime(root + filename)}) for
filename in filenames if filename.endswith('.jsl')]
return files_dict
##### working in single thread
def single_thread():
path = "Y:/"
before = check_files(path)
sql_queue = []
while True:
time.sleep(3)
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
# print sql_queue
for sql_file in sql_queue:
try:
query_sql(sql_file)
except:
pass
##### not working in queue
def multiple_thread():
NUMBER_OF_PROCESSES = 4
path = "Y:/"
sql_queue = []
before = check_files(path) # get the current dictionary of sql_files
task_queue = Queue()
done_queue = Queue()
while True: #while loop to check the changes of the files
time.sleep(5)
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
TASKS = [(query_sql, sql_file) for sql_file in sql_queue]
# Create queues
#submit task
for task in TASKS:
task_queue.put(task)
for i in range(NUMBER_OF_PROCESSES):
p = Process(target=worker, args=(task_queue, done_queue)).start()
# try:
# p = Process(target=worker, args=(task_queue))
# p.start()
# except:
# pass
# Get and print results
print 'Unordered results:'
for i in range(len(TASKS)):
print '\t', done_queue.get()
# Tell child processes to stop
for i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# single_thread()
if __name__ == '__main__':
# freeze_support()
multiple_thread()
Reference:
monitor file changes with python script: http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
Multiprocessing:
https://docs.python.org/2/library/multiprocessing.html
Where did you define sql_file in multiple_thread() in
multiprocessing.Process(target=query_sql, args=(sql_file)).start()
You have not defined sql_file in the method and moreover you have used that variable in a for loop. The variable's scope is only confined to the for loop.
Try replacing this:
result = func(*args)
by this:
result = func(args)
I have figured this out. Thank your for the response inspired the thought.
Now the script can run a while loop to monitor the folder for new updated/added SQL script, and then distribute the data pulling to multiple threads. The solution comes from the queue.get(), and queue.put(). I assume the queue object takes care of the communication by itself.
This is the final code --
from glob import glob
import os, time
import sys
import pypyodbc
from multiprocessing import Process, Queue, Event, Pool, current_process, freeze_support
def query_sql(sql_file): #test func
#jsl file processing and SQL querying, data table will be saved to csv.
fo_name = os.path.splitext(sql_file)[0] + '.csv'
fo = open(fo_name, 'w')
print sql_file
fo.write("sql_file {0} is done\n".format(sql_file))
return "Query is done for \n".format(sql_file)
def check_files(path):
"""
arguments -- root path to monitor
returns -- dictionary of {file: timestamp, ...}
"""
sql_query_dirs = glob(path + "/*/IDABox/")
files_dict = {}
try:
for sql_query_dir in sql_query_dirs:
for root, dirs, filenames in os.walk(sql_query_dir):
[files_dict.update({(root + filename): os.path.getmtime(root + filename)}) for
filename in filenames if filename.endswith('.jsl')]
except:
pass
return files_dict
def worker_main(queue):
print os.getpid(),"working"
while True:
item = queue.get(True)
query_sql(item)
def main():
the_queue = Queue()
the_pool = Pool(4, worker_main,(the_queue,))
path = "Y:/"
before = check_files(path) # get the current dictionary of sql_files
while True: #while loop to check the changes of the files
time.sleep(5)
sql_queue = []
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
if sql_queue:
for jsl_file in sql_queue:
try:
the_queue.put(jsl_file)
except:
print "{0} failed with error {1}. \n".format(jsl_file, str(sys.exc_info()[0]))
pass
else:
pass
if __name__ == "__main__":
main()
I am working on an aggregation platform. We want to store resized version of 'aggregated' images from the web on our servers. To be specific, these images are of e-commerce products from different vendors. The 'item' dictionary has "image" field which is a url and needs to be downloaded and compressed and saved to disk.
download and compression method:
def downloadCompressImage( url, width, item):
#Retrieve our source image from a URL
#Load the URL data into an image
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
img = cStringIO.StringIO(response.read())
im = Image.open(img)
wpercent = (width/float(im.size[0]))
hsize = int((float(im.size[1])*float(wpercent)))
#Resize the image
im2 = im.resize((width, hsize), Image.ANTIALIAS)
key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"
path = "/var/www/html/server/images/"
path = path + timestamp + "/"
#save compressed image to disk
im2.save(path + key_name, 'JPEG', quality = 85)
url = "http://server.com/images/" + timestamp + "/" + key_name
return url
worker method:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
line = line.rstrip('\n')
item = json.loads(line.decode('ascii', 'ignore'))
#
#Do stuff with the item dict and update it
#
# Append item to result if image dl and compression is successful
try:
item["grid_image"] = downloadCompressImage(item["image"],200, item)
except:
print "dl-comp exception in processing: " + item['name'] + item['vendor']
traceback.print_exc(file=sys.stdout)
continue
if(item["grid_image"] != -1):
result.append(item)
return result
main method:
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 15
numlines = 1000
lines = open('parserProducts.json').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
for line in result_lines:
jdata = json.dumps(line)
f.write(jdata+',\n')
pool.close()
pool.join()
f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')
print "parsing is done"
My question:
Is this the best I can do with python? The count of dictionary items is ~ 3 M. Without calling the "downloadCompresssImage" method in 'worker', the "#Do stuff with the item dict and update it" portion takes only 8 minutes to complete. With compression though, it seems it would take weeks, if not months.
Any ideas appreciated, thanks a bunch.
You are working with 3 million images here, which are downloaded from internet and then compressed. So how much time will it take, depends on 2 things as far as I can tell.
Your network speed (and the speed of the target server), to download the images.
Your CPU power, to compress the images.
So, it is not Python limiting you, you are doing fine with multiprocessing.Pool, main bottlenecks are your network speed and number of cores (or CPU power) you have.
I'm not sure if anyone else has this problem, but I'm getting an exception "Too big query offset" when using a cursor for chaining tasks on appengine development server (not sure if it happens on live).
The error occurs when requesting a cursor after 4000+ records have been processed in a single query.
I wasn't aware that offsets had anything to do with cursors, and perhaps its just a quirk in sdk for app engine.
To fix, either shorten the time allowed before task is deferred (so fewer records get processed at a time) or when checking time elapsed you can also check the number of records processed is still within range. e.g, if time.time() > end_time or count == 2000.Reset count and defer task. 2000 is an arbitrary number, I'm not sure what the limit should be.
EDIT:
After making the above mentioned changes, the never finishes executing. The with_cursor(cursor) code is being called, but seems to start at the beginning each time. Am I missing something obvious?
The code that causes the exception is as follows:
The table "Transact" has 4800 rows. The error occurs when transacts.cursor() is called when time.time() > end_time is true. 4510 records have been processed at the time when the cursor is requested, which seems to cause the error (on development server, haven't tested elsewhere).
def some_task(trans):
tts = db.get(trans)
for t in tts:
#logging.info('in some_task')
pass
def test_cursor(request):
ret = test_cursor_task()
def test_cursor_task(cursor = None):
startDate = datetime.datetime(2010,7,30)
endDate = datetime.datetime(2010,8,30)
end_time = time.time() + 20.0
transacts = Transact.all().filter('transactionDate >', startDate).filter('transactionDate <=',endDate)
count =0
if cursor:
transacts.with_cursor(cursor)
trans =[]
logging.info('queue_trans')
for tran in transacts:
count+=1
#trans.append(str(tran))
trans.append(str(tran.key()))
if len(trans)==20:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
if time.time() > end_time:
logging.info(count)
if len(trans)>0:
deferred.defer(some_task, trans, _countdown = 500)
trans =[]
logging.info('time limit exceeded setting next call to queue')
cursor = transacts.cursor()
deferred.defer(test_cursor_task, cursor)
logging.info('returning false')
return False
return True
return HttpResponse('')
Hope this helps someone.
Thanks
Bert
Try this again without using the iter functionality:
#...
CHUNK = 500
objs = transacts.fetch(CHUNK)
for tran in objs:
do_your_stuff
if len(objs) == CHUNK:
deferred.defer(my_task_again, cursor=str(transacts.cursor()))
This works for me.