Unable to upload to Amazon S3 via python threads

Unable to upload to Amazon S3 via python threads - python

Heres how I thread it.
t = Thread(target=s3_upload, args=(absolute_write_path,raw_unique_key))
t.start()
Heres the function thats called in threads.
def s3_upload(file_path,key):
conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
#check if bucket exists, if not cr8 it
if S3_BUCKET_CHECK:
if not conn.check_bucket_exists(S3_BUCKET_NAME).status == 200:
conn.create_located_bucket(S3_BUCKET_NAME, S3_LOCATION)
orig_file = open(file_path, "r")
obj = S3Object(orig_file.read())
conn.put(S3_BUCKET_NAME, key, obj)
os.remove(file_path)
If I don't run it in threads, it seem to work. But if I run in threads, it works up to the line where I do conn.put() and it does not print line from there onwards. Does anyone know why?
Thanks.

ok solved it. the problem was that the def daemon value for flask was True. changed it to false (which i assumed was the def) and now it works :)

Related

Unable to lock file in python

I have written two scripts in python that will be working with data in the same directory. One script will be set to run every 5 minutes and will save data to the directory, and then once a day the other script will zip all of that data, archive it and delete the original files. To avoid the archiving script deleting files which may be being saved by the worker script, I want to create a system-wide mutex so that the worker script knows not to run while the archiver is doing its thing.
I've done some searching and seen that on unix-based systems, the generally accepted method of doing this is to attempt to lock a file. If you get the lock then great, go ahead and run, if you can't get it then you know the other process is already running. I've written the following code:
import fcntl
import traceback
class ProcessLock:
def __init__(self, path_to_file, block):
self.file_path = path_to_file
try:
options = fcntl.LOCK_EX
if not block:
options = options | fcntl.LOCK_NB
self.file = open(path_to_file, 'w+')
self.lock = fcntl.flock(file, options)
except:
print 'caught something: {}'.format(traceback.format_exc())
self.file = None
self.lock = None
def is_locked(self):
return self.lock is not None
def unlock(self):
self.lock = None
self.file = None
def aquire_lock(lock_name):
path = '/tmp/{}.lock'.format(lock_name)
return ProcessLock(path, False)
def aquire_lock_blocking(lock_name):
path = '/tmp/{}.lock'.format(lock_name)
return ProcessLock(path, True)
However for the life of me I can't get it to actually work. I have searched and all of the samples I've seen and other questions posted on here seem to work using the code that I have got. I've also tried both flock and lockf, but neither work. The call to open works correctly, but I get the following logged out to the console:
self.lock = fcntl.flock(file, options)
TypeError: descriptor 'fileno' of 'file' object needs an argument
I don't know enough about Python to be able to know what this error means. Hopefully someone can see if I'm doing anything wrong. I'm running this in Pycharm on macOS

Django server crashes with exit codes 139, 77

Foreword
Okay, I have a really complex perfomance issue. I'm building a content managment system and one of the features should be generating tons of .docx files with different templates. I started with Webodt + Abiword. But then templates got too complex, so I had to swith my backend to Templated-docs + LibreOffice. So this is where my problems started.
I use:
Python 2.7.12
Django==1.8.2
templated-docs==0.2.9
LibreOffice 5.1.5.2
Ubuntu 16.04
The actual problem
I have an API which handles .docx render. I will show one of views, as an example, they are pretty similar:
#permission_classes((permissions.IsAdminUser,))
class BookDocxViewSet(mixins.RetrieveModelMixin, viewsets.GenericViewSet):
def retrieve(self, request, *args, **kwargs):
queryset = Pupils.objects.get(id=kwargs['pk'])
serializer = StudentSerializer(queryset)
context = dict(serializer.data)
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'docs/books/%s/%s_%s_%s.doc' % (datetime.now().date(), context[u'surname'], context[u'name'], datetime.now().date())
with open(doc, 'rb') as f:
content = f.read()
path = default_storage.save(p, ContentFile(content))
f.close()
return response.Response(u'/media/' + path)
When I call it the first time, it creates a .docx file, saves it to my default_storage and then returns me a download link. But when I try to do it again, of do it with another method (which works with another template and context), my server just crashes without any logs. The last thing I see is either
Process finished with exit code 77 if I call it with a little delay (more then one second)
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) if call my method for the second time right away (in less than one second)
I tried to use debuger -- it said that my server crashes on this line:
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
I bet what happens is:
When I call my method the first time templated_docs starts LibreOffice backend, and then does not stop it
When I call my method the second time templated_docs tries to start LibreOffice backend again, but it is already busy.
Questions
How do I debug LibreOffice to prove / refute my theory? (I guess I need to debug templated_docs instead)
Why do I get different exit codes depending of delay?
Is it enough base to oppen an issue on GitHub?
How do I fix that?
UPD
It is not an issue of REST Framework or not using FileResponce().
I already tried to test it with regular view.
def get_document(request, *args, **kwargs):
context = Pupils.objects.get(id=kwargs['pk']).__dict__
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'%s_%s_%s' % (context[u'surname'], context[u'name'], datetime.now().date())
return FileResponse(doc, p)
And the problem is same.
UPD 2
Okay. This line is chashing my server:
# pylokit/lokit.py
self.lokit = lo.libreofficekit_hook(six.b(lo_path))

Okay, that was a bug in templated_docs. I was right, it happens because templated_docs is trying to start LibreOffice twice. As it said in pylokit documentation:
The use of _exit() instead of default exit() is required because in
some circumstances LibreOffice segfaults on process exit.
It means the process that used pylockt should be killed after. But we cannot kill Django server. So I decided to use multiprocessing:
# templated_docs/__init__.py
if source_extension[1:] != output_format:
lo_path = getattr(
settings,
'TEMPLATED_DOCS_LIBREOFFICE_PATH',
'/usr/lib/libreoffice/program/')
def f(conn):
with Office(lo_path) as lo:
conv_file = NamedTemporaryFile(delete=False,
suffix='.%s' % output_format)
with lo.documentLoad(str(dest_file.name)) as doc:
doc.saveAs(conv_file.name)
os.unlink(dest_file.name)
conn.send(conv_file.name)
conn.close()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
conv_file_name = parent_conn.recv()
p.join()
return conv_file_name
else:
return dest_file.name
I oppened an issue and made a pull request.

Python multiprocessing pool hangs on map call

I have a function that parses a file and inserts the data into MySQL using SQLAlchemy. I've been running the function sequentially on the result of os.listdir() and everything works perfectly.
Because most of the time is spent reading the file and writing to the DB, I wanted to use multiprocessing to speed things up. Here is my pseduocode as the actual code is too long:
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
The problem I'm seeing is that the script hangs and never finishes. I usually get 48 of 63 records into the database. Sometimes it's more, sometimes it's less.
I've tried using pool.close() and in combination with pool.join() and neither seems to help.
How do I get this script to complete? What am I doing wrong? I'm using Python 2.7.8 on a Linux box.

You need to put all code which uses multiprocessing, inside its own function. This stops it recursively launching new pools when multiprocessing re-imports your module in separate processes:
def parse_file(filename):
...
def main():
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
if __name__ == '__main__':
main()
See the documentation about making sure your module is importable, also the advice for running on Windows(tm)

The problem was a combination of 2 things:
my pool code being called multiple times (thanks #Peter Wood)
my DB code opening too many sessions (and/or) sharing sessions
I made the following changes and everything works now:
Original File
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session = get_session() # see below
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
DB File
def get_session():
engine = create_engine('mysql://root:root#localhost/my_db')
Base.metadata.create_all(engine)
Base.metadata.bind = engine
db_session = sessionmaker(bind=engine)
return db_session()

Python watchdog windows wait till copy finishes

I am using the Python watchdog module on a Windows 2012 server to monitor new files appearing on a shared drive. When watchdog notices the new file it kicks off a database restore process.
However, it seems that watchdog will attempt to restore the file the second it is created and not wait till the file has finished copying to the shared drive. So I changed the event to on_modified but there are two on_modified events, one when the file is initially being copied and one when it is finished being copied.
How can I handle the two on_modified events to only fire when the file being copied to the shared drive has finished?
What happens when multiple files are copied to the shared drive at the same time?
Here is my code
import time
import subprocess
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class NewFile(FileSystemEventHandler):
def process(self, event):
if event.is_directory:
return
if event.event_type == 'modified':
if getext(event.src_path) == 'gz':
load_pgdump(event.src_path)
def on_modified(self, event):
self.process(event)
def getext(filename):
"Get the file extension"
file_ext = filename.split(".",1)[1]
return file_ext
def load_pgdump(src_path):
restore = 'pg_restore command ' + src_path
subprocess.call(restore, shell=True)
def main():
event_handler = NewFile()
observer = Observer()
observer.schedule(event_handler, path='Y:\\', recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
if __name__ == '__main__':
main()

In your on_modified event, just wait until the file is finished being copied, via watching the filesize.
Offering a Simpler Loop:
historicalSize = -1
while (historicalSize != os.path.getsize(filename)):
historicalSize = os.path.getsize(filename)
time.sleep(1)
print "file copy has now finished"

I'm using following code to wait until file copied (for Windows only):
from ctypes import windll
import time
def is_file_copy_finished(file_path):
finished = False
GENERIC_WRITE = 1 << 30
FILE_SHARE_READ = 0x00000001
OPEN_EXISTING = 3
FILE_ATTRIBUTE_NORMAL = 0x80
if isinstance(file_path, str):
file_path_unicode = file_path.decode('utf-8')
else:
file_path_unicode = file_path
h_file = windll.Kernel32.CreateFileW(file_path_unicode, GENERIC_WRITE, FILE_SHARE_READ, None, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, None)
if h_file != -1:
windll.Kernel32.CloseHandle(h_file)
finished = True
print 'is_file_copy_finished: ' + str(finished)
return finished
def wait_for_file_copy_finish(file_path):
while not is_file_copy_finished(file_path):
time.sleep(0.2)
wait_for_file_copy_finish(r'C:\testfile.txt')
The idea is to try open a file for write with share read mode. It will fail if someone else is writing to it.
Enjoy ;)

I would add a comment as this isn't an answer to your question but a different approach... but I don't have enough rep yet. You could try monitoring filesize, if it stops changing you can assume copy has finished:
copying = True
size2 = -1
while copying:
size = os.path.getsize('name of file being copied')
if size == size2:
break
else:
size2 = os.path.getsize('name of file being copied')
time.sleep(2)

On linux you also get close event. Than solution would be to wait with processing file until file gets closed.
My approach would be to add on_closed handling.
class Handler(FileSystemEventHandler):
def __init__(self):
self.files_to_process = set()
def dispatch(self, event):
_method_map = {
'created': self.on_created,
'closed': self.on_closed
}
def on_created(self, event):
self.files_to_process.add(event.src_path)
def on_closed(self, event):
self.files_to_process.remove(event.src_path)
actual_processing(event.src_path)

I had a similar issue recently with watchdog. A rather simple but not very smart workaround was for me to check the change of file size in a while loop using a two-element list, one for 'past', one for 'now'. Once the the values are equal the copying is finished.
Edit: something like this.
past = 0
now = 1
value = [past, now]
while True:
# change
# test
if value[0] == value[1]:
break
else:
value = [value[1], value[0]]

This works for me. Tested in windows as well with python3.7
while True:
size_now = os.path.getsize(event.src_path)
if size_now == size_past:
log.debug("file has copied completely now size: %s", size_now)
break
# TODO: why sleep is not working here ?
else:
size_past = os.path.getsize(event.src_path)
log.debug("file copying size: %s", size_past)

Old I know, but I recently came up with a solution for this exact problem. In my case, I was only concerned with wav and mp3 files. This function will ensure that only files that are completely copied will be sent to makerCore() because the created placeholder files do not have any extension and will always end up in 'not ready'. Once the file is completed it will trigger the watchdog module again except this time with an extension. This will work on multiple files simultaneously as well.
def on_created(event):
#print(event)
if str(event.src_path).endswith('.mp3') or str(event.src_path).endswith('.wav'):
makerCore(event)
else:
print('not ready')

I am using a different approach that might not be the most elegant one but is easy to do on any plateform if you have control on the side copying the file.
Just had 'in-progress' to the name of the file until the copying is complete, and then rename the file. You can then have a while loop waiting for the file with the name without 'in-progress' to exist and you're good.

I've tried the check filesize - wait - check again routine many have suggested above but it's not very reliable. To make it work better I've added a check if the file is still locked.
file_done = False
file_size = -1
while file_size != os.path.getsize(file_path):
file_size = os.path.getsize(file_path)
time.sleep(1)
while not file_done:
try:
os.rename(file_path, file_path)
file_done = True
except:
return True

Following up to ravenwing's answer, more details can be found about on_closed in watchdog here.
As mentioned in the documented issue, there is no documentation available for on_closed yet and it can only be used with unix.

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()

There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()

Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to upload to Amazon S3 via python threads - python

ok solved it. the problem was that the def daemon value for flask was True. changed it to false (which i assumed was the def) and now it works :)

Related

Unable to lock file in python

Django server crashes with exit codes 139, 77

Python multiprocessing pool hangs on map call

Python watchdog windows wait till copy finishes

Python Multithreading Not Functioning

Categories

Resources