Run many filesystem operations in parallel and asynchronously

Run many filesystem operations in parallel and asynchronously - python

I have the following function, which returns a filesize of a file over HTTP:
def GetFileSize(url):
" Function gets a url and returns it's filesize in bytes "
url = url.replace(' ', '%20')
u = urllib2.urlopen(url)
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
return file_size
I would like to get the biggest file from a given links, and I wrote the following function for it:
def GetBiggestFile(links):
" Function gets a list of links and returns the biggest file and his size in bytes "
dic = {}
for link in links:
filename = link.split('/')[-1]
filesize = GetFileSize(link)
dic[link] = filesize
print "%s | %.2f MB" % (filename, filesize / 1024.0 / 1024.0)
biggest_file = max(dic, key=dic.get)
return biggest_file, dic[biggest_file]
My lists have dozens of links, therefore this scripts takes some time to complete. Using threading I can fetch the different filesizes synchronously and shorten the running time of the code.
I'm not so sure how to do it - I've tried using a decorator that makes the function run asynchronously:
def run_async(func):
" Decorator for running functions asynchronously. "
from threading import Thread
from functools import wraps
#wraps(func)
def async_func(*args, **kwargs):
func_hl = Thread(target = func, args = args, kwargs = kwargs)
func_hl.start()
return func_hl
return async_func
But I'm not sure how to make my code wait for all the answers before trying to determine who is the biggest file.
Thanks.

You'll be happier with multiprocessing.
Start with this example: http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers
Your GetFileSize function can be run in a process pool.
Since each process is separate, you should also have an "output Queue" into which the results are put. A separate process does a simple "get" to retrieve all the answers from the Queue.

Related

How do I count the number of line in a FTP file without downloading it locally while using Python

So I need to be able to read and count the number of lines from a FTP server WITHOUT downloading it to my local machine while using Python.
I know the code to connect to the server:
ftp = ftplib.FTP('example.com') //Object ftp set as server address
ftp.login ('username' , 'password') // Login info
ftp.retrlines('LIST') // List file directories
ftp.cwd ('/parent folder/another folder/file/') //Change file directory
I also know the basic code to count the number of line If it is already downloaded/stored locally :
with open('file') as f:
... count = sum(1 for line in f)
... print (count)
I just need to know how to connect these 2 pieces of code without having to download the file to my local system.
Any help is appreciated.
Thank You

As far as i know FTP doesn't provide any kind of functionality to read the file content without actually downloading it. However you could try using something like Is it possible to read FTP files without writing them using Python?
(You haven't specified what python you are using)
#!/usr/bin/env python
from ftplib import FTP
def countLines(s):
print len(s.split('\n'))
ftp = FTP('ftp.kernel.org')
ftp.login()
ftp.retrbinary('RETR /pub/README_ABOUT_BZ2_FILES', countLines)
Please take this code as a reference only

There is a way: I adapted a piece of code that I created for processes csv files "on the fly". Is implement by producer-consumer problem approach. Apply this pattern allows us to assign each task to a thread (or process) and show partial results for huge remote files. You can adapt it for ftp requests.
Download stream is saved in queue and is consumed "on the fly". No HDD extra space is needed and memory efficient. Tested in Python 3.5.2 (vanilla) on Fedora Core 25 x86_64.
This is the source adapted for ftp (over http) retrieve:
from threading import Thread, Event
from queue import Queue, Empty
import urllib.request,sys,csv,io,os,time;
import argparse
FILE_URL = 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
def download_task(url,chunk_queue,event):
CHUNK = 1*1024
response = urllib.request.urlopen(url)
event.clear()
print('%% - Starting Download - %%')
print('%% - ------------------ - %%')
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
chunk = response.read(CHUNK)
if not chunk:
print('%% - Download completed - %%')
event.set()
break
chunk_queue.put(chunk)
def count_task(chunk_queue, event):
part = False
time.sleep(5) #give some time to producer
M=0
contador = 0
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
try:
#Default behavior of queue allows getting elements from it and block if queue is Empty.
#In this case I set argument block=False. When queue.get() and queue Empty ocurrs not block and throws a
#queue.Empty exception that I use for show partial result of process.
chunk = chunk_queue.get(block=False)
for line in chunk.splitlines(True):
if line.endswith(b'\n'):
if part: ##for treat last line of chunk (normally is a part of line)
line = linepart + line
part = False
M += 1
else:
##if line not contains '\n' is last line of chunk.
##a part of line which is completed in next interation over next chunk
part = True
linepart = line
except Empty:
# QUEUE EMPTY
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print('Downloading records ...')
if M>0:
print('Partial result: Lines: %d ' % M) #M-1 because M contains header
if (event.is_set()): #'THE END: no elements in queue and download finished (even is set)'
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print('The consumer has waited %s times' % str(contador))
print('RECORDS = ', M)
break
contador += 1
time.sleep(1) #(give some time for loading more records)
def main():
chunk_queue = Queue()
event = Event()
args = parse_args()
url = args.url
p1 = Thread(target=download_task, args=(url,chunk_queue,event,))
p1.start()
p2 = Thread(target=count_task, args=(chunk_queue,event,))
p2.start()
p1.join()
p2.join()
# The user of this module can customized one parameter:
# + URL where the remote file can be found.
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', default=FILE_URL,
help='remote-csv-file URL')
return parser.parse_args()
if __name__ == '__main__':
main()
Usage
$ python ftp-data.py -u <ftp-file>
Example:
python ftp-data-ol.py -u 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
The consumer has waited 0 times
RECORDS = 16327
Csv version on Github: https://github.com/AALVAREZG/csv-data-onthefly

Concurrently downloading files in python using multi process

I have made a code below to download files using pySmartDL. I would like to download more than one file at a time. Tried to implement it using multi process. But second process starts only when first finishes. Code is below:
import time
from multiprocessing import Process
from pySmartDL import SmartDL, HashFailedException
def down():
dest='/home/faheem/Downloads'
obj = SmartDL(url_100mb_file,dest, progress_bar=False,fix_urls=True)
obj.start(blocking=False)
#cnt=1
while not obj.isFinished():
print("Speed: %s" % obj.get_speed(human=True))
print("Already downloaded: %s" % obj.get_dl_size(human=True))
print("Eta: %s" % obj.get_eta(human=True))
print("Progress: %d%%" % (obj.get_progress()*100))
print("Progress bar: %s" % obj.get_progress_bar())
print("Status: %s" % obj.get_status())
print("\n"*2+"="*50+"\n"*2)
print("SIZE=%s"%obj.filesize)
time.sleep(2)
if obj.isSuccessful():
print("downloaded file to '%s'" % obj.get_dest())
print("download task took %ss" % obj.get_dl_time(human=True))
print("File hashes:")
print(" * MD5: %s" % obj.get_data_hash('md5'))
print(" * SHA1: %s" % obj.get_data_hash('sha1'))
print(" * SHA256: %s" % obj.get_data_hash('sha256'))
data=obj.get_data()
else:
print("There were some errors:")
for e in obj.get_errors():
print(str(e))
return
if __name__ == '__main__':
#jobs=[]
#for i in range(5):
print 'Link1'
url_100mb_file = ['https://softpedia-secure-download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar']
Process(target=down()).start()
print'link2'
url_100mb_file = ['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']
Process(target=down()).start()
Here link2 starts downloading when link1 finishes, but I need both download to perform concurrently. I would like to implement this method to perform upto 10 downloads at a time. So is it good to use multiprocessing?
Is there any other better memory efficient method.
I am a beginner in these codes, so kindly define the answer easily..
Regards

You can also use python module Thread. Here is a little snippet on how it works:
import threading
import time
def func(i):
time.sleep(i)
print i
for i in range(1, 11):
thread = threading.Thread(target = func, args=(i,))
thread.start()
print "Launched thread " + str(i)
print "Done"
Run this snippet and you will get a perfect idea on how it works.
Knowing that, you can actually run your code, passing as an argument to the function the url to use in each thread.
Hope that helps

The particular library you're using appears to already support non-blocking downloads so why no just do the following? Non-blocking means it'll run in a seperate process.
from time import sleep
from pySmartDL import SmartDL
links = [['https://softpedia-secure download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar'],['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']]
objs = [SmartDL(link, progress_bar=False) for link in links]
for obj in objs:
obj.start(blocking=False)
while not all(obj.isFinished() for obj in objs):
sleep(1)

Since your program is I/O-bound, you can use multi-processing or mult-threading.
Just in case, I'd like to remind the classical pattern for problems like this. Have a queue of URLs from which worker processes / threads pull URLs for processing, and have a status queue where the workers push their progress reports or errors.
A thread pool or a process pull greatly simplifies things, compared to manual control.

Python, How to break out of multiple threads

I am following one of the examples in a book I am reading ("Violent Python"). It is to create a zip file password cracker from a dictionary. I have two questions about it. First it says to thread it as I have written in the code to increase performance but when I timed it (I know time.time() is not great for timing) there was about a twelve second difference in favor of not threading. Is this because it is taking longer to start the threads? Second if I do it without the threads I can break as soon as the correct value is found by printing the result and the entering the statement exit(0). Is there a way to get the same result using threading so that if I find the result I am looking for I can end all other threads simultaneously?
import zipfile
from threading import Thread
import time
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
def main():
start = time.time()
z = zipfile.ZipFile('test.zip')
pwdfile = open('words.txt')
pwds = pwdfile.read()
pwdfile.close()
for pwd in pwds.splitlines():
t = Thread(target=extractFile, args=(z, pwd, start))
t.start()
#extractFile(z, pwd, start)
print(str(time.time()-start))
if __name__ == '__main__':
main()

In CPython, the Global Interpreter Lock ("GIL") enforces the restriction that only one thread at a time can execute Python bytecode.
So in this application, it is probably better to use the map method of a multiprocessing.Pool, since every try is independant of the others;
import multiprocessing
import zipfile
def tryfile(password):
rv = passwd
with zipfile.ZipFile('test.zip') as z:
try:
z.extractall(pwd=password)
except:
rv = None
return rv
with open('words.txt') as pwdfile:
data = pwdfile.read()
pwds = data.split()
p = multiprocessing.Pool()
results = p.map(tryfile, pwds)
results = [r for r in results if r is not None]
This will start (by default) as many processes as your computer has cores. If will keep running tryfile() with a different passwords in these processes until the list pwds is exhausted, gather the results and return them. The last list comprehension is to discard the None results.
Note that this code could be improved to stop shut down the map once the password is found. You'd probably have to use map_async and a shared variable in that case. It would also be nice to load the zipfile only once and share that.

This code is slow because python has a Global Interpreter Lock, which means only one thread can execute at a time. This causes multithreaded code to run slower than serial code in Python. If you want to create a truly multithreaded application, you'd have to use the Multiprocessing Module.
To break out of the threads and get the return value, you can use os._exit(1) First, import the os module at the top of your file:
import os
Then, change your extractFile function to use os._exit(1):
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
os._exit(1)

tail multiple logfiles in python

This is probably a bit of a silly excercise for me, but it raises a bunch of interesting questions. I have a directory of logfiles from my chat client, and I want to be notified using notify-osd every time one of them changes.
The script that I wrote basically uses os.popen to run the linux tail command on every one of the files to get the last line, and then check each line against a dictionary of what the lines were the last time it ran. If the line changed, it used pynotify to send me a notification.
This script actually worked perfectly, except for the fact that it used a huge amount of cpu (probably because it was running tail about 16 times every time the loop ran, on files that were mounted over sshfs.)
It seems like something like this would be a great solution, but I don't see how to implement that for more than one file.
Here is the script that I wrote. Pardon my lack of comments and poor style.
Edit: To clarify, this is all linux on a desktop.

Not even looking at your source code, there are two ways you could easily do this more efficiently and handle multiple files.
Don't bother running tail unless you have to. Simply os.stat all of the files and record the last modified time. If the last modified time is different, then raise a notification.
Use pyinotify to call out to Linux's inotify facility; this will have the kernel do option 1 for you and call back to you when any files in your directory change. Then translate the callback into your osd notification.
Now, there might be some trickiness depending on how many notifications you want when there are multiple messages and whether you care about missing a notification for a message.
An approach that preserves the use of tail would be to instead use tail -f. Open all of the files with tail -f and then use the select module to have the OS tell you when there's additional input on one of the file descriptors open for tail -f. Your main loop would call select and then iterate over each of the readable descriptors to generate notifications. (You could probably do this without using tail and just calling readline() when it's readable.)
Other areas of improvement in your script:
Use os.listdir and native Python filtering (say, using list comprehensions) instead of a popen with a bunch of grep filters.
Update the list of buffers to scan periodically instead of only doing it at program boot.
Use subprocess.popen instead of os.popen.

If you're already using the pyinotify module, it's easy to do this in pure Python (i.e. no need to spawn a separate process to tail each file).
Here is an example that is event-driven by inotify, and should use very little cpu. When IN_MODIFY occurs for a given path we read all available data from the file handle and output any complete lines found, buffering the incomplete line until more data is available:
import os
import select
import sys
import pynotify
import pyinotify
class Watcher(pyinotify.ProcessEvent):
def __init__(self, paths):
self._manager = pyinotify.WatchManager()
self._notify = pyinotify.Notifier(self._manager, self)
self._paths = {}
for path in paths:
self._manager.add_watch(path, pyinotify.IN_MODIFY)
fh = open(path, 'rb')
fh.seek(0, os.SEEK_END)
self._paths[os.path.realpath(path)] = [fh, '']
def run(self):
while True:
self._notify.process_events()
if self._notify.check_events():
self._notify.read_events()
def process_default(self, evt):
path = evt.pathname
fh, buf = self._paths[path]
data = fh.read()
lines = data.split('\n')
# output previous incomplete line.
if buf:
lines[0] = buf + lines[0]
# only output the last line if it was complete.
if lines[-1]:
buf = lines[-1]
lines.pop()
# display a notification
notice = pynotify.Notification('%s changed' % path, '\n'.join(lines))
notice.show()
# and output to stdout
for line in lines:
sys.stdout.write(path + ': ' + line + '\n')
sys.stdout.flush()
self._paths[path][1] = buf
pynotify.init('watcher')
paths = sys.argv[1:]
Watcher(paths).run()
Usage:
% python watcher.py [path1 path2 ... pathN]

Simple pure python solution (not the best, but doesn't fork, spits out 4 empty lines after idle period and marks everytime the source of the chunk, if changed):
#!/usr/bin/env python
from __future__ import with_statement
'''
Implement multi-file tail
'''
import os
import sys
import time
def print_file_from(filename, pos):
with open(filename, 'rb') as fh:
fh.seek(pos)
while True:
chunk = fh.read(8192)
if not chunk:
break
sys.stdout.write(chunk)
def _fstat(filename):
st_results = os.stat(filename)
return (st_results[6], st_results[8])
def _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = False
#Find the size of the file and move to the end
tup = _fstat(filename)
# print tup
if last_stats[filename] != tup:
changed = True
if not no_fn and last_fn != filename:
print '\n<%s>' % filename
print_file_from(filename, last_stats[filename][0])
last_stats[filename] = tup
return changed
def multi_tail(filenames, stdout=sys.stdout, interval=1, idle=10, no_fn=False):
S = lambda (st_size, st_mtime): (max(0, st_size - 124), st_mtime)
last_stats = dict((fn, S(_fstat(fn))) for fn in filenames)
last_fn = None
last_print = 0
while 1:
# print last_stats
changed = False
for filename in filenames:
if _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = True
last_fn = filename
if changed:
if idle > 0:
last_print = time.time()
else:
if idle > 0 and last_print is not None:
if time.time() - last_print >= idle:
last_print = None
print '\n' * 4
time.sleep(interval)
if '__main__' == __name__:
from optparse import OptionParser
op = OptionParser()
op.add_option('-F', '--no-fn', help="don't print filename when changes",
default=False, action='store_true')
op.add_option('-i', '--idle', help='idle time, in seconds (0 turns off)',
type='int', default=10)
op.add_option('--interval', help='check interval, in seconds', type='int',
default=1)
opts, args = op.parse_args()
try:
multi_tail(args, interval=opts.interval, idle=opts.idle,
no_fn=opts.no_fn)
except KeyboardInterrupt:
pass

In Django, how to call a subprocess with a slow start-up time

Suppose you're running Django on Linux, and you've got a view, and you want that view to return the data from a subprocess called cmd that operates on a file that the view creates, for example likeso:
def call_subprocess(request):
response = HttpResponse()
with tempfile.NamedTemporaryFile("W") as f:
f.write(request.GET['data']) # i.e. some data
# cmd operates on fname and returns output
p = subprocess.Popen(["cmd", f.name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = p.communicate()
response.write(p.out) # would be text/plain...
return response
Now, suppose cmd has a very slow start-up time, but a very fast operating time, and it does not natively have a daemon mode. I would like to improve the response-time of this view.
I would like to make the whole system would run much faster by starting up a number of instances of cmd in a worker-pool, have them wait for input, and having call_process ask one of those worker pool processes handle the data.
This is really 2 parts:
Part 1. A function that calls cmd and cmd waits for input. This could be done with pipes, i.e.
def _run_subcmd():
p = subprocess.Popen(["cmd", fname],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# write 'out' to a tmp file
o = open("out.txt", "W")
o.write(out)
o.close()
p.close()
exit()
def _run_cmd(data):
f = tempfile.NamedTemporaryFile("W")
pipe = os.mkfifo(f.name)
if os.fork() == 0:
_run_subcmd(fname)
else:
f.write(data)
r = open("out.txt", "r")
out = r.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Part 2. A set of workers running in the background that are waiting on the data. i.e. We want to extend the above so that the subprocess is already running, e.g. when the Django instance initializes, or this call_process is first called, a set of these workers is created
WORKER_COUNT = 6
WORKERS = []
class Worker(object):
def __init__(index):
self.tmp_file = tempfile.NamedTemporaryFile("W") # get a tmp file name
os.mkfifo(self.tmp_file.name)
self.p = subprocess.Popen(["cmd", self.tmp_file],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.index = index
def run(out_filename, data):
WORKERS[self.index] = Null # qua-mutex??
self.tmp_file.write(data)
if (os.fork() == 0): # does the child have access to self.p??
out, err = self.p.communicate()
o = open(out_filename, "w")
o.write(out)
exit()
self.p.close()
self.o.close()
self.tmp_file.close()
WORKERS[self.index] = Worker(index) # replace this one
return out_file
#classmethod
def get_worker() # get the next worker
# ... static, incrementing index
There should be some initialization of workers somewhere, like this:
def init_workers(): # create WORKERS_COUNT workers
for i in xrange(0, WORKERS_COUNT):
tmp_file = tempfile.NamedTemporaryFile()
WORKERS.push(Worker(i))
Now, what I have above becomes something likeso:
def _run_cmd(data):
Worker.get_worker() # this needs to be atomic & lock worker at Worker.index
fifo = open(tempfile.NamedTemporaryFile("r")) # this stores output of cmd
Worker.run(fifo.name, data)
# please ignore the fact that everything will be
# appended to out.txt ... these will be tmp files, too, but named elsewhere.
out = fifo.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Now, the questions:
Will this work? (I've just typed this off the top of my head into StackOverflow, so I'm sure there are problems, but conceptually, will it work)
What are the problems to look for?
Are there better alternatives to this? e.g. Could threads work just as well (it's Debian Lenny Linux)? Are there any libraries that handle parallel process worker-pools like this?
Are there interactions with Django that I ought to be conscious of?
Thanks for reading! I hope you find this as interesting a problem as I do.
Brian

It may seem like i am punting this product as this is the second time i have responded with a recommendation of this.
But it seems like you need a Message Queing service, in particular a distributed message queue.
ere is how it will work:
Your Django App requests CMD
CMD gets added to a queue
CMD gets pushed to several works
It is executed and results returned upstream
Most of this code exists, and you dont have to go about building your own system.
Have a look at Celery which was initially built with Django.
http://www.celeryq.org/
http://robertpogorzelski.com/blog/2009/09/10/rabbitmq-celery-and-django/

Issy already mentioned Celery, but since comments doesn't work well
with code samples, I'll reply as an answer instead.
You should try to use Celery synchronously with the AMQP result store.
You could distribute the actual execution to another process or even another machine. Executing synchronously in celery is easy, e.g.:
>>> from celery.task import Task
>>> from celery.registry import tasks
>>> class MyTask(Task):
...
... def run(self, x, y):
... return x * y
>>> tasks.register(MyTask)
>>> async_result = MyTask.delay(2, 2)
>>> retval = async_result.get() # Now synchronous
>>> retval 4
The AMQP result store makes sending back the result very fast,
but it's only available in the current development version (in code-freeze to become
0.8.0)

How about "daemonizing" the subprocess call using python-daemon or its successor, grizzled.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run many filesystem operations in parallel and asynchronously - python

Related

How do I count the number of line in a FTP file without downloading it locally while using Python

Concurrently downloading files in python using multi process

Python, How to break out of multiple threads

tail multiple logfiles in python

In Django, how to call a subprocess with a slow start-up time

Categories

Resources