Concurrently downloading files in python using multi process - python

I have made a code below to download files using pySmartDL. I would like to download more than one file at a time. Tried to implement it using multi process. But second process starts only when first finishes. Code is below:
import time
from multiprocessing import Process
from pySmartDL import SmartDL, HashFailedException
def down():
dest='/home/faheem/Downloads'
obj = SmartDL(url_100mb_file,dest, progress_bar=False,fix_urls=True)
obj.start(blocking=False)
#cnt=1
while not obj.isFinished():
print("Speed: %s" % obj.get_speed(human=True))
print("Already downloaded: %s" % obj.get_dl_size(human=True))
print("Eta: %s" % obj.get_eta(human=True))
print("Progress: %d%%" % (obj.get_progress()*100))
print("Progress bar: %s" % obj.get_progress_bar())
print("Status: %s" % obj.get_status())
print("\n"*2+"="*50+"\n"*2)
print("SIZE=%s"%obj.filesize)
time.sleep(2)
if obj.isSuccessful():
print("downloaded file to '%s'" % obj.get_dest())
print("download task took %ss" % obj.get_dl_time(human=True))
print("File hashes:")
print(" * MD5: %s" % obj.get_data_hash('md5'))
print(" * SHA1: %s" % obj.get_data_hash('sha1'))
print(" * SHA256: %s" % obj.get_data_hash('sha256'))
data=obj.get_data()
else:
print("There were some errors:")
for e in obj.get_errors():
print(str(e))
return
if __name__ == '__main__':
#jobs=[]
#for i in range(5):
print 'Link1'
url_100mb_file = ['https://softpedia-secure-download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar']
Process(target=down()).start()
print'link2'
url_100mb_file = ['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']
Process(target=down()).start()
Here link2 starts downloading when link1 finishes, but I need both download to perform concurrently. I would like to implement this method to perform upto 10 downloads at a time. So is it good to use multiprocessing?
Is there any other better memory efficient method.
I am a beginner in these codes, so kindly define the answer easily..
Regards

You can also use python module Thread. Here is a little snippet on how it works:
import threading
import time
def func(i):
time.sleep(i)
print i
for i in range(1, 11):
thread = threading.Thread(target = func, args=(i,))
thread.start()
print "Launched thread " + str(i)
print "Done"
Run this snippet and you will get a perfect idea on how it works.
Knowing that, you can actually run your code, passing as an argument to the function the url to use in each thread.
Hope that helps

The particular library you're using appears to already support non-blocking downloads so why no just do the following? Non-blocking means it'll run in a seperate process.
from time import sleep
from pySmartDL import SmartDL
links = [['https://softpedia-secure download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar'],['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']]
objs = [SmartDL(link, progress_bar=False) for link in links]
for obj in objs:
obj.start(blocking=False)
while not all(obj.isFinished() for obj in objs):
sleep(1)

Since your program is I/O-bound, you can use multi-processing or mult-threading.
Just in case, I'd like to remind the classical pattern for problems like this. Have a queue of URLs from which worker processes / threads pull URLs for processing, and have a status queue where the workers push their progress reports or errors.
A thread pool or a process pull greatly simplifies things, compared to manual control.

Related

How to check in Python if Windows process is running when its service has stopped?

Some Windows processes keep running for a few minutes after their service has stopped. Is there a way in Python to detect that?
You can try the psutil package and in particular psutil.process_iter() (returns an iterator over running processes). This package is used in other profiler packages. Documentation on Process functions can be found here.
I don't know how you would find process ids for the service(s) in question if they are not apparent in parent/child pid relationships. I've not tested this on Windows.
import os
def getTasks(name):
r = os.popen('tasklist /v').read().strip().split('\n')
print ('# of tasks is %s' % (len(r)))
for i in range(len(r)):
s = r[i]
if name in r[i]:
print ('%s in r[i]' %(name))
return r[i]
return []
if __name__ == '__main__':
'''
This code checks tasklist, and will print the status of a code
'''
imgName = 'dwm.exe'
notResponding = 'Not Responding'
r = getTasks(imgName)
if not r:
print('%s - No such process' % (imgName))
elif 'Not Responding' in r:
print('%s is Not responding' % (imgName))
else:
print('%s is Running or Unknown' % (imgName))
Source
Important note!!! As the author of the tutorial stated the platform he used was Windows Server 2012 but this code most likely work with other Windows products. At the very least it should give you and idea on how to do what you want.
Hope its helps!

Weird behavior from pythons multiprocessing

The code what I am trying is:
def update_vm(si, vm):
env.host_string = vm
with settings(user=VM_USER, key_filename=inputs['ssh_key_path']):
put(local_file, remote_zip_file)
run('tar -zxpf %s' % remote_zip_file)
run('sudo sh %s' % REMOTE_UPDATE_SCRIPT)
response_msg = run('cat %s' % REMOTE_RESPONSE_FILE)
if 'success' in response_msg:
#do stuff
else:
#do stuff
def update_vm_wrapper(args):
return update_vm(*args)
def main():
try:
si = get_connection()
vms = [vm1, vm2, vm3...]
update_jobs = [(si, vm) for vm in vms]
pool = Pool(30)
pool.map(update_vm_wrapper, update_jobs)
pool.close()
pool.join()
except Exception as e:
print e
if __name__ == "__main__":
main()
Now the problem is I saw it is trying to put the zip file inside same vm(say vm1)for 3 times(I guess the length of vms). And trying to execute the other ssh commands 3 times.
Using locks for the update_vm() method is solving the issue. But it looks no longer a multiprocessor solution. It more like iterating over a loop.
What wrong am I doing here ?
Fabric has its own facilities for parallel execution of tasks - you should use those, rather than just trying to execute Fabric tasks in multiprocessing pools. The problem is that the env object is mutated when executing the tasks, so the different workers are stepping on each other (unless you put locking in).

Python, How to break out of multiple threads

I am following one of the examples in a book I am reading ("Violent Python"). It is to create a zip file password cracker from a dictionary. I have two questions about it. First it says to thread it as I have written in the code to increase performance but when I timed it (I know time.time() is not great for timing) there was about a twelve second difference in favor of not threading. Is this because it is taking longer to start the threads? Second if I do it without the threads I can break as soon as the correct value is found by printing the result and the entering the statement exit(0). Is there a way to get the same result using threading so that if I find the result I am looking for I can end all other threads simultaneously?
import zipfile
from threading import Thread
import time
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
def main():
start = time.time()
z = zipfile.ZipFile('test.zip')
pwdfile = open('words.txt')
pwds = pwdfile.read()
pwdfile.close()
for pwd in pwds.splitlines():
t = Thread(target=extractFile, args=(z, pwd, start))
t.start()
#extractFile(z, pwd, start)
print(str(time.time()-start))
if __name__ == '__main__':
main()
In CPython, the Global Interpreter Lock ("GIL") enforces the restriction that only one thread at a time can execute Python bytecode.
So in this application, it is probably better to use the map method of a multiprocessing.Pool, since every try is independant of the others;
import multiprocessing
import zipfile
def tryfile(password):
rv = passwd
with zipfile.ZipFile('test.zip') as z:
try:
z.extractall(pwd=password)
except:
rv = None
return rv
with open('words.txt') as pwdfile:
data = pwdfile.read()
pwds = data.split()
p = multiprocessing.Pool()
results = p.map(tryfile, pwds)
results = [r for r in results if r is not None]
This will start (by default) as many processes as your computer has cores. If will keep running tryfile() with a different passwords in these processes until the list pwds is exhausted, gather the results and return them. The last list comprehension is to discard the None results.
Note that this code could be improved to stop shut down the map once the password is found. You'd probably have to use map_async and a shared variable in that case. It would also be nice to load the zipfile only once and share that.
This code is slow because python has a Global Interpreter Lock, which means only one thread can execute at a time. This causes multithreaded code to run slower than serial code in Python. If you want to create a truly multithreaded application, you'd have to use the Multiprocessing Module.
To break out of the threads and get the return value, you can use os._exit(1) First, import the os module at the top of your file:
import os
Then, change your extractFile function to use os._exit(1):
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
os._exit(1)

Ensuring only one process is executed for a long running process using python

I am looking for best practice for ensuring a script executed by a cron job every minute only has one running instance. For e.g. if I have a cron that executed every minute and in case the process takes longer then one minute then do not execute another till done.
For now I have the below function. In essence I get the name of the current process and I do a ps grep to see if the count of the current process is listed. Kinda messy so I was looking for a more pythonic way.
I place the code on top of a file. It does work but again messy.
def doRunCount(stop=False,min_run=1):
import inspect
current_file = inspect.getfile( inspect.currentframe() )
print current_file
fn = current_file.split()
run_check = os.popen('ps aux | grep python').read().strip().split('\n')
run_count = 0
for i in run_check:
if i.find('/bin/sh')<0:
if i.find(current_file)>=0:
run_count = run_count + 1
if run_count>min_run:
print 'max proccess already running'
exit()
return run_count
I don't know if you could describe this as best practice, but I would use a pid file. Here's a snippet similar to what I have used several times to ensure only one instance of a specific app is running.
import os, sys
PID_FILE = '/path/to/somewhere.pid'
if os.path.exists( PID_FILE ):
pid = int(open( PID_FILE,'rb').read().rstrip('\n'))
if len(os.popen('ps %i' % pid).read().split('\n')) > 2:
print "Already Running as pid: %i" % pid
sys.exit(1)
# If we get here, we know that the app is not running so we can start a new one...
pf = open(PID_FILE,'wb')
pf.write('%i\n' % os.getpid())
pf.close()
if __name__ == '__main__':
#Do something here!
pass
Like I said this is similar to what I have used, but I just re-wrote this snippet to be a little more elegant. But it should get the general concept across! Hope this helps.
Here is a slight modification which should clear up any issues arising from a process crash.
This code will not only validate that a pid file exists, but that the pid in the file is still alive and that the pid is still the same executable.
import os, sys
PID_FILE = '/path/to/somewhere.pid'
if os.path.exists( PID_FILE ):
pid = int(open( PID_FILE,'rb').read().rstrip('\n'))
pinfo = os.popen('ps %i' % pid).read().split('\n')
if len( pinfo ) > 2:
# You might need to modify this to your own usage...
if pinfo[1].count( sys.argv[0] ):
# Varify that the process found by 'ps' really is still running...
print "Already Running as pid: %i" % pid
sys.exit(1)
# If we get here, we know that the app is not running so we can start a new one...
pf = open(PID_FILE,'wb')
pf.write('%i\n' % os.getpid())
pf.close()
if __name__ == '__main__':
#Do something here!
pass
After that I just leave the pid file, since you don't really need to worry about a false positive. Note you might need to modify the second step of validation to your own specific usage!

Run many filesystem operations in parallel and asynchronously

I have the following function, which returns a filesize of a file over HTTP:
def GetFileSize(url):
" Function gets a url and returns it's filesize in bytes "
url = url.replace(' ', '%20')
u = urllib2.urlopen(url)
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
return file_size
I would like to get the biggest file from a given links, and I wrote the following function for it:
def GetBiggestFile(links):
" Function gets a list of links and returns the biggest file and his size in bytes "
dic = {}
for link in links:
filename = link.split('/')[-1]
filesize = GetFileSize(link)
dic[link] = filesize
print "%s | %.2f MB" % (filename, filesize / 1024.0 / 1024.0)
biggest_file = max(dic, key=dic.get)
return biggest_file, dic[biggest_file]
My lists have dozens of links, therefore this scripts takes some time to complete. Using threading I can fetch the different filesizes synchronously and shorten the running time of the code.
I'm not so sure how to do it - I've tried using a decorator that makes the function run asynchronously:
def run_async(func):
" Decorator for running functions asynchronously. "
from threading import Thread
from functools import wraps
#wraps(func)
def async_func(*args, **kwargs):
func_hl = Thread(target = func, args = args, kwargs = kwargs)
func_hl.start()
return func_hl
return async_func
But I'm not sure how to make my code wait for all the answers before trying to determine who is the biggest file.
Thanks.
You'll be happier with multiprocessing.
Start with this example: http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers
Your GetFileSize function can be run in a process pool.
Since each process is separate, you should also have an "output Queue" into which the results are put. A separate process does a simple "get" to retrieve all the answers from the Queue.

Categories

Resources