Threading takes longer than non threaded program - python

I'm working on the below code. The task is to generate 1,000,000 random numbers and save these to a .txt-file. The code I'm using without any threading needs about 2.5 seconds to run, but the threading version needs anywhere between 120 - 280 seconds. I have no clue where and why it goes wrong. Are the locks inhibiting each other?
Simple version
import random
import timeit
import os
start = timeit.default_timer()
f = open('file2.txt', 'a')
for i in range(0,1000000):
f.write(str(random.randint(0,32576)))
f.write('\n')
f.close()
end = timeit.default_timer()
Threading version
import threading
import queue
import random
import timeit
import os
tasks = queue.Queue()
start = timeit.default_timer()
output = open('file2.txt', 'w')
output_lock = threading.Lock()
def worker(thread_number):
while not tasks.empty():
tasks.get()
f = open('file2.txt','a')
with output_lock: # block until lock is available
f.write(str(random.randint(0,32767)))
f.write('\n')
f.close()
tasks.task_done()
for i in range(1000000):
tasks.put(i)
for thread in range(8):
threading.Thread(target=worker, args=(thread,)).start()
print('waiting')
tasks.join()

Related

Define multiprocessing loops from config file

I want to read a set of sensors in parallel with concurrent loops, but each system has a different set of sensors.
I currently have:
with open(inputfilelocation, 'r') as f:
sensorlist = json.load(f)
while True:
for sensor in sensorlist:
H, T = read_sensor(sensor['model'], sensor['address'])
send_data(H, T)
time.sleep(60)
which reads them all, then sleeps for a minute. But now I want to specify how frequently to sample each sensor.
I could do:
from multiprocessing import Process
def loop_a():
while True:
#Sample and send data
time.sleep(sensor_a_delay)
def loop_b():
while True:
#Sample and send data
time.sleep(sensor_b_delay)
Process(target=loop_a).start()
Process(target=loop_b).start()
but then I would need to know at the very least how many sensors I have.
Is there any way to define these loops on the fly?
Edit: I've tried this:
def loop_a(sleeptime, string):
while 1:
print(string)
time.sleep(sleeptime)
Process(target=loop_a(5,"foo")).start()
Process(target=loop_a(2,"bar")).start()
but only the "foo" loop runs.
OK, I've figured out what Klaus meant. This solves my problem:
import time
import json
import threading
with open('sensorlist.json', 'r') as f:
sensorlist = json.load(f)
def sensorloop(model,address,delay):
while True:
H, T = read_sensor(model, address)
send_data(H, T)
time.sleep(delay)
for sensor in sensorlist:
x = threading.Thread(target=sensorloop,
args = (sensor['model'],
sensor['address'],
sensor['delay'])
)
x.start()

Asynchronous programming for calculating hashes of files

I'm trying to calculate hash for files to check if any changes are made.
i have Gui and some other observers running in the event loop.
So, i decided to calculate hash of files [md5/Sha1 which ever is faster] asynchronously.
Synchronous code :
import hashlib
import time
chunk_size = 4 * 1024
def getHash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
for byte_block in iter(lambda: f.read(chunk_size), b""):
md5_hash.update(byte_block)
print("getHash : " + md5_hash.hexdigest())
start = time.time()
getHash("C:\\Users\\xxx\\video1.mkv")
getHash("C:\\Users\\xxx\\video2.mkv")
getHash("C:\\Users\\xxx\\video3.mkv")
end = time.time()
print(end - start)
Output of synchronous code : 2.4000535011291504
Asynchronous code :
import hashlib
import aiofiles
import asyncio
import time
chunk_size = 4 * 1024
async def get_hash_async(file_path: str):
async with aiofiles.open(file_path, "rb") as fd:
md5_hash = hashlib.md5()
while True:
chunk = await fd.read(chunk_size)
if not chunk:
break
md5_hash.update(chunk)
print("get_hash_async : " + md5_hash.hexdigest())
async def check():
start = time.time()
t1 = get_hash_async("C:\\Users\\xxx\\video1.mkv")
t2 = get_hash_async("C:\\Users\\xxx\\video2.mkv")
t3 = get_hash_async("C:\\Users\\xxx\\video3.mkv")
await asyncio.gather(t1,t2,t3)
end = time.time()
print(end - start)
loop = asyncio.get_event_loop()
loop.run_until_complete(check())
Output of asynchronous code : 27.957366943359375
am i doing it right? or, are there any changes to be made to improve the performance of the code?
Thanks in advance.
In sync case, you read files sequentially. It's faster to read a file by chunks sequentially.
In async case, your event loop blocks when it's calculating the hash. That's why only one hash can be calculated at the same time. What do the terms “CPU bound” and “I/O bound” mean?
If you want to increase the calculating speed, you need to use threads. Threads can be executed on CPU in parallel. Increasing CHUNK_SIZE should also help.
import hashlib
import os
import time
from pathlib import Path
from multiprocessing.pool import ThreadPool
CHUNK_SIZE = 1024 * 1024
def get_hash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
while True:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
md5_hash.update(chunk)
return md5_hash
if __name__ == '__main__':
directory = Path("your_dir")
files = [path for path in directory.iterdir() if path.is_file()]
number_of_workers = os.cpu_count()
start = time.time()
with ThreadPool(number_of_workers) as pool:
files_hash = pool.map(get_hash, files)
end = time.time()
print(end - start)
In case of calculating hash for only 1 file: aiofiles uses a thread for each file. OS needs time to create a thread.

How to stop running my threads after a period of time?

I need to stop running my threads after a period of time, In this example I put only 120 seconds. I try by using this methods by it does not work.
from threading import Thread
from Queue import Queue
import os
import time
timeout = 120 # [seconds]
timeout_start = time.time()
#while True :
def OpenWSN ():
os.system("./toto")
def Wireshark():
os.system(" tshark -i tun0 -T ek -w /home/ptl/PCAP_Brouillon/Sim_Run3rd.pcap > /dev/null ")
def wrapper1(func, queue):
queue.put(func())
def wrapper2(func, queue):
queue.put(func())
q = Queue()
Thread(target=wrapper1, args=(OpenWSN, q)).start()
Thread(target=wrapper2, args=(Wireshark, q)).start()
#print (time.time())
print ("***************** End Simulation *************************")
os.system("quit")
I think this is what you are trying to achieve:
import threading
from queue import Queue
import os
import time
timeout = 120 # [seconds]
timeout_start = time.time()
def OpenWSN ():
print( "OpenWSN:")
os.system("echo -OpenWSN-")
def Wireshark():
print( "Wireshark:")
os.system("echo -Wireshark-")
def wrapper1(func, queue):
queue.put(func())
def wrapper2(func, queue):
queue.put(func())
q = Queue()
threading.Thread(target=wrapper1, args=(OpenWSN, q)).start()
threading.Thread(target=wrapper2, args=(Wireshark, q)).start()
cv = threading.Condition()
cv.acquire()
cv.wait( timeout )
print ("***************** End Simulation *************************")
print (" Simulation Time: {0}s".format( time.time() - timeout_start) )
os.system("echo -exit-")
This produces the following output:
C:\temp\StackExchange\StopRunningThread>python -B stop-running-thread.py
OpenWSN:
Wireshark:
-OpenWSN-
-Wireshark-
***************** End Simulation *************************
Simulation Time: 120.04460144042969s
-exit-
What is happening there - you are starting two threads, each starts separate process in the system. After the said threads were started, you return to your main thread, allocate a "lock" and wait until this lock is signaled, or time out takes place.
In this particular case nobody signals the lock, so the only chance to finish the application is to wait until the time out happens.
I would extend your application that it signals the lock in each thread function, so we can terminate the main thread only if both of thread functions terminate.
But that was not the part of your question, so I assume you can leave without signalling.

kill threads later a time in python

I have a python code with threads, and i need that if in for example 1 hour the threads are not finished, finish all threads and finish the script, and if the hour are not complete wait that all my threads finish.
I try with a daemon thread, and with a sleep of the hour, and if the hour is complete use a: sys.exit() but it not works to me, because always wait to my sleep threadh, then my script wait until the thread finished and the sys.exit() does not work.
import socket, threading, time, sys
from sys import argv
import os
acc_time=0
transactions_ps=5
ins = open(sys.argv[1],'r')
msisdn_list = []
for line in ins:
msisdn_list.append (line.strip('\n'))
# print line
ins.close()
def worker(msisdn_list):
semaphore.acquire()
global transactions_ps
print " ***** ", threading.currentThread().getName(), "Lanzado"
count=1
acc_time=0
print "len: ",len(msisdn_list)
for i in msisdn_list:
try:
init=time.time()
time.sleep(2)
print "sleeping...",i
time.sleep(4)
final=time.time()
acc_time = acc_time+final-init
print acc_time
except IOError:
print "Connection failed",sys.exc_info()[0]
print "Deteniendo ",threading.currentThread().getName()
semaphore.release()
def kill_process(secs_to_die):
time.sleep(secs_to_die)
sys.exit()
seconds_to_die=3600
thread_kill = threading.Thread(target = kill_process, args=(seconds_to_die,))
thread_kill.start()
max_con=5
semaphore = threading.BoundedSemaphore(max_con)
for i in range(0,28,transactions_ps):
w = threading.Thread(target=worker, args=(msisdn_list[i:i+transactions_ps-1],))
w.setDaemon(True)
w.start()
How can to do it
A minimal change to your code that would fix the issue is threading.Barrier:
barrier = Barrier(number_of_threads, timeout=3600)
# create (number_of_threads - 1) threads, pass them barrier
# each thread calls barrier.wait() on exit
barrier.wait() # after number_of_threads .wait() calls or on timeout it returns
A simpler alternative is to use multiprocessing.dummy.Pool that creates daemon threads:
from multiprocessing.dummy import Pool # use threads
start = timer()
endtime = start + 3600
for result in pool.imap_unordered(work, args):
if timer() > endtime:
exit("timeout")
The code doesn't timeout until a work item is done i.e., it expects that processing a single item from the list doesn't take long.
Complete example:
#!/usr/bin/env python3
import logging
import multiprocessing as mp
from multiprocessing.dummy import Pool
from time import monotonic as timer, sleep
info = mp.get_logger().info
def work(i):
info("start %d", i)
sleep(1)
info("end %d", i)
seconds_to_die = 3600
max_con = 5
mp.log_to_stderr().setLevel(logging.INFO) # enable logging
pool = Pool(max_con) # no more than max_con at a time
start = timer()
endtime = start + seconds_to_die
for _ in pool.imap_unordered(work, range(10000)):
if timer() > endtime:
exit("timeout")
You may refer to this implementation of KThread:
http://python.todaysummary.com/q_python_45717.html

the efficiency compare between gevent and thread

Recently,I'm working on a gevent demo and I try to compare the efficiency between gevent and thread. Generally speaking,the gevent code should be more efficient than the thread code. But when I use time command to profile the program, I get the unusual result(my command is time python FILENAME.py 50 1000,the last two parameters means pool number or thread number,so I change the two number in the table below). The result shows that the thread is more efficient than the gevent code,so I want to know why this happen and what's wrong with my program? Thanks.
gevent VS thread
My code is below(The main idea is use thread or gevent to send multi HTTP request):
******This is the thread version code******
# _*_ coding: utf-8 _*_
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import requests
import threading
import time
import urllib2
finished = 0
def GetUrl(pagenum):
url = 'http://opendata.baidu.com/zhaopin/s?p=mini&wd=%B0%D9%B6%C8&pn=' + \
str(pagenum*20) + '&rn=20'
return url
def setUrlSet():
for i in xrange(requestnum):
urlnum = i % 38
urlset.append(GetUrl(urlnum))
def GetResponse(pagenum):
try:
r = requests.get(urlset[pagenum])
except Exception, e:
print e
pass
def DigJobByPagenum(pagenum, requestnum):
init_num = pagenum
print '%d begin' % init_num
while pagenum < requestnum:
GetResponse(pagenum)
pagenum += threadnum
print '%d over' % init_num
def NormalThread(threadnum):
startime = time.time()
print "%s is running..." % threading.current_thread().name
threads = []
global finished, requestnum
for i in xrange(threadnum):
thread = threading.Thread(target=DigJobByPagenum, args=(i, requestnum))
threads.append(thread)
for t in threads:
t.daemon = True
t.start()
for t in threads:
t.join()
finished += 1
endtime = time.time()
print "%s is stop.The total time is %0.2f" % \
(threading.current_thread().name, (endtime - startime))
def GetAvageTime(array):
alltime = 0.0
for i in array:
alltime += i
avageTime = alltime/len(array)
return avageTime
if __name__ == '__main__':
threadnum = int(sys.argv[1])
requestnum = int(sys.argv[2])
print 'threadnum : %s,requestnum %s ' % (threadnum, requestnum)
originStartTime = time.time()
urlset = []
setUrlSet()
NormalThread(threadnum)
******This is the gevent verison code******
# _*_ coding: utf-8 _*_
import sys
reload(sys)
sys.setdefaultencoding("utf8")
from gevent import monkey
monkey.patch_all()
import gevent
from gevent import pool
import requests
import time
finished = 0
def GetUrl(pagenum):
url = 'http://opendata.baidu.com/zhaopin/s?p=mini&wd=%B0%D9%B6%C8&pn=' + \
str(pagenum*20) + '&rn=20'
return url
def setUrlSet():
for i in xrange(requestnum):
urlnum = i % 38
urlset.append(GetUrl(urlnum))
def GetResponse(url):
startime = time.time()
r = requests.get(url)
print url
endtime = time.time()
spendtime = endtime - startime
NormalSpendTime.append(spendtime)
global finished
finished += 1
print finished
def GetAvageTime(array):
alltime = 0.0
for i in array:
alltime += i
avageTime = alltime/len(array)
return avageTime
def RunAsyncJob():
jobpool = pool.Pool(concurrent)
for url in urlset:
jobpool.spawn(GetResponse, url)
jobpool.join()
endtime = time.time()
allSpendTime = endtime - originStartime
print 'Total spend time is %0.3f, total request num is %s within %s \
seconds' % (allSpendTime, finished, timeoutNum)
print 'Each request time is %0.3f' % (GetAvageTime(NormalSpendTime))
if __name__ == '__main__':
concurrent = int(sys.argv[1])
requestnum = int(sys.argv[2])
timeoutNum = 100
NormalSpendTime = []
urlset = []
urlActionList = []
setUrlSet()
originStartime = time.time()
RunAsyncJob()
Try
gevent.monkey.patch_all(httplib=True)
It seems that by default gevent does not patch httplib (have a look at http://www.gevent.org/gevent.monkey.html : httplib=False) so you are actually doing blocking requests and you lose all advantages of the asynchronous framework. Although I'm not sure whether requests uses httplib.
If that doesn't work, then have a look at this lib:
https://github.com/kennethreitz/grequests
Re: httplib=False
You are already using requests library to make web calls. It has gevent flavour called grequests:
https://github.com/kennethreitz/grequests
Overall I don't immediately see much reason to prefer one style of threading to the other, if your pool is so small. Of course real threads are relatively heavy (start with 8MB stack), but you have to take that into proportion to the size of your job.
My take, try both (done), verify you are doing both right (to do) and let numbers do the talking.

Categories

Resources