I have a program that reads some input text files and write all of them into one separate file. I used two threads so it runs faster!
I tried the following python code with both one thread and two threads! Why when I run with one thread it runs faster than when I run it with two threads?
processedFiles=[]
# Define a function for the threads
def print_time( threadName, delay):
for file in glob.glob("*.txt"):
#check if file has been read by another thread already
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
f = open('myfile','a')
f.write("%s \n" %lines) # python will convert \n to os.linesep
f.close() # you can omit in most cases as the destructor will call it
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
t1 = Thread(target=print_time, args=("Thread-1", 0,))
t2 = Thread(target=print_time, args=("Thread-2", 0,))
t1.start()
t2.start()
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"
You have several problems I'll get to in a moment, but generally your program is disk-bound (it can't go faster than your hard drive) so even a properly threaded program isn't any faster. It can be hard to measure disk performance because of the file system cache: You run this once with threads and you go at hard drive speed, you run it again without threads and the files are in the system so go fast. It makes it hard to figure out how the code will perform later when the data no longer in the system cache.
So now for the problems.
if file not in processedFiles: isn't thread safe. Both threads could look at an empty list and decide to copy the same file. At a minimum you need a lock. Or you could do the glob once and pass the files to a queue that are read by the thread.
Reading the file line-by-line then joining with \n is a crazy slow way to write a file. Use shutil.copyfileobj instead - its built to copy files efficiently.
f = open('myfile','a') now you have multiple file descriptors to a single file and each will advance their file pointers independently.... so one overwrites the other.
f.write("%s \n" %lines) is also not thread safe. You could end up with bits of the files interleaving each other in the output file.
stop = timeit.default_timer() - you didn't wait for the threads to complete their work so you didn't really measure anything useful. Code code severely under-reports execution time.
You are much better off with a simple single-threaded script.
Related
I am looking for a way to execute my loop faster. With the current code the calculations takes forever. So I am looking for a way to make my code more efficient.
EDIT: I do not think either explain , I need to create a program that does all possible combinations of 8 digits , not forgetting to include uppercase , lowercase and numbers .. Then encrypt md5 these possible combinations and save them to a file.
But I have new questions , this process would take 63 years would weigh this file ?, As the end of the script? Latest buy a vps server for this task, but if it takes 63 years better not even try haha ..
I am new to coding and all help is appreciated
import hashlib
from random import choice
longitud = 8
valores = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def enc(string):
m = hashlib.md5()
m.update(string.encode('utf-8'))
return m.hexdigest()
def code():
p = ""
p = p.join([choice(valores) for i in xrange(longitud)])
text = p
return text
i = 1
for i in xrange(2000000000000000000):
cod = code()
md = enc(cod)
print cod
print md
i += 1
print i
f=open('datos.txt','a')
f.write("%s " % cod)
f.write("%s" % md)
f.write('\n')
f.close()
You're not utilizing the full power of modern computers, that have multiple central processing units! This is by far the best optimization you can have here, since this is CPU bound. Note: for I/O bound operations multithreading (using the threading module) is suitable.
So let's see how python makes it easy to do so using multiprocessing module (read comments):
import hashlib
# you're sampling a string so you need sample, not 'choice'
from random import sample
import multiprocessing
# use a thread to synchronize writing to file
import threading
# open up to 4 processes per cpu
processes_per_cpu = 4
processes = processes_per_cpu * multiprocessing.cpu_count()
print "will use %d processes" % processes
longitud = 8
valores = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
# check on smaller ranges to compare before trying your range... :-)
RANGE = 200000
def enc(string):
m = hashlib.md5()
m.update(string.encode('utf-8'))
return m.hexdigest()
# we synchronize the results to be written using a queue shared by processes
q = multiprocessing.Manager().Queue()
# this is the single point where results are written to the file
# the file is opened ONCE (you open it on every iteration, that's bad)
def write_results():
with open('datos.txt', 'w') as f:
while True:
msg = q.get()
if msg == 'close':
break;
else:
f.write(msg)
# this is the function each process uses to calculate a single result
def calc_one(i):
s = ''.join(sample(valores, longitud))
md = enc(s)
q.put("%s %s\n" % (s, md))
# we start a process pool of workers to spread work and not rely on
# a single cpu
pool = multiprocessing.Pool(processes=processes)
# this is the thread that will write the results coming from
# other processes using the queue, so it's execution target is write_results
t = threading.Thread(target=write_results)
t.start()
# we use 'map_async' to not block ourselves, this is redundant here,
# but it's best practice to use this when you don't HAVE to block ('pool.map')
pool.map_async(calc_one, xrange(RANGE))
# wait for completion
pool.close()
pool.join()
# tell result-writing thread to stop
q.put('close')
t.join()
There are probably more optimizations to be done in this code, but a major optimization for any cpu-bound task like you present, is using multiprocessing.
note: A trivial optimization of file writes would be to aggregate some results from the queue and write them together (if you have many cpus that exceed the single writing thread's speed)
note 2: Since OP was looking to go over combinations/permutations of stuff, it should be noted that there is a module for doing just that, and it's called itertools.
Note that you should use
for cod in itertools.product(valores, longitud):
rather than picking the strings via random.sample as this will only ever visit a given string once.
Also note that for your given values this loop has 218340105584896 iterations. And the output file will occupy 9170284434565632 bytes or 8PB.
Profile your program first (with the cProfile module: https://docs.python.org/2/library/profile.html and http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html), but I'm willing to bet your program is IO-bound (if your CPU usage never reaches 100% on one core, it means your hard drive is too slow to keep up with the execution speed of the rest of the program).
With that in mind, start by changing your program so that:
It opens and closes the file outside of the loop (opening and closing files is super slow).
It only makes one write call in each iteration (those each translate to a syscall, which are expensive), like so: f.write("%s %s\n" % (cod, md))
Although it helps with debugging, I have found printing makes a program run slower, so maybe don't quite print as much out. Also I'd move the "f=open('datos.txt', 'a') out from the loop as I can imagine Opening the same file over and over again might cause some time issues, and then move the "f.close()" out of the loop also to the end of the program.
CHANGED
I have a file which contains a lot of data. Each row is a record. And I am trying to do some ETL work against the whole file. Right now I am using standard input to read the data line by line. The cool thing about this is your script could be very flexible to integrate with other script and shell commands. I write the result to standard output. For example.
$ cat input_file
line1
line2
line3
line4
...
My current python code looks like this - parse.py
import sys
for line in sys.stdin:
result = ETL(line) # ETL is some self defined function which takes a while to execute.
print result
The code below is how it is working right now:
cat input_file | python parse.py > output_file
I have looked at the Threading module of Python and I am wondering if the performance would be dramatically improved if I use that module.
Question1: How should I plan the quotas for each thread, why?
...
counter = 0
buffer = []
for line in sys.stdin:
buffer.append(line)
if counter % 5 == 0: # maybe assign 5 rows to each thread? if not, is there a rule of thumb to determine
counter = 0
thread = parser(buffer)
buffer = []
thread.start()
Question2: Multiple Threads might print the result back to stdout at the same time, how to organize them and avoid the situation below?
import threading
import time
class parser(threading.Thread):
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
print elem + 'Finished'
work = ['a', 'b', 'c', 'd', 'e', 'f']
thread1 = parser(['a', 'b'])
thread2 = parser(['c', 'd'])
thread3 = parser(['e', 'f'])
thread1.start()
thread2.start()
thread3.start()
The output is really ugly, where one row contains the outputs from two threads.
aFinished
cFinishedeFinished
bFinished
fFinished
dFinished
Taking your second question first, this is what mutexes are for. You can get the cleaner output that you want by using a lock to coordinate among the parsers and ensure that only one thread has access to the output stream during a given period of time:
class parser(threading.Thread):
output_lock = threading.Lock()
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
with self.output_lock:
print elem + 'Finished'
As regards your first question, note that it's probably the case that multi-threading will provide no benefit for your particular workload. It largely depends on whether the work you do with each input line (your ETL function) is primarily CPU-bound or IO-bound. If the former (which I suspect is likely), threads will be of no help, because of the global interpreter lock. In that case, you would want to use the multiprocessing module to distribute work among multiple processes instead of multiple threads.
But you can get the same results with an easier to implement workflow: Split the input file into n pieces (using, e.g., the split command); invoke the extract-and-transform script separately on each subfile; then concatenate the resulting output files.
One nitpick: "using standard input to read the data line by line because it won't load the whole file into memory" involves a misconception. You can read a file line by line from within Python by, e.g., replacing sys.stdin with a file object in a construct like:
for line in sys.stdin:
See also the readline() method of file objects, and note that read() can take as parameter the maximum number of bytes to read.
Whether threading will be helpful you is highly dependent on on your situation. In particular, if your ETL() function involves a lot of disk access, then threading would likely give you pretty significant speed improvement.
In response to your first question, I've always found that it just depends. There are a lot of factors at play when determining the ideal number of threads, and many of them are program-dependent. If you're doing a lot of disk access (which is pretty slow), for example, then you'll want more threads to take advantage of the downtime while waiting for disk access. If the program is CPU-bound, though, tons of threads may not be super helpful. So, while it may be possible to analyze all the factors to come up with an ideal number of threads, it's usually a lot faster to make an initial guess and then adjust from there.
More specifically, though, assigning a certain number of lines to each thread probably isn't the best way to go about divvying up the work. Consider, for example, if one line takes a particularly long time to process. It would be best if one thread could work away at that one line and the other threads could each do a few more lines in the meantime. The best way to handle this is to use a Queue. If you push each line into a Queue, then each thread can pull a line off the Queue, handle it, and repeat until the Queue is empty. This way, the work gets distributed such that no thread is ever without work to do (until the end, of course).
Now, the second question. You're definitely right that writing to stdout from multiple threads at once isn't an ideal solution. Ideally, you would arrange things so that the writing to stdout happens in only one place. One great way to do that is to use a Queue. If you have each thread write its output to a shared Queue, then you can spawn an additional thread whose sole task is to pull items out of that Queue and print them to stdout. By restricting the printing to just one threading, you'll avoid the issues inherent in multiple threads trying to print at once.
Hello I have a program that looks through a range of data and finds anomalies in that data. To make my program faster I incorporated the use of threads (66 in total) now when my program finds the anomalies I would want it to write it to a file but however when i try to write to the file from within multiple threads it wont write.
class myThread(threading.Thread):
def __init__(self,arg1,arg2,lock,output):
threading.Thread.__init__(self)
self.arg1 = arg1
self.arg2 = arg2
self.lock = lock
self.file = output
def run(self):
# print "Starting " + self.name
main(self.arg1,self.arg2,self.lock,self.file)
# print "Exiting " + self.name
def main(START_IP,END_IP,lock,File):
# store found DNS servers
foundDNS=[]
# scan all the ip addresses in the range
for i0 in range(START_IP[0], END_IP[0]+1):
for i1 in range(START_IP[1], END_IP[1]+1):
for i2 in range(START_IP[2], END_IP[2]+1):
for i3 in range(START_IP[3], END_IP[3]+1):
# build ip addres
ipaddr=str(i0)+"."+str(i1)+"."+str(i2)+"."+str(i3)
print "Scanning "+ipaddr+"...",
# scan address
ret=ScanDNS(ipaddr, 10)
if ret==True:
foundDNS.append(ipaddr)
print "Found!"
lock.acquire()
File.write(ipaddr)
File.write("\n")
File.flush()
lock.release()
else:
print
file = open("file.txt","wb")
lock = threading.Lock()
thread1 = myThread(START_IP,END_IP,lock,)
thread1.start()
This uses my exact same MyThread class just with the required arguments for main to manipulate the data. If I run my code for about a minute as its scanning over DNS servers I should get maybe 20-30 DNS servers saved into a file but I generally get this:
FILE.TXT
2.2.1.2
8.8.8.8
31.40.40
31.31.40.40
31.31.41.41
I know for a fact (because I watched the scanning output) and that it hardly writes all of them. So why is some writing and some not?
I don't know why your code is not working, but I can hazard a guess that it is due to race conditions. Hopefully someone knowledgeable can answer that part of your question.
However, I've encountered a similar problem before, and I solved it by moving the file writing code to a single output thread. This thread read from a synchronized queue to which other threads pushed data to be written.
Also, if you happen to be working on a machine with multiple cores, then it's better to use multiprocess instead of threading. The latter only runs threads on a single core, while the former does not have this limitation.
instead of providing file - provide Queue. Spawn new thread to read from Queue and file write. Or Use Locks everywhere in print too because some treads can be deadlocked.
To avoid potential error or misuse for access file from multi-threads, you can try using logging to write down your result.
import logging
logger = logging.getLogger()
file_handler = logging.FileHandler()
formatter = #your formmat
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
Check the the documentation for File Objects:
File.flush() is not enough to ensure that your data is written on disk, add
os.fsync(File.fileno()) just after to make it happens.
I have a huge file and need to read it and process.
with open(source_filename) as source, open(target_filename) as target:
for line in source:
target.write(do_something(line))
do_something_else()
Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?
edit: To make this question not a discussion, How should the code look like?
with open(source_filename) as source, open(target_filename) as target:
?
#Nicoretti: In an iteration I need to read a line of several KB of data.
update 2: the file may be a bz2, so Python may have to wait for unpacking:
$ bzip2 -d country.osm.bz2 | ./my_script.py
You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.
import threading
import Queue
QUEUE_SIZE = 1000
sentinel = object()
def read_file(name, queue):
with open(name) as f:
for line in f:
queue.put(line)
queue.put(sentinel)
def process(inqueue, outqueue):
for line in iter(inqueue.get, sentinel):
outqueue.put(do_something(line))
outqueue.put(sentinel)
def write_file(name, queue):
with open(name, "w") as f:
for line in iter(queue.get, sentinel):
f.write(line)
inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)
threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)
It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.
Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.
If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve.
This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.
Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.
Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.
There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).
In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:
you are doing processing that doesn't require the global interpreter lock; or
you are spending time blocked on I/O.
Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.
So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.
Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads).
from threading import Thread
from Queue import Queue
def process_file(f, source_filename, target_filename):
"""
Apply the function `f` to each line of `source_filename` and write
the results to `target_filename`. Each call to `f` is evaluated in
a separate thread.
"""
worker_queue = Queue()
finished = object()
def process(queue, line):
"Process `line` and put the result on `queue`."
queue.put(f(line))
def read():
"""
Read `source_filename`, create an output queue and a worker
thread for every line, and put that worker's output queue onto
`worker_queue`.
"""
with open(source_filename) as source:
for line in source:
queue = Queue()
Thread(target = process, args=(queue, line)).start()
worker_queue.put(queue)
worker_queue.put(finished)
Thread(target = read).start()
with open(target_filename, 'w') as target:
for output in iter(worker_queue.get, finished):
target.write(output.get())
looking for some eyeballs to verifiy that the following chunk of psuedo python makes sense. i'm looking to spawn a number of threads to implement some inproc functions as fast as possible. the idea is to spawn the threads in the master loop, so the app will run the threads simultaneously in a parallel/concurrent manner
chunk of code
-get the filenames from a dir
-write each filename ot a queue
-spawn a thread for each filename, where each thread
waits/reads value/data from the queue
-the threadParse function then handles the actual processing
based on the file that's included via the "execfile" function...
# System modules
from Queue import Queue
from threading import Thread
import time
# Local modules
#import feedparser
# Set up some global variables
appqueue = Queue()
# more than the app will need
# this matches the number of files that will ever be in the
# urldir
#
num_fetch_threads = 200
def threadParse(q)
#decompose the packet to get the various elements
line = q.get()
college,level,packet=decompose (line)
#build name of included file
fname=college+"_"+level+"_Parse.py"
execfile(fname)
q.task_done()
#setup the master loop
while True
time.sleep(2)
# get the files from the dir
# setup threads
filelist="ls /urldir"
if filelist
foreach file_ in filelist:
worker = Thread(target=threadParse, args=(appqueue,))
worker.start()
# again, get the files from the dir
#setup the queue
filelist="ls /urldir"
foreach file_ in filelist:
#stuff the filename in the queue
appqueue.put(file_)
# Now wait for the queue to be empty, indicating that we have
# processed all of the downloads.
#don't care about this part
#print '*** Main thread waiting'
#appqueue.join()
#print '*** Done'
Thoughts/comments/pointers are appreciated...
thanks
If I understand this right: You spawn lots of threads to get things done faster.
This only works if the main part of the job done in each thread is done without holding the GIL. So if there is a lot of waiting for data from network, disk or something like that, it might be a good idea.
If each of the tasks are using a lot of CPU, this will run pretty much like on a single core 1-CPU machine and you might as well do them in sequence.
I should add that what I wrote is true for CPython, but not necessarily for Jython/IronPython.
Also, I should add that if you need to utilize more CPUs/cores, there's the multiprocessing module that might help.