Python file handling within threads

Python file handling within threads - python

Hello I have a program that looks through a range of data and finds anomalies in that data. To make my program faster I incorporated the use of threads (66 in total) now when my program finds the anomalies I would want it to write it to a file but however when i try to write to the file from within multiple threads it wont write.here is a segment of it
Python:
import threading
class myThread(threading.Thread):
def __init__(self,lock,output):
threading.Thread.__init__(self)
self.lock = lock
self.file = output
def run(self):
main(self.lock,self.file)
def main(lock,file):
lock.acquire()
file.write("It wont write :(")
lock.release
if __name__ == "__main__":
lock = threading.Lock()
file = open("file.txt","wb")
thread1 = myThread(lock,file)
thread1.start()
here is my code on a much smaller scale
my error message is that file is not open for writing
EDIT:this code for some reason works but my full length code seems to not work so I am going to post it
def main(START_IP,END_IP,lock,File):
# store found DNS servers
foundDNS=[]
# scan all the ip addresses in the range
for i0 in range(START_IP[0], END_IP[0]+1):
for i1 in range(START_IP[1], END_IP[1]+1):
for i2 in range(START_IP[2], END_IP[2]+1):
for i3 in range(START_IP[3], END_IP[3]+1):
# build ip addres
ipaddr=str(i0)+"."+str(i1)+"."+str(i2)+"."+str(i3)
print "Scanning "+ipaddr+"...",
# scan address
ret=ScanDNS(ipaddr, 10)
if ret==True:
foundDNS.append(ipaddr)
print "Found!"
lock.acquire()
File.write(ipaddr)
File.write("\n")
File.flush()
lock.release()
else:
print
This uses my exact same MyThread class just with the required arguments for main to manipulate the data. If I run my code for about a minute as its scanning over DNS servers
I should get maybe 20-30 DNS servers saved into a file but I generally get this:
FILE.TXT
2.2.1.2
8.8.8.8
31.40.40
31.31.40.40
31.31.41.41
I know for a fact (because I watched the scanning output) and that it hardly all of them. So why is some writing and some not?

This may be a typo, but this:
lock.release
should have parentheses:
lock.release()
Also, your writes will be buffered until the first newline or flush().

Check the documentation for File Objects:
flush() does not necessarily write the file’s data to disk. Use
flush() followed by os.fsync() to ensure this behavior.
File.fileno() gets the file descriptor needed by os.fsync():
with lock:
File.write(ipaddr)
File.write("\n")
File.flush()
os.fsync(File.fileno())

Related

Time Comparison Multithreading

I have a program that reads some input text files and write all of them into one separate file. I used two threads so it runs faster!
I tried the following python code with both one thread and two threads! Why when I run with one thread it runs faster than when I run it with two threads?
processedFiles=[]
# Define a function for the threads
def print_time( threadName, delay):
for file in glob.glob("*.txt"):
#check if file has been read by another thread already
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
f = open('myfile','a')
f.write("%s \n" %lines) # python will convert \n to os.linesep
f.close() # you can omit in most cases as the destructor will call it
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
t1 = Thread(target=print_time, args=("Thread-1", 0,))
t2 = Thread(target=print_time, args=("Thread-2", 0,))
t1.start()
t2.start()
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"

You have several problems I'll get to in a moment, but generally your program is disk-bound (it can't go faster than your hard drive) so even a properly threaded program isn't any faster. It can be hard to measure disk performance because of the file system cache: You run this once with threads and you go at hard drive speed, you run it again without threads and the files are in the system so go fast. It makes it hard to figure out how the code will perform later when the data no longer in the system cache.
So now for the problems.
if file not in processedFiles: isn't thread safe. Both threads could look at an empty list and decide to copy the same file. At a minimum you need a lock. Or you could do the glob once and pass the files to a queue that are read by the thread.
Reading the file line-by-line then joining with \n is a crazy slow way to write a file. Use shutil.copyfileobj instead - its built to copy files efficiently.
f = open('myfile','a') now you have multiple file descriptors to a single file and each will advance their file pointers independently.... so one overwrites the other.
f.write("%s \n" %lines) is also not thread safe. You could end up with bits of the files interleaving each other in the output file.
stop = timeit.default_timer() - you didn't wait for the threads to complete their work so you didn't really measure anything useful. Code code severely under-reports execution time.
You are much better off with a simple single-threaded script.

Multiprocessing: Pass file handle to Process

My program spawns multiple processes to do some time consuming calculations. The results are then collected in a queue and a writer process writes them into an output file.
Below is a simplified version of my code which should illustrate my issue. If I comment out the flush statement in the Writer class, test.out is empty at the end of the program.
What exactly is happening here? Is test.out not closed properly? Was it naive to assume that passing the file handle to an autonomous process should work in the first place?
from multiprocessing import JoinableQueue, Process
def main():
queue = JoinableQueue()
queue.put("hello world!")
with open("test.out", "w") as outhandle:
wproc = Writer(queue, outhandle)
wproc.start()
queue.join()
with open("test.out") as handle:
for line in handle:
print(line.strip())
class Writer(Process):
def __init__(self, queue, handle):
Process.__init__(self)
self.daemon = True
self.queue = queue
self.handle = handle
def run(self):
while True:
msg = self.queue.get()
print(msg, file=self.handle)
#self.handle.flush()
self.queue.task_done()
if __name__ == '__main__':
main()

The writer is a separate process. The data it writes to the file might be buffered, and because the process keeps running, it doesn't know that it should flush the buffer (write it to the file). Flushing manually is the right thing to do.
Normally, the file would be closed when you exit the with block, and this would flush the buffer. But the parent process doesn't know anything about its children's buffers, so the child has to flush it's own buffer (closing the file should work too - that doesn't close the file for the parent, at least on Unix systems).
Also, check out the Pool class from multiprocessing (https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) - it might save you some work.

I had the same issue with writing the output from a processed pooled data set to a file directly.
This was sorted out by first collecting the pooled results into a list and then finally wrote to the file.
This is mainly happening due to, the hard disk can't write the speed which processor processes, the buffered content getting lost or pooled data is not in the proper order.
Best thing to do is allocate pooled output into a memory location or to a variable (string or a list) first and then write to a file, that way things should get sorted out.
Good lick!

Python threading with filehandling

Hello I have a program that looks through a range of data and finds anomalies in that data. To make my program faster I incorporated the use of threads (66 in total) now when my program finds the anomalies I would want it to write it to a file but however when i try to write to the file from within multiple threads it wont write.
class myThread(threading.Thread):
def __init__(self,arg1,arg2,lock,output):
threading.Thread.__init__(self)
self.arg1 = arg1
self.arg2 = arg2
self.lock = lock
self.file = output
def run(self):
# print "Starting " + self.name
main(self.arg1,self.arg2,self.lock,self.file)
# print "Exiting " + self.name
def main(START_IP,END_IP,lock,File):
# store found DNS servers
foundDNS=[]
# scan all the ip addresses in the range
for i0 in range(START_IP[0], END_IP[0]+1):
for i1 in range(START_IP[1], END_IP[1]+1):
for i2 in range(START_IP[2], END_IP[2]+1):
for i3 in range(START_IP[3], END_IP[3]+1):
# build ip addres
ipaddr=str(i0)+"."+str(i1)+"."+str(i2)+"."+str(i3)
print "Scanning "+ipaddr+"...",
# scan address
ret=ScanDNS(ipaddr, 10)
if ret==True:
foundDNS.append(ipaddr)
print "Found!"
lock.acquire()
File.write(ipaddr)
File.write("\n")
File.flush()
lock.release()
else:
print
file = open("file.txt","wb")
lock = threading.Lock()
thread1 = myThread(START_IP,END_IP,lock,)
thread1.start()
This uses my exact same MyThread class just with the required arguments for main to manipulate the data. If I run my code for about a minute as its scanning over DNS servers I should get maybe 20-30 DNS servers saved into a file but I generally get this:
FILE.TXT
2.2.1.2
8.8.8.8
31.40.40
31.31.40.40
31.31.41.41
I know for a fact (because I watched the scanning output) and that it hardly writes all of them. So why is some writing and some not?

I don't know why your code is not working, but I can hazard a guess that it is due to race conditions. Hopefully someone knowledgeable can answer that part of your question.
However, I've encountered a similar problem before, and I solved it by moving the file writing code to a single output thread. This thread read from a synchronized queue to which other threads pushed data to be written.
Also, if you happen to be working on a machine with multiple cores, then it's better to use multiprocess instead of threading. The latter only runs threads on a single core, while the former does not have this limitation.

instead of providing file - provide Queue. Spawn new thread to read from Queue and file write. Or Use Locks everywhere in print too because some treads can be deadlocked.

To avoid potential error or misuse for access file from multi-threads, you can try using logging to write down your result.
import logging
logger = logging.getLogger()
file_handler = logging.FileHandler()
formatter = #your formmat
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

Check the the documentation for File Objects:
File.flush() is not enough to ensure that your data is written on disk, add
os.fsync(File.fileno()) just after to make it happens.

Threadsafe printing across multiple processes python 2.x

I have experienced a very weird issue that I just can't explain when dealing with printing to a file from multiple processes (started with the subprocess module). The behavior I am seeing is that some of my output is slightly truncated and some of it is just completely missing. I am using a slightly modified version of Alex Martelli's solution for thread safe printing found here How do I get a thread safe print in Python 2.6?. The main difference is in the write method. To guarantee that output is not interleaved between the multiple processes writing to the same file I buffer the output and only write when I see a newline.
import sys
import threading
tls = threading.local()
class ThreadSafeFile(object):
"""
#author: Alex Martelli
#see: https://stackoverflow.com/questions/3029816/how-do-i-get-a-thread-safe-print-in-python-2-6
#summary: Allows for safe printing of output of multi-threaded programs to stdout.
"""
def __init__(self, f):
self.f = f
self.lock = threading.RLock()
self.nesting = 0
self.dataBuffer = ""
def _getlock(self):
self.lock.acquire()
self.nesting += 1
def _droplock(self):
nesting = self.nesting
self.nesting = 0
for i in range(nesting):
self.lock.release()
def __getattr__(self, name):
if name == 'softspace':
return tls.softspace
else:
raise AttributeError(name)
def __setattr__(self, name, value):
if name == 'softspace':
tls.softspace = value
else:
return object.__setattr__(self, name, value)
def write(self, data):
self._getlock()
self.dataBuffer += data
if data == '\n':
self.f.write(self.dataBuffer)
self.f.flush()
self.dataBuffer = ""
self._droplock()
def flush(self):
self.f.flush()
It should also be noted that to get this to behave abnormally it is going to require either a lot of time or a machine with multiple processors or cores. I ran the offending program in my test suite ~7000 times on a single processor machine before it reported a failure. This program that I've created to demonstrate the issue I've been experiencing in my test suite also seems to work on a single processor machine, but when you execute it on a multicore or multiprocessor machine it will certainly fail.
The following program shows the issue and it is somewhat more involved than I wanted it to be, but I wanted to preserve enough of the behavior of my programs as possible.
The code for process 1 main.py
import subprocess, sys, socket, time, random
from threadSafeFile import ThreadSafeFile
sys.stdout = ThreadSafeFile(sys.__stdout__)
usage = "python main.py nprocs niters"
workerFilename = "/path/to/worker.py"
def startMaster(n, iters):
host = socket.gethostname()
for i in xrange(n):
#set up ~synchronization between master and worker
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind((host,0))
sock.listen(1)
socketPort = sock.getsockname()[1]
cmd = 'ssh %s python %s %s %d %d %d' % \
(host, workerFilename, host, socketPort, i, iters)
proc = subprocess.Popen(cmd.split(), shell=False, stdout=None, stderr=None)
conn, addr = sock.accept()
#wait for worker process to start
conn.recv(1024)
for j in xrange(iters):
#do very bursty i/o
for k in xrange(iters):
print "master: %d iter: %d message: %d" % (n,i, j)
#sleep for some amount of time between .02s and .5s
time.sleep(1 * (random.randint(1,50) / float(100)))
#wait for worker to finish
conn.recv(1024)
sock.close()
proc.kill()
def main(nprocs, niters):
startMaster(nprocs, niters)
if __name__ == "__main__":
if len(sys.argv) != 3:
print usage
sys.exit(1)
nprocs = int(sys.argv[1])
niters = int(sys.argv[2])
main(nprocs, niters)
code for process 2 worker.py
import sys, socket,time, random, time
from threadSafeFile import ThreadSafeFile
usage = "python host port id iters"
sys.stdout = ThreadSafeFile(sys.__stdout__)
def main(host, port, n, iters):
#tell master to start
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
sock.send("begin")
for i in xrange(iters):
#do bursty i/o
for j in xrange(iters):
print "worker: %d iter: %d message: %d" % (n,i, j)
#sleep for some amount of time between .02s and .5s
time.sleep(1 * (random.randint(1,50) / float(100)))
#tell master we are done
sock.send("done")
sock.close()
if __name__ == "__main__":
if len(sys.argv) != 5:
print usage
sys.exit(1)
host = sys.argv[1]
port = int(sys.argv[2])
n = int(sys.argv[3])
iters = int(sys.argv[4])
main(host,port,n,iters)
When testing I ran main.py as follows:
python main.py 1 75 > main.out
The resulting file should be of length 75*75*2 = 11250 lines of the format:
(master|worker): %d iter: %d message: %d
Most of the time it is short 20-30 lines, but I have seen on occasion the program having the appropriate number of lines. After further investigation of the rare successes some of the lines are being truncated with something like:
ter: %d message: %d
Another interesting aspect to this is that when starting the ssh process using multiprocessing instead of subprocess this program behaves as intended. Some may just say why bother using subprocess when multiprocessing works fine. Unfortunately, it is the academic in me that really wants to know why this is behaving abnormally. Any thoughts and/or insights would be very appreciated. Thanks.
***edit
Ben I understand that threadSafeFile uses different locks per process, but I need it in my larger project for 2 reasons.
1) Each process may have multiple threads that will be writing to stdout even though this example does not. So I need to guarantee both safety at the thread level and at the process level.
2) If I don't make sure that when stdout gets flushed that there is a '\n' at the end of the buffer then there is going to be some potential execution trace where process 1 writes its buffer to a file without a trailing '\n' and then process 2 comes in and writes its buffer. Now we have lines interleaving and that's not what I want.
I also understand that this mechanism makes things a bit restrictive for what can be printed. Right now, in my stage of development of this project, restrictiveness is ok. When I can guarantee correctness I can start to relax the restrictions.
Your comment about locking inside of the conditional check if data == '\n' is incorrect. If the lock goes inside the conditional check then threadSafeFile is no longer thread safe in the general case. If any thread can add to the data buffer then there will be a race condition at dataBuffer += data as this is not an atomic operation. Perhaps your comment is simply related to this example in which we only have 1 thread per process, but if that's the case then we don't even need a lock at all.
In regards to OS level locks, my understanding was that multiple programs were able to safely write to the same file on a unix platform iff the number of bytes being written was smaller than the size of the internal buffer. Shouldn't the OS take care of all of the necessary locking for me in this case?

In each process you create a ThreadSafeFile for sys.stdout, each of which has a lock, but they're different locks; there's nothing connecting the locks used in all the different processes. So you're getting the same effect as if you used no locks at all; no process is ever going to be blocked by a lock held in another process, since they all have their own.
The only reason this works when run on a single processor machine is the buffering you do to queue up writes until a newline is encountered. This means that each line of output is written all in one go. On a uniprocessor, it's not unlikely that the OS will decide to switch processes in the middle of a bunch of successive calls to write, which would trash your data. But if the output is all written in chunks of a single line and you don't care about the order in which lines end up in the file, then it's very very unlikely for a context switch to happen in the middle of an operation you care about. Not theoretically impossible though, so I wouldn't call this code correct even for a uniprocessor.
ThreadSafeFile is very specifically only thread safe. It relies on the fact that the program only has a single ThreadSafeFile object for each file it's writing to. So any writes to that file are going to be going through that single object, synchronizing upon the lock.
When you have multiple processes, you don't have the shared global memory that threads in a single process do. So each process necessarily has its own separate ThreadSafeFile(sys.stdout) object. This is exactly the same mistake as if you had used threads and spawned N threads, each of which created its own ThreadSafeFile(sys.stdout).
I have no idea how this works when you use multiprocessing, because you haven't posted the code you used to do that. But my understanding is that this would still fail, for all the same reasons, if you used multiprocessing in such a way that each process created its own fresh ThreadSafeFile. Maybe you're not doing that in the version that uses multiprocessing?
What you need to do is arrange for the synchronization object (the lock) to be connected somehow. The multiprocessing module can do this for you. Note in the example here how the lock is created once and then passed in to each new process as it is created. (This still results in 10 different lock objects in 10 different processes of course, but what Python must be doing behind the scenes is creating an OS-level lock and then making each of the copied Python-level lock objects refer to the single OS-level lock).
If you want to do this with subprocessing, where you're just starting totally independent worker commands from separate scripts, then you'll need some way to get them all talking to a single OS-level lock. I don't know of anything in the standard library that helps you do that. I would just use multiprocessing.
As another thought, your buffering and locking code looks a little suspicious too. What happens if something calls sys.stdout.write("foo\n")? I'm not certain, but at a guess this is only working because the implementation of print happens to call sys.stdout.write on whatever you're printing, then call it again with a single newline. There is absolutely no reason it has to do this! It could just as easily assemble a single string of output in memory and then only call sys.stdout.write once. Plus, what happens if you need to print a block of multiple lines that need to go together in the output?
Another problem is that you acquire the lock the first time a process writes to the buffer, continue to hold it as the buffer is filled, then write the line, and finally release the lock. If your lock actually worked and a process took a long time between starting a line and finishing it it would block all other processes from even buffering up their writes! Maybe that's sort of what you want, if the intention that when a process starts writing something it gets a guarantee that its output will hit the file next. But in that case, you don't even need the buffering at all. I think you should be acquiring the lock just after if data == '\n':, and then you wouldn't need all that code tracking the nesting level either.

Multiple port scans using external tool, subprocess.Popen and threads

I am using a port scanner to scan my subnet. The port scanner unfortunately can scan one port of only one host at a time. Also the scanner has a 1 sec timeout for unreachable hosts. The scanner(being an outside program) has to be run from subprocess.Popen() and to speed it up - so that I may send multiple probes while some previous ones are waiting for replies- I use threads. The problem comes up for a complete /24 subnet scan with large number of threads. SOme of the actually open ports are displayed as closed. I suspect somehow the output gets garbled. Note that this does not occur if I scan fewer hosts or one host at a time
The following code is my attempt to create a pool of threads which take an IP address and run 'sequential' port scan for defined port. Once all specified ports are scanned, it picks up next IP from the list.
while True:
if not thread_queue.empty():
try:
hst = ip_iter.next()
except StopIteration:
break
m=thread_queue.get()
l=ThreadWork(self,hst,m)
l.start()
while open_threads != 0:
pass
Where this fragment sets up thread queue
thread_list = [x for x in range(num_threads)]
for t in thread_list:
thread_queue.put(str(t))
ip_iter=iter(self.final_target)
In the ThreadWork function I keep a tab of open threads(since thread_queue.empty proved to be unreliable, I had to use this crude way)
class ThreadWork(threading.Thread):
def __init__(self,i,hst,thread_no):
global open_threads
threading.Thread.__init__(self)
self.host = hst
self.ptr = i
self.t = thread_no
lock.acquire()
open_threads = open_threads + 1
lock.release()
def run(self):
global thread_queue
global open_threads
global lock
user_log.info("Executing sinfp for IP Address : %s"%self.host)
self.ptr.result.append(SinFpRes(self.host,self.ptr.init_ports,self.ptr.all_ports,self.ptr.options,self.ptr.cf))
lock.acquire()
open_threads = open_threads - 1
lock.release()
thread_queue.put(self.t)
The call to SinFpRes creates a result object for one IP and initiate sequential scanning of ports for that IP only.
The actual scan per port is as shown
com_string = '/usr/local/sinfp/bin/sinfp.pl '+self.options+' -ai '+str(self.ip)+' -p '+str(p)
args = shlex.split(com_string)
self.result=subprocess.Popen(args,stdout=subprocess.PIPE).communicate()[0]
self.parse(p)
The parse function then utilises result stored in self.result to store output for that PORT. An aggregate of all the ports is what constitues a scan result for an IP.
Calling this code using 10 threads gives accurate o/p (when compared to nmap output). On giving 15 threads an occasional open port is missed. On giving 20 threads, more open ports are missed. On giving 50 threads many ports are missed.
P.S. - As a first timer, this code is very convoluted. Apologies to the puritans.
P.P.S. - Even a threaded port scan takes 15 minutes for an entire class C subnet with hardly 20 ports scanned. I was wondering if I should move this code to another language and use Python to only parse the results. Could somebody suggest me a language?
Note -I am exploring Shell option as shown by S.Lott but manual processing is required before dumping it into file.

Use the shell
for h in host1 host2 host3
do
scan $h >$h.scan &
done
cat *.scan >all.scan
This will scan the entire list of hosts all at the same time, each in a separate process. No threads.
Each scan will produce a .scan file. You can then cat all the .scan files into a massive all.scan file for further processing or whatever it is you're doing.

Why don't you try it?
(Answer: No they will have their own pipe)

Use Perl instead of Python. The program (SinFP) is written in Perl, you can modify the code to suite your needs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.