--Hi guys, --
I have about 4000 (1-50MB) files to sort.
I was thinking to have Python call the Linux sort command. And since I'm thinking this might be somewhat I/O bound, I would use the threading library.
So here's what I have but I when I run it and watch the system monitor I don't see 25 sort tasks pop up. It seems to be running one at a time? What am I doing wrong?
...
print "starting sort"
def sort_unique(file_path):
"""Run linux sort -ug on a file"""
out = commands.getoutput('sort -ug -o "%s" "%s"' % (file_path, file_path))
assert not out
pool = ThreadPool(25)
for fn in os.listdir(target_dir):
fp = os.path.join(target_dir,fn)
pool.add_task(sort_unique, fp)
pool.wait_completion()
Here's where ThreadPool comes from, perhaps that is broken?
You're doing everything correct.
There is something which is called GIL in python;
Global Interpreter Lock - which eventually cause python to execute only one thread at time.
Choose subprocess instead :), python is not multithreaded.
Actually this does seem to be working. I spoke too soon. I'm not sure if you guys want to delete this or what? Sorry about that.
Normally people do this by spawning multiple processes. The multiprocessing module makes this easy to do.
On the other hand, Python is pretty good at sorting, so why not just read the file into a list of strings file.readlines() and then sort it in Python. You would have to write a key function to use with list.sort() to do the -g option, and you would also have to remove duplicates, i.e. -u option. The easiest way (and a fast way) to remove duplicates is to do list(set(UNsortedfile)) before you do the sort.
Related
I have this function that saves a large list of dictionaries into files of 100 items each. This worked flawlessly in a normal environment, but after changing nothing but running this using threading, I experienced rather significant slow-downs. To my knowledge, simply adding threading or multiprocessing shouldn't cause slowdowns in I/O, but if I am missing something trivial, please let me know, but if I am not, how can I make this not run so slowly?
def savePlayerQueueToDisk(saveCopy):
print("="*10)
print("Application exit signal received. Caching loaded players...")
import pickle
i = 0
for l in chunks(saveCopy, 100):
with open(CACHED_PLAYERS_DIR / f"{str(i)}", 'wb') as filehandle:
pickle.dump(saveCopy, filehandle)
i += 1
print(f"saved chunk {i}")
import sys
print("="*10)
sys.exit()
def mainFunction():
# Calls a main program, where on KeyboardException, it calls the savePlayerQueueToDisk function.
t2 = threading.Thread(target=main, kwargs={'ignore_exceptions': False})
t2.start()
Edit: After some more testing, by using multiprocessing instead of threading, and only using a second process for the one of the threads, and keeping the mainFunction on the main thread, I experienced no slowdowns. Why is this the case?
Edit: after more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.
After more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.
I wanted to use concurrency in Python for the first time. So I started reading a lot about Python concurreny (GIL, threads vs processes, multiprocessing vs concurrent.futures vs ...) and seen a lot of convoluted examples. Even in examples using the high level concurrent.futures library.
So I decided to just start trying stuff and was surprised with the very, very simple code I ended up with:
from concurrent.futures import ThreadPoolExecutor
class WebHostChecker(object):
def __init__(self, websites):
self.webhosts = []
for website in websites:
self.webhosts.append(WebHost(website))
def __iter__(self):
return iter(self.webhosts)
def check_all(self):
# sequential:
#for webhost in self:
# webhost.check()
# threaded:
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(lambda webhost: webhost.check(), self.webhosts)
class WebHost(object):
def __init__(self, hostname):
self.hostname = hostname
def check(self):
print("Checking {}".format(self.hostname))
self.check_dns() # only modifies internal state, i.e.: sets self.dns
self.check_http() # only modifies internal status, i.e.: sets self.http
Using the classes looks like this:
webhostchecker = WebHostChecker(["urla.com", "urlb.com"])
webhostchecker.check_all() # -> this calls .check() on all WebHost instances in parallel
The relevant multiprocessing/threading code is only 3 lines. I barely had to modify my existing code (which I hoped to be able to do when first starting to write the code for sequential execution, but started to doubt after reading the many examples online).
And... it works! :)
It perfectly distributes the IO-waiting among multiple threads and runs in less than 1/3 of the time of the original program.
So, now, my question(s):
What am I missing here?
Could I implement this differently? (Should I?)
Why are other examples so convoluted? (Although I must say I couldn't find an exact example doing a method call on multiple objects)
Will this code get me in trouble when I expand my program with features/code I cannot predict right now?
I think I already know of one potential problem and it would be nice if someone can confirm my reasoning: if WebHost.check() also becomes CPU bound I won't be able to swap ThreadPoolExecutor for ProcessPoolExecutor. Because every process will get cloned versions of the WebHost instances? And I would have to code something to sync those cloned instances back to the original?
Any insights/comments/remarks/improvements/... that can bring me to greater understanding will be much appreciated! :)
Ok, so I'll add my own first gotcha:
If webhost.check() raises an Exception, then the thread just ends and self.dns and/or self.http might NOT have been set. However, with the current code, you won't see the Exception, UNLESS you also access the executor.map() results! Leaving me wondering why some objects raised AttributeErrors after running check_all() :)
This can easily be fixed by just evaluating every result (which is always None, cause I'm not letting .check() return anything). You can do it after all threads have run or during. I choose to let Exceptions be raised during (ie: within the with statement), so the program stops at the first unexpected error:
def check_all(self):
with ThreadPoolExecutor(max_workers=10) as executor:
# this alone works, but does not raise any exceptions from the threads:
#executor.map(lambda webhost: webhost.check(), self.webhosts)
for i in executor.map(lambda webhost: webhost.check(), self.webhosts):
pass
I guess I could also use list(executor.map(lambda webhost: webhost.check(), self.webhosts)) but that would unnecessarily use up memory.
I was using this answer in order to run parallel commands with multiprocessing in Python on a Linux box.
My code did something like:
import multiprocessing
import logging
def cycle(offset):
# Do stuff
def run():
for nprocess in process_per_cycle:
logger.info("Start cycle with %d processes", nprocess)
offsets = list(range(nprocess))
pool = multiprocessing.Pool(nprocess)
pool.map(cycle, offsets)
But I was getting this error: OSError: [Errno 24] Too many open files
So, the code was opening too many file descriptor, i.e.: it was starting too many processes and not terminating them.
I fixed it replacing the last two lines with these lines:
with multiprocessing.Pool(nprocess) as pool:
pool.map(cycle, offsets)
But I do not know exactly why those lines fixed it.
What is happening underneath of that with?
You're creating new processes inside a loop, and then forgetting to close them once you're done with them. As a result, there comes a point where you have too many open processes. This is a bad idea.
You could fix this by using a context manager which automatically calls pool.terminate, or manually call pool.terminate yourself. Alternatively, why don't you create a pool outside the loop just once, and then send tasks to the processes inside?
pool = multiprocessing.Pool(nprocess) # initialise your pool
for nprocess in process_per_cycle:
...
pool.map(cycle, offsets) # delegate work inside your loop
pool.close() # shut down the pool
For more information, you could peruse the multiprocessing.Pool documentation.
It is context manger. Using with ensures that you are opening and closing files properly. To understand this in detail, I'd recommend this article https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/
I was already terminating and closing pool but there was a limit on number of file descriptors and I changed my ulimit to 4096 from 1024 and it worked. Following is the procedure:
Check:
ulimit -n
I updated it to 4096 and it worked.
ulimit -n 4096
This can happen when you use numpy.load too, make sure close those files too, or avoid using it and use pickle or torch.save torch.load etc.
In one of my programs, I have the following method :
def set_part_content(self, part_no, block_no, data):
with open(self.file_path, "rwb+") as f:
f.seek(part_no * self.constant1 + block_no * self.constant2)
f.write(data)
I did this this way because of the following :
I have to write at different index (the reason why the f.seek is here)
and this function is thread safe (thanks to the with statement)
My issue is this function is called approximately 10k to 100k times, and obviously it is really really slow (it represent half of the execution time of one of my most critical set of functionality) because of the opening/closing time.
Because of the f.seek, I can't open the file directly in the __init__ function in order to operate on it (if 2 thread use the function at the same time, it result in a bad index for one of this two, which is critical).
Is there any module / way that could accelerate this function ?
I am not familiar with python, so I am not sure if with statement will make this function thread safe or not.
If it does, you won't have 2 threads using the function at the same time.
If it doesn't, your function is not thread safe. You need a lock to ensure thread safety. You may check implementing file locks using Python's with statement. Another option is to make this function async, and let single thread to handle all the writing.
I need to detect when a program crashes or is not running using python and restart it. I need a method that doesn't necessarily rely on the python module being the parent process.
I'm considering implementing a while loop that essentially does
ps -ef | grep process name
and when the process isn't found it starts another. Perhaps this isn't the most efficient method. I'm new to python so possibly there is a python module that does this already.
Why implement it yourself? An existing utility like daemon or Debian's start-stop-daemon is more likely to get the other difficult stuff right about running long-living server processes.
Anyway, when you start the service, put its pid in /var/run/<name>.pid and then make your ps command just look for that process ID, and check that it is the right process. On Linux you can simply look at /proc/<pid>/exe to check that it points to the right executable.
Please don't reinvent init. Your OS has capabilities to do this that require nearly no system resources and will definitely do it better and more reliably than anything you can reproduce.
Classic Linux has /etc/inittab
Ubuntu has /etc/event.d (upstart)
OS X has launchd
Solaris has smf
The following code checks a given process in a given interval, and restarts it.
#Restarts a given process if it is finished.
#Compatible with Python 2.5, tested on Windows XP.
import threading
import time
import subprocess
class ProcessChecker(threading.Thread):
def __init__(self, process_path, check_interval):
threading.Thread.__init__(self)
self.process_path = process_path
self.check_interval = check_interval
def run (self):
while(1):
time.sleep(self.check_interval)
if self.is_ok():
self.make_sure_process_is_running()
def is_ok(self):
ok = True
#do the database locks, client data corruption check here,
#and return true/false
return ok
def make_sure_process_is_running(self):
#This call is blocking, it will wait for the
#other sub process to be finished.
retval = subprocess.call(self.process_path)
def main():
process_path = "notepad.exe"
check_interval = 1 #In seconds
pm = ProcessChecker(process_path, check_interval)
pm.start()
print "Checker started..."
if __name__ == "__main__":
main()
maybe you need http://supervisord.org
I haven't tried it myself, but there is a Python System Information module that can be used to find processes and get information about them. AFAIR there is a ProcessTable class that can be used to inspect the running processes, but it doesn't seem to be very well documented...
I'd go the command-line route (it's just easier imho) as long as you only check every second or two the resource usage should be infintesimal compared to the available processing on any system less than 10 years old.