python multiprocessing read file cost too much time

python multiprocessing read file cost too much time - python

there is a function in my code that should read the file .each file is about 8M,however the reading speed is too low,and to improve that i use the multiprocessing.sadly,it seems it got blocked.i wanna know is there any methods to help solve this and improve the reading speed?
my code is as follows:
import multiprocessing as mp
import json
import os
def gainOneFile(filename):
file_from = open(filename)
json_str = file_from.read()
temp = json.loads(json_str)
print "load:",filename," len ",len(temp)
file_from.close()
return temp
def gainSortedArr(path):
arr = []
pool = mp.Pool(4)
for i in xrange(1,40):
abs_from_filename = os.path.join(path, "outputDict"+str(i))
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
pool.close()
pool.join()
arr = sorted(arr,key = lambda dic:len(dic))
return arr
and the call function:
whole_arr = gainSortedArr("sortKeyOut/")

You have a few problems. First, you're not parallelizing. You do:
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
over and over, dispatching a task, then immediately calling .get() which waits for it to complete before you dispatch any additional tasks; you never actually have more than one worker running at once. Store all the results without calling .get(), then call .get() later. Or just use Pool.map or related methods and save yourself some hassle from manual individual result management, e.g. (using imap_unordered to minimize overhead since you're just sorting anyway):
# Make generator of paths to load
paths = (os.path.join(path, "outputDict"+str(i)) for i in xrange(1, 40))
# Load them all in parallel, and sort the results by length (lambda is redundant)
arr = sorted(pool.imap_unordered(gainOneFile, paths), key=len)
Second, multiprocessing has to pickle and unpickle all arguments and return values sent between the main process and the workers, and it's all sent over pipes that incur system call overhead to boot. Since your file system isn't likely to gain substantial speed from parallelizing the reads, it's likely to be a net loss, not a gain.
You might be able to get a bit of a boost by switching to a thread based pool; change the import to import multiprocessing.dummy as mp and you'll get a version of Pool implemented in terms of threads; they don't work around the CPython GIL, but since this code is almost certainly I/O bound, that hardly matters, and it removes the pickling and unpickling as well as the IPC involved in worker communications.
Lastly, if you're using Python 3.3 or higher on a UNIX like system, you may be able to get the OS to help you out by having it pull files into the system cache more aggressively. If you can open the file, then use os.posix_fadvise on the file descriptor (.fileno() on file objects) with either WILLNEED or SEQUENTIAL it might improve read performance when you read from the file at some later point by aggressively prefetching file data before you request it.

Related

Python: Pre-loading memory

I have a python program where I need to load and de-serialize a 1GB pickle file. It takes a good 20 seconds and I would like to have a mechanism whereby the content of the pickle is readily available for use. I've looked at shared_memory but all the examples of its use seem to involve numpy and my project doesn't use numpy. What is the easiest and cleanest way to achieve this using shared_memory or otherwise?
This is how I'm loading the data now (on every run):
def load_pickle(pickle_name):
return pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
I would like to be able to edit the simulation code in between runs without having to reload the pickle. I've been messing around with importlib.reload but it really doesn't seem to work well for a large Python program with many file:
def main():
data_manager.load_data()
run_simulation()
while True:
try:
importlib.reload(simulation)
run_simulation()
except:
print(traceback.format_exc())
print('Press enter to re-run main.py, CTRL-C to exit')
sys.stdin.readline()

This could be an XY problem, the source of which being the assumption that you must use pickles at all; they're just awful to deal with due to how they manage dependencies and are fundamentally a poor choice for any long-term data storage because of it
The source financial data is almost-certainly in some tabular form to begin with, so it may be possible to request it in a friendlier format
A simple middleware to deserialize and reserialize the pickles in the meantime will smooth the transition
input -> load pickle -> write -> output
Converting your workflow to use Parquet or Feather which are designed to be efficient to read and write will almost-certainly make a considerable difference to your load speed
Further relevant links
Answer to How to reversibly store and load a Pandas dataframe to/from disk
What are the pros and cons of parquet format compared to other formats?
You may also be able to achieve this with hickle, which will internally use a HDH5 format, ideally making it significantly faster than pickle, while still behaving like one

An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.
sudo mkdir mnt/pickle
mount -o size=1536M -t tmpfs none /mnt/pickle
cp path/to/pickle.pkl mnt/pickle/pickle.pkl
Then you can access the pickle at mnt/pickle/pickle.pkl. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)

You can use shareable list:
So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json.
So
Program1
import pickle
import json
from multiprocessing.managers import SharedMemoryManager
YOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
data_dict={'DATA':YOUR_DATA}
data_dict_json=json.dumps(data_dict)
smm = SharedMemoryManager()
smm.start()
sl = smm.ShareableList(['alpha','beta',data_dict_json])
print (sl)
#smm.shutdown() commenting shutdown now but you will need to do it eventually
The output will look like this
#OUTPUT
>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')
Now in Program2:
from multiprocessing import shared_memory
load_from_mem=shared_memory.ShareableList(name='psm_12abcd')
load_from_mem[1]
#OUTPUT
'beta'
load_from_mem[2]
#OUTPUT
yourdataindictionaryformat
You can look for more over here
https://docs.python.org/3/library/multiprocessing.shared_memory.html

Adding another assumption-challenging answer, it could be where you're reading your files from that makes a big difference
1G is not a great amount of data with today's systems; at 20 seconds to load, that's only 50MB/s, which is a fraction of what even the slowest disks provide
You may find you actually have a slow disk or some type of network share as your real bottleneck and that changing to a faster storage medium or compressing the data (perhaps with gzip) makes a great difference to read and writing

Here are my assumptions while writing this answer:
Your Financial data is being produced after complex operations and you want the result to persist in memory
The code that consumes must be able to access that data fast
You wish to use shared memory
Here are the codes (self-explanatory, I believe)
Data structure
'''
Nested class definitions to simulate complex data
'''
class A:
def __init__(self, name, value):
self.name = name
self.value = value
def get_attr(self):
return self.name, self.value
def set_attr(self, n, v):
self.name = n
self.value = v
class B(A):
def __init__(self, name, value, status):
super(B, self).__init__(name, value)
self.status = status
def set_attr(self, n, v, s):
A.set_attr(self, n,v)
self.status = s
def get_attr(self):
print('\nName : {}\nValue : {}\nStatus : {}'.format(self.name, self.value, self.status))
Producer.py
from multiprocessing import shared_memory as sm
import time
import pickle as pkl
import pickletools as ptool
import sys
from class_defs import B
def main():
# Data Creation/Processing
obj1 = B('Sam Reagon', '2703', 'Active')
#print(sys.getsizeof(obj1))
obj1.set_attr('Ronald Reagon', '1023', 'INACTIVE')
obj1.get_attr()
###### real deal #########
# Create pickle string
byte_str = pkl.dumps(obj=obj1, protocol=pkl.HIGHEST_PROTOCOL, buffer_callback=None)
# compress the pickle
#byte_str_opt = ptool.optimize(byte_str)
byte_str_opt = bytearray(byte_str)
# place data on shared memory buffer
shm_a = sm.SharedMemory(name='datashare', create=True, size=len(byte_str_opt))#sys.getsizeof(obj1))
buffer = shm_a.buf
buffer[:] = byte_str_opt[:]
#print(shm_a.name) # the string to access the shared memory
#print(len(shm_a.buf[:]))
# Just an infinite loop to keep the producer running, like a server
# a better approach would be to explore use of shared memory manager
while(True):
time.sleep(60)
if __name__ == '__main__':
main()
Consumer.py
from multiprocessing import shared_memory as sm
import pickle as pkl
from class_defs import B # we need this so that while unpickling, the object structure is understood
def main():
shm_b = sm.SharedMemory(name='datashare')
byte_str = bytes(shm_b.buf[:]) # convert the shared_memory buffer to a bytes array
obj = pkl.loads(data=byte_str) # un-pickle the bytes array (as a data source)
print(obj.name, obj.value, obj.status) # get the values of the object attributes
if __name__ == '__main__':
main()
When the Producer.py is executed in one terminal, it will emit a string identifier (say, wnsm_86cd09d4) for the shared memory. Enter this string in the Consumer.py and execute it in another terminal.
Just run the Producer.py in one terminal and the Consumer.py on another terminal on the same machine.
I hope this is what you wanted!

You can take advantage of multiprocessing to run the simulations inside of subprocesses, and leverage the copy-on-write benefits of forking to unpickle/process the data only once at the start:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
def _run_simulation(_):
# Wrapper for `run_simulation` that takes one argument. The function passed
# into `multiprocessing.Pool.map` must take one argument.
run_simulation()
with mp.Pool() as pool:
pool.map(_run_simulation, range(num_simulations))
If you want to parameterize each simulation run, you can do so like so:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
with mp.Pool() as pool:
simulations = ('arg for simulation run', 'arg for another simulation run')
pool.map(run_simulation, simulations)
This way the run_simulation function will be passed in the values from the simulations tuple, which can allow for having each simulation run with different parameters, or even just assign each run a ID number of name for logging/saving purposes.
This whole approach relies on fork being available. For more information about using fork with Python's built-in multiprocessing library, see the docs about contexts and start methods. You may also want to consider using the forkserver multiprocessing context (by using mp = multiprocessing.get_context('fork')) for the reasons described in the docs.
If you don't want to run your simulations in parallel, this approach can be adapted for that. The key thing is that in order to only have to process the data once, you must call run_simulation within the process that processed the data, or one of its child processes.
If, for instance, you wanted to edit what run_simulation does, and then run it again at your command, you could do it with code resembling this:
main.py:
import multiprocessing
from multiprocessing.connection import Connection
import pickle
from data import load_data
# Load/process data in the parent process
load_data()
# Now child processes can access the data nearly instantaneously
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork') # Consider using 'forkserver' instead
# This is only ever run in child processes
def load_and_run_simulation(result_pipe: Connection) -> None:
# Import `run_simulation` here to allow it to change between runs
from simulation import run_simulation
# Ensure that simulation has not been imported in the parent process, as if
# so, it will be available in the child process just like the data!
try:
run_simulation()
except Exception as ex:
# Send the exception to the parent process
result_pipe.send(ex)
else:
# Send this because the parent is waiting for a response
result_pipe.send(None)
def run_simulation_in_child_process() -> None:
result_pipe_output, result_pipe_input = mp.Pipe(duplex=False)
proc = mp.Process(
target=load_and_run_simulation,
args=(result_pipe_input,)
)
print('Starting simulation')
proc.start()
try:
# The `recv` below will wait until the child process sends sometime, or
# will raise `EOFError` if the child process crashes suddenly without
# sending an exception (e.g. if a segfault occurs)
result = result_pipe_output.recv()
if isinstance(result, Exception):
raise result # raise exceptions from the child process
proc.join()
except KeyboardInterrupt:
print("Caught 'KeyboardInterrupt'; terminating simulation")
proc.terminate()
print('Simulation finished')
if __name__ == '__main__':
while True:
choice = input('\n'.join((
'What would you like to do?',
'1) Run simulation',
'2) Exit\n',
)))
if choice.strip() == '1':
run_simulation_in_child_process()
elif choice.strip() == '2':
exit()
else:
print(f'Invalid option: {choice!r}')
data.py:
from functools import lru_cache
# <obtain 'DATA_ROOT' and 'pickle_name' here>
#lru_cache
def load_data():
with open(DATA_ROOT + pickle_name, 'rb') as f:
return pickle.load(f)
simulation.py:
from data import load_data
# This call will complete almost instantaneously if `main.py` has been run
data = load_data()
def run_simulation():
# Run the simulation using the data, which will already be loaded if this
# is run from `main.py`.
# Anything printed here will appear in the output of the parent process.
# Exceptions raised here will be caught/handled by the parent process.
...
The three files detailed above should all be within the same directory, alongside an __init__.py file that can be empty. The main.py file can be renamed to whatever you'd like, and is the primary entry-point for this program. You can run simulation.py directly, but that will result in a long time spent loading/processing the data, which was the problem you ran into initially. While main.py is running, the file simulation.py can be edited, as it is reloaded every time you run the simulation from main.py.
For macOS users: forking on macOS can be a bit buggy, which is why Python defaults to using the spawn method for multiprocessing on macOS, but still supports fork and forkserver for it. If you're running into crashes or multiprocessing-related issues, try adding OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES to your environment. See https://stackoverflow.com/a/52230415/5946921 for more details.

As I understood:
something is needed to be loaded
it is needed to be loaded often, because file with code which uses this something is edited often
you don't want to wait until it will be loaded every time
Maybe such solution will be okay for you.
You can write script loader file in such way (tested on Python 3.8):
import importlib.util, traceback, sys, gc
# Example data
import pickle
something = pickle.loads(pickle.dumps([123]))
if __name__ == '__main__':
try:
mod_path = sys.argv[1]
except IndexError:
print('Usage: python3', sys.argv[0], 'PATH_TO_SCRIPT')
exit(1)
modules_before = list(sys.modules.keys())
argv = sys.argv[1:]
while True:
MOD_NAME = '__main__'
spec = importlib.util.spec_from_file_location(MOD_NAME, mod_path)
mod = importlib.util.module_from_spec(spec)
# Change to needed global name in the target module
mod.something = something
sys.modules[MOD_NAME] = mod
sys.argv = argv
try:
spec.loader.exec_module(mod)
except:
traceback.print_exc()
del mod, spec
modules_after = list(sys.modules.keys())
for k in modules_after:
if k not in modules_before:
del sys.modules[k]
gc.collect()
print('Press enter to re-run, CTRL-C to exit')
sys.stdin.readline()
Example of module:
# Change 1 to some different number when first script is running and press enter
something[0] += 1
print(something)
Should work. And should reduce the reload time of pickle close to zero 🌝
UPD
Add a possibility to accept script name with command line arguments

This is not exact answer to the question as the Q looks as pickle and SHM are required, but others went of the path, so I am going to share a trick of mine. It might help you. There are some fine solutions here using the pickle and SHM anyway. Regarding this I can offer only more of the same. Same pasta with slight sauce modifications.
Two tricks I employ when dealing with your situations are as follows.
First is to use sqlite3 instead of pickle. You can even easily develop a module for a drop-in replacement using sqlite. Nice thing is that data will be inserted and selected using native Python types, and you can define yourown with converter and adapter functions that would use serialization method of your choice to store complex objects. Can be a pickle or json or whatever.
What I do is to define a class with data passed in through *args and/or **kwargs of a constructor. It represents whatever obj model I need, then I pick-up rows from "select * from table;" of my database and let Python unwrap the data during the new object initialization. Loading big amount of data with datatype conversions, even the custom ones is suprisingly fast. sqlite will manage buffering and IO stuff for you and do it faster than pickle. The trick is construct your object to be filled and initiated as fast as possible. I either subclass dict() or use slots to speed up the thing.
sqlite3 comes with Python so that's a bonus too.
The other method of mine is to use a ZIP file and struct module.
You construct a ZIP file with multiple files within. E.g. for a pronunciation dictionary with more than 400000 words I'd like a dict() object. So I use one file, let say, lengths.dat in which I define a length of a key and a length of a value for each pair in binary format. Then I have a one file of words and one file of pronunciations all one after the other.
When I load from file, I read the lengths and use them to construct a dict() of words with their pronunciations from two other files. Indexing bytes() is fast, so, creating such a dictionary is very fast. You can even have it compressed if diskspace is a concern, but some speed loss is introduced then.
Both methods will take less place on a disk than the pickle would.
The second method will require you to read into RAM all the data you need, then you will be constructing the objects, which will take almost double of RAM that the data took, then you can discard the raw data, of course. But alltogether shouldn't require more than the pickle takes. As for RAM, the OS will manage almost anything using the virtual memory/SWAP if needed.
Oh, yeah, there is the third trick I use. When I have ZIP file constructed as mentioned above or anything else which requires additional deserialization while constructing an object, and number of such objects is great, then I introduce a lazy load. I.e. Let say we have a big file with serialized objects in it. You make the program load all the data and distribute it per object which you keep in list() or dict().
You write your classes in such a way that when the object is first asked for data it unpacks its raw data, deserializes and what not, removes the raw data from RAM then returns your result. So you will not be losing loading time until you actually need the data in question, which is much less noticeable for a user than 20 secs taking for a process to start.

I implemented the python-preloaded script, which can help you here. It will store the CPython state at an early stage after some modules are loaded, and then when you need it, you can restore from this state and load your normal Python script. Storing currently means that it will stay in memory, and restoring means that it does a fork on it, which is very fast. But these are implementation details of python-preloaded and should not matter to you.
So, to make it work for your use case:
Make a new module, data_preloaded.py or so, and in there, just this code:
preloaded_data = load_pickle(...)
Now run py-preloaded-bundle-fork-server.py data_preloaded -o python-data-preloaded.bin. This will create python-data-preloaded.bin, which can be used as a replacement for python.
I assume you have started python your_script.py before. So now run ./python-data-preloaded.bin your_script.py. Or also just python-data-preloaded.bin (no args). The first time, this will still be slow, i.e. take about 20 seconds. But now it is in memory.
Now run ./python-data-preloaded.bin your_script.py again. Now it should be extremely fast, i.e. a few milliseconds. And you can start it again and again and it will always be fast, until you restart your computer.

What is the safest method to save files generated by different processes with multiprocessing in Python?

I am totally new to using the multiprocessing package. I have built an agent-based model and would like to run a large number of simulations with different parameters in parallel. My model takes an xml file, extracts some parameters and runs a simulation, then generates two pandas dataframes and saves them as pickle files.
I'm trying to use the multiprocessing.Process() class, but the two dataframes are not saved correctly, rather for some simulation I get a single dataframe for others no dataframe.
Am I using the right class for this type of work? What is the safest method to write my simulation results to disk using the multiprocessing module?
I add, If I launch the simulations sequentially with a simple loop I get the right outputs.
Thanks for the support
I add an example of code that is not reproducible because I don't have the possibility to share the model, composed by many modules and xml files.
import time
import multiprocessing
from model import ProtonOC
import random
import os
import numpy as np
import sys
sys.setrecursionlimit(100000)
def load(dir):
result = list()
names = list()
for filename in os.scandir(dir):
if filename.path.endswith('.xml'):
result.append(filename.path)
names.append(filename.name[:-4])
return result, names
def run(xml, name):
model = MYMODEL()
model.override_xml(xml)
model.run()
new_dir = os.path.join("C:\\", name)
os.mkdir(new_dir)
model.datacollector.get_agent_vars_dataframe().to_pickle(os.path.join(new_dir, "agents" + ".pkl"))
model.datacollector.get_model_vars_dataframe().to_pickle(os.path.join(new_dir, "model" + ".pkl"))
if __name__ == '__main__':
paths, names = load("C:\\") #The folder that contains xml files
processes = []
for path, name in zip(paths, names):
p = multiprocessing.Process(target=run, args=(path, name))
processes.append(p)
p.start()
for process in processes:
process.join()

I can elaborate on my comment, but alas, looking at your code and not knowing anything about your model, I do not see an obvious cause for the problems you mentioned.
I mentioned in my comment that I would use either a thread pool or processor pool according to whether your processing was I/O bound or CPU bound in order to better control the number of threads/processes you create. And while threads have less overhead to create, the Python interpreter would be executed within the same process and there is thus no parallelism when executing Python bytecode due to the Global Interpreter Lock (GIL) having to first be obtained. So it is for that reason that processor pools are generally recommended for CPU-intensive jobs. However, when execution is occurring in runtime libraries implemented in the C language, such as often the case with numpy and pandas, the Python interpreter releases the GIL and you can still have a high degree of parallelism with threads. But I don't know what the nature processing being done by the ProtonOC class instance. Some if it is clearly I/O related. So for now I will recommend that you initially try a thread pool for which I have arbitrarily set a maximum size of 20 (a number I pulled out of thin air). The issue here is that you are doing concurrent operations to your disk and I don't know whether too many threads will slow down disk operations (do you have a solid-state drive where arm movement is not an issue?)
If you run the following code example with MAX_CONCURRENCY set to 1, presumably it should work. Of course, that is not your end goal. But it demonstrates how easily you can set the concurrency.
import time
from concurrent.futures import ThreadPoolExecutor as Executor
from model import ProtonOC
import random
import os
import numpy as np
import sys
sys.setrecursionlimit(100000)
def load(dir):
result = list()
names = list()
for filename in os.scandir(dir):
if filename.path.endswith('.xml'):
result.append(filename.path)
names.append(filename.name[:-4])
return result, names
def run(xml, name):
model = ProtonOC()
model.override_xml(xml)
model.run()
new_dir = os.path.join("C:\\", name)
os.mkdir(new_dir)
model.datacollector.get_agent_vars_dataframe().to_pickle(os.path.join(new_dir, "agents.pkl"))
model.datacollector.get_model_vars_dataframe().to_pickle(os.path.join(new_dir, "model.pkl"))
if __name__ == '__main__':
paths, names = load("C:\\") #The folder that contains xml files
MAX_CONCURRENCY = 20 # never more than this
N_WORKERS = min(len(paths), MAX_CONCURRENCY)
with Executor(max_workers=N_WORKERS) as executor:
executor.map(run, paths, names)
To use a process pool, change:
from concurrent.futures import ThreadPoolExecutor as Executor
to:
from concurrent.futures import ProcessPoolExecutor as Executor
You may then wish to change MAX_CONCURRENCY. But because the jobs still involve a lot of I/O and give up the processor when doing this I/O, you might benefit from this value being greater than the number of CPUs you have.
Update
An alternative to using the map method of the ThreadPoolExecutor class is to use submit. This gives you an opportunity to handle any exception on an individual job-submission basis:
if __name__ == '__main__':
paths, names = load("C:\\") #The folder that contains xml files
MAX_CONCURRENCY = 20 # never more than this
N_WORKERS = min(len(paths), MAX_CONCURRENCY)
with Executor(max_workers=N_WORKERS) as executor:
futures = [executor.submit(run, path, name) for path, name in zip(paths, names)]
for future in futures:
try:
result = future.get() # return value from run, which is None
except Exception as e: # any exception run might have thrown
print(e) # handle this as you see fit
You should be aware that this submits jobs one by one whereas map, when used with the ProcessPoolExecutor, allows you to specify a chunksize parameter. When you have a pool size of N and M jobs to submit where M is much greater than N, it is more efficient to place on the work queue for each process in the pool chunksize jobs at a time rather than one at a time to reduce the number of shared memory transfers required. But as long as you are using a thread pool, this is not relevant.

Multiprocessing -- Thread Pool Memory Leak?

I am observing memory usage that I cannot explain to myself. Below I provide a stripped down version of my actual code that still exhibits this behavior. The code is intended to accomplish the following:
Read a text file in chunks of 1000 lines. Each line is a sentence. Split these 1000 sentences into 4 generators. Pass these generators to a thread pool and run feature extraction in parallel on 250 sentences.
In my actual code I accumulate features and labels from all sentences of the entire file.
Now here comes the weird thing: Memory gets allocated but not freed again even when not accumulating these values! And it has something to do with the thread pool I think. The amount of memory taken in total is dependent on how many features are extracted for any given word. I simulate this here with range(100). Have a look:
from sys import argv
from itertools import chain, islice
from multiprocessing import Pool
from math import ceil
# dummyfied feature extraction function
# the lengt of the range determines howmuch mamory is used up in total,
# eventhough the objects are never stored
def features_from_sentence(sentence):
return [{'some feature' 'some value'} for i in range(100)], ['some label' for i in range(100)]
# split iterable into generator of generators of length `size`
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def features_from_sentence_meta(l):
return list(map (features_from_sentence, l))
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# split sentences into a generator of 4 generators
sentence_chunks = chunks(sentences, ceil(50000/4))
# results is a list containing the lists of pairs of X and Y of all chunks
results = map(lambda x : x[0], pool.map(features_from_sentence_meta, sentence_chunks))
X, Y = zip(*results)
print(f'end: {i}')
return X, Y
# reads file in chunks of `lines_per_chunk` lines
def line_chunks(textfile, lines_per_chunk=1000):
chunk = []
i = 0
with open(textfile, 'r') as textfile:
for line in textfile:
if not line.split(): continue
i+=1
chunk.append(line.strip())
if i == lines_per_chunk:
yield chunk
i = 0
chunk = []
yield chunk
textfile = argv[1]
for i, line_chunk in enumerate(line_chunks(textfile)):
# stop processing file after 10 chunks to demonstrate
# that memory stays occupied (check your system monitor)
if i == 10:
while True:
pass
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
The file I am using to debug this has 50000 nonempty lines, which is why I use the hardcoded 50000 at one place. If you want to use the same file, he is a link for your convenience:
https://www.dropbox.com/s/v7nxb7vrrjim349/de_wiki_50000_lines?dl=0
Now when you run this script and open your system monitor you will observe that memory gets used up and the usage keeps going until the 10th chunk, where I artificially go into an endless loop to demonstrate that the memory stays in use, even though I never store anything.
Can you explain to me why this happens? I seem to be missing something about how multiprocessing pools are supposed to be used.

First, let's clear up some misunderstandings—although, as it turns out, this wasn't actually the right avenue to explore in the first place.
When you allocate memory in Python, of course it has to go get that memory from the OS.
When you release memory, however, it rarely gets returned to the OS, until you finally exit. Instead, it goes into a "free list"—or, actually, multiple levels of free lists for different purposes. This means that the next time you need memory, Python already has it lying around, and can find it immediately, without needing to talk to the OS to allocate more. This usually makes memory-intensive programs much faster.
But this also means that—especially on modern 64-bit operating systems—trying to understand whether you really do have any memory pressure issues by looking at your Activity Monitor/Task Manager/etc. is next to useless.
The tracemalloc module in the standard library provides low-level tools to see what actually is going on with your memory usage. At a higher level, you can use something like memory_profiler, which (if you enable tracemalloc support—this is important) can put that information together with OS-level information from sources like psutil to figure out where things are going.
However, if you aren't seeing any actual problems—your system isn't going into swap hell, you aren't getting any MemoryError exceptions, your performance isn't hitting some weird cliff where it scales linearly up to N and then suddenly goes all to hell at N+1, etc.—you usually don't need to bother with any of this in the first place.
If you do discover a problem, then, fortunately, you're already half-way to solving it. As I mentioned at the top, most memory that you allocated doesn't get returned to the OS until you finally exit. But if all of your memory usage is happening in child processes, and those child processes have no state, you can make them exit and restart whenever you want.
Of course there's a performance cost to doing so—process teardown and startup time, and page maps and caches that have to start over, and asking the OS to allocate the memory again, and so on. And there's also a complexity cost—you can't just run a pool and let it do its thing; you have to get involved in its thing and make it recycle processes for you.
There's no builtin support in the multiprocessing.Pool class for doing this.
You can, of course, build your own Pool. If you want to get fancy, you can look at the source to multiprocessing and do what it does. Or you can build a trivial pool out of a list of Process objects and a pair of Queues. Or you can just directly use Process objects without the abstraction of a pool.
Another reason you can have memory problems is that your individual processes are fine, but you just have too many of them.
And, in fact, that seems to be the case here.
You create a Pool of 4 workers in this function:
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# ...
… and you call this function for every chunk:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
So, you end up with 4 new processes for every chunk. Even if each one has pretty low memory usage, having hundreds of them at once is going to add up.
Not to mention that you're probably severely hurting your time performance by having hundreds of processes competing over 4 cores, so you waste time in context switching and OS scheduling instead of doing real work.
As you pointed out in a comment, the fix for this is trivial: just make a single global pool instead of a new one for each call.
Sorry for getting all Columbo here, but… just one more thing… This code runs at the top level of your module:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
… and that's the code that tries to spin up the pool and all the child tasks. But each child process in that pool needs to import this module, which means they're all going to end up running the same code, and spinning up another pool and a whole extra set of child tasks.
You're presumably running this on Linux or macOS, where the default startmethod is fork, which means multiprocessing can avoid this import, so you don't have a problem. But with the other startmethods, this code would basically be a forkbomb that eats up all of your system resources. And that includes spawn, which is the default startmethod on Windows. So, if there's ever any chance anyone might run this code on Windows, you should put all of that top-level code in a if __name__ == '__main__': guard.

Why is reading multiple files at the same time slower than reading sequentially?

I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!

Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)

In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).

A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.

How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process.
with open(source_filename) as source, open(target_filename) as target:
for line in source:
target.write(do_something(line))
do_something_else()
Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?
edit: To make this question not a discussion, How should the code look like?
with open(source_filename) as source, open(target_filename) as target:
?
#Nicoretti: In an iteration I need to read a line of several KB of data.
update 2: the file may be a bz2, so Python may have to wait for unpacking:
$ bzip2 -d country.osm.bz2 | ./my_script.py

You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.
import threading
import Queue
QUEUE_SIZE = 1000
sentinel = object()
def read_file(name, queue):
with open(name) as f:
for line in f:
queue.put(line)
queue.put(sentinel)
def process(inqueue, outqueue):
for line in iter(inqueue.get, sentinel):
outqueue.put(do_something(line))
outqueue.put(sentinel)
def write_file(name, queue):
with open(name, "w") as f:
for line in iter(queue.get, sentinel):
f.write(line)
inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)
threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)
It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.

Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.
If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve.

This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.
Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.
Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.
There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).

In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:
you are doing processing that doesn't require the global interpreter lock; or
you are spending time blocked on I/O.
Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.
So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.
Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads).
from threading import Thread
from Queue import Queue
def process_file(f, source_filename, target_filename):
"""
Apply the function `f` to each line of `source_filename` and write
the results to `target_filename`. Each call to `f` is evaluated in
a separate thread.
"""
worker_queue = Queue()
finished = object()
def process(queue, line):
"Process `line` and put the result on `queue`."
queue.put(f(line))
def read():
"""
Read `source_filename`, create an output queue and a worker
thread for every line, and put that worker's output queue onto
`worker_queue`.
"""
with open(source_filename) as source:
for line in source:
queue = Queue()
Thread(target = process, args=(queue, line)).start()
worker_queue.put(queue)
worker_queue.put(finished)
Thread(target = read).start()
with open(target_filename, 'w') as target:
for output in iter(worker_queue.get, finished):
target.write(output.get())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python multiprocessing read file cost too much time - python

Related

Python: Pre-loading memory

What is the safest method to save files generated by different processes with multiprocessing in Python?

Multiprocessing -- Thread Pool Memory Leak?

Why is reading multiple files at the same time slower than reading sequentially?

How to parse a large file taking advantage of threading in Python?

Categories

Resources