Python multiprocessing - sharing large dataset

Python multiprocessing - sharing large dataset - python

I'm trying to speed up a CPU-bound Python script (on Windows11). Threats in Python do not seem to run on a different cpu(core) so the only option I have is multiprocessing.
I have a big dictionary data structure (11GB memory footprint after loading from file) that I am checking calculated values on if they are in that dictionary. Input for the calculation also comes from a file (100GB in size). This input I can pool-map to the processes in batches, no problem. But I cannot copy the dictionary to all processes because there is not enough memory for that. So I need to find a way for the processes to check if the value (actually a string) is in the dictionary.
Any advice?
Pseudo programm flow:
--main--
- load dictionary structure from file # 11GB memory footprint
- ...
- While not all chuncks loaded
- Load chunk of calcdata from file # (10.000 lines per chunk)
- Distribute (map) calcdata-chunck to processes
- Wait for processes to complete all chunks
--process--
- for each element in subchunk
- perform calculation
- check if calculation in dictionary # here is my problem!
- store result in file
Edit, after implementing comments below, I am now at:
def ReadDictFromFile()
cnt=0
print("Reading dictionary from " + dictfilename)
with open(dictfilename, encoding=("utf-8"), errors=("replace")) as f:
next(f) #skip first line (header)
for line in f:
s = line.rstrip("\n")
(key,keyvalue) = s.split()
shared_dict[str(key)]=keyvalue
cnt = cnt + 1
if ((cnt % 1000000) == 0): #log each 1000000 where we are
print(cnt)
return #temp to speed up testing, not load whole dictionary atm
print("Done loading dictionary")
def checkqlist(qlist)
print(str(os.getpid()) + "-" + str(len(qlist)))
for li in qlist:
try:
checkvalue = calculations(li)
(found, keyval) = InMem(checkvalue)
if (found):
print("FOUND!!! " + checkvalue + ' ' + keyvalue)
except Exception as e:
print("(" + str(os.getpid()) + ")Error log: %s" % repr(e))
time.sleep(15)
def InMem(checkvalue):
if(checkvalue in shared_dict):
return True, shared_dict[checkvalue]
else:
return False, ""
if __name__ == "__main__":
start_time = time.time()
global shared_dict
manager = Manager()
shared_dict = manager.dict()
ReadDictFromFile()
chunksize=5
nr_of_processes = 10
with open(filetocheck, encoding=("utf-8"), errors=("replace")) as f:
qlist = []
for line in f:
s = line.rstrip("\n")
qlist.append(s)
if (len(qlist) >= (chunksize * nr_of_processes)):
chunked_list = [qlist[i:i+chunk_size] for i in range(0, len(qlist), chunk_size)]
try:
with multiprocessing.Pool() as pool:
pool.map(checkqlist, chunked_list, nr_of_processes) #problem: qlist is a single string, not a list of about 416 strings.
except Exception as e:
print("error log: %s" % repr(e))
time.sleep(15)
logit("Completed! " + datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y"))
print("--- %s seconds ---" % (time.time() - start_time))

you can use a multiprocessing.Manager.dict for this, it's the fastest IPC you can use to do the check between processes in python, and for the memory size, just make it smaller by changing all values to None, on my pc it can do 33k member checks every second ... about 400 times slower than a normal dictionary.
manager = Manager()
shared_dict = manager.dict()
shared_dict.update({x:None for x in main_dictionary})
shared_dict["new_element"] = None # to set another value
del shared_dict["new_element"] # to delete a certain value
you can also use a dedicated in-memory database for this like redis, which can handle being polled by multiple processes at the same time.
#Sam Mason suggestion to use WSL and fork may be better, but this one is the most portable.
Edit: to store it in children global scope you have to pass it through the initializer.
def define_global(var):
global shared_dict
shared_dict = var
...
if __name__ == "__main__":
...
with multiprocessing.Pool(initializer=define_global, initargs=(shared_dict ,)) as pool:

Related

how to "poll" python multiprocess pool apply_async

I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called

Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.

Multiprocessing hangs after several hundred jobs

I am trying to use this question for my file processing:
Python multiprocessing safely writing to a file
This is my modification of the code:
def listener(q):
'''listens for messages on the q, writes to file. '''
while 1:
reads = q.get()
if reads == 'kill':
#f.write('killed')
break
for read in reads:
out_bam.write(read)
out_bam.flush()
out_bam.close()
def fetch_reads(line, q):
parts = line[:-1].split('\t')
print(parts)
start,end = int(parts[1])-1,int(parts[2])-1
in_bam = pysam.AlignmentFile(args.bam, mode='rb')
fetched = in_bam.fetch(parts[0], start, end)
reads = [read for read in fetched if (read.cigarstring and read.pos >= start and read.pos < end and 'S' not in read.cigarstring)]
in_bam.close()
q.put(reads)
return reads
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
if not args.threads:
threads = 1
else:
threads = int(args.threads)
pool = mp.Pool(threads+1)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
with open(args.bed,'r') as bed:
jobs = []
cnt = 0
for line in bed:
# Fire off the read fetchings
job = pool.apply_async(fetch_reads, (line, q))
jobs.append(job)
cnt += 1
if cnt > 10000:
break
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
print('get')
#now we are done, kill the listener
q.put('kill')
pool.close()
The differences in that I am opening and closing the file in the function since otherwise I get unusual errors from bgzip.
At first, print(parts) and print('get') are interchangeably printed (more or less), then there are less and less prints of 'get'. Ultimately the code hangs, and nothing is printed (all the parts are printed, but 'get' simply doesn't print anymore). The output file remains zero bytes.
Can anyone lend a hand? Cheers!

Windows Python 2.7 multiprocessing.Manager.Queue deadlocks on put from child process

I'm trying to get code similar to the following example working correctly:
from multiprocessing import Process, Queue, Manager, Pool
import time
from datetime import datetime
def results_producer(the_work, num_procs):
results = Manager().Queue()
ppool = Pool(num_procs)
multiplier = 3
#step = len(the_work)/(num_procs*multiplier)
step = 100
for i in xrange(0,len(the_work), step):
batch = the_work[i:i+step]
ppool.apply_async(do_work1, args=(i,batch,results))#,callback=results.put_nowait)
return (ppool, results)
def results_consumer(results, total_work, num_procs, pool=None):
current = 0
batch_size=10
total = total_work
est_remaining = 0
while current < total_work:
size = results.qsize()
est_remaining = total_work - (current + size)
if current % 1000 == 0:
print 'Attempting to retrieve item from queue that is empty? %s, with size: %d and remaining work: %d' % (results.empty(), size, est_remaining)
item = results.get()
results.task_done()
current += 1
if current % batch_size == 0 or total_work - current < batch_size:
if pool is not None and est_remaining == 0 and size/num_procs > batch_size:
pool.apply_async(do_work2, args=(current, item, True))
else:
do_work2(current,item, False)
if current % 1000 == 0:
print 'Queue size: %d and remaining work: %d' % (size, est_remaining)
def do_work1(i, w, results):
time.sleep(.05)
if i % 1000 == 0:
print 'did work %d: from %d to %d' % (i,w[0], w[-1])
for j in w:
#create an increasing amount of work on the queue
results.put_nowait(range(j*2))
def do_work2(index, item, in_parallel):
time.sleep(1)
if index % 50 == 0:
print 'processed result %d with length %d in parallel %s' % (index, len(item), in_parallel)
if __name__ == "__main__":
num_workers = 2
start = datetime.now()
print 'Start: %s' % start
amount_work = 4000
the_work = [i for i in xrange(amount_work)]
ppool, results = results_producer(the_work, num_workers)
results_consumer(results, len(the_work), num_workers, ppool)
if ppool is not None:
ppool.close()
ppool.join()
print 'Took: %s time' % (datetime.now() - start)
And it deadlocks on the results.put_nowait call from do_work1 even though the queue is empty! Sometimes the code is able to put all the work on the queue but the results.get call from results_consumer blocks since it is apparently empty even though the work has not been consumed yet.
Additionally, I checked the programming guidelines: https://docs.python.org/2/library/multiprocessing.html and believe the above code conforms to it. Lastly the problem in this post: Python multiprocessing.Queue deadlocks on put and get seems very similar and claims to be solved on Windows (I'm running this on Windows 8.1) however the above code doesn't block due to the parent process attempting to join the child process since the logic is similar to the suggested answer. Any suggestions about the cause of the deadlock and how to fix it? Also in general, what is the best way to enable multiple producers to provide results for a consumer to process in python?

Python/Multiprocessing : Processes does not seem to start

I have a function which reads a binary file and converts each byte into a corresponding sequence of characters. For example, 0x05 becomes 'AACC', 0x2A becomes 'AGGG' etc...The function which reads the file and converts the bytes is currently a linear one and since the files to convert are anywhere between 25kb and 2Mb, this can take quite a while.
Therefore, I'm trying to use multiprocessing to divide the task and hopefully improve speed. However, I just can't get it to work. Below is the linear function, which works, albeit slowly;
def fileToRNAString(_file):
if (_file and os.path.isfile(_file)):
rnaSequences = []
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
decSequenceToRNA(blockCount, buf, rnaSequences)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note: The function 'decSequenceToRNA' takes the buffer read and converts each byte to the required string. Upon execution, the function returns a tuple which contain the block number and the string, e.g. (1, 'ACCGTAGATTA...') and at the end, I have an array of these tuples available.
I've tried to convert the function to use the multiprocessing of Python;
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
workers = []
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
p.start()
workers.append(p)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
for p in workers:
p.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
However, no processes seems to even start, as when this function is ran, an empty array is returned. Any message printed to the console in 'decSequenceToRNA' is not displayed;
>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).
Unlike this question here, I'm running Linux shiva 3.14-kali1-amd64 #1 SMP Debian 3.14.5-1kali1 (2014-06-07) x86_64 GNU/Linux and using PyCrust to test the functions on Python Version: 2.7.3. I'm using the following packages:
import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process
I'd like help to figure out why my code does not work, of if I'm missing something elsewhere to make the Process works. Also open to suggestions for improving the code. Below is 'decSequenceToRNA' for reference:
def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
_rnaSequences.append((_idxSeq, rnaSequence))

decSequenceToRNA is running in its own process, which means it gets its own, separate copy of every data structure in the main process. That means that when you append to _rnaSequences in decSequenceToRNA, it's has no effect on rnaSequences in the parent process. That would explain why an empty list is being returned.
You have two options to address this. First, is to create a list that can be shared between processes using multiprocessing.Manager. For example:
import multiprocessing
def f(shared_list):
shared_list.append(1)
if __name__ == "__main__":
normal_list = []
p = multiprocessing.Process(target=f, args=(normal_list,))
p.start()
p.join()
print(normal_list)
m = multiprocessing.Manager()
shared_list = m.list()
p = multiprocessing.Process(target=f, args=(shared_list,))
p.start()
p.join()
print(shared_list)
Output:
[] # Normal list didn't work, the appended '1' didn't make it to the main process
[1] # multiprocessing.Manager() list works fine
Applying this to your code would just require replacing
rnaSequences = []
With
m = multiprocessing.Manager()
rnaSequences = m.list()
Alternatively, you could (and probably should) use a multiprocessing.Pool instead of creating individual Process for each chunk. I'm not sure how large hFile is or how big the chunks you're reading are, but if there are more than multiprocessing.cpu_count() chunks, you're going to hurt performance by spawning processes for every chunk. Using a Pool, you can keep your process count constant, and easily create your rnaSequence list:
def decSequenceToRNA(_idxSeq, _byteSequence):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
return _idxSeq, rnaSequence
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
results = []
p = multiprocessing.Pool() # Creates a pool of cpu_count() processes
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
result = pool.apply_async(decSequenceToRNA, blockCount, buf)
results.append(result)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
rnaSequences = [r.get() for r in results]
pool.close()
pool.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note that we no longer pass the rnaSequences list to the child. Instead, we just return the result we would have appened back to the parent (which we can't do with Process), and build the list there.

Try writing this (comma at the end of the parameter list)
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))

Successive multiprocessing

I am filtering huge text files using multiprocessing.py. The code basically opens the text files, works on it, then closes it.
Thing is, I'd like to be able to launch it successively on multiple text files. Hence, I tried to add a loop, but for some reason it doesn't work (while the code works on each file). I believe this is an issue with:
if __name__ == '__main__':
However, I am looking for something else. I tried to create a Launcher and a LauncherCount files like this:
LauncherCount.py:
def setLauncherCount(n):
global LauncherCount
LauncherCount = n
and,
Launcher.py:
import os
import LauncherCount
LauncherCount.setLauncherCount(0)
os.system("OrientedFilterNoLoop.py")
LauncherCount.setLauncherCount(1)
os.system("OrientedFilterNoLoop.py")
...
I import LauncherCount.py, and use LauncherCount.LauncherCount as my loop index.
Of course, this doesn't work too as it edits the variable LauncherCount.LauncherCount locally, so it won't be edited in the imported version of LauncherCount.
Is there any way to edit globally a variable in an imported file? Or, is there any way to do this in any other way? What I need is running a code multiple times, in changing one value, and without using any loop apparently.
Thanks!
Edit: Here is my main code if necessary. Sorry for the bad style ...
import multiprocessing
import config
import time
import LauncherCount
class Filter:
""" Filtering methods """
def __init__(self):
print("launching methods")
# Return the list: [Latitude,Longitude] (elements are floating point numbers)
def LatLong(self,line):
comaCount = []
comaCount.append(line.find(','))
comaCount.append(line.find(',',comaCount[0] + 1))
comaCount.append(line.find(',',comaCount[1] + 1))
Lat = line[comaCount[0] + 1 : comaCount[1]]
Long = line[comaCount[1] + 1 : comaCount[2]]
try:
return [float(Lat) , float(Long)]
except ValueError:
return [0,0]
# Return a boolean:
# - True if the Lat/Long is within the Lat/Long rectangle defined by:
# tupleFilter = (minLat,maxLat,minLong,maxLong)
# - False if not
def LatLongFilter(self,LatLongList , tupleFilter) :
if tupleFilter[0] <= LatLongList[0] <= tupleFilter[1] and
tupleFilter[2] <= LatLongList[1] <= tupleFilter[3]:
return True
else:
return False
def writeLine(self,key,line):
filterDico[key][1].write(line)
def filteringProcess(dico):
myFilter = Filter()
while True:
try:
currentLine = readFile.readline()
except ValueError:
break
if len(currentLine) ==0: # Breaks at the end of the file
break
if len(currentLine) < 35: # Deletes wrong lines (too short)
continue
LatLongList = myFilter.LatLong(currentLine)
for key in dico:
if myFilter.LatLongFilter(LatLongList,dico[key][0]):
myFilter.writeLine(key,currentLine)
###########################################################################
# Main
###########################################################################
# Open read files:
readFile = open(config.readFileList[LauncherCount.LauncherCount][1], 'r')
# Generate writing files:
pathDico = {}
filterDico = config.filterDico
# Create outputs
for key in filterDico:
output_Name = config.readFileList[LauncherCount.LauncherCount][0][:-4]
+ '_' + key +'.log'
pathDico[output_Name] = config.writingFolder + output_Name
filterDico[key] = [filterDico[key],open(pathDico[output_Name],'w')]
p = []
CPUCount = multiprocessing.cpu_count()
CPURange = range(CPUCount)
startingTime = time.localtime()
if __name__ == '__main__':
### Create and start processes:
for i in CPURange:
p.append(multiprocessing.Process(target = filteringProcess ,
args = (filterDico,)))
p[i].start()
### Kill processes:
while True:
if [p[i].is_alive() for i in CPURange] == [False for i in CPURange]:
readFile.close()
for key in config.filterDico:
config.filterDico[key][1].close()
print(key,"is Done!")
endTime = time.localtime()
break
print("Process started at:",startingTime)
print("And ended at:",endTime)

To process groups of files in sequence while working on files within a group in parallel:
#!/usr/bin/env python
from multiprocessing import Pool
def work_on(args):
"""Process a single file."""
i, filename = args
print("working on %s" % (filename,))
return i
def files():
"""Generate input filenames to work on."""
#NOTE: you could read the file list from a file, get it using glob.glob, etc
yield "inputfile1"
yield "inputfile2"
def process_files(pool, filenames):
"""Process filenames using pool of processes.
Wait for results.
"""
for result in pool.imap_unordered(work_on, enumerate(filenames)):
#NOTE: in general the files won't be processed in the original order
print(result)
def main():
p = Pool()
# to do "successive" multiprocessing
for filenames in [files(), ['other', 'bunch', 'of', 'files']]:
process_files(p, filenames)
if __name__=="__main__":
main()
Each process_file() is called in sequence after the previous one has been complete i.e., the files from different calls to process_files() are not processed in parallel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing - sharing large dataset - python

Related

how to "poll" python multiprocess pool apply_async

Multiprocessing hangs after several hundred jobs

Windows Python 2.7 multiprocessing.Manager.Queue deadlocks on put from child process

Python/Multiprocessing : Processes does not seem to start

Successive multiprocessing

Categories

Resources