Hardware: Raspberry Pi 4B (1GB) & Macbook Pro
OS: Raspbian & OSX
Python Version: 3.7.3
I'm having an issue with multiprocessing.Queue() skipping the first item that is placed in the queue. After some testing I have figured out that I can keep this from happening if I add additional code (time.sleep(.0001), print(''), anything except for commented code) between the subsequent q.put() commands. q.get will always skip the first item in the queue and start on the second item without a delay between adding items to the queue, and when a delay is added, it will always get the first item in the queue. Maybe someone can explain to me what is happening and how to resolve this issue in a better way. Thanks in advance.
Here is a sample bit of code that shows the problem that I'm having*(see note).
import multiprocessing
import time
set_size = 3
def process_queueing():
entry = 1
data_list = []
for i in range(1,100):
data_list.append(i)
if i % set_size == 0:
data = [data_list, set_size, entry]
q.put(data)
#time.sleep(.001) #Uncomment to fix problem
entry = entry + 1
data_list.clear()
def process_data():
while True:
data = q.get()
for i in data[0]:
print('Entry: ' + str(data[2]) + ' Data: ' + str(i))
q = multiprocessing.Queue()
process = multiprocessing.Process(target=process_data, daemon=True)
process.start()
process_queueing()
*Note: This code actually shows the data in the queue as being incomplete and incorrect (Entry: 1 Data: 4 Entry: 1 Data: 5 Entry: 1 Data: 6 (full output) instead of Entry: 1 Data: 1 Entry: 1 Data: 2 Entry: 1 Data: 3 and so on...) in this example and when run on my Macbook Pro (Python 3.7.3, OSX 10.14.5) doesn't output anything. Again, adding the additional code as a delay fixes all the problems.
import multiprocessing
import time
set_size = 3
def process_queueing():
entry = 1
data_list = []
for i in range(1,100):
data_list.append(i)
if i % set_size == 0:
data = [list(data_list), set_size, entry]
q.put(data)
# time.sleep(.001) #Uncomment to fix problem
# print(data)
entry = entry + 1
data_list.clear()
def process_data():
while True:
data = q.get()
for i in data[0]:
print('Entry: ' + str(data[2]) + ' Data: ' + str(i))
q = multiprocessing.Queue()
process = multiprocessing.Process(target=process_data, daemon=True)
process.start()
process_queueing()
OUTPUT
Entry: 1 Data: 1
Entry: 1 Data: 2
Entry: 1 Data: 3
Entry: 2 Data: 4
Entry: 2 Data: 5
Entry: 2 Data: 6
Entry: 3 Data: 7
Entry: 3 Data: 8
Entry: 3 Data: 9
Entry: 4 Data: 10
I think I got it working, by changing the data_list to list(data_list), I think what happens is that you are overwriting the data_list not making a new list every time. I really think you should use something like Locks for this, to avoid any race conditions like the ones you are facing.
Changing data_list.clear() to data_list = [] seems to have solved the problem. I also opted to throw the output into a queue because process_data() is running in a separate process and won't print to my main process stdout when running in the IDLE shell (windows) (there are other solutions for that).
import multiprocessing
import time
set_size = 3
def process_queueing(q):
entry = 1
data_list = []
for i in range(1,100):
data_list.append(i)
if i % set_size == 0:
data = [data_list, set_size, entry]
q.put(data)
#time.sleep(.001) #Uncomment to fix problem
entry = entry + 1
## data_list.clear()
data_list = []
return('Done')
def process_data(q,r):
while True:
data = q.get()
for i in data[0]:
r.put('Entry: ' + str(data[2]) + ' Data: ' + str(i))
if __name__ == '__main__':
q = multiprocessing.Queue()
r = multiprocessing.Queue()
process = multiprocessing.Process(target=process_data,
args=(q,r),
daemon=True)
process.start()
print(process_queueing(q))
print('foo')
print(r.empty())
#wait for process_data to put stuff on the queue
while r.empty():
pass
while not r.empty():
data = r.get()
#hopefully print takes enough time for more things to get put on the queue
print(data)
I don't believe the additional queue affects the outcome - although it does introduce a wait for the data to get pickled before putting it on the queue. Running py -m tmp from the powershell command prompt works fine without the additional queue
tmp.py
import multiprocessing
from queue import Empty
import time
set_size = 3
def process_queueing(q):
entry = 1
data_list = []
for i in range(1,100):
data_list.append(i)
if i % set_size == 0:
data = [data_list, set_size, entry]
q.put(data)
#time.sleep(.001) #Uncomment to fix problem
entry = entry + 1
## data_list.clear()
data_list = []
q.put('Done')
return('Done')
def process_data(q,r):
while True:
try:
data = q.get(timeout=1)
if data == 'Done':
print('donedone')
break
for i in data[0]:
## r.put('Entry: ' + str(data[2]) + ' Data: ' + str(i))
print('foo Entry: ' + str(data[2]) + ' Data: ' + str(i))
except Empty:
break
if __name__ == '__main__':
q = multiprocessing.Queue()
r = multiprocessing.Queue()
process = multiprocessing.Process(target=process_data,
args=(q,r),
daemon=True)
process.start()
print(process_queueing(q))
while process.is_alive():
pass
Related
I may be approaching this all wrong but still this is where I'm at. I have very large log files I'm trying to search, up to 30gb in some cases. I'm writing a script to pull info and have been playing with multi process to speed it up a bit. right now I'm testing running two functions at the same time to search from the top and bottom to get results, which seems to work. I'm wondering if it's possible to stop one function one a result from the other. Such as if the top function finds a result they both stop. This way I can build it out as needed.
from file_read_backwards import FileReadBackwards
from multiprocessing import Process
import sys
z = "log.log"
#!/usr/bin/env python
rocket = 0
def top():
target = "test"
with open(z) as src:
found= None
for line in src:
if len(line) == 0: break #happens at end of file, then stop loop
if target in line:
found= line
break
print(found)
def bottom():
target = "text"
with FileReadBackwards(z) as src:
found= None
for line in src:
if len(line) == 0: break #happens at end of file, then stop loop
if target in line:
found= line
break
print(found)
if __name__=='__main__':
p1 = Process(target = top)
p1.start()
p2 = Process(target = bottom)
p2.start()
Here's a proof-of-concept of the approach I mentioned in the comments:
import os
import random
import sys
from multiprocessing import Process, Value
def search(proc_no, file_name, seek_to, max_size, find, flag):
stop_at = seek_to + max_size
with open(file_name) as f:
if seek_to:
f.seek(seek_to - 1)
prev_char = f.read(1)
if prev_char != '\n':
# Landed in the middle of a line. Skip back one (or
# maybe more) lines so this line isn't excluded. Start
# by seeking back 256 bytes, then 512 if necessary, etc.
exponent = 8
pos = seek_to
while pos >= seek_to:
pos = f.seek(max(0, pos - (2 ** exponent)))
f.readline()
pos = f.tell()
exponent += 1
while True:
if flag.value:
break
line = f.readline()
if not line:
break # EOF
data = line.strip()
if data == find:
flag.value = proc_no
print(data)
break
if f.tell() > stop_at:
break
if __name__ == '__main__':
# list.txt contains lines with the numbers 1 to 1000001
file_name = 'list.txt'
info = os.stat(file_name)
file_size = info.st_size
if len(sys.argv) == 1:
# Pick a random value from list.txt
num_lines = 1000001
choices = list(range(1, num_lines + 1))
choices.append('XXX')
find = str(random.choice(choices))
else:
find = sys.argv[1]
num_procs = 4
chunk_size, remainder = divmod(file_size, num_procs)
max_size = chunk_size + remainder
flag = Value('i', 0)
procs = []
print(f'Using {num_procs} processes to look for {find} in {file_name}')
for i in range(num_procs):
seek_to = i * chunk_size
proc = Process(target=search, args=(i + 1, file_name, seek_to, max_size, find, flag))
procs.append(proc)
for proc in procs:
proc.start()
for proc in procs:
proc.join()
if flag.value:
print(find, 'found by proc', flag.value)
else:
print(find, 'not found')
After reading various posts[1] about reading files with multiprocessing and multithreading, it seems that neither is a great approach due to potential disk thrashing and serialized reads. So here's a different, simpler approach that is way faster (at least for the file with a million lines I was trying it out on):
import mmap
import sys
def search_file(file_name, text, encoding='utf-8'):
text = text.encode(encoding)
with open(file_name) as f:
with mmap.mmap(f.fileno(), 0, flags=mmap.ACCESS_READ, prot=mmap.PROT_READ) as m:
index = m.find(text)
if index > -1:
# Found a match; now find beginning of line that
# contains match so we can grab the whole line.
while index > 0:
index -= 1
if m[index] == 10:
index += 1
break
else:
index = 0
m.seek(index)
line = m.readline()
return line.decode(encoding)
if __name__ == '__main__':
file_name, search_string = sys.argv[1:]
line = search_file(file_name, search_string)
sys.stdout.write(line if line is not None else f'Not found in {file_name}: {search_string}\n')
I'm curious how this would perform with a 30GB log file.
[1] Including this one
Simple example using a multiprocessing.Pool and callback function.
Terminates remaining pool processes once a result has returned.
You could add an arbitrary number of processes to search from different offsets in the file using this approach.
import math
import time
from multiprocessing import Pool
from random import random
def search(pid, wait):
"""Sleep for wait seconds, return PID
"""
time.sleep(wait)
return pid
def done(result):
"""Do something with result and stop other processes
"""
print("Process: %d done." % result)
pool.terminate()
print("Terminate Pool")
pool = Pool(2)
pool.apply_async(search, (1, math.ceil(random() * 3)), callback=done)
pool.apply_async(search, (2, math.ceil(random() * 3)), callback=done)
# do other stuff ...
# Wait for result
pool.close()
pool.join() # block our main thread
This is essentially the same as Blurp's answer, but I shortened it and made it a bit to make it more general. As you can see top should be an infinite loop, but bottom stops top immediately.
from multiprocessing import Process
valNotFound = True
def top():
i=0
while ValNotFound:
i += 1
def bottom():
ValNotFound = False
p1 = Process(target = top)
p2 = Process(target = bottom)
p1.start()
p2.start()
I'm trying to run my code with a multiprocessing function but mongo keep returning
"MongoClient opened before fork. Create MongoClient with
connect=False, or create client after forking."
I really doesn't understand how i can adapt my code to this.
Basically the structure is:
db = MongoClient().database
db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
collectionW = db['words']
collectionT = db['sinMemo']
collectionL = db['sinLogic']
def findW(word):
rows = collectionw.find({"word": word})
ind = 0
for row in rows:
ind += 1
id = row["_id"]
if ind == 0:
a = ind
else:
a = id
return a
def trainAI(stri):
...
if findW(word) == 0:
_id = db['words'].insert(
{"_id": getNextSequence(db.counters, "nodeid"), "word": word})
story = _id
else:
story = findW(word)
...
def train(index):
# searching progress
progFile = "./train/progress{0}.txt".format(index)
trainFile = "./train/small_file_{0}".format(index)
if os.path.exists(progFile):
f = open(progFile, "r")
ind = f.read().strip()
if ind != "":
pprint(ind)
i = int(ind)
else:
pprint("No progress saved or progress lost!")
i = 0
f.close()
else:
i = 0
#get the number of line of the file
rangeC = rawbigcount(trainFile)
#fix unicode
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
files = io.open(trainFile, "r", encoding="utf8")
str1 = ""
str2 = ""
filex = open(progFile, "w")
with progressbar.ProgressBar(max_value=rangeC) as bar:
for line in files:
line = line.replace("\n", "")
if i % 2 == 0:
str1 = line.translate(non_bmp_map)
else:
str2 = line.translate(non_bmp_map)
bar.update(i)
trainAI(str1 + " " + str2)
filex.seek(0)
filex.truncate()
filex.write(str(i))
i += 1
#multiprocessing function
maxProcess = 3
def f(l, i):
l.acquire()
train(i + 1)
l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(maxProcess):
pprint("start " + str(num))
Process(target=f, args=(lock, num)).start()
This code is made for reading 4 different file in 4 different process and at the same time insert the data in the database.
I copied only part of the code for make you understand the structure of it.
I've tried to add connect=False to this code but nothing...
db = MongoClient(connect=False).database
db.authenticate('user', 'password', mechanism='SCRAM-SHA-1')
collectionW = db['words']
collectionT = db['sinMemo']
collectionL = db['sinLogic']
then i've tried to move it in the f function (right before train() but what i get is that the program doesn't find collectionW,collectionT and collectionL.
I'm not very expert of python or mongodb so i hope that this is not a silly question.
The code is running under Ubuntu 16.04.2 with python 2.7.12
db.authenticate will have to connect to mongo server and it will try to make a connection. So, even though connect=False is being used, db.authenticate will require a connection to be open.
Why don't you create the mongo client instance after fork? That's look like the easiest solution.
Since db.authenticate must open the MongoClient and connect to the server, it creates connections which won't work in the forked subprocess. Hence, the error message. Try this instead:
db = MongoClient('mongodb://user:password#localhost', connect=False).database
Also, delete the Lock l. Acquiring a lock in one subprocess has no effect on other subprocesses.
Here is how I did it for my problem:
import pathos.pools as pp
import time
import db_access
class MultiprocessingTest(object):
def __init__(self):
pass
def test_mp(self):
data = [[form,'form_number','client_id'] for form in range(5000)]
pool = pp.ProcessPool(4)
pool.map(db_access.insertData, data)
if __name__ == '__main__':
time_i = time.time()
mp = MultiprocessingTest()
mp.test_mp()
time_f = time.time()
print 'Time Taken: ', time_f - time_i
Here is db_access.py:
from pymongo import MongoClient
def insertData(form):
client = MongoClient()
db = client['TEST_001']
db.initialization.insert({
"form": form[0],
"form_number": form[1],
"client_id": form[2]
})
This is happening to your code because you are initiating MongoCLient() once for all the sub-processes. MongoClient is not fork safe. So, initiating inside each function works and let me know if there are other solutions.
So I'm using processes and a queue to search through data and find the rows that have the same entry in a different columns. I decided to use multiprocessing to try and make it so can be scaled for large data. The file has a 1000 lines and 10 points of data per line. I read in only 80 lines of the data and the program stalls. 70 lines and it works fine and at a decent speed too.
My question is what am I doing wrong or are the limitations with this approach that I haven't identified? The code isn't perfect by any means and is probably bad in itself. The code is as follows:
from multiprocessing import Process, Queue
import random
def openFile(file_name, k, division):
i = 0
dataSet = []
with open(file_name) as f:
for line in f:
stripLine = line.strip('\n')
splitLine = stripLine.split(division)
dataSet += [splitLine]
i += 1
if(i == k):
break
return(dataSet)
def setCombination(q,data1,data2):
newData = []
for i in range(0,len(data1)):
for j in range(0, len(data2)):
if(data1[i][1] == data2[j][3]):
newData += data2[j]
q.put(newData)
if __name__ == '__main__':
# Takes in the file, the length of the data to read in, and how the data is divided.
data = openFile('testing.txt', 80, ' ')
for i in range(len(data)):
for j in range(len(data[i])):
try:
data[i][j] = float(data[i][j])
except ValueError:
pass
#print(data)
k = len(data)//10
q = Queue()
processes = [Process(target=setCombination, args=(q, data[k*x: k + k*x], data))
for x in range(10)]
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
saleSet = [q.get() for p in processes]
print('\n', saleSet)
The data file testing.txt
It appears that something about what your code does is causing a deadlock. While experimenting, I noticed that 3 out of the 10 tasks would never terminate, but, to be honest, I don't really don't know the reason(s) why.
The good news is it's easy to fix by just removing or disabling the
# Exit the completed processes
for p in processes:
p.join()
loop you have in your code.
Here's a complete version of your code with (mostly) just that modification in it:
from multiprocessing import Process, Queue
def openFile(file_name, k, division):
i = 0
dataSet = []
with open(file_name) as f:
for line in f:
stripLine = line.strip('\n')
splitLine = stripLine.split(division)
dataSet += [splitLine]
i += 1
if i == k:
break
return dataSet
def setCombination(q, data1, data2):
newData = []
for i in range(len(data1)):
for j in range(len(data2)):
if data1[i][1] == data2[j][3]:
newData += data2[j]
q.put(newData)
if __name__ == '__main__':
# Takes in the file, the length of the data to read in, and how the data is divided.
data = openFile('testing.txt', 80, ' ')
for i in range(len(data)):
for j in range(len(data[i])):
try:
data[i][j] = float(data[i][j])
except ValueError:
pass
k = len(data) // 10
q = Queue()
processes = [Process(target=setCombination, args=(q, data[k*x: k*x+k], data))
for x in range(10)]
for p in processes:
p.start()
# NO LONGER USED (HANGS)
# # Exit the completed processes
# for p in processes:
# p.join()
# note: this works since by default, get() will block until it can retrieve something
saleSet = [q.get() for _ in processes] # a queue item should be added by each Process
print('\n', saleSet)
What I need to do is refactor a script, to turn it into a method that can be added to a larger process.
STEP A
The point is- there's a data processing task that I'm working on- the first step is to create data of the following form:
3|Victoria|[51.503378, -0.139134]|2673
52|Cubitt Town|[51.505199, -0.018848]|23
5|United Kingdom|[54.75844, -2.69531]|459
6|London|[51.50853, -0.12574]|346
296|Bucharest|[44.43225, 26.10626]|9
352|Vilich-Müldorf|[50.75024, 7.15283]|4
48|Gut Scheibenhardt|[49.001249, 8.412378]|3
314|Westerham|[48.0601, 11.62219]|9
45|Honartsdeich|[53.557429, 9.987297]|34
9779|Martinsried|[48.137418, 11.555737]|11
343|Brussels|[50.85045, 4.34878]|27
563|Russell Square|[51.519403, -0.133906]|20
2|Germany|[51.5, 10.5]|20
11|Farringdon|[51.51807, -0.10852]|154
609|Fröttmaning|[48.16652, 11.59038]|3
STEP B
So then Step B will operate on the result of the above step (described by PROCESS below) and it will output data like so (the final data representation).
This means that PROCESS creates the data displayed in STEP A.
The final data representation looks like this:
code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
code: DE-TH, jobs: 20
code: GB-ENG, jobs: 154
code: DE-BY, jobs: 3
The script which performs that magic (creates the final output data displayed above) looks in this way:
import json
import requests
from collections import defaultdict
from pprint import pprint
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
# open up the output of 'data-processing.py'
with open('job-numbers-by-location.txt') as data_file:
# print the output to a file
with open('phase_ii_output.txt', 'w') as output_file_:
for line in data_file:
identifier, name, coords, number_of_jobs = line.split("|")
coords = coords[1:-1]
lat, lng = coords.split(",")
# print("lat: " + lat, "lng: " + lng)
response = requests.get("http://api.geonames.org/countrySubdivisionJSON?lat="+lat+"&lng="+lng+"&username=s.matthew.english").json()
codes = response.get('codes', [])
for code in codes:
if code.get('type') == 'ISO3166-2':
country_code = '{}-{}'.format(response.get('countryCode', 'UNKNOWN'), code.get('code', 'UNKNOWN'))
if not hasNumbers( country_code ):
# print("code: " + country_code + ", jobs: " + number_of_jobs)
output_file_.write("code: " + country_code + ", jobs: " + number_of_jobs)
output_file_.close()
But what I need to do is make it into a method, so I can fuse it into this process:
PROCESS
import json
from collections import defaultdict
from pprint import pprint
def process_locations_data(data):
# processes the 'data' block
locations = defaultdict(int)
for item in data['data']:
location = item['relationships']['location']['data']['id']
locations[location] += 1
return locations
def process_locations_included(data):
# processes the 'included' block
return_list = []
for record in data['included']:
id = record.get('id', None)
name = record.get('attributes', {}).get('name', None)
coord = record.get('attributes', {}).get('coord', None)
return_list.append((id, name, coord))
return return_list # return list of tuples
# load the data from file once
with open('data-science.txt') as data_file:
data = json.load(data_file)
# use the two functions on same data
locations = process_locations_data(data)
records = process_locations_included(data)
# output list to collect lines
output = []
# combine the data for printing
for record in records:
id, name, coord = record
references = locations[id] # lookup the references in the dict
line = str(id) + "|" + str(name) + "|" + str(coord) + "|" + str(references) + "\n"
if line not in output:
output.append(line)
# print the output to a file
with open('job-numbers-by-location.txt', 'w') as file_:
for l in output:
if not ("None") in l:
if not ("[0, 0]") in l:
file_.write(l)
file_.close()
but the thing is- I keep failing in my efforts to make that script into a method within there.
How to refactorize that script into an independent method that could be included into the process?
All of the data related to this task can be found here on my GitHub page
I have some python code to read a file and push data to a list. Then put this list to queue, use threading to process the list, say 20 items a time. After processing, I save the result into a new file. What was put in the new file was actually different order than the original file. For example, I have in input,
1 a
2 b
3 c
4 a
5 d
But the output looks like:
2 aa
1 ba
4 aa
5 da
3 ca
Is there any way to preserve the original order?
Here is my code:
import threading,Queue,time,sys
class eSS(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.queue = queue
self.lock = threading.Lock()
def ess(self,email,code,suggested,comment,reason,dlx_score):
#do something
def run(self):
while True:
info = self.queue.get()
infolist = info.split('\t')
email = infolist[1]
code = infolist[2]
suggested = infolist[3]
comment = infolist[4]
reason = infolist[5]
dlx_score = (0 if infolist[6] == 'NULL' else int(infolist[6]))
g.write(info + '\t' + self.ess(email,code,suggested,comment,reason,dlx_score) +'\r\n')
self.queue.task_done()
if __name__ == "__main__":
queue = Queue.Queue()
filename = sys.argv[1]
#Define number of threads
threads = 20
f = open(filename,'r')
g = open(filename+'.eSS','w')
lines = f.read().splitlines()
f.close()
start = time.time()
for i in range(threads):
t = eSS(queue)
t.setDaemon(True)
t.start()
for line in lines:
queue.put(line)
queue.join()
print time.time()-start
g.close()
Three thoughts come to mind. Common to all is to include an index with the packet that is queued for processing.
One thought then is to use the controller/workers/output framework in which the output thread de-queues the worker-processed data, assembles, and outputs it.
The second thought is to employ a memory-mapped file for output, and use the index to calculate the offset to write into the file (assumes fixed-length writes probably).
The third is to use the index to put processed data in a new list, and when the list is completed write the items out at the end rather than on the fly.