I'm trying to do some text processing on around 200,000 entries in a SQlite database which I'm accessing using SQLAlchemy. I'd like to parallelize it (I'm looking at Parallel Python), but I'm not sure how exactly to do it.
I want to commit the session each time an entry is processed, so that if I need to stop the script I won't lose the work it's already done. However, when I try to pass the session.commit() command to the callback function, it does not seem to work.
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
def callbackFunc(match, session, inputTuple):
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))
The session is imported from assignDB. I'm not getting any error, it's just not updating the database.
Thanks for your help.
UPDATE
Here is the code for fuzzy_substring
def fuzzy_substring(needle, haystack):
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
return m
row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)
which I got from here: Fuzzy Substring. In my case, "needle" is one of ~8000 possible choices, while haystack is the raw string I'm trying to match. I loop over all possible "needles" and choose the one with the best score.
Without looking at your specific code, it can be fairly said that:
Using serverless SQLite and
Seeking increased write performance through paralleism
are mutually incompatible desires. Quoth the SQLite FAQ:
… However, client/server database engines (such as PostgreSQL, MySQL,
or Oracle) usually support a higher level of concurrency and allow
multiple processes to be writing to the same database at the same
time. This is possible in a client/server database because there is
always a single well-controlled server process available to coordinate
access. If your application has a need for a lot of concurrency, then
you should consider using a client/server database. But experience
suggests that most applications need much less concurrency than their
designers imagine. …
And that's even without whatever gating and ordering SQLAlchemy uses. It is also not clear at all when — if at all — the Parallel Python jobs are completing.
My suggestion: get it working correctly first and then look for optimizations. Especially when the pp secret sauce might not be buying you much at all even if it was working perfectly.
added in response to comment:
If fuzzy_substring matching is the bottleneck it appears completely decoupled from the database access and you should keep that in mind. Without seeing what fuzzy_substring is doing, a good starting assumption is that you can make algorithmic improvements which may make the single-threaded programming computationally feasible. Approximate string matching is a very well studied problem and choosing the right algorithm is often far better than "throw more processors at it".
Far better in this sense is that you have cleaner code, don't waste the overhead of segmenting and reassembling the problem, have a more extensible and debuggable program at the end.
#msw has provided an excellent overview of the problem, giving a general way to think about parallelization.
Notwithstanding these comments, here is what I got to work in the end:
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
rawIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None)
numIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None).count()
stepSize = 30
for i in range(0, numIngredients, stepSize):
print i
print numIngredients
if i + stepSize > numIngredients:
stop = numIngredients
else:
stop = i + stepSize
jobs = []
for match in rawIngredients[i:stop]:
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,))))
job_server.wait()
for match, job in jobs:
inputTuple = job()
print match.ingredient
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
Essentially, I've chopped the problem into chunks. After matching 30 substrings in parallel, the results are returned and committed to the database. I chose 30 somewhat arbitrarily, so there might be gains to be had in optimizing that number. It seems to have sped up a fair bit, as I'm using all 3(!) of the cores in my processor now.
Related
I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)
You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.
I am trying to do a word counter with mapreduce using concurrent.futures, previously I've done a multi threading version, but was so slow because is CPU bound.
I have done the mapping part to divide the words into ['word1',1], ['word2,1], ['word1,1], ['word3',1] and between the processes, so each process will take care of a part of the text file. The next step ("shuffling") is to put these words in a dictionary so that it looks like this: word1: [1,1], word2:[1], word3: [1], but I cannot share the dictionary between the processes because we are using multiprocessing instead of multithreading, so how can I make each process add the "1" to the dictionary shared between all the processes? I'm stuck with this, and I can't continue.
I am at this point:
import sys
import re
import concurrent.futures
import time
# Read text file
def input(index):
try:
reader = open(sys.argv[index], "r", encoding="utf8")
except OSError:
print("Error")
sys.exit()
texto = reader.read()
reader.close()
return texto
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
n_processes = 4
# Creating processes
with concurrent.futures.ProcessPoolExecutor() as executor:
results = []
for id_process in range(n_processes):
results.append(executor.submit(mapping, words, n_processes, id_process))
for f in concurrent.futures.as_completed(results):
print(f.result())
def mapping(words, n_processes, id_process):
word_map_result = []
for i in range(int((id_process / n_processes) * len(words)),
int(((id_process + 1) / n_processes) * len(words))):
word_map_result.append([words[i], 1])
return word_map_result
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Please, specify a text file...")
sys.exit()
start_time = time.time()
for index in range(1, len(sys.argv)):
print(sys.argv[index], ":", sep="")
text = input(index)
splitting(text)
# for word in result_dictionary_words:
# print(word, ':', result_dictionary_words[word])
print("--- %s seconds ---" % (time.time() - start_time))
I've seen that when doing concurrent programming it is usually best to avoid using shared state as far as possible, so how I can implement Map reduce word count without share the dictionary between processes?
You can create a shared dictionary using a Manager from multiprocessing. I understand from your program that it is your word_map_result you need to share.
You could try something like this
from multiprocessing import Manager
...
def splitting():
...
word_map_result = Manager().dict()
with concurrent.futures.....:
...
results.append(executor.submit(mapping, words, n_processes, id_process, word_map_result)
...
...
def mapping(words, n_processes, id_process, word_map_result):
for ...
# Do not return anything - word_map_result is up to date in your main process
Basically you will remove the local copy of word_map_result from your mapping function and pass it the Manager instance as a parameter. This word_map_result is now shared between all your subprocesses and the main program. Managers add data transfer overhead, though, so this might not help you very much.
In this case you do not return anything from the workers so you do not need the for loop to process results either in your main program - your word_map_result is identical in all subprocesses and the main program.
I may have misunderstood your problem and I am not familiar with the algorithm if it is possible to re-engineer that to work so that you don't need to share anything between processes.
It seems like a misconception to be using multiprocessing at all. First, there is overhead in creating the pool and overhead in passing data to and from the processes. And if you decide to use a shared, managed dictionary that worker function mapping can use to store its results in, know that a managed dictionary uses a proxy, the accessing of which is rather slow. The alternative to using a managed dictionary would be as you currently have it, i.e. mapping returns a list and the main process uses those results to create the keys and values of the dictionary. But what then is the point of mapping returning a list where each element is always a list of two elements where the second element is always the constant value 1? Isn't that rather wasteful of time and space?
I think your performance will be no faster (probably slower) than just implementing splitting as:
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
results = {}
for word in words:
results[word] = [1]
return results
I'm trying to learn something a little new in each mini-project I do. I've made a Game of Life( https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life ) program.
This involves a numpy array where each point in the array (a "cell") has an integer value. To evolve the state of the game, you have to compute for each cell the sum of all its neighbour values (8 neighbours).
The relevant class in my code is as follows, where evolve() takes in one of the xxx_method methods. It works fine for conv_method and loop_method, but I want to use multiprocessing (which I've identified should work, unlike multithreading?) on loop_method to see any performance increases. I feel it should work as each calculation is independent. I've tried a naive approach, but don't really understand the multiprocessing module well enough. Could I also use it within the evolve() method, as again I feel that each calculation within the double for loops are independent.
Any help appreciated, including general code comments.
Edit - I'm getting a RuntimeError, which I'm half-expecting as my understanding of multiprocessing isnt good enough. What needs to be done to the code to get it work?
class GoL:
""" Game Engine """
def __init__(self, size):
self.size = size
self.grid = Grid(size) # Grid is another class ive defined
def evolve(self, neigbour_sum_func):
new_grid = np.zeros_like(self.grid.cells) # start with everything dead, only need to test for keeping/turning alive
neighbour_sum_array = neigbour_sum_func()
for i in range(self.size):
for j in range(self.size):
cell_sum = neighbour_sum_array[i,j]
if self.grid.cells[i,j]: # already alive
if cell_sum == 2 or cell_sum == 3:
new_grid[i,j] = 1
else: # test for dead coming alive
if cell_sum == 3:
new_grid[i,j] = 1
self.grid.cells = new_grid
def conv_method(self):
""" Uses 2D convolution across the entire grid to work out the neighbour sum at each cell """
kernel = np.array([
[1,1,1],
[1,0,1],
[1,1,1]],
dtype=int)
neighbour_sum_grid = correlate2d(self.grid.cells, kernel, mode='same')
return neighbour_sum_grid
def loop_method(self, partition=None):
""" Also works out neighbour sum for each cell, using a more naive loop method """
if partition is None:
cells = self.grid.cells # no multithreading, just work on entire grid
else:
cells = partition # just work on a set section of the grid
neighbour_sum_grid = np.zeros_like(cells) # copy
for i, row in enumerate(cells):
for j, cell_val in enumerate(row):
neighbours = cells[i-1:i+2, j-1:j+2]
neighbour_sum = np.sum(neighbours) - cell_val
neighbour_sum_grid[i,j] = neighbour_sum
return neighbour_sum_grid
def multi_loop_method(self):
cores = cpu_count()
procs = []
slices = []
if cores == 2: # for my VM, need to impliment generalised method for more cores
half_grid_point = int(SQUARES / 2)
slices.append(self.grid.cells[0:half_grid_point])
slices.append(self.grid.cells[half_grid_point:])
else:
Exception
for sl in slices:
proc = Process(target=self.loop_method, args=(sl,))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
I want to use multiprocessing (which I've identified should work, unlike multithreading?)
Multithreading would not work because it would run on a single processor which is your current bottleneck. Multithreading is for things where you are awaiting for an API to answer. In that meantime you can do other calculations. But in Conway's Game of Life your program is constantly running.
Getting multiprocessing right is hard. If you have 4 processors you can define a quadrant for each of your processor. But you need to share the result between your processors. And with this you are getting a performance hit. They need to be synchronized/running on the same clock speed/have the same tick rate for updating and the result needs to be shared.
Multiprocessing starts being feasible when your grid is very big/there is a ton to calculate.
Since the question is very broad and complicated I cannot give you a better answer. There is a paper on getting parallel processing on Conway's Game of Life: http://www.shodor.org/media/content/petascale/materials/UPModules/GameOfLife/Life_Module_Document_pdf.pdf
I try to make an python application which is responsible for reading and writting with plc through OPC.
The Problem which I am facing right now, I have to read some data from plc and according to that I have to write some data.
This is continuous process. So, I am using multithreading concept to handle this.
For an example:-
def callallsov(self):
while True:
self.allsov1sobjects.setup(self.sovelementlist)
def callallmotor(self):
while True:
self.allmotor1dobjects.setup(self.motorelementlist)
def callallanalog(self):
while True:
self.allanalogobjects.setup(self.analogelementlist)
self.writsovthread = threading.Thread(target=callallsov)
self.writemotorthread = threading.Thread(target=callallmotor)
self.writemotorthread = threading.Thread(target=callallanalog)
self.writsovthread.start()
self.writemotorthread.start()
self.writeanalogthread.start()
calling structure:
def setup(self,elementlist):
try:
n =0
self.listofmotor1D.clear()
while n< len(self.df.index):
self.df.iloc[n, 0] = Fn_Motor1D(self.com, self.df, elementlist, n)
self.listofmotor1D.append(self.df.iloc[n,0])
n = n + 1
read and write plc tag:
if tagname == self.cmd:
if tagvalue == self.gen.readgeneral.readnodevalue(self.cmd):
# sleep(self.delaytime)
self.gen.writegeneral.writenodevalue(self.runingFB, 1)
else:
self.gen.writegeneral.writenodevalue(self.runingFB, 0)
break
now each elementlist have 100 device. So I make type of each element then create runtime objects for each element.
like in case of motor, if motor command is present we need to high it's runfb.
I have to follow the same process for 100 motors device.
Thus why I use while true here to check continuously data from plc.
Due to 100 motors (large amout of data) it takes long time to write in plc.plc scan cycle is very fast.
so my fist question is I haven't use join here ....is it correct?
secondly how I can avoid this sluggishness.
We have a server that gets cranky if it gets too many users logging in at the same time (meaning less than 7 seconds apart). Once the users are logged in, there is no problem (one or two logging in at the same time is also not a problem, but when 10-20 try the entire server goes into a death spiral sigh).
I'm attempting to write a page that will hold onto users (displaying an animated countdown etc.) and let them through 7 seconds apart. The algorithm is simple
fetch the timestamp (t) when the last login happened
if t+7 is in the past start the login and store now() as the new timestamp
if t+7 is in the future, store it as the new timestamp, wait until t+7, then start the login.
A straight forward python/redis implementation would be:
import time, redis
SLOT_LENGTH = 7 # seconds
now = time.time()
r = redis.StrictRedis()
# lines below contain race condition..
last_start = float(r.get('FLOWCONTROL') or '0.0') # 0.0 == time-before-time
my_start = last_start + SLOT_LENGTH
r.set('FLOWCONTROL', max(my_start, now))
wait_period = max(0, my_start - now)
time.sleep(wait_period)
# .. login
The race condition here is obvious, many processes can be at the my_start = line simultaneously. How can I solve this using redis?
I've tried the redis-py pipeline functionality, but of course that doesn't get an actual value until in the r.get() call...
I'll document the answer in case anyone else finds this...
r = redis.StrictRedis()
with r.pipeline() as p:
while 1:
try:
p.watch('FLOWCONTROL') # --> immediate mode
last_slot = float(p.get('FLOWCONTROL') or '0.0')
p.multi() # --> back to buffered mode
my_slot = last_slot + SLOT_LENGTH
p.set('FLOWCONTROL', max(my_slot, now))
p.execute() # raises WatchError if anyone changed TCTR-FLOWCONTROL
break # break out of while loop
except WatchError:
pass # someone else got there before us, retry.
a little more complex than the original three lines...