I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
So I have a bit of an issue that I am wondering if you guys can help me with. So I am writing a program right now that pulls some strings from html pages and adds them to a list. I have 50 some pages I am pulling data from. When I run the program it takes between 45 and 55 seconds to gather the data. Not bad, but I need to be somewhere on the order of 15-20 seconds.
So here is my question: My computer has a 800MHz process (ya I know, its four years old) and I am about to get a new computer, will having a faster processor help with this? If so what speed of processor should I look for to get to my desired speed. Is this speed more related to processor speed or connection speed (my internet connection is definitely fast enough for this application)? Is it able to be speed up?
Thanks!
Addition:
Here is the code used.
This function creates the list of lists that stores the data
def makesobjlist(objs, length):
sets = [objs]
for obj in objs:
objlist = [obj]
for i in range(1,length+1):
objlist.append(0)
sets.append(objlist)
return sets
The following function then updates the list of lists
def update(objslist):
for i in range(1, len(objslist)):
objlist = objslist[i]
objlist.append(getdata(objlist[0]))
del(objlist[1])
Python supports threading, multiple processes and queues.
You may gain some speed by simply having multiple workers perform the job than a single worker that has to wait. Basically you divide the "work" up amongst multiple programs (workers) that process the tasks at hand. This is much faster than having to wait for one long process to finish.
Similar post here:
Threading in python using queue
Multiprocessing vs Threading Python
del(objlist[1])
If the objlist here can be long (more than a few dozens), then this line has bad complexity: it shifts all the end of the list. You should refactor the code to not do that. For example, you could arrange that the item to remove is the last item of the list instead of the item at index 1; del objlist[-1] is always a constant-time operation.
I'm building an application based around a task queue: it serves a series of tasks to multiple, asynchronously connected clients. The twist is that the tasks must be served in a random order.
My problem is that the algorithm I'm using now is computationally expensive, because it relies on many large queries and transfers from the database. I have a strong hunch that there's a cheaper way to achieve the same result, but I can't quite see the solution. Can you think of a clever fix for this problem?
Here's the (computationally expensive) algorithm I'm using now:
When the client queries for a new task...
Query the database for "unfinished" tasks
Put all tasks in a list
Shuffle the list (using random.shuffle)
Flag the first task as "in progress"
Send the task parameters to the client for completion
When the client finishes the task...
6a. Record the result and flag the task as "finished."
If the client fails to finish the task by some deadline...
6b. Re-flag the task as "unfinished."
Seems like we could do better by replacing steps 1, 2, and 3, with pseudorandom sequences or hash functions. But I can't quite figure out the whole solution. Ideas?
Other considerations:
In case it's important, I'm using python and mongodb for all of this. (Mongodb doesn't have some clever "use find_one to efficiently return a random matching entry" usage, does it?)
The term "queue" is a little misleading. All the tasks are stored in subfields of a single collection within the mongodb. The length (total number of tasks) in the collection is known and fixed at the outset.
If it's necessary, it might be okay to let the same task be assigned multiple times, as long as the occurrence is rare. But instances of this kind would need to be very rare, because completing each task is costly.
I have identifying information on each client, so we know exactly who originates each task request.
There is an easy way to get a random document from MongoDB!
See Random record from MongoDB
If you don't want a task to be picked twice, you could mark the task as active and not select it.
Ah, based on the comments that I missed, you can do something along these lines:
import random
available = range(lengthofdatabase)
inprogress = []
while len(available) > 0:
taskindex = available.pop(random.randrange(0, len(available)))
# I'm not sure of your implementation, but you said something
# along these lines was possible
task = GetTask(taskindex)
inprogress.append(taskindex)
I'm not sure of any of the functions you are using - this is just an algorithm.
Happy Coding!
I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.
I am doing some research this summer and working on parallelizing pre-existing code. The main focus right now is a way to load balance the code so that it will run more efficient on the cluster. The current task is to make a proof of concept that creates several processes with each one having their own stack available and when the process is finished processing the stack it queries the two closest processes to see if they have any more work available in their stack.
I am having difficulties conceptualizing this in python, but was hoping someone could point me in the right direction or have some sort of example that is similar in either mpi4py or ParallelPython. Also if anyone knows of a better or easier module then that would be great to know.
Thanks.
Here's a simple way to do this.
Create a single common shared queue of work to do. This application
will fill this queue with work to do.
Create an application which gets one item from the queue, and does the work.
This is the single-producer-multiple-consumer design. It works well and can swamp your machine with parallel processes.
To use the built-in queue class, you need to wrap the queue in some kind of
multi-processing API. http://docs.python.org/library/queue.html. Personally, I like to create a small HTTP-based web-server that handles the queue. Each application does a
GET to fetch the next piece of work.
You can use tools like RabbitMQ to create a very nice shared queue.
http://nathanborror.com/posts/2009/may/20/working-django-and-rabbitmq/
You might be able to use http://hjb.python-hosting.com/ to make use of JMS queues.
You'll need a small application to create and fill the queue with work.
Create as many copies of the application as you like. For example:
for i in 1 2 3 4 5 6 7 8 9 10
do
python myapp.py &
done
This will run 10 concurrent copies of your application. All 10 are trying to get work from a single queue. They will use all available CPU resources and the OS will nicely schedule them for you.
Peer, node-to-node synchronization means you have O(n*(n-1)/2) communication paths among all nodes.
The "two-adjacent nodes" means you still have 2*n communication paths and work has to "somehow" trickle among the nodes. If the nodes are all initially seeded with work, then someone did a lot of planning to balance the workload. If you're going to do that much planning, why ask the nodes to synchronize at all?
If the queues are not carefully balanced to begin with than every even node could be slow. Every odd node could be fast. The odd nodes finish first, check for work from two even nodes, and those nodes are (a) not done and (b) don't have more work to do, either. What now? Half the nodes are working, half are idle. All due to poor planning in the initial distribution of work.
Master-slave means you have n communication paths. Further, the balancing is automatic since all idle nodes have equal access to work. There's no such thing as a biased initial distribution that leads to poor overall performance.
use multiprocessing module