So I have a bit of an issue that I am wondering if you guys can help me with. So I am writing a program right now that pulls some strings from html pages and adds them to a list. I have 50 some pages I am pulling data from. When I run the program it takes between 45 and 55 seconds to gather the data. Not bad, but I need to be somewhere on the order of 15-20 seconds.
So here is my question: My computer has a 800MHz process (ya I know, its four years old) and I am about to get a new computer, will having a faster processor help with this? If so what speed of processor should I look for to get to my desired speed. Is this speed more related to processor speed or connection speed (my internet connection is definitely fast enough for this application)? Is it able to be speed up?
Thanks!
Addition:
Here is the code used.
This function creates the list of lists that stores the data
def makesobjlist(objs, length):
sets = [objs]
for obj in objs:
objlist = [obj]
for i in range(1,length+1):
objlist.append(0)
sets.append(objlist)
return sets
The following function then updates the list of lists
def update(objslist):
for i in range(1, len(objslist)):
objlist = objslist[i]
objlist.append(getdata(objlist[0]))
del(objlist[1])
Python supports threading, multiple processes and queues.
You may gain some speed by simply having multiple workers perform the job than a single worker that has to wait. Basically you divide the "work" up amongst multiple programs (workers) that process the tasks at hand. This is much faster than having to wait for one long process to finish.
Similar post here:
Threading in python using queue
Multiprocessing vs Threading Python
del(objlist[1])
If the objlist here can be long (more than a few dozens), then this line has bad complexity: it shifts all the end of the list. You should refactor the code to not do that. For example, you could arrange that the item to remove is the last item of the list instead of the item at index 1; del objlist[-1] is always a constant-time operation.
Related
I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))
I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.
I'm building an application based around a task queue: it serves a series of tasks to multiple, asynchronously connected clients. The twist is that the tasks must be served in a random order.
My problem is that the algorithm I'm using now is computationally expensive, because it relies on many large queries and transfers from the database. I have a strong hunch that there's a cheaper way to achieve the same result, but I can't quite see the solution. Can you think of a clever fix for this problem?
Here's the (computationally expensive) algorithm I'm using now:
When the client queries for a new task...
Query the database for "unfinished" tasks
Put all tasks in a list
Shuffle the list (using random.shuffle)
Flag the first task as "in progress"
Send the task parameters to the client for completion
When the client finishes the task...
6a. Record the result and flag the task as "finished."
If the client fails to finish the task by some deadline...
6b. Re-flag the task as "unfinished."
Seems like we could do better by replacing steps 1, 2, and 3, with pseudorandom sequences or hash functions. But I can't quite figure out the whole solution. Ideas?
Other considerations:
In case it's important, I'm using python and mongodb for all of this. (Mongodb doesn't have some clever "use find_one to efficiently return a random matching entry" usage, does it?)
The term "queue" is a little misleading. All the tasks are stored in subfields of a single collection within the mongodb. The length (total number of tasks) in the collection is known and fixed at the outset.
If it's necessary, it might be okay to let the same task be assigned multiple times, as long as the occurrence is rare. But instances of this kind would need to be very rare, because completing each task is costly.
I have identifying information on each client, so we know exactly who originates each task request.
There is an easy way to get a random document from MongoDB!
See Random record from MongoDB
If you don't want a task to be picked twice, you could mark the task as active and not select it.
Ah, based on the comments that I missed, you can do something along these lines:
import random
available = range(lengthofdatabase)
inprogress = []
while len(available) > 0:
taskindex = available.pop(random.randrange(0, len(available)))
# I'm not sure of your implementation, but you said something
# along these lines was possible
task = GetTask(taskindex)
inprogress.append(taskindex)
I'm not sure of any of the functions you are using - this is just an algorithm.
Happy Coding!
I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.
I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.