I have some code that is slow (30-60mins by last count), that I need to optimize, it is a data extraction script for Abaqus for a structural engineering model. The worst part of the script is the loop where it iterates through the object model database first by frame (i.e. the time in the time history of the simulation) and nested under this it iterates by each of the nodes. The silly thing is that there are ~100k 'nodes' but only about ~20k useful nodes. But luckily for me the nodes are always in the same order, meaning I do not need to look up the node's uniqueLabel, I can do this in a separate loop once and then filter what I get at the end. That is why I have dumped everything into one list and then I remove all the nodes that are repeats. But as you can see from the code:
timeValues = []
peeqValues = []
for frame in frames: #760 loops
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in setValues: # 100k loops
peeqValues.append(value.data)
It still needs to make the value.data calls unnecessarily, about ~80k times. If anyone is familiar with Abaqus odb (object database) objects, they're super slow under python. To add insult to injury they only run in a single thread, under Abaqus which has its own python version (2.6.x) and packages (so e.g. numpy is available, pandas is not). Another thing that may be annoying is the fact that you can address the objects by position e.g. frames[-1] gives you the last frame, but you cannot slice, so e.g. you can't do this: for frame in frames[0:10]: # iterate first 10 elements.
I don't have any experience with itertools but I'd want to provide it a list of nodeIDs (or list of True/False) to map onto the setValues. The length and pattern of setValues to skip is always the same for each of the 760 frames. Maybe something like:
for frame in frames: #still 760 calls
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
# nodeSet_IDs_TF = [True, True, False, False, False, ...] same length as
# setValues
filteredSetValues = ifilter(nodeSet_IDs_TF, setValues)
for value in filteredSetValues: # only 20k calls
peeqValues.append(value.data)
Any other tips also appreciated, after this I did want to "avoid the dots" by removing the .append() from the loop, and then putting the whole thing in a function to see if it helps. The whole script already runs in under 1.5 hours (down from 6 and at one point 21 hours), but once you start optimizing there is no way to stop.
Memory considerations also appreciated, I run these on a cluster and I believe I got away once with 80 GB of RAM. The scripts definitely work on 160 GB, the issue is getting the resources allocated to me.
I've searched around for a solution but maybe I'm using the wrong keywords, I'm sure this is not an uncommon issue in looping.
EDIT 1
Here is what I ended up using:
# there is no compress under 2.6.x ... so use the equivalent recipe:
from itertools import izip
def compress(data, selectors):
# compress('ABCDEF', [1,0,1,0,1,1]) --> ACEF
return (d for d, s in izip(data, selectors) if s)
def iterateOdb(frames, selectors): # minor speed up
peeqValues = []
timeValues = []
append = peeqValues.append # minor speed up
for frame in frames:
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in compress(setValues, selectors): # massive speed up
append(value.data)
return peeqValues, timeValues
peeqValues, timeValues = iterateOdb(frames, selectors)
The biggest improvement came from using the compress(values, selectors) method (the whole script, including the odb portion went from ~1:30 hours to 25mins. There was also a minor improvement from append = peeqValues.append as well as enclosing everything in def iterateOdb(frames, selectors):.
I used tips from: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Thanks to everyone for answering & helping!
If you're not confident with itertools try using an if statement in your for loop first.
eg.
for index, item in enumerate(values):
if not selectors[index]:
continue
...
# where selectors is a truth array like nodeSet_IDs_TF
This way you can be more sure that you are getting the correct behaviour and will get getting most of the performance increase you would get from using itertools.
The itertools equivalent is compress.
for item in compress(values, selectors):
...
I'm not familiar with abaqus, but the best optimisations you could achieve would be to see if there is anyway way to give abaqus your selectors so it doesn't have to waste creating each value, only for it to be thrown away. If abaqus is used for doing large array-based manipulations of data then it's like this is the case.
Another variant in addition to those in Dunes's solution:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
If you want to keep the output list length the same as the setValue length, then add an else clause:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
else:
peeqValue.append(None)
selector is here a vector with True/False, and it has the same length as setValues.
In this case it is really a matter of taste which one you like. If the full iteration of 76 million nodes (760 x 100 000) takes 30 minutes, the time is not spent in python's loops.
I tried this:
def loopit(a):
for i in range(760):
for j in range(100000):
a = a + 1
return a
IPython's %timeit reports the loop time as 3.54 s. So, the looping spends maybe 0.1 % of the total time.
Related
I’m analizing a dataset with 200 columns and 6000 rows. I computed all the possibile differences between columns using iterools and implemented them into the dataset. So now the number of columns has increased. Until now everything work fine and kernel doesn’t have problems. Kernel dies when i try to group columns with same first value and sum them.
#difference between two columns,all possible combinations 1-2,1-3,..,199-200
def sb(df):
comb=itertools.permutations(df.columns,2)
N_f=pd.DataFrame()
N_f = pd.concat([df[a]-df[b] for a,b in comb],axis=1)
N_f.iloc[0,:]=[abs(number) for number in N_f.iloc[0,:]]
return N_f
#Here i transform the first row into columns headers and then i try to sum columns with the same head
def fg(m):
f.columns=f.iloc[0]
f=f.iloc[1:]
f=f.groupby(f.columns,axis=1).sum()
return f
Now i tried to run the code without the groupby part, but the kernel keeps dying.
Kernel crashes often suggest a large spike in resource usage, which your machine and/or juypter configuration could not handle.
The question is then, "What am I doing that is using so many resources?".
That's for you to figure out, but my guess is that it has to do with your list comprehension over permutations. Permutations are extremely expensive, and having in-memory data structures for each permutation is going to hurt.
I suggest debugging like so:
# Print out the size of this. Does it surprise you?
comb=itertools.permutations(df.columns,2)
N_f=pd.DataFrame()
# Instead of doing these operations in one list comprehension,
# instead make a for loop and print out the memory
# usage at each iteration in the loop.
# How is it scaling?
N_f = pd.concat([df[a]-df[b] for a,b in comb],axis=1)
I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])
It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.
Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.
I am monitoring a rss feed using feedparser. The feed has 100 items and am extracting a time stamp from it as a unique identifier in the form of a list of strings. This is what a single item looks of the list like:
2017-07-25T20:41:59-04:00
Next, I am doing some python magic with the other data from the feed which is parsed into lists as well (same index, different lists though) to extract the information I want. That part works well, I love it.
Now my problem: after a time delay
import time
time.sleep(60)
I'd like to monitor the feed again and see efficiently if the time stamp was observed before. If so, I'd pass on executing code and wait some more until a uniqe time stamp shows up.
I failed to implement it so far. I thought about making a second list and comparing the two. Each list has 100 items.
New items appear on top of the feed and move down over time. I should be fine if I only run to the first match. Should make the code more efficient then comparing everything.
I'd be happy iif someone could point me towards the solution. I am somewhat stuck whatever I have tried, failed.
Edit:
def compare(feed_id_l, feed_id_check_l):
#compares items in lists and returns the index
for i in range(0, len(feed_id_l)):
for j in range(0, len(feed_id_check_l)):
if feed_id_l[i] == feed_id_check_l[j]:
print('match for id ' + feed_id_l[i])
return i, j
else:
return -1
Works and returns 0, 0 if the first feed item is unchanged.
I will have to figure out what to do with other cases, let's say 0, 6.
Cheers!
In bioinformatics, we do the following transformation an awful lot:
>>> data = {
(90,100):1,
(91,101):1,
(92,102):2,
(93,103):1,
(94,104):1
}
>>> someFuction(data)
{
90:1,
91:2,
92:4,
93:5,
94:6,
95:6,
96:6,
97:6,
98:6,
99:6,
100:6,
101:5,
102:4,
103:2,
104:1
}
Where the tuple in data is always a unique pair.
But there are many methods for doing this transform - some significantly better than others. One i have tried is:
newData = {}
for pos, values in data.iteritems():
A,B = pos
for i in xrange(A,B+1):
try: newData[i] += values
except KeyError: newData[i] = values
This has the benefit that its short and sweet, but im not actually sure it is that efficient....
I have a feeling that somehow turning the dict into a list of lists, and then doing the xrange, would save an awful lot of time. We're talking weeks of computational work per experiment. Something like this:
>>> someFuction(data)
[
[90,90,1],
[91,91,2],
[92,92,4],
[93,93,5],
[94,100,6],
[101,101,5],
[102,102,4],
[103,103,2],
[104,104,1]
]
and THEN do the for/xrange loop.
People on #Python have recommended bisect and heapy, but after struggling with bisect all day, I can't come up with a nice algorithm which i can be 100% will work all the time. If anyone on here could help or even point me in the right direction, id be massively grateful :)
I worked out a solution last night that takes the total run time of one file from roughly 400 minutes to 251 minutes. I would post the code but its pretty long, and likely to have bugs in the edge-cases. For that reason i'll say the 'working' code can be found in the program 'rawSeQL', but the algorithmic improvements that helped the most were:
Looping over the overlapping arrays and flattening them to non-overlapping arrays with a multiplier value made an enormous difference, as xrange() does not now need to repeat itself.
Using collections.defaultdict(int) made a big difference over the try/except loop above. collections.Counter() and orderedDict was a LOT slower than the try/except.
I went with using bisect_left() to find where to insert the next non-overlapping piece, and it was so-so, but then adding in Bisect's 'low' parameter to limit the range of the list it needs to check gave a sizeable reduction in compute time. If you sort the input list, your value for low is always the last returned value for bisect, which makes this process easy :)
It is possible that heapy would provide even more benefits still - but for now the main algorithm improvements mentioned above will probably outweigh any compile-time tricks. I have 75 files to process now, which means just these three things save roughly 12500 days of compute time :)
Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers
Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.
import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!
I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)
Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.
This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.