how to speed up Python nested loops for million elements - python

I try to pair two objects (one data set contains about 0.5 million elements, another one contains about 2 million elements) which meet certain conditions, then save information of the two objects to a file. Many variables are not involved in the pairing calculation, but they are important for my following analysis, therefore I need to keep track of those variables and save them. If there is way to vectorize the whole analysis, it will be much faster. In the following I take the random number as an example:
import numpy as np
from astropy import units as u
from astropy.coordinates import SkyCoord
from PyAstronomy import pyasl
RA1 = np.random.uniform(0,360,500000)
DEC1 = np.random.uniform(-90,90,500000)
d = np.random.uniform(55,2000,500000)
z = np.random.uniform(0.05,0.2,500000)
e = np.random.uniform(0.05,1.0,500000)
s = np.random.uniform(0.05,5.0,500000)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)
n = np.random.randint(10,10000,2000000)
m = np.random.randint(10,10000,2000000)
f = open('results.txt','a')
for i in range(len(RA1)):
if i % 50000 == 0:
print i
ra1 = RA1[i]
dec1 = DEC1[i]
c1 = SkyCoord(ra=ra1*u.degree, dec=dec1*u.degree)
for j in range(len(RA2)):
ra2 = RA2[j]
dec2 = DEC2[j]
c2 = SkyCoord(ra=ra2*u.degree, dec=dec2*u.degree)
ang = c1.separation(c2)
sep = d[i] * ang.radian
pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)
if sep < 1.5:
np.savetxt(f,np.c_[ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j]], fmt = '%1.4f %1.4f %1.4f %1.4f %1.4f %1.4f %i %i')

The basic question you need to ask yourself is: Can you reduce the dataset?
If not I have some bad news: 500000 * 2000000 is 1e12. That means you're trying to do one trillion operations.
The angular seperation involves some trigonometric functions (I think cos, sin and sqrt are involved here) so it will be roughly in the order of hundreds of nanoseconds up to microseconds per operation. Assuming each operation takes 1us you'll still need 12 days to complete this. And this assumes you don't have any Python loop or IO overhead and I think 1us is reasonable for these kind of operations.
But there are certainly ways to optimize it: SkyCoord allows to vectorize but only 1D:
# Create the SkyCoord for the longer array once
c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
# and calculate the seperation from each coordinate of the shorter list
for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
# x will be the angular seperation with a length of your RA2 and DEC2 arrays
x = c1.separation(c2)
This will already yield a speedup of several orders of magnitude:
# note that I made these MUCH shorter
RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,10)
DEC2 = np.random.uniform(-90,90,10)
def test(RA1, DEC1, RA2, DEC2):
"""Version with vectorized inner loop."""
c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
x = c1.separation(c2)
def test2(RA1, DEC1, RA2, DEC2):
"""Double loop."""
for ra, dec in zip(RA1, DEC1):
c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
for ra, dec in zip(RA2, DEC2):
c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
x = c1.separation(c2)
%timeit test(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 225 ms per loop
%timeit test2(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 2.71 s per loop
This is already 10 times as fast and it scales MUCH better:
RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)
%timeit test(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 2.8 s per loop
# test2 scales so bad I only use 50 elements here
RA2 = np.random.uniform(0,360,50)
DEC2 = np.random.uniform(-90,90,50)
%timeit test2(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 11.4 s per loop
Note that by vectorizing the inner loop I was able to calculate 40000 times more elements in 1/4 of the time. So by vectorizing the inner loop you should be roughly 200k times faster.
Here we calculated 5 times 2 million seperations in 3 seconds, so it will be roughly 300 ns per operation. At this speed you'd need 3 days to complete this task.
Even if you could also vectorize the remaining loop away I don't think that would yield any great speedups because at that level the loop overhead is much less than the computation time in each loop. Using line-profiler supports this statement:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 def test(RA1, DEC1, RA2, DEC2):
12 1 216723 216723.0 2.6 c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
13 6 222 37.0 0.0 for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
14 5 206796 41359.2 2.5 c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
15 5 7847321 1569464.2 94.9 x = c1.separation(c2)
If it's not obvious from the Hits that's from the 5 x 2,000,000 run and for comparison here is the one from a 5x20 run on test2:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 def test2(RA1, DEC1, RA2, DEC2):
18 6 80 13.3 0.0 for ra, dec in zip(RA1, DEC1):
19 5 195030 39006.0 0.6 c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
20 105 1737 16.5 0.0 for ra, dec in zip(RA2, DEC2):
21 100 3871427 38714.3 11.8 c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
22 100 28870724 288707.2 87.6 x = c1.separation(c2)
The reason why test2 scales worse is that the c2 = SkyCoord part takes 12% of the total time instead of just 2.5% and that each single call to seperation has some significant overhead. So it's not really the Python loop overhead that makes it slow but the SkyCoord constructor and the static parts of seperation.
You obviously need to vectorize the pa calculation and the saving to file (I haven't worked with PyAstronomy and numpy.savetext so I can't advise there).
But there is still the problem that it's simply not feasible to do one trillion trigonometric operations on a normal computer.
Some additional ideas how to reduce the time:
Use multiprocessing so each core of your computer works in parallel, in theory this could speed this up by the amount of your cores. In practice this won't be be reachable and I would recommend doing this only if you have more than >= 8 cores or a cluster avaiable. Otherwise the time spent on getting multiprocessing to work correctly might exceed the single-core running time. Especially because multiprocessing might not work correctly and then you have to rerun the calculation.
Preprocess the elements: Remove items where the RA or DEC difference alone makes it impossible to find matches. However if this cannot remove a significant fraction of the elements the additional subtractions and comparisons might actually slow this down.

Here is an implementation using a buffer in memory to reduce I/O. Note: I prefer using io module for file input/output in order to be more compatible with Python 3. I think it is a best practice. You won’t have lower performance with it.
import io
with io.open('results.txt', 'a') as f:
buf = io.BytesIO()
for i in xrange(len(RA1)):
if i % 50000 == 0:
print(i)
f.write(buf.getvalue())
buf.truncate(0)
ra1 = RA1[i]
dec1 = DEC1[i]
c1 = SkyCoord(ra=ra1 * u.degree, dec=dec1 * u.degree)
for j in xrange(len(RA2)):
ra2 = RA2[j]
dec2 = DEC2[j]
c2 = SkyCoord(ra=ra2 * u.degree, dec=dec2 * u.degree)
ang = c1.separation(c2)
sep = d[i] * ang.radian
pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)
if sep < 1.5:
np.savetxt(buf, np.c_[ra1, dec1, sep, z[i], e[i], s[i], n[j], m[j]],
fmt='%1.4f %1.4f %1.4f %1.4f %1.4f %1.4f %i %i')
f.write(buf.getvalue())
Note: In Python 2, I use xrange instead of range to reduce memory usage.
The buf.truncate(0) could be replaced by a new instance like this: buf = io.BytesIO(). It could be more efficient…

First way to speedup: c2 = SkyCoord calculated for every pair in ra2, dec2 len(RA1) times. You can speedup by making a buffer array of SkyCoord:
f = open('results.txt','a')
C1 = [SkyCoord(ra=ra1*u.degree, dec=DEC1[i]*u.degree)
for i, ra1 in enumerate(RA1)] )
C2 = [SkyCoord(ra=ra2*u.degree, dec=DEC2[i]*u.degree)
for i, ra2 in enumerate(RA2)] ) # buffer coords
for i, c1 in enumerate(C1): # we only need enumerate() to get i
for j, c2 in enumerate(C2):
ang = c1.separation(c2) # note we don't have to calculate c2
if d[i] < 1.5 / ang.radian:
# now we don't have to multiply every iteration.
# The right part is a constant
# the next line is only executed if objects are close enough
pa = pyasl.positionAngle(RA1[i], DEC1[i], RA2[j], DEC2[j])
np.savetxt('...whatever')
You can speedup even more by reading SkyCoord.separation code and vectorizing it to replace SkyCoord, but I'm too lazy to do it myself. I assume if we had two 2xN coord matrices x1, x2 it will look similar to (Matlab/Octave):
cos = pdist2(x1, x2) / (sqrt(dot(x1, x1)) * sqrt(dot(x2, x2)))

Assuming you want to reduce your dataset to <2 degree differences (as per your comment), you can make a mask by broadcasting (may need to do in chunks, but method is same)
aMask=(abs(RA1[:,None]-RA2[None,:])<2)&(abs(DEC1[:,None]-DEC2[None,:])<2)
In some smaller scale testing, this reduces the dataset by about 1/5000. Then make a location array of the mask.
locs=np.where(aMask)
(array([ 0, 2, 4, ..., 4998, 4999, 4999], dtype=int32),
array([3575, 1523, 1698, ..., 4869, 1801, 2792], dtype=int32))
(from my 5k x 5k test). Dump all your other variables through, for example, d[locs[0]] to create 1d arrays that you can push through SkyCoord as per #MSeifert's answer.
When you get your outputs and compare to 1.5, you'll get a boolean bmask that you can then outlocs=locs[0][bmask] and output RA1[outlocs] etc.
I've done similar things trying to do spatial derivatives on shells for FEM analysis, where taking the full rank of comparison between all the datapoints is similarly inefficient.

savetxt used this way is essentially
astr = fmt % (ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j])
astr += '\n' # or include in fmt
f.write(astr)
that is, just writing a formatted line to the file

Related

Recurrent sequence task

Given the sequence f0, f1, f2, ... given by the recurrence relations f0 = 0, f1 = 1, f2 = 2 and fk = f (k-1) + f (k-3)
Write a program that calculates the n elements of this sequence with the numbers k1, k2, ..., kn.
Input format
The first line of the input contains an integer n (1 <= n <= 1000)
The second line contains n non-negative integers ki (0 <= ki <= 16000), separated by spaces.
Output format
Output space-separated values ​​for fk1, fk2, ... fkn.
Memory Limit: 10MB
Time limit: 1 second
The problem is that the recursive function at large values ​​goes beyond the limit.
def f (a):
    if a <= 2:
        return a
    return f (a - 1) + f (a - 3)
n = int (input ())
nums = list (map (int, input (). split ()))
for i in range (len (nums)):
    if i <len (nums) - 1:
        print (f (nums [i]), end = '')
    else:
        print (f (nums [i]))
I also tried to solve through a cycle, but the task does not go through time (1 second):
fk1 = 0
fk2 = 0
fk3 = 0
n = int (input ())
nums = list (map (int, input (). split ()))
a = []
for i in range (len (nums)):
    itog = 0
    for j in range (1, nums [i] + 1):
        if j <= 2:
            itog = j
        else:
            if j == 3:
                itog = 0 + 2
                fk1 = itog
                fk2 = 2
                fk3 = 1
            else:
                itog = fk1 + fk3
                fk1, fk2, fk3 = itog, fk1, fk2
    if i <len (nums) - 1:
        print (itog, end = '')
    else:
        print (itog)
How else can you solve this problem so that it is optimal in time and memory?
Concerning the memory, the best solution probably is the iterative one. I think you are not far from the answer. The idea would be to first check for the simple cases f(k) = k (ie, k <= 2), for all other cases k > 2 you can simply compute fi using (fi-3, fi-2, fi-1) until i = k. What you need to do during this process is indeed to keep track of the last three values (similar to what you did in the line fk1, fk2, fk3 = itog, fk1, fk2).
On the other hand, there is one thing that you need to do here. If you just perform computations of fk1, fk2, ... fkn independently, then you are screwed (unless you use a super fast machine or a Cython implementation). On the other hand, there is no reason to perform n independent computations, you can just compute fx for x = max(k1, k2, ..., kn) and on the way you'll store every answer for fk1, fk2, ..., fkn (this will slow down the computation of fx by a little bit, but instead of doing this n times you'll do it only once). This way it can be solved under 1s even for n = 1000.
On my machine, independent calculations for f15000, f15001, ..., f16000 takes roughly 30s, the "all at once" solution takes roughly 0.035s.
Honestly, that's not such an easy exercise, it would be interesting to show your solution on a site like code review to get some feedback on your solution once you found one :).
First, you have to sort the numbers. Then calculate values of the sequence one by one:
while True:
a3 = a2 + a0
a0 = a3 + a1
a1 = a0 + a2
a2 = a1 + a3
Lastly, return values in beginning order. To do that you have to remember position of every number. From [45, 22, 14, 33] make [[45,0], [22,1], [14,2], [33,3]] and then sort, calculate values and change them with argument [[f45,0], [f22,1], [f14,2], [f33,3]], then sort by second value.

What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?

So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code:
df_reversed = df[df.columns[::-1]]
What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?
I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:
def get_dummy_df(n):
return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})
df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
The output was:
(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop
As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.
So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)
The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka []) with an Index (see relevant code, with other types that would trigger a copy).
Here is a helpful stack trace:
<ipython-input-4-3162cae03863>(2)<module>()
1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]
pandas/core/frame.py(2682)__getitem__()
2681 # either boolean or fancy integer index
-> 2682 return self._getitem_array(key)
2683 elif isinstance(key, DataFrame):
pandas/core/frame.py(2727)_getitem_array()
2726 indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727 return self._take(indexer, axis=1)
2728
pandas/core/generic.py(2789)_take()
2788 axis=self._get_block_manager_axis(axis),
-> 2789 verify=True)
2790 result = self._constructor(new_data).__finalize__(self)
pandas/core/internals.py(4539)take()
4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539 axis=axis, allow_dups=True)
4540
pandas/core/internals.py(4421)reindex_indexer()
4420 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421 fill_tuple=(fill_value,))
4422 else:
pandas/core/internals.py(1254)take_nd()
1253 new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254 allow_fill=False)
1255 else:
> pandas/core/algorithms.py(1658)take_nd()
1657 import ipdb; ipdb.set_trace()
-> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1659 mask_info=mask_info)
1660 func(arr, indexer, out, fill_value)
The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.
inner_take_2d_axis0_template = """\
cdef:
Py_ssize_t i, j, k, n, idx
%(c_type_out)s fv
n = len(indexer)
k = values.shape[1]
fv = fill_value
IF %(can_copy)s:
cdef:
%(c_type_out)s *v
%(c_type_out)s *o
#GH3130
if (values.strides[1] == out.strides[1] and
values.strides[1] == sizeof(%(c_type_out)s) and
sizeof(%(c_type_out)s) * n >= 256):
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
v = &values[idx, 0]
o = &out[i, 0]
memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
return
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
for j from 0 <= j < k:
out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""
Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.
I ran an empirical test using big_O fitting library here
Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.
rows from 10 to 10^6 vs. constant column size of 3,
columns from 10 to 10^6 vs. constant row size of 10
The result shows that the columns reverse operation .columns[::-1] complexity in the DataFrame is
Cubical: O(n^3) where n is the number of rows
Cubical: O(n^3) where n is the number of columns
Prerequisites: You will need to install big_o() using terminal command pip install big_o
Code
import big_o
import pandas as pd
import numpy as np
SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10
def build_df(rows, columns):
# To isolated the creation of the DataFrame from the inversion operation.
narray = np.zeros(rows*columns).reshape(rows, columns)
df = pd.DataFrame(narray)
return df
def flip_columns(df):
return df[df.columns[::-1]]
def get_row_df(n, m=COLUMNS):
return build_df(1*10**n, m)
def get_column_df(n, m=ROWS):
return build_df(m, 1*10**n)
# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
for class_, residual in others.items():
print('{:<60s} (res: {:.2G})'.format(str(class_), residual))
print('-'*80)
# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
for class_, residual in others.items():
print('{:<60s} (res: {:.2G})'.format(str(class_), residual))
print('-'*80)
Results
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------
Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38 (res: 3.9)
Linear: time = -0.73 + 0.32*n (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2 (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3 (res: 1.1)
Polynomial: time = -6 * x^2.2 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n) (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n) (res: 1.8)
Exponential: time = -7 * 1^n (res: 9.7)
--------------------------------------------------------------------------------

Function speed improvement: Convert ints to list of 32bit ints

I'm looking for speedy alternatives to my function. the goal is to make a list of 32 bit integers based of any length integers. The length is explicitly given in a tuple of (value, bitlength). This is part of a bit-banging procedure for a asynchronous interface which takes 4 32 bit integers per bus transaction.
All ints are unsigned, positive or zero, the length can vary between 0 and 2000
My inputs is a list of these tuples,
the output should be integers with implicit 32 bit length, with the bits in sequential order. The remaining bits not fitting into 32 should also be returned.
input: [(0,128),(1,12),(0,32)]
output:[0, 0, 0, 0, 0x100000], 0, 12
I've spent a day or two on profiling with cProfile, and trying different methods, but I seem to be kind of stuck with functions that takes ~100k tuples in one second, which is kinda slow. Ideally i would like a 10x speedup, but I haven't got enough experience to know where to start. The ultimate goal for the speed of this is to be faster than 4M tuples per second.
Thanks for any help or suggestions.
the fastest i can do is:
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
this has very similar performance:
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
# binarylist = "".join([np.binary_repr(a, b) for a, b in inp]) # bin(a)[2:].rjust(b, '0')
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
The best I could come up with so far is a 25% speed-up
from functools import reduce
intMask = 0xffffffff
def f(x,y):
return (x[0] << y[1]) + y[0], x[1] + y[1]
def jens(input):
n, length = reduce( f , input, (0,0) )
remainderBits = length % 32
intBits = length - remainderBits
remainder = ((n & intMask) << (32 - remainderBits)) >> (32 - remainderBits)
n >>= remainderBits
ints = [n & (intMask << i) for i in range(intBits-32, -32, -32)]
return ints, remainderBits, remainder
print([hex(x) for x in jens([(0,128),(1,12),(0,32)])[0]])
It uses a long to sum up the tuple values according to their bit length, and then extract the 32-bit values and the remaining bits from this number. The computastion of the overall length (summing up the length values of the input tuple) and the computation of the large value are done in a single loop with reduce to use an intrinsic loop.
Running matineau's benchmark harness prints, the best numbers I have seen are:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
jens : 0.004151 secs, rel speed 1.00x, 0.00% slower
First snippet : 0.005259 secs, rel speed 1.27x, 26.70% slower
Second snippet : 0.008328 secs, rel speed 2.01x, 100.64% slower
You could probably gain a better speed-up if you use some C extension implementing a bit array.
This isn't an answer with a faster implementation. Instead it's the code in the two snippets you have in your question placed within an extensible benchmarking framework that makes comparing different approaches very easy.
Comparing just those two testcases, it indicates that your second approach does not have very similar performance to the first, based on the output shown. In fact, it's almost twice as slow.
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 1000 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
# Import any resources needed defined in outer benchmarking script.
#from __main__ import ??? # Not needed at this time
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"First snippet": TestCase("""
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
""", """
foo([(0,128),(1,12),(0,32)])
"""
),
"Second snippet": TestCase("""
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
""", """
tpli_2_32ili([(0,128),(1,12),(0,32)])
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5,.2f}x, {:8,.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
Output:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
First snippet : 0.003024 secs, rel speed 1.00x, 0.00% slower
Second snippet : 0.005085 secs, rel speed 1.68x, 68.13% slower

Can a Cython/Numba compiled function improve on numpy.max(numpy.abs(a-b))?

I am optimizing a bottleneck section of my code--iterating on a function a' = f(a), where a and a' are N by 1 vectors, until max(abs(a' - a)) is sufficiently small.
I have put a Numba wrapper on f(a), and got a nice speedup over the most optimized pure NumPy version I was able to produe (cut runtime by about 50%).
I tried writing a C-compatible version of numpy.max(numpy.abs(aprime - a)), but it turns out this is slower! I actually lose back ALL of the gains I got from Numba-fying the first portion of the iteration!
Is there likely to be a way for Numba or Cython to improve upon numpy.max(numpy.abs(aprime - a))? I reproduce my code below for reference, where a is P0 and a' is Pprime:
EDIT: For me, it seems that it is important to "flatten()" the inputs to "maxabs()". When I do this, the performance is no worse than NumPy. Then, when I do a "dry run" of the function outside the timing brackets as JoshAdel suggested, the loop with "maxabs" does slightly better than the loop with numpy.max(numpy.abs()).
from numba import jit
import numpy as np
### Preliminaries, to make the working example fully functional
n = 1200
Gammer = np.exp(-np.random.rand(n,n))
alpher = np.ones((n,1))
xxer = 10000*np.random.rand(n,1)
chii = 6.5
varkappa = 6.5
phi3 = 1.5
A = .5
sig = .2
mmer = np.dot(Gammer,xxer**phi3)
totalprod = A*alpher + (1-A)*mmer
Gammerchii = Gammer**chii
Gammerrats = Gammerchii[:,0].flatten()/Gammerchii[0,:].flatten()
Gammerrats[(Gammerchii[0,:].flatten() == 0) | (Gammerchii[:,0].flatten() == 0)] = 1.
P0 = (Gammerrats*(xxer[0]/totalprod[0])*(totalprod/xxer).flatten())**(1/(1+2*chii))
P0 *= n/np.sum(P0)
### End of preliminaries
### This is the function to produce a' = f(a)
#jit
def Piteration(P0, chii, sig, n, xxer, totalprod, Gammerrats, Gammerchii):
Mac = np.zeros((n,))
Pprime = np.zeros((n,))
themacpow = 1-(1/chii)*(sig/(1-sig))
specialchiipow = 1/(1+2*chii)
Psum = 0.
for i in range(n):
for j in range(n):
Mac[j] += ((P0[i]/P0[j])**chii)*Gammerchii[i,j]*totalprod[j]
for i in range(n):
Pprime[i] = (Gammerrats[i]*(xxer[0]/totalprod[0])*(totalprod[i]/xxer[i])*((Mac[i]/Mac[0])**themacpow))**specialchiipow
Psum += Pprime[i]
Psum = n/Psum
for i in range(n):
Pprime[i] *= Psum
return Pprime
### This is the function to find max(abs(aprime - a))
#jit
def maxabs(vec1,vec2,n):
themax = 0.
curdiff = 0.
for i in range(n):
curdiff = vec1[i] - vec2[i]
if curdiff < 0:
curdiff *= -1
if curdiff > themax:
themax = curdiff
return themax
### This is the main loop
diff = 1000.
while diff > 1e-2:
Pprime = Piteration(P0.flatten(), chii, sig, n, xxer.flatten(), totalprod.flatten(), Gammerrats.flatten(), Gammerchii)
diff = maxabs(P0.flatten(),Pprime.flatten(),n)
P0 = 1.*Pprime
When I time your maxabs function vs np.max(np.abs(vec1 - vec2)) for an array of shape (1200,), the numba version is ~2.6x faster using numba 0.32.0.
When you time the code, make sure you run your function once before you time it so that you don't include the time it takes to jit the code, which you only pay the first time. In general using timeit and running multiple times takes care of this. I'm not sure how you did the timing though since I see almost no difference in using maxabs vs the numpy call, most of the runtime seems to be in the call to Piteration.

Python: optimising loops

I wish to optimise some python code consisting of two nested loops. I am not so familar with numpy, but I understand it should enable me to improve the efficiency for such a task. Below is a test code I wrote that reflects what happens in the actual code. Currently using the numpy range and iterator is slower than the usual python one. What am I doing wrong? What is the best solution to this problem?
Thanks for your help!
import numpy
import time
# setup a problem analagous to that in the real code
npoints_per_plane = 1000
nplanes = 64
naxis = 1000
npoints3d = naxis + npoints_per_plane * nplanes
npoints = naxis + npoints_per_plane
specres = 1000
# this is where the data is being mapped to
sol = dict()
sol["ems"] = numpy.zeros(npoints3d)
sol["abs"] = numpy.zeros(npoints3d)
# this would normally be non-random input data
data = dict()
data["ems"] = numpy.zeros((npoints,specres))
data["abs"] = numpy.zeros((npoints,specres))
for ip in range(npoints):
data["ems"][ip,:] = numpy.random.random(specres)[:]
data["abs"][ip,:] = numpy.random.random(specres)[:]
ems_mod = numpy.random.random(1)[0]
abs_mod = numpy.random.random(1)[0]
ispec = numpy.random.randint(specres)
# this the code I want to optimize
t0 = time.time()
# usual python range and iterator
for ip in range(npoints_per_plane):
jp = naxis + ip
for ipl in range(nplanes):
ip3d = jp + npoints_per_plane * ipl
sol["ems"][ip3d] = data["ems"][jp,ispec] * ems_mod
sol["abs"][ip3d] = data["abs"][jp,ispec] * abs_mod
t1 = time.time()
# numpy ranges and iterator
ip_vals = numpy.arange(npoints_per_plane)
ipl_vals = numpy.arange(nplanes)
for ip in numpy.nditer(ip_vals):
jp = naxis + ip
for ipl in numpy.nditer(ipl_vals):
ip3d = jp + npoints_per_plane * ipl
sol["ems"][ip3d] = data["ems"][jp,ispec] * ems_mod
sol["abs"][ip3d] = data["abs"][jp,ispec] * abs_mod
t2 = time.time()
print "plain python: %0.3f seconds" % ( t1 - t0 )
print "numpy: %0.3f seconds" % ( t2 - t1 )
edit: put "jp = naxis + ip" in the first for loop only
additional note:
I worked out how to get numpy to quickly do the inner loop, but not the outer loop:
# numpy vectorization
for ip in xrange(npoints_per_plane):
jp = naxis + ip
sol["ems"][jp:jp+npoints_per_plane*nplanes:npoints_per_plane] = data["ems"][jp,ispec] * ems_mod
sol["abs"][jp:jp+npoints_per_plane*nplanes:npoints_per_plane] = data["abs"][jp,ispec] * abs_mod
Joe's solution below shows how to do both together, thanks!
The best way of writing loops in numpy is not writing loops and instead using vectorized operations. For example:
c = 0
for i in range(len(a)):
c += a[i] + b[i]
becomes
c = np.sum(a + b, axis=0)
For a and b with a shape of (100000, 100) this takes 0.344 seconds in the first variant, and 0.062 seconds in the second.
In the case presented in your question the following does what you want:
sol['ems'][naxis:] = numpy.ravel(
numpy.repeat(
data['ems'][naxis:,ispec,numpy.newaxis] * ems_mod,
nplanes,
axis=1
),
order='F'
)
This could be further optimized with some tricks, but that would reduce clarity and is probably premature optimization because:
plain python: 0.064 seconds
numpy: 0.002 seconds
The solution works as follows:
Your original version contains jp = naxis + ip which merely skips the first naxis elements [naxis:] selects all but the first naxis elements. Your inner loop repeats the value of data[jp,ispec] for nplanes times and writes it to multiple locations ip3d = jp + npoints_per_plane * ipl which is equivalent to a flattened 2D array offset by naxis. Therefore a second dimension is added via numpy.newaxis to the (previously 1D) data['ems'][naxis:, ispec], the values are repeated nplanes times along this new dimension via numpy.repeat. The resulting 2D array is then flattened again via numpy.ravel (in Fortran order, i.e., with the lowest axis having the smallest stride) and written to the appropriate subarray of sol['ems']. If the target array was actually 2D, the repeat could be skipped by using automatic array broadcasting.
If you run into a situation where you cannot avoid using loops, you could use Cython (which supports efficient buffer views on numpy arrays).

Categories

Resources