Fast way to update integer indexed slice in numpy - python

I have a rather big numpy array, and I'm proceeding to clean up the data (this is pseudocode):
arr = np.ones(shape=(int(5E5), 1000), dtype=float)
to_zero = np.arange(500, 1000, 2) # Normally this is a function that finds
# columns be zeroed indexes are not continous
arr[:, to_zero] = 0
Problem is that arr[:, to_zero] = 0 takes very long time. In this example it takes 4 sec.
while arr[:, :500] takes 500ms.
Is there any way to make this faster?
Example numpy notebooks for numpy 1.8.1 and numpy 1.9 beta (see timing results).
As #Jaime pointed using newer numpy is would be a good shot in the long term.

From numpy's internal storage point of view it is very fast to reset a large chunk of contiguous memory. With your example the problem is that you are jumping around the memory like a mad rabbit, as you are zeroing columns.
I tried turning your array the other way round, and then the speed increases almost by a factor of 3. I get 4.08 s for your version and 1.57 s for a transposed version. So, that at least is a good optimisation, if you can do it otherwise.
There may be something fishy in numpy with this indexing, because actually doing:
for c in to_zero:
arr[:, c] = 0
is faster than using the list notation.
So, I ran a few different alternatives:
to_zero = numpy.arange(0, 500, 2)
# 1. table in the original orientation
arr = numpy.ones((500000, 1000), dtype='float')
# 1.1. basic case: 4.08 s
arr[:, to_zero] = 0
# 1.2. a bit different notation: 234 ms
arr[:, 0:500:2] = 0
# 1.3. for loop: 2.75 s
for c in to_zero:
arr[:, c] = 0
# 2. transposed table
arr = numpy.ones((1000, 500000), dtype='float')
# 2.1. basic case: 1.47 s
arr[to_zero,:] = 0
# 2.2. a bit different notation: 105 ms
arr[0:500:2,:] = 0
# 2.3. for loop: 112 ms
for r in to_zero:
arr[r, :] = 0
These have been timed with IPython %timeit, so the results may vary a bit across one run to another, but there seems to be a pattern. Transpose your table and use a loop.

The problem has to do the way the data is allocated in memory.
But to solve your problem read the rows as columns and the columns as rows:
arr = np.ones(shape=(1000, int(5E5)), dtype=float)
to_zero = np.arange(500, 1000, 2) # Normally this is a function that finds
# columns be zeroed indexes are not continous
arr[to_zero, :] = 0

Related

Speeding up numpy operations

Using a 2D numpy array, I want to create a new array that expands the original one using a moving window. Let me explain what I mean using an example code:
# Simulate some data
import numpy as np
np.random.seed(1)
t = 20000 # total observations
location = np.random.randint(1, 5, (t,1))
var_id = np.random.randint(1, 8, (t,1))
hour = np.repeat(np.arange(0, (t/5)), 5).reshape(-1,1)
value = np.random.rand(t,1)
df = np.concatenate((location,var_id,hour,value),axis = 1)
Having "df" I want to create a new array "results" like below:
# length of moving window
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
results = np.concatenate((results, obs_data), axis=0)
my problem is that the concatenation is very slow (on my system the operation take 1.4 and 16 seconds without and with the concatenation respectively). I have over a million data points and I want to speedup this code. Does anyone know a better way to create the new array faster (possibly without using the np.concatenate)?
If you need to iterate, make the results array big enough to hold all the values.
# create an empty array to store the results
n = len(set(hours))-window+1
results = np.empty((n,4))
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[i,:] = obs_data
Repeated concatenate is slow; list append is faster.
It may be possible to get all obs_data from df with one indexing call, but I won't try to explore that now.
Not a completely for-free answer neither, but a working one
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
lr=[]
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
lr.append(obs_data)
np.vstack(lr)
It is way faster. For the reason already given: calling concatenate in a loop is awfully slow. Where as python list can be expanded more efficiently.
I would have preferred something like hpaulj answer. With some array initially created, and then filled. Even if obs_data is not a single row (as they seem to assume) but several row, it is not really a problem. Something like
p=0
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[p:p+len(obs_data),:] = obs_data
p+=len(obs_data)
would do.
But the problem here is to estimate the size of results. With your example, with uniformly distributed hours, it is quite easy : (len(set(hours))-window+1)*window*(len(hours)/len(set(hours))
But I guess in reality, each obs_data has a different size.
So, the only way to compute the size of result in advance would be to do a first iteration just to compute the sum of len(obs_data), and then a second to store obs_data. So, vstack, even if not entierely satisfying, is probably the best option.
Anyway, it is a very visible improvement from your version (on my computer 22 seconds vs less than 1)

Vectorizing an iterative function on Pandas DataFrame

I have a dataframe where the first row is the initial condition.
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4] + [np.nan]* 3})
and a function f(x,r) = r*x*(1-x), where r = 2 is a constant and 0 <= x <= 1.
I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)
df = pd.DataFrame({"Year": np.arange(4),
"Pop": [0.4, 0.48, 4992, 0.49999872]})
Question: Is it possible to do this in a vectorized way?
I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.
I have also tried this, but all nan places are filled with 0.48.
df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])
It is IMPOSSIBLE to do this in a vectorized way.
By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.
However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.
def gen(x_init, n, R=2):
x = x_init
for _ in range(n):
x = R * x * (1-x)
yield x
# execute
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))
Result:
print(df)
Pop
0 0.400000
1 0.480000
2 0.499200
3 0.499999
It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.
pip install numba or conda install numba in the console first
import numba
Add decorator #numba.njit in front of the generator.
Change the number of np.nans to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.

tile operation to create a csr_matrix from one row of another csr_matrix

I have a csr_matrix 'a' type of sparse matrix. I want to perform an operation to create a new csr_matrix 'b' where each row of 'b' is same ith row of 'a'.
I think for normal numpy arrays it is possible using 'tile' operation. But I am not able to find the same for csr_matrix.
Making first a numpy matrix and converting to csr_matrix is not an option as the size of matrix is 10000 x 10000.
I actually could get to answer which doesn't require creating full numpy matrix and is quite fast for my purpose. So adding it as answer if it's useful for people in future:
rows, cols = a.shape
b = scipy.sparse.csr_matrix((np.tile(a[2].data, rows), np.tile(a[2].indices, rows),
np.arange(0, rows*a[2].nnz + 1, a[2].nnz)), shape=a.shape)
This takes 2nd row of 'a' and tiles it to create 'b'.
Following is the timing test, seems quite fast for 10000x10000 matrix:
100 loops, best of 3: 2.24 ms per loop
There is a blk format, that lets you create a new sparse matrix from a list of other matrices.
So for a start you could
a1 = a[I,:]
ll = [a1,a1,a1,a1]
sparse.blk_matrix(ll)
I don't have a shell running to test this.
Internally this format turns all input arrays into coo format, and collects their coo attributes into 3 large lists (or arrays). In your case of tiled rows, the data and col (j) values would just repeat. The row (I) values would step.
Another way to approach it would be to construct a small test matrix, and look at the attributes. What kinds of repetition do you see? It's easy to see patterns in the cooformat. lil might also be easy to replicate, maybe with the list *n operation. csr is trickier to understand.
One can do
row = a.getrow(row_idx)
n_rows = a.shape[0]
b = tiled_row = sp.sparse.vstack(np.repeat(row, n_rows))

Fill a Pytables array with random values: horizontally vs vertically

So I am using Pytables to store a numpy array of size (10,000 x 100). My goal is to fill it with random values.
import tables as tb
h5File = '/Users/me/tmp0/test0.h5'
f = tb.openFile( h5File, 'w')
atom = tb.Atom.from_dtype( numpy.dtype('Float32'))
x = f.createCArray( f.root, 'prices', atom=atom, shape=(10000, 100) )
In this example I could simply do x[:]=nr.random(10000,100), but in reality my matrix is much bigger, more like (100,000,000 x 500). So I need to do it by chunks. First I tried vertically:
%%timeit
for k in xrange(100) :
x[ :, k ] = nr.random( 10000 )
1 loops, best of 3: 255 ms per loop
Then I tried horizontally:
%%timeit
for k in xrange(0, 10000, 100) :
x[ k:k+100, : ] = nr.random( ( 100, 100 ) )
100 loops, best of 3: 22.4 ms per loop
Why is the horizontal one 10 times faster? Also, is there a simpler way to do that?
For the speed, its because of how computers keep memory organized.
Internally, the entire matrix is kept in linear memory. To keep it easy to wrap our heads around: if you had a 4x4 matrix:
1 2
3 4
Internally, that would be stored as
memAddr1: 1
memAddr2: 2
memAddr3: 3
memAddr4: 4
So if you write this in rows, you get very efficient use of consecutive memory addresses (1-4). If you write in columns, you're forcing frequent random accesses (1 then 3 then 2 then 4).
The reason has been exposed already: the differences in how you store data in memory affects a lot to the performance you get. For knowing more about the issue, look at the slide 19 (and neighborhoods) of this presentation:
http://www.pytables.org/docs/StarvingCPUs-PyTablesUsages.pdf

Numpy array get the subset/slice of an array which is not NaN

I have an array of size: (50, 50). Within this array there is a slice of size (20,10).
Only this slice contains data, the remainder is all set to nan.
How do I cut this slice out of my large array?
You can get this using fancy indexing to collect the items that are not NaN:
a = a[ np.logical_not( np.isnan(a) ) ].reshape(20,10)
or, alternatively, as suggested by Joe Kington:
a = a[ ~np.isnan(a) ]
Do you know where the NaNs are? If so, something like this should work:
newarray = np.copy(oldarray[xstart:xend,ystart:yend])
where xstart and xend are the beginning and end of the slice you want in the x dimension and similarly for y. You can then delete the old array to free up memory if you don't need it anymore.
If you don't know where the NaNs are, this should do the trick:
# in this example, the starting array is A, numpy is imported as np
boolA = np.isnan(A) #get a boolean array of where the nans are
nonnanidxs = zip(*np.where(boolA == False)) #all the indices which are non NaN
#slice out the nans
corner1 = nonnanidxs[0]
corner2 = nonnanidxs[-1]
xdist = corner2[0] - corner1[0] + 1
ydist = corner2[1] - corner1[1] + 1
B = copy(A[corner1[0]:corner1[0]+xdist,corner1[1]:corner1[1]+ydist])
#B is now the array you want
Note that this would be pretty slow for large arrays because np.where looks through the whole thing. There's an open issue in the number bug tracker for a method that finds the first index equal to some value and then stops. There might be a more elegant way to do this, this is just the first thing that came to my head.
EDIT: ignore, sgpc's answer is much better.

Categories

Resources