Fill a Pytables array with random values: horizontally vs vertically - python

So I am using Pytables to store a numpy array of size (10,000 x 100). My goal is to fill it with random values.
import tables as tb
h5File = '/Users/me/tmp0/test0.h5'
f = tb.openFile( h5File, 'w')
atom = tb.Atom.from_dtype( numpy.dtype('Float32'))
x = f.createCArray( f.root, 'prices', atom=atom, shape=(10000, 100) )
In this example I could simply do x[:]=nr.random(10000,100), but in reality my matrix is much bigger, more like (100,000,000 x 500). So I need to do it by chunks. First I tried vertically:
%%timeit
for k in xrange(100) :
x[ :, k ] = nr.random( 10000 )
1 loops, best of 3: 255 ms per loop
Then I tried horizontally:
%%timeit
for k in xrange(0, 10000, 100) :
x[ k:k+100, : ] = nr.random( ( 100, 100 ) )
100 loops, best of 3: 22.4 ms per loop
Why is the horizontal one 10 times faster? Also, is there a simpler way to do that?

For the speed, its because of how computers keep memory organized.
Internally, the entire matrix is kept in linear memory. To keep it easy to wrap our heads around: if you had a 4x4 matrix:
1 2
3 4
Internally, that would be stored as
memAddr1: 1
memAddr2: 2
memAddr3: 3
memAddr4: 4
So if you write this in rows, you get very efficient use of consecutive memory addresses (1-4). If you write in columns, you're forcing frequent random accesses (1 then 3 then 2 then 4).

The reason has been exposed already: the differences in how you store data in memory affects a lot to the performance you get. For knowing more about the issue, look at the slide 19 (and neighborhoods) of this presentation:
http://www.pytables.org/docs/StarvingCPUs-PyTablesUsages.pdf

Related

What is the best efficient way to loop through 2d array in Python

I am new to Python and machine learning. I can't find best way on the internet. I have a big 2d array (distance_matrix.shape= (47, 1328624)). I wrote below code but it takes too long time to run. For loop in for loop takes so time.
distance_matrix = [[0.21218192, 0.12845819, 0.54545613, 0.92464129, 0.12051526, 0.0870853 ], [0.2168166 , 0.11174682, 0.58193855, 0.93949729, 0.08060061, 0.11963891], [0.23996999, 0.17554854, 0.60833433, 0.93914766, 0.11631545, 0.2036373]]
iskeleler = pd.DataFrame({
'lat':[40.992752,41.083202,41.173462],
'lon':[29.023165,29.066652,29.088163],
'name':['Kadıköy','AnadoluHisarı','AnadoluKavağı']
}, dtype=str)
for i in range(len(distance_matrix)):
for j in range(len(distance_matrix[0])):
if distance_matrix[i][j] < 1:
iskeleler.loc[i,'Address'] = distance_matrix[i][j]
print(iskeleler)
To explain, I am sharing the first 5 rows of my array and showing my dataframe.
İskeleler dataframe
distance_matrix
The "İskeleler" dataframe has 47 rows. I want to add them to the 'Address' column in row i in the "İskeleler" by looking at all the values in row i in the distance_matrix and adding the ones less than 1. I mean if we look at the first row in the distance_matrix photo, I want to add the numbers like 0.21218192 + 0.12845819 + 0.54545613 .... and put them in the 'address' column in the i'th row in the İskeleler dataframe.
My intend is to loop through distance_matrix and find some values which smaller than 1. The code takes too long. How can i do this with faster way?
I think you mean this:
import numpy as np
# Set up some dummy data in range 0..100
distance = np.random.rand(47,1328624) * 100.0
# Boolean mask of all values < 1
mLessThan1 = distance<1
# Sum elements <1 across rows
result = np.sum(distance*mLessThan1, axis=1)
That takes 168ms on my Mac.
In [47]: %timeit res = np.sum(distance*mLessThan1, axis=1)
168 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Optimized computation of pairwise correlations in Python

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.
For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:
import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation
I also have 5000 replicates of simulation results on each site:
sims = 5000
res = np.round(np.random.rand(sites, sims),1)
To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:
rho_list = np.ones(categ)*99
for i in range(1, categ+1):
idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
for j in range(len(idr)): # loop through unique site pairs
comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
rho_list[i-1] = np.nanmean(rho)
Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?
We can vectorize the innermost loop with j iterator with some masking to take care of the ragged nature of data being processed at each iteration of that loop. We can also take out slowish np.corrcoef (inspired by this post). Additionally, we can optimize few steps at the start of the outer-loop, specially the stacking steps, which could be the bottlenecks.
Thus, the complete code would reduce to something like this -
for i in range(1, categ+1):
r,c = np.where(relations==i)
A = res[r]
B = res[c]
mask0 = ~((A!=0) & (B!=0))
A[mask0] = 0
B[mask0] = 0
count = mask0.shape[-1] - mask0.sum(-1,keepdims=1)
A_mA = A - A.sum(-1, keepdims=1)/count
B_mB = B - B.sum(-1, keepdims=1)/count
A_mA[mask0] = 0
B_mB[mask0] = 0
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
rho = np.einsum('ij,ij->i',A_mA, B_mB)/np.sqrt(ssA*ssB)
rho_list[i-1] = np.nanmean(rho)
Runtime tests
Case # 1: On given sample data, with sites = 100
In [381]: %timeit loopy_app()
1 loop, best of 3: 7.45 s per loop
In [382]: %timeit vectorized_app()
1 loop, best of 3: 479 ms per loop
15x+ speedup.
Case # 2: With sites = 200
In [387]: %timeit loopy_app()
1 loop, best of 3: 1min 56s per loop
In [388]: %timeit vectorized_app()
1 loop, best of 3: 1.86 s per loop
In [390]: 116/1.86
Out[390]: 62.36559139784946
62x+ speedup.
Case # 3: Finally with sites = 400
In [392]: %timeit vectorized_app()
1 loop, best of 3: 7.64 s per loop
This took 6hrs+ at OP's end with their loopy method.
From the timings, it became clear that vectorizing the inner loop was the key to getting noticeable speedups for large sites.

Fastest way to sort a large number of arrays in python

I am trying to sort a large number of arrays in python. I need to perform the sorting for over 11 million arrays at once.
Also, it would be nice if I could directly get the indices that would sort the array.
That is why, as of now I'm using numpy.argsort() but thats too slow on my machine (takes over an hour to run)
The same operation in R is taking about 15 minutes in the same machine.
Can anyone tell me a faster way to do this in Python?
Thanks
EDIT:
Adding an example
If I have the following dataframe :
agg:
x y w z
1 2 2 5
1 2 6 7
3 4 3 3
5 4 7 8
3 4 2 5
5 9 9 9
I am running the following function and command on it:
def fucntion(group):
z = group['z'].values
w = group['w'].values
func = w[np.argsort(z)[::-1]][:7] #i need top 7 in case there are many
return np.array_str(func)[1:-1]
output = agg.groupby(['x,'y']).apply(function).reset_index()
so my output dataframe will look like this:
output:
x y w
1 2 6,2
3 4 2,3
5 4 7
5 9 9
Well for cases like those where you are interested in partial sorted indices, there's NumPy's argpartition.
You have the troublesome np.argsort in : w[np.argsort(z)[::-1]][:7], which is essentially w[idx], where idx = np.argsort(z)[::-1][:7].
So, idx could be calculated with np.argpartition, like so -
idx = np.argpartition(-z,np.arange(7))[:7]
That -z is needed because by default np.argpartition tries to get sorted indices in ascending order. So, to reverse it, we have negated the elements.
Thus, the proposed change in the original code would be :
func = w[np.argpartition(-z,np.arange(7))[:7]]
Runtime test -
In [162]: z = np.random.randint(0,10000000,(1100000)) # Random int array
In [163]: idx1 = np.argsort(z)[::-1][:7]
...: idx2 = np.argpartition(-z,np.arange(7))[:7]
...:
In [164]: np.allclose(idx1,idx2) # Verify results
Out[164]: True
In [165]: %timeit np.argsort(z)[::-1][:7]
1 loops, best of 3: 264 ms per loop
In [166]: %timeit np.argpartition(-z,np.arange(7))[:7]
10 loops, best of 3: 36.5 ms per loop
The reason python is so much slower than R is that by python does not typecast variables (i.e. int, string, float), so part of each comparison to determine which value is larger is spent determining the variable type.
You can't fix this problem using python alone, but you can include type definitions using cython (ctypes and psyco also can perform the same function, but I prefer cython). An simple example of how this works is on http://docs.cython.org/src/quickstart/cythonize.html
Cython compiles a .c version of your python file, that can be imported instead of the .py to reduce the runtime. All the possible ways to compile using cython are shown on http://docs.cython.org/src/reference/compilation.html
Your input and output is a bit confusing. Please provide some sample data.
But look into: http://pandas.pydata.org/pandas-docs/stable/api.html#reshaping-sorting-transposing
Pandas sorting is as optimized as it gets. Focus on the series sort as each column of the DataFrame is more accurately represented as a series.

Fast way to update integer indexed slice in numpy

I have a rather big numpy array, and I'm proceeding to clean up the data (this is pseudocode):
arr = np.ones(shape=(int(5E5), 1000), dtype=float)
to_zero = np.arange(500, 1000, 2) # Normally this is a function that finds
# columns be zeroed indexes are not continous
arr[:, to_zero] = 0
Problem is that arr[:, to_zero] = 0 takes very long time. In this example it takes 4 sec.
while arr[:, :500] takes 500ms.
Is there any way to make this faster?
Example numpy notebooks for numpy 1.8.1 and numpy 1.9 beta (see timing results).
As #Jaime pointed using newer numpy is would be a good shot in the long term.
From numpy's internal storage point of view it is very fast to reset a large chunk of contiguous memory. With your example the problem is that you are jumping around the memory like a mad rabbit, as you are zeroing columns.
I tried turning your array the other way round, and then the speed increases almost by a factor of 3. I get 4.08 s for your version and 1.57 s for a transposed version. So, that at least is a good optimisation, if you can do it otherwise.
There may be something fishy in numpy with this indexing, because actually doing:
for c in to_zero:
arr[:, c] = 0
is faster than using the list notation.
So, I ran a few different alternatives:
to_zero = numpy.arange(0, 500, 2)
# 1. table in the original orientation
arr = numpy.ones((500000, 1000), dtype='float')
# 1.1. basic case: 4.08 s
arr[:, to_zero] = 0
# 1.2. a bit different notation: 234 ms
arr[:, 0:500:2] = 0
# 1.3. for loop: 2.75 s
for c in to_zero:
arr[:, c] = 0
# 2. transposed table
arr = numpy.ones((1000, 500000), dtype='float')
# 2.1. basic case: 1.47 s
arr[to_zero,:] = 0
# 2.2. a bit different notation: 105 ms
arr[0:500:2,:] = 0
# 2.3. for loop: 112 ms
for r in to_zero:
arr[r, :] = 0
These have been timed with IPython %timeit, so the results may vary a bit across one run to another, but there seems to be a pattern. Transpose your table and use a loop.
The problem has to do the way the data is allocated in memory.
But to solve your problem read the rows as columns and the columns as rows:
arr = np.ones(shape=(1000, int(5E5)), dtype=float)
to_zero = np.arange(500, 1000, 2) # Normally this is a function that finds
# columns be zeroed indexes are not continous
arr[to_zero, :] = 0

Constructing a waterfall algorithm from multiple columns in a Pandas Data Frame

Suppose I have a multi-column data frame and I wish to implement a waterfall style algorithm that takes the first column if it is present, then looks at the second if it is not, and if that is not present takes the value in the third column, and so on, and if missing in the last column takes a default value (say zero). I have a way of doing this involving adding up a series of vector operations (see below) but it doesn't seem to scale to more columns very well. And, of course, I could do it with nested loops through rows (very unpythonic -- right?).
frame = pd.DataFrame(np.arange(15).reshape((5,3)),index=['a','b','c','d','e'],columns=['X','Y', 'Z'])
#Make some missing values
frame['X'].ix[0:2] = None
frame['Y'].ix[1:4] = None
frame['Z'].ix[3:5] = None
#This is my kludgy waterfall for the three column case.
frame['Waterfall'] = frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & frame['Y'].isnull())
I am hoping for a solution to this problem that scales well to waterfalls of arbitrary length. If it could be more Pythonic that would be great. Ideally, it would be a function that takes an ordered list of column labels a dataframe as an argument and returns the desired values.
Thank you for your help.
First of all, don't use None as your missing data value. That forces all your columns to the object dtype, which will be slow. Use nan instead (this makes everything doubles so just be careful with floating point stuff.
I'd use the bfill method for fillna():
In [26]: frame.fillna(method='bfill', axis=1)['X'].fillna(0)
Out[26]:
a 1
b 5
c 6
d 9
e 12
Name: X, dtype: float64
performance:
In [27]: %timeit frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & fra
me['Y'].isnull())
1000 loops, best of 3: 776 µs per loop
In [28]: %timeit frame.fillna(method='bfill', axis=1)['X']
10000 loops, best of 3: 138 µs per loop

Categories

Resources