Fastest way to calculate difference in all columns

Fastest way to calculate difference in all columns - python

I have a dataframe of all float columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# A B C D
# 0 0.0 1.0 2.0 3.0
# 1 4.0 5.0 6.0 7.0
# 2 8.0 9.0 10.0 11.0
I would like to calculate column-wise differences for all combinations of columns (e.g., A-B, A-C, B-C, etc.).
E.g., the desired output would be something like:
A_B A_C A_D B_C B_D C_D
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Since the number of columns may be large, I'd like to do the calculations as efficiently/quickly as possible. I assume I'll get a big speed bump by converting the dataframe to a numpy array first so I'll do that, but I'm wondering if there are any other strategies that might result in large performance gains. Maybe some matrix algebra or multidimensional data format trick that results in not having to loop through all unique combinations. Any suggestions are welcome. This project is in Python 3.

Listed in this post are two NumPy approaches for performance - One would be fully vectorized approach and another with one loop.
Approach #1
def numpy_triu1(df):
a = df.values
r,c = np.triu_indices(a.shape[1],1)
cols = df.columns
nm = [cols[i]+"_"+cols[j] for i,j in zip(r,c)]
return pd.DataFrame(a[:,r] - a[:,c], columns=nm)
Sample run -
In [72]: df
Out[72]:
A B C D
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [78]: numpy_triu(df)
Out[78]:
A_B A_C A_D B_C B_D C_D
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Approach #2
If we are okay with array as output or dataframe without specialized column names, here's another -
def pairwise_col_diffs(a): # a would df.values
n = a.shape[1]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((a.shape[0],N),dtype=a.dtype)
for j,i in enumerate(range(n-1)):
out[:, start[j]:stop[j]] = a[:,i,None] - a[:,i+1:]
return out
Runtime test
Since OP has mentioned that multi-dim array output would work for them as well, here are the array based approaches from other author(s) -
# #Allen's soln
def Allen(arr):
n = arr.shape[1]
idx = np.asarray(list(itertools.combinations(range(n),2))).T
return arr[:,idx[0]]-arr[:,idx[1]]
# #DYZ's soln
def DYZ(arr):
result = np.concatenate([(arr.T - arr.T[x])[x+1:] \
for x in range(arr.shape[1])]).T
return result
pandas based solution from #Gerges Dib's post wasn't included as it came out very slow as compared to others.
Timings -
We will use three dataset sizes - 100, 500 and 1000 :
In [118]: df = pd.DataFrame(np.random.randint(0,9,(3,100)))
...: a = df.values
...:
In [119]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
1000 loops, best of 3: 258 µs per loop
1000 loops, best of 3: 1.48 ms per loop
1000 loops, best of 3: 284 µs per loop
In [121]: df = pd.DataFrame(np.random.randint(0,9,(3,500)))
...: a = df.values
...:
In [122]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 2.56 ms per loop
10 loops, best of 3: 39.9 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [123]: df = pd.DataFrame(np.random.randint(0,9,(3,1000)))
...: a = df.values
...:
In [124]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 8.61 ms per loop
10 loops, best of 3: 167 ms per loop
100 loops, best of 3: 5.09 ms per loop

I think you can do it with NumPy. Let arr=df.values. First, let's find all two-column combinations:
from itertools import combinations
column_combos = combinations(range(arr.shape[1]), 2)
Now, subtract columns pairwise and convert a list of arrays back to a 2D array:
result = np.array([(arr[:,x[1]] - arr[:,x[0]]) for x in column_combos]).T
#array([[1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.]])
Another solution is somewhat (~15%) faster because it subtracts whole 2D arrays rather than columns, and has fewer Python-side iterations:
result = np.concatenate([(arr.T - arr.T[x])[x+1:] for x in range(arr.shape[1])]).T
#array([[ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.]])
You can convert the result back to a DataFrame if you want:
columns = list(map(lambda x: x[1]+x[0], combinations(df.columns, 2)))
#['BA', 'CA', 'DA', 'CB', 'DB', 'DC']
pd.DataFrame(result, columns=columns)
# BA CA DA CB DB DC
#0 1.0 2.0 3.0 1.0 2.0 1.0
#1 1.0 2.0 3.0 1.0 2.0 1.0
#2 1.0 2.0 3.0 1.0 2.0 1.0

import itertools
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
df_cols = df.columns.tolist()
#build a index array of all the pairs need to do the subtraction
idx = np.asarray(list(itertools.combinations(range(len(df_cols)),2))).T
#build a new DF using the pairwise difference and column names
df_new = pd.DataFrame(data=df.values[:,idx[0]]-df.values[:,idx[1]],
columns=[''.join(e) for e in (itertools.combinations(df_cols,2))])
df_new
Out[43]:
AB AC AD BC BD CD
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0

I am not sure how fast can this be compared to other possible methods, but here it is:
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# get the columns as list
cols = list(df.columns)
# define output dataframe
out = pd.DataFrame()
# loop over possible periods
for period in range(1, df.shape[1]):
names = [l1 + l2 for l1, l2, in zip(cols, cols[period:])]
out[names] = df.diff(periods=period, axis=1).dropna(axis=1, how='all')
print(out)
# column name shows which two columns are subtracted
AB BC CD AC BD AD
0 1.0 1.0 1.0 2.0 2.0 3.0
1 1.0 1.0 1.0 2.0 2.0 3.0
2 1.0 1.0 1.0 2.0 2.0 3.0

Related

Replace specific values in multiindex dataframe

I have a multindex dataframe with 3 index levels and 2 numerical columns.
A 1 2017-04-01 14.0 87.346878
2017-06-01 4.0 87.347504
2 2014-08-01 1.0 123.110001
2015-01-01 4.0 209.612503
B 3 2014-07-01 1.0 68.540001
2014-12-01 1.0 64.370003
4 2015-01-01 3.0 75.000000
I want to replace the values in first row of 3rd index level wherever a new second level index begins.
For ex: every first row
(A,1,2017-04-01)->0.0 0.0
(A,2,2014-08-01)->0.0 0.0
(B,3,2014-07-01)->0.0 0.0
(B,4,2015-01-01)->0.0 0.0
The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?

Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
Result:
# print(df)
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

We can extract a series of the second-level index with:
df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')
And check where it changes with:
idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])
So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:
idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0
output:
value1 value2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

You can use the grouper indices in a simple iloc:
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
Example:
df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
('B', 4, '2015-01-01')])))
Result:
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
Timings:
%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So this is about 7 times faster than the accepted answer

I think you can use something like this:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
['foo', 'one'], ['foo', 'two']],
columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.
lst = ['bar','foo', 'qux']
ls = []
for i in lst:
base = df.index.get_loc(i)
a = base.indices(len(df))
a = a[0]
ls.append(a)
for ii in ls:
#print(ii)
df[0][ii] = 0
df
Fortunately, this can help you.
Cheers!

pandas - cumulative median

I was wondering if there is any pandas equivalent to cumsum() or cummax() etc. for median: e.g. cummedian().
So that if I have, for example this dataframe:
a
1 5
2 7
3 6
4 4
what I want is something like:
df['a'].cummedian()
which should output:
5
6
6
5.5

You can use expanding.median -
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
Timings
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.

We could create nan filled subarrays as rows with a strides based function, like so -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
Sample run -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
Hence, the final solution would be -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64

A faster solution for the specific cumulative median
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273

Reshape a NumPy array based on values in its columns

What is the most compact way to make a matrix from a table with numpy?
I have a table of values, where 1st column is x, 2nd is y and 3rd is z. z values are all unique, (x, y) pair is obtained with combinations of x and y. Here is an example:
0.0 0.0 949219540.0
0.0 0.5 944034910.0
0.0 1.0 938508543.0
0.0 1.5 930093905.0
0.0 2.0 922076484.0
50.0 0.0 911497861.0
50.0 0.5 903224763.0
50.0 1.0 900406431.0
50.0 1.5 890658529.0
50.0 2.0 880907404.0
100.0 0.0 883527077.0
100.0 0.5 911683042.0
........ # and so on
basically this is a matrix 5x9:
0.0 0.0 0.5 1.0 1.5 2.0
0.0 0.949 0.944 0.939 0.93 0.922
50.0 0.911 0.903 0.9 0.891 0.881
100.0 0.884 0.912 0.84 0.839 0.851
150.0 0.85 0.84 0.799 0.844 0.863
200.0 0.84 0.79 0.806 0.847 0.745
250.0 0.789 0.78 0.748 0.719 0.759
300.0 0.761 0.783 0.714 0.766 0.698
350.0 0.737 0.757 0.792 0.705 0.665
400.0 0.801 0.797 0.57 0.628 0.532
Now for this i making: set(x) and set(y) to get rid of duplicates, reshape(Z) with length of x any y and then vstack and hstack to concatenate x, y, z. I belief that this is quite common operation in data processing, and maybe it has one-step-solution. More over, my way is not good when x and y are not in order, so set() can broke a matrix logic.

This is basically the opposite of numpy.meshgrid.
For a one-liner, you can use scipy.interpolate.griddata:
grid = griddata(list(zip(x, y)), z,
(x.reshape((len(set(y)), len(set(x)))),
y.reshape((len(set(y)), len(set(x))))),
method='nearest')
Longer demonstration: let's say that we have an list of entries that cover completely a matrix. In numpy, this is obtained by meshgrid
In [1]: import numpy as np
In [2]: a = np.arange(0, 5)
In [3]: b = np.arange(6, 9)
In [4]: aa, bb = np.meshgrid(a, b)
And assign values to each element of the mesh:
IN [5]: x, y = aa.flatten(), bb.flatten()
In [6]: z = np.ones(len(x))
These are the starting x, y, and z of the OP.
Now let's use grid data to get all values into a matrix. griddata is much more powerful than this, but having only one point per grid and a clearly equally spaced grid, the matrix comes out exact.
In [7]: points = list(zip(x, y))
In [8]: from scipy.interpolate import griddata
In [9]: grid = griddata(points, z,
(x.reshape((len(set(y)), len(set(x)))),
y.reshape((len(set(y)), len(set(x))))),
method='nearest')
In [10]: grid
Out[10]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
In [11]: a, b = set(x), set(y)
In [12]: np.hstack((np.concatenate(([0], b)).reshape((1, len(b) + 1)).T, np.vstack((a, grid))))
Out[12]:
array([[ 0., 0., 1., 2., 3., 4.],
[ 6., 1., 1., 1., 1., 1.],
[ 7., 1., 1., 1., 1., 1.],
[ 8., 1., 1., 1., 1., 1.]])

Pandas - Create a new column with apply for float indexed dataframe

I'm using pandas 13.0 and I'm trying to create a new colum using apply() and a function name foo().
My dataframe is as follow:
df = pandas.DataFrame({
'a':[ 0.0, 0.1, 0.2, 0.3],
'b':[10.0, 20.0, 30.0, 40.0],
'c':[ 1.0, 2.0, 3.0, 4.0]
})
df.set_index(df['a'], inplace=True)
So my dataframe is:
in: print df
out:
a b c
a
0.0 0.0 10.0 1.0
0.1 0.1 20.0 2.0
0.2 0.2 30.0 3.0
0.3 0.3 40.0 4.0
My function is as follow:
def foo(arg1, arg2):
return arg1*arg2
Now I want to create a column name 'd' using foo();
df['d'] = df.apply(foo(df['b'], df['c']), axis=1)
But I get the following error:
TypeError: ("'Series' object is not callable", u'occurred at index 0.0')
How can I use pandas.apply() with foo() for index that are made of float?
Thanks

The problem here is that you are trying to process this row-wise but you are passing series as arguements which is wrong you could do it this way:
In [7]:
df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
df
Out[7]:
a b c d
a
0.0 0.0 10 1 10
0.1 0.1 20 2 40
0.2 0.2 30 3 90
0.3 0.3 40 4 160
A better way would be to just call your function direct:
In [8]:
df['d'] = foo(df['b'], df['c'])
df
Out[8]:
a b c d
a
0.0 0.0 10 1 10
0.1 0.1 20 2 40
0.2 0.2 30 3 90
0.3 0.3 40 4 160
The advantage with the above method is that it is vectorised and will perform the operation on the whole series rather than a row at a time.
In [15]:
%timeit df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
%timeit df['d'] = foo(df['b'], df['c'])
1000 loops, best of 3: 270 µs per loop
1000 loops, best of 3: 214 µs per loop
Not much difference here, now compare with a 400,000 row df:
In [18]:
%timeit df['d'] = df.apply(lambda row: foo(row['b'], row['c']), axis=1)
%timeit df['d'] = foo(df['b'], df['c'])
1 loops, best of 3: 5.84 s per loop
100 loops, best of 3: 8.68 ms per loop
So you see here ~672x speed up.

Numpy cumsum considering NaNs

I am looking for a succinct way to go from:
a = numpy.array([1,4,1,numpy.nan,2,numpy.nan])
to:
b = numpy.array([1,5,6,numpy.nan,8,numpy.nan])
The best I can do currently is:
b = numpy.insert(numpy.cumsum(a[numpy.isfinite(a)]), (numpy.argwhere(numpy.isnan(a)) - numpy.arange(len(numpy.argwhere(numpy.isnan(a))))), numpy.nan)
Is there a shorter way to accomplish the same? What about doing a cumsum along an axis of a 2D array?

Pandas is a library build on top of numpy. It's
Series class has a cumsum method, which preserves the nan's and is considerably faster than the solution proposed by DSM:
In [15]: a = arange(10000.0)
In [16]: a[1] = np.nan
In [17]: %timeit a*0 + np.nan_to_num(a).cumsum()
1000 loops, best of 3: 465 us per loop
In [18] s = pd.Series(a)
In [19]: s.cumsum()
Out[19]:
0 0
1 NaN
2 2
3 5
...
9996 49965005
9997 49975002
9998 49985000
9999 49994999
Length: 10000
In [20]: %timeit s.cumsum()
10000 loops, best of 3: 175 us per loop

How about (for not-too-big arrays):
In [34]: import numpy as np
In [35]: a = np.array([1,4,1,np.nan,2,np.nan])
In [36]: a*0 + np.nan_to_num(a).cumsum()
Out[36]: array([ 1., 5., 6., nan, 8., nan])

Masked arrays are for just this type of situation.
>>> import numpy as np
>>> from numpy import ma
>>> a = np.array([1,4,1,np.nan,2,np.nan])
>>> b = ma.masked_array(a,mask = (np.isnan(a) | np.isinf(a)))
>>> b
masked_array(data = [1.0 4.0 1.0 -- 2.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c = b.cumsum()
>>> c
masked_array(data = [1.0 5.0 6.0 -- 8.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c.filled(np.nan)
array([ 1., 5., 6., nan, 8., nan])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to calculate difference in all columns - python

Related

Replace specific values in multiindex dataframe

pandas - cumulative median

Reshape a NumPy array based on values in its columns

Pandas - Create a new column with apply for float indexed dataframe

Numpy cumsum considering NaNs

Categories

Resources