pandas get index of highest dot product - python

I have a dataframe like this:
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
a b c
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
And I would like to create another column in this dataframe which stores for every row, which other row gets the highest score when performed a dot product against.
For instance for the first row we'll compute the dot products against the other rows:
df1.drop(0).dot(df1.loc[0]).idxmax()
output: 3
Therefore I can create a function:
def get_highest(dataframe):
lis = []
for row in dataframe.index:
temp = dataframe.drop(row).dot(dataframe.loc[row])
lis.append(temp.idxmax())
return lis
And I get what I want with:
df1['highest'] = get_highest(df1)
output:
a b c highest
0 1 5 9 3
1 2 6 10 3
2 3 7 11 3
3 4 8 12 2
Ok that's working, but the problem is that it doesn't scale AT ALL. Here are the outputs of timeit for different number of rows:
4 rows: 2.87 ms
40 rows: 77.1 ms
400 rows: 700 ms
4000 rows: 10.4s
And I have to perform this on a dataframe which has roughly 240k rows and 3.3k columns. Therefore here is my question: Is there a way to optimize this calculation? (likely by addressing it in another way)
Thank you in advance.

Do a matrix multiplication with the transpose:
mat_mul = np.dot(df.values, df.values.T)
Fill diagonals with a small number so they cannot be the maximum (I assumed all positive, so filled with -1 but you can change this):
np.fill_diagonal(mat_mul, -1)
Now take the argmax of the array:
df['highest'] = mat_mul.argmax(axis=1)
Timings on a 10k by 4 df:
%%timeit
mat_mul = np.dot(df.values, df.values.T)
np.fill_diagonal(mat_mul, -1)
df['highest'] = mat_mul.argmax(axis=1)
1 loop, best of 3: 782 ms per loop
%timeit df['highest'] = get_highest(df)
1 loop, best of 3: 9.8 s per loop

Since the dot-products would be repeated for pairs when they are flipped, the final dot-product array for each row against every other row would be a symmetric one. So, we can calculate for either the lower or upper triangular dot product elements and then get the full form by using scipy's squareform. Thus, we would have an implementation like so -
from scipy.spatial.distance import squareform
arr = df1.values
R,C = np.triu_indices(arr.shape[0],1)
df1['highest'] = squareform(np.einsum('ij,ij->i',arr[R],arr[C])).argmax(1)
Output for sample case -
In [145]: df1
Out[145]:
a b c highest
0 1 5 9 3
1 2 6 10 3
2 3 7 11 3
3 4 8 12 2

Related

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Rolling sum on a dynamic window

I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.
here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3

Filtering rows from dataframe based on the values of the previous rows

I have a dataframe like the following:
A
1 1000
2 1000
3 1001
4 1001
5 10
6 1000
7 1010
8 9
9 10
10 6
11 999
12 10110
13 10111
14 1000
I am trying to clean my dataframe in the following way:
For every row having more value than 1.5 times the previous row value or less than 0.5 times the previous row value, drop it.
But If the previous row is a to-drop row, comparison must be made with the immediate previous NON-to-drop row. (For example Index 9, 10 or 13 in my dataframe)
So the final dataframe should be like:
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
My dataframe is really huge so performance is appreciated.
You can't get away from looping through each row
Tips
Avoid creating new (expensive to create) objects for each row
Use a memory efficient iteration
I'd use a generator
I'll pass a series to a function and yield the index values for which rows satisfy the conditions.
def f(s):
it = s.iteritems()
i, v = next(it)
yield i # Yield the first one
for j, x in it:
if .5 * v <= x <= 1.5 * v:
yield j # Yield the ones that satisfy
v = x # Update the comparative value
df.loc[list(f(df.A))] # Use `loc` with index values
# yielded by my generator
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
One alternative could be to use itertools.accumulate to push forward the last valid value and then filter out the values that are different from the original, e.g:
from itertools import accumulate
def change(x, y, pct=0.5):
if pct * x <= y <= (1 + pct) * x:
return y
return x
# create a mask filtering out the values that are different from the original A
mask = (df.A == list(accumulate(df.A, change)))
print(df[mask])
Output
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
Just to get an idea, see how the accumulated column (change) compares to the original side-by-side:
A change
1 1000 1000
2 1000 1000
3 1001 1001
4 1001 1001
5 10 1001
6 1000 1000
7 1010 1010
8 9 1010
9 10 1010
10 6 1010
11 999 999
12 10110 999
13 10111 999
14 1000 1000
Update
To make it in the function call do:
mask = (df.A == list(accumulate(df.A, lambda x, y : change(x, y, pct=0.5))))

Comparing values from Pandas data frame column using offset values from another column

I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.

compare whether value in one column is between two values in another column python pandas

I have two data frames as follow:
A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29]})
B = pd.DataFrame({"low":[1, 6, 11 ,16,21,26], "high":[5,10,15,20,25,30], "name":["one","two","three","four","five", "six"]})
I want to find whether "value" in A is between 'high' and low' in B, and if so, I want to copy the column name from B to A.
The output should look like this:
A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29], "name":["one","two","one","four","five", "six", "five", "six"]})
My function uses iterrows as follows:
def func1(row):
x = row['value']
for index,value in B.iterrows():
if ((value['low'] <= x) &(x<=value['high'])):
return value['name']
But it doesn't yet achieve what i want to do,
thank you,
You can use a list comprehension to iterate through the values in A, and then use loc to get the relevant mapped values. le is less than or equal to, and ge is greater than or equal to.
For example, v = 3 in the first row. Using simple boolean indexing:
>>> B[(B['low'].le(v)) & (B['high'].ge(v))]
high low name
0 5 1 one
Assuming that DataFrame B does not have any overlapping ranges, then you will get back one row as above. One then uses loc to get the name column, as below. Because each returned name is a a series, you need get the first and only scalar value (using iat, for example).
A['name'] = [B.loc[(B['low'].le(v)) & (B['high'].ge(v)), 'name'].iat[0]
for v in A['value']]
>>> A
value name
0 3 one
1 7 two
2 5 one
3 18 four
4 23 five
5 27 six
6 21 five
7 29 six
I believe you are looking for something like this:
In [1]: import pandas as pd
In [2]: A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29]})
In [3]:
In [3]: B = pd.DataFrame({"low":[1, 6, 11 ,16,21,26], "high":[5,10,15,20,25,30], "name":["one","two","three","four","five", "six"]})
In [4]: A
Out[4]:
value
0 3
1 7
2 5
3 18
4 23
5 27
6 21
7 29
In [5]: B
Out[5]:
high low name
0 5 1 one
1 10 6 two
2 15 11 three
3 20 16 four
4 25 21 five
5 30 26 six
In [6]: def func1(x):
...: for row in B.itertuples():
...: if row.low <= x <= row.high:
...: return row.name
...:
In [7]: A.value.map(func1)
Out[7]:
0 one
1 two
2 one
3 four
4 five
5 six
6 five
7 six
Name: value, dtype: object
In [8]: A['name'] = A['value'].map(func1)
In [9]: A
Out[9]:
value name
0 3 one
1 7 two
2 5 one
3 18 four
4 23 five
5 27 six
6 21 five
7 29 six
I use itertuples because it should be a little bit faster but in general this will not be very efficient. This is a solution but there might be better ones.
Edited to Add:
In [8]: timeit A['value'].map(func1)
100 loops, best of 3: 10.5 ms per loop
In [9]: timeit [B.loc[(B['low'].le(v)) & (B['high'].ge(v)), 'name'].tolist()[0] for v in A['value']]
100 loops, best of 3: 9.06 ms per loop
Quick and dirty test shows that Alexander's approach is faster. I wonder how it scales.

Categories

Resources