Get Trend/Streak in Each Row of Pandas DataFrame

Get Trend/Streak in Each Row of Pandas DataFrame - python

I have a Pandas DataFrame:
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0],
['D', -0.1, -1.0, -4.0, -3.3, -1.0],
['E', np.nan, np.nan, np.nan, np.nan, np.nan],
['F', 4.0, np.nan, np.nan, np.nan, np.nan]
], columns=['Group', '1', '2', '3', '4', '5'])
Group 1 2 3 4 5
0 A 0.1 2.0 1.0 0.5 0.3
1 B -0.3 -0.4 0.1 0.2 -1.0
2 C 0.1 -1.0 4.0 -3.3 1.0
3 D -0.1 -1.0 -4.0 -3.3 -1.0
4 E NaN NaN NaN NaN NaN
5 F 4.0 NaN NaN NaN NaN
For each row, I'd like to return the trend/streak of either consecutive positive/negative values going from left to right. So, the final DataFrame should be:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
The first row has a streak of +5 because the values are all positive going from left to right. The second row has a streak of negative -2 because the first two columns have negative values and the streak ends with a positive value in column 3. The third row has a streak of +1 because the second column has an opposite sign from the first column. The fourth row is all NaN and so the streak is zero.

This is a bit long-winded, but it seems to do everything you need:
def streak(row):
cols = row.keys()
n_cols = len(cols)
neg_streak = 0
pos_streak = 0
i_neg_streak = n_cols
i_pos_streak = n_cols
for icol_1 in range(n_cols - 1):
for icol_2 in range(icol_1, n_cols):
if (row.ix[icol_1: icol_2 + 1] < 0).all():
streak = icol_1 - icol_2 - 1
if streak < neg_streak:
neg_streak = streak
i_neg_streak = icol_1
elif (row.ix[icol_1: icol_2 + 1] > 0).all():
streak = 1 + icol_2 - icol_1
if streak > pos_streak:
pos_streak = streak
i_pos_streak = icol_1
if pos_streak == abs(neg_streak):
if i_pos_streak < i_neg_streak:
return pos_streak
else:
return neg_streak
elif pos_streak > abs(neg_streak):
return pos_streak
else:
return neg_streak
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0]
], columns=['Group', '1', '2', '3', '4', '5'])
df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()
print df

This did the trick and is more intuitive/vectorized
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:] # Compare values from neighboring columns
So diff looks like this:
[[ True True True True]
[ True False True False]
[False False False False]
[ True True True True]]
Then,
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1) # Add False column to end of diff
[[ True True True True False]
[ True False True False False]
[False False False False False]
[ True True True True False]]
Next, we look for streaks of True by looking for the first occurrence of False:
df['Streak'] = np.argmin(diff, axis=1) + 1 # Add 1 to the index get the streak
Finally, we adjust the sign of the streak value according to the sign of the first column:
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
The final DataFrame looks like this:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1

I'm assuming you want the longest streak. Can't make any promises about ties... This answer uses itertools.groupby. First, under the hood so you can see what groupby is doing:
In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
for k,g in groupby(b, key=lambda x: x > 0.0):
print k,list(g)
False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]
Now wrap that in a function, taking advantage of the groupings:
def streak(dfrow):
longest= 0
for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
cur_streak = len(list(g))
if np.isnan(k):
continue
if k: #group is positive
if abs(longest) < cur_streak:
longest= cur_streak
else: #group is negative
if abs(longest) < cur_streak:
longest= -1*cur_streak #multiply by -1
return longest
Use df.apply to apply function to each row:
In [6]: df.set_index('Group',inplace=True)
df['LongestStreak'] = df.apply(streak, axis=1)
Result:
In [281]: df
Out[281]: 1 2 3 4 5 LongestStreak
Group
A 0.1 2.0 1.0 0.5 0.3 5
B -0.3 -0.4 0.1 0.2 -1.0 -2
C 0.1 -1.0 4.0 -3.3 1.0 1
EDIT
Updated to address your new DataFrame and added a benchmark, your's probably scales better, but I don't know how to modify your code to generate the results.
Results:
%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)
1000 loops, best of 3: 473 µs per loop
%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
100 loops, best of 3: 2.94 ms per loop

Related

How do you calculate the sum based on certain numbers in the dataframe?

I have variables like this
a = pd.DataFrame(np.array([[1, 1, 2, 3, 2], [2, 2, 3, 3, 2], [1, 2, 3, 2, 3]]))
b = np.array([0.1, 0.3, 0.5, 0.6, 0.2])
Display a
0 1 2 3 4
0 1 1 2 3 2
1 2 2 3 3 2
2 1 2 3 2 3
Display b
[0.1 0.3 0.5 0.6 0.2]
The result I want is the sum of the values in b based on the values of a where the indices of a serve as the indices for the values in b .
The final result that I want is like this.
0.4 0.7 0.6
0 0.5 0.11
0.1 0.9 0.7
How to obtain the first row in detail
0.4 0.7 0.6
so 0.4 is obtained from 0.1 + 0.3, based on the number 1 in the first row of a, i.e. since the indices are 0 and 1, we add b[0] and b[1]
0.7 is obtained from 0.5 + 0.2, based on the number 2 where the indices are 2 and 4, so we added b[2] + b[4]
0.6 based on the number 3 which is just b[3] because the index is 3

You can create one-hot encoded matrices to use in a dot product:
from pandas.api.types import CategoricalDtype
n = a.max().max()
cat = CategoricalDtype(categories=np.arange(1, n + 1))
dummies = pd.get_dummies(a.T.astype(cat))
b.dot(dummies).reshape(n, n)
yields
array([[0.4, 0.7, 0.6],
[0. , 0.6, 1.1],
[0.1, 0.9, 0.7]])

This is one way you can do it, it is not optimized, yet I think it follows your logic in a clear way:
df = pd.DataFrame(columns=range(1, a.max().max()+1))
for i,r in a.iterrows():
for c in list(df):
df.loc[i,c] = np.sum((b[r[r==c].index.values]))
df
1 2 3
0 0.4 0.7 0.6
1 0 0.6 1.1
2 0.1 0.9 0.7

performance of a multi-class classifier using the 3 highest probabilities

I have a pandas dataframe as below
predictions.head()
Out[22]:
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
Each row is a the prediction probability of the different classes (columns).
The last column is the label (correct class).
I would like to evaluate the performances of the classifiers allowing 2 errors.
What I mean is that if one of the highest 3 probabilities is the correct label I consider the prediction correct.
Is there a smart way to do it in scikit-learn?

Try this approach:
In [57]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index).T
In [58]: x
Out[58]:
0 1 2
0 L A N
1 L U A
2 P A S
3 L N B
4 R S A
In [59]: x.eq(df.LABEL, axis=0).any(1)
Out[59]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
similar solution, which uses one transpose less:
In [66]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [67]: x
Out[67]:
0 1 2 3 4
0 L L P L R
1 A U A N S
2 N A S B A
In [68]: x.eq(df.LABEL).any()
Out[68]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
Source DF:
In [70]: df
Out[70]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
UPDATE: trying to reproduce the error (from comments):
In [81]: df
Out[81]:
a b c d e LABEL
0 1 2 3 4 5 c
1 3 4 5 6 7 d
In [82]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [83]: x
Out[83]:
0 1
0 e e
1 d d
2 c c
In [84]: x.eq(df.LABEL).any()
Out[84]:
0 True
1 True
dtype: bool
PS I'm using Pandas 0.23.0

If performance is important use numpy.argsort with remove last column by iloc:
print (np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3])
[[ 7 0 8]
[ 7 12 0]
[ 9 0 11]
[ 7 8 1]
[10 11 0]]
v = df.columns[np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3]]
print (v)
Index([['L', 'A', 'N'], ['L', 'U', 'A'], ['P', 'A', 'S'], ['L', 'N', 'B'],
['R', 'S', 'A']],
dtype='object')
a = pd.DataFrame(v).eq(df['LABEL'], axis=0).any(axis=1)
print (a)
0 True
1 True
2 False
3 False
4 False
dtype: bool
Thanks, #Maxu for another similar solution with numpy.argpartition:
v = df.columns[np.argpartition(-df.iloc[:, :-1].values, 3, axis=1)[:,:3]]
Sample data:
df = pd.DataFrame({'A': [0.718363, 0.677776, 0.6820189999999999, 0.48298100000000005, 0.700207], 'B': [0.5, 0.5, 0.5, 0.5, 0.5], 'C': [0.403466, 0.366128, 0.074347, 0.06543600000000001, 0.515825], 'D': [0.5, 0.5, 0.5, 0.5, 0.5], 'E': [0.5, 0.5, 0.5, 0.5, 0.5], 'G': [0.45898900000000004, 0.042405, 0.562217, 0.112383, 0.07808899999999999], 'H': [0.5, 0.5, 0.5, 0.5, 0.5], 'L': [0.85019, 0.8942, 0.417786, 0.7436590000000001, 0.43783900000000003], 'N': [0.6208779999999999, 0.510644, 0.539949, 0.604382, 0.249892], 'P': [0.182169, 0.015788999999999997, 0.7992229999999999, 0.24676599999999999, 0.064734], 'R': [0.48363100000000003, 0.523462, 0.603212, 0.39907, 0.8228340000000001], 'S': [0.43291499999999994, 0.547838, 0.6208060000000001, 0.34108099999999997, 0.769277], 'U': [0.328495, 0.691239, 0.335204, 0.22940700000000003, 0.512239], 'V': [0.5, 0.5, 0.5, 0.5, 0.5], 'LABEL': ['A', 'L', 'G', 'P', 'U']})
print (df)
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U

I can't think of a solution in sklearn so here's one in pandas
# Data
predictions
Out[]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
# Check if the label is in the top 3 (one line solution)
predictions.apply(lambda row: row['LABEL'] in list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 True
1 True
2 False
3 False
4 False
Here is what is happening:
# List the top 3 results:
predictions.apply(lambda row: list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 [N, A, L]
1 [A, U, L]
2 [S, A, P]
3 [V, N, L]
4 [A, S, R]
# Then check if the 'LABEL' is inside this list
You could ask this question on Cross Validated as they will use sklearn extensively

Calculating a new column in pandas

I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.

You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9

This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0

Fastest way to find compute function on DataFrame slices by column value (Python pandas)

I am trying to create a column on a data frame which contains the minimum of column A (the value column), for which column B (the id column) has a particular value. My code is really slow. I'm looking for a faster way to do this. Here is my little function:
def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):
for rid in set(df[id_col].values):
df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df
And example usage:
import pandas as pd
import numpy as np
# create data frame
df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],
"val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})
print df.head(10)
# output
id_col val_col
0 0 0.1
1 0 0.2
2 0 0.3
3 1 0.6
4 1 0.4
5 1 0.5
6 2 0.2
7 2 0.1
8 2 0.0
df = apply_by_id_value(df)
print df.head(10)
# output
id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0
Some more context: In my real data, the "id_col" column has some 30000 or more unique values. This means that the data frame has to be sliced 30000 times. I imagine this is the bottleneck.

Perform a groupby on 'id_col' and then a transform passing function 'min', this will return a Series aligned to your original df so you can add as a new column:
In [13]:
df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],
"val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})
df['offset'] = df.groupby('id_col').transform('min')
df
Out[13]:
id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0
timings
In [15]:
def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):
for rid in set(df[id_col].values):
df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df
%timeit apply_by_id_value(df)
%timeit df.groupby('id_col').transform('min')
100 loops, best of 3: 8.12 ms per loop
100 loops, best of 3: 5.99 ms per loop
So the groupby and transform is faster on this dataset, I expect it to be significantly faster on your real dataset as it will scale better.
For a 800,000 row df I get the following timings:
1 loops, best of 3: 611 ms per loop
1 loops, best of 3: 438 ms per loop

Interpolate on the fly to get previous valid entry from pandas DataFrame

If I have an indexed pandas.DataFrame like this:
>>> Dxz = pandas.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.2,0.4,1]})
>>> Dxz.set_index(["x","z"], inplace=True)
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
How do I get it to return me the value for p given a valid index tuple, and the value of the previous present index tuple if the index is not valid? For example, assuming it was a method “lookup_or_interpolate”, I'd like to see something like this:
>>> Dxz.lookup_or_interpolate((False, 0))["p"]
0.4
>>> Dxz.lookup_or_interpolate((False, 1))["p"]
0.4
>>> Dxz.lookup_or_interpolate((True, 23))["p"]
1.0
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0

use reindex:
import pandas as pd
Dxz = pd.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.4,0.2,1]})
Dxz.set_index(["x","z"], inplace=True)
print Dxz.reindex(pd.MultiIndex.from_tuples([(False, 0), (False, 1), (False, 100), (True, 23)]), method="ffill")
output:
p
False 0 0.4
1 0.4
100 0.2
True 23 1.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Trend/Streak in Each Row of Pandas DataFrame - python

Related

How do you calculate the sum based on certain numbers in the dataframe?

performance of a multi-class classifier using the 3 highest probabilities

Calculating a new column in pandas

Fastest way to find compute function on DataFrame slices by column value (Python pandas)

Interpolate on the fly to get previous valid entry from pandas DataFrame

Categories

Resources