Get Trend/Streak in Each Row of Pandas DataFrame - python

I have a Pandas DataFrame:
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0],
['D', -0.1, -1.0, -4.0, -3.3, -1.0],
['E', np.nan, np.nan, np.nan, np.nan, np.nan],
['F', 4.0, np.nan, np.nan, np.nan, np.nan]
], columns=['Group', '1', '2', '3', '4', '5'])
Group 1 2 3 4 5
0 A 0.1 2.0 1.0 0.5 0.3
1 B -0.3 -0.4 0.1 0.2 -1.0
2 C 0.1 -1.0 4.0 -3.3 1.0
3 D -0.1 -1.0 -4.0 -3.3 -1.0
4 E NaN NaN NaN NaN NaN
5 F 4.0 NaN NaN NaN NaN
For each row, I'd like to return the trend/streak of either consecutive positive/negative values going from left to right. So, the final DataFrame should be:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
The first row has a streak of +5 because the values are all positive going from left to right. The second row has a streak of negative -2 because the first two columns have negative values and the streak ends with a positive value in column 3. The third row has a streak of +1 because the second column has an opposite sign from the first column. The fourth row is all NaN and so the streak is zero.

This is a bit long-winded, but it seems to do everything you need:
def streak(row):
cols = row.keys()
n_cols = len(cols)
neg_streak = 0
pos_streak = 0
i_neg_streak = n_cols
i_pos_streak = n_cols
for icol_1 in range(n_cols - 1):
for icol_2 in range(icol_1, n_cols):
if (row.ix[icol_1: icol_2 + 1] < 0).all():
streak = icol_1 - icol_2 - 1
if streak < neg_streak:
neg_streak = streak
i_neg_streak = icol_1
elif (row.ix[icol_1: icol_2 + 1] > 0).all():
streak = 1 + icol_2 - icol_1
if streak > pos_streak:
pos_streak = streak
i_pos_streak = icol_1
if pos_streak == abs(neg_streak):
if i_pos_streak < i_neg_streak:
return pos_streak
else:
return neg_streak
elif pos_streak > abs(neg_streak):
return pos_streak
else:
return neg_streak
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0]
], columns=['Group', '1', '2', '3', '4', '5'])
df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()
print df

This did the trick and is more intuitive/vectorized
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:] # Compare values from neighboring columns
So diff looks like this:
[[ True True True True]
[ True False True False]
[False False False False]
[ True True True True]]
Then,
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1) # Add False column to end of diff
[[ True True True True False]
[ True False True False False]
[False False False False False]
[ True True True True False]]
Next, we look for streaks of True by looking for the first occurrence of False:
df['Streak'] = np.argmin(diff, axis=1) + 1 # Add 1 to the index get the streak
Finally, we adjust the sign of the streak value according to the sign of the first column:
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
The final DataFrame looks like this:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1

I'm assuming you want the longest streak. Can't make any promises about ties... This answer uses itertools.groupby. First, under the hood so you can see what groupby is doing:
In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
for k,g in groupby(b, key=lambda x: x > 0.0):
print k,list(g)
False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]
Now wrap that in a function, taking advantage of the groupings:
def streak(dfrow):
longest= 0
for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
cur_streak = len(list(g))
if np.isnan(k):
continue
if k: #group is positive
if abs(longest) < cur_streak:
longest= cur_streak
else: #group is negative
if abs(longest) < cur_streak:
longest= -1*cur_streak #multiply by -1
return longest
Use df.apply to apply function to each row:
In [6]: df.set_index('Group',inplace=True)
df['LongestStreak'] = df.apply(streak, axis=1)
Result:
In [281]: df
Out[281]: 1 2 3 4 5 LongestStreak
Group
A 0.1 2.0 1.0 0.5 0.3 5
B -0.3 -0.4 0.1 0.2 -1.0 -2
C 0.1 -1.0 4.0 -3.3 1.0 1
EDIT
Updated to address your new DataFrame and added a benchmark, your's probably scales better, but I don't know how to modify your code to generate the results.
Results:
%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)
1000 loops, best of 3: 473 µs per loop
%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
100 loops, best of 3: 2.94 ms per loop

Related

How do you calculate the sum based on certain numbers in the dataframe?

I have variables like this
a = pd.DataFrame(np.array([[1, 1, 2, 3, 2], [2, 2, 3, 3, 2], [1, 2, 3, 2, 3]]))
b = np.array([0.1, 0.3, 0.5, 0.6, 0.2])
Display a
0 1 2 3 4
0 1 1 2 3 2
1 2 2 3 3 2
2 1 2 3 2 3
Display b
[0.1 0.3 0.5 0.6 0.2]
The result I want is the sum of the values in b based on the values of a where the indices of a serve as the indices for the values in b .
The final result that I want is like this.
0.4 0.7 0.6
0 0.5 0.11
0.1 0.9 0.7
How to obtain the first row in detail
0.4 0.7 0.6
so 0.4 is obtained from 0.1 + 0.3, based on the number 1 in the first row of a, i.e. since the indices are 0 and 1, we add b[0] and b[1]
0.7 is obtained from 0.5 + 0.2, based on the number 2 where the indices are 2 and 4, so we added b[2] + b[4]
0.6 based on the number 3 which is just b[3] because the index is 3
You can create one-hot encoded matrices to use in a dot product:
from pandas.api.types import CategoricalDtype
n = a.max().max()
cat = CategoricalDtype(categories=np.arange(1, n + 1))
dummies = pd.get_dummies(a.T.astype(cat))
b.dot(dummies).reshape(n, n)
yields
array([[0.4, 0.7, 0.6],
[0. , 0.6, 1.1],
[0.1, 0.9, 0.7]])
This is one way you can do it, it is not optimized, yet I think it follows your logic in a clear way:
df = pd.DataFrame(columns=range(1, a.max().max()+1))
for i,r in a.iterrows():
for c in list(df):
df.loc[i,c] = np.sum((b[r[r==c].index.values]))
df
1 2 3
0 0.4 0.7 0.6
1 0 0.6 1.1
2 0.1 0.9 0.7

performance of a multi-class classifier using the 3 highest probabilities

I have a pandas dataframe as below
predictions.head()
Out[22]:
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
Each row is a the prediction probability of the different classes (columns).
The last column is the label (correct class).
I would like to evaluate the performances of the classifiers allowing 2 errors.
What I mean is that if one of the highest 3 probabilities is the correct label I consider the prediction correct.
Is there a smart way to do it in scikit-learn?
Try this approach:
In [57]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index).T
In [58]: x
Out[58]:
0 1 2
0 L A N
1 L U A
2 P A S
3 L N B
4 R S A
In [59]: x.eq(df.LABEL, axis=0).any(1)
Out[59]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
similar solution, which uses one transpose less:
In [66]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [67]: x
Out[67]:
0 1 2 3 4
0 L L P L R
1 A U A N S
2 N A S B A
In [68]: x.eq(df.LABEL).any()
Out[68]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
Source DF:
In [70]: df
Out[70]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
UPDATE: trying to reproduce the error (from comments):
In [81]: df
Out[81]:
a b c d e LABEL
0 1 2 3 4 5 c
1 3 4 5 6 7 d
In [82]: x = df.drop('LABEL',1).T.apply(lambda x: x.nlargest(3).index)
In [83]: x
Out[83]:
0 1
0 e e
1 d d
2 c c
In [84]: x.eq(df.LABEL).any()
Out[84]:
0 True
1 True
dtype: bool
PS I'm using Pandas 0.23.0
If performance is important use numpy.argsort with remove last column by iloc:
print (np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3])
[[ 7 0 8]
[ 7 12 0]
[ 9 0 11]
[ 7 8 1]
[10 11 0]]
v = df.columns[np.argsort(-df.iloc[:, :-1].values, axis=1)[:,:3]]
print (v)
Index([['L', 'A', 'N'], ['L', 'U', 'A'], ['P', 'A', 'S'], ['L', 'N', 'B'],
['R', 'S', 'A']],
dtype='object')
a = pd.DataFrame(v).eq(df['LABEL'], axis=0).any(axis=1)
print (a)
0 True
1 True
2 False
3 False
4 False
dtype: bool
Thanks, #Maxu for another similar solution with numpy.argpartition:
v = df.columns[np.argpartition(-df.iloc[:, :-1].values, 3, axis=1)[:,:3]]
Sample data:
df = pd.DataFrame({'A': [0.718363, 0.677776, 0.6820189999999999, 0.48298100000000005, 0.700207], 'B': [0.5, 0.5, 0.5, 0.5, 0.5], 'C': [0.403466, 0.366128, 0.074347, 0.06543600000000001, 0.515825], 'D': [0.5, 0.5, 0.5, 0.5, 0.5], 'E': [0.5, 0.5, 0.5, 0.5, 0.5], 'G': [0.45898900000000004, 0.042405, 0.562217, 0.112383, 0.07808899999999999], 'H': [0.5, 0.5, 0.5, 0.5, 0.5], 'L': [0.85019, 0.8942, 0.417786, 0.7436590000000001, 0.43783900000000003], 'N': [0.6208779999999999, 0.510644, 0.539949, 0.604382, 0.249892], 'P': [0.182169, 0.015788999999999997, 0.7992229999999999, 0.24676599999999999, 0.064734], 'R': [0.48363100000000003, 0.523462, 0.603212, 0.39907, 0.8228340000000001], 'S': [0.43291499999999994, 0.547838, 0.6208060000000001, 0.34108099999999997, 0.769277], 'U': [0.328495, 0.691239, 0.335204, 0.22940700000000003, 0.512239], 'V': [0.5, 0.5, 0.5, 0.5, 0.5], 'LABEL': ['A', 'L', 'G', 'P', 'U']})
print (df)
A B C D E G H L N \
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892
P R S U V LABEL
0 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.064734 0.822834 0.769277 0.512239 0.5 U
I can't think of a solution in sklearn so here's one in pandas
# Data
predictions
Out[]:
A B C D E G H L N P R S U V LABEL
0 0.718363 0.5 0.403466 0.5 0.5 0.458989 0.5 0.850190 0.620878 0.182169 0.483631 0.432915 0.328495 0.5 A
1 0.677776 0.5 0.366128 0.5 0.5 0.042405 0.5 0.894200 0.510644 0.015789 0.523462 0.547838 0.691239 0.5 L
2 0.682019 0.5 0.074347 0.5 0.5 0.562217 0.5 0.417786 0.539949 0.799223 0.603212 0.620806 0.335204 0.5 G
3 0.482981 0.5 0.065436 0.5 0.5 0.112383 0.5 0.743659 0.604382 0.246766 0.399070 0.341081 0.229407 0.5 P
4 0.700207 0.5 0.515825 0.5 0.5 0.078089 0.5 0.437839 0.249892 0.064734 0.822834 0.769277 0.512239 0.5 U
# Check if the label is in the top 3 (one line solution)
predictions.apply(lambda row: row['LABEL'] in list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 True
1 True
2 False
3 False
4 False
Here is what is happening:
# List the top 3 results:
predictions.apply(lambda row: list(row.drop('LABEL').sort_values().tail(3).index), axis=1)
Out[]:
0 [N, A, L]
1 [A, U, L]
2 [S, A, P]
3 [V, N, L]
4 [A, S, R]
# Then check if the 'LABEL' is inside this list
You could ask this question on Cross Validated as they will use sklearn extensively

Calculating a new column in pandas

I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.
You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9
This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0

Fastest way to find compute function on DataFrame slices by column value (Python pandas)

I am trying to create a column on a data frame which contains the minimum of column A (the value column), for which column B (the id column) has a particular value. My code is really slow. I'm looking for a faster way to do this. Here is my little function:
def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):
for rid in set(df[id_col].values):
df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df
And example usage:
import pandas as pd
import numpy as np
# create data frame
df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],
"val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})
print df.head(10)
# output
id_col val_col
0 0 0.1
1 0 0.2
2 0 0.3
3 1 0.6
4 1 0.4
5 1 0.5
6 2 0.2
7 2 0.1
8 2 0.0
df = apply_by_id_value(df)
print df.head(10)
# output
id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0
Some more context: In my real data, the "id_col" column has some 30000 or more unique values. This means that the data frame has to be sliced 30000 times. I imagine this is the bottleneck.
Perform a groupby on 'id_col' and then a transform passing function 'min', this will return a Series aligned to your original df so you can add as a new column:
In [13]:
df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],
"val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})
df['offset'] = df.groupby('id_col').transform('min')
df
Out[13]:
id_col val_col offset
0 0 0.1 0.1
1 0 0.2 0.1
2 0 0.3 0.1
3 1 0.6 0.4
4 1 0.4 0.4
5 1 0.5 0.4
6 2 0.2 0.0
7 2 0.1 0.0
8 2 0.0 0.0
timings
In [15]:
def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):
for rid in set(df[id_col].values):
df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col])
return df
%timeit apply_by_id_value(df)
%timeit df.groupby('id_col').transform('min')
100 loops, best of 3: 8.12 ms per loop
100 loops, best of 3: 5.99 ms per loop
So the groupby and transform is faster on this dataset, I expect it to be significantly faster on your real dataset as it will scale better.
For a 800,000 row df I get the following timings:
1 loops, best of 3: 611 ms per loop
1 loops, best of 3: 438 ms per loop

Interpolate on the fly to get previous valid entry from pandas DataFrame

If I have an indexed pandas.DataFrame like this:
>>> Dxz = pandas.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.2,0.4,1]})
>>> Dxz.set_index(["x","z"], inplace=True)
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
How do I get it to return me the value for p given a valid index tuple, and the value of the previous present index tuple if the index is not valid? For example, assuming it was a method “lookup_or_interpolate”, I'd like to see something like this:
>>> Dxz.lookup_or_interpolate((False, 0))["p"]
0.4
>>> Dxz.lookup_or_interpolate((False, 1))["p"]
0.4
>>> Dxz.lookup_or_interpolate((True, 23))["p"]
1.0
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
use reindex:
import pandas as pd
Dxz = pd.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.4,0.2,1]})
Dxz.set_index(["x","z"], inplace=True)
print Dxz.reindex(pd.MultiIndex.from_tuples([(False, 0), (False, 1), (False, 100), (True, 23)]), method="ffill")
output:
p
False 0 0.4
1 0.4
100 0.2
True 23 1.0

Categories

Resources