Interpolate on the fly to get previous valid entry from pandas DataFrame - python

If I have an indexed pandas.DataFrame like this:
>>> Dxz = pandas.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.2,0.4,1]})
>>> Dxz.set_index(["x","z"], inplace=True)
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0
How do I get it to return me the value for p given a valid index tuple, and the value of the previous present index tuple if the index is not valid? For example, assuming it was a method “lookup_or_interpolate”, I'd like to see something like this:
>>> Dxz.lookup_or_interpolate((False, 0))["p"]
0.4
>>> Dxz.lookup_or_interpolate((False, 1))["p"]
0.4
>>> Dxz.lookup_or_interpolate((True, 23))["p"]
1.0
>>> Dxz
p
x z
False 0 0.4
2 0.2
True 0 1.0

use reindex:
import pandas as pd
Dxz = pd.DataFrame({"x": [False,False,True], "z": [0,2,0], "p": [0.4,0.2,1]})
Dxz.set_index(["x","z"], inplace=True)
print Dxz.reindex(pd.MultiIndex.from_tuples([(False, 0), (False, 1), (False, 100), (True, 23)]), method="ffill")
output:
p
False 0 0.4
1 0.4
100 0.2
True 23 1.0

Related

PANDAS NEW COLUMN BASED ON MULTIPLE CRITERIA AND COLUMNS

I want to create a new columns for a big table using several criteria and columsn and was not sure the best way to approach it.
df = pd.DataFrame({'a': ['A', "B", "B", "C", "D"],
'b':['y','n','y','n', np.nan], 'c':[10,20,10,40,30], 'd':[.3,.1,.4,.2, .1]})
df.head()
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
df['new_Col'] = df.c+df.d
if df.a=='A' & df.b =='y':
df['new_Col'] = df.d *2
else:
df['new_Col'] = 0
return df
fun()
OR
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
return = df.c+df.d
if df.a=='A' & df.b =='y':
return df.d *2
else:
return 0
df['new_Col"] df.apply(fun)
OR using np.where:
df['new_Col'] = np.where(df.a=='A' & df.b =='n', df.c+df.d,0 )
df['new_Col'] = np.where(df.a=='A' & df.b =='y', df.d *2,0 )
Looks like you need np.select
a, n, y = df.a.eq('A'), df.b.eq('n'), df.b.eq('y')
df['result'] = np.select([a & n, a & y], [df.c + df.d, df.d*2], default=0)
This is an arithmetic way (I added one more row to your sample for case a = 'A' and b = 'n'):
sample
Out[1369]:
a b c d
0 A y 10 0.3
1 B n 20 0.1
2 B y 10 0.4
3 C n 40 0.2
4 D NaN 30 0.1
5 A n 50 0.9
nc = df.a.eq('A') & df.b.eq('y')
mc = df.a.eq('A') & df.b.eq('n')
nr = df.d * 2
mr = df.c + df.d
df['new_col'] = nc*nr + mc*mr
Out[1371]:
a b c d new_col
0 A y 10 0.3 0.6
1 B n 20 0.1 0.0
2 B y 10 0.4 0.0
3 C n 40 0.2 0.0
4 D NaN 30 0.1 0.0
5 A n 50 0.9 50.9

Pandas conditional map/fill/replace

d1=pd.DataFrame({'x':['a','b','c','c'],'y':[-1,-2,-3,0]})
d2=pd.DataFrame({'x':['d','c','a','b'],'y':[0.1,0.2,0.3,0.4]})
I want to replace d1.y where y<0 with the correspondent y in d2. It's something like vlookup in Excel. The core problem is replace y according to x rather than just simply manipulate y. What I want is
Out[40]:
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
Use Series.map with condition:
s = d2.set_index('x')['y']
d1.loc[d1.y < 0, 'y'] = d1['x'].map(s)
print (d1)
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
You can try this:
d1.loc[d1.y < 0, 'y'] = d2.loc[d1.y < 0, 'y']

How to unstack a dictionary of dataframe with multiple entries?

Hi I have this dictionary below
str1 x y
a 1.0 -3.0
b 2.0 -2.5
str2: x y
a 3.0 -2.0
b 4.0 -1.5
str3: x y
a 5.0 -1.0
b 6.0 -0.5
The result I would like is to be able to unstack it so I get a dataframe with index=[str1,str2,str3] and columns=[a,b]. To choose whether I use values on columns x or y to fill the row of my expected dataframe, I use the integer N.
You can see N as the limit stating that every row above use x values and below, y values.
If N=1, I use x values for str 1, y values for str 2 and str 3.
If N=2, I use x values for str 1 and str 2 , y values for str 3.
If N=3, I use x values for str 1, str 2 and str 3.
Which will give me for i = 1:
a b
str1 1.0 2.0 (x values)
str2 -2.0 -1.5 (y values)
str3 -1.0 -0.5 (y values)
I know that I can get two data frames, unstacking on x and y, then concatenating rows that I want to keep but I wanted to know if there were a faster way.
To better resolve the question in a Pythonic way, you could first translate your rule (using x or y values) to a dictionary (probably with dictionary comprehension):
# replicate the dictionary in the post
>>> d = {'str1':{'a':{'x':1, 'y':-3}, 'b':{'x':2,'y':-2.5}}, 'str2':{'a':{'x':3, 'y':-2}, 'b':{'x':4,'y':-1.5}}, 'str3':{'a':{'x':5, 'y':-1}, 'b':{'x':6,'y':-0.5}}}
>>> indexes = ['str1', 'str2', 'str3']
>>> N_map = {1:{'str1':'x', 'str2':'y', 'str3':'y'}, 2:{'str1':'x', 'str2':'x', 'str3':'y'}}
Then we could loop through N=1,... and construct the dataframe with list/dictionary comprehension:
# only take the first two rules as an example
>>> for i in range(1, 3):
... df_d = {col:[d[index][col][N_map[i][index]] for index in indexes] for col in ['a', 'b']}
... pd.DataFrame(df_d, index=indexes)
a b
str1 1 2.0
str2 -2 -1.5
str3 -1 -0.5
a b
str1 1 2.0
str2 3 4.0
str3 -1 -0.5
Here is a code using dictcomp from an Ordered dictionary (a bit more pythonic):
def N_unstack(d,N):
d = collections.OrderedDict(d)
idx = list('x'*N+'y'*(len(d)-N))
return pd.DataFrame({k:v[idx[i]] for i,(k,v) in enumerate(d.items())}).T
Output for N_unstack(d,1) where d is the dictionary of dataframes:
a b
str1 1.0 2.0
str2 -2.0 -1.5
str3 -1.0 -0.5
Here is how I would do it (using pd.concat). It's a bit verbose:
def N_unstack(d,N):
idx = list('x'*N+'y'*(len(d)-N))
df = pd.concat([d['str1'][idx[0]],d['str2'][idx[1]],d['str3'][idx[2]]], axis=1).T
df.index = ['str1','str2','str3']
return df
Edit: made the code a bit more pythonic
With this dictionary of Dataframes :
d2
"""
{'str1': a b
x 1.0 2.0
y -3.0 -2.5,
'str2': a b
x 3.0 4.0
y -2.0 -1.5,
'str3': a b
x 5.0 6.0
y -1.0 -0.5}
"""
Define
df2 = pd.concat(d2)
df2.set_index(df2.index.droplevel(1),inplace=True) # remove 'x','y' labels
select = { N:[ 2*i + (i>=N) for i in range(3)] for N in range(1,4) }
Then with for example N = 1
In [3]: df2.iloc[select[N]]
Out[3]:
a b
str1 1.0 2.0
str2 -2.0 -1.5
str3 -1.0 -0.5

Get Trend/Streak in Each Row of Pandas DataFrame

I have a Pandas DataFrame:
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0],
['D', -0.1, -1.0, -4.0, -3.3, -1.0],
['E', np.nan, np.nan, np.nan, np.nan, np.nan],
['F', 4.0, np.nan, np.nan, np.nan, np.nan]
], columns=['Group', '1', '2', '3', '4', '5'])
Group 1 2 3 4 5
0 A 0.1 2.0 1.0 0.5 0.3
1 B -0.3 -0.4 0.1 0.2 -1.0
2 C 0.1 -1.0 4.0 -3.3 1.0
3 D -0.1 -1.0 -4.0 -3.3 -1.0
4 E NaN NaN NaN NaN NaN
5 F 4.0 NaN NaN NaN NaN
For each row, I'd like to return the trend/streak of either consecutive positive/negative values going from left to right. So, the final DataFrame should be:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
The first row has a streak of +5 because the values are all positive going from left to right. The second row has a streak of negative -2 because the first two columns have negative values and the streak ends with a positive value in column 3. The third row has a streak of +1 because the second column has an opposite sign from the first column. The fourth row is all NaN and so the streak is zero.
This is a bit long-winded, but it seems to do everything you need:
def streak(row):
cols = row.keys()
n_cols = len(cols)
neg_streak = 0
pos_streak = 0
i_neg_streak = n_cols
i_pos_streak = n_cols
for icol_1 in range(n_cols - 1):
for icol_2 in range(icol_1, n_cols):
if (row.ix[icol_1: icol_2 + 1] < 0).all():
streak = icol_1 - icol_2 - 1
if streak < neg_streak:
neg_streak = streak
i_neg_streak = icol_1
elif (row.ix[icol_1: icol_2 + 1] > 0).all():
streak = 1 + icol_2 - icol_1
if streak > pos_streak:
pos_streak = streak
i_pos_streak = icol_1
if pos_streak == abs(neg_streak):
if i_pos_streak < i_neg_streak:
return pos_streak
else:
return neg_streak
elif pos_streak > abs(neg_streak):
return pos_streak
else:
return neg_streak
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0]
], columns=['Group', '1', '2', '3', '4', '5'])
df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()
print df
This did the trick and is more intuitive/vectorized
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:] # Compare values from neighboring columns
So diff looks like this:
[[ True True True True]
[ True False True False]
[False False False False]
[ True True True True]]
Then,
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1) # Add False column to end of diff
[[ True True True True False]
[ True False True False False]
[False False False False False]
[ True True True True False]]
Next, we look for streaks of True by looking for the first occurrence of False:
df['Streak'] = np.argmin(diff, axis=1) + 1 # Add 1 to the index get the streak
Finally, we adjust the sign of the streak value according to the sign of the first column:
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
The final DataFrame looks like this:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
I'm assuming you want the longest streak. Can't make any promises about ties... This answer uses itertools.groupby. First, under the hood so you can see what groupby is doing:
In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
for k,g in groupby(b, key=lambda x: x > 0.0):
print k,list(g)
False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]
Now wrap that in a function, taking advantage of the groupings:
def streak(dfrow):
longest= 0
for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
cur_streak = len(list(g))
if np.isnan(k):
continue
if k: #group is positive
if abs(longest) < cur_streak:
longest= cur_streak
else: #group is negative
if abs(longest) < cur_streak:
longest= -1*cur_streak #multiply by -1
return longest
Use df.apply to apply function to each row:
In [6]: df.set_index('Group',inplace=True)
df['LongestStreak'] = df.apply(streak, axis=1)
Result:
In [281]: df
Out[281]: 1 2 3 4 5 LongestStreak
Group
A 0.1 2.0 1.0 0.5 0.3 5
B -0.3 -0.4 0.1 0.2 -1.0 -2
C 0.1 -1.0 4.0 -3.3 1.0 1
EDIT
Updated to address your new DataFrame and added a benchmark, your's probably scales better, but I don't know how to modify your code to generate the results.
Results:
%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)
1000 loops, best of 3: 473 µs per loop
%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
100 loops, best of 3: 2.94 ms per loop

Calculating a new column in pandas

I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.
You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9
This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0

Categories

Resources