Pandas - lambda function with conditional based on row index - python

I am trying to apply a lambda function to a dataframe by referencing three columns. I want to update one of the columns, Cumulative Total, based on the following logic:
If it's on the first row, then Cumulative Total should equal the value in Total.
If it's not the first row, then apply the following formula that references the prior row:
x.shift()['Cumulative Total']
- (x.shift()['Total'] * (x.shift()['Annualized Rate'] / 1200))
I want the Cumulative Total column to look like so:
Total Annualized Rate Cumulative Total
869 11.04718067 869
868 5.529953917 861
871 8.266360505 857
873 6.872852234 851
873 8.24742268 846
874 9.610983982 840
870 5.517241379 833
871 8.266360505 829
868 2.764976959 823
What is throwing me off is how I can determine whether or not I'm on the first row. This sounds rather trivial, but I'm very new to Pandas and am totally stumped. iloc doesn't seem to work, as it seems to only be used for grabbing a row of a given index.
The code is currently as follows:
df['Cumulative Total'] = df.apply(lambda x: x['Total'] if x.iloc[0] else x.shift()['Cumulative Total']-(x.shift()['Total']*(x.shift()['Annualized Rate']/1200)),axis=1)
The statement if x.iloc[0] is wrong. Any idea on how I can determine if it's the first row?
Edit: thank you all for your answers. Alexander's answer is on the right track, but I've noticed that the results strayed somewhat from what was to be expected. These differences became more pronounced the larger the dataframe used.
Alexander - can you address this issue with an edit to your answer? Using vanilla Python, I've arrived at the results below. The differences are largely trivial, but as stated, can get more pronounced with larger datasets.
total=(869,868,871,873,873,874,870,871,868)
rate=(11.047181,5.529954,8.266361,6.872852,8.247423,9.610984,5.517241,8.266361,2.764977)
def f(total,rate):
cum = []
for i in range(len(total)):
if i == 0:
cum.append(total[i])
else:
cum.append(float(cum[i-1])-(float(total[i-1])*(rate[i-1]/1200.0)))
return cum
f(total, rate)
Returns:
869
860.9999997591667
856.9999996991667
850.99999934
845.9999995100001
839.9999992775
832.9999992641667
828.9999995391668
822.9999991800001

Perhaps this?
df = df.assign(
Cumulative_Total=df['Total'].iat[0]
- ((df['Total'] * df['Annualized Rate'].div(1200))
.shift()
.fillna(0)
.cumsum())
)
>>> df
Total Annualized Rate Cumulative_Total
0 869 11.047181 869
1 868 5.529954 861
2 871 8.266361 857
3 873 6.872852 851
4 873 8.247423 846
5 874 9.610984 840
6 870 5.517241 833
7 871 8.266361 829
8 868 2.764977 823

Would this work? In this solution, I used x.name to get the row index.
df['Cumulative Total'] = df.apply(lambda x: x['Total'] if x.name == 0 else x.shift()['Cumulative Total']-(x.shift()['Total']*(x.shift()['Annualized Rate']/1200)),axis=1)

Related

Finding the bug in a function with while loops

I have this function:
def same_price(df=df):
df= df.sort_values(by='Ticket')
nucleus= dict()
k=0
while df.shape[0]>=2:
if df.Price.iloc[0]==df.Price.iloc[1]:
value= df.Price.iloc[0]
n=0
nucleus[k]= []
while df.Price.iloc[n]==value:
nucleus[k].append(df.index[n])
n+=1
if n>df.shape[0]:
df.drop(nucleus[k], axis=0, inplace=True)
break
else:
df.drop(nucleus[k], axis=0, inplace=True)
k+=1
else:
if df.shape[0]>=3:
df.drop(df.index[0], axis=0, inplace=True)
else:
break
return(nucleus)
The objective of the function is to go through the ordered dataframe, and list together the persons who paid the same price GIVEN the sequence of the 'Ticket'id. (I do not just want to list together ALL the people who paid the same price, no matter the sequence!)
The dataframe:
Price Ticket
Id
521 93.5000 12749
821 93.5000 12749
584 40.1250 13049
648 35.5000 13213
633 30.5000 13214
276 77.9583 13502
628 77.9583 13502
766 77.9583 13502
435 55.9000 13507
578 55.9000 13507
457 26.5500 13509
588 79.2000 13567
540 49.5000 13568
48 7.7500 14311
574 7.7500 14312
369 7.7500 14313
When I test it:
same_price(df[:11])is working just fine and the output is : {0: [521, 821], 1: [276, 628, 766], 2: [435, 578]}
but, same_fare(df[:10]) throws:IndexError: single positional indexer is out-of-bounds.
I'd like to know what is wrong with this function guys.
Thx
I found what's wrong, if anyone is interested...
df.iloc[n] gets the (n+1)th line of the dataframe. But shape[0]=n means that the dataframe has n elements.
Hence we use if n+1>df.shape[0]:, instead of if n>df.shape[0]:
Cheers :)

Python dataframe slicing doesn't work in a function but works stand-alone

I checked similar questions posted about slicing DFs in Python but they didn't explain the inconsistency I'm seeing in my exercise.
The code works with the known diamonds data frame. Top lines of the data frame are:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
I have to create a slicing function which takes 4 arguments: DataFrame 'df', a column of that DataFrame
'col', the label of another column 'label' and two values 'val1' and 'val2'. The function will take the frame and output the entries of the column indicated by the 'label' argument for which the rows of the column 'col' are greater than the number 'val1' and less than 'val2'.
The following stand-alone piece of code gives me the correct answer:
diamonds.loc[(diamonds.carat > 1.1) & (diamonds.carat < 1.4),['price']]
and I get the price from the rows where the carat value is between 1.1 and 1.4.
However, when I try to use this syntax in a function, it doesn't work and I get an error.
Function:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
Function call:
slice2(diamonds,diamonds.carat,'price',1.1,1.4)
Error:
"None of [['output_label']] are in the [columns]"
Full traceback message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-64-adc582faf6cc> in <module>()
----> 1 exercise2(test_df,test_df.carat,'price',1.1,1.4)
<ipython-input-63-556b71ba172d> in exercise2(df, col, output_label, val1, val2)
1 def exercise2(df,col,output_label,val1,val2):
----> 2 res = df.loc[(col > val1) & (col < val2), ['output_label']]
3 return res
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1323 except (KeyError, IndexError):
1324 pass
-> 1325 return self._getitem_tuple(key)
1326 else:
1327 key = com._apply_if_callable(key, self.obj)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
839
840 # no multi-index, so validate all of the indexers
--> 841 self._has_valid_tuple(tup)
842
843 # ugly hack for GH #836
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
187 if i >= self.obj.ndim:
188 raise IndexingError('Too many indexers')
--> 189 if not self._has_valid_type(k, i):
190 raise ValueError("Location based indexing can only have [%s] "
191 "types" % self._valid_types)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1416
1417 raise KeyError("None of [%s] are in the [%s]" %
-> 1418 (key, self.obj._get_axis_name(axis)))
1419
1420 return True
KeyError: "None of [['output_label']] are in the [columns]"
I'm not very advanced in Python and after looking at this code for a while I haven't been able to figure out what the problem is. Maybe I'm blind to something obvious here and would appreciate any pointed on how to get the function to work or how to redo it so that it gives the same result as the single line code.
Thanks
In your function
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
you are searching for the column with name 'output_label' instead of using your parameter (you are assigning its value directly instead of using your value!)
This should work:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), [output_label]] # notice that there are not quotes
return res

iterate over a list of colums to print out the .value_counts()

I have a list of columns i want to iterate over to get the .value_counts() for each column , getting errors or the code i posted in the bottom i get no printing at all
x = [ 'call_type','date_time','FullAddress','priority']
for i in range(len(x)):
df[x[i]].value_counts()
this is with one single column name
df["call_type"].value_counts()
415 22303
459A 21045
1150 17070
1151 12884
911 11094
CW 9458
586 9405
5150 7109
415V 6922
1016 6453
MCTSTP 5818
1185 5682
FU 5179
1186 5101
415N 5066
SELENF 4787
FD 4435
SLEEPER 3885
INFO 3511
REPORT 3390
1153 3264
PARTY 3170
10851R 2923
602 2877
242 2831
459R 2825
AU2 2802
CC 2776
415PP 2528
488R 2525
Your solution should working, also is possible simplify:
for i in x:
print(df[i].value_counts())
You are just generating data, but not telling your function to print the data to the console.
Add the print() function
x = ['call_type','date_time','FullAddress','priority']
for i in range(len(x)):
print(df[x[i]].value_counts())
Also,
for col in df.columns:
print(df[col].value_counts())
or,
df.apply(lambda x: x.value_counts()).T.stack()

Pandas Calculate Median of Group over Columns

I am trying to calculate the Median of Groups over columns. I found a very clear example at
Pandas: Calculate Median of Group over Columns
This question and answer is the exactly the answer I needed. I created the exact example posted to work through the details on my own
import pandas
import numpy
data_3 = [2,3,4,5,4,2]
data_4 = [0,1,2,3,4,2]
df = pandas.DataFrame({'COL1': ['A','A','A','A','B','B'],
'COL2': ['AA','AA','BB','BB','BB','BB'],
'COL3': data_3,
'COL4': data_4})
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
When I tried to calculate the median of Group over columns I encounter the error
TypeError: Series.name must be a hashable type
If I do the exact same code with the only difference replacing median with a different statistic (mean, min, max, std) and everything works just fine.
I don't understand the cause of this error and why it only occurs for median, which is what I really need to calculate.
Thanks in advance for your help,
Bob
Here is the full error message. I am using python 3.5.2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-af0ef7da3347> in <module>()
----> 1 m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
649 # ignore SettingWithCopy here in case the user mutates
650 with option_context('mode.chained_assignment', None):
--> 651 return self._python_apply_general(f)
652
653 def _python_apply_general(self, f):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _python_apply_general(self, f)
658 keys,
659 values,
--> 660 not_indexed_same=mutated or self.mutated)
661
662 def _iterate_slices(self):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
3373 coerce = True if any([isinstance(x, Timestamp)
3374 for x in values]) else False
-> 3375 return (Series(values, index=key_index, name=self.name)
3376 ._convert(datetime=True,
3377 coerce=coerce))
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
231 generic.NDFrame.__init__(self, data, fastpath=True)
232
--> 233 self.name = name
234 self._set_axis(0, index, fastpath=True)
235
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
2692 object.__setattr__(self, name, value)
2693 elif name in self._metadata:
-> 2694 object.__setattr__(self, name, value)
2695 else:
2696 try:
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in name(self, value)
307 def name(self, value):
308 if value is not None and not com.is_hashable(value):
--> 309 raise TypeError('Series.name must be a hashable type')
310 object.__setattr__(self, '_name', value)
311
TypeError: Series.name must be a hashable type
Somehow the series name at this stage is being interpreted as un-hashable, despite supposedly being a tuple. I think it may be the same bug as the one fixed and closed:
Apply on selected columns of a groupby object - stopped working with 0.18.1 #13568
Basically, single scalar values in groups (as you have in your example) were causing the name of the Series to not be passed through. It is fixed in 0.19.2.
In any case, it shouldn't be a practical concern since you can (and should) call mean, median, etc. on GroupBy objects directly.
>>> df.groupby(['COL1', 'COL2'])[['COL3', 'COL4']].median()
COL3 COL4
COL1 COL2
A AA 2.5 0.5
BB 4.5 2.5
B BB 3.0 3.0

Use two indexers to access values in a pandas DataFrame

I have a DataFrame (df) with many columns and rows.
What I'd like to do is access the values in one column for which the values in two other columns match my indexer.
This is what my code looks like now:
df.loc[df.delays == curr_d, df.prev_delay == prev_d, 'd_stim']
In case it isn't clear, my goal is to select the values in the column 'd_stim' for which other values in the same row are curr_d (in the 'delays' column) and prev_d (in the 'prev_delay' column).
This use of loc does not work. It raises the following error:
/home/despo/dbliss/dopa_net/behavioral_experiments/analysis_code/behavior_analysis.py in plot_prev_curr_interaction(data_frames, labels)
2061 for k, prev_d in enumerate(delays):
2062 diff = np.array(df.loc[df.delays == curr_d,
-> 2063 df.prev_delay == prev_d, 'd_stim'])
2064 ind = ~np.isnan(diff)
2065 diff_rad = np.deg2rad(diff[ind])
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1292
1293 if type(key) is tuple:
-> 1294 return self._getitem_tuple(key)
1295 else:
1296 return self._getitem_axis(key, axis=0)
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
787
788 # no multi-index, so validate all of the indexers
--> 789 self._has_valid_tuple(tup)
790
791 # ugly hack for GH #836
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
139 for i, k in enumerate(key):
140 if i >= self.obj.ndim:
--> 141 raise IndexingError('Too many indexers')
142 if not self._has_valid_type(k, i):
143 raise ValueError("Location based indexing can only have [%s] "
IndexingError: Too many indexers
What is the appropriate way to access the data I need?
your logic isn't working for two reasons.
pandas doesn't know what to do with comma separated conditions
df.delays == curr_d, df.prev_delay == prev_d
Assuming you meant and you need to wrap these up in parenthesis and join with &. This is #MaxU's solution in the comments and should work unless you haven't given us everything.
df.loc[(df.delays == curr_d) & (df.prev_delay == prev_d), 'd_stim'])
However, I think this looks prettier.
df.query('delays == #curr_d and prev_delay == #prev_d').d_stim
If this works then so should've #MaxU's. If neither work, I suggest you post some sample data because most folk don't like guessing what your data is.

Categories

Resources