Finding the bug in a function with while loops - python

I have this function:
def same_price(df=df):
df= df.sort_values(by='Ticket')
nucleus= dict()
k=0
while df.shape[0]>=2:
if df.Price.iloc[0]==df.Price.iloc[1]:
value= df.Price.iloc[0]
n=0
nucleus[k]= []
while df.Price.iloc[n]==value:
nucleus[k].append(df.index[n])
n+=1
if n>df.shape[0]:
df.drop(nucleus[k], axis=0, inplace=True)
break
else:
df.drop(nucleus[k], axis=0, inplace=True)
k+=1
else:
if df.shape[0]>=3:
df.drop(df.index[0], axis=0, inplace=True)
else:
break
return(nucleus)
The objective of the function is to go through the ordered dataframe, and list together the persons who paid the same price GIVEN the sequence of the 'Ticket'id. (I do not just want to list together ALL the people who paid the same price, no matter the sequence!)
The dataframe:
Price Ticket
Id
521 93.5000 12749
821 93.5000 12749
584 40.1250 13049
648 35.5000 13213
633 30.5000 13214
276 77.9583 13502
628 77.9583 13502
766 77.9583 13502
435 55.9000 13507
578 55.9000 13507
457 26.5500 13509
588 79.2000 13567
540 49.5000 13568
48 7.7500 14311
574 7.7500 14312
369 7.7500 14313
When I test it:
same_price(df[:11])is working just fine and the output is : {0: [521, 821], 1: [276, 628, 766], 2: [435, 578]}
but, same_fare(df[:10]) throws:IndexError: single positional indexer is out-of-bounds.
I'd like to know what is wrong with this function guys.
Thx

I found what's wrong, if anyone is interested...
df.iloc[n] gets the (n+1)th line of the dataframe. But shape[0]=n means that the dataframe has n elements.
Hence we use if n+1>df.shape[0]:, instead of if n>df.shape[0]:
Cheers :)

Related

Python dataframe slicing doesn't work in a function but works stand-alone

I checked similar questions posted about slicing DFs in Python but they didn't explain the inconsistency I'm seeing in my exercise.
The code works with the known diamonds data frame. Top lines of the data frame are:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
I have to create a slicing function which takes 4 arguments: DataFrame 'df', a column of that DataFrame
'col', the label of another column 'label' and two values 'val1' and 'val2'. The function will take the frame and output the entries of the column indicated by the 'label' argument for which the rows of the column 'col' are greater than the number 'val1' and less than 'val2'.
The following stand-alone piece of code gives me the correct answer:
diamonds.loc[(diamonds.carat > 1.1) & (diamonds.carat < 1.4),['price']]
and I get the price from the rows where the carat value is between 1.1 and 1.4.
However, when I try to use this syntax in a function, it doesn't work and I get an error.
Function:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
Function call:
slice2(diamonds,diamonds.carat,'price',1.1,1.4)
Error:
"None of [['output_label']] are in the [columns]"
Full traceback message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-64-adc582faf6cc> in <module>()
----> 1 exercise2(test_df,test_df.carat,'price',1.1,1.4)
<ipython-input-63-556b71ba172d> in exercise2(df, col, output_label, val1, val2)
1 def exercise2(df,col,output_label,val1,val2):
----> 2 res = df.loc[(col > val1) & (col < val2), ['output_label']]
3 return res
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1323 except (KeyError, IndexError):
1324 pass
-> 1325 return self._getitem_tuple(key)
1326 else:
1327 key = com._apply_if_callable(key, self.obj)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
839
840 # no multi-index, so validate all of the indexers
--> 841 self._has_valid_tuple(tup)
842
843 # ugly hack for GH #836
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
187 if i >= self.obj.ndim:
188 raise IndexingError('Too many indexers')
--> 189 if not self._has_valid_type(k, i):
190 raise ValueError("Location based indexing can only have [%s] "
191 "types" % self._valid_types)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1416
1417 raise KeyError("None of [%s] are in the [%s]" %
-> 1418 (key, self.obj._get_axis_name(axis)))
1419
1420 return True
KeyError: "None of [['output_label']] are in the [columns]"
I'm not very advanced in Python and after looking at this code for a while I haven't been able to figure out what the problem is. Maybe I'm blind to something obvious here and would appreciate any pointed on how to get the function to work or how to redo it so that it gives the same result as the single line code.
Thanks
In your function
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
you are searching for the column with name 'output_label' instead of using your parameter (you are assigning its value directly instead of using your value!)
This should work:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), [output_label]] # notice that there are not quotes
return res

Error: all arrays must be same length. But they ARE the same length

I am doing some work about sentiment analysis, here I have three arrays:the content of the sentences, the sentiment score and the key words.
I want to display them as a dataframe by pandas, but I got :
"ValueError: arrays must all be same length"
Here are some of my codes:
print(len(text_sentences),len(score_list),len(keyinfo_list))
df = pd.DataFrame(text_sentences,score_list,keyinfo_list)
print(df)
Here are the results:
182 182 182
ValueError Traceback (most recent call last)
<ipython-input-15-cfb70aca07d1> in <module>()
21 print(len(text_sentences),len(score_list),len(keyinfo_list))
22
---> 23 df = pd.DataFrame(text_sentences,score_list,keyinfo_list)
24
25 print(df)
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
328 else:
329 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 330 copy=copy)
331 else:
332 mgr = self._init_dict({}, index, columns, dtype=dtype)
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
472 raise_with_traceback(e)
473
--> 474 index, columns = _get_axes(*values.shape)
475 values = values.T
476
E:\learningsoft\anadonda\lib\site-packages\pandas\core\frame.py in _get_axes(N, K, index, columns)
439 columns = _default_index(K)
440 else:
--> 441 columns = _ensure_index(columns)
442 return index, columns
443
E:\learningsoft\anadonda\lib\site-packages\pandas\core\indexes\base.py in _ensure_index(index_like, copy)
4015 if len(converted) > 0 and all_arrays:
4016 from .multi import MultiIndex
-> 4017 return MultiIndex.from_arrays(converted)
4018 else:
4019 index_like = converted
E:\learningsoft\anadonda\lib\site-packages\pandas\core\indexes\multi.py in from_arrays(cls, arrays, sortorder, names)
1094 for i in range(1, len(arrays)):
1095 if len(arrays[i]) != len(arrays[i - 1]):
-> 1096 raise ValueError('all arrays must be same length')
1097
1098 from pandas.core.categorical import _factorize_from_iterables
ValueError: all arrays must be same length
You can see all my three arrays contain 182 elements, so I don't understand why it said "all arrays must be same length".
You're passing the wrong data into pandas.DataFrame's initializer.
The way you're using it, you're essentially running:
pandas.DataFrame(data=text_sentences, index=score_list, columns=keyinfo_list)
This isn't what you want. You probably want to do something like this instead:
pd.DataFrame(data={
'sentences': text_sentences,
'scores': score_list,
'keyinfo': keyinfo_list
})

Pandas - lambda function with conditional based on row index

I am trying to apply a lambda function to a dataframe by referencing three columns. I want to update one of the columns, Cumulative Total, based on the following logic:
If it's on the first row, then Cumulative Total should equal the value in Total.
If it's not the first row, then apply the following formula that references the prior row:
x.shift()['Cumulative Total']
- (x.shift()['Total'] * (x.shift()['Annualized Rate'] / 1200))
I want the Cumulative Total column to look like so:
Total Annualized Rate Cumulative Total
869 11.04718067 869
868 5.529953917 861
871 8.266360505 857
873 6.872852234 851
873 8.24742268 846
874 9.610983982 840
870 5.517241379 833
871 8.266360505 829
868 2.764976959 823
What is throwing me off is how I can determine whether or not I'm on the first row. This sounds rather trivial, but I'm very new to Pandas and am totally stumped. iloc doesn't seem to work, as it seems to only be used for grabbing a row of a given index.
The code is currently as follows:
df['Cumulative Total'] = df.apply(lambda x: x['Total'] if x.iloc[0] else x.shift()['Cumulative Total']-(x.shift()['Total']*(x.shift()['Annualized Rate']/1200)),axis=1)
The statement if x.iloc[0] is wrong. Any idea on how I can determine if it's the first row?
Edit: thank you all for your answers. Alexander's answer is on the right track, but I've noticed that the results strayed somewhat from what was to be expected. These differences became more pronounced the larger the dataframe used.
Alexander - can you address this issue with an edit to your answer? Using vanilla Python, I've arrived at the results below. The differences are largely trivial, but as stated, can get more pronounced with larger datasets.
total=(869,868,871,873,873,874,870,871,868)
rate=(11.047181,5.529954,8.266361,6.872852,8.247423,9.610984,5.517241,8.266361,2.764977)
def f(total,rate):
cum = []
for i in range(len(total)):
if i == 0:
cum.append(total[i])
else:
cum.append(float(cum[i-1])-(float(total[i-1])*(rate[i-1]/1200.0)))
return cum
f(total, rate)
Returns:
869
860.9999997591667
856.9999996991667
850.99999934
845.9999995100001
839.9999992775
832.9999992641667
828.9999995391668
822.9999991800001
Perhaps this?
df = df.assign(
Cumulative_Total=df['Total'].iat[0]
- ((df['Total'] * df['Annualized Rate'].div(1200))
.shift()
.fillna(0)
.cumsum())
)
>>> df
Total Annualized Rate Cumulative_Total
0 869 11.047181 869
1 868 5.529954 861
2 871 8.266361 857
3 873 6.872852 851
4 873 8.247423 846
5 874 9.610984 840
6 870 5.517241 833
7 871 8.266361 829
8 868 2.764977 823
Would this work? In this solution, I used x.name to get the row index.
df['Cumulative Total'] = df.apply(lambda x: x['Total'] if x.name == 0 else x.shift()['Cumulative Total']-(x.shift()['Total']*(x.shift()['Annualized Rate']/1200)),axis=1)

Python: "in" does not recognize values in DataFrame column

I have an excerpt from a DataFrame "IRAData" and a Column called 'Labels':
380 u'itator-Research'
381 u'itator-OnSystem'
382 u'itator-QueryClient'
383 u'itator-OnSystem'
384 u'itator-OnSystem'
385 u'itator-OnSystem'
386 u'itator-OnSystem'
387 u'itator-OnSystem'
388 u'itator-OnSystem'
Name: Labels, dtype: object
But when I run the following code, I get "False" output:
print(u'itator-QueryClient' in IRAData['Labels'])
Same goes for the other values in the column and when I remove the unicode 'u'.
Anyone have an idea as to why?
EDIT: The solution that I placed in a comment below worked. Did not need to attempt the answer to the suggested duplicate question.
I think the best way to avoid this problems is to correctly import the data.
You store "u'itator-QueryClient'" where u is a raw marker of unicode string,
when 'itator-QueryClient' is the good information to store here.
For example from this html page, just select and copy the lines 381 to 384 an invoque :
In [498]: import ast
In [499]: pd.read_clipboard(names=['value'],index_col=0,header=None,\
converters={'value': ast.literal_eval})
Out[499]:
value
381 itator-OnSystem
382 itator-QueryClient
383 itator-OnSystem
384 itator-OnSystem
Then 'itator-QueryClient' in IRAData['value'] will be evaluated to True.

Use two indexers to access values in a pandas DataFrame

I have a DataFrame (df) with many columns and rows.
What I'd like to do is access the values in one column for which the values in two other columns match my indexer.
This is what my code looks like now:
df.loc[df.delays == curr_d, df.prev_delay == prev_d, 'd_stim']
In case it isn't clear, my goal is to select the values in the column 'd_stim' for which other values in the same row are curr_d (in the 'delays' column) and prev_d (in the 'prev_delay' column).
This use of loc does not work. It raises the following error:
/home/despo/dbliss/dopa_net/behavioral_experiments/analysis_code/behavior_analysis.py in plot_prev_curr_interaction(data_frames, labels)
2061 for k, prev_d in enumerate(delays):
2062 diff = np.array(df.loc[df.delays == curr_d,
-> 2063 df.prev_delay == prev_d, 'd_stim'])
2064 ind = ~np.isnan(diff)
2065 diff_rad = np.deg2rad(diff[ind])
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1292
1293 if type(key) is tuple:
-> 1294 return self._getitem_tuple(key)
1295 else:
1296 return self._getitem_axis(key, axis=0)
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
787
788 # no multi-index, so validate all of the indexers
--> 789 self._has_valid_tuple(tup)
790
791 # ugly hack for GH #836
/usr/local/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
139 for i, k in enumerate(key):
140 if i >= self.obj.ndim:
--> 141 raise IndexingError('Too many indexers')
142 if not self._has_valid_type(k, i):
143 raise ValueError("Location based indexing can only have [%s] "
IndexingError: Too many indexers
What is the appropriate way to access the data I need?
your logic isn't working for two reasons.
pandas doesn't know what to do with comma separated conditions
df.delays == curr_d, df.prev_delay == prev_d
Assuming you meant and you need to wrap these up in parenthesis and join with &. This is #MaxU's solution in the comments and should work unless you haven't given us everything.
df.loc[(df.delays == curr_d) & (df.prev_delay == prev_d), 'd_stim'])
However, I think this looks prettier.
df.query('delays == #curr_d and prev_delay == #prev_d').d_stim
If this works then so should've #MaxU's. If neither work, I suggest you post some sample data because most folk don't like guessing what your data is.

Categories

Resources