Pandas Calculate Median of Group over Columns

Pandas Calculate Median of Group over Columns - python

I am trying to calculate the Median of Groups over columns. I found a very clear example at
Pandas: Calculate Median of Group over Columns
This question and answer is the exactly the answer I needed. I created the exact example posted to work through the details on my own
import pandas
import numpy
data_3 = [2,3,4,5,4,2]
data_4 = [0,1,2,3,4,2]
df = pandas.DataFrame({'COL1': ['A','A','A','A','B','B'],
'COL2': ['AA','AA','BB','BB','BB','BB'],
'COL3': data_3,
'COL4': data_4})
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
When I tried to calculate the median of Group over columns I encounter the error
TypeError: Series.name must be a hashable type
If I do the exact same code with the only difference replacing median with a different statistic (mean, min, max, std) and everything works just fine.
I don't understand the cause of this error and why it only occurs for median, which is what I really need to calculate.
Thanks in advance for your help,
Bob
Here is the full error message. I am using python 3.5.2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-af0ef7da3347> in <module>()
----> 1 m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(numpy.median)
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
649 # ignore SettingWithCopy here in case the user mutates
650 with option_context('mode.chained_assignment', None):
--> 651 return self._python_apply_general(f)
652
653 def _python_apply_general(self, f):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _python_apply_general(self, f)
658 keys,
659 values,
--> 660 not_indexed_same=mutated or self.mutated)
661
662 def _iterate_slices(self):
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
3373 coerce = True if any([isinstance(x, Timestamp)
3374 for x in values]) else False
-> 3375 return (Series(values, index=key_index, name=self.name)
3376 ._convert(datetime=True,
3377 coerce=coerce))
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
231 generic.NDFrame.__init__(self, data, fastpath=True)
232
--> 233 self.name = name
234 self._set_axis(0, index, fastpath=True)
235
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
2692 object.__setattr__(self, name, value)
2693 elif name in self._metadata:
-> 2694 object.__setattr__(self, name, value)
2695 else:
2696 try:
/Applications/anaconda3/lib/python3.5/site-packages/pandas/core/series.py in name(self, value)
307 def name(self, value):
308 if value is not None and not com.is_hashable(value):
--> 309 raise TypeError('Series.name must be a hashable type')
310 object.__setattr__(self, '_name', value)
311
TypeError: Series.name must be a hashable type

Somehow the series name at this stage is being interpreted as un-hashable, despite supposedly being a tuple. I think it may be the same bug as the one fixed and closed:
Apply on selected columns of a groupby object - stopped working with 0.18.1 #13568
Basically, single scalar values in groups (as you have in your example) were causing the name of the Series to not be passed through. It is fixed in 0.19.2.
In any case, it shouldn't be a practical concern since you can (and should) call mean, median, etc. on GroupBy objects directly.
>>> df.groupby(['COL1', 'COL2'])[['COL3', 'COL4']].median()
COL3 COL4
COL1 COL2
A AA 2.5 0.5
BB 4.5 2.5
B BB 3.0 3.0

Related

SpecificationError: nested renamer is not supported using groupby()

Could you please help me to solve this issue in my code, as the spatial join using pandas (groupby(), agg()) it give me the below error:
I have a data frame df and I use several columns from it to groupby:
n the below way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means.
In short: How do I get group-wise statistics for a dataframe?
Code:
def bin_the_midpoints(bins, midpoints):
b = bins.copy()
m = midpoints.copy()
reindexed = b.reset_index().rename(columns={'index':'bins_index'})
joined = gpd.tools.sjoin(reindexed, m)
bin_stats = joined.groupby('bins_index')['offset']\
.agg({'fold': len, 'min_offset': np.min})
return gpd.GeoDataFrame(b.join(bin_stats))
bin_stats = bin_the_midpoints(bins, midpoints)
Error:
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
Input In [103], in <cell line: 9>()
6 bin_stats = joined.groupby('bins_index')['offset']\
7 .agg({'fold': len, 'min_offset': np.min})
8 return gpd.GeoDataFrame(b.join(bin_stats))
----> 9 bin_stats = bin_the_midpoints(bins, midpoints)
Input In [103], in bin_the_midpoints(bins, midpoints)
4 reindexed = b.reset_index().rename(columns={'index':'bins_index'})
5 joined = gpd.tools.sjoin(reindexed, m)
----> 6 bin_stats = joined.groupby('bins_index')['offset']\
7 .agg({'fold': len, 'min_offset': np.min})
8 return gpd.GeoDataFrame(b.join(bin_stats))
File ~\anaconda3\envs\GeoSynapps\lib\site-packages\pandas\core\groupby\generic.py:271, in SeriesGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
267 elif isinstance(func, abc.Iterable):
268 # Catch instances of lists / tuples
269 # but not the class list / tuple itself.
270 func = maybe_mangle_lambdas(func)
--> 271 ret = self._aggregate_multiple_funcs(func)
272 if relabeling:
273 # error: Incompatible types in assignment (expression has type
274 # "Optional[List[str]]", variable has type "Index")
275 ret.columns = columns # type: ignore[assignment]
File ~\anaconda3\envs\GeoSynapps\lib\site-packages\pandas\core\groupby\generic.py:307, in SeriesGroupBy._aggregate_multiple_funcs(self, arg)
301 def _aggregate_multiple_funcs(self, arg) -> DataFrame:
302 if isinstance(arg, dict):
303
304 # show the deprecation, but only if we
305 # have not shown a higher level one
306 # GH 15931
--> 307 raise SpecificationError("nested renamer is not supported")
309 elif any(isinstance(x, (tuple, list)) for x in arg):
310 arg = [(x, x) if not isinstance(x, (tuple, list)) else x for x in arg]
SpecificationError: nested renamer is not supported

You must read more about agg method in pandas. Easily you can add many calculation to this method.
For example you can write:
df.groupby(by=[...]).agg({'col1': ['count', 'sum', 'min']})

ValueError: Incompatible indexer with Series while adding date to Date to Data Frame

I am new to python and I can't figure out why I get this error: ValueError: Incompatible indexer with Series.
I am trying to add a date to my data frame.
The date I am trying to add:
date = (chec[(chec['Día_Sem']=='Thursday') & (chec['ID']==2011957)]['Entrada'])
date
Date output:
56 1900-01-01 07:34:00
Name: Entrada, dtype: datetime64[ns]
Then I try to add 'date' to my data frame using loc:
rep.loc[2039838,'Thursday'] = date
rep
And I get this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-347-3e0678b0fdbf> in <module>
----> 1 rep.loc[2039838,'Thursday'] = date
2 rep
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
188 key = com.apply_if_callable(key, self.obj)
189 indexer = self._get_setitem_indexer(key)
--> 190 self._setitem_with_indexer(indexer, value)
191
192 def _validate_key(self, key, axis):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
640 # setting for extensionarrays that store dicts. Need to decide
641 # if it's worth supporting that.
--> 642 value = self._align_series(indexer, Series(value))
643
644 elif isinstance(value, ABCDataFrame):
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _align_series(self, indexer, ser, multiindex_indexer)
781 return ser.reindex(ax)._values
782
--> 783 raise ValueError('Incompatible indexer with Series')
784
785 def _align_frame(self, indexer, df):
ValueError: Incompatible indexer with Series

I was also facing similar issue but in a different scenario. I came across threads of duplicate indices, but of-course that was not the case with me. What worked for me was to use .at in place of .loc. So you can try and see if it works
rep['Thursday'].at[2039838] = date.values[0]

Try date.iloc[0] instead of date:
rep.loc[2039838,'Thursday'] = date.iloc[0]
Because date is actually a Series (so basically like a list/array) of the values, and .iloc[0] actually selects the value.

You use loc to get a specific value, and your date type is a series or dataframe, the type between the two can not match, you can change the code to give the value of date to rep.loc[2039838,'Thursday'], for example, if your date type is a series and is not null, you can do this:
rep.loc[2039838,'Thursday'] = date.values[0]

ValueError: Cannot convert non-finite values (NA or inf) to integer

df.dtypes
name object
rating object
genre object
year int64
released object
score float64
votes float64
director object
writer object
star object
country object
budget float64
gross float64
company object
runtime float64
dtype: object
Then when i try to convert using :
df['budget'] = df['budget'].astype("int64")
it says:
ValueError Traceback (most recent call last)
<ipython-input-23-6ced5964af60> in <module>
1 # Change Datatype for Columns
----> 2 df['budget'] = df['budget'].astype("int64")
3
4 #df['column_name'].astype(np.float).astype("Int32")
5 #df['gross'] = df['gross'].astype('int64')
~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5696 else:
5697 # else, only a single dtype is given
-> 5698 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
5699 return self._constructor(new_data).__finalize__(self)
5700
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
580
581 def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
583
584 def convert(self, **kwargs):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
440 applied = b.apply(f, **kwargs)
441 else:
--> 442 applied = getattr(b, f)(**kwargs)
443 result_blocks = _extend_blocks(applied, result_blocks)
444
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
623 vals1d = values.ravel()
624 try:
--> 625 values = astype_nansafe(vals1d, dtype, copy=True)
626 except (ValueError, TypeError):
627 # e.g. astype_nansafe can fail on object-dtype of strings
~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
866
867 if not np.isfinite(arr).all():
--> 868 raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
869
870 elif is_object_dtype(arr):
ValueError: Cannot convert non-finite values (NA or inf) to integer

Assuming that the budget does not contain infinite values, the problem may be because you have nan values. These values are usually allowed in floats but not in ints.
You can:
Drop na values before converting
Or, if you still want the na values and have a recent version of pandas, you can convert to an int type that accepts nan values (note the i is capital):
df['budget'] = df['budget'].astype("Int64")

Try this notice the capital "i" in Int64
df['budget'] = df['budget'].astype("Int64")
you might have some NaN values in this column which might be the reason for this issue
From pandas docs:
Changed in version 1.0.0: Now uses pandas.NA as the missing value rather than numpy.nan
Follow the link to find out more:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Or you could fill the NaN/NA values with 0 and than do .astype("int64")
df['budget'] = df['budget'].fillna(0)

Check for any null values present in the column.
If there are no null values. Try using apply() instead of astype()
df['budget'] = df['budget'].apply("int64")

Python dataframe slicing doesn't work in a function but works stand-alone

I checked similar questions posted about slicing DFs in Python but they didn't explain the inconsistency I'm seeing in my exercise.
The code works with the known diamonds data frame. Top lines of the data frame are:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
I have to create a slicing function which takes 4 arguments: DataFrame 'df', a column of that DataFrame
'col', the label of another column 'label' and two values 'val1' and 'val2'. The function will take the frame and output the entries of the column indicated by the 'label' argument for which the rows of the column 'col' are greater than the number 'val1' and less than 'val2'.
The following stand-alone piece of code gives me the correct answer:
diamonds.loc[(diamonds.carat > 1.1) & (diamonds.carat < 1.4),['price']]
and I get the price from the rows where the carat value is between 1.1 and 1.4.
However, when I try to use this syntax in a function, it doesn't work and I get an error.
Function:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
Function call:
slice2(diamonds,diamonds.carat,'price',1.1,1.4)
Error:
"None of [['output_label']] are in the [columns]"
Full traceback message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-64-adc582faf6cc> in <module>()
----> 1 exercise2(test_df,test_df.carat,'price',1.1,1.4)
<ipython-input-63-556b71ba172d> in exercise2(df, col, output_label, val1, val2)
1 def exercise2(df,col,output_label,val1,val2):
----> 2 res = df.loc[(col > val1) & (col < val2), ['output_label']]
3 return res
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1323 except (KeyError, IndexError):
1324 pass
-> 1325 return self._getitem_tuple(key)
1326 else:
1327 key = com._apply_if_callable(key, self.obj)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
839
840 # no multi-index, so validate all of the indexers
--> 841 self._has_valid_tuple(tup)
842
843 # ugly hack for GH #836
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
187 if i >= self.obj.ndim:
188 raise IndexingError('Too many indexers')
--> 189 if not self._has_valid_type(k, i):
190 raise ValueError("Location based indexing can only have [%s] "
191 "types" % self._valid_types)
/Users/jojo/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1416
1417 raise KeyError("None of [%s] are in the [%s]" %
-> 1418 (key, self.obj._get_axis_name(axis)))
1419
1420 return True
KeyError: "None of [['output_label']] are in the [columns]"
I'm not very advanced in Python and after looking at this code for a while I haven't been able to figure out what the problem is. Maybe I'm blind to something obvious here and would appreciate any pointed on how to get the function to work or how to redo it so that it gives the same result as the single line code.
Thanks

In your function
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), ['output_label']]
return res
you are searching for the column with name 'output_label' instead of using your parameter (you are assigning its value directly instead of using your value!)
This should work:
def slice2(df,col,output_label,val1,val2):
res = df.loc[(col > val1) & (col < val2), [output_label]] # notice that there are not quotes
return res

python if statement dictionary incompatible indexer with Series

This script :
for x in df.index:
if df.loc[x,'medicament1'] in dicoprix:
df.loc[x,'coutmed1'] = dicoprix[df.loc[x,'medicament1']]
gives this error :
File "<ipython-input-35-097fdb2220b8>", line 3, in <module>
df.loc[x,'coutmed1'] = dicoprix[df.loc[x,'medicament1']]
File "//anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 115, in __setitem__
self._setitem_with_indexer(indexer, value)
File "//anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 346, in _setitem_with_indexer
value = self._align_series(indexer, value)
File "//anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 613, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
But the script is working, meaning df.loc[x,'coutmed1'] takes the value that I want.
I don't understand what am I doing wrong ?
I think that the problem comes from this
dicoprix[df.loc[x,'medicament1']]

This problem occurs when a key in the dict refers to more than one value !

Solution: Remove the duplicate indexes from the series (i.e. dicoprix) and keep them unique
You got it, the problem is in dicoprix[df.loc[x,'medicament1']]
There are duplicates in the indexes of the series dicoprix, which cannot be put as one value in the dataframe.
Below is the demonstration:
In [1]:
import pandas as pd
dum_ser = pd.Series(index=['a','b','b','c'], data=['apple', 'balloon', 'ball', 'cat' ])
[Out 1]
a apple
b balloon
b ball
c cat
dtype: object
In [2]:
df = pd.DataFrame({'letter':['a','b','c','d'], 'full_form':['aley', 'byue', 'case', 'cible']}, index=[0,1,2,3])
df
Out [2]:
letter full_form
0 a aley
1 b byue
2 c case
3 d cible
Following command will run fine as 'a' is not the duplicate index in dum_ser series
In [3]:
df.loc[0,'full_form'] = dum_ser['a']
df
Out [3]:
letter full_form
0 a apple
1 b byue
2 c case
3 d apple
Error will occur when the command tries to insert two records from the series(as there are two records for the index b in dum_ser, to check run the command dum_ser['b']) into one value-space of the DataFrame. Refer below
In [4]:
df.loc[1,'full_form'] = dum_ser['b']
Out [4]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-af11b9b3a776> in <module>()
----> 1 df.loc['b','full_form'] = dum_ser['b']
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value)
187 key = com._apply_if_callable(key, self.obj)
188 indexer = self._get_setitem_indexer(key)
--> 189 self._setitem_with_indexer(indexer, value)
190
191 def _validate_key(self, key, axis):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _setitem_with_indexer(self, indexer, value)
635 # setting for extensionarrays that store dicts. Need to decide
636 # if it's worth supporting that.
--> 637 value = self._align_series(indexer, Series(value))
638
639 elif isinstance(value, ABCDataFrame):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _align_series(self, indexer, ser, multiindex_indexer)
775 return ser.reindex(ax)._values
776
--> 777 raise ValueError('Incompatible indexer with Series')
778
779 def _align_frame(self, indexer, df):
ValueError: Incompatible indexer with Series
The above-written line of the code is the one of the iteration from the for loop i.e. for x=1
Solution: Remove the duplicate indexes from the series (i.e. dum_ser here) and keep them unique

Use indexing like this:
dicoprix[df.loc[x,'medicament1']][0]
It did work for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Calculate Median of Group over Columns - python

Related

SpecificationError: nested renamer is not supported using groupby()

ValueError: Incompatible indexer with Series while adding date to Date to Data Frame

ValueError: Cannot convert non-finite values (NA or inf) to integer

Python dataframe slicing doesn't work in a function but works stand-alone

python if statement dictionary incompatible indexer with Series

Categories

Resources