pandas.DataFrame.agg does not work with np.std? - python

I am trying to use the pandas.DataFrame.agg function on the first column of a dataframe with the agg function is numpy.std.
I dont know why it works with numpy.mean but not numpy.std?
Can someone tell me in what circumstance that happens.
This is very strange ##
The following describes what I am facing.
My source is like this:
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.mean]})
print(agg_df)
then it shows the result like this
<class 'pandas.core.frame.DataFrame'>
ax
0 -98.06
1 -97.81
2 -96.00
3 -93.44
4 -92.94
ax
mean -98.06
now I change the function from np.mean into np.std (without changing anything else)
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.std]})
print(agg_df)
it shows the errors
Traceback (most recent call last):
File "C:\prediction_framework_django\predictions\predictor.py", line 112, in pre_aggregated_unseen_data
agg_df = dataframe.agg({axis: [np.std]})
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7578, in aggregate
result, how = self._aggregate(func, axis, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7609, in _aggregate
return aggregate(self, arg, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 582, in aggregate
return agg_dict_like(obj, arg, _axis), True
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in agg_dict_like
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in <dictcomp>
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\series.py", line 3974, in aggregate
result, how = aggregate(self, func, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 586, in aggregate
return agg_list_like(obj, arg, _axis=_axis), None
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 672, in agg_list_like
raise ValueError("no results")
ValueError: no results
So the error is
in agg_list_like raise ValueError("no results") ValueError: no results
Thank you for your time and help.

Simply use the pandas builtin:
# Note the use of string to denote the function here
df.agg({first_col: ['mean', 'std']})
# You can also simply use the following
df[first_col].mean()
df[first_col].std()
[EDIT]: The error that you are getting is probably resulting from mixed types. You can check that all dtypes are float by looking at df.dtypes. If you have one that is object, then convert the problematic ones (probably empty strings) into whatever you need and np.std and pandas' builtin std should work

Related

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?
With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Pandas – ValueError: Cannot setitem on a Categorical with a new category, set the categories first

I've been searching for a solution to this for the past few hours now. Relevant pandas documentation is unhelpful and this solution gives me the same error.
I am trying to order my dataframe using a categorical in the following manner:
metabolites_order = CategoricalDtype(['Header', 'Metabolite', 'Unknown'], ordered=True)
df2['Feature type'] = df2['Feature type'].astype(metabolites_order)
df2 = df2.sort_values('Feature type')
The "Feature type" column is populated with the categories correctly. This code runs perfectly in Jupyter Notebooks, but when I run it in Pycharm, I get the following error:
Traceback (most recent call last):
File "/Users/wasim.sandhu/Documents/MSDIALPostProcessor/postprocessor.py", line 138, in process_alignment_file
df2.loc[4] = list(df2.columns)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1635, in _setitem_with_indexer
self._setitem_with_indexer_split_path(indexer, value, name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1700, in _setitem_with_indexer_split_path
self._setitem_single_column(loc, v, pi)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1813, in _setitem_single_column
ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 568, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 1846, in setitem
self.values[indexer] = value
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/_mixins.py", line 211, in __setitem__
value = self._validate_setitem_value(value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/categorical.py", line 1898, in _validate_setitem_value
raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
What could be causing this? I believe that I've set the categories correctly...
I'd suggest just mapping these categories to integers then sorting on that column instead.
categories = ['Header', 'Metabolite', 'Unknown']
feature_map = {categories[i]:i for i in range(len(categories))}
df['Feature Order'] = df['Feature Type'].map(feature_map)
df.sort_values('Feature Order')
Figured it out literally minutes after I posted the question. The header column in this dataset is in the 5th row. I checked the "Feature type" column and "Feature type" is one of its values, which threw this error.
Solved by adding the column header name into the categories.
metabolites_order = CategoricalDtype(['Header', 'Feature type', 'Metabolite', 'Unknown'], ordered=True)

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

Can use dataframe ix for assignment, but not retrieval

I am looping through rows of a pandas df, loop index i.
I am able to assign several columns using the ix function with the loop index as first parameter, column name as second.
However, when I try to retrieve/print using this method,
print(df.ix[i,"Run"])
I get a the following Typerror: str object cannot be interpreted as an integer.
somehow related to Keyerror: 'Run'
Not quite sure why this is occurring, as Run is indeed a column in the dataframe.
Any suggestions?
Traceback (most recent call last):
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3124, in get_value
return libindex.get_value_box(s, key)
File \!"pandas\_libs\index.pyx\!", line 55, in pandas._libs.index.get_value_box
File \!"pandas\_libs\index.pyx\!", line 63, in pandas._libs.index.get_value_box
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File \!"C:\...", line 365, in <module>
print(df.ix[i,\!"Run\!"])
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 116, in __getitem__
return self._getitem_tuple(key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1027, in _getitem_lowerdim
return getattr(section, self.name)[new_key]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 122, in __getitem__
return self._getitem_axis(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1116, in _getitem_axis
return self._get_label(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 136, in _get_label
return self.obj[label]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\series.py\!", line 767, in __getitem__
result = self.index.get_value(self, key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3132, in get_value
raise e1
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3118, in get_value
tz=getattr(series.dtype, 'tz', None))
File \!"pandas\_libs\index.pyx\!", line 106, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 114, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 162, in pandas._libs.index.IndexEngine.get_loc
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Run'
"
Upon changing the name of the column I print to any other column, it does work correctly. Earlier in the code, I "compressed" the rows, which had multiple rows per unique string in 'Run' column, using the following.
df=df.groupby('Run').max()
Did this last line somehow remove the column/column name from the table?
ix has been deprecated. ix has always been ambiguous: does ix[10] refer to the row with the label 10, or the row at position 10?
Use loc or iloc instead:
df.loc[i,"Run"] = ... # by label
df.iloc[i]["Run"] = ... # by position
As for the groupby removing Run: it moves Run to the index of the data frame. To get it back as a column, call reset_index:
df=df.groupby('Run').max().reset_index()
Differences between indexing by label and position:
Suppose you have a series like this:
s = pd.Series(['a', 'b', 'c', 'd', 'e'], index=np.arange(0,9,2))
0 a
2 b
4 c
6 d
8 e
The first column is the labels (aka the index). The second column is the values of the series.
Label based indexing:
s.loc[2] --> b
s.loc[3] --> error. The label doesn't exist
Position based indexing:
s.iloc[2] --> c. since `a` has position 0, `b` has position 1, and so on
s.iloc[3] --> d
According to the documentation, s.ix[3] would have returned d since it first searches for the label 3. When that fails, it falls back to the position 3. On my machine (Pandas 0.24.2), it returns an error, along with a deprecation warning, so I guess the developers changed it to behave like loc.
If you want to use mixed indexing, you have to be explicit about that:
s.loc[3] if 3 in s.index else s.iloc[3]

Pandas DataFrame: query with variables

I'm working on a DataFrame query using 2 variables.
The first variable is the column label and the second is a list of values.
What I want to do is select all row where that column has a value contained in that list. The strange thing is that if I write the column label as a string there is no error, while referencing the variable containing the column label gives the following error:
Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:4238)
File "pandas\index.pyx", line 388, in pandas.index.Int64Engine._check_type (pandas\index.c:8171)
KeyError: False
This is the working code:
rhs_values_list = df1["RHS"].tolist()
query = "shoe_size in #rhs_values_list"
result_set = df2.query(query)
while this rises the above error:
rhs_values_list = df1["RHS"].tolist()
col = "shoe_size"
query = "#col in #rhs_values_list"
result_set = df2.query(query)
Is there something wrong with second version of the query?
What you are doing is executing the actual query with #col in the string, not the value you bound to that variable. You can use string interpolation e.g:
rhs_values_list = df1["RHS"].tolist()
col = "shoe_size"
query = "{} in #rhs_values_list".format(col)
result_set = df2.query(relaxed_query)

Categories

Resources