python - pandas : how to select by date - python

Why Can I do a selection by month in this case, but not a selection by date ?
dates = pd.date_range( start = "01/01/1931" , end = "01/02/1941" )
new_df_4 = new_df_3.reindex(dates)
new_df_4["1931-10"][![enter image description here][1]][1]
But this doesn't work :
new_df_4["1931-10-02"]
KeyError Traceback (most recent call last)
in ()
----> 1 new_df_4["1931-10-02"]
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
2002 result = self._constructor(self._data.get(key))
2003 if result.columns.is_unique:
-> 2004 result = result[key]
2005
2006 return result
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2000
2001 # duplicate columns & possible reduce dimensionality
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3223
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1879
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()
KeyError: '1931-10-02'

For select by month use partial string indexing:
print (new_df_4["1931-10"])
This won't work if the resolutions are the same (from the same docs):
Warning
However if the string is treated as an exact match, the
selection in DataFrameā€˜s [] will be column-wise and not row-wise, see
Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise
KeyError as '2012-12-31 23:59' has the same resolution as index and
there is no column with such name: To always have unambiguous
selection, whether the row is treated as a slice or a single
selection, use .loc.
In [95]: dft_minute.loc['2011-12-31 23:59']
Out[95]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
You can use loc if need select by date:
new_df_4.loc["1931-10-02"]
Sample:
np.random.seed(10)
dates = pd.date_range( start = "01/01/1931" , end = "01/02/1941" )
new_df_4 = pd.DataFrame({'a':np.random.randint(10, size=len(dates))}, index=dates)
print (new_df_4.head())
a
1931-01-01 9
1931-01-02 4
1931-01-03 0
1931-01-04 1
1931-01-05 9
print (new_df_4["1931-10"])
a
1931-10-01 9
1931-10-02 6
1931-10-03 9
1931-10-04 7
1931-10-05 8
1931-10-06 0
1931-10-07 9
1931-10-08 6
1931-10-09 0
1931-10-10 1
1931-10-11 0
...
print (new_df_4.loc["1931-10-02"])
a 6
Name: 1931-10-02 00:00:00, dtype: int32

Related

How to replace nan in a column with the median of the column

Using Pandas, I've been working on Kaggle's titanic problem, and have tried different variants of the groupby/ apply to try to fill out the NaN entries of the training data, train['Age'] Column.
import pandas as pd
import numpy as np
train = pd.DataFrame({'ID': [887, 888, 889, 890], 'Age': [19.0, np.nan, 26.0, 32.0]})
ID Age
0 887 19.0
1 888 NaN
2 889 26.0
3 890 32.0
how would I go through the elements and change these NaN elements to something like the median age?
I've tried variations of
train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
Which results in
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [249], in <cell line: 1>()
----> 1 train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\apply.py:1088, in SeriesApply.apply(self)
1084 if isinstance(self.f, str):
1085 # if we are a string, try to dispatch
1086 return self.apply_str()
-> 1088 return self.apply_standard()
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\apply.py:1143, in SeriesApply.apply_standard(self)
1137 values = obj.astype(object)._values
1138 # error: Argument 2 to "map_infer" has incompatible type
1139 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1140 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1141 # List[Union[Callable[..., Any], str]]]]]"; expected
1142 # "Callable[[Any], Any]"
-> 1143 mapped = lib.map_infer(
1144 values,
1145 f, # type: ignore[arg-type]
1146 convert=self.convert_dtype,
1147 )
1149 if len(mapped) and isinstance(mapped[0], ABCSeries):
1150 # GH#43986 Need to do list(mapped) in order to get treated as nested
1151 # See also GH#25959 regarding EA support
1152 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~\anaconda3\envs\py10\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()
Input In [249], in <lambda>(x)
----> 1 train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
AttributeError: 'float' object has no attribute 'fillna'
Could someone lead me in the right direction? I don't even need the code; just some tips/hints. I've been reading through the pandas documentation without any progress. Can it be done with just apply? or some kind of groupby method?
You may check with fillna without apply
train.Age = train.Age.fillna(train.Age.median())
train
Out[561]:
D Age
0 887 19.0
1 888 26.0
2 889 26.0
3 890 32.0
The above code can only be used when there is NaN or NA values in a specific column. To used it for changing values based on a condition on the values on a row element of a column you can use loc :
train.loc[train['Age'].isna(),'Age'] = train['Age'].median()

Dynamic top 3 and percentage total using pandas groupby

I have a dataframe like as shown below
id,Name,country,amount,qty
1,ABC,USA,123,4500
1,ABC,USA,156,3210
1,BCE,USA,687,2137
1,DEF,UK,456,1236
1,ABC,nan,216,324
1,DEF,nan,12678,11241
1,nan,nan,637,213
1,BCE,nan,213,543
1,XYZ,KOREA,432,321
1,XYZ,AUS,231,321
sf = pd.read_clipboard(sep=',')
I would like to do the below
a) Get top 3 based on amount for each id and other selected columns such as Name and country. Meaning, we get top 3 based id and Name first and later, we again get top 3 based on id and country
b) Find out how much does each of the top 3 item contribute to total amount for each unique id.
So, I tried the below
sf_name = sf.groupby(['id','Name'],dropna=False)['amount'].sum().nlargest(3).reset_index().rename(columns={'amount':'Name_amount'})
sf_country = sf.groupby(['id','country'],dropna=False)['amount'].sum().nlargest(3).reset_index().rename(columns={'amount':'country_amount'})
sf_name['total'] = sf.groupby('id')['amount'].sum()
sf_country['total'] = sf.groupby('id')['amount'].sum()
sf_name['name_pct_total'] = (sf_name['Name_amount']/sf_name['total'])*100
sf_country['country_pct_total'] = (sf_country['country_amount']/sf_country['total'])*100
As you can see, I am repeating the same operation for each column.
But in my real dataframe, I have to do this groupby id and find Top3 and compute pct_total % for another 8 columns (along with Name and country)
Is there any efficient, elegant and scalable solution that you can share?
I expect my output to be like as below
update - full error
KeyError Traceback (most recent call last)
C:\Users\Test\AppData\Local\Temp/ipykernel_8720/1850446854.py in <module>
----> 1 df_new.groupby(['unique_key','Resale Customer'],dropna=False)['Revenue Resale EUR'].sum().nlargest(3).reset_index(level=1, name=f'{c}_revenue')
~\Anaconda3\lib\site-packages\pandas\core\series.py in nlargest(self, n, keep)
3834 dtype: int64
3835 """
-> 3836 return algorithms.SelectNSeries(self, n=n, keep=keep).nlargest()
3837
3838 def nsmallest(self, n: int = 5, keep: str = "first") -> Series:
~\Anaconda3\lib\site-packages\pandas\core\algorithms.py in nlargest(self)
1135 #final
1136 def nlargest(self):
-> 1137 return self.compute("nlargest")
1138
1139 #final
~\Anaconda3\lib\site-packages\pandas\core\algorithms.py in compute(self, method)
1181
1182 dropped = self.obj.dropna()
-> 1183 nan_index = self.obj.drop(dropped.index)
1184
1185 if is_extension_array_dtype(dropped.dtype):
~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~\Anaconda3\lib\site-packages\pandas\core\series.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4769 dtype: float64
4770 """
-> 4771 return super().drop(
4772 labels=labels,
4773 axis=axis,
~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4277 for axis, labels in axes.items():
4278 if labels is not None:
-> 4279 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4280
4281 if inplace:
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors, consolidate, only_slice)
4321 new_axis = axis.drop(labels, level=level, errors=errors)
4322 else:
-> 4323 new_axis = axis.drop(labels, errors=errors)
4324 indexer = axis.get_indexer(new_axis)
4325
~\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in drop(self, codes, level, errors)
2234 for level_codes in codes:
2235 try:
-> 2236 loc = self.get_loc(level_codes)
2237 # get_loc returns either an integer, a slice, or a boolean
2238 # mask
~\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in get_loc(self, key, method)
2880 if keylen == self.nlevels and self.is_unique:
2881 try:
-> 2882 return self._engine.get_loc(key)
2883 except TypeError:
2884 # e.g. test_partial_slicing_with_multiindex partial string slicing
~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()
~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.UInt64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.UInt64HashTable.get_item()
KeyError: 8937472
Simpliest is use loop by columnsnames in list, for pct_amount use GroupBy.transform with sum per id and divide amount column:
dfs = []
cols = ['Name','country']
for c in cols:
df = (sf.groupby(['id',c],dropna=False)['amount'].sum()
.nlargest(3)
.reset_index(level=1, name=f'{c}_amount'))
df[f'{c}_pct_total']=(df[f'{c}_amount'].div(df.groupby('id',dropna=False)[f'{c}_amount']
.transform('sum'))*100)
dfs.append(df)
df = pd.concat(dfs, axis=1)
print (df)
Name Name_amount Name_pct_total country country_amount \
id
1 DEF 13134 89.365177 NaN 13744
1 BCE 900 6.123699 USA 966
1 XYZ 663 4.511125 UK 456
country_pct_total
id
1 90.623764
1 6.369511
1 3.006726
Testing with Resale Customer column name::
print (sf)
id Resale Customer country amount qty
0 1 ABC USA 123 4500
1 1 ABC USA 156 3210
2 1 BCE USA 687 2137
3 1 DEF UK 456 1236
4 1 ABC NaN 216 324
5 1 DEF NaN 12678 11241
6 1 NaN NaN 637 213
7 1 BCE NaN 213 543
8 1 XYZ KOREA 432 321
9 1 XYZ AUS 231 321
Test columns names:
print (sf.columns)
Index(['id', 'Resale Customer', 'country', 'amount', 'qty'], dtype='object')
dfs = []
cols = ['Resale Customer','country']
for c in cols:
df = (sf.groupby(['id',c],dropna=False)['amount'].sum()
.nlargest(3)
.reset_index(level=1, name=f'{c}_amount'))
df[f'{c}_pct_total']=(df[f'{c}_amount'].div(df.groupby('id',dropna=False)[f'{c}_amount']
.transform('sum'))*100)
dfs.append(df)
df = pd.concat(dfs, axis=1)
print (df)
Resale Customer Resale Customer_amount Resale Customer_pct_total country \
id
1 DEF 13134 89.365177 NaN
1 BCE 900 6.123699 USA
1 XYZ 663 4.511125 UK
country_amount country_pct_total
id
1 13744 90.623764
1 966 6.369511
1 456 3.006726
Solution with melt is possible, but more complicated:
df = sf.melt(id_vars=['id', 'amount'], value_vars=['Name','country'])
df = (df.groupby(['id','variable', 'value'],dropna=False)['amount']
.sum()
.sort_values(ascending=False)
.groupby(level=[0,1],dropna=False)
.head(3)
.to_frame()
.assign(pct_total=lambda x: x['amount'].div(x.groupby(level=[0,1],dropna=False)['amount'].transform('sum')).mul(100),
g=lambda x: x.groupby(level=[0,1],dropna=False).cumcount())
.set_index('g', append=True)
.reset_index('value')
.unstack(1)
.sort_index(level=1, axis=1)
.droplevel(1)
)
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df)
Name_amount Name_pct_total Name_value country_amount country_pct_total \
id
1 13134 89.365177 DEF 13744 90.623764
1 900 6.123699 BCE 966 6.369511
1 663 4.511125 XYZ 456 3.006726
country_value
id
1 NaN
1 USA
1 UK

Why I am getting nan as string when using np.nan and missing value when using pd.NA?

Sorry I cannot share the data. I tried to make test data but it does not gives same error or different missing values as described below.
Added more info at bottom about pd.NA
I am loading data with code:
df = pd.read_csv("C:/data.csv")
When loading data I am getting this warning:
C:\Users\User1\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (162,247,274,292,304,316,321,335,345,347,357,379,389,390,393,395,400,401,420,424,447,462,465,467,478,481,534,536,538,570,616,632,653,666,675,691,707,754,758,762,766,770,774,778,782,784,785,786,788,789,790,792,793,794,796,797,798,800,801,802,804,805,806,808,809,810,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,867,868,871,872,875,876,880,1367,1368,1370,1371,1373,1374,1376,1377,1379,1380,1382,1383,1385,1386,1388,1389,1391,1392,1394,1395,1397,1398,1400,1401,1403,1404,1406,1407,1409,1410,1412,1413,1415,1416,1418,1419,1421,1422,1424,1425,2681) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
As I understood from this question this warning is not a problem and I can ignore it.
After I am running this code from here:
# from: https://stackoverflow.com/questions/60101845/compare-multiple-pandas-columns-1st-and-2nd-after-3rd-and-4rth-after-etc-wit
# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition
import pandas as pd
import numpy as np
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()
conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]
df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)
Newly created columns apon running code above contain nan but colums that are loaded from csv file contain NaN.
After running df['var1'].value_counts(dropna=False), I am getting output:
NaN 3453
0.0 3002
1.0 314
Name: var1, dtype: int64
After running df['Result1'].value_counts(dropna=False), I am getting output:
0.0 3655
nan 2665
1.0 407
Both values 42
Name: Result1, dtype: int64
Notice that var1 contains NaN values but Result1 contains nan values.
When I run df['var1'].value_counts(dropna=False).loc[[np.nan]] I am getting output:
NaN 3453
Name: weeklyivr_q1, dtype: int64
When I run df['Result1'].value_counts(dropna=False).loc[[np.nan]] I am getting error (variable names in error are different but key idea is that there are no missing values):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-52-0daeac75fdb4> in <module>
27 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False)
28 #combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[["Both values"]]
---> 29 combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[[np.nan]]
30 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False).loc[[np.nan]]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1764
1765 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1766 return self._getitem_axis(maybe_callable, axis=axis)
1767
1768 def _is_scalar_access(self, key: Tuple):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1950 raise ValueError("Cannot index with multidimensional key")
1951
-> 1952 return self._getitem_iterable(key, axis=axis)
1953
1954 # nested tuple slicing
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1591 else:
1592 # A collection of keys
-> 1593 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1594 return self.obj._reindex_with_indexers(
1595 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1549
1550 self._validate_read_indexer(
-> 1551 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1552 )
1553 return keyarr, indexer
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1636 if missing == len(indexer):
1637 axis_name = self.obj._get_axis_name(axis)
-> 1638 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1639
1640 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Float64Index([nan], dtype='float64')] are in the [index]"
When I am running df['Result1'].value_counts(dropna=False).loc[['nan']] I am getting:
nan 2665
Name: my_weekly_ivr_1, dtype: int64
So nan in 'Result1' column is string.
If i replace choices = ["Both values", np.nan, df[col1], df[col2]] with choices = ["Both values", pd.NA, df[col1], df[col2]] and after run:
df['Result1'].value_counts(dropna=False).loc[[np.nan]]
I am getting output:
NaN 2665
Name: Result1, dtype: int64
So in this case np.nan produces string and pd.NA missing value.
Question:
Why am getting nan in 'Result1' column when using np.nan? What can be a reason and how to fix this?

Impute missing values using apply and lambda functions

I am trying to impute the missing values in "Item_Weight" variable by taking the average of the variable according to different "Item_Types" as per the code below. But when I run it, I am getting Key error as added below. Is it the pandas version that does not allow this or something wrong with the code?
Item_Weight_Average =
train.dropna(subset['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
missing = train['Item_Weight'].isnull()
train.loc[missing,'Item_Weight']= train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
KeyError Traceback (most recent call last)
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-c9971d0bdaf7> in <module>()
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66645)()
<ipython-input-25-c9971d0bdaf7> in <lambda>(x)
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
any ideas or workarounds for this one?
If I understand what you're trying to do, then there's an easier way to solve your problem. Instead of making a new series of averages, you can calculate the average item_weight by item_type using groupby, transform, and np.mean(), and fill in the missing spots in item_weight using fillna().
# Setting up some toy data
import pandas as pd
import numpy as np
df = pd.DataFrame({'item_type': [1,1,1,2,2,2],
'item_weight': [2,4,np.nan,10,np.nan,np.nan]})
# The solution
df.item_weight.fillna(df.groupby('item_type').item_weight.transform(np.mean), inplace=True)
The result:
item_type item_weight
0 1 2.0
1 1 4.0
2 1 3.0
3 2 10.0
4 2 10.0
5 2 10.0

Odd behaviour indexing Pandas dataframe on date

I've just been working through the Pandas tutorial and am a little confused by the following behaviour.
In [28]: d
Out[28]:
Status CustomerCount
StatusDate
2009-01-05 9 2519
2009-01-12 10 3351
2009-01-19 10 2188
2009-01-26 10 2301
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
2009-03-02 11 2887
2009-03-09 9 2927
getting the records for a particular month via a string works nicely:
In [31]: d['2009-02']
Out[31]:
Status CustomerCount
StatusDate
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
slicing a date range also works nicely:
In [33]: d['2009-02-09':'2009-02-10']
Out[33]:
Status CustomerCount
StatusDate
2009-02-09 9 1538
getting records for a particular day with the same method does not:
In [32]: d['2009-02-09']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-b78c7ec0d497> in <module>()
----> 1 d['2009-02-09']
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: '2009-02-09'
neither does the following:
In [36]: d[d.first_valid_index()]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-36-071dd1d3c77c> in <module>()
----> 1 d[d.first_valid_index()]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: Timestamp('2009-01-05 00:00:00')
but this does:
In [37]: d.loc[d.first_valid_index()]
Out[37]:
Status 9
CustomerCount 2519
Name: 2009-01-05 00:00:00, dtype: int64
Is this behaviour buggy or have I misunderstood something?
d is a DataFrame, so the principal indexer when using df[key] is indexing the columns (see the indexing basics in the docs).
Only, an exception is made when key is a slice. For convenience, slicing on a DataFrame will slice the rows.
In your example, d['2009-02-09':'2009-02-10'] is a slice, so slicing the rows correctly. In d['2009-02-09'], you give a single key, so it looks at the columns, and for this you get a KeyError, as '2009-02-09' is not a column name.
d['2009-02'] is a special case, which can be a bit confusing in the beginning. It is a single string, but actually represents a slice (this feature is called partial string indexing, see the docs here).

Categories

Resources