Odd behaviour indexing Pandas dataframe on date - python

I've just been working through the Pandas tutorial and am a little confused by the following behaviour.
In [28]: d
Out[28]:
Status CustomerCount
StatusDate
2009-01-05 9 2519
2009-01-12 10 3351
2009-01-19 10 2188
2009-01-26 10 2301
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
2009-03-02 11 2887
2009-03-09 9 2927
getting the records for a particular month via a string works nicely:
In [31]: d['2009-02']
Out[31]:
Status CustomerCount
StatusDate
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
slicing a date range also works nicely:
In [33]: d['2009-02-09':'2009-02-10']
Out[33]:
Status CustomerCount
StatusDate
2009-02-09 9 1538
getting records for a particular day with the same method does not:
In [32]: d['2009-02-09']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-b78c7ec0d497> in <module>()
----> 1 d['2009-02-09']
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: '2009-02-09'
neither does the following:
In [36]: d[d.first_valid_index()]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-36-071dd1d3c77c> in <module>()
----> 1 d[d.first_valid_index()]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: Timestamp('2009-01-05 00:00:00')
but this does:
In [37]: d.loc[d.first_valid_index()]
Out[37]:
Status 9
CustomerCount 2519
Name: 2009-01-05 00:00:00, dtype: int64
Is this behaviour buggy or have I misunderstood something?

d is a DataFrame, so the principal indexer when using df[key] is indexing the columns (see the indexing basics in the docs).
Only, an exception is made when key is a slice. For convenience, slicing on a DataFrame will slice the rows.
In your example, d['2009-02-09':'2009-02-10'] is a slice, so slicing the rows correctly. In d['2009-02-09'], you give a single key, so it looks at the columns, and for this you get a KeyError, as '2009-02-09' is not a column name.
d['2009-02'] is a special case, which can be a bit confusing in the beginning. It is a single string, but actually represents a slice (this feature is called partial string indexing, see the docs here).

Related

How to replace nan in a column with the median of the column

Using Pandas, I've been working on Kaggle's titanic problem, and have tried different variants of the groupby/ apply to try to fill out the NaN entries of the training data, train['Age'] Column.
import pandas as pd
import numpy as np
train = pd.DataFrame({'ID': [887, 888, 889, 890], 'Age': [19.0, np.nan, 26.0, 32.0]})
ID Age
0 887 19.0
1 888 NaN
2 889 26.0
3 890 32.0
how would I go through the elements and change these NaN elements to something like the median age?
I've tried variations of
train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
Which results in
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [249], in <cell line: 1>()
----> 1 train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\apply.py:1088, in SeriesApply.apply(self)
1084 if isinstance(self.f, str):
1085 # if we are a string, try to dispatch
1086 return self.apply_str()
-> 1088 return self.apply_standard()
File ~\anaconda3\envs\py10\lib\site-packages\pandas\core\apply.py:1143, in SeriesApply.apply_standard(self)
1137 values = obj.astype(object)._values
1138 # error: Argument 2 to "map_infer" has incompatible type
1139 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1140 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1141 # List[Union[Callable[..., Any], str]]]]]"; expected
1142 # "Callable[[Any], Any]"
-> 1143 mapped = lib.map_infer(
1144 values,
1145 f, # type: ignore[arg-type]
1146 convert=self.convert_dtype,
1147 )
1149 if len(mapped) and isinstance(mapped[0], ABCSeries):
1150 # GH#43986 Need to do list(mapped) in order to get treated as nested
1151 # See also GH#25959 regarding EA support
1152 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~\anaconda3\envs\py10\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()
Input In [249], in <lambda>(x)
----> 1 train.Age = train.Age.apply(lambda x: x.fillna(x.median()))
AttributeError: 'float' object has no attribute 'fillna'
Could someone lead me in the right direction? I don't even need the code; just some tips/hints. I've been reading through the pandas documentation without any progress. Can it be done with just apply? or some kind of groupby method?
You may check with fillna without apply
train.Age = train.Age.fillna(train.Age.median())
train
Out[561]:
D Age
0 887 19.0
1 888 26.0
2 889 26.0
3 890 32.0
The above code can only be used when there is NaN or NA values in a specific column. To used it for changing values based on a condition on the values on a row element of a column you can use loc :
train.loc[train['Age'].isna(),'Age'] = train['Age'].median()

i got key error while accessing column by its position in dataframe

df1.head()
#i got output
#but i got the error when i run following cell
subset = df1[[1]]
#error:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14940/431015837.py in <module>
----> 1 subset = df1[[1]]
~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers
~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
KeyError: "None of [Int64Index([1], dtype='int64')] are in the [columns]"
​```
​
You can use the iloc method for the position-based call.
For example:
import pandas as pd
df = pd.DataFrame({'a': [1, 2]})
subset = df.iloc[:, [0]]
Output will be:
a
0 1
1 2

Filtering a dataframe with datetime index using .loc (rows & columns) [duplicate]

This question already has answers here:
How to slice a Pandas Dataframe based on datetime index
(3 answers)
Closed 3 years ago.
Trying to slice both rows and columns of a dataframe using the .loc method, but I am having trouble slicing the rows of the df (it has a datetime index)
The dataframe I am working with has 537 rows and 10 columns. The first date is 2018-01-01, but I want it to slice it so that it only shows dates for 2019.
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 536 entries, 2018-01-01 00:00:00 to 2019-06-20 00:00:00
Data columns (total 10 columns):
link_clicks 536 non-null int64
customer_count 536 non-null int64
transaction_count 536 non-null int64
customers_per_click 536 non-null float64
transactions_per_click 536 non-null float64
14_day_ma 523 non-null float64
14_day_std 523 non-null float64
Upper14 523 non-null float64
Lower14 523 non-null float64
lower_flag 536 non-null bool
dtypes: bool(1), float64(6), int64(3)
memory usage: 42.4+ KB
df.loc['2019-01-01':'2019-06-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]
Expected result is to return a filtered data frame within that date range. However when I execute that line of code it gives me the following error:
(clearly it is an issue with the index, but I am just not sure what the proper syntax is and am having trouble finding a solution online.)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4804 try:
-> 4805 return self._searchsorted_monotonic(label, side)
4806 except ValueError:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
4764
-> 4765 raise ValueError('index must be monotonic increasing or decreasing')
4766
ValueError: index must be monotonic increasing or decreasing
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-599-5bdb485482ff> in <module>
----> 1 merge2.loc['2019-11-01':'2019-02-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']].plot(figsize=(15,5))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1492 except (KeyError, IndexError, AttributeError):
1493 pass
-> 1494 return self._getitem_tuple(key)
1495 else:
1496 # we by definition only have the 0th axis
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
886 continue
887
--> 888 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
889
890 return retval
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1865 if isinstance(key, slice):
1866 self._validate_key(key, axis)
-> 1867 return self._get_slice_axis(key, axis=axis)
1868 elif com.is_bool_indexer(key):
1869 return self._getbool_axis(key, axis=axis)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
1531 labels = obj._get_axis(axis)
1532 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1533 slice_obj.step, kind=self.name)
1534
1535 if isinstance(indexer, slice):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
4671 """
4672 start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 4673 kind=kind)
4674
4675 # return a slice
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
4870 start_slice = None
4871 if start is not None:
-> 4872 start_slice = self.get_slice_bound(start, 'left', kind)
4873 if start_slice is None:
4874 start_slice = 0
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4806 except ValueError:
4807 # raise the original KeyError
-> 4808 raise err
4809
4810 if isinstance(slc, np.ndarray):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4800 # we need to look up the label
4801 try:
-> 4802 slc = self._get_loc_only_exact_matches(label)
4803 except KeyError as err:
4804 try:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _get_loc_only_exact_matches(self, key)
4770 get_slice_bound.
4771 """
-> 4772 return self.get_loc(key)
4773
4774 def get_slice_bound(self, label, side, kind):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '2019-11-01'
If your index is of type "datetime", try:
from datetime import datetime
df.loc[(df.index>=datetime(2019,1,1)) & (df.index<= datetime(2019,6,1)), ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]
Without all the details I propose the following code:
index = pd.date_range('1/1/2018', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
grouped = ts.groupby(lambda x: x.year)
grouped.size()
2018 365
2019 365
2020 366
2021 4
dtype: int64
You can select a year (a group) using:
grouped.get_group(2019)
len(grouped.get_group(2019))
365
Do you need something more specific?

Impute missing values using apply and lambda functions

I am trying to impute the missing values in "Item_Weight" variable by taking the average of the variable according to different "Item_Types" as per the code below. But when I run it, I am getting Key error as added below. Is it the pandas version that does not allow this or something wrong with the code?
Item_Weight_Average =
train.dropna(subset['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
missing = train['Item_Weight'].isnull()
train.loc[missing,'Item_Weight']= train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
KeyError Traceback (most recent call last)
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-c9971d0bdaf7> in <module>()
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66645)()
<ipython-input-25-c9971d0bdaf7> in <lambda>(x)
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
any ideas or workarounds for this one?
If I understand what you're trying to do, then there's an easier way to solve your problem. Instead of making a new series of averages, you can calculate the average item_weight by item_type using groupby, transform, and np.mean(), and fill in the missing spots in item_weight using fillna().
# Setting up some toy data
import pandas as pd
import numpy as np
df = pd.DataFrame({'item_type': [1,1,1,2,2,2],
'item_weight': [2,4,np.nan,10,np.nan,np.nan]})
# The solution
df.item_weight.fillna(df.groupby('item_type').item_weight.transform(np.mean), inplace=True)
The result:
item_type item_weight
0 1 2.0
1 1 4.0
2 1 3.0
3 2 10.0
4 2 10.0
5 2 10.0

python - pandas : how to select by date

Why Can I do a selection by month in this case, but not a selection by date ?
dates = pd.date_range( start = "01/01/1931" , end = "01/02/1941" )
new_df_4 = new_df_3.reindex(dates)
new_df_4["1931-10"][![enter image description here][1]][1]
But this doesn't work :
new_df_4["1931-10-02"]
KeyError Traceback (most recent call last)
in ()
----> 1 new_df_4["1931-10-02"]
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
2002 result = self._constructor(self._data.get(key))
2003 if result.columns.is_unique:
-> 2004 result = result[key]
2005
2006 return result
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2000
2001 # duplicate columns & possible reduce dimensionality
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3223
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]
/Users/romain/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1879
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()
KeyError: '1931-10-02'
For select by month use partial string indexing:
print (new_df_4["1931-10"])
This won't work if the resolutions are the same (from the same docs):
Warning
However if the string is treated as an exact match, the
selection in DataFrame‘s [] will be column-wise and not row-wise, see
Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise
KeyError as '2012-12-31 23:59' has the same resolution as index and
there is no column with such name: To always have unambiguous
selection, whether the row is treated as a slice or a single
selection, use .loc.
In [95]: dft_minute.loc['2011-12-31 23:59']
Out[95]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
You can use loc if need select by date:
new_df_4.loc["1931-10-02"]
Sample:
np.random.seed(10)
dates = pd.date_range( start = "01/01/1931" , end = "01/02/1941" )
new_df_4 = pd.DataFrame({'a':np.random.randint(10, size=len(dates))}, index=dates)
print (new_df_4.head())
a
1931-01-01 9
1931-01-02 4
1931-01-03 0
1931-01-04 1
1931-01-05 9
print (new_df_4["1931-10"])
a
1931-10-01 9
1931-10-02 6
1931-10-03 9
1931-10-04 7
1931-10-05 8
1931-10-06 0
1931-10-07 9
1931-10-08 6
1931-10-09 0
1931-10-10 1
1931-10-11 0
...
print (new_df_4.loc["1931-10-02"])
a 6
Name: 1931-10-02 00:00:00, dtype: int32

Categories

Resources