Impute missing values using apply and lambda functions

Impute missing values using apply and lambda functions - python

I am trying to impute the missing values in "Item_Weight" variable by taking the average of the variable according to different "Item_Types" as per the code below. But when I run it, I am getting Key error as added below. Is it the pandas version that does not allow this or something wrong with the code?
Item_Weight_Average =
train.dropna(subset['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
missing = train['Item_Weight'].isnull()
train.loc[missing,'Item_Weight']= train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
KeyError Traceback (most recent call last)
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-c9971d0bdaf7> in <module>()
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66645)()
<ipython-input-25-c9971d0bdaf7> in <lambda>(x)
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
any ideas or workarounds for this one?

If I understand what you're trying to do, then there's an easier way to solve your problem. Instead of making a new series of averages, you can calculate the average item_weight by item_type using groupby, transform, and np.mean(), and fill in the missing spots in item_weight using fillna().
# Setting up some toy data
import pandas as pd
import numpy as np
df = pd.DataFrame({'item_type': [1,1,1,2,2,2],
'item_weight': [2,4,np.nan,10,np.nan,np.nan]})
# The solution
df.item_weight.fillna(df.groupby('item_type').item_weight.transform(np.mean), inplace=True)
The result:
item_type item_weight
0 1 2.0
1 1 4.0
2 1 3.0
3 2 10.0
4 2 10.0
5 2 10.0

Related

Filtering a dataframe with datetime index using .loc (rows & columns) [duplicate]

This question already has answers here:
How to slice a Pandas Dataframe based on datetime index
(3 answers)
Closed 3 years ago.
Trying to slice both rows and columns of a dataframe using the .loc method, but I am having trouble slicing the rows of the df (it has a datetime index)
The dataframe I am working with has 537 rows and 10 columns. The first date is 2018-01-01, but I want it to slice it so that it only shows dates for 2019.
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 536 entries, 2018-01-01 00:00:00 to 2019-06-20 00:00:00
Data columns (total 10 columns):
link_clicks 536 non-null int64
customer_count 536 non-null int64
transaction_count 536 non-null int64
customers_per_click 536 non-null float64
transactions_per_click 536 non-null float64
14_day_ma 523 non-null float64
14_day_std 523 non-null float64
Upper14 523 non-null float64
Lower14 523 non-null float64
lower_flag 536 non-null bool
dtypes: bool(1), float64(6), int64(3)
memory usage: 42.4+ KB
df.loc['2019-01-01':'2019-06-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]
Expected result is to return a filtered data frame within that date range. However when I execute that line of code it gives me the following error:
(clearly it is an issue with the index, but I am just not sure what the proper syntax is and am having trouble finding a solution online.)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4804 try:
-> 4805 return self._searchsorted_monotonic(label, side)
4806 except ValueError:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
4764
-> 4765 raise ValueError('index must be monotonic increasing or decreasing')
4766
ValueError: index must be monotonic increasing or decreasing
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-599-5bdb485482ff> in <module>
----> 1 merge2.loc['2019-11-01':'2019-02-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']].plot(figsize=(15,5))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1492 except (KeyError, IndexError, AttributeError):
1493 pass
-> 1494 return self._getitem_tuple(key)
1495 else:
1496 # we by definition only have the 0th axis
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
886 continue
887
--> 888 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
889
890 return retval
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1865 if isinstance(key, slice):
1866 self._validate_key(key, axis)
-> 1867 return self._get_slice_axis(key, axis=axis)
1868 elif com.is_bool_indexer(key):
1869 return self._getbool_axis(key, axis=axis)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
1531 labels = obj._get_axis(axis)
1532 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1533 slice_obj.step, kind=self.name)
1534
1535 if isinstance(indexer, slice):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
4671 """
4672 start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 4673 kind=kind)
4674
4675 # return a slice
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
4870 start_slice = None
4871 if start is not None:
-> 4872 start_slice = self.get_slice_bound(start, 'left', kind)
4873 if start_slice is None:
4874 start_slice = 0
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4806 except ValueError:
4807 # raise the original KeyError
-> 4808 raise err
4809
4810 if isinstance(slc, np.ndarray):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4800 # we need to look up the label
4801 try:
-> 4802 slc = self._get_loc_only_exact_matches(label)
4803 except KeyError as err:
4804 try:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _get_loc_only_exact_matches(self, key)
4770 get_slice_bound.
4771 """
-> 4772 return self.get_loc(key)
4773
4774 def get_slice_bound(self, label, side, kind):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '2019-11-01'

If your index is of type "datetime", try:
from datetime import datetime
df.loc[(df.index>=datetime(2019,1,1)) & (df.index<= datetime(2019,6,1)), ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]

Without all the details I propose the following code:
index = pd.date_range('1/1/2018', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
grouped = ts.groupby(lambda x: x.year)
grouped.size()
2018 365
2019 365
2020 366
2021 4
dtype: int64
You can select a year (a group) using:
grouped.get_group(2019)
len(grouped.get_group(2019))
365
Do you need something more specific?

How to plot a histogram of one column colored by another in Python?

I have a dataset that contains, among other columns, 3 columns titled Gender (either M or F), House (either A or B or C), and Indicator (either 0 or 1). I want to plot the histogram of House A colored by Gender. This is my code to do this:
import pandas as pd
df = pd.read_csv('dataset.csv', usecols=['House','Gender','Indicator')
A = df[df['House']=='A']
A = pd.DataFrame(A, columns=['Indicator', 'Gender'])
This imports the values of House A for the respective genders correctly, as shown by its contents:
print(A)
Indicator Gender
0 1 Male
1 1 Male
2 1 Male
4 1 Female
7 1 Male
8 1 Male
11 1 Male
14 1 Male
17 1 Male
18 1 Female
19 1 Female
20 1 Female
21 1 Male
24 1 Male
26 1 Female
27 1 Male
... ... ...
Now when I want to plot the histogram of A colored by gender the way I did in MATLAB, it gives an error:
import matplotlib.pyplot as plt
plt.hist(A)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-130-81c3aef1748b> in <module>()
----> 1 plt.hist(A)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\pyplot.py in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, hold, data, **kwargs)
3130 histtype=histtype, align=align, orientation=orientation,
3131 rwidth=rwidth, log=log, color=color, label=label,
-> 3132 stacked=stacked, normed=normed, data=data, **kwargs)
3133 finally:
3134 ax._hold = washold
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 "the Matplotlib list!)" % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856
1857 inner.__doc__ = _add_data_doc(inner.__doc__,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(***failed resolving arguments***)
6512 for xi in x:
6513 if len(xi) > 0:
-> 6514 xmin = min(xmin, xi.min())
6515 xmax = max(xmax, xi.max())
6516 bin_range = (xmin, xmax)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py in _amin(a, axis, out, keepdims)
27
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
TypeError: '<=' not supported between instances of 'int' and 'str'
It seems we need to specify the exact column we want to make histogram of. It can't automatically understand (unlike MATLAB) that it needs to color according to the other column. So, doing the following plots the histogram, but with no color indicating the Gender:
plt.hist(A['Indicator'])
So, how do I make either a stacked histogram, or a side-by-side one colored by gender? Something like this, except there'll be only 2 bars for each Indicator, at x=0 and x=1:
x = np.random.randn(1000, 2)
colors = ['red', 'green']
plt.hist(x, color=colors)
plt.legend(['Male', 'Female'])
plt.title('Male and Female indicator by gender')
I have tried to imitate the above by copying the 2 dataframe columns into 2 columns of a list, and then trying to plot the histogram:
y=[]
y[0] = A[A['Gender'=='M']].tolist()
y[1] = A[A['Gender'=='F']].tolist()
plt.hist(y)
But this gives the following error:
KeyError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3062 try:
-> 3063 return self._engine.get_loc(key)
3064 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: False
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-152-138cb74b6e00> in <module>()
2 A= pd.DataFrame(A, columns=['Indicator', 'Gender'])
3 y=[]
----> 4 y[0] = A[A['Gender'=='M']].tolist()
5 y[1] = A[A['Gender'=='F']].tolist()
6 plt.hist(y)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2683 return self._getitem_multilevel(key)
2684 else:
-> 2685 return self._getitem_column(key)
2686
2687 def _getitem_column(self, key):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
2690 # get column
2691 if self.columns.is_unique:
-> 2692 return self._get_item_cache(key)
2693
2694 # duplicate columns & possible reduce dimensionality
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3063 return self._engine.get_loc(key)
3064 except KeyError:
-> 3065 return self._engine.get_loc(self._maybe_cast_indexer(key))
3066
3067 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: False

The following should work, not tested with your data though.
genders = A.Gender.unique()
plt.hist([A.loc[A.Gender == x, 'Indicator'] for x in genders], label=genders)
Your code fails on A[A['Gender'=='M']] because it should be A[A['Gender'] == 'M'] to get the Male elements, but you also need to select the column that you want.

Basic json and pandas DataFrame build

I am very new to python and learning my way up. My task is to crawl data from web and filing xlsx data using json and pandas (and etc..). I am researching through some examples of modifing json dic to pandas DataFrame, and I cant seem to find the one that I need.
Im gussing this would be very basic, but help me out.
so below is my code
js ='{"startDate":"2017-01-01","endDate":"2017-10-31","timeUnit":"month","results":
[{"title":"fruit","keywords":["apple","banana"],"data":
[{"period":"2017-01-01","ratio":19.35608},
{"period":"2017-02-01","ratio":17.33902},
{"period":"2017-03-01","ratio":22.30411},
{"period":"2017-04-01","ratio":20.94646},
{"period":"2017-05-01","ratio":23.8557},
{"period":"2017-06-01","ratio":22.38169},
{"period":"2017-07-01","ratio":27.38557},
{"period":"2017-08-01","ratio":19.16214},
{"period":"2017-09-01","ratio":32.07913},
{"period":"2017-10-01","ratio":41.89293}]},
{"title":"veg","keywords":["carrot","onion"],"data":
[{"period":"2017-01-01","ratio":100.0},
{"period":"2017-02-01","ratio":80.41117},
{"period":"2017-03-01","ratio":89.29402},
{"period":"2017-04-01","ratio":74.32118},
{"period":"2017-05-01","ratio":69.82156},
{"period":"2017-06-01","ratio":66.52444},
{"period":"2017-07-01","ratio":67.84328},
{"period":"2017-08-01","ratio":74.43754},
{"period":"2017-09-01","ratio":65.82621},
{"period":"2017-10-01","ratio":65.55469}]}]}'
And I have tried below
df = pd.DataFrame.from_dict(json_normalize(js), orient='columns')
df
and
df = pd.read_json(js)
results = df['results'].head()
dd = results['data']
results.to_json(orient='split')
and
data = json.loads(js)
data["results"]
data["startDate"]
data2 = json.loads(data["results"])
data2["data"]
And I want my DataFrame to be like below
Date Fruit Veg
0 2017-01-01 19.35608 100.0
1 2017-02-01 17.33902 80.41117
2 2017-03-01 22.30411 89.29402
3 2017-04-01 20.94646 74.32118
4 2017-05-01 23.8557 69.82156
--------------------------------------------------------------------------------------------------------------------edit
The code (from #COLDSPEED) worked perfect until one point. I use your code to my new crawler "Crawler: Combining DataFrame per each loop Python" and it ran perfectly until my DNA reached to 170. The error message is below
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'period'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-30-2a1de403b285> in <module>()
47 d = json.loads(js)
48 lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
---> 49 for r in d['results']]
50 df = pd.concat(lst, 1)
51 dfdfdf = Data.join(df)
<ipython-input-30-2a1de403b285> in <listcomp>(.0)
47 d = json.loads(js)
48 lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
---> 49 for r in d['results']]
50 df = pd.concat(lst, 1)
51 dfdfdf = Data.join(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
2828 names.append(None)
2829 else:
-> 2830 level = frame[col]._values
2831 names.append(col)
2832 if drop:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'period'
I found out that if the js has no value in 'data' which shows below. (please disregard the Korean title)
{"startDate":"2016-01-01","endDate":"2017-12-03","timeUnit":"date","results":[{"title":"황금뿔나팔버섯","keywords":["황금뿔나팔버섯"],"data":[]}]}
So I want to check if there is 'data' before using your code. please take a look below and tell me what is wrong with it please.
if ([pd.DataFrame.from_dict(r['data']) for r in d['results']] == []):
#want to put only the column name as 'title' and move on
else:
lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
for r in d['results']]
df = pd.concat(lst, 1)

Assuming your structure is consistent, use a list comprehension and then concatenate -
import json
d = json.loads(js)
lst = [
pd.DataFrame.from_dict(r['data'])\
.set_index('period').rename(columns={'ratio' : r['title']})
for r in d['results']
]
df = pd.concat(lst, 1)
df
fruit veg
period
2017-01-01 19.35608 100.00000
2017-02-01 17.33902 80.41117
2017-03-01 22.30411 89.29402
2017-04-01 20.94646 74.32118
2017-05-01 23.85570 69.82156
2017-06-01 22.38169 66.52444
2017-07-01 27.38557 67.84328
2017-08-01 19.16214 74.43754
2017-09-01 32.07913 65.82621
2017-10-01 41.89293 65.55469

Numpy TypeError: an integer is required

This will be maybe quite personal question but I don't know who to ask I hope somebody can help and don't skip me THANKS!. I have installed python using Anaconda and using Jupyter notebook. I have 2 csv files of data.
products.head()
ID_FUPID FUPID
0 1 674563
1 2 674597
2 3 674606
3 4 694776
4 5 694788
Products contain id of product and product number.
ratings.head()
ID_CUSTOMER ID_FUPID RATING
0 1 216 1
1 2 390 1
2 3 851 5
3 4 5897 1
4 5 9341 1
Ratings containt id of customer, productID and Rating which customer give to product.
I have created table as:
M = ratings.pivot_table(index=['ID_CUSTOMER'],columns=['ID_FUPID'],values='RATING')
Which is showing data correctly in matrix with productID= columns and customerID as rows.
I wanted to count pearson colleration between products so here is the pearson function:
def pearson(s1, s2):
import numpy as np
"""take two pd.series objects and return a pearson correlation"""
s1_c = s1 - s1.mean()
s2_c = s2 - s2.mean()
return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
When I'm trying to count pearson(M['17'], M['21']) I got following errors:
TypeError Traceback (most recent call last)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
TypeError: an integer is required
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
KeyError: '17'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
TypeError: an integer is required
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-277-d4ead225b6ab> in <module>()
----> 1 pearson(M['17'], M['21'])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
KeyError: '17'
I will really appreciate any help ! thanks a million.

There were two places in the error message with the following line:
KeyError: '17'
This indicates there is no key '17' in M. This is likely because your index is an integer. However, you are currently accessing the DataFrame M with a string. The code to call pearson might be as follows:
pearson(M[17], M[21])

Odd behaviour indexing Pandas dataframe on date

I've just been working through the Pandas tutorial and am a little confused by the following behaviour.
In [28]: d
Out[28]:
Status CustomerCount
StatusDate
2009-01-05 9 2519
2009-01-12 10 3351
2009-01-19 10 2188
2009-01-26 10 2301
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
2009-03-02 11 2887
2009-03-09 9 2927
getting the records for a particular month via a string works nicely:
In [31]: d['2009-02']
Out[31]:
Status CustomerCount
StatusDate
2009-02-02 7 2204
2009-02-09 9 1538
2009-02-16 9 1983
2009-02-23 9 1960
slicing a date range also works nicely:
In [33]: d['2009-02-09':'2009-02-10']
Out[33]:
Status CustomerCount
StatusDate
2009-02-09 9 1538
getting records for a particular day with the same method does not:
In [32]: d['2009-02-09']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-b78c7ec0d497> in <module>()
----> 1 d['2009-02-09']
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: '2009-02-09'
neither does the following:
In [36]: d[d.first_valid_index()]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-36-071dd1d3c77c> in <module>()
----> 1 d[d.first_valid_index()]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __getitem__(self, key)
1676 return self._getitem_multilevel(key)
1677 else:
-> 1678 return self._getitem_column(key)
1679
1680 def _getitem_column(self, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _getitem_column(self, key)
1683 # get column
1684 if self.columns.is_unique:
-> 1685 return self._get_item_cache(key)
1686
1687 # duplicate columns & possible reduce dimensionaility
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in _get_item_cache(self, item)
1050 res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in get(self, item, fastpath)
2563
2564 if not isnull(item):
-> 2565 loc = self.items.get_loc(item)
2566 else:
2567 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_loc(self, key)
1179 loc : int if unique index, possibly slice or mask if not
1180 """
-> 1181 return self._engine.get_loc(_values_from_object(key))
1182
1183 def get_value(self, series, key):
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3572)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3452)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11343)()
/usr/local/lib/python2.7/site-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11296)()
KeyError: Timestamp('2009-01-05 00:00:00')
but this does:
In [37]: d.loc[d.first_valid_index()]
Out[37]:
Status 9
CustomerCount 2519
Name: 2009-01-05 00:00:00, dtype: int64
Is this behaviour buggy or have I misunderstood something?

d is a DataFrame, so the principal indexer when using df[key] is indexing the columns (see the indexing basics in the docs).
Only, an exception is made when key is a slice. For convenience, slicing on a DataFrame will slice the rows.
In your example, d['2009-02-09':'2009-02-10'] is a slice, so slicing the rows correctly. In d['2009-02-09'], you give a single key, so it looks at the columns, and for this you get a KeyError, as '2009-02-09' is not a column name.
d['2009-02'] is a special case, which can be a bit confusing in the beginning. It is a single string, but actually represents a slice (this feature is called partial string indexing, see the docs here).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Impute missing values using apply and lambda functions - python

Related

Filtering a dataframe with datetime index using .loc (rows & columns) [duplicate]

How to plot a histogram of one column colored by another in Python?

Basic json and pandas DataFrame build

Numpy TypeError: an integer is required

Odd behaviour indexing Pandas dataframe on date

Categories

Resources