I have a very similar pandas dataframe :
In:
import pandas as pd
df = pd.DataFrame({'blobs':['6-Feb- 1 4 Startup ZestFinance says it has built a machine-learning system that’s smart enough to find new borrowers and keep bias out of its credit analysis. 17-Feb-2014',
'Credit ratings have long been the key measure of how likely a U.S. 29—Oct-2012 consumer is to repay any loan, from mortgages to 18-0ct-12 credit cards. But the factors that FICO and other companies that create credit scores rely on—things like credit history and credit card balances—often depend on having credit already. ',
'November 22, 2012 In recent years, a crop of startup companies have launched on the premise 6—Feb- ] 4 that borrowers without such histories might still be quite likely to repay, and that their likelihood of doing so could be determined by analyzing large amounts of data, especially data that has traditionally not been part of the credit evaluation. These companies use algorithms and machine learning to find meaningful patterns in the data, alternative signs that a borrower is a good or bad credit risk.',
'March 1“, 2012 Los Angeles-based ZestFinance, founded by former Google CIO Douglas Merrill, claims to have solved this problem with a new credit-scoring platform, called ZAML. 06—Fcb—2012 The company sells the machine-learning software to lenders and also offers consulting 19—Jan— ] 2 services. Zest does not lend money itself. January 2, 1990']})
Each row has some noisy dates in different formats. Thus, my objective is to extract the dates into another column, for example, from the above daata frame I would like to extract dates like this:
(*)
Out:
dates1 dates2
0 6-Feb-14, 17-Feb-2014 NaN
1 29—Oct-2012, 18-0ct-12 NaN
2 6—Feb- ]4 November 22, 2012
3 06—Fcb—2012, 19—Jan— ] 2 March 1“, 2012 | January 2, 1990
So far, I tried to define the following regex and extract from the column with extract():
In:
df['col1'] = df['blobs'].str.extract(r"^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$")
df
However I am getting the following error and I would like to have a more robust method. Therefore, which is the best way of getting (*)
Out:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
KeyError: 'date_format_3'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in set(self, item, value, check)
3667 try:
-> 3668 loc = self.items.get_loc(item)
3669 except KeyError:
/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
KeyError: 'date_format_3'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-3-ec929aa5341c> in <module>()
----> 1 df['date_format_3'] = df['text'].str.extract(r"^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$", expand = True)
2 df
/usr/local/lib/python3.5/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
2417 else:
2418 # set column
-> 2419 self._set_item(key, value)
2420
2421 def _setitem_slice(self, key, value):
/usr/local/lib/python3.5/site-packages/pandas/core/frame.py in _set_item(self, key, value)
2484 self._ensure_valid_index(value)
2485 value = self._sanitize_column(key, value)
-> 2486 NDFrame._set_item(self, key, value)
2487
2488 # check if we are modifying a copy
/usr/local/lib/python3.5/site-packages/pandas/core/generic.py in _set_item(self, key, value)
1498
1499 def _set_item(self, key, value):
-> 1500 self._data.set(key, value)
1501 self._clear_item_cache()
1502
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in set(self, item, value, check)
3669 except KeyError:
3670 # This item wasn't present, just insert at end
-> 3671 self.insert(len(self.items), item, value)
3672 return
3673
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in insert(self, loc, item, value, allow_duplicates)
3770
3771 block = make_block(values=value, ndim=self.ndim,
-> 3772 placement=slice(loc, loc + 1))
3773
3774 for blkno, count in _fast_count_smallints(self._blknos[loc:]):
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in make_block(values, placement, klass, ndim, dtype, fastpath)
2683 placement=placement, dtype=dtype)
2684
-> 2685 return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
2686
2687 # TODO: flexible with index=None and/or items=None
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, values, ndim, fastpath, placement, **kwargs)
1815
1816 super(ObjectBlock, self).__init__(values, ndim=ndim, fastpath=fastpath,
-> 1817 placement=placement, **kwargs)
1818
1819 #property
/usr/local/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, values, placement, ndim, fastpath)
107 raise ValueError('Wrong number of items passed %d, placement '
108 'implies %d' % (len(self.values),
--> 109 len(self.mgr_locs)))
110
111 #property
ValueError: Wrong number of items passed 4, placement implies 1
You have to define a capture group in the str.extract() with parenthesis.
I extended/corrected the regex pattern, but it needs more work.
(You can review and experiment with this regular expression on regex101)
df['blobs'].str.extract("(\d{1,2}[\s-]\w+[\s-]\d{2,4}|\w+[\s-]\d{1,2}[\",\s-]+\d{2,4})", expand = True)
Out:
0
0 17-Feb-2014
1 18-0ct-12
2 November 22, 2012
3 January 2, 1990
Related
Wrong number of items passed 2, placement implies 1
I want to calculate 'overnight returns' & 'Intraday returns' of AAMZN & AAPL Error
df['intradayReturn'] = (df1["adj_close"]/df1["open"])-1
df['overnightReturn'] = (df1["open_shift"]/df1["adj_close"])-1
in python , below is my code :
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import warnings
df = yf.download(['AAPL','AMZN'],start='2006-01-01',end='2020-12-31')**
df.head()
Adj Close Close High Low Open Volume
AAPL AMZN AAPL AMZN AAPL AMZN AAPL AMZN AAPL AMZN AAPL AMZN
Date
2006-01-03 2.299533 47.580002 2.669643 47.580002 2.669643 47.849998 2.580357 46.250000 2.585000 47.470001 807234400 7582200
2006-01-04 2.306301 47.250000 2.677500 47.250000 2.713571 47.730000 2.660714 46.689999 2.683214 47.490002 619603600 7440900
2006-01-05 2.288151 47.650002 2.656429 47.650002 2.675000 48.200001 2.633929 47.110001 2.672500 47.160000 449422400 5417200
2006-01-06 2.347216 47.869999 2.725000 47.869999 2.739286 48.580002 2.662500 47.320000 2.687500 47.970001 704457600 6152900
2006-01-09 2.339525 47.080002 2.716071 47.080002 2.757143 47.099998 2.705000 46.400002 2.740357 46.549999 675040800 8943100**
Upto this it was working fine
But when i used this formula
df['intradayReturn'] = (df["Adj Close"]/df["Open"])-1
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'intradayReturn'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _set_item(self, key, value)
3824 try:
-> 3825 loc = self._info_axis.get_loc(key)
3826 except KeyError:
~\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in get_loc(self, key, method)
2875 if not isinstance(key, tuple):
-> 2876 loc = self._get_level_indexer(key, level=0)
2877 return _maybe_to_slice(loc)
~\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_level_indexer(self, key, level, indexer)
3157
-> 3158 idx = self._get_loc_single_level_index(level_index, key)
3159
~\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py in _get_loc_single_level_index(self, level_index, key)
2808 else:
-> 2809 return level_index.get_loc(key)
2810
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
KeyError: 'intradayReturn'
enter code here
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-24-b2a4b46270a9> in <module>
----> 1 df['intradayReturn'] = (df["Adj Close"]/df["Open"])-1
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3161 else:
3162 # set column
-> 3163 self._set_item(key, value)
3164
3165 def _setitem_slice(self, key: slice, value):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3238 self._ensure_valid_index(value)
3239 value = self._sanitize_column(key, value)
-> 3240 NDFrame._set_item(self, key, value)
3241
3242 # check if we are modifying a copy
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _set_item(self, key, value)
3826 except KeyError:
3827 # This item wasn't present, just insert at end
-> 3828 self._mgr.insert(len(self._info_axis), key, value)
3829 return
3830
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in insert(self, loc, item, value, allow_duplicates)
1201 value = safe_reshape(value, (1,) + value.shape)
1202
-> 1203 block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
1204
1205 for blkno, count in _fast_count_smallints(self.blknos[loc:]):
~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in make_block(values, placement, klass, ndim, dtype)
2730 values = DatetimeArray._simple_new(values, dtype=dtype)
2731
-> 2732 return klass(values, ndim=ndim, placement=placement)
2733
2734
~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in __init__(self, values, placement, ndim)
141 if self._validate_ndim and self.ndim and len(self.mgr_locs) != len(self.values):
142 raise ValueError(
--> 143 f"Wrong number of items passed {len(self.values)}, "
144 f"placement implies {len(self.mgr_locs)}"
145 )
its showing this error
ValueError: Wrong number of items passed 2, placement implies 1
It's EZ PZ you can easily do this by the code below.
gf = (df['Adj Close']['AAPL']/df['Open']['AAPL']-1).to_frame()
sd = (df['Adj Close']['AMZN']/df['Open']['AMZN']-1).to_frame()
df['intradayReturn', 'AAPL'] = gf
df['intradayReturn', 'AMZN'] = sd
df.head()
so your output is like:
you need to assign separately when it comes to multi-index columns.
now it's your turn for overnight return 😉.
I'm sorry that I'm a begginer, so this might be something simple but I don't know what might be wrong.
import pandas as pd
from pandas_datareader import data as wb
tickers = ['F','MSFT','BP']
new_data = pd.DataFrame()
for t in tickers:
new_data[t] = wb.DataReader(t, data_source='yahoo', start='2015-1-1')
This is the error I get:
ValueError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value)
3524 try:
-> 3525 value = Series(value)
3526 except (ValueError, NotImplementedError, TypeError):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
312
--> 313 data = SingleBlockManager(data, index, fastpath=True)
314
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, block, axis, do_integrity_check, fastpath)
1515 if not isinstance(block, Block):
-> 1516 block = make_block(block, placement=slice(0, len(axis)), ndim=1)
1517
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath)
3266
-> 3267 return klass(values, ndim=ndim, placement=placement)
3268
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in __init__(self, values, placement, ndim)
2774
-> 2775 super().__init__(values, ndim=ndim, placement=placement)
2776
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in __init__(self, values, placement, ndim)
127 "Wrong number of items passed {val}, placement implies "
--> 128 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
129 )
ValueError: Wrong number of items passed 6, placement implies 1318
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-11-6f48653a5e0e> in <module>
2 new_data = pd.DataFrame()
3 for t in tickers:
----> 4 new_data[t] = wb.DataReader(t, data_source='yahoo', start='2015-1-1')
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3470 else:
3471 # set column
-> 3472 self._set_item(key, value)
3473
3474 def _setitem_slice(self, key, value):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3546 """
3547
-> 3548 self._ensure_valid_index(value)
3549 value = self._sanitize_column(key, value)
3550 NDFrame._set_item(self, key, value)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _ensure_valid_index(self, value)
3526 except (ValueError, NotImplementedError, TypeError):
3527 raise ValueError(
-> 3528 "Cannot set a frame with no defined index "
3529 "and a value that cannot be converted to a "
3530 "Series"
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I imagine this has got something to do with missing information? Reading int as text or something? Can someone help?
This is actually from the answer key of an exercise I am doing from an online course, and it's returning this error.
You can only assign a column with in a dataframe with data of the same size: a column or Series (in Pandas terms). Let's look at what DataReader returns:
wb.DataReader('F', data_source='yahoo', start='2015-1-1').head()
Date High Low Open Close Volume AdjClose
2015-01-02 15.65 15.18 15.59 15.36 24777900.0 11.452041
2015-01-05 15.13 14.69 15.12 14.76 44079700.0 11.004693
2015-01-06 14.90 14.38 14.88 14.62 32981600.0 10.900313
2015-01-07 15.09 14.77 14.78 15.04 26065300.0 11.213456
2015-01-08 15.48 15.23 15.40 15.42 33943400.0 11.496773
It returns a dataframe. If you'd like to pick one column from here - say, Open - you can assign that to your new table.
tickers = ['F','MSFT','BP']
new_data = pd.DataFrame()
for t in tickers:
new_data[t] = wb.DataReader(t, data_source='yahoo', start='2015-1-1')['Open']
new_data.head()
Date F MSFT BP
2015-01-02 15.59 46.660000 38.209999
2015-01-05 15.12 46.369999 36.590000
2015-01-06 14.88 46.380001 36.009998
2015-01-07 14.78 45.980000 36.000000
2015-01-08 15.40 46.750000 36.430000
You could assign to a dictionary;
import pandas as pd
from pandas_datareader import data as wb
tickers = ['F','MSFT','BP']
new_data = {}
for t in tickers:
new_data[t] = wb.DataReader(t, data_source='yahoo', start='2015-1-1')
print(new_data['F'])
you are trying to assign a dataframe to each column of a dataframe
This is a rather generic question: I would like to know in what cases df.groupby(level = 'id').agg(['count']) can result in the following type error:
TypeError Traceback (most recent call last)
<ipython-input-116-b422fc0967b1> in <module>()
----> 4 df.groupby(level = 'id', sort = False).agg(['count'])
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
2798 if hasattr(func_or_funcs, '__iter__'):
2799 ret = self._aggregate_multiple_funcs(func_or_funcs,
-> 2800 (_level or 0) + 1)
2801 else:
2802 cyfunc = self._is_cython_func(func_or_funcs)
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_multiple_funcs(self, arg, _level)
2867 # reset the cache so that we
2868 # only include the named selection
-> 2869 if name in self._selected_obj:
2870 obj = copy.copy(obj)
2871 obj._reset_cache()
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __contains__(self, key)
905 def __contains__(self, key):
906 """True if the key is in the info axis"""
--> 907 return key in self._info_axis
908
909 #property
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/multi.py in __contains__(self, key)
1326 hash(key)
1327 try:
-> 1328 self.get_loc(key)
1329 return True
1330 except LookupError:
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/multi.py in get_loc(self, key, method)
1964
1965 if not isinstance(key, tuple):
-> 1966 loc = self._get_level_indexer(key, level=0)
1967 return _maybe_to_slice(loc)
1968
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/multi.py in _get_level_indexer(self, key, level, indexer)
2227 else:
2228
-> 2229 loc = level_index.get_loc(key)
2230 if isinstance(loc, slice):
2231 return loc
/home/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2391 key = _values_from_object(key)
2392 try:
-> 2393 return self._engine.get_loc(key)
2394 except KeyError:
2395 return self._engine.get_loc(self._maybe_cast_indexer(key))
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:4881)()
pandas/_libs/index.pyx in pandas._libs.index._bin_search (pandas/_libs/index.c:8637)()
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
I'm asking this because I have a dataframe where this error occurs in some cases but not always, for example if I use df.head(n).groupby(level = 'id').agg(['count']) then depending on n I get this error. For example it runs ok for head up to 2523475 and fails from 2523476 but I can't find anything wrong with the rows at those positions, they look the same as all the other rows in the dataframe and there are no null values.
I figure there must be something wrong with the data but in order to find what I need to know when this error can happen.
The data is a series in this format:
id date
3531364 2017-04-13 1
1550725 2017-02-21 1
1819411 2017-04-19 1
1636629 2016-12-28 1
3123152 2017-05-03 1
dtype: int64
In case it matters the dates are periods. Converting to_dict() gives me this:
{(1550725, Period('2017-02-21', 'D')): 1,
(1636629, Period('2016-12-28', 'D')): 1,
(1819411, Period('2017-04-19', 'D')): 1,
(3123152, Period('2017-05-03', 'D')): 1,
(3531364, Period('2017-04-13', 'D')): 1}
Some extra bits that might help somebody diagnose the problem:
Once I get the error if I do a aux = df.head(5) and then aux.groupby(level = 'id').agg(['count']) it fails BUT aux.groupby(level = 'id').count() works fine. This is also puzzling because with the original dataframe df.head(5).groupby(level = 'id').agg(['count']) works perfectly. Maybe something is cached? How can this happen?
If you try it with the Series I pasted here it will work for you and it will work for me but if I get the error and then get the head(5) of the df then it fails... (!). I guess eventually something is making agg(['count']) fail and that something is in some way cached by pandas but I really don't know what is going on under the hood.
I would appreciate any help with this. I know it is not easy.
I'm struggling with a sorting operation of a stata file in Phyton3: I was asked to keep only the households without kids out of a dataset/table:
I used a filtering condition to filter these rows out of the table:
filtering_condition = df["kids"] > 0
df_nokids = df.loc[filtering_condition,"kids"]
This, however, gives me an unknown error:
KeyError Traceback (most recent call last)
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
1944 try:
-> 1945 return self._engine.get_loc(key)
1946 except KeyError:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()
KeyError: 'kids'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-321-e72cd8a67065> in <module>()
1 #keep only the households without kids and use this dataset for the rest of the assignment
----> 2 filtering_condition = df["kids"] > 0
3 df_nokids = df.loc[filtering_condition,"kids"]
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
1995 return self._getitem_multilevel(key)
1996 else:
-> 1997 return self._getitem_column(key)
1998
1999 def _getitem_column(self, key):
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2002 # get column
2003 if self.columns.is_unique:
-> 2004 return self._get_item_cache(key)
2005
2006 # duplicate columns & possible reduce dimensionality
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1348 res = cache.get(item)
1349 if res is None:
-> 1350 values = self._data.get(item)
1351 res = self._box_item_values(item, values)
1352 cache[item] = res
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
3288
3289 if not isnull(item):
-> 3290 loc = self.items.get_loc(item)
3291 else:
3292 indexer = np.arange(len(self.items)) [isnull(self.items)]
/opt/anaconda/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
1945 return self._engine.get_loc(key)
1946 except KeyError:
-> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key))
1948
1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()
KeyError: 'kids'
Any explanations of what I am doing wrong?
Thanks!
Datafile:
Do you mean something like this:
df_kids = df[df['kids']>0]
This selects the rows where the 'kids' column is not zero.
Why do I keep getting a key error?
[edit] Here is the data:
GEO,LAT,LON
AALBORG DENMARK,57.0482206,9.9193939
AARHUS DENMARK,56.1496278,10.2134046
ABBOTSFORD BC CANADA,49.0519047,-122.3290473
ABEOKUTA NIGERIA,7.161,3.348
ABERDEEN SCOTLAND,57.1452452,-2.0913745
[end edit]
Can't find row by index, but its clearly there:
geocache = pd.read_csv('geolog.csv',index_col=['GEO']) # index_col=['GEO']
geocache.head()
Shows
LAT LON
GEO
AALBORG DENMARK 57.048221 9.919394
AARHUS DENMARK 56.149628 10.213405
ABBOTSFORD BC CANADA 49.051905 -122.329047
ABEOKUTA NIGERIA 7.161000 3.348000
ABERDEEN SCOTLAND 57.145245 -2.091374
So then I test it:
x = 'AARHUS DENMARK'
print(x)
geocache[x]
And this is what I get:
AARHUS DENMARK
KeyError Traceback (most recent call last)
in ()
2 x = u'AARHUS DENMARK'
3 print(x)
----> 4 geocache[x]
C:\Users\g\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1785 return self._getitem_multilevel(key)
1786 else:
-> 1787 return self._getitem_column(key)
1788
1789 def _getitem_column(self, key):
C:\Users\g\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1792 # get column
1793 if self.columns.is_unique:
-> 1794 return self._get_item_cache(key)
1795
1796 # duplicate columns & possible reduce dimensionaility
C:\Users\g\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1077 res = cache.get(item)
1078 if res is None:
-> 1079 values = self._data.get(item)
1080 res = self._box_item_values(item, values)
1081 cache[item] = res
C:\Users\g\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
2841
2842 if not isnull(item):
-> 2843 loc = self.items.get_loc(item)
2844 else:
2845 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Users\g\Anaconda3\lib\site-packages\pandas\core\index.py in get_loc(self, key, method)
1435 """
1436 if method is None:
-> 1437 return self._engine.get_loc(_values_from_object(key))
1438
1439 indexer = self.get_indexer([key], method=method)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3824)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()
KeyError: 'AARHUS DENMARK'
No extra spaces or non-visible chars, Tried putting r and u before the string assignment with no change in behavior.
Ok, what am I missing?
As you didn't pass a sep (separator) arg to read_csv the default is comma separated. As your csv contained spaces/tabs after the commas then these get treated as part of the data hence your index data contains embedded spaces.
So you need to pass additional params to read_csv:
pd.read_csv('geolog.csv',index_col=['GEO'], sep=',\s+', engine='python')
The sep arg means that it will look for commas with optional 1 or more spaces in front of the commas, we pass engine='python' as the c engine does not accept a regex for separators.