Pandas hierarchical indexing - not working for dataframe?

Pandas hierarchical indexing - not working for dataframe? - python

I'm having trouble addressing values in a DataFrame, but I don't seem to have any problems with the Series object.
>>> df=DataFrame([0.5,1.5,2.5,3.5,4.5], index=[['a','a','b','b','b'],[1,2,1,2,3]])
>>> series=Series([0.5,1.5,2.5,3.5,4.5], index=[['a','a','b','b','b'],[1,2,1,2,3]])
>>> series['a']
1 0.5
2 1.5
dtype: float64
>>> df['a']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 2003, in __getitem__
return self._get_item_cache(key)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 667, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1655, in get
_, block = self._find_block(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1935, in _find_block
self._check_have(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 1942, in _check_have
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named a'
I'm definitely misunderstanding something, if someone could help me out it would be very much appreciated!

You are trying to select a column, and there is indeed no column named 'a'. Try df.loc['a'] instead.
I recommend to look at the basic indexing docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
In summary:
series[label] selects element in series at index label
dataframe[label] selects column with name label

Related

How to access particular elements of a dataframe in Pandas. Gives error

I have a dataframe df_params. It contains parameters for the stored procedure.
PurchaseOrderID OrderDate SupplierReference DF_Name
0 1 2013-01-01 B2084020 dataframe1
1 2 2013-01-01 293092 dataframe2
2 3 2013-01-01 08803922 dataframe3
3 4 2013-01-01 BC0280982 dataframe4
4 5 2013-01-01 ML0300202 dataframe5
I simply want to access the elements of the dataframe in a loop:
for i in range(len(df_params)):
print(df_params[i][0])
But it gives me an error without really explanation:
Traceback (most recent call last):
File "C:my\path\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Test3.py", line 35, in <module>
print(df_params[i][0])
File "C:\Users\my\path\Python37\lib\site-packages\pandas\core\frame.py", line 2995, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\my\path\Python37\lib\site-packages\pandas\core\indexes\base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
PS Microsoft.PowerShell.Core\FileSystem::\\my\path>
The goal is to supply value to the stored procedure:
for i in range(len(df_params)):
query = "EXEC Purchasing.GetPurchaseOrder " + df_params[i][0] + "," + str(df_params[i][1]) + "," + df_params[i][2])
df = pd.read_sql(query, conn)
desired outcome from print(query):
EXEC Purchasing.GetPurchaseOrder 1, '2013-01-01', 'B2084020'
EXEC Purchasing.GetPurchaseOrder 2, '2013-01-01', '293092'
EXEC Purchasing.GetPurchaseOrder 3, '2013-01-01', '08803922'
EXEC Purchasing.GetPurchaseOrder 4, '2013-01-01', 'BC0280982'
EXEC Purchasing.GetPurchaseOrder 5, '2013-01-01', 'ML0300202'

pandas.DataFrames don't behave exactly like numpy.ndarrays. There are basically three options:
option 1: iterrows-method:
You can iterate over rows of a pandas.dataframe by
for idx, row in df_params.iterrows():
print(row['PurchaseOrderID'])
This is a particularly readable way, so personally I prefer this
option 2: indexing:
if you want to index pandas.dataframe just like an numpy.ndarray object, go with the method .iat[]
for i in range(len(df_params)):
print(df_params.iat[i, 0])
This actually indexes all elements and ignores the index of the dataframe! So assuming that you have a different index (in the extreme some strings or a table with a pandas.DataTimeIndex) this still works... just as if you would have done a df_params.to_numpy()[i, 0].
Note: There exists a similar function that uses the column name: .at[]
There is a second way to index a pandas.DataFrame object and it is just a little safer with regard to columns:.loc[] It takes an index and column name(s)
for idx in df_params.index:
print(df_params.iloc[idx, 'PurchaseOrderID'])
option 3: slicing a pandas.Series object:
Every column in a pandas.DataFrame is a pandas.Series object, which you can index similar (you actually index the series as described above) to a numpy.ndarray:
col = df_params['PurchaseOrderID']
for idx in col.index:
print(col[idx])
So what went wrong in your case?
The double indexing is almost the same as the last example but it calls calls .loc[] under the hood and thus expects a column name and not a number (that would have been the method .iloc[]. And it is expecting to see the column first and then the row.
So if you really want, you could go like this:
for i in range(len(df_params)):
print(df_params.iloc[0][i])
but this only works because your pandas.DataFrame has continuous numeric indices starting from 0! So please don't do this and use the actual indices of your table (actually use one of the options above and not the last one ;) )

On data frame there are better ways to access values, you can use lambda.
with lambda will have an access to any row.
df.apply(lambda row : print(row['DF_Name']))
now the variable 'row' is each row on the dataframe, and you can access to each properties on the row.

Pandas – ValueError: Cannot setitem on a Categorical with a new category, set the categories first

I've been searching for a solution to this for the past few hours now. Relevant pandas documentation is unhelpful and this solution gives me the same error.
I am trying to order my dataframe using a categorical in the following manner:
metabolites_order = CategoricalDtype(['Header', 'Metabolite', 'Unknown'], ordered=True)
df2['Feature type'] = df2['Feature type'].astype(metabolites_order)
df2 = df2.sort_values('Feature type')
The "Feature type" column is populated with the categories correctly. This code runs perfectly in Jupyter Notebooks, but when I run it in Pycharm, I get the following error:
Traceback (most recent call last):
File "/Users/wasim.sandhu/Documents/MSDIALPostProcessor/postprocessor.py", line 138, in process_alignment_file
df2.loc[4] = list(df2.columns)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1635, in _setitem_with_indexer
self._setitem_with_indexer_split_path(indexer, value, name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1700, in _setitem_with_indexer_split_path
self._setitem_single_column(loc, v, pi)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexing.py", line 1813, in _setitem_single_column
ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 568, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 1846, in setitem
self.values[indexer] = value
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/_mixins.py", line 211, in __setitem__
value = self._validate_setitem_value(value)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/arrays/categorical.py", line 1898, in _validate_setitem_value
raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
What could be causing this? I believe that I've set the categories correctly...

I'd suggest just mapping these categories to integers then sorting on that column instead.
categories = ['Header', 'Metabolite', 'Unknown']
feature_map = {categories[i]:i for i in range(len(categories))}
df['Feature Order'] = df['Feature Type'].map(feature_map)
df.sort_values('Feature Order')

Figured it out literally minutes after I posted the question. The header column in this dataset is in the 5th row. I checked the "Feature type" column and "Feature type" is one of its values, which threw this error.
Solved by adding the column header name into the categories.
metabolites_order = CategoricalDtype(['Header', 'Feature type', 'Metabolite', 'Unknown'], ordered=True)

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements

Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

Can use dataframe ix for assignment, but not retrieval

I am looping through rows of a pandas df, loop index i.
I am able to assign several columns using the ix function with the loop index as first parameter, column name as second.
However, when I try to retrieve/print using this method,
print(df.ix[i,"Run"])
I get a the following Typerror: str object cannot be interpreted as an integer.
somehow related to Keyerror: 'Run'
Not quite sure why this is occurring, as Run is indeed a column in the dataframe.
Any suggestions?
Traceback (most recent call last):
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3124, in get_value
return libindex.get_value_box(s, key)
File \!"pandas\_libs\index.pyx\!", line 55, in pandas._libs.index.get_value_box
File \!"pandas\_libs\index.pyx\!", line 63, in pandas._libs.index.get_value_box
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File \!"C:\...", line 365, in <module>
print(df.ix[i,\!"Run\!"])
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 116, in __getitem__
return self._getitem_tuple(key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1027, in _getitem_lowerdim
return getattr(section, self.name)[new_key]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 122, in __getitem__
return self._getitem_axis(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1116, in _getitem_axis
return self._get_label(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 136, in _get_label
return self.obj[label]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\series.py\!", line 767, in __getitem__
result = self.index.get_value(self, key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3132, in get_value
raise e1
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3118, in get_value
tz=getattr(series.dtype, 'tz', None))
File \!"pandas\_libs\index.pyx\!", line 106, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 114, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 162, in pandas._libs.index.IndexEngine.get_loc
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Run'
"
Upon changing the name of the column I print to any other column, it does work correctly. Earlier in the code, I "compressed" the rows, which had multiple rows per unique string in 'Run' column, using the following.
df=df.groupby('Run').max()
Did this last line somehow remove the column/column name from the table?

ix has been deprecated. ix has always been ambiguous: does ix[10] refer to the row with the label 10, or the row at position 10?
Use loc or iloc instead:
df.loc[i,"Run"] = ... # by label
df.iloc[i]["Run"] = ... # by position
As for the groupby removing Run: it moves Run to the index of the data frame. To get it back as a column, call reset_index:
df=df.groupby('Run').max().reset_index()
Differences between indexing by label and position:
Suppose you have a series like this:
s = pd.Series(['a', 'b', 'c', 'd', 'e'], index=np.arange(0,9,2))
0 a
2 b
4 c
6 d
8 e
The first column is the labels (aka the index). The second column is the values of the series.
Label based indexing:
s.loc[2] --> b
s.loc[3] --> error. The label doesn't exist
Position based indexing:
s.iloc[2] --> c. since `a` has position 0, `b` has position 1, and so on
s.iloc[3] --> d
According to the documentation, s.ix[3] would have returned d since it first searches for the label 3. When that fails, it falls back to the position 3. On my machine (Pandas 0.24.2), it returns an error, along with a deprecation warning, so I guess the developers changed it to behave like loc.
If you want to use mixed indexing, you have to be explicit about that:
s.loc[3] if 3 in s.index else s.iloc[3]

Conditional delete in pandas dataframe

I want to delete any rows including specific string in dataframe.
I want to delete data rows with abnormal email address (with .jpg)
Here's my code, what's wrong with it?
df = pd.DataFrame({'email':['abc#gmail.com', 'cde#gmail.com', 'ghe#ss.jpg', 'sldkslk#sss.com']})
df
email
0 abc#gmail.com
1 cde#gmail.com
2 ghe#ss.jpg
3 sldkslk#sss.com
for i, r in df.iterrows():
if df.loc[i,'email'][-3:] == 'com':
df.drop(df.index[i], inplace=True)
Traceback (most recent call last):
File "<ipython-input-84-4f12d22e5e4c>", line 2, in <module>
if df.loc[i,'email'][-3:] == 'com':
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 998, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1911, in _getitem_axis
self._validate_key(key, axis)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1798, in _validate_key
error()
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1785, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [2] is not in the [index]'

IIUC, you can do this rather than iterating through your frame with iterrows:
df = df[df.email.str.endswith('.com')]
which returns:
>>> df
email
0 abc#gmail.com
1 cde#gmail.com
3 sldkslk#sss.com
Or, for larger dataframes, it's sometimes faster to not use the str methods provided by pandas, but just to do it in a plain list comprehension with python's built in string methods:
df = df[[i.endswith('.com') for i in df.email]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas hierarchical indexing - not working for dataframe? - python

Related

How to access particular elements of a dataframe in Pandas. Gives error

Pandas – ValueError: Cannot setitem on a Categorical with a new category, set the categories first

Filter pandas df multiple columns from a pandas series

Can use dataframe ix for assignment, but not retrieval

Conditional delete in pandas dataframe

Categories

Resources