Pandas IndexingError - python

I'm following a tutorial about bitcoin and pandas where I'm receiving a data from websocket and storing in a dataframe. Everything is working fine but randomly my script is throwing an error:
Traceback (most recent call last):: 26561.29| MIN: 26530.0 | MAX: 26582.691
File "/home/user/Desktop/BTC/price.py", line 89, in <module>
df = df.loc[df.date >= start_time]
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1090, in _getitem_axis
return self._getbool_axis(key, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 896, in _getbool_axis
key = check_bool_indexer(labels, key)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 2183, in check_bool_indexer
"Unalignable boolean Series provided as "
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
This how my code snippet looks like:
df = price['BTCGBP']
start_time = df.date.iloc[-1] - pd.Timedelta(minutes=5)
df = df.loc[df.date >= start_time]
max_price = df.price.max()
I think this is related to websocket data because is totally random.
I have changed from 5 minutes to 1 min. and the result of this comparison is:
print(df.loc[df.date >= start_time])
date price
0 2021-01-19 18:50:51.724977 27078.59
until
15 2021-01-19 18:51:51.723815 27113.82

df.date >= start_time
This part is a boolean comparison and returns either True or False. Try printing the result and you'll see it's a boolean value. However, df.loc[]
expects a row number in the form of an integer. What is the intended output for this comparison?

Related

Why pandas DataFrame allows to set column using too large Series?

Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.
You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?
With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Error with date_range using datetime subselection

I need to create a vector of dates with pd.date_range specifying the min and the max for date values.
Date values come from a subselection performed on a dataframe object ds.
This is the code I wrote:
Note that Date in ds are obtained from
ds = pd.read_excel("data.xlsx",sheet_name='all') # Read the Excel file
ds['Date'] = pd.to_datetime(ds['Date'], infer_datetime_format=True)
This is the part inside a for loop where x loops on a list of Names.
for x in lofNames:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
mindate = date_tmp.min()
maxdate = date_tmp.max()
date = pd.date_range(start=mindate, end=maxdate, freq='D')
This is the error I get:
Traceback (most recent call last):
File "<ipython-input-8-1f56d07b5a74>", line 4, in <module>
date = pd.date_range(start=mindate, end=maxdate, freq='D')
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/datetimes.py", line 1180, in date_range
**kwargs,
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 365, in _generate_range
start = Timestamp(start)
File "pandas/_libs/tslibs/timestamps.pyx", line 418, in pandas._libs.tslibs.timestamps.Timestamp.__new__
File "pandas/_libs/tslibs/conversion.pyx", line 329, in pandas._libs.tslibs.conversion.convert_to_tsobject
TypeError: Cannot convert input [Date 2007-01-09
dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
What's wrong?
thank you
Here is returned one column DataFrame instead Series, so next min and max returned one item Series instead scalar, so error is raised:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
Correct way is removed []:
date_tmp = ds.loc[ds['Security Name']==x,'Date']

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements
Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

index out of bound when iterrow() how is this possible?

I got error message:
5205
(5219, 25)
5221
(5219, 25)
Traceback (most recent call last):
File "/Users/Chu/Documents/dssg2018/sa4.py", line 44, in <module>
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
IndexError: index 5221 is out of bounds for axis 0 with size 5219
when I'm traversing the data frame, the index comes from the iterators. I don't know how is this even possible? idx directly comes from the dataframe
bt = BallTree(df[['lat','lng']], metric="haversine")
indices = bt.query_radius(df[['lat','lng']],r=(float(10)/40000)*360)
for idx,row in df.iterrows():
for word in bag_of_words:
if word in row['caption']:
print(idx)
print(df.shape)
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
np.max([1,len(df.iloc[indices[idx]][df[word]!=1])])
changing iloc to loc gives
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/Chu/Documents/dssg2018/sa4.py
(-124.60334244261675, 49.36453144316216, -121.67106179949566, 50.863501888419826)
27
(5219, 25)
/Users/Chu/Documents/dssg2018/sa4.py:42: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
/Users/Chu/Documents/dssg2018/sa4.py:42: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
Traceback (most recent call last):
File "/Users/Chu/Documents/dssg2018/sa4.py", line 42, in <module>
df.loc[idx,word]=len(df.loc[indices[idx]][df[word]==1])/\
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2173, in _getitem_array
key = check_bool_indexer(self.index, key)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 2023, in check_bool_indexer
raise IndexingError('Unalignable boolean Series provided as '
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Your index is not from 0 to len(df)-1, this will making df.iloc[idx]out of boundary
For example
df = pd.DataFrame({'a': [0, 1]},index=[1,100])
for idx,row in df.iterrows():
print(idx)
print(row)
1
a 0
Name: 1, dtype: int64
100
a 1
Name: 100, dtype: int64
Then when you do
df.iloc[100]
IndexError: single positional indexer is out-of-bounds
But when you do .loc you get the expected output.
df.loc[100]
Out[23]:
a 1
Name: 100, dtype: int64
From the file :
.iloc :iloc[] is primarily integer position based
.loc:.loc[] is primarily label based
Solution:
Using .loc or df=df.reset_index(drop=True)

Categories

Resources