Error with date_range using datetime subselection - python

I need to create a vector of dates with pd.date_range specifying the min and the max for date values.
Date values come from a subselection performed on a dataframe object ds.
This is the code I wrote:
Note that Date in ds are obtained from
ds = pd.read_excel("data.xlsx",sheet_name='all') # Read the Excel file
ds['Date'] = pd.to_datetime(ds['Date'], infer_datetime_format=True)
This is the part inside a for loop where x loops on a list of Names.
for x in lofNames:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
mindate = date_tmp.min()
maxdate = date_tmp.max()
date = pd.date_range(start=mindate, end=maxdate, freq='D')
This is the error I get:
Traceback (most recent call last):
File "<ipython-input-8-1f56d07b5a74>", line 4, in <module>
date = pd.date_range(start=mindate, end=maxdate, freq='D')
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/datetimes.py", line 1180, in date_range
**kwargs,
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 365, in _generate_range
start = Timestamp(start)
File "pandas/_libs/tslibs/timestamps.pyx", line 418, in pandas._libs.tslibs.timestamps.Timestamp.__new__
File "pandas/_libs/tslibs/conversion.pyx", line 329, in pandas._libs.tslibs.conversion.convert_to_tsobject
TypeError: Cannot convert input [Date 2007-01-09
dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
What's wrong?
thank you

Here is returned one column DataFrame instead Series, so next min and max returned one item Series instead scalar, so error is raised:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
Correct way is removed []:
date_tmp = ds.loc[ds['Security Name']==x,'Date']

Related

python - Getting error while taking difference between two dates in columns

this is my code, I am trying to get business days between two dates
the number of days is saved in a new column 'nd'
import numpy as np
df1 = pd.DataFrame(pd.date_range('2020-01-01',periods=26,freq='D'),columns=['A'])
df2 = pd.DataFrame(pd.date_range('2020-02-01',periods=26,freq='D'),columns=['B'])
df = pd.concat([df1,df2],axis=1)
# Iterate over each row of the DataFrame
for index , row in df.iterrows():
bc = np.busday_count(row['A'],row['B'])
df['nd'] = bc
I am getting this error.
Traceback (most recent call last):
File "<input>", line 35, in <module>
File "<__array_function__ internals>", line 5, in busday_count
TypeError: Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Is there a way to fix it or another way to get the solution?
busday_count only accepts dates, not datetimes change
bc = np.busday_count(row['A'],row['B'])
to
np.busday_count(row['A'].date(), row['B'].date())

Pandas IndexingError

I'm following a tutorial about bitcoin and pandas where I'm receiving a data from websocket and storing in a dataframe. Everything is working fine but randomly my script is throwing an error:
Traceback (most recent call last):: 26561.29| MIN: 26530.0 | MAX: 26582.691
File "/home/user/Desktop/BTC/price.py", line 89, in <module>
df = df.loc[df.date >= start_time]
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1090, in _getitem_axis
return self._getbool_axis(key, axis=axis)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 896, in _getbool_axis
key = check_bool_indexer(labels, key)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 2183, in check_bool_indexer
"Unalignable boolean Series provided as "
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
This how my code snippet looks like:
df = price['BTCGBP']
start_time = df.date.iloc[-1] - pd.Timedelta(minutes=5)
df = df.loc[df.date >= start_time]
max_price = df.price.max()
I think this is related to websocket data because is totally random.
I have changed from 5 minutes to 1 min. and the result of this comparison is:
print(df.loc[df.date >= start_time])
date price
0 2021-01-19 18:50:51.724977 27078.59
until
15 2021-01-19 18:51:51.723815 27113.82
df.date >= start_time
This part is a boolean comparison and returns either True or False. Try printing the result and you'll see it's a boolean value. However, df.loc[]
expects a row number in the form of an integer. What is the intended output for this comparison?

Object of type 'float' has no len() error when slicing pandas dataframe json column

I have data that looks like this. In each column, there are value/keys of varying different lengths. Some rows are also NaN.
like match
0 [{'timestamp', 'type'}] [{'timestamp', 'type'}]
1 [{'timestamp', 'comment', 'type'}] [{'timestamp', 'type'}]
2 NaN NaN
I want to split these lists into their own columns. I want to keep all the data (and make it NaN if it is missing). I've tried following this tutorial and doing this:
df1 = pd.DataFrame(df['like'].values.tolist())
df1.columns = 'like_'+ df1.columns
df2 = pd.DataFrame(df['match'].values.tolist())
df2.columns = 'match_'+ df2.columns
col = df.columns.difference(['like','match'])
df = pd.concat([df[col], df1, df2],axis=1)
I get this error.
Traceback (most recent call last):
File "link to my file", line 12, in <module>
df1 = pd.DataFrame(df['like'].values.tolist())
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 509, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 524, in to_arrays
return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 561, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/lib.pyx", line 2448, in pandas._libs.lib.to_object_array
TypeError: object of type 'float' has no len()
Can someone help me understand what I'm doing wrong?
You can't perform values.tolist() on NaN. If you delete that row of NaNs, you can get past this issue. but then your prefix line fails. See this for prefixes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_prefix.html

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements
Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

panda tseries convertion not working

I am new to python and I am trying to build Time Series through this. I am trying to convert this csv data into time series, however by the internet and stack research, 'result' should have had
<class 'pandas.tseries.index.DatetimeIndex'>,
but my output is not converted time series. Why is it not converting? How do I convert it? Thanks for the help in advance.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
data = pd.read_csv('somedata.csv')
print data.head()
#selecting specific columns by column name
df1 = data[['a','b']]
#converting the data to time series
dates = pd.date_range('2015-01-01', '2015-12-31', freq='H')
dates #preview
results:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
...
'2015-12-31 23:00:00', '2015-12-31 00:00:00'],
dtype='datetime64[ns]', length=2161, freq='H')
Above is working, however I get error below:
df1 = Series(df1[:,2], index=dates)
output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Series' is not defined
After attempting the pd.Series...
df1 = pd.Series(df1[:,2], index=dates)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1999, in _getitem_column
return self._get_item_cache(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/generic.py", line 1343, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
you do need to have the pd.Series. However, you were also doing something else wrong. I'm assuming you want to get all rows, 2nd column of df1 and return a pd.Series with an index of dates.
Solution
df1 = pd.Series(df1.iloc[:, 1], index=dates)
Explanation
df1.iloc is used to return the slice of df1 by row/column postitioning
[:, 1] gets all rows, 2nd columns
Also, df1.iloc[:, 1] returns a pd.Series and can be passed into the pd.Series constructor.

Categories

Resources