python - Getting error while taking difference between two dates in columns - python

this is my code, I am trying to get business days between two dates
the number of days is saved in a new column 'nd'
import numpy as np
df1 = pd.DataFrame(pd.date_range('2020-01-01',periods=26,freq='D'),columns=['A'])
df2 = pd.DataFrame(pd.date_range('2020-02-01',periods=26,freq='D'),columns=['B'])
df = pd.concat([df1,df2],axis=1)
# Iterate over each row of the DataFrame
for index , row in df.iterrows():
bc = np.busday_count(row['A'],row['B'])
df['nd'] = bc
I am getting this error.
Traceback (most recent call last):
File "<input>", line 35, in <module>
File "<__array_function__ internals>", line 5, in busday_count
TypeError: Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Is there a way to fix it or another way to get the solution?

busday_count only accepts dates, not datetimes change
bc = np.busday_count(row['A'],row['B'])
to
np.busday_count(row['A'].date(), row['B'].date())

Related

Why pandas DataFrame allows to set column using too large Series?

Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.
You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here

Object of type 'float' has no len() error when slicing pandas dataframe json column

I have data that looks like this. In each column, there are value/keys of varying different lengths. Some rows are also NaN.
like match
0 [{'timestamp', 'type'}] [{'timestamp', 'type'}]
1 [{'timestamp', 'comment', 'type'}] [{'timestamp', 'type'}]
2 NaN NaN
I want to split these lists into their own columns. I want to keep all the data (and make it NaN if it is missing). I've tried following this tutorial and doing this:
df1 = pd.DataFrame(df['like'].values.tolist())
df1.columns = 'like_'+ df1.columns
df2 = pd.DataFrame(df['match'].values.tolist())
df2.columns = 'match_'+ df2.columns
col = df.columns.difference(['like','match'])
df = pd.concat([df[col], df1, df2],axis=1)
I get this error.
Traceback (most recent call last):
File "link to my file", line 12, in <module>
df1 = pd.DataFrame(df['like'].values.tolist())
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 509, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 524, in to_arrays
return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 561, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/lib.pyx", line 2448, in pandas._libs.lib.to_object_array
TypeError: object of type 'float' has no len()
Can someone help me understand what I'm doing wrong?
You can't perform values.tolist() on NaN. If you delete that row of NaNs, you can get past this issue. but then your prefix line fails. See this for prefixes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_prefix.html

Error with date_range using datetime subselection

I need to create a vector of dates with pd.date_range specifying the min and the max for date values.
Date values come from a subselection performed on a dataframe object ds.
This is the code I wrote:
Note that Date in ds are obtained from
ds = pd.read_excel("data.xlsx",sheet_name='all') # Read the Excel file
ds['Date'] = pd.to_datetime(ds['Date'], infer_datetime_format=True)
This is the part inside a for loop where x loops on a list of Names.
for x in lofNames:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
mindate = date_tmp.min()
maxdate = date_tmp.max()
date = pd.date_range(start=mindate, end=maxdate, freq='D')
This is the error I get:
Traceback (most recent call last):
File "<ipython-input-8-1f56d07b5a74>", line 4, in <module>
date = pd.date_range(start=mindate, end=maxdate, freq='D')
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/datetimes.py", line 1180, in date_range
**kwargs,
File "/Users/marco/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 365, in _generate_range
start = Timestamp(start)
File "pandas/_libs/tslibs/timestamps.pyx", line 418, in pandas._libs.tslibs.timestamps.Timestamp.__new__
File "pandas/_libs/tslibs/conversion.pyx", line 329, in pandas._libs.tslibs.conversion.convert_to_tsobject
TypeError: Cannot convert input [Date 2007-01-09
dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
What's wrong?
thank you
Here is returned one column DataFrame instead Series, so next min and max returned one item Series instead scalar, so error is raised:
date_tmp = ds.loc[ds['Security Name']==x,['Date']]
Correct way is removed []:
date_tmp = ds.loc[ds['Security Name']==x,'Date']

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements
Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

panda tseries convertion not working

I am new to python and I am trying to build Time Series through this. I am trying to convert this csv data into time series, however by the internet and stack research, 'result' should have had
<class 'pandas.tseries.index.DatetimeIndex'>,
but my output is not converted time series. Why is it not converting? How do I convert it? Thanks for the help in advance.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
data = pd.read_csv('somedata.csv')
print data.head()
#selecting specific columns by column name
df1 = data[['a','b']]
#converting the data to time series
dates = pd.date_range('2015-01-01', '2015-12-31', freq='H')
dates #preview
results:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
...
'2015-12-31 23:00:00', '2015-12-31 00:00:00'],
dtype='datetime64[ns]', length=2161, freq='H')
Above is working, however I get error below:
df1 = Series(df1[:,2], index=dates)
output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Series' is not defined
After attempting the pd.Series...
df1 = pd.Series(df1[:,2], index=dates)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1999, in _getitem_column
return self._get_item_cache(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/generic.py", line 1343, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
you do need to have the pd.Series. However, you were also doing something else wrong. I'm assuming you want to get all rows, 2nd column of df1 and return a pd.Series with an index of dates.
Solution
df1 = pd.Series(df1.iloc[:, 1], index=dates)
Explanation
df1.iloc is used to return the slice of df1 by row/column postitioning
[:, 1] gets all rows, 2nd columns
Also, df1.iloc[:, 1] returns a pd.Series and can be passed into the pd.Series constructor.

Categories

Resources