How to build a datetime in dask from separate fields

How to build a datetime in dask from separate fields - python

I'm trying to build a computed column in dask, a datetime from separate fields year, month, day, hour.
And I can't find a way to make it work.
With the method below it's creating a datetime column, but inside it's not datetime.
I've tried different formulas, but none that work.
Python 3.8.10
pandas==1.5.3
dask==2023.1.1
aiohttp==3.8.3
Get the data
# data processing
import dask.dataframe as dd
# web data source
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
# read demo data
dtable = dd.read_csv(url)
# print list of columns
print('demo data list of fields : ',dtable.columns)
result:
demo data list of fields : Index(['year', 'month', 'day',
'dep_delay', 'arr_delay', 'carrier', 'origin', 'dest', 'air_time',
'distance', 'hour'], dtype='object')
Then create the field. It looks like working, but no.
# create datetime column from the 'year','month','day','hour' fields
dtable['flight_datetime'] = dd.to_datetime(
(dtable.year *1000000
+ dtable.month*10000
+ dtable.day*100
+ dtable.hour).astype(str), format='%Y%m%d%H', errors='ignore')
print('demo data list of fields : ',dtable.columns)
print('demo data fields types : ',dtable.dtypes)
print(dtable.flight_datetime.head())
print(dtable.flight_datetime.dt.year.head())
result:
demo data list of fields : Index(['year', 'month', 'day',
'dep_delay', 'arr_delay', 'carrier', 'origin', 'dest', 'air_time',
'distance', 'hour', 'flight_datetime'], dtype='object')
demo data fields types :
year int64
month int64
day int64
dep_delay int64
arr_delay int64
carrier object
origin object
dest object
air_time int64
distance int64
hour int64
flight_datetime datetime64[ns]
dtype: object
0 2014010109
1 2014010111
2 2014010119
3 2014010107
4 2014010113
Name: flight_datetime, dtype: object
AttributeError: 'Series' object has no attribute 'year'

As #RomanPerekhrest says in the comments, your not using the correct syntax for dd.to_datetime. The following is working for me:
dtable_time = dtable[['year','month','day','hour']]
dtable['flight_datetime'] = dd.to_datetime(dtable_time)
print('demo data list of fields : ', dtable.columns)
print('demo data fields types : ', dtable.dtypes)
print(dtable.flight_datetime.head())
print(dtable.flight_datetime.dt.year.head())
outputs:
demo data list of fields : Index(['year', 'month', 'day', 'dep_delay', 'arr_delay', 'carrier', 'origin',
'dest', 'air_time', 'distance', 'hour', 'flight_datetime'],
dtype='object')
demo data fields types : year int64
month int64
day int64
dep_delay int64
arr_delay int64
carrier object
origin object
dest object
air_time int64
distance int64
hour int64
flight_datetime datetime64[ns]
dtype: object
0 2014-01-01 09:00:00
1 2014-01-01 11:00:00
2 2014-01-01 19:00:00
3 2014-01-01 07:00:00
4 2014-01-01 13:00:00
Name: flight_datetime, dtype: datetime64[ns]
0 2014
1 2014
2 2014
3 2014
4 2014
Name: flight_datetime, dtype: int64

Related

filter data frame by specific dates

I have the data frame below with dates ranging from 2016-01-01 to 2021-03-27
timestamp close circulating_supply issuance_native
0 2016-01-01 0.944695 7.389026e+07 26070.31250
1 2016-01-02 0.931646 7.391764e+07 27383.90625
2 2016-01-03 0.962863 7.394532e+07 27675.78125
3 2016-01-04 0.944515 7.397274e+07 27420.62500
4 2016-01-05 0.950312 7.400058e+07 27839.21875
I'm looking to filter this dataframe by Month & Day to look at the circulating supply on December 31st for each year.
here is an output of the datatypes of the data frame
timestamp datetime64[ns]
close float64
circulating_supply float64
issuance_native float64
dtype: object
I'm able to pull single rows using this:
ts = pd.to_datetime('2016-12-31')
df.loc[df['timestamp'] == td]
but no luck passing in a list of datetimes inside df.loc[]
The result should look like this, showing the rows for December 31st of each year:
timestamp close circulating_supply issuance_native
0 2016-31-12 0.944695 7.389026e+07 26070.31250
1 2017-31-12 0.931646 7.391764e+07 27383.90625
2 2018-31-12 0.962863 7.394532e+07 27675.78125
3 2019-31-12 0.944515 7.397274e+07 27420.62500
4 2020-31-12 0.950312 7.400058e+07 27839.21875
This is the closest Ive gotten but I get this error
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")

circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy()
circulating_supply.head()
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(

Try something like this:
end_of_year = [
pd.to_datetime(ts)
for ts in [
"2016-12-31",
"2017-12-31",
"2018-12-31",
"2019-12-31",
"2020-12-31",
"2021-03-01",
]
]
end_of_year_df = df.loc[df["timestamp"].isin(end_of_year), :]
circulating_supply = end_of_year_df.drop(columns=["close", "issuance_native"])
circulating_supply.head()

I was able to solve this by ignoring the error I got when using the .drop() function on my df.query result
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")
circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy() #not sure if this did anything
circulating_supply.head()
#add the column
yearly_issuance['EOY Supply'] = circulating_supply['circulating_supply'].values
yearly_issuance.head()

Pandas convert partial column Index to Datetime

DataFrame below contains housing price dataset from 1996 to 2016.
Other than the first 6 columns, other columns need to be converted to Datetime type.
I tried to run the following code:
HousingPrice.columns[6:] = pd.to_datetime(HousingPrice.columns[6:])
but I got the error:
TypeError: Index does not support mutable operations
I wish to convert some columns in the columns Index to Datetime type, but not all columns.

The pandas index is immutable, so you can't do that.
However, you can access and modify the column index with array, see doc here.
HousingPrice.columns.array[6:] = pd.to_datetime(HousingPrice.columns[6:])
should work.
Note that this would change the column index only. In order to convert the columns values, you can do this :
date_cols = HousingPrice.columns[6:]
HousingPrice[date_cols] = HousingPrice[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
EDIT
Illustrated example:
data = {'0ther_col': [1,2,3], '1996-04': ['1996-04','1996-05','1996-06'], '1995-05':['1996-02','1996-08','1996-10']}
print('ORIGINAL DATAFRAME')
df = pd.DataFrame.from_records(data)
print(df)
print("\nDATE COLUMNS")
date_cols = df.columns[-2:]
print(df.dtypes)
print('\nCASTING DATE COLUMNS TO DATETIME')
df[date_cols] = df[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
print(df.dtypes)
print('\nCASTING DATE COLUMN INDEXES TO DATETIME')
print("OLD INDEX -", df.columns)
df.columns.array[-2:] = pd.to_datetime(df[date_cols].columns)
print("NEW INDEX -",df.columns)
print('\nFINAL DATAFRAME')
print(df)
yields:
ORIGINAL DATAFRAME
0ther_col 1995-05 1996-04
0 1 1996-02 1996-04
1 2 1996-08 1996-05
2 3 1996-10 1996-06
DATE COLUMNS
0ther_col int64
1995-05 object
1996-04 object
dtype: object
CASTING DATE COLUMNS TO DATETIME
0ther_col int64
1995-05 datetime64[ns]
1996-04 datetime64[ns]
dtype: object
CASTING DATE COLUMN INDEXES TO DATETIME
OLD INDEX - Index(['0ther_col', '1995-05', '1996-04'], dtype='object')
NEW INDEX - Index(['0ther_col', 1995-05-01 00:00:00, 1996-04-01 00:00:00], dtype='object')
FINAL DATAFRAME
0ther_col 1995-05-01 00:00:00 1996-04-01 00:00:00
0 1 1996-02-01 1996-04-01
1 2 1996-08-01 1996-05-01
2 3 1996-10-01 1996-06-01

Difference between two date column if and only values are present in both the columns

I have two date column
PRIMARY CHILD diff
05-19-1945 01-13-1994 some value in years
03-01-1963
05-33-1933 03-01-1955 some value in years
05-19-1944 06-11-1967 some value in years
04-22-2020
I want to show difference in years if and only if value is present in both the column
(driver_data_new['ASGN_BRTH_DT_PRIMARY']-driver_data_new['ASGN_BRTH_DT_CHILD'])/np.timedelta64(1,'Y')
getting the following error
---> 36 driver_data_new['ASGN_BRTH_DT_PRIMARY'].dt.date
37 driver_data_new['ASGN_BRTH_DT_CHILD'].dt.date
38 driver_data_new['N_range']=(driver_data_new['ASGN_BRTH_DT_PRIMARY']-driver_data_new['ASGN_BRTH_DT_CHILD'])/np.timedelta64(1,'Y')
AttributeError: Can only use .dt accessor with datetimelike values

Your error AttributeError: Can only use .dt accessor with datetimelike values has nothing to do with only subtracting dates where both values are available. Rather, it has to do with the data types in the columns you're using. At least one of them is not a "datetimelike" object – therefore, the .dt accessor just isn't available. Use df.dtypes to see which columns are not datetime, and pandas.to_datetime to convert. Once you've done that, you'll see how the difference you're trying to calculate is already handled:
>>> df = pd.DataFrame({'a_dt': pd.to_datetime([np.nan, '2019-01-01', '2020-02-04']), 'b_dt': pd.to_datetime([np.nan, np.nan, '2020-03-17'])})
>>> df
a_dt b_dt
0 NaT NaT
1 2019-01-01 NaT
2 2020-02-04 2020-03-17
# Both are datetime types
>>> df.dtypes
a_dt datetime64[ns]
b_dt datetime64[ns]
dtype: object
>>> df['b_dt'] - df['a_dt']
0 NaT
1 NaT
2 42 days
dtype: timedelta64[ns]

python pandas merge_asof groupby

I have a merged dataframe as follows:
>>> merged_df.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
ProductType object
Rate float32
Obligation float32
Demand float64
Cost float64
dtype: object
The below groupby statement returns the correct AdjustedVolume values by Jurisdiction/Year:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
When including ProductType:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
AdjustedVolume by Year is correct if the Jurisdiction contains only one ProductType, but for any Jurisdiction with two or more ProductTypes, the AdjustedVolumes are getting split up such that they sum to the correct value. I was expecting each row to have the total AdjustedVolume, and am unclear on why it's being split up.
example:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
Jurisdiction VintageYear AdjustedVolume
CA 2017 3.529964e+05
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
Jurisdiction VintageYear ProductType AdjustedVolume
CA 2017 Bucket1 7.584832e+04
CA 2017 Bucket2 1.308454e+05
CA 2017 Bucket3 1.463026e+05
I suspect the merge_asof is being done incorrectly:
>>> df1.dtypes
Jurisdiction object
ProductType object
VintageYear int64
EffectiveStartDate datetime64[ns]
Rate float32
Obligation float32
dtype: object
>>> df2.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
dtype: object
Because df2 has no ProductType field, the below merge is breaking up the total volume into whatever ProductTypes are under each Jurisdiction. Can I modify the below merge so each ProductType has the total AdjustedVolume?
merged_df = pd.merge_asof(df2, df1, on='EffectiveStartDate', by=['Jurisdiction','VintageYear'])

You could use both versions of the group by and merge the two tables.
The first table is a group by with the ProductType, which would break out your AdjustedVolume by ProductType.
df = df.groupby(['Jurisdiction','VintageYear','ProductType']).agg({'AdjustedVolume':'sum'}).reset_index(drop = False)
Then create another table without including the ProductType (This is where the total amount will come from).
df1 = df.groupby(['Jurisdiction','VintageYear']).agg({'AdjustedVolume':'sum'}).reset_index(drop = False)
Now create an ID column, in both tables, in order for the merge to work correctly.
df['ID'] = df['Jurisdiction'].astype(str)+'_' +df['VintageYear'].astype(str)
df1['ID'] = df1['Jurisdiction'].astype(str)+'_'+ df1['VintageYear'].astype(str)
Now merge on IDs to get the total adjusted volumne.
df = pd.merge(df, df1, left_on = ['ID'], right_on = ['ID'], how = 'inner')
Last step is to clean up your columns.
df = df.rename(columns = {'AdjustedVolume_x':'AdjustedVolume',
'AdjustedVolume_y':'TotalAdjustedVolume',
'Jurisdiction_x':'Jurisdiction',
'VintageYear_x':'VintageYear'})
del df['Jurisdiction_y']
del df['VintageYear_y']
Your output will look like:

Consider also transform to retrieve grouping aggregate inline with other records, akin to the subquery aggregate in SQL.
grpdf = merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume']\
.sum().reset_index()
grpdf['TotalAdjVolume'] = merged_df.groupby(['Jurisdiction', 'ProductType'])['AdjustedVolume']\
.transform('sum')

Reading CSV file in Pandas with historical dates

I'm trying to read a file in with dates in the (UK) format 13/01/1800, however some of the dates are before 1667, which cannot be represented by the nanosecond timestamp (see http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas-timestamp-limits). I understand from that page I need to create my own PeriodIndex to cover the range I need (see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-oob) but I can't understand how I convert the string in the csv reader to a date in this periodindex.
So far I have:
span = pd.period_range('1000-01-01', '2100-01-01', freq='D')
df_earliest= pd.read_csv("objects.csv", index_col=0, names=['Object Id', 'Earliest Date'], parse_dates=[1], infer_datetime_format=True, dayfirst=True)
How do I apply the span to the date reader/converter so I can create a PeriodIndex / DateTimeIndex column in the dataframe ?

you can try to do it this way:
fn = r'D:\temp\.data\36987699.csv'
def dt_parse(s):
d,m,y = s.split('/')
return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
df = pd.read_csv(fn, parse_dates=[0], date_parser=dt_parse)
Input file:
Date,col1
13/01/1800,aaa
25/12/1001,bbb
01/03/1267,ccc
Test:
In [16]: df
Out[16]:
Date col1
0 1800-01-13 aaa
1 1001-12-25 bbb
2 1267-03-01 ccc
In [17]: df.dtypes
Out[17]:
Date object
col1 object
dtype: object
In [18]: df['Date'].dt.year
Out[18]:
0 1800
1 1001
2 1267
Name: Date, dtype: int64
PS you may want to add try ... catch block in the dt_parse() function for catching ValueError: exceptions - result of int()...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to build a datetime in dask from separate fields - python

Related

filter data frame by specific dates

Pandas convert partial column Index to Datetime

Difference between two date column if and only values are present in both the columns

python pandas merge_asof groupby

Reading CSV file in Pandas with historical dates

Categories

Resources