Create a dataframe from a date range in python - python

Given an interval from two dates, which will be a Python TimeStamp.
create_interval('2022-01-12', '2022-01-17', 'Holidays')
Create the following dataframe:
date
interval_name
2022-01-12 00:00:00
Holidays
2022-01-13 00:00:00
Holidays
2022-01-14 00:00:00
Holidays
2022-01-15 00:00:00
Holidays
2022-01-16 00:00:00
Holidays
2022-01-17 00:00:00
Holidays
If it can be in a few lines of code I would appreciate it. Thank you very much for your help.

If you're open to using Pandas, this should accomplish what you've requested
import pandas as pd
def create_interval(start, end, field_val):
#setting up index date range
idx = pd.date_range(start, end)
#create the dataframe using the index above, and creating the empty column for interval_name
df = pd.DataFrame(index = idx, columns = ['interval_name'])
#set the index name
df.index.names = ['date']
#filling out all rows in the 'interval_name' column with the field_val parameter
df.interval_name = field_val
return df
create_interval('2022-01-12', '2022-01-17', 'holiday')

I hope I coded exactly what you need.
import pandas as pd
def create_interval(ts1, ts2, interval_name):
ts_list_dt = pd.date_range(start=ts1, end=ts2).to_pydatetime().tolist()
ts_list = list(map(lambda x: ''.join(str(x)), ts_list_dt))
d = {'date': ts_list, 'interval_name': [interval_name]*len(ts_list)}
df = pd.DataFrame(data=d)
return df
df = create_interval('2022-01-12', '2022-01-17', 'Holidays')
print(df)
output:
date interval_name
0 2022-01-12 00:00:00 Holidays
1 2022-01-13 00:00:00 Holidays
2 2022-01-14 00:00:00 Holidays
3 2022-01-15 00:00:00 Holidays
4 2022-01-16 00:00:00 Holidays
5 2022-01-17 00:00:00 Holidays
If you want DataFrame without Index column, use df = df.set_index('date') after creating DataFrame df = pd.DataFrame(data=d). And then you will get:
date interval_name
2022-01-12 00:00:00 Holidays
2022-01-13 00:00:00 Holidays
2022-01-14 00:00:00 Holidays
2022-01-15 00:00:00 Holidays
2022-01-16 00:00:00 Holidays
2022-01-17 00:00:00 Holidays

Related

Reshape dataframe into several columns based on date column

I want to rearrange my example dataframe (df.csv) below based on the date column. Each row represents an hour's data for instance for both dates 2002-01-01 and 2002-01-02, there is 5 rows respectively, each representing 1 hour.
date,symbol
2002-01-01,A
2002-01-01,A
2002-01-01,A
2002-01-01,B
2002-01-01,A
2002-01-02,B
2002-01-02,B
2002-01-02,A
2002-01-02,A
2002-01-02,A
My expected output is as below .
date,hour1, hour2, hour3, hour4, hour5
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
I have tried the below as explained here: https://pandas.pydata.org/docs/user_guide/reshaping.html, but it doesnt work in my case because the symbol column contains duplicates.
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
pivoted = df.pivot(index="date", columns="symbol")
print(pivoted)
The data does not have the timestamps but only the date. However, each row for the same date represents an hourly interval, for instance the output could also be represented as below:
date,01:00, 02:00, 03:00, 04:00, 05:00
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
where the hour1 represent 01:00, hour2 represent 02:00...etc
You had the correct pivot approach, but you were missing a column 'time', so let's split the datetime into date and time:
s = pd.to_datetime(df['date'])
df['date'] = s.dt.date
df['time'] = s.dt.time
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
Alternatively for having a HH:MM time, use df['time'] = s.dt.strftime('%H:%M')
used input:
date,symbol
2002-01-01 01:00,A
2002-01-01 02:00,A
2002-01-01 03:00,A
2002-01-01 04:00,B
2002-01-01 05:00,A
2002-01-02 01:00,B
2002-01-02 02:00,B
2002-01-02 03:00,A
2002-01-02 04:00,A
2002-01-02 05:00,A
not time as input!
If really you have no time in the input dates and need to 'invent' increasing ones, you could use groupby.cumcount:
df['time'] = pd.to_datetime(df.groupby('date').cumcount(), format='%H').dt.strftime('%H:%M')
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00 02:00 03:00 04:00 05:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
For each entry as an hour:
k = df.groupby("date").cumcount().add(1).astype(str).radd("hour")
out = df.pivot_table('symbol','date',k,aggfunc=min)
print(out)
hour1 hour2 hour3 hour4 hour5
date
2002-01-01 A A A B A
2002-01-02 B B A A A
I'd have an approach for you, I guess it not the most elegant way since I have to rename both index and columns but it does the job.
new_cols = ['01:00', '02:00', '03:00', '04:00', '05:00']
df1 = df.loc[df['date']=='2002-01-01', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-01'])
df2 = df.loc[df['date']=='2002-01-02', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-02'])
result = pd.concat([df1,df2])
print(result)
Output:
01:00 02:00 03:00 04:00 05:00
2002-01-01 A A A B A
2002-01-02 B B A A A

Get range of dates between specified start and end date from csv using python

I have a problem in which i have a CSV file with StartDate and EndDate, Consider 01-02-2020 00:00:00 and 01-03-2020 00:00:00
And I want a python program that finds the dates in between the dates and append in next rows like
So here instead of dot , it should increment Startdate and keep End date as it is.
import pandas as pd
df = pd.read_csv('MyData.csv')
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
df['Dates'] = [pd.date_range(x, y) for x , y in zip(df['StartDate'],df['EndDate'])]
df = df.explode('Dates')
df
So for example , if i have StartDate as 01-02-2020 00:00:00 and EndDate as 05-02-2020 00:00:00
As result i should get
All the result DateTime should be in same format as in MyData.Csv StartDate and EndDate
Only the StartDate will change , rest should be same
I tried doing it with date range. But am not getting any result. Can anyone please help me with this.
Thanks
My two cents: a very simple solution based only on functions from pandas:
import pandas as pd
# Format of the dates in 'MyData.csv'
DT_FMT = '%m-%d-%Y %H:%M:%S'
df = pd.read_csv('MyData.csv')
# Parse dates with the provided format
for c in ('StartDate', 'EndDate'):
df[c] = pd.to_datetime(df[c], format=DT_FMT)
# Create the DataFrame with the ranges of dates
date_df = pd.DataFrame(
data=[[d] + list(row[1:])
for row in df.itertuples(index=False, name=None)
for d in pd.date_range(row[0], row[1])],
columns=df.columns.copy()
)
# Convert dates to strings in the same format of 'MyData.csv'
for c in ('StartDate', 'EndDate'):
date_df[c] = date_df[c].dt.strftime(DT_FMT)
If df is:
StartDate EndDate A B C
0 2020-01-02 2020-01-06 ME ME ME
1 2021-05-15 2021-05-18 KI KI KI
then date_df will be:
StartDate EndDate A B C
0 01-02-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
1 01-03-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
2 01-04-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
3 01-05-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
4 01-06-2020 00:00:00 01-06-2020 00:00:00 ME ME ME
5 05-15-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
6 05-16-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
7 05-17-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
8 05-18-2021 00:00:00 05-18-2021 00:00:00 KI KI KI
Then you can save back the result to a CSV file with the to_csv method.
Does something like this achieve what you want?
from datetime import datetime, timedelta
date_list = []
for base, end in zip(df['StartDate'], df['EndDate']):
d1 = datetime.strptime(base, "%d-%m-%Y %H:%M:%S")
d2 = datetime.strptime(end, "%d-%m-%Y %H:%M:%S")
numdays = abs((d2 - d1).days)
basedate = datetime.strptime(base, "%d-%m-%Y %H:%M:%S")
date_list += [basedate - timedelta(days=x) for x in range(numdays)]
df['Dates'] = date_list
Actually the code you provided is working for me. I guess the only thing you need to change is the date formatting in reading and writing operations to make sure that is consistent with your requirements. In particular, you should leverage the dayfirst argument when reading and date_format when writing the output file. A toy example below:
Toy data
StartDate
EndDate
A
B
C
01-02-2020 00:00:00
06-02-2020 00:00:00
ME
ME
ME
01-04-2020 00:00:00
04-04-2020 00:00:00
PE
PE
PE
Sample code
import pandas as pd
s_dates = ['01-02-2020', '01-03-2020']
e_dates = ['01-04-2020', '01-05-2020']
df = pd.read_csv('dataSO.csv', parse_dates=[0,1], dayfirst=True)
cols = df.columns
df['Dates'] = [pd.date_range(x, y) for x , y in zip(df['StartDate'],df['EndDate'])]
df1 = df.explode('Dates')[cols]
df1.to_csv('resSO.csv', date_format="%d-%m-%Y %H:%M:%S", index=False)
And the output is what you described except for the fact that StartDate is also in datetime format. Does this answer you question?

Dataframe from Series grouped by weekday and hour of day

I have a Series with a DatetimeIndex, as such :
time my_values
2017-12-20 09:00:00 0.005611
2017-12-20 10:00:00 -0.004704
2017-12-20 11:00:00 0.002980
2017-12-20 12:00:00 0.001497
...
2021-08-20 13:00:00 -0.001084
2021-08-20 14:00:00 -0.001608
2021-08-20 15:00:00 -0.002182
2021-08-20 16:00:00 -0.012891
2021-08-20 17:00:00 0.002711
I would like to create a dataframe of average values with the weekdays as columns names and hour of the day as index, resulting in this :
hour Monday Tuesday ... Sunday
0 0.005611 -0.001083 -0.003467
1 -0.004704 0.003362 -0.002357
2 0.002980 0.019443 0.009814
3 0.001497 -0.002967 -0.003466
...
19 -0.001084 0.009822 0.003362
20 -0.001608 -0.002967 -0.003567
21 -0.002182 0.035600 -0.003865
22 -0.012891 0.002945 -0.002345
23 0.002711 -0.002458 0.006467
How can do this in Python ?
Do as follows
# Coerce time to datetime
df['time'] = pd.to_datetime(df['time'])
# Extract day and hour
df = df.assign(day=df['time'].dt.strftime('%A'), hour=df['time'].dt.hour)
# Pivot
pd.pivot_table(df, values='my_values', index=['hour'],
columns=['day'], aggfunc=np.mean)
Since you asked for a solution that returns the average values, I propose this groupby solution
df["weekday"] = df.time.dt.strftime('%A')
df["hour"] = df.time.dt.strftime('%H')
df = df.drop(["time"], axis=1)
# calculate averages by weekday and hour
df2 = df.groupby(["hour", "weekday"]).mean()
# put it in the right format
df2.unstack()

How to use Pandas to get date_range from some timestamp?

I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]

pandas period_range - gaining access to the

pd.period_range(start='2017-01-01', end='2017-01-01', freq='Q') gives me the following:
PeriodIndex(['2017Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
I would like to gain access to '2017Q1' to put it into a different column in the dataframe.
I have a date column with dates, e.g., 1/1/2017. I have other column where I'd like to put the string calculated by the period range. It seems like that would be an efficient way to update the column with the fiscal quarter. I can't seem to access it. It isn't subscriptable, and I can't get at it even when I assign it to a variable. Just wondering what I'm missing.
pandas.PeriodIndex:
You were almost there. It just needed to be assigned to a column
import numpy as np
from datetime import timedelta, datetime
import pandas as pd
# list of dates - test data
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
# create the dataframe
df = pd.DataFrame({'dates': list_of_dates})
dates
2017-01-01
2017-01-05
2017-01-09
2017-01-13
2017-01-17
df['Quarters'] = pd.PeriodIndex(df.dates, freq='Q-DEC')
Output:
print(df.head())
dates Quarters
2017-01-01 2017Q1
2017-01-05 2017Q1
2017-01-09 2017Q1
2017-01-13 2017Q1
2017-01-17 2017Q1
print(df.tail())
dates Quarters
2019-08-31 2019Q3
2019-09-04 2019Q3
2019-09-08 2019Q3
2019-09-12 2019Q3
2019-09-16 2019Q3

Categories

Resources