Counting each day in a dataframe (Not resetting on new year)

Counting each day in a dataframe (Not resetting on new year) - python

I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below:
Code which handles dayNo:
df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year
I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below:
Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!

You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point:
df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")})
start = pd.Timestamp("2021-12-31")
df["dayNo"] = df["dates"].sub(start).dt.days
dates dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
5 2022-12-30 16:00:00 364
6 2022-12-31 00:00:00 365
7 2022-12-31 08:00:00 365
8 2022-12-31 16:00:00 365
9 2023-01-01 00:00:00 366
10 2023-01-01 08:00:00 366
11 2023-01-01 16:00:00 366
12 2023-01-02 00:00:00 367
13 2023-01-02 08:00:00 367
14 2023-01-02 16:00:00 367
15 2023-01-03 00:00:00 368

You are nearly there with your solution just do Apply for final result as
df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo)
df
Out[108]:
dates TradeDate dayNo
0 2022-12-31 00:00:00 2022-12-31 365
1 2022-12-31 01:00:00 2022-12-31 365
2 2022-12-31 02:00:00 2022-12-31 365
3 2022-12-31 03:00:00 2022-12-31 365
4 2022-12-31 04:00:00 2022-12-31 365
.. ... ... ...
68 2023-01-02 20:00:00 2023-01-02 367
69 2023-01-02 21:00:00 2023-01-02 367
70 2023-01-02 22:00:00 2023-01-02 367
71 2023-01-02 23:00:00 2023-01-02 367
72 2023-01-03 00:00:00 2023-01-03 368

Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
The dataframe has then dates from 2022 to 2030 :
TradeDate
0 2022-12-29 00:00:00
1 2022-12-29 08:00:00
2 2022-12-29 16:00:00
3 2022-12-30 00:00:00
4 2022-12-30 08:00:00
... ...
7682 2030-01-01 16:00:00
7683 2030-01-02 00:00:00
7684 2030-01-02 08:00:00
7685 2030-01-02 16:00:00
7686 2030-01-03 00:00:00
[7687 rows x 1 columns]
I propose you the following commented-inside code to aim our target :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
# Initialize Days counter
dyc = df['TradeDate'].iloc[0].dayofyear
# Initialize Previous day of Year
prv_dof = dyc
def func(row):
global dyc, prv_dof
# Get the day of the year
dof = row.iloc[0].dayofyear
# If New day then increment days counter
if dof != prv_dof:
dyc+=1
prv_dof = dof
return dyc
df['dayNo'] = df.apply(func, axis=1)
Resulting dataframe :
TradeDate dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
... ... ...
7682 2030-01-01 16:00:00 2923
7683 2030-01-02 00:00:00 2924
7684 2030-01-02 08:00:00 2924
7685 2030-01-02 16:00:00 2924
7686 2030-01-03 00:00:00 2925

Related

Create a new DataFrame using pandas date_range

I have the following DataFrame:
date_start date_end
0 2023-01-01 16:00:00 2023-01-01 17:00:00
1 2023-01-02 16:00:00 2023-01-02 17:00:00
2 2023-01-03 16:00:00 2023-01-03 17:00:00
3 2023-01-04 17:00:00 2023-01-04 19:00:00
4 NaN NaN
and I want to create a new DataFrame which will contain values starting from the date_start and ending at the date_end of each row.
So for the first row by using the code below:
new_df = pd.Series(pd.date_range(start=df['date_start'][0], end=df['date_end'][0], freq= '15min'))
I get the following:
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
How can I get the same result for all the rows of the df combined in a new df?

You can use a list comprehension and concat:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])],
ignore_index=True))
Output:
date
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
5 2023-01-02 16:00:00
6 2023-01-02 16:15:00
7 2023-01-02 16:30:00
8 2023-01-02 16:45:00
9 2023-01-02 17:00:00
10 2023-01-03 16:00:00
11 2023-01-03 16:15:00
12 2023-01-03 16:30:00
13 2023-01-03 16:45:00
14 2023-01-03 17:00:00
15 2023-01-04 17:00:00
16 2023-01-04 17:15:00
17 2023-01-04 17:30:00
18 2023-01-04 17:45:00
19 2023-01-04 18:00:00
20 2023-01-04 18:15:00
21 2023-01-04 18:30:00
22 2023-01-04 18:45:00
23 2023-01-04 19:00:00
handling NAs:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])
if pd.notna(start) and pd.notna(end)
],
ignore_index=True)

Adding to the previous answer that date_range has a to_series() method and that you could proceed like this as well:
pd.concat(
[
pd.date_range(start=row['date_start'], end=row['date_end'], freq= '15min').to_series()
for _, row in df.iterrows()
], ignore_index=True
)

Create regular time series from irregular interval with python

I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance

Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]

Why pandas .last('1W') don't show last 7 days?

Trying to extract a max value from a Pandas Dataframe with a daytime as index, I'm using .last('1W').
My data goes from the first day of month (2020-09-01 00:00:00). It seems to work properly until I reach today (monday 07/09/2020). At first I supposed that .last() takes the last days of week from starting value (sunday I guess) instead of the last 7 days (as I assumed) but, what confuses me is that if I extend the hours, the resulting dataframe shifts the first sample too...
I'm try to simulate this with:
import pandas as pd
i = pd.date_range('2020-09-01', periods=24*6+5, freq='1H')
values = range(0, 24*6+5 )
df = pd.DataFrame({'A': values}, index=i)
print(df)
print(df.last('1W'))
With output:
A
2020-09-01 00:00:00 0
2020-09-01 01:00:00 1
2020-09-01 02:00:00 2
2020-09-01 03:00:00 3
2020-09-01 04:00:00 4
... ...
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
[149 rows x 1 columns]
A
2020-09-06 05:00:00 125
2020-09-06 06:00:00 126
2020-09-06 07:00:00 127
2020-09-06 08:00:00 128
2020-09-06 09:00:00 129
2020-09-06 10:00:00 130
2020-09-06 11:00:00 131
2020-09-06 12:00:00 132
2020-09-06 13:00:00 133
2020-09-06 14:00:00 134
2020-09-06 15:00:00 135
2020-09-06 16:00:00 136
2020-09-06 17:00:00 137
2020-09-06 18:00:00 138
2020-09-06 19:00:00 139
2020-09-06 20:00:00 140
2020-09-06 21:00:00 141
2020-09-06 22:00:00 142
2020-09-06 23:00:00 143
2020-09-07 00:00:00 144
2020-09-07 01:00:00 145
2020-09-07 02:00:00 146
2020-09-07 03:00:00 147
2020-09-07 04:00:00 148
Process finished with exit code 0
The first value in df is 0 at 2020-09-01 00:00:00
But,
When I try to apply last('1W'), the selection goes from 2020-09-06 05:00:00, to the last value, instead of the last 7 days... as I assumed, nor from 2020-09-06 00:00:00 if the operator works from sunday to sunday.

If you're looking for an offset of 7 days, why not use the Day offset, rather than the Week?
"1W" offset isn't the same as "7D" because "1W" starting on a Monday in a two-week dataset where the last row is Tuesday will have only 2 days. "2W" will include previous week (Monday-Sunday) + (Monday-Tuesday).
You can see the effects of changing the start day of the week by calling the offset class directly, like so:
week_offset = pd.tseries.offsets.Week(n=1, weekday=0) # week starting Monday
day_offset = pd.tseries.offsets.Day(n=7) # or simply "7D"
df.last(day_offset)

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.

UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

Separate by threshold

I am trying to take the value of c_med one value as threshold from input:1 and separate the above and below values in two different outputs from input:2. Write above.csv & below.csv with reference to column c_total.
Read the above.csv as input and categorize them with percentage as mentioned in point 2 written in pure python.
Input: 1
date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751
Input: 2 ['date','startTime','endTime','day','c_total','u_total']
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
I am trying to read a threshold value from another input csv c_med
I am getting following error:
Traceback (most recent call last):
File "class_med.py", line 10, in <module>
above_median = df_data['c_total'] > df_med['c_med']
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
filter the separated data column c_total with percentage. Pure python solution given below but I am looking for a pandas solution. like in Reference one
for row in csv.reader(inp):
if int(row[1])<(.20 * max_value):
val = 'viewers'
elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value):
val= 'event based'
elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value):
val= 'situational'
elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value):
val = 'active'
else:
val= 'highly active'
writer.writerow([row[0],row[1],val])
Code:
import pandas as pd
import numpy as np
df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
above = df_data['c_total'] > df_med['c_med']
#print above_median
above.to_csv('above.csv', index=None, header=None)
df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
#Percentage block should come here
Edit: In case of single column value the qcut is the simplest solution. But when it comes to using two values from two different columns how to achieve that in pandas ?
for row in csv.reader(inp):
if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
val='highly active'
elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
val='active'
elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):
val='event based'
elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
val ='situational'
else:
val= 'viewers'

assuming you have the following DFs:
In [7]: df1
Out[7]:
date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med
0 2 12 2309 19072 12515 13131 254 785 686 751
In [8]: df2
Out[8]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254
separate by threshold (you can compare two series with the same length or with a scalar value - i assume you will to separate your second data set, comparing it to the scalar value (c_med column) from the first of your first data set:
In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]
In [23]: above
Out[23]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
you can use qcut() method in order to categorize your data:
In [29]: df2['cat'] = pd.qcut(df2.c_total,
....: q=[0, .2, .4, .6, .8, 1.],
....: labels=['viewers','event based','situational','active','highly active'])
In [30]: df2
Out[30]:
date startTime endTime day c_total u_total cat
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790 highly active
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750 active
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747 active
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777 highly active
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785 highly active
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757 situational
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735 situational
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636 situational
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513 event based
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382 event based
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303 viewers
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262 viewers
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254 viewers
check:
In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
c_total pct cat
0 18944 0.993289 highly active
1 17534 0.919358 active
2 17262 0.905096 active
3 19072 1.000000 highly active
4 18275 0.958211 highly active
5 13589 0.712510 situational
6 16053 0.841705 situational
7 11440 0.599832 situational
8 5972 0.313129 event based
9 3424 0.179530 event based
10 2696 0.141359 viewers
11 2350 0.123217 viewers
12 2309 0.121068 viewers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting each day in a dataframe (Not resetting on new year) - python

Related

Create a new DataFrame using pandas date_range

Create regular time series from irregular interval with python

Why pandas .last('1W') don't show last 7 days?

add timedelta data within a group in pandas dataframe

Separate by threshold

Categories

Resources