I have the following DataFrame:
date_start date_end
0 2023-01-01 16:00:00 2023-01-01 17:00:00
1 2023-01-02 16:00:00 2023-01-02 17:00:00
2 2023-01-03 16:00:00 2023-01-03 17:00:00
3 2023-01-04 17:00:00 2023-01-04 19:00:00
4 NaN NaN
and I want to create a new DataFrame which will contain values starting from the date_start and ending at the date_end of each row.
So for the first row by using the code below:
new_df = pd.Series(pd.date_range(start=df['date_start'][0], end=df['date_end'][0], freq= '15min'))
I get the following:
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
How can I get the same result for all the rows of the df combined in a new df?
You can use a list comprehension and concat:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])],
ignore_index=True))
Output:
date
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
5 2023-01-02 16:00:00
6 2023-01-02 16:15:00
7 2023-01-02 16:30:00
8 2023-01-02 16:45:00
9 2023-01-02 17:00:00
10 2023-01-03 16:00:00
11 2023-01-03 16:15:00
12 2023-01-03 16:30:00
13 2023-01-03 16:45:00
14 2023-01-03 17:00:00
15 2023-01-04 17:00:00
16 2023-01-04 17:15:00
17 2023-01-04 17:30:00
18 2023-01-04 17:45:00
19 2023-01-04 18:00:00
20 2023-01-04 18:15:00
21 2023-01-04 18:30:00
22 2023-01-04 18:45:00
23 2023-01-04 19:00:00
handling NAs:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])
if pd.notna(start) and pd.notna(end)
],
ignore_index=True)
Adding to the previous answer that date_range has a to_series() method and that you could proceed like this as well:
pd.concat(
[
pd.date_range(start=row['date_start'], end=row['date_end'], freq= '15min').to_series()
for _, row in df.iterrows()
], ignore_index=True
)
Related
I have the following DataFrame:
datetime day_fetched col_a col_b
0 2023-01-02 12:00:00 2023-01-01 12:00:00 100 200
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 400
2 2023-01-03 12:00:00 2023-01-02 12:00:00 140 500
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 700
4 2023-01-04 12:00:00 2023-01-03 12:00:00 200 300
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 200
And I want to create a new column that will take the value 2 if there is a difference in the date between datetime and day_fetched and value 1 if there is no difference.
So my new Dataframe should look like this:
datetime day_fetched col_a col_b day_ahead
0 2023-01-02 12:00:00 2023-01-01 12:00:00 100 200 2
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 400 1
2 2023-01-03 12:00:00 2023-01-02 12:00:00 140 500 2
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 700 1
4 2023-01-04 12:00:00 2023-01-03 12:00:00 200 300 2
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 200 1
Then based on the column['day_ahead'], I want to split the col_a and col_b, into col_a_1 and col_a_2 and col_b_1 and col_b_2.
So the final DataFrame will look like this:
datetime day_fetched col_a_1 col_a_2 col_b_1 col_b_2 day_ahead
0 2023-01-02 12:00:00 2023-01-01 12:00:00 NaN 200 NaN 200 2
1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 NaN 100 NaN 1
2 2023-01-03 12:00:00 2023-01-02 12:00:00 NaN 500 NaN 200 2
3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 NaN 100 NaN 1
4 2023-01-04 12:00:00 2023-01-03 12:00:00 NaN 300 NaN 200 2
5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 NaN 100 NaN 1
One solution is to use np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[["2023-01-02 12:00:00", "2023-01-01 12:00:00", 100, 200],
["2023-01-02 12:00:00", "2023-01-02 12:00:00", 120, 400],
["2023-01-03 12:00:00", "2023-01-02 12:00:00", 140, 500],
["2023-01-03 12:00:00", "2023-01-03 12:00:00", 160, 700],
["2023-01-04 12:00:00", "2023-01-03 12:00:00", 200, 300],
["2023-01-04 12:00:00", "2023-01-04 12:00:00", 430, 200]],
columns=["datetime","day_fetched","col_a","col_b"])
# days ahead
df["day_ahead"] = np.where(df["datetime"] == df["day_fetched"], 1, 2)
# column of None's for next section
df["na"] = None
# overwrite dataframe with new df
df = pd.DataFrame(data=np.where(df["day_ahead"] == 1,
[df["datetime"], df["day_fetched"],
df["col_a"], df["na"],
df["col_b"], df["na"],
df["day_ahead"]],
[df["datetime"], df["day_fetched"],
df["na"], df["col_a"],
df["na"], df["col_b"],
df["day_ahead"]]).T,
columns=["datetime", "day_fetched",
"col_a_1", "col_a_2",
"col_b_1", "col_b_2",
"day_ahead"])
df
# datetime day_fetched col_a_1 ... col_b_1 col_b_2 day_ahead
# 0 2023-01-02 12:00:00 2023-01-01 12:00:00 None ... None 200 2
# 1 2023-01-02 12:00:00 2023-01-02 12:00:00 120 ... 400 None 1
# 2 2023-01-03 12:00:00 2023-01-02 12:00:00 None ... None 500 2
# 3 2023-01-03 12:00:00 2023-01-03 12:00:00 160 ... 700 None 1
# 4 2023-01-04 12:00:00 2023-01-03 12:00:00 None ... None 300 2
# 5 2023-01-04 12:00:00 2023-01-04 12:00:00 430 ... 200 None 1
# [6 rows x 7 columns]
When asking a question please provide data that can be easily copied, such as by using df.to_dict().
EDIT - Generalised for many columns
Here is a (more complicated) bit of code that uses a list comprehension to pivot based on the value of day_ahead for each col_ and concatenates these to produce the same result:
df = pd.concat(
[df.pivot_table(index=[df.index, "datetime", "day_fetched"],
columns=["day_ahead"],
values=x).add_prefix(x+"_") for x in \
df.columns[df.columns.str.startswith("col_")]] + \
[df.set_index([df.index, "datetime", "day_fetched"])["day_ahead"]],
axis=1).reset_index(level=[1, 2])
The second, third and fourth lines above create the pivot table and adds the column name and "_" as a prefix, and this is a list comprehension for each column in df that starts with "col_" (fifth line). The sixth and seventh lines add the day_ahead column at the end of the DataFrame. The eighth line resets the index so that datetime and day_fetched are columns.
I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below:
Code which handles dayNo:
df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year
I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below:
Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!
You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point:
df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")})
start = pd.Timestamp("2021-12-31")
df["dayNo"] = df["dates"].sub(start).dt.days
dates dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
5 2022-12-30 16:00:00 364
6 2022-12-31 00:00:00 365
7 2022-12-31 08:00:00 365
8 2022-12-31 16:00:00 365
9 2023-01-01 00:00:00 366
10 2023-01-01 08:00:00 366
11 2023-01-01 16:00:00 366
12 2023-01-02 00:00:00 367
13 2023-01-02 08:00:00 367
14 2023-01-02 16:00:00 367
15 2023-01-03 00:00:00 368
You are nearly there with your solution just do Apply for final result as
df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo)
df
Out[108]:
dates TradeDate dayNo
0 2022-12-31 00:00:00 2022-12-31 365
1 2022-12-31 01:00:00 2022-12-31 365
2 2022-12-31 02:00:00 2022-12-31 365
3 2022-12-31 03:00:00 2022-12-31 365
4 2022-12-31 04:00:00 2022-12-31 365
.. ... ... ...
68 2023-01-02 20:00:00 2023-01-02 367
69 2023-01-02 21:00:00 2023-01-02 367
70 2023-01-02 22:00:00 2023-01-02 367
71 2023-01-02 23:00:00 2023-01-02 367
72 2023-01-03 00:00:00 2023-01-03 368
Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
The dataframe has then dates from 2022 to 2030 :
TradeDate
0 2022-12-29 00:00:00
1 2022-12-29 08:00:00
2 2022-12-29 16:00:00
3 2022-12-30 00:00:00
4 2022-12-30 08:00:00
... ...
7682 2030-01-01 16:00:00
7683 2030-01-02 00:00:00
7684 2030-01-02 08:00:00
7685 2030-01-02 16:00:00
7686 2030-01-03 00:00:00
[7687 rows x 1 columns]
I propose you the following commented-inside code to aim our target :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
# Initialize Days counter
dyc = df['TradeDate'].iloc[0].dayofyear
# Initialize Previous day of Year
prv_dof = dyc
def func(row):
global dyc, prv_dof
# Get the day of the year
dof = row.iloc[0].dayofyear
# If New day then increment days counter
if dof != prv_dof:
dyc+=1
prv_dof = dof
return dyc
df['dayNo'] = df.apply(func, axis=1)
Resulting dataframe :
TradeDate dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
... ... ...
7682 2030-01-01 16:00:00 2923
7683 2030-01-02 00:00:00 2924
7684 2030-01-02 08:00:00 2924
7685 2030-01-02 16:00:00 2924
7686 2030-01-03 00:00:00 2925
I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN
I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3
I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object