Combining dataframes with differing dates column - python

I have a dataset of hourly prices where I have produced a dataframe that contains the minimum price from the previous day using:
df_min = df_hour_0[['Price_REG1', 'Price_REG2', 'Price_REG3',
'Price_REG4']].between_time('00:00', '23:00').resample('d').min()
This gives me:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 00:00:00 25.07 25.07 25.07 25.07
2020-01-02 00:00:00 12.07 12.07 12.07 12.07
2020-01-03 00:00:00 0.14 0.14 0.14 0.14
2020-01-04 00:00:00 3.83 3.83 3.83 3.83
2020-01-05 00:00:00 25.77 25.77 25.77 25.77
Now, I want to combine this df with 24 other df's, one for each hour (hour_0 below):
Price_REG1 Price_REG2 ... Price_24_3 Price_24_4
date ...
2020-01-01 00:00:00 30.83 30.83 ... NaN NaN
2020-01-02 00:00:00 24.81 24.81 ... 25.88 25.88
2020-01-03 00:00:00 24.39 24.39 ... 27.69 27.69
2020-01-04 00:00:00 22.04 22.04 ... 25.70 25.70
2020-01-05 00:00:00 25.77 25.77 ... 27.37 27.37
Which I do this way:
df_hour_0 = pd.concat([df_hour_0, df_min, df_max], axis=1)
This works fine for the df from the first hour, since the dates matches. But for the other df's the date is "2020-01-01 00:01:00", "2020-01-01 00:02:00" etc.
Since the dates don't match, the pd.concat gives me two times as many observations where every other observation is null:
Price_REG1 Price_REG2 ... Price_3_min Price_4_min
date ...
2020-01-01 00:00:00 NaN NaN ... NaN NaN
2020-01-01 01:00:00 28.78 28.78 ... NaN NaN
2020-01-02 00:00:00 NaN NaN ... 30.83 30.83
2020-01-02 01:00:00 12.07 12.07 ... NaN NaN
2020-01-03 00:00:00 NaN NaN ... 31.20 31.20
I tried to fix this by:
df_max = df_max.reset_index()
df_max = df_max.drop(['date'], axis=1)
But this only gives me the same issue but instead of every other being null the whole df_min df is just inserted at the bottom of the first df.
I want to keep the date, otherwise I guess it could be possible to reset the index in both df's and combine them by index instead of date.
Thank you.

One option could be to normalize to the date:
dfs = [df_hour_0, df_min, df_max]
pd.concat([d.set_axis(d.index.normalize()) for d in dfs], axis=1)

Related

Using Pandas to filter 2 specific day of year

I have a big CSV dataset and i wish to filter my dataset with use of Pandas and save it into new CSV File
The aim is to find all the records for 1 and 15 days of the year
when i used following code it is work
print (df[(df['data___date_time'].dt.day == 1)])
and result appear as follow:
data___date_time NO2 SO2 PM10
26 2020-07-01 00:00:00 1.591616 0.287604 NaN
27 2020-07-01 01:00:00 1.486401 NaN NaN
28 2020-07-01 02:00:00 1.362056 NaN NaN
29 2020-07-01 03:00:00 1.295101 0.194399 NaN
30 2020-07-01 04:00:00 1.260667 0.362168 NaN
... ... ... ...
17054 2022-07-01 19:00:00 2.894369 2.077140 19.34
17055 2022-07-01 20:00:00 3.644265 1.656386 23.09
17056 2022-07-01 21:00:00 2.907760 1.291555 23.67
17057 2022-07-01 22:00:00 2.974715 1.318185 27.68
17058 2022-07-01 23:00:00 2.858022 1.169057 25.18
However when i used following code nothing comes out
print (df[(df['data___date_time'].dt.day == 1) & (df['data___date_time'].dt.day == 15)])
this just gave me:
Empty DataFrame
Columns: [data___date_time, NO2, SO2, PM10]
Index: []
Is there any idea what could be the problem
There is logical problem, is not possible same row 1 and 15, need | for bitwise OR. If need test multiple values simplier is use Series.isin:
df = pd.DataFrame({'data___date_time': pd.date_range('2000-01-01', periods=20)})
print (df[df['data___date_time'].dt.day.isin([1,15])])
data___date_time
0 2000-01-01
14 2000-01-15

Pandas sort_index only the given timeframe

I have a pandas series object consists of a datetime_index and some values, looks like following:
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 00:30:00 35.6
2020-01-01 00:45:00 39.2
2020-01-01 01:00:00 56.7
...
2020-12-31 23:45:00 56.3
I am adding some values to this df with .append(). Since it is not sorted then I sort its index via .sort_index(). However what I would like to achieve is that I want to sort only for given day.
So for example I add some values to day 2020-01-01, and since the added values will be after the end of the day 2020-01-01 I just need to sort the first day of the year. NOT ALL THE DF.
Here is an example, NaN value is added with .append():
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
...
2020-01-01 23:45:00 34.3
2020-01-01 15:00:00 NaN
...
2020-12-31 23:45:00 56.3
Now I cannot df.sort_index(), because it breaks other days. That is why I just want to apply .sort_index() to the day 2020-01-01. How do I do that?
WHAT I TRIED SO FAR AND DOES NOT WORK:
df.loc['2020-01-01'] = df.loc['2020-01-01'].sort_index()
Filter rows for 2020-01-01 days, sorting and join back with not matched rows:
mask = df.index.normalize() == '2020-01-01'
df = pd.concat([df[mask].sort_index(), df[~mask]])
print (df)
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 15:00:00 NaN
2020-01-01 23:45:00 34.3
2020-12-31 23:45:00 56.3
Name: a, dtype: float64
Another idea:
df1 = df['2020-01-01'].sort_index()
df = pd.concat([df1, df.drop(df1.index)])

How to extract hourly data from a df in python?

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Merge dataframes with different datetime frequencies without NaNs

I'm trying to merge two dataframes with different datetime frequencies and also filling up missing values with duplicates.
Dataframe df1 with minute frequency:
time
0 2017-06-01 00:00:00
1 2017-06-01 00:01:00
2 2017-06-01 00:02:00
3 2017-06-01 00:03:00
4 2017-06-01 00:04:00
Dataframe df2 with hourly frequency:
time2 temp hum
1 2017-06-01 00:00:00 13.5 90.0
2 2017-06-01 01:00:00 12.2 95.0
3 2017-06-01 02:00:00 11.7 96.0
4 2017-06-01 03:00:00 11.5 96.0
5 2017-06-01 04:00:00 11.1 97.0
So far i merged these dataframe but get NaNs:
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
time time2 temp hum
0 2017-06-01 00:00:00 2017-06-01 13.5 90.0
1 2017-06-01 00:01:00 NaT NaN NaN
2 2017-06-01 00:02:00 NaT NaN NaN
3 2017-06-01 00:03:00 NaT NaN NaN
4 2017-06-01 00:04:00 NaT NaN NaN
My desired dataframe should look like this (NaN filled up with hourly value df2):
time temp hum
0 2017-06-01 00:00:00 13.5 90.0
1 2017-06-01 00:01:00 13.5 90.0
2 2017-06-01 00:02:00 13.5 90.0
3 2017-06-01 00:03:00 13.5 90.0
4 2017-06-01 00:04:00 13.5 90.0
...
So far i found this solution: merge series/dataframe with different time frequencies in python, but the Datetime column is not my index. Does anyone know how to get there ?
As suggested by Ben Pap i did the following two Steps as a solution:
import pandas as pd
data1 = {'time':pd.date_range('2017-06-01 00:00:00','2017-06-01 00:09:00', freq='T')}
data2 = {'time2':pd.date_range('2017-06-01 00:00:00','2017-06-01 04:00:00', freq='H'), 'temp':[13.5,12.2,11.7,11.5,11.1], 'hum':[90.0,95.0,96.0,96.0,97.0]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
m2o_merge.fillna(method='ffill', inplace=True)
filled_df = m2o_merge.drop(['time2'], axis=1)
filled_df.head()

Choosing time from 2300-0000 for different days

So I'm having a issue with with the 23:00-00:00 time for different days within in Python.
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-10 23:00:00 NaN 0.207653 0.205911 0.202886
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
What I'm looking for is to mainly select the 00:00:00 hour which is why I've applied df = df.reset_index().groupby(df.index.date).first().set_index('times') but if that doesn't exist that it should use the 23:00:00 of the previous days as the 00:00:00 of the next day. The following is wrong:
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
How do I get it to look at the 23:00:00 of the previous day to the 00:00:00 of the next day, to achieve this solution.
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...

Categories

Resources