I am working with a dataset where I have dates in datetime format in the first column and hours as float as separate columns like this:
date 1.0 2.0 3.0 ... 21.0 22.0 23.0 24.0
0 2021-01-01 24.95 24.35 23.98 ... 27.32 26.98 26.44 25.64
1 2021-01-02 25.59 24.91 24.74 ... 27.38 26.96 26.85 25.94
and what I want to achieve is this:
Date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2013-01-01 04:00:00 ...
So I have been figuring that the first step should be to change the hours into datetime format,
been trying this code for example: df[1.0] = pd.to_datetime(df[1.0], format='%h')
Where I get this: "ValueError: 'h' is a bad directive in format '%h'"
And then rearrange the columns and rows. Been thinking about doing this with pandas pivot_table and transform. Any help would be appreciated. Thank you.
Use DataFrame.set_index first, convert all columns to timedeltas, reshape by DataFrame.unstack and last join dates and timedeltas:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: pd.to_timedelta(float(x), unit='h')
df1 = (df.set_index('date')
.rename(columns=f)
.unstack()
.reset_index(name='Price')
.assign(date=lambda x: x['date'] + x.pop('level_0')))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-02 01:00:00 25.59
2 2021-01-01 02:00:00 24.35
3 2021-01-02 02:00:00 24.91
4 2021-01-01 03:00:00 23.98
5 2021-01-02 03:00:00 24.74
6 2021-01-01 21:00:00 27.32
7 2021-01-02 21:00:00 27.38
8 2021-01-01 22:00:00 26.98
9 2021-01-02 22:00:00 26.96
10 2021-01-01 23:00:00 26.44
11 2021-01-02 23:00:00 26.85
12 2021-01-02 00:00:00 25.64
13 2021-01-03 00:00:00 25.94
Or use DataFrame.melt and then join column converted to timedeltas:
df['date'] = pd.to_datetime(df['date'])
df1 = (df.melt('date', value_name='Price')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2021-01-01 21:00:00 27.32
4 2021-01-01 22:00:00 26.98
5 2021-01-01 23:00:00 26.44
6 2021-01-02 00:00:00 25.64
7 2021-01-02 01:00:00 25.59
8 2021-01-02 02:00:00 24.91
9 2021-01-02 03:00:00 24.74
10 2021-01-02 21:00:00 27.38
11 2021-01-02 22:00:00 26.96
12 2021-01-02 23:00:00 26.85
13 2021-01-03 00:00:00 25.94
Related
EDITED
I have a time series in a Pandas DataFrame for which I want to add a new column with the minimum values. Specifically, imagine I have the following values in my time series:
time_stamp price da_price min hour_of_day day_of_year
0 2021-01-01 00:00:00 64.84 50.87 50.87 0 1
1 2021-01-01 00:15:00 13.96 50.87 13.96 0 1
2 2021-01-01 00:30:00 12.40 50.87 12.40 0 1
3 2021-01-01 00:45:00 7.70 50.87 7.70 0 1
4 2021-01-01 01:00:00 64.25 48.19 48.19 1 1
5 2021-01-01 01:15:00 14.07 48.19 14.07 1 1
6 2021-01-01 01:30:00 13.25 48.19 13.25 1 1
7 2021-01-01 01:45:00 10.47 48.19 10.47 1 1
I would like to find the minimum values so this is ok with pandas function. However the only constraint I have is that da_price is composed of the same value for 1 full hour. So if the average of the price over an hour is smaller than the da_price, then we report the values giving the lower average as MIN. So, here above (64+13+12+7)/4=24 < 50.87, so the values price should be reported.
So in substance:
if price gives the MIN average, no problem, we report the values as they are as MIN.
if da_price gives the minimum value, then we report the values as MIN.
Any ideas how I can do this efficiently with Pandas and/or Numpy? Thanks!
If I understand correctly, this should solve the issue (I added some data to the input data to verify your constraint):
Input data
time_stamp price da_price min hour_of_day day_of_year
0 2021-01-01 00:00:00 64.84 50.87 50.87 0 1
1 2021-01-01 00:15:00 13.96 50.87 13.96 0 1
2 2021-01-01 00:30:00 12.40 50.87 12.40 0 1
3 2021-01-01 00:45:00 7.70 50.87 7.70 0 1
4 2021-01-01 01:00:00 64.25 48.19 48.19 1 1
5 2021-01-01 01:15:00 14.07 48.19 14.07 1 1
6 2021-01-01 01:30:00 13.25 48.19 13.25 1 1
7 2021-01-01 01:45:00 10.47 48.19 10.47 1 1
8 2021-01-01 02:00:00 64.25 22.19 48.19 1 1
9 2021-01-01 02:15:00 14.07 22.19 14.07 1 1
10 2021-01-01 02:30:00 13.25 22.19 13.25 1 1
11 2021-01-01 02:45:00 10.47 22.19 10.47 1 1
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
df = df.set_index('time_stamp', drop=True)
# Means - for the third hour the da_price mean is lower
# print(df.resample('1H').mean()[['price','da_price']])
time_stamp price da_price
2021-01-01 00:00:00 24.725 50.87
2021-01-01 01:00:00 25.510 48.19
2021-01-01 02:00:00 25.510 22.19
def check(x):
if x.da_price.mean() < x.price.mean():
x.loc[:,'min'] = x.da_price.mean()
else:
x.loc[:,'min'] = [min(i,j) for i,j in zip(x.price.values, x.da_price.values)]
return x
df = df.resample('1H').apply(check)
Output:
price da_price hour_of_day day_of_year min
time_stamp
2021-01-01 00:00:00 64.84 50.87 0 1 50.87
2021-01-01 00:15:00 13.96 50.87 0 1 13.96
2021-01-01 00:30:00 12.40 50.87 0 1 12.40
2021-01-01 00:45:00 7.70 50.87 0 1 7.70
2021-01-01 01:00:00 64.25 48.19 1 1 48.19
2021-01-01 01:15:00 14.07 48.19 1 1 14.07
2021-01-01 01:30:00 13.25 48.19 1 1 13.25
2021-01-01 01:45:00 10.47 48.19 1 1 10.47
2021-01-01 02:00:00 64.25 22.19 1 1 22.19
2021-01-01 02:15:00 14.07 22.19 1 1 22.19
2021-01-01 02:30:00 13.25 22.19 1 1 22.19
2021-01-01 02:45:00 10.47 22.19 1 1 22.19
I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.
I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86
I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success
I am looking fopr a way to interpolate only over short gaps in a Pandas DataFrame that has a DateTimeIndex. Long gaps should be kept as they are.
df = pd.DataFrame(
{ "value": [ 1, np.nan, 3, np.nan, np.nan, 5, np.nan, 11, np.nan, 21, np.nan, 41 ] },
index=pd.to_datetime( [
"2021-01-01 00:00", "2021-01-01 00:05", "2021-01-01 00:10",
"2021-01-01 00:11", "2021-01-01 00:13", "2021-01-01 00:14",
"2021-01-01 00:15", "2021-01-01 01:30", "2021-01-01 03:00",
"2021-01-01 04:00", "2021-01-01 05:45", "2021-01-01 06:45",
] )
)
value
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 NaN
2021-01-01 00:13:00 NaN
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
The idea is to keep gaps that are longer than a certain time (>5 minutes in this case), but interpolate all missing values within shorter gaps.
interpolate() has a limit argument that limits the number of missing values to be interpolated, but this does not respect the time delta between the rows, only the number of rows.
I would like the result to be like this:
value
2021-01-01 00:00:00 1.000000
2021-01-01 00:05:00 2.000000
2021-01-01 00:10:00 3.000000
2021-01-01 00:11:00 3.500000
2021-01-01 00:13:00 4.500000
2021-01-01 00:14:00 5.000000
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.000000
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.000000
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.000000
This solution fills value gaps that are in time spans that are less than a specified value. The filled values are set proportionally to the entry's position within the value gap's time span (time-interpolated values). Julian dates are used for easier computation.
Set max time span gap to fill with time-interpolated values. 5 minutes.
jd_max_gap_fill = 5/(60*24)
Calculate the value gap:
df['ffill'] = df['value'].ffill()
df['value_gap'] = df['value'].bfill() - df['value'].ffill()
Get the Julian date for the entry:
df['jd'] = df.index.to_julian_date()
Calculate the time gap:
df['jd_nan'] = np.where(~df['value'].isna(), df['jd'], np.nan)
df['jd_gap'] = df['jd_nan'].bfill() - df['jd_nan'].ffill()
Time-wise, calculate how far into the value gap we are:
df['jd_start'] = df['jd_nan'].ffill()
df['jd_prp'] = np.where(df['jd_gap'] != 0, (df['jd'] - df['jd_start'])/df['jd_gap'], 0)
Calculate time-interpolated values:
df['filled_value'] = np.where(df['jd_gap'] <= jd_max_gap_fill, df['ffill'] + df['value_gap'] * df['jd_prp'], np.nan)
df['filled_value']
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 3.5
2021-01-01 00:13:00 4.5
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
Note that my output is different than your expected output because the first NaN is in a 10 minute gap.