Problem adding column to Pandas DataFrame - python

I have a Dataframe of raw data:
df
Out:
Date_time 10a 10b 10c 40a 40b 40c 100a 100b 100c
120 2019-02-04 16:00:00 26.7 26.9 NaN 26.7 NaN NaN 24.9 NaN NaN
121 2019-02-04 17:00:00 23.4 24.0 23.5 24.3 24.1 24.0 25.1 24.8 25.1
122 2019-02-04 18:00:00 23.1 24.0 23.3 24.3 24.1 24.0 25.1 24.8 25.1
123 2019-02-04 19:00:00 22.8 23.8 22.9 24.3 24.1 24.0 25.1 24.8 25.1
124 2019-02-04 20:00:00 NaN 23.5 22.6 24.3 24.1 24.0 25.1 24.8 25.1
I wish to create a DataFrame containing the 'Date_time' column and several columns of data means. In this instance there will be 3 means for each row, one each for 10, 40, and 100, calculating the mean values for a, b, and c for each of these numbered intervals.
means
Out:
Date_time 10cm 40cm 100cm
120 2019-02-04 16:00:00 26.800000 26.700000 24.9
121 2019-02-04 17:00:00 23.633333 24.133333 25.0
122 2019-02-04 18:00:00 23.466667 24.133333 25.0
123 2019-02-04 19:00:00 23.166667 24.133333 25.0
124 2019-02-04 20:00:00 23.050000 24.133333 25.0
I have tried the following (taken from this answer):
means = df['Date_time'].copy()
means['10cm'] = df.loc[:, '10a':'10c'].mean(axis=1)
But this results in all the mean values being clumped together in one cell at the bottom of the 'Date_time' column with '10cm' being given as the cell's index.
means
Out:
120 2019-02-04 16:00:00
121 2019-02-04 17:00:00
122 2019-02-04 18:00:00
123 2019-02-04 19:00:00
124 2019-02-04 20:00:00
10cm 120 26.800000
121 23.633333
122 23.46...
Name: Date_time, dtype: object
I believe that this is something to do with means being a Series object rather that a DataFrame object when I copy across the 'Date_time' column, but I'm not sure. Any pointers would be greatly appreciated!

It was the Series issue. Turns out writing out the question helped me realise the issue! My solution was altering the initial creation of means using to_frame():
means = df['Date_time'].copy().to_frame()
I'll leave the question up in case anyone else is having a similar issue, to save them having to spend time writing it all up!

Related

How to calculate time difference in minutes and populate the dataframe according

I have a time series data, converted to a dataframe. It has multiple columns, where the first column is timestamps and rest of the column names are timestamps with values.
The dataframe looks like
date 2022-01-02 10:20:00 2022-01-02 10:25:00 2022-01-02 10:30:00 2022-01-02 10:35:00 2022-01-02 10:40:00 2022-01-02 10:45:00 2022-01-02 10:50:00 2022-01-02 10:55:00 2022-01-02 11:00:00
2022-01-02 10:30:00 25.5 26.3 26.9 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 60.3 59.3 59.2 58.4 56.9 58.0 NaN NaN NaN
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
Note that if value in date column matches with columns names, there are NaNs after the intersecting column.
The dataframe I am trying to achieve is as below where the column names are the minutes before date (40,35,30,25,20,15,10,5,0) and the same values are populated accordingly:
For example : 1) 2022-01-02 10:30:00 - 2022-01-02 10:30:00 = 0 mins, hence the corresponding value there should be 26.9. 2) 2022-01-02 10:30:00 - 2022-01-02 10:25:00 = 5 mins, hence the value there should be 26.3 and so on.
Note - values with * are dummy values to represent.(The real dataframe has many more columns)
date 40mins 35mins 30mins 25mins 20mins 15mins 10mins 5mins 0mins
2022-01-02 10:30:00 24* 24* 24.8* 24.8* 25* 25* 25.5 26.3 26.9
2022-01-02 10:45:00 59* 58* 60* 60.3 59.3 59.2 58.4 56.9 58.0
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
I would highly appreciate some help here. Apologies if I have not framed the question well. Please ask for clarification if needed.
IIUC, you can melt, compute the timedelta and filter, then pivot back:
(df.melt('date', var_name='date2') # reshape the columns to rows
# convert the date strings to datetime
# and compute the timedelta
.assign(date=lambda d: pd.to_datetime(d['date']),
date2=lambda d: pd.to_datetime(d['date2']),
delta=lambda d: d['date'].sub(d['date2'])
.dt.total_seconds().floordiv(60)
)
# filter out negative timedelta
.loc[lambda d: d['delta'].ge(0)]
# reshape the rows back to columns
.pivot('date', 'delta', 'value')
# rename columns from integer to "Xmins"
.rename(columns=lambda x: f'{x:.0f}mins')
# remove columns axis label
.rename_axis(columns=None)
)
output:
0mins 5mins 10mins 15mins 20mins 25mins 30mins 35mins 40mins
date
2022-01-02 10:30:00 26.9 26.3 25.5 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 58.0 56.9 58.4 59.2 59.3 60.3 NaN NaN NaN
2022-01-02 11:00:00 49.5 49.5 49.0 48.9 48.1 48.0 48.0 43.9 43.7

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest

I have this df:
Week U.S. 30 yr FRM U.S. 15 yr FRM
0 2014-12-31 3.87 3.15
1 2015-01-01 NaN NaN
2 2015-01-02 NaN NaN
3 2015-01-03 NaN NaN
4 2015-01-04 NaN NaN
... ... ... ...
2769 2022-07-31 NaN NaN
2770 2022-08-01 NaN NaN
2771 2022-08-02 NaN NaN
2772 2022-08-03 NaN NaN
2773 2022-08-04 4.99 4.26
And when I try to run this interpolation:
pmms_df.interpolate(method = 'nearest', inplace = True)
I get ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest
I read in this post that pandas interpolate doesn't do well with the time columns, so I tried this:
pmms_df[['U.S. 30 yr FRM', 'U.S. 15 yr FRM']].interpolate(method = 'nearest', inplace = True)
but the output is exactly the same as before the interpolation.
It may not work great with date columns, but it works well with a datetime index, which is probably what you should be using here:
df = df.set_index('Week')
df = df.interpolate(method='nearest')
print(df)
# Output:
U.S. 30 yr FRM U.S. 15 yr FRM
Week
2014-12-31 3.87 3.15
2015-01-01 3.87 3.15
2015-01-02 3.87 3.15
2015-01-03 3.87 3.15
2015-01-04 3.87 3.15
2022-07-31 4.99 4.26
2022-08-01 4.99 4.26
2022-08-02 4.99 4.26
2022-08-03 4.99 4.26
2022-08-04 4.99 4.26

Resamping pandas index with datetime.time type

I have a data set that has a date-agnostic datetime.time (hour:minute) index like this:
Time
11/22/16
11/23/16
11/24/16
00:00
50.9
51.3
49
00:01
50.8
51.8
49.9
00:02
51.4
52.6
48.3
I'm trying to do various stats on date-agnostic time slots, e.g. (avg, stddev, etc.) for the time 00:01, so having them aligned like this helps with that, but it seems a date-agnostic index makes things much harder to work with other parts of pandas. Does anyone have recommendations for how to deal with date-agnostic indexes like this or how to do date-agnostic stats if I were to just reorganize the data set such that the three date columns are one continuous column and the index being a true datetime timestamp?
The particular problem I am facing right now is that I want to upsample the dataset with forward fill such that it has 60 samples per minute (second accuracy).
So the resulting df would look something like:
Time
11/22/16
11/23/16
11/24/16
00:00:00
50.9
51.3
49
00:00:01
50.9
51.3
49
00:00:02
50.9
51.3
49
00:00:03
50.9
51.3
49
…
…
…
…
00:01:00
50.8
51.8
49.9
…
…
…
…
00:02:00
51.4
52.6
48.3
…
…
…
…
What I'm having trouble with is datetime.time is not accepted by resample as a `DateTime-like index'. I can add an artificial date to the initial dataset like:
Time
11/22/16
11/23/16
11/24/16
1899-12-30 00:00:00
50.9
51.3
49
1899-12-30 00:01:00
50.8
51.8
49.9
1899-12-30 00:02:00
51.4
52.6
48.3
but this seems a little absurd. I also came up with a way of using explode to do this:
df['ListOfTimes'] = pd.Series(
[np.full((1, 60), df['Time'][x]).tolist()[0] for x in range((len(df.index)))])
df = df.explode('ListOfTimes')
But doing things like this are SOO much more painful than create and debug than just df.resample('60S').ffill(). Looking for the most panda-centric way of dealing with date-agnostic time indexes.
One way is to create a generator that yields the time intervals using f-string, and then reindex:
s = (f"{i}:{n:02}" for i in df["Time"] for n in range(0, 60))
print (df.assign(Time=df["Time"]+":00").set_index("Time").reindex(s).ffill())
11/22/16 11/23/16 11/24/16
Time
00:00:00 50.9 51.3 49.0
00:00:01 50.9 51.3 49.0
00:00:02 50.9 51.3 49.0
00:00:03 50.9 51.3 49.0
00:00:04 50.9 51.3 49.0
... ... ... ...
00:02:55 51.4 52.6 48.3
00:02:56 51.4 52.6 48.3
00:02:57 51.4 52.6 48.3
00:02:58 51.4 52.6 48.3
00:02:59 51.4 52.6 48.3
[180 rows x 3 columns]
You can rearrange your DataFrame into a time series, resample, and go back to your one-day-per-column format like this:
import pandas as pd
# One day per column. We will upsample from four to two hours.
df = pd.DataFrame({'2020-01-01': [1,2,3,4,5,6], '2020-01-02': [7,8,9,10,11,12]},
index=['0:00', '4:00', '8:00', '12:00', '16:00', '20:00'])
# One-liner
df.stack() \
.reset_index() \
.apply(lambda x: (pd.Timestamp(x['level_1']+'T'+x['level_0']), x[0]),
axis='columns', result_type='expand') \
.set_index(0) \
.resample('2H').ffill() \
.reset_index() \
.apply(lambda x: (str(x[0].date()), str(x[0].time()), x[1]),
axis='columns', result_type='expand') \
.set_index([1,0]) \
.unstack() \
.ffill()
The result is:
2
0 2020-01-01 2020-01-02
1
00:00:00 1.0 7.0
02:00:00 1.0 7.0
04:00:00 2.0 8.0
06:00:00 2.0 8.0
08:00:00 3.0 9.0
10:00:00 3.0 9.0
12:00:00 4.0 10.0
14:00:00 4.0 10.0
16:00:00 5.0 11.0
18:00:00 5.0 11.0
20:00:00 6.0 12.0
22:00:00 6.0 12.0

How to loop through dates column and assign values according to a certain condition?

I have a df as follows
dates winter summer rest Final
2020-01-01 00:15:00 65.5 71.5 73.0 NaN
2020-01-01 00:30:00 62.6 69.0 70.1 NaN
2020-01-01 00:45:00 59.6 66.3 67.1 NaN
2020-01-01 01:00:00 57.0 63.5 64.5 NaN
2020-01-01 01:15:00 54.8 60.9 62.3 NaN
2020-01-01 01:30:00 53.1 58.6 60.6 NaN
2020-01-01 01:45:00 51.7 56.6 59.2 NaN
2020-01-01 02:00:00 50.5 55.1 57.9 NaN
2020-01-01 02:15:00 49.4 54.2 56.7 NaN
2020-01-01 02:30:00 48.5 53.7 55.6 NaN
2020-01-01 02:45:00 47.9 53.4 54.7 NaN
2020-01-01 03:00:00 47.7 53.3 54.2 NaN
2020-01-01 03:15:00 47.9 53.1 54.1 NaN
2020-01-01 03:30:00 48.7 53.2 54.6 NaN
2020-01-01 03:45:00 50.2 54.1 55.8 NaN
2020-01-01 04:00:00 52.3 56.1 57.9 NaN
2020-04-28 12:30:00 225.1 200.0 209.8 NaN
2020-04-28 12:45:00 215.7 193.8 201.9 NaN
2020-04-28 13:00:00 205.6 186.9 193.4 NaN
2020-04-28 13:15:00 195.7 179.9 185.0 NaN
2020-04-28 13:30:00 186.7 173.4 177.4 NaN
2020-04-28 13:45:00 179.2 168.1 170.9 NaN
2020-04-28 14:00:00 173.8 164.4 166.3 NaN
2020-04-28 14:15:00 171.0 163.0 163.9 NaN
2020-04-28 14:30:00 170.7 163.5 163.6 NaN
2020-12-31 21:15:00 88.5 90.2 89.2 NaN
2020-12-31 21:30:00 85.2 88.5 87.2 NaN
2020-12-31 21:45:00 82.1 86.3 85.0 NaN
2020-12-31 22:00:00 79.4 84.1 83.2 NaN
2020-12-31 22:15:00 77.6 82.4 82.1 NaN
2020-12-31 22:30:00 76.4 81.2 81.7 NaN
2020-12-31 22:45:00 75.6 80.3 81.6 NaN
2020-12-31 23:00:00 74.7 79.4 81.3 NaN
2020-12-31 23:15:00 73.7 78.4 80.6 NaN
2020-12-31 23:30:00 72.3 77.2 79.5 NaN
2020-12-31 23:45:00 70.5 75.7 77.9 NaN
2021-01-01 00:00:00 68.2 73.8 75.7 NaN
The dates column has dates starting from 2020-01-01 00:15:00 till 2021-01-01 00:00:00 split at every 15 mins.
I also have the following date range conditions:
Winter: 01.11 - 20.03
Summer: 15.05 - 14.09
Rest: 21.03 - 14.05 & 15.09 - 31.10
What I want to do is to create a new column named season that checks every date in the dates column and assigns winter if the date is in Winter range, summer if it is in Summer range and rest if it is the Rest range.
Then, based on the value in the season column, the Final column must be filled. If the value in season column is 'winter', then the values from winter column must be placed, if the value in season column is 'summer', then the values from summer column must be placed and so on.
How can this be done?
Idea is normalize datetimes for same year, then filter by Series.between and set new column by numpy.select:
d = pd.to_datetime(df['dates'].dt.strftime('%m-%d-2020'))
m1 = d.between('2020-11-01','2020-12-31') | d.between('2020-01-01','2020-03-20')
m2 = d.between('2020-05-15','2020-09-14')
df['Final'] = np.select([m1, m2], ['Winter','Summer'], default='Rest')
print (df)
dates winter summer rest Final
0 2020-01-01 00:15:00 65.5 71.5 73.0 Winter
1 2020-06-15 00:30:00 62.6 69.0 70.1 Summer
2 2020-12-25 00:45:00 59.6 66.3 67.1 Winter
3 2020-10-10 01:00:00 57.0 63.5 64.5 Rest

Python pandas rolling mean while retaining index and column

I have a pandas DataFrame of statistics for NBA games. Here's a sample of the data for away teams:
away_team away_efg away_drb away_score
date
2000-10-31 19:00:00 Los Angeles Clippers 0.522 74.4 94
2000-10-31 19:00:00 Milwaukee Bucks 0.434 63.0 93
2000-10-31 19:30:00 Minnesota Timberwolves 0.523 73.8 106
2000-10-31 19:30:00 Charlotte Hornets 0.605 77.1 106
2000-10-31 19:30:00 Seattle SuperSonics 0.429 73.1 88
There are many more numeric columns other than the away_score column, and also analogous columns for the home team.
What I would like is, for each row, replace the numeric columns (other than score) with the mean of the previous three observations, partitioned by team. I can almost get what I want by doing the following:
home_df.groupby("team").apply(lambda x: x.rolling(window=3).mean())
This returns, for example,
>>> home_avg[home_avg["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb
0 NaN NaN NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN NaN NaN
81 0.146667 71.600000 9.4 74.666667 0.512000 0.347667 25.833333
Taking this, along with
>>> home_df[home_df["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb stl team tov trb
0 0.118 76.7 7.1 64.7 0.535 0.365 25.6 11.5 Utah Jazz 10.8 42.9
50 0.100 63.9 9.1 80.5 0.536 0.414 27.6 2.2 Utah Jazz 20.2 58.6
81 0.222 74.2 12.0 78.8 0.465 0.264 24.3 7.3 Utah Jazz 13.9 50.0
122 0.119 81.8 11.3 75.0 0.515 0.642 25.0 12.2 Utah Jazz 21.8 52.5
135 0.129 76.7 17.8 75.9 0.650 0.400 37.9 5.7 Utah Jazz 18.8 62.7
demonstrates that it is including the current row in the calculation of the mean. I want to avoid this. More specifically, the desired output for row 81 would be all NaNs (because there haven't been three games yet), and the entry in the 3par column for row 122 would be .146667 (the average of the values in that column for rows 0, 50, and 81).
So, my question is, how can I exclude the current row in the rolling mean calculation?
You can use shift here which shifts the index for a given amount to make your rolling window use the last three values excluding the current value:
# create dummy data frame with numeric values
df = pd.DataFrame({"numeric_col": np.random.randint(0, 100, size=5)})
print(df)
numeric_col
0 66
1 60
2 74
3 41
4 83
df["mean"] = df["numeric_col"].shift(1).rolling(window=3).mean()
print(df)
numeric_col mean
0 66 NaN
1 60 NaN
2 74 NaN
3 41 66.666667
4 83 58.333333
Accordingly, change your apply function to lambda x: x.shift(1).rolling(window=3).mean() to make it work in your specific example.

Categories

Resources