I have 2 Series with datetime index. I want to edit the values (float) of the first one at each corresponding daily minimum of the second one.
I tried
ser_1.loc[ser_2.groupby(ser_2.index.day_of_year).idxmin()] += 1
But I get this error :
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
series 2 and 1,respectively, are shaped
2019-01-01 00:00:00 0.04980
2019-01-01 01:00:00 0.04426
2019-01-01 02:00:00 0.05100
2019-01-01 03:00:00 0.04627
2019-01-01 04:00:00 0.03978
...
2019-12-31 19:00:00 0.04773
2019-12-31 20:00:00 0.04600
2019-12-31 21:00:00 0.04220
2019-12-31 22:00:00 0.03974
2019-12-31 23:00:00 0.03888
Name: 0, Length: 8760, dtype: float64
2019-01-01 23:00:00 0.000
2019-01-02 00:00:00 0.000
2019-01-02 01:00:00 0.000
2019-01-02 02:00:00 0.000
2019-01-02 03:00:00 0.000
...
2019-12-13 06:00:00 1.534
2019-12-13 07:00:00 2.425
2019-12-13 08:00:00 1.622
2019-12-13 09:00:00 1.974
2019-12-13 10:00:00 1.729
Freq: H, Name: 1, Length: 8292, dtype: float64
Could it be a non correspondig index format or just bad use of a function ?
Found my problem
The function I used is correct
But ser_1 is incomplete and some values can't correspond
Related
Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)
So I have a dataset that has electricity load over 24 hours:
Time_of_Day = loadData.groupby(loadData.index.hour).mean()
Time_of_Day
Time Load
2019-01-01 01:00:00 38.045
2019-01-01 02:00:00 30.675
2019-01-01 03:00:00 22.570
2019-01-01 04:00:00 22.153
2019-01-01 05:00:00 21.085
... ...
2019-12-31 20:00:00 65.565
2019-12-31 21:00:00 53.513
2019-12-31 22:00:00 49.096
2019-12-31 23:00:00 44.409
2020-01-01 00:00:00 45.744
how do I plot a random day(24hrs) from the 8760 hours please
With the following toy dataframe:
import pandas as pd
import random
df = pd.DataFrame({"Time": pd.date_range(start="1/1/2019", end="12/31/2019", freq="H")})
df["Load"] = [round(random.random() * 100, 2) for _ in range(df.shape[0])]
Time Load
0 2019-01-01 00:00:00 53.36
1 2019-01-01 01:00:00 34.20
2 2019-01-01 02:00:00 64.19
3 2019-01-01 03:00:00 89.18
4 2019-01-01 04:00:00 27.82
... ... ...
8732 2019-12-30 20:00:00 38.26
8733 2019-12-30 21:00:00 49.66
8734 2019-12-30 22:00:00 64.15
8735 2019-12-30 23:00:00 23.97
8736 2019-12-31 00:00:00 3.72
[8737 rows x 2 columns]
Here is one way to do it using choice function from Python standard library random module:
# In Jupyter cell
df[
(df["Time"].dt.month == random.choice(df["Time"].dt.month))
& (df["Time"].dt.day == random.choice(df["Time"].dt.day))
].plot(x="Time")
Output:
I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]
I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64
I have three columns in a Time series.
The Time series is hourly and the index value.
I have multiple categories that are being measured hourly.
I have arbitrary lists of levels: these are usually odd names and I may pull anywhere between 40 to 40000 at a time.
I also have their varying values: for score 0 - 100.
So:
I want to make each Level have its own DataFrame:
(FULL DataFrame):
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
(One of hundreds of individual DataFrames I want generated, but NOT in a DICT)
df_is_1005 =
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
.... but for ALL THE LEVELS
.
And
I have a bit of a problem!
I've done quite a lot of digging and have tried making a dict of the dataframes. How do I extract each of these?
Also, how do I name them individually as: df_of_{levels}?
This is the Time Series data I'll create for a toy model. (BUT there should be multiple datetime for each and every level, unlike here)
import pandas as pd
from datetime import datetime
import numpy as np
import pandas as pd
date_rng = pd.date_range(start='1/1/2019', end='3/30/2019', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['level'] = np.random.randint(1000,1033,size=(len(date_rng)))
df['score'] = np.random.uniform(0,100,size=(len(date_rng)))
Keep in mind, the levels I may deal with could be hundreds and they are named bizarre things.
I'll have the time stamps for each of these as separate rows.
My desired goal is the have each of the possible levels, which there may well be more than just the small number here, to dynamically create dataframes.
NOW: I know I can create a Dictionary of dataframes.
BUT How do I extract each of dataframes with INDIVIDUAL Numbers?
I want, for example
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
2019-01-01 06:00:00 1005 12.154499
2019-01-01 07:00:00 1005 96.439228
2019-01-01 08:00:00 1005 64.283731
2019-01-01 09:00:00 1005 83.165093
2019-01-01 10:00:00 1005 75.740610
2019-01-01 11:00:00 1005 25.721404
2019-01-01 12:00:00 1005 37.493829
2019-01-01 13:00:00 1005 51.783549
2019-01-01 14:00:00 1005 7.223582
2019-01-01 15:00:00 1005 0.932651
2019-01-01 16:00:00 1005 95.916686
2019-01-01 17:00:00 1005 11.579450
and same df, far later... :
date levels score
2019-01-01 00:00:00 1027 99.438851
2019-01-01 01:00:00 1027 92.081975
2019-01-01 02:00:00 1027 93.032991
2019-01-01 03:00:00 1027 1.991615
2019-01-01 04:00:00 1027 12.723531
2019-01-01 05:00:00 1027 74.443313
2019-01-01 06:00:00 1027 12.154499
2019-01-01 07:00:00 1027 96.439228
2019-01-01 08:00:00 1027 64.283731
2019-01-01 09:00:00 1027 83.165093
2019-01-01 10:00:00 1027 75.740610
2019-01-01 11:00:00 1027 25.721404
2019-01-01 12:00:00 1027 37.493829
2019-01-01 13:00:00 1027 51.783549
2019-01-01 14:00:00 1027 7.223582
2019-01-01 15:00:00 1027 0.932651
2019-01-01 16:00:00 1027 95.916686
2019-01-01 17:00:00 1027 11.579450
2019-01-01 18:00:00 1027 91.226938
2019-01-01 19:00:00 1027 31.564530
2019-01-01 20:00:00 1027 39.511358
2019-01-01 21:00:00 1027 59.787468
2019-01-01 22:00:00 1027 4.666549
2019-01-01 23:00:00 1027 92.197337
...etcetera...
EACH level individually, whatever it may be called (and there may be hundreds of them with random values):
To be converted to
df_{level_value_generated} =
date score
2019-01-01 00:00:00 8.040233
2019-01-01 01:00:00 55.736688
2019-01-01 02:00:00 37.910143
2019-01-01 03:00:00 22.907763
2019-01-01 04:00:00 4.586205
2019-01-01 05:00:00 88.090652
2019-01-01 06:00:00 50.474533
2019-01-01 07:00:00 92.890208
2019-01-01 08:00:00 70.949978
2019-01-01 09:00:00 23.191488
2019-01-01 10:00:00 60.506870
2019-01-01 11:00:00 25.689149
2019-01-01 12:00:00 49.234296
2019-01-01 13:00:00 65.369771
2019-01-01 14:00:00 55.550065
2019-01-01 15:00:00 35.112297
2019-01-01 16:00:00 45.989587
2019-01-01 17:00:00 76.829787
2019-01-01 18:00:00 5.982378
2019-01-01 19:00:00 83.603115
2019-01-01 20:00:00 5.995648
2019-01-01 21:00:00 95.658097
2019-01-01 22:00:00 21.877945
2019-01-01 23:00:00 30.428798
2019-01-02 00:00:00 72.450284
2019-01-02 01:00:00 91.947018
2019-01-02 02:00:00 66.741502
2019-01-02 03:00:00 77.535416
2019-01-02 04:00:00 29.624868
2019-01-02 05:00:00 89.652003
So I can then list the these DataFrames that are created DYNAMICALLY.
From here, I'd like to add them to a dictionary, the reason being, is that I want to train a Time-Series model to on each and every one of the individual DataFrames so I can have a different model for each of them, each with their own training and outputs.
If possibly, can I train multiple DataFrames from inside a dictionary individually?
If I just do a pivot table or groupby, I will have a large Dataframe that I'll have to individually call out columns of to train on the time series. So I'd rather not do that.
So, how do I dynamically create:
Newly named DataFrames from levels that are not all known in value,
each named:
df_{level_name}:
DateTime Column: Score_Column:
some dates... scores 0-100
that will then drop the 'level_name' column in their own DataFrame, so that I can have as many dataframes as necessary, each named uniquely, programmatically, so I can then take each of these and then plug them into a new model or whatever?
If I've understood your problem correctly, a MultiIndex should do exactly what you want.
To do this on your dataframe:
df.reset_index(inplace=True)
df.set_index(['levels', 'date'], inplace=True)
# in the case of your example above, this will produce:
df =
levels date score
1005 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
1027 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
2019-01-01 18:00:00 91.226938
2019-01-01 19:00:00 31.564530
2019-01-01 20:00:00 39.511358
2019-01-01 21:00:00 59.787468
2019-01-01 22:00:00 4.666549
2019-01-01 23:00:00 92.197337
#... etc
You can then access each level of the data using these indices:
df.loc[1005, :]
>
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
You can also loop through all the 'levels' of the data using:
for level, data in df.groupby(level=0):
# do something to 'level'
And, if needed, get a list of all 'levels' contained in the data:
df.index.levels[0]
> [1005, 1027, ...]
This might prove to be more flexible than creating numerous individually named dataframes, and is closer to the indended use of pandas.