I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64
Related
i have some issues with my dataresampling in pandas. I´m trying to upsample 15 min values to 1min values. The resampled dataframe values shoud contain the sum spliited equaly between the two values of the original dataframe. This codes generates an extraction of the problem.
import pandas as pd
import numpy as np
dates = pd.DataFrame(pd.date_range(start="20190101",end="20200101", freq="15min"))
values = pd.DataFrame(np.random.randint(0,10,size=(35041, 1)))
df = pd.concat([dates,values], axis = 1)
df = df.set_index(pd.DatetimeIndex(df.iloc[:,0]))
print(df.resample("min").agg("sum").head(16))
This is an example output:
2019-01-01 00:00:00 3
2019-01-01 00:01:00 0
2019-01-01 00:02:00 0
2019-01-01 00:03:00 0
2019-01-01 00:04:00 0
2019-01-01 00:05:00 0
2019-01-01 00:06:00 0
2019-01-01 00:07:00 0
2019-01-01 00:08:00 0
2019-01-01 00:09:00 0
2019-01-01 00:10:00 0
2019-01-01 00:11:00 0
2019-01-01 00:12:00 0
2019-01-01 00:13:00 0
2019-01-01 00:14:00 0
2019-01-01 00:15:00 3
The values shown as 0 should be replaced by the sum of the two values (in this exapmle: 2019-01-01 00:00:00 3; and 2019-01-01 00:15:00 3) which equals to 6 and this should be evenly distibuted over the timearea.
2019-01-01 00:00:00 6/15
2019-01-01 00:01:00 6/15
2019-01-01 00:02:00 6/15
2019-01-01 00:03:00 6/15
2019-01-01 00:04:00 6/15
2019-01-01 00:05:00 6/15
2019-01-01 00:06:00 6/15
2019-01-01 00:07:00 6/15
2019-01-01 00:08:00 6/15
2019-01-01 00:09:00 6/15
2019-01-01 00:10:00 6/15
2019-01-01 00:11:00 6/15
2019-01-01 00:12:00 6/15
2019-01-01 00:13:00 6/15
2019-01-01 00:14:00 6/15
2019-01-01 00:15:00 6/15
This should be done for each resampled group over the whole Dataframe.
In other word the sum of the original dataframe and the resampled dataframe should be equal.
Thanks for your help.
First of all, personally, I would recommend working with a series, if there is only one column.
series = pd.Series(index=pd.date_range(start="20190101",end="20200101",
freq="15min"), data=(np.random.randint(0,10,size=(35041,))).tolist())
Then, I would create a new index with minutely values, calculate the cumulative sum of the values and interpolate between these values. In your use case "linear" is suggested as interpolation method:
beginning = series.index[0]
end = series.index[-1]
new_index = pd.date_range(start, end, freq="1T")
cumsum = series.cumsum()
cumsum = result.reindex(new_index)
cumsum = result.interpolate("linear")
Afterwards, you get an interpolated cumulative sum, which you can convert back to your searched values via:
series_upsampled = cumsum.diff()
If you want, you can shift the series_upsampled by 1, doing
series_upsampled = series_upsampled.shift(-1)
Pay attention to NaN value at the beginning (or if you shift your series, at the end).
I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00
I have a dataframe as follows
import pandas as pd
import numpy as np
IDs = ['A','A','A','B','B']
times = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h')
times_2 = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h') + pd.Timedelta('15min')
Vals = [np.random.randint(15,250) for x in enumerate(times)]
df = pd.DataFrame({'id' : IDs*5,
'Start' : times,
'End' : times_2,
'Value' : Vals},columns=['id','Start','End','Value'])
this gives me a df as follows.
print(df.head(5))
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 00:15:00 52
1 A 2019-01-01 01:00:00 2019-01-01 01:15:00 69
2 A 2019-01-01 02:00:00 2019-01-01 02:15:00 209
3 B 2019-01-01 03:00:00 2019-01-01 03:15:00 163
4 B 2019-01-01 04:00:00 2019-01-01 04:15:00 70
now what I'm trying to do is apply a group by to my data frame to get the sum of the value column, however, whilst doing this I would like to retain the min start and max end time of my df.
so my example output would be as follows :
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 22:15:00 2007
1 B 2019-01-01 03:00:00 2019-01-02 00:15:00 1385
The only way I've sort of made this work is pass the min and max of each unique ID by start and end time, pass these to a list and then manually create the start and end times, but it was slow and messy and prone to error... hoping someone here can guide me as to what I'm missing.
Using groupby with agg
df.groupby('id').agg({'Start':'min','End':'max','Value':'sum'})#reset_index()
Out[92]:
Start End Value
id
A 2019-01-01 00:00:00 2019-01-01 22:15:00 2152
B 2019-01-01 03:00:00 2019-01-02 00:15:00 972
I have three columns in a Time series.
The Time series is hourly and the index value.
I have multiple categories that are being measured hourly.
I have arbitrary lists of levels: these are usually odd names and I may pull anywhere between 40 to 40000 at a time.
I also have their varying values: for score 0 - 100.
So:
I want to make each Level have its own DataFrame:
(FULL DataFrame):
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
(One of hundreds of individual DataFrames I want generated, but NOT in a DICT)
df_is_1005 =
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
.... but for ALL THE LEVELS
.
And
I have a bit of a problem!
I've done quite a lot of digging and have tried making a dict of the dataframes. How do I extract each of these?
Also, how do I name them individually as: df_of_{levels}?
This is the Time Series data I'll create for a toy model. (BUT there should be multiple datetime for each and every level, unlike here)
import pandas as pd
from datetime import datetime
import numpy as np
import pandas as pd
date_rng = pd.date_range(start='1/1/2019', end='3/30/2019', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['level'] = np.random.randint(1000,1033,size=(len(date_rng)))
df['score'] = np.random.uniform(0,100,size=(len(date_rng)))
Keep in mind, the levels I may deal with could be hundreds and they are named bizarre things.
I'll have the time stamps for each of these as separate rows.
My desired goal is the have each of the possible levels, which there may well be more than just the small number here, to dynamically create dataframes.
NOW: I know I can create a Dictionary of dataframes.
BUT How do I extract each of dataframes with INDIVIDUAL Numbers?
I want, for example
df =
date levels score
2019-01-01 00:00:00 1005 99.438851
2019-01-01 01:00:00 1005 92.081975
2019-01-01 02:00:00 1005 93.032991
2019-01-01 03:00:00 1005 1.991615
2019-01-01 04:00:00 1005 12.723531
2019-01-01 05:00:00 1005 74.443313
2019-01-01 06:00:00 1005 12.154499
2019-01-01 07:00:00 1005 96.439228
2019-01-01 08:00:00 1005 64.283731
2019-01-01 09:00:00 1005 83.165093
2019-01-01 10:00:00 1005 75.740610
2019-01-01 11:00:00 1005 25.721404
2019-01-01 12:00:00 1005 37.493829
2019-01-01 13:00:00 1005 51.783549
2019-01-01 14:00:00 1005 7.223582
2019-01-01 15:00:00 1005 0.932651
2019-01-01 16:00:00 1005 95.916686
2019-01-01 17:00:00 1005 11.579450
and same df, far later... :
date levels score
2019-01-01 00:00:00 1027 99.438851
2019-01-01 01:00:00 1027 92.081975
2019-01-01 02:00:00 1027 93.032991
2019-01-01 03:00:00 1027 1.991615
2019-01-01 04:00:00 1027 12.723531
2019-01-01 05:00:00 1027 74.443313
2019-01-01 06:00:00 1027 12.154499
2019-01-01 07:00:00 1027 96.439228
2019-01-01 08:00:00 1027 64.283731
2019-01-01 09:00:00 1027 83.165093
2019-01-01 10:00:00 1027 75.740610
2019-01-01 11:00:00 1027 25.721404
2019-01-01 12:00:00 1027 37.493829
2019-01-01 13:00:00 1027 51.783549
2019-01-01 14:00:00 1027 7.223582
2019-01-01 15:00:00 1027 0.932651
2019-01-01 16:00:00 1027 95.916686
2019-01-01 17:00:00 1027 11.579450
2019-01-01 18:00:00 1027 91.226938
2019-01-01 19:00:00 1027 31.564530
2019-01-01 20:00:00 1027 39.511358
2019-01-01 21:00:00 1027 59.787468
2019-01-01 22:00:00 1027 4.666549
2019-01-01 23:00:00 1027 92.197337
...etcetera...
EACH level individually, whatever it may be called (and there may be hundreds of them with random values):
To be converted to
df_{level_value_generated} =
date score
2019-01-01 00:00:00 8.040233
2019-01-01 01:00:00 55.736688
2019-01-01 02:00:00 37.910143
2019-01-01 03:00:00 22.907763
2019-01-01 04:00:00 4.586205
2019-01-01 05:00:00 88.090652
2019-01-01 06:00:00 50.474533
2019-01-01 07:00:00 92.890208
2019-01-01 08:00:00 70.949978
2019-01-01 09:00:00 23.191488
2019-01-01 10:00:00 60.506870
2019-01-01 11:00:00 25.689149
2019-01-01 12:00:00 49.234296
2019-01-01 13:00:00 65.369771
2019-01-01 14:00:00 55.550065
2019-01-01 15:00:00 35.112297
2019-01-01 16:00:00 45.989587
2019-01-01 17:00:00 76.829787
2019-01-01 18:00:00 5.982378
2019-01-01 19:00:00 83.603115
2019-01-01 20:00:00 5.995648
2019-01-01 21:00:00 95.658097
2019-01-01 22:00:00 21.877945
2019-01-01 23:00:00 30.428798
2019-01-02 00:00:00 72.450284
2019-01-02 01:00:00 91.947018
2019-01-02 02:00:00 66.741502
2019-01-02 03:00:00 77.535416
2019-01-02 04:00:00 29.624868
2019-01-02 05:00:00 89.652003
So I can then list the these DataFrames that are created DYNAMICALLY.
From here, I'd like to add them to a dictionary, the reason being, is that I want to train a Time-Series model to on each and every one of the individual DataFrames so I can have a different model for each of them, each with their own training and outputs.
If possibly, can I train multiple DataFrames from inside a dictionary individually?
If I just do a pivot table or groupby, I will have a large Dataframe that I'll have to individually call out columns of to train on the time series. So I'd rather not do that.
So, how do I dynamically create:
Newly named DataFrames from levels that are not all known in value,
each named:
df_{level_name}:
DateTime Column: Score_Column:
some dates... scores 0-100
that will then drop the 'level_name' column in their own DataFrame, so that I can have as many dataframes as necessary, each named uniquely, programmatically, so I can then take each of these and then plug them into a new model or whatever?
If I've understood your problem correctly, a MultiIndex should do exactly what you want.
To do this on your dataframe:
df.reset_index(inplace=True)
df.set_index(['levels', 'date'], inplace=True)
# in the case of your example above, this will produce:
df =
levels date score
1005 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
1027 2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
2019-01-01 18:00:00 91.226938
2019-01-01 19:00:00 31.564530
2019-01-01 20:00:00 39.511358
2019-01-01 21:00:00 59.787468
2019-01-01 22:00:00 4.666549
2019-01-01 23:00:00 92.197337
#... etc
You can then access each level of the data using these indices:
df.loc[1005, :]
>
date score
2019-01-01 00:00:00 99.438851
2019-01-01 01:00:00 92.081975
2019-01-01 02:00:00 93.032991
2019-01-01 03:00:00 1.991615
2019-01-01 04:00:00 12.723531
2019-01-01 05:00:00 74.443313
2019-01-01 06:00:00 12.154499
2019-01-01 07:00:00 96.439228
2019-01-01 08:00:00 64.283731
2019-01-01 09:00:00 83.165093
2019-01-01 10:00:00 75.740610
2019-01-01 11:00:00 25.721404
2019-01-01 12:00:00 37.493829
2019-01-01 13:00:00 51.783549
2019-01-01 14:00:00 7.223582
2019-01-01 15:00:00 0.932651
2019-01-01 16:00:00 95.916686
2019-01-01 17:00:00 11.579450
You can also loop through all the 'levels' of the data using:
for level, data in df.groupby(level=0):
# do something to 'level'
And, if needed, get a list of all 'levels' contained in the data:
df.index.levels[0]
> [1005, 1027, ...]
This might prove to be more flexible than creating numerous individually named dataframes, and is closer to the indended use of pandas.
I have temperature time series data in 15 minute intervals.
If temp value is missing, i want to take mean of temp values of last/next 10 days at same time and put it in place of nan.
This is my code
This returns a pandas series with the values i want to keep for na values.
pd.Series(df.index[(df.Temp.isna())]).apply(last10daysmean)
How do i put the above. into this one below?
df.Temp[df.Temp.isna()]
This returns the na slots.
I don't have the function last10daysmean from your question so I can substitute it with this:
def last10daysmean(x):
return "TenDaysMeanPlaceholder"
You should try to have sample data when you post a question but I can just make temp data now:
df = pd.DataFrame({
"Temp": [2, 3, 4, 5, 6, np.nan, 3, 4, np.nan]
})
This fills the isna rows with the output of our dummy version for your last10daysmean function:
df.Temp[df.Temp.isna()] = df.Temp[df.Temp.isna()].apply(last10daysmean)
You can try of writing the row value by value apply function
df = pd.DataFrame()
df['value'] = np.random.random(len(pd.date_range(start='2019-1-1',end='2019-1-2',freq='15Min')))*10
df.index = pd.date_range(start='2019-1-1',end='2019-1-2',freq='15Min')
df.loc[df['value']<2,'value'
] = np.nan
Sample Dataframe
value
2019-01-01 00:00:00 NaN
2019-01-01 00:15:00 6.100087
2019-01-01 00:30:00 7.953615
2019-01-01 00:45:00 7.214069
2019-01-01 01:00:00 3.697723
2019-01-01 01:15:00 5.772333
2019-01-01 01:30:00 NaN
2019-01-01 01:45:00 2.827144
Function to get slice of dataframe
def last10daysmean(x,ind):
df.loc[ind,'value'] = x.mean()
temp = df.index.map(lambda x: last10daysmean(df['value'].loc[x:x+10],x) if math.isnan(df.loc[x,'value']) else df.loc[x,'value'])
Out:
value
2019-01-01 00:00:00 5.901569
2019-01-01 00:15:00 6.100087
2019-01-01 00:30:00 7.953615
2019-01-01 00:45:00 7.214069
2019-01-01 01:00:00 3.697723
2019-01-01 01:15:00 5.772333
2019-01-01 01:30:00 5.594577
2019-01-01 01:45:00 2.827144
2019-01-01 02:00:00 6.409086