grouping multiple time values into a start and finish time - python

I have a dataframe as follows
import pandas as pd
import numpy as np
IDs = ['A','A','A','B','B']
times = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h')
times_2 = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h') + pd.Timedelta('15min')
Vals = [np.random.randint(15,250) for x in enumerate(times)]
df = pd.DataFrame({'id' : IDs*5,
'Start' : times,
'End' : times_2,
'Value' : Vals},columns=['id','Start','End','Value'])
this gives me a df as follows.
print(df.head(5))
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 00:15:00 52
1 A 2019-01-01 01:00:00 2019-01-01 01:15:00 69
2 A 2019-01-01 02:00:00 2019-01-01 02:15:00 209
3 B 2019-01-01 03:00:00 2019-01-01 03:15:00 163
4 B 2019-01-01 04:00:00 2019-01-01 04:15:00 70
now what I'm trying to do is apply a group by to my data frame to get the sum of the value column, however, whilst doing this I would like to retain the min start and max end time of my df.
so my example output would be as follows :
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 22:15:00 2007
1 B 2019-01-01 03:00:00 2019-01-02 00:15:00 1385
The only way I've sort of made this work is pass the min and max of each unique ID by start and end time, pass these to a list and then manually create the start and end times, but it was slow and messy and prone to error... hoping someone here can guide me as to what I'm missing.

Using groupby with agg
df.groupby('id').agg({'Start':'min','End':'max','Value':'sum'})#reset_index()
Out[92]:
Start End Value
id
A 2019-01-01 00:00:00 2019-01-01 22:15:00 2152
B 2019-01-01 03:00:00 2019-01-02 00:15:00 972

Related

Python Pandas Upsampling on average values between data points (15min to 1min)

i have some issues with my dataresampling in pandas. I´m trying to upsample 15 min values to 1min values. The resampled dataframe values shoud contain the sum spliited equaly between the two values of the original dataframe. This codes generates an extraction of the problem.
import pandas as pd
import numpy as np
dates = pd.DataFrame(pd.date_range(start="20190101",end="20200101", freq="15min"))
values = pd.DataFrame(np.random.randint(0,10,size=(35041, 1)))
df = pd.concat([dates,values], axis = 1)
df = df.set_index(pd.DatetimeIndex(df.iloc[:,0]))
print(df.resample("min").agg("sum").head(16))
This is an example output:
2019-01-01 00:00:00 3
2019-01-01 00:01:00 0
2019-01-01 00:02:00 0
2019-01-01 00:03:00 0
2019-01-01 00:04:00 0
2019-01-01 00:05:00 0
2019-01-01 00:06:00 0
2019-01-01 00:07:00 0
2019-01-01 00:08:00 0
2019-01-01 00:09:00 0
2019-01-01 00:10:00 0
2019-01-01 00:11:00 0
2019-01-01 00:12:00 0
2019-01-01 00:13:00 0
2019-01-01 00:14:00 0
2019-01-01 00:15:00 3
The values shown as 0 should be replaced by the sum of the two values (in this exapmle: 2019-01-01 00:00:00 3; and 2019-01-01 00:15:00 3) which equals to 6 and this should be evenly distibuted over the timearea.
2019-01-01 00:00:00 6/15
2019-01-01 00:01:00 6/15
2019-01-01 00:02:00 6/15
2019-01-01 00:03:00 6/15
2019-01-01 00:04:00 6/15
2019-01-01 00:05:00 6/15
2019-01-01 00:06:00 6/15
2019-01-01 00:07:00 6/15
2019-01-01 00:08:00 6/15
2019-01-01 00:09:00 6/15
2019-01-01 00:10:00 6/15
2019-01-01 00:11:00 6/15
2019-01-01 00:12:00 6/15
2019-01-01 00:13:00 6/15
2019-01-01 00:14:00 6/15
2019-01-01 00:15:00 6/15
This should be done for each resampled group over the whole Dataframe.
In other word the sum of the original dataframe and the resampled dataframe should be equal.
Thanks for your help.
First of all, personally, I would recommend working with a series, if there is only one column.
series = pd.Series(index=pd.date_range(start="20190101",end="20200101",
freq="15min"), data=(np.random.randint(0,10,size=(35041,))).tolist())
 Then, I would create a new index with minutely values, calculate the cumulative sum of the values and interpolate between these values. In your use case "linear" is suggested as interpolation method:
beginning = series.index[0]
end = series.index[-1]
new_index = pd.date_range(start, end, freq="1T")
cumsum = series.cumsum()
cumsum = result.reindex(new_index)
cumsum = result.interpolate("linear")
Afterwards, you get an interpolated cumulative sum, which you can convert back to your searched values via:
series_upsampled = cumsum.diff()
If you want, you can shift the series_upsampled by 1, doing
series_upsampled = series_upsampled.shift(-1)
Pay attention to NaN value at the beginning (or if you shift your series, at the end).

Pandas: How to get the most immediately preceding row that fulfills a condition? Something like a variable-length shift

I have a table, indexed by date, that has values of price that I want to use when creating a new column, previous_close.
date | price
2019-01-01 00:00:00 | 2
2019-01-01 04:00:00 | 3
2019-01-02 00:00:00 | 4
2019-01-01 04:00:00 | 5
I want to generate a column previous_close that returns the value of price in a row of the previous day's last price, so the output will be as follows:
date | price | previous_close
2019-01-01 00:00:00 | 2 | NaN
2019-01-01 04:00:00 | 3 | NaN
2019-01-02 00:00:00 | 4 | 3
2019-01-02 04:00:00 | 5 | 3
So far the only way I've figured how is to use df.apply, which iterates row-wise and for every row filters the index for the latest preceding day's last row. However, even though the DataFrame is date-indexed this takes a lot of time; for a table with a hundred thousand rows it takes several minutes to populate.
I was wondering if there was any way to create the new series in a vectorized form; something like df.shift(num_periods) but with the num_periods adjusted according to the row's date value.
I suggest as in question for the reindexing part:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df = df.set_index("date")
df = pd.concat([df.price,
df.resample("d").last().shift().rename(columns={"price":"close"}).reindex(df.index, method='ffill')],
axis = 1)
And you get the result:
price close
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 93.0
2019-01-02 01:00:00 18 93.0
2019-01-02 02:00:00 84 93.0
2019-01-02 03:00:00 58 93.0
2019-01-02 04:00:00 87 93.0
2019-01-02 05:00:00 98 93.0
2019-01-02 06:00:00 97 93.0
2019-01-02 07:00:00 48 93.0
EDIT:
If your business day ends at 2 and you want the close for this hour, I suggest using DateOffset (as in here) on the date and doing the same method:
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df["proxy"] = df.date + pd.DateOffset(hours=-3)
df = df.set_index("proxy")
df = pd.concat([df[["price", "date"]],
(df.price.resample("d").last().shift()
.rename({"price":"close"})
.reindex(df.index, method='ffill'))],
axis = 1).reset_index(drop=True).set_index("date")
You get the result:
price price
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 NaN
2019-01-02 01:00:00 18 NaN
2019-01-02 02:00:00 84 NaN
2019-01-02 03:00:00 58 84.0
2019-01-02 04:00:00 87 84.0
2019-01-02 05:00:00 98 84.0
2019-01-02 06:00:00 97 84.0
2019-01-02 07:00:00 48 84.0

combine columns with different data types to make a single dateTime column in pandas data frames

I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')

Replace "flatline" repeated data in Pandas series with nan

I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64

how to set a series into a slice of a dataframe?

I have temperature time series data in 15 minute intervals.
If temp value is missing, i want to take mean of temp values of last/next 10 days at same time and put it in place of nan.
This is my code
This returns a pandas series with the values i want to keep for na values.
pd.Series(df.index[(df.Temp.isna())]).apply(last10daysmean)
How do i put the above. into this one below?
df.Temp[df.Temp.isna()]
This returns the na slots.
I don't have the function last10daysmean from your question so I can substitute it with this:
def last10daysmean(x):
return "TenDaysMeanPlaceholder"
You should try to have sample data when you post a question but I can just make temp data now:
df = pd.DataFrame({
"Temp": [2, 3, 4, 5, 6, np.nan, 3, 4, np.nan]
})
This fills the isna rows with the output of our dummy version for your last10daysmean function:
df.Temp[df.Temp.isna()] = df.Temp[df.Temp.isna()].apply(last10daysmean)
You can try of writing the row value by value apply function
df = pd.DataFrame()
df['value'] = np.random.random(len(pd.date_range(start='2019-1-1',end='2019-1-2',freq='15Min')))*10
df.index = pd.date_range(start='2019-1-1',end='2019-1-2',freq='15Min')
df.loc[df['value']<2,'value'
] = np.nan
Sample Dataframe
value
2019-01-01 00:00:00 NaN
2019-01-01 00:15:00 6.100087
2019-01-01 00:30:00 7.953615
2019-01-01 00:45:00 7.214069
2019-01-01 01:00:00 3.697723
2019-01-01 01:15:00 5.772333
2019-01-01 01:30:00 NaN
2019-01-01 01:45:00 2.827144
Function to get slice of dataframe
def last10daysmean(x,ind):
df.loc[ind,'value'] = x.mean()
temp = df.index.map(lambda x: last10daysmean(df['value'].loc[x:x+10],x) if math.isnan(df.loc[x,'value']) else df.loc[x,'value'])
Out:
value
2019-01-01 00:00:00 5.901569
2019-01-01 00:15:00 6.100087
2019-01-01 00:30:00 7.953615
2019-01-01 00:45:00 7.214069
2019-01-01 01:00:00 3.697723
2019-01-01 01:15:00 5.772333
2019-01-01 01:30:00 5.594577
2019-01-01 01:45:00 2.827144
2019-01-01 02:00:00 6.409086

Categories

Resources