how to run regression on groups with dates - python

I am trying to calculate the regression coefficient of weight for every animal_id and cycle_nr in my df:
animal_id
cycle_nr
feed_date
weight
1003
8
2020-02-06
221
1003
8
2020-02-10
226
1003
8
2020-02-14
230
1004
1
2020-02-20
231
1004
1
2020-02-21
243
What I tried using this source source:
import pandas as pd
import statsmodels.api as sm
def GroupRegress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])
This code fails because my variable includes a date.
What I tried next:
I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:
animal_id
cycle_nr
feed_date
weight
id
1003
8
2020-02-06
221
1
1003
8
2020-02-10
226
2
1003
8
2020-02-14
230
3
1004
1
2020-02-20
231
4
1004
1
2020-02-21
243
5
Then I ran my regression on this column
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])
The slope calculation looks good, but the intercept makes of course no sense.
Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.
I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?

I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates.
If the feed dates are strings make a datetime Series using pandas.to_datetime.
Use that new Series to calculate the actual time difference between feedings
Use the resultant timedeltas in your regression instead of a linear fabricated sequence. The timedeltas have different attributes, (i.e. microseconds, days), that can be used depending on the resolution you need.
My first instinct would be to produce the Timedeltas for each group separately. The first feeding in each group would of course be time zero.
Making the Timedeltas may not even be necessary - there are probably datetime aware regression methods in Numpy or Scipy or maybe even Pandas - I imagine there would have to be, it is a common enough application.
Instead of Timedeltas the datetime Series could be converted to ordinal values for use in the regression.
df = pd.DataFrame(
{
"feed_date": [
"2020-02-06",
"2020-02-10",
"2020-02-14",
"2020-02-20",
"2020-02-21",
]
}
)
>>> q = pd.to_datetime(df.feed_date)
>>> q
0 2020-02-06
1 2020-02-10
2 2020-02-14
3 2020-02-20
4 2020-02-21
Name: feed_date, dtype: datetime64[ns]
>>> q.apply(pd.Timestamp.toordinal)
0 737461
1 737465
2 737469
3 737475
4 737476
Name: feed_date, dtype: int64
>>>

Related

Integration of pandas timeframe

I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().
Timestamp Power [W]
2022-05-05 06:00:05+02:00 2.0
2022-05-05 06:00:15+02:00 1.2
2022-05-05 06:00:25+02:00 0.3
2022-05-05 06:00:35+02:00 4.3
2022-05-05 06:00:45+02:00 1.1
...
2022-05-06 20:59:19+02:00 1.4
2022-05-06 20:59:29+02:00 2.0
2022-05-06 20:59:39+02:00 4.1
2022-05-06 20:59:49+02:00 1.3
2022-05-06 20:59:59+02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00+02:00 some values
2022-05-05 08:00:00+02:00 .
2022-05-05 09:00:00+02:00 .
2022-05-05 10:00:00+02:00 .
2022-05-05 11:00:00+02:00
...
2022-05-06 20:00:00+02:00
2022-05-06 21:00:00+02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.
Thank you.
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?
This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.
It will create a Stairs object (which represents a step function) from a pandas.Series, then bin across arbitrary DatetimeIndex values, then integrate.
This solution requires staircase 2.4.2 or above
setup
df = pd.DataFrame(
{
"Timestamp":pd.to_datetime(
[
"2022-05-05 06:00:05+02:00",
"2022-05-05 06:00:15+02:00",
"2022-05-05 06:00:25+02:00",
"2022-05-05 06:00:35+02:00",
"2022-05-05 06:00:45+02:00",
]
),
"Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
}
)
solution
import staircase as sc
# create step function
sf = sc.Stairs.from_values(
initial_value=0,
values=df.set_index("Timestamp")["Power [W]"],
)
# optional: plot
sf.plot(style="hlines")
# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
"2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)
# slice into bins and integrate
result = sf.slice(bins).integral()
result will be a pandas.Series with an IntervalIndex and Timedelta values. The IntervalIndex retains timezone info, it just doesn't display it:
[2022-05-05 06:00:00, 2022-05-05 06:00:20) 0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40) 0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00) 0 days 00:00:38
dtype: timedelta64[ns]
You can change the index to be the "left" values (and see this timezone info) like this:
result.index = result.index.left
You can change values to a float with division by an appropriate Timedelta. Eg to convert to minutes:
result/pd.Timedelta("1min")
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Complicated function with groupby and between? Python

Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848

Multiply pandas dataframe by vlookup

I have a very large dataframe with multiple years of sales data and tens of thousands of skew_ids (i.e.):
date skew_id units_sold
0 2001-01-01 123 1
1 2001-01-02 123 2
2 2001-01-03 123 3
3 2001-01-01 456 4
4 2001-01-02 456 5
...
I have another dataframe that maps skew_ids to skew_price (i.e.):
skew_id skew_price
0 123 100.00
1 456 10.00
...
My first dataframe is so large that I cannot merge without hitting my memory limit.
I'd like to calculate the daily revenues (i.e.):
date revenue
0 2001-01-01 140
1 2001-01-02 250
2 2001-01-03 300
...
I think it depends of number of rows, number of unique skew_id values and size of RAM.
One possible solution with map:
df1['revenue'] = df1['skew_id'].map(df2.set_index('skew_id')['skew_price']) * df1['units_sold']
df2 = df1.groupby('date', as_index=False)['revenue'].sum()
You could achieve this with a groupby:
df.groupby('date').apply(lambda gr: df2.loc[df2.skew_id.isin(list(gr.skew_id))]['skew_price'].sum())
Or if you run into memory problems you could loop over all dates yourself. This is slower, but might need less memory.
revenue = []
for d in df.date.unique():
r = df2.loc[df2.skew_id.isin(list(df.loc[df.date == d].skew_id))]['skew_price'].sum()
revenue.append({'date': d, 'revenue': r})
pd.DataFrame(revenue)

Statsmodels operands could not be broadcast together in pandas series

I am currently using the Statsmodels library for forecasting. I am trying to run triple exponential smoothing with seasonality, trend, and a smoothing factor, but keep getting errors. Here is a sample pandas series with the frequency already set:
2018-01-01 25
2018-02-01 30
2018-03-01 40
2018-04-01 38
2018-05-01 33
2018-06-01 36
2018-07-01 34
2018-08-01 35
2018-09-01 37
2018-10-01 41
2018-11-01 36
2018-12-01 32
2019-01-01 31
2019-02-01 29
2019-03-01 28
2019-04-01 29
2019-05-01 30
Freq: MS, dtype:float64
Here is my code:
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
def triple_expo(input_data_set, output_list1):
data = input_data_set
model1 = ExponentialSmoothing(data, trend='add', seasonal='mul', damped=False, seasonal_periods=12, freq='M').fit(smoothing_level=0.1, smoothing_slope=0.1, smoothing_seasonal=0.1, optimized=False)
fcast1 = model1.forecast(12)
fcast_list1 = list(fcast1)
output_list1.append(fcast_list1)
for product in unique_products:
product_slice = sorted_product_df["Product"] == product
unique_slice = sorted_product_df[product_slice]
amount = unique_slice["Amount"].tolist()
index = unique_slice["Date"].tolist()
unique_series = pd.Series(amount, index)
unique_series.index = pd.DatetimeIndex(unique_series.index, freq=pd.infer_freq(unique_series.index))
triple_expo(unique_series, triple_out_one)
I originally did not have a frequency argument at all because I was following the example on the statsmodels website which is here: http://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
They did not pass a frequency argument at all as they inferred it using the DateTimeIndex in pandas. When I have no frequency argument, my error is "operands could not be broadcast together with shapes <5,><12,>". I have 17 months of data and pandas recognizes it as 'Ms'. The 5 and the 12 is referring to the 17. I then pass freq='M' like I have in the code sample, and I get "The given frequency argument is incompatible with the given index". I then try setting it to everything from len(data) to len(data)-1 and always get errors. I tried the len(data)-1 because I was originally referencing this stack post: seasonal_decompose: operands could not be broadcast together with shapes on a series
In that post, he said it would work if you set the frequency to one less than the length of the data set. It does not work for me though. Any help would be appreciated. Thanks!
EDIT: The comment below suggested I include more than one year's worth of data and it worked when I did that.

Grouping records with close DateTimes in Python pandas DataFrame

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2

Categories

Resources