How do I aggregate rows with an upper bound on column value? - python

I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3

Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

Related

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Replacing values in dataframe with 0s and 1s based on conditions

I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)

Create a new column based on the state of another column in pandas

Suppose I want to create a new column that counts the number of days since the state was 1. As an example, the current columns would be the first three below. The forth column is what I'm trying to get.
Index State Days Since_Days
1 1 0 0
2 0 20 20
3 0 40 40
4 1 55 55
5 1 60 5
6 1 70 10
Without resorting to for-loop, what is a pandas way to approach this?
You can also try following where first you group by State and for those that have State == 1, fill by difference. Then, for those which has State == 0 will be na which can be filled by corresponding Days column value
df.loc[df.State == 1, 'Since_Days'] = df.groupby('State')['Days'].diff().fillna(0)
df['Since_Days'].fillna(df['Days'],inplace=True)
print(df)
Result:
Index State Days Since_Days
0 1 1 0 0.0
1 2 0 20 20.0
2 3 0 40 40.0
3 4 1 55 55.0
4 5 1 60 5.0
5 6 1 70 10.0
The values to be subtracted can be formed with:
ser = df['Days'].where(df['State']==1, np.nan).ffill().shift()
If you subtract this from the original Days column, you'll have:
df['Days'].sub(ser, fill_value=0).astype('int')
Out:
0 0
1 20
2 40
3 55
4 5
5 10
Name: Days, dtype: int64

How to group numeric values by hours?

I have the following data:
df =
MONTH DAY HOUR DURATION
1 1 7 20
1 1 7 21
1 2 7 20
1 2 8 22
2 1 7 19
2 1 8 25
2 1 8 29
2 2 8 27
I want to get the mean DURATION grouped by HOUR and averaged over MONTH and DAY. In other words, I want to know what is the average DURATION per HOUR.
This is my current code. If I delete 'MONTH','DAY' from df.groupby(['MONTH','DAY','HOUR','DURATION']), then I get higher values of DURATION, which are not correct. Therefore I decided to keep 'MONTH','DAY'.
grouped = df.groupby(['MONTH','DAY','HOUR','DURATION']).size() \
.groupby(level=['HOUR','DURATION']).mean().reset_index()
grouped
However, anyway, it gives me incorrect output. This is an example for some random data (it can be seen that the hour 8 is repeated many times, also the column 0 appears).
HOUR DURATION 0
0 7 122.0 1.0
1 8 77.0 1.0
2 8 82.0 1.0
3 8 83.0 1.0
Have you tried:
df.groupby("HOUR").agg({'DURATION_1' : 'mean', 'DURATION_2' : 'mean'})

Categories

Resources