Add column and group by other column - python

How can I get the sum of a column, grouped by another column? E.g for the following DataFrame
Paid Week
5 1
2 1
7 2
7 2
How would I get the following output?
Paid Week
7 1
14 2
I have tried this, but it doesn't seem to actually add the values. It also prints some other columns along with it.
print df.groupby(['Paid','Week']).sum()
Week Paid
1 0.0
0.5
2.4
3.0
3.8
3.9
6.6
2 0.0
0.9
2.4

Use:
df.groupby('Week', as_index=False)['Paid'].sum()
Output:
Week Paid
0 1 7
1 2 14

You can use:
df.groupby(by=['Week']).sum()
Would give the following output:
Paid
Week
1 7
2 14
Use the below if you do not want to index the group labels.
df.groupby(by=['Week'], as_index=False).sum()
Output:
Week Paid
0 1 7
1 2 14

df.groupby(['Week'])['Paid'].sum()

Related

Weekly Average In Pandas

I have a large year-long dataframe of occurrences with month (1-12), week (1-52), day_of_week (0-6), and hour (0-23).
Below is just a snippet of the dataset. Each row is an occurrence.
The first part of the snippet below shows multiple occurrences captured with a date/timestamp of 2018-04-01 00:00:00 (Sunday). The second part of the snippet below (after the first ellipses) shows multiple occurrences in the following hour and the third part is the next hour, and so on.
month week day_of_week hour
0 4 13 6 0
1 4 13 6 0
2 4 13 6 0
3 4 13 6 0
4 4 13 6 0
...
100 4 13 6 1
101 4 13 6 1
102 4 13 6 1
...
...
300 4 13 6 2
301 4 13 6 2
302 4 13 6 2
...
I would like to be able to display a summary of this dataset showing the weekly average count of occurrences for each of the hours (0-23) as well as for each month.
For example:
month hour weekly_ave
4 0 100
4 1 175
4 2 250
...
4 23 500
5 0 90
How do I do this using pandas groupby and aggregate functions?
Thanks!
df.groupby(['month','hour'])['hour'].count()
Then, if you need this formatted a little bit nicer:
df.groupby(['month','hour'])['hour'].count().rename("weekly:ave").reset_index()
I was able to figure it out. I had to do a second groupby:
df.groupby(['month', 'hour', 'week']) \
.agg({'day_of_week': 'count'}) \
.groupby(['month', 'hour']).mean() \
.rename(columns={"day_of_week": "weekly_ave"}).reset_index()
This gave me what I needed but is there a more elegant way of doing this?
Thanks.

Python - Adding rows to timeseries dataset

I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7
Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3
Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

How to group numeric values by hours?

I have the following data:
df =
MONTH DAY HOUR DURATION
1 1 7 20
1 1 7 21
1 2 7 20
1 2 8 22
2 1 7 19
2 1 8 25
2 1 8 29
2 2 8 27
I want to get the mean DURATION grouped by HOUR and averaged over MONTH and DAY. In other words, I want to know what is the average DURATION per HOUR.
This is my current code. If I delete 'MONTH','DAY' from df.groupby(['MONTH','DAY','HOUR','DURATION']), then I get higher values of DURATION, which are not correct. Therefore I decided to keep 'MONTH','DAY'.
grouped = df.groupby(['MONTH','DAY','HOUR','DURATION']).size() \
.groupby(level=['HOUR','DURATION']).mean().reset_index()
grouped
However, anyway, it gives me incorrect output. This is an example for some random data (it can be seen that the hour 8 is repeated many times, also the column 0 appears).
HOUR DURATION 0
0 7 122.0 1.0
1 8 77.0 1.0
2 8 82.0 1.0
3 8 83.0 1.0
Have you tried:
df.groupby("HOUR").agg({'DURATION_1' : 'mean', 'DURATION_2' : 'mean'})

Categories

Resources