How to sum by agrouping a specific column using Python?

How to sum by agrouping a specific column using Python? - python

I`m not able to sum by each group/column. The idea is to create a new column on this data set with the sum by "store":
PNO store ForecastSUM
17 20054706 WITZ 0.0
8 8007536 WITZ 0.0
2 8007205 WITZ 0.0
12 8601965 WITZ 0.0
5 8007239 WITZ 0.0
14 20054706 ROT 1.0
1 8007205 ROT 7.0
9 8601965 ROT 2.0
6 8007536 ROT 3.0
3 8007239 ROT 2.0
15 20054706 MAR 1.0
7 8007536 MAEG 6.0
10 8601965 MAEG 4.0
4 8007239 MAEG 3.0
0 8007205 MAEG 6.0
13 20054706 BUD 1.0
11 8601965 AYC 0.0
16 20054706 AYC 0.0
I am trying to apply this code:
copiedDataWHSE['sumWHSE'] = copiedDataWHSE.groupby(['ForecastSUM']).agg({'ForecastSUM': "sum"})
and the result I am getting is:
PNO store ForecastSUM sumWHSE
17 20054706 WITZ 0.0 NaN
8 8007536 WITZ 0.0 NaN
2 8007205 WITZ 0.0 4.0
12 8601965 WITZ 0.0 NaN
5 8007239 WITZ 0.0 NaN
14 20054706 ROT 1.0 NaN
1 8007205 ROT 7.0 3.0
9 8601965 ROT 2.0 NaN
6 8007536 ROT 3.0 12.0
3 8007239 ROT 2.0 6.0
15 20054706 MAR 1.0 NaN
7 8007536 MAEG 6.0 7.0
10 8601965 MAEG 4.0 NaN
4 8007239 MAEG 3.0 4.0
0 8007205 MAEG 6.0 0.0
13 20054706 BUD 1.0 NaN
11 8601965 AYC 0.0 NaN
16 20054706 AYC 0.0 NaN
Which is wrong, since I would like to have as example, once the store is ROT, the sumWHSE column should receive 19.

As #sammywemmy mentions, you need to group on store, not on ForecastSUM:
store_groupby = df.groupby(['store']).agg({'ForecastSUM': "sum"})
However, since it's a groupby of length 6, you can't assign it back to the dataframe as a new column.
What I would do is turn the groupby into a dictionary, then assign() it to a new column with a lambda function.
store_groupby_dict = store_groupby.to_dict()
df = df.assign(store_total = lambda x: store_groupby_dict[x.store])
Doing the same thing with apply() makes it a little more readable:
df['store_total'] = df.store.apply(lambda x: store_groupby_dict[x])

Related

Convert column vector into multi-column matrix

I have a column vector with say 30 values (1-30) I would like to try to manipulate this vector so that it becomes a matrix with 5 values in the first column, 10 values in the second and 15 values in the third column. How would I implement this using Pandas or NumPy?
import pandas as pd
#Create data
df = pd.DataFrame(np.linspace(1,20,20))
print(df)
1
2
:
28
29
30
In order to get something like this:
# Manipulate the column vector to make columns where the first column has 5
# the second column has 10 and the last column has 15 values
'T1' 'T2' 'T3'
1 6 16
2 7 17
3 8 18
4 9 19
5 10 20
NA 11 21
NA 12 22
NA 13 23
NA 14 24
NA 15 25
NA NA 26
NA NA 27
NA NA 28
NA NA 29
NA NA 30

It took a little time to find out what series is this, and I found that its a triangular series , just a modified one.
tri = lambda x:int((0.25+2*x)**0.5-0.5)
This would give results like:
0 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 ...
And after the modification:
modtri = lambda x:int((0.25+2*(x//5))**0.5-0.5)
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
So each occurrence in normal triangular series repeats 5 times.
The above modtri function would directly map the index starting from 0, to appropriate group ids.
and so after that, this would do the job:
df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
Full execution:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.linspace(1,30,30))
N = 5 #the increment value
modtri = lambda x:int((0.25+2*(x//N))**0.5-0.5)
df2 = df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
df2.rename(columns={0: "T1", 1: "T2",2:"T3"},inplace=True)
print(df2)
Output:
T1 T2 T3
0 1.0 6.0 16.0
1 2.0 7.0 17.0
2 3.0 8.0 18.0
3 4.0 9.0 19.0
4 5.0 10.0 20.0
5 NaN 11.0 21.0
6 NaN 12.0 22.0
7 NaN 13.0 23.0
8 NaN 14.0 24.0
9 NaN 15.0 25.0
10 NaN NaN 26.0
11 NaN NaN 27.0
12 NaN NaN 28.0
13 NaN NaN 29.0
14 NaN NaN 30.0

Try this by slicing with reindexing:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
Original data before operation:
df = pd.DataFrame(np.linspace(1,30,30))
print(df)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 20.0
20 21.0
21 22.0
22 23.0
23 24.0
24 25.0
25 26.0
26 27.0
27 28.0
28 29.0
29 30.0
Running new codes:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
print(df)
0 T1 T2 T3
0 1.0 1.0 6.0 16.0
1 2.0 2.0 7.0 17.0
2 3.0 3.0 8.0 18.0
3 4.0 4.0 9.0 19.0
4 5.0 5.0 10.0 20.0
5 6.0 NaN 11.0 21.0
6 7.0 NaN 12.0 22.0
7 8.0 NaN 13.0 23.0
8 9.0 NaN 14.0 24.0
9 10.0 NaN 15.0 25.0
10 11.0 NaN NaN 26.0
11 12.0 NaN NaN 27.0
12 13.0 NaN NaN 28.0
13 14.0 NaN NaN 29.0
14 15.0 NaN NaN 30.0
15 16.0 NaN NaN NaN
16 17.0 NaN NaN NaN
17 18.0 NaN NaN NaN
18 19.0 NaN NaN NaN
19 20.0 NaN NaN NaN
20 21.0 NaN NaN NaN
21 22.0 NaN NaN NaN
22 23.0 NaN NaN NaN
23 24.0 NaN NaN NaN
24 25.0 NaN NaN NaN
25 26.0 NaN NaN NaN
26 27.0 NaN NaN NaN
27 28.0 NaN NaN NaN
28 29.0 NaN NaN NaN
29 30.0 NaN NaN NaN

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?

You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0

Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)

simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)

IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

Sorting by two columns in pandas series

Putting a slight variation on a question I previously asked. I managed to get a solution to sorting values by a particular column in my pandas series. However, the problem is that sorting purely by time doesn't allow me to factor in different dates in which the time occurred. I understand that I could potentially hard code the order and use .loc to apply the order but wanted to find out if there was a simpler solution to sort primarily by week (earliest week first) and by time (0-23hours for each week).
Here is a sample of the dataframe I have again:
weeknum time_hour
16-22Jun 0.0 5
2-8Jun 0.0 3
23-29Jun 0.0 11
9-15Jun 0.0 3
16-22Jun 1.0 3
2-8Jun 1.0 6
23-29Jun 1.0 3
9-15Jun 1.0 8
16-22Jun 2.0 3
2-8Jun 2.0 6
23-29Jun 2.0 3
16-22Jun 3.0 4
2-8Jun 3.0 2
23-29Jun 3.0 3
9-15Jun 3.0 4
16-22Jun 4.0 2
2-8Jun 4.0 7
23-29Jun 4.0 1
9-15Jun 4.0 7
16-22Jun 5.0 2
2-8Jun 5.0 9
23-29Jun 5.0 9
9-15Jun 5.0 12
16-22Jun 6.0 5
2-8Jun 6.0 12
23-29Jun 6.0 6
9-15Jun 6.0 14
16-22Jun 7.0 12
2-8Jun 7.0 17
23-29Jun 7.0 19
This is my code:
merged_clean.groupby('weeknum')['time_hour'].value_counts().sort_index(level=['time_hour'])

Use function sorted by multiple keys for sorting MultiIndex with convert first number before - and for change order use DataFrame.reindex:
s = merged_clean.groupby('weeknum')['time_hour'].value_counts()
idx = sorted(s.index, key = lambda x: (int(x[0].split('-')[0]), x[1]))
s = s.reindex(idx)
print (s)
weeknum time_hour
2-8Jun 0.0 3
1.0 6
2.0 6
3.0 2
4.0 7
5.0 9
6.0 12
7.0 17
9-15Jun 0.0 3
1.0 8
3.0 4
4.0 7
5.0 12
6.0 14
16-22Jun 0.0 5
1.0 3
2.0 3
3.0 4
4.0 2
5.0 2
6.0 5
7.0 12
23-29Jun 0.0 11
1.0 3
2.0 3
3.0 3
4.0 1
5.0 9
6.0 6
7.0 19
Name: a, dtype: int64

Easy pythonic way to classify columns in groups and store it in Dictionary?

Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0
20 21.0 205.0
I am trying to classify according to machine number. Like Machine_number 1 to 5 will be one group. Then 6 to 10 in one group and so on.

I think you need substract 1 by sub and then floordiv:
df['g'] = df.Machine_number.sub(1).floordiv(5)
#same as //
#df['g'] = df.Machine_number.sub(1) // 5
print (df)
Machine_number Machine_Running_Hours g
0 1.0 424.0 -0.0
1 2.0 458.0 0.0
2 3.0 465.0 0.0
3 4.0 446.0 0.0
4 5.0 466.0 0.0
5 6.0 466.0 1.0
6 7.0 445.0 1.0
7 8.0 466.0 1.0
8 9.0 447.0 1.0
9 10.0 469.0 1.0
10 11.0 467.0 2.0
11 12.0 449.0 2.0
12 13.0 436.0 2.0
13 14.0 465.0 2.0
14 15.0 463.0 2.0
15 16.0 372.0 3.0
16 17.0 460.0 3.0
17 18.0 450.0 3.0
18 19.0 467.0 3.0
19 20.0 463.0 3.0
20 21.0 205.0 4.0
If need store in dictionary use groupby with dict comprehension:
dfs = {i:g for i, g in df.groupby(df.Machine_number.astype(int).sub(1).floordiv(5))}
print (dfs)
{0: Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0, 1: Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0, 2: Machine_number Machine_Running_Hours
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0, 3: Machine_number Machine_Running_Hours
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0, 4: Machine_number Machine_Running_Hours
20 21.0 205.0}
print (dfs[0])
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
print (dfs[1])
Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sum by agrouping a specific column using Python? - python

Related

Convert column vector into multi-column matrix

Substraction between two dataframe's column

How to add the sum of some column at the end of dataframe

Sorting by two columns in pandas series

Easy pythonic way to classify columns in groups and store it in Dictionary?

Categories

Resources