Substraction between two dataframe's column

Substraction between two dataframe's column - python

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

Related

Replace NaN with the average of the last 5 values - Pandas

I want to know how can I replace the NaN in my dataset with the last average of 5 last values.
Column A
Column B
1
2
2
5
3
5
4
2
5
2
NaN
2
NaN
2
1
2
1
2
1
2
1
NaN
1
2
1
2
For example, in this case the first NaN will be the average of (1,2,3,4,5) and second NaN will be the average of (2,3,4,5, The value of the other NaN).
I have tried
df.fillna(df.mean())

As mentioned, it has been answered here, but the updated version for the latest pandas version is as follow:
data={'col1':[1,2,3,4,5,np.nan,np.nan,1,1,1,1,1,1],
'col2':[2,5,5,2,2,2,2,2,2,2,np.nan,2,2]}
df=pd.DataFrame(data)
window_size = 5
df=df.fillna(df.rolling(window_size+1, min_periods=1).mean())
outputs:
col1 col2
0 1.0 2.0
1 2.0 5.0
2 3.0 5.0
3 4.0 2.0
4 5.0 2.0
5 3.0 2.0
6 3.5 2.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
10 1.0 2.0
11 1.0 2.0
12 1.0 2.0

Calculated column with shift

This is the base DataFrame:
g_accessor number_opened number_closed
0 49 - 20 3.0 1.0
1 50 - 20 2.0 14.0
2 51 - 20 1.0 6.0
3 52 - 20 0.0 6.0
4 1 - 21 1.0 4.0
5 2 - 21 3.0 5.0
6 3 - 21 4.0 11.0
7 4 - 21 2.0 7.0
8 5 - 21 6.0 10.0
9 6 - 21 2.0 8.0
10 7 - 21 4.0 9.0
11 8 - 21 2.0 3.0
12 9 - 21 2.0 1.0
13 10 - 21 1.0 11.0
14 11 - 21 6.0 3.0
15 12 - 21 3.0 3.0
16 13 - 21 2.0 6.0
17 14 - 21 5.0 9.0
18 15 - 21 9.0 13.0
19 16 - 21 7.0 7.0
20 17 - 21 9.0 4.0
21 18 - 21 3.0 8.0
22 19 - 21 6.0 3.0
23 20 - 21 6.0 1.0
24 21 - 21 3.0 5.0
25 22 - 21 5.0 3.0
26 23 - 21 1.0 0.0
I want to add a calculated new column number_active which relies on previous values. For this I'm trying to use pd.DataFrame.shift(), like this:
# Creating new column and setting all rows to 0
df['number_active'] = 0
# Active from previous period
PREVIOUS_PERIOD_ACTIVE = 22
# Calculating active value for first period in the DataFrame, based on `PREVIOUS_PERIOD_ACTIVE`
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
# Calculating all columns using DataFrame.shift()
df['number_active'] = (df['number_opened'] + df['number_active'].shift(1)) - df['number_closed']
# Recalculating first active value as it was overwritten in the previous step.
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
The result:
g_accessor number_opened number_closed number_active
0 49 - 20 3.0 1.0 24.0
1 50 - 20 2.0 14.0 12.0
2 51 - 20 1.0 6.0 -5.0
3 52 - 20 0.0 6.0 -6.0
4 1 - 21 1.0 4.0 -3.0
5 2 - 21 3.0 5.0 -2.0
6 3 - 21 4.0 11.0 -7.0
7 4 - 21 2.0 7.0 -5.0
8 5 - 21 6.0 10.0 -4.0
9 6 - 21 2.0 8.0 -6.0
10 7 - 21 4.0 9.0 -5.0
11 8 - 21 2.0 3.0 -1.0
12 9 - 21 2.0 1.0 1.0
13 10 - 21 1.0 11.0 -10.0
14 11 - 21 6.0 3.0 3.0
15 12 - 21 3.0 3.0 0.0
16 13 - 21 2.0 6.0 -4.0
17 14 - 21 5.0 9.0 -4.0
18 15 - 21 9.0 13.0 -4.0
19 16 - 21 7.0 7.0 0.0
20 17 - 21 9.0 4.0 5.0
21 18 - 21 3.0 8.0 -5.0
22 19 - 21 6.0 3.0 3.0
23 20 - 21 6.0 1.0 5.0
24 21 - 21 3.0 5.0 -2.0
25 22 - 21 5.0 3.0 2.0
26 23 - 21 1.0 0.0 1.0
Oddly, it seems that only the first active value (index 1) is calculated correctly (since the value at index 0 is calculated independently, via df.iat). For the rest of the values it seems that number_closed is interpreted as negative value - for some reason.
What am I missing/doing wrong?

You are assuming that the result for the previous row is available when the current row is calculated. This is not how pandas calculations work. Pandas calculations treat each row in isolation, unless you are applying multi-row operations like cumsum and shift.
I would calculate the number active with a minimal example as:
df = pandas.DataFrame({'ignore': ['a','b','c','d','e'], 'number_opened': [3,4,5,4,3], 'number_closed':[1,2,2,1,2]})
df['number_active'] = df['number_opened'].cumsum() + 22 - df['number_closed'].cumsum()
This gives a result of:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
29
3
d
4
1
32
4
e
3
2
33
The code in your question with my minimal example gave:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
3
3
d
4
1
3
4
e
3
2
1

How to sum by agrouping a specific column using Python?

I`m not able to sum by each group/column. The idea is to create a new column on this data set with the sum by "store":
PNO store ForecastSUM
17 20054706 WITZ 0.0
8 8007536 WITZ 0.0
2 8007205 WITZ 0.0
12 8601965 WITZ 0.0
5 8007239 WITZ 0.0
14 20054706 ROT 1.0
1 8007205 ROT 7.0
9 8601965 ROT 2.0
6 8007536 ROT 3.0
3 8007239 ROT 2.0
15 20054706 MAR 1.0
7 8007536 MAEG 6.0
10 8601965 MAEG 4.0
4 8007239 MAEG 3.0
0 8007205 MAEG 6.0
13 20054706 BUD 1.0
11 8601965 AYC 0.0
16 20054706 AYC 0.0
I am trying to apply this code:
copiedDataWHSE['sumWHSE'] = copiedDataWHSE.groupby(['ForecastSUM']).agg({'ForecastSUM': "sum"})
and the result I am getting is:
PNO store ForecastSUM sumWHSE
17 20054706 WITZ 0.0 NaN
8 8007536 WITZ 0.0 NaN
2 8007205 WITZ 0.0 4.0
12 8601965 WITZ 0.0 NaN
5 8007239 WITZ 0.0 NaN
14 20054706 ROT 1.0 NaN
1 8007205 ROT 7.0 3.0
9 8601965 ROT 2.0 NaN
6 8007536 ROT 3.0 12.0
3 8007239 ROT 2.0 6.0
15 20054706 MAR 1.0 NaN
7 8007536 MAEG 6.0 7.0
10 8601965 MAEG 4.0 NaN
4 8007239 MAEG 3.0 4.0
0 8007205 MAEG 6.0 0.0
13 20054706 BUD 1.0 NaN
11 8601965 AYC 0.0 NaN
16 20054706 AYC 0.0 NaN
Which is wrong, since I would like to have as example, once the store is ROT, the sumWHSE column should receive 19.

As #sammywemmy mentions, you need to group on store, not on ForecastSUM:
store_groupby = df.groupby(['store']).agg({'ForecastSUM': "sum"})
However, since it's a groupby of length 6, you can't assign it back to the dataframe as a new column.
What I would do is turn the groupby into a dictionary, then assign() it to a new column with a lambda function.
store_groupby_dict = store_groupby.to_dict()
df = df.assign(store_total = lambda x: store_groupby_dict[x.store])
Doing the same thing with apply() makes it a little more readable:
df['store_total'] = df.store.apply(lambda x: store_groupby_dict[x])

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?

You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0

Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)

simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)

IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

Sorting by two columns in pandas series

Putting a slight variation on a question I previously asked. I managed to get a solution to sorting values by a particular column in my pandas series. However, the problem is that sorting purely by time doesn't allow me to factor in different dates in which the time occurred. I understand that I could potentially hard code the order and use .loc to apply the order but wanted to find out if there was a simpler solution to sort primarily by week (earliest week first) and by time (0-23hours for each week).
Here is a sample of the dataframe I have again:
weeknum time_hour
16-22Jun 0.0 5
2-8Jun 0.0 3
23-29Jun 0.0 11
9-15Jun 0.0 3
16-22Jun 1.0 3
2-8Jun 1.0 6
23-29Jun 1.0 3
9-15Jun 1.0 8
16-22Jun 2.0 3
2-8Jun 2.0 6
23-29Jun 2.0 3
16-22Jun 3.0 4
2-8Jun 3.0 2
23-29Jun 3.0 3
9-15Jun 3.0 4
16-22Jun 4.0 2
2-8Jun 4.0 7
23-29Jun 4.0 1
9-15Jun 4.0 7
16-22Jun 5.0 2
2-8Jun 5.0 9
23-29Jun 5.0 9
9-15Jun 5.0 12
16-22Jun 6.0 5
2-8Jun 6.0 12
23-29Jun 6.0 6
9-15Jun 6.0 14
16-22Jun 7.0 12
2-8Jun 7.0 17
23-29Jun 7.0 19
This is my code:
merged_clean.groupby('weeknum')['time_hour'].value_counts().sort_index(level=['time_hour'])

Use function sorted by multiple keys for sorting MultiIndex with convert first number before - and for change order use DataFrame.reindex:
s = merged_clean.groupby('weeknum')['time_hour'].value_counts()
idx = sorted(s.index, key = lambda x: (int(x[0].split('-')[0]), x[1]))
s = s.reindex(idx)
print (s)
weeknum time_hour
2-8Jun 0.0 3
1.0 6
2.0 6
3.0 2
4.0 7
5.0 9
6.0 12
7.0 17
9-15Jun 0.0 3
1.0 8
3.0 4
4.0 7
5.0 12
6.0 14
16-22Jun 0.0 5
1.0 3
2.0 3
3.0 4
4.0 2
5.0 2
6.0 5
7.0 12
23-29Jun 0.0 11
1.0 3
2.0 3
3.0 3
4.0 1
5.0 9
6.0 6
7.0 19
Name: a, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Substraction between two dataframe's column - python

Related

Replace NaN with the average of the last 5 values - Pandas

Calculated column with shift

How to sum by agrouping a specific column using Python?

How to add the sum of some column at the end of dataframe

Sorting by two columns in pandas series

Categories

Resources