How to calculate cumulative sum and average on file data in python

How to calculate cumulative sum and average on file data in python - python

I have a below data in file
NAME,AGE,MARKS
A1,12,40
B1,13,54
C1,15,67
D1,11,41
E1,16,59
F1,10,60
If the data was in database table , I would have used Sum and Average function to get the cumulative sum and average
But How to get it with python is a bit challenging , As i am learner
Expected output :
NAME,AGE,MARKS,CUM_SUM,AVG
A1,12,40,40,40
B1,13,54,94,47
C1,15,67,161,53.66
D1,11,41,202,50.5
E1,16,59,261,43.5
F1,10,60,321,45.85

IIUC use:
df = pd.read_csv('file')
df['CUM_SUM'] = df['MARKS'].cumsum()
df['AVG'] = df['MARKS'].expanding().mean()
print (df)
NAME AGE MARKS CUM_SUM AVG
0 A1 12 40 40 40.000000
1 B1 13 54 94 47.000000
2 C1 15 67 161 53.666667
3 D1 11 41 202 50.500000
4 E1 16 59 261 52.200000
5 F1 10 60 321 53.500000
Last use:
df.to_csv('file.csv', index=False)
Or:
out = df.to_string(index=False)

Related

summing two columns in a dataframe

My df looks as follows:
Roll Name Age Physics English Maths
0 A1 Max 16 87 79 90
1 A2 Lisa 15 47 75 60
2 A3 Luna 17 83 49 95
3 A4 Ron 16 86 79 93
4 A5 Silvia 15 57 99 91
I'd like to add the columns Physics, English, and Maths and display the results in a separate column 'Grade'.
I've tried the code:
df['Physics'] + df['English'] + df['Maths']
But it just concatenates. I am not taught about the lambda function yet.
How do I go about this?

df['Grade'] = df['Physics'] + df['English'] + df['Maths']
it concatenates maybe your data is in **String** just convert into float or integer.
Check Data Types First by using df.dtypes

Try:
df["total"] = df[["Physics", "English", "Maths"]].sum(axis=1)
df

Check Below code, Its is possible you columns are in string format, belwo will solve that:
import pandas as pd
df = pd.DataFrame({"Physics":['1','2','3'],"English":['1','2','3'],"Maths":['1','2','3']})
df['Total'] = df['Physics'].astype('int') +df['English'].astype('int') +df['Maths'].astype('int')
df
Output:

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?

import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

Pandas matching elements

i have a database named df1 and a sheet named df2。
i want to use df1 filling df2 by pandas。
DF1:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33
4 Meir 20 110 20
5 Payne 10 175 21
DF2:
name SCORE height weight
1 JACK
2 PAUL
3 MLKE
*name maybe mess up the order
my misktake code :
import openpyxl
import pandas as pd
df1 = pd.DataFrame(pd.read_excel('df1.xlsx',sheet_name =0))
df2 = pd.DataFrame(pd.read_excel('df2.xlsx',sheet_name = 0))
result = df1.merge(df2,on = ['NAME'],how="left")
DF1:
Expected result:
DF2:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33

As you mentioned, name maybe mess up the order, therefore, if you want to use df1 to fill-up df2, you can try setting name as index in both df1 and df2 and then use .update(), as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2 = df2a.reset_index()
Result:
(Using df1 data based on the picture near the bottom):
print(df2)
name SCORE height weight
0 JACK 66 150 100
1 PAUL 50 165 22
2 MLKE 30 132 33
If you want to keep the original row index of df2, you can save the index and then restore it later, as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2_idx = df2.index
df2 = df2a.reset_index()
df2.index = df2_idx
Result:
print(df2)
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33

Modify and round numbers in a pandas dataframe in Python

Long story short, I have a csv file which I read as a pandas dataframe. The file contains a weather report, but all of the measurements for temperature are in Fahrenheit. I've figured out how to convert them:
import pandas as np
df = np.read_csv('report.csv')
df['average temperature'] = (df['average temperature'] - 32) * 5/9
But then the data for this column is in decimals up to 6 points.
I've found code that will round up all the data in the dataframe, but I need only this column.
df.round(2)
I don't like how it has to be a separate piece of code on a separate line and how it modifies all of my data. Is there a way to go about this problem more elegantly? Is there a way to apply this to other columns in my dataframe, such as maximum temperature and minimum temperature without having to copy the above piece of code?

For round only some columns use subset:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = df[cols].round(2)
If want convert only some columns from list:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
If want round each column separately:
df['average temperature'] = df['average temperature'].round(2)
df['maximum temperature'] = df['maximum temperature'].round(2)
df['minimum temperature'] = df['minimum temperature'].round(2)
Sample:
df = (pd.DataFrame(np.random.randint(30, 100, (10, 3)),
columns=['maximum temperature','minimum temperature','average temperature'])
.assign(a='m', b=range(10)))
print (df)
maximum temperature minimum temperature average temperature a b
0 97 60 98 m 0
1 64 86 64 m 1
2 32 64 95 m 2
3 60 56 93 m 3
4 43 89 64 m 4
5 40 62 86 m 5
6 37 40 70 m 6
7 61 33 46 m 7
8 36 44 46 m 8
9 63 30 33 m 9
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
print (df)
maximum temperature minimum temperature average temperature a b
0 36.11 15.56 36.67 m 0
1 17.78 30.00 17.78 m 1
2 0.00 17.78 35.00 m 2
3 15.56 13.33 33.89 m 3
4 6.11 31.67 17.78 m 4
5 4.44 16.67 30.00 m 5
6 2.78 4.44 21.11 m 6
7 16.11 0.56 7.78 m 7
8 2.22 6.67 7.78 m 8
9 17.22 -1.11 0.56 m 9

Here's a single line solution with apply and a conversion function.
def convert_to_celsius (f):
return 5.0/9.0*(f-32)
df[['Column A','Column B']] = df[['Column A','Column B']].apply(convert_to_celsius).round(2)

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.

What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()

Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate cumulative sum and average on file data in python - python

Related

summing two columns in a dataframe

Calculate Multiple Column Growth in Python Dataframe

Pandas matching elements

Modify and round numbers in a pandas dataframe in Python

calculate values between two pandas dataframe based on a column value

Categories

Resources