dataframe get average of a column based on other columns [duplicate] - python

This question already has answers here:
Compute row average in pandas
(5 answers)
Closed 2 years ago.
have a df with values
df
name maths english chemistry
mark 10 0 20
tom 10 20 30
hall 0 25 15
how to take average marks of the each user without considerding the value 0 in it.
expected output
name maths english chemistry average marks
mark 10 0 20 15
tom 10 20 30 30
hall 0 25 15 20

You can change the value you want to ignore to nan and then calculate the mean. This can be done by df.replace({0: pd.NA}) as exemplified by the following code:
import pandas as pd
df = pd.DataFrame({
"math": {"mark": 10, "tom":10, "hall": 0},
"english": {"mark":0, "tom": 20,"hall":25},
"chemistry": {"mark":20, "tom":30, "hall":15}
})
df["average_marks"] = df.replace({0: pd.NA}).mean(axis=1)
df
Outputs:
math english chemistry average_marks
mark 10 0 20 15.0
tom 10 20 30 20.0
hall 0 25 15 20.0

You can mask the zero values, before computing your average :
df.assign(average_marks=df.mask(df.eq(0)).select_dtypes("number").mean(1))
name maths english chemistry average_marks
0 mark 10 0 20 15.0
1 tom 10 20 30 20.0
2 hall 0 25 15 20.0
#trimvi's solution is simpler though. This is only an alternative.

Related

How do i increase an element value from column in Pandas?

Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1

How to print the sorted excel ( which is done using .sort_value() ) in pandas

import pandas as pd
grade = pd.read_excel('data1.xlsx')
Total=grade['Total(48)']
print(Total)
Total.sort_value()
You can sort the dataframe by column name like Total(48) using pandas sort_values method.
Code:
import pandas as pd
grade = pd.read_excel('data1.xlsx')
print("Grade before sorting")
print(grade)
grade = grade.sort_values(by=['Total(48)'])
print("Grade after sorting")
print(grade)
Output:
Grade before sorting
Name Total(48)
0 Shovon 5
1 arsho 89
2 Ahmedur -54
3 Rahman 10
4 Sho 1
5 john 6
6 ken 87
Grade after sorting
Name Total(48)
2 Ahmedur -54
4 Sho 1
0 Shovon 5
5 john 6
3 Rahman 10
6 ken 87
1 arsho 89
data1.xlsx:
References:
Documentation on sort_values method

Splitting Pandas Row Data into Multiple Rows without Adding Columns

I have some American Football data in a DataFrame like below:
df = pd.DataFrame({'Green Bay Packers' : ['30-18-0', '5-37', '10-71' ],
'Chicago Bears' : ['45-26-1', '5-20', '10-107']},
index=['Att - Comp - Int', 'Sacked - Yds Lost', 'Penalties - Yards'])
Green Bay Packers Chicago Bears
Att - Comp - Int 30-18-0 45-26-1
Sacked - Yds Lost 5-37 5-20
Penalties - Yards 10-71 10-107
You can see above that each row contains multiple data points that need to be split off.
What I'd like to do is find some way to split the rows up so that each data point is it's own row. The final output would like like:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Is there a way to do this efficiently? I tried some Regex but it just turned into a mess. Sorry if my formatting isn't perfect...2nd question ever posted here.
Try:
df = df.reset_index().apply(lambda x: x.str.split("-"))
df = pd.DataFrame(
{c: df[c].explode().str.strip() for c in df.columns},
).set_index("index")
df.index.name = None
print(df)
Prints:
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
First reset the index, then stack all the columns and split them on -, You can also additionally apply to remove any left over whitespace characters after after using split, then unstack again, then apply pd.Series.explode finally reset the index, and drop any left-over unrequired column.
out = (df.reset_index()
.stack().str.split('-').apply(lambda x:[i.strip() for i in x])
.unstack()
.apply(pd.Series.explode)
.reset_index()
.drop(columns='level_0'))
index Green Bay Packers Chicago Bears
0 Att 30 45
1 Comp 18 26
2 Int 0 1
3 Sacked 5 5
4 Yds Lost 37 20
5 Penalties 10 10
6 Yards 71 107
Assuming you have same number of splits for every row, with pandas >= 1.3.0, you can explode multiple columns at the same time:
df = df.reset_index().apply(lambda s: s.str.split(' *- *'))
df.explode(df.columns.tolist()).set_index('index')
Green Bay Packers Chicago Bears
index
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107
Use .apply() on each column (including index) and for each column:
use .str.split() to split data points and
use .explode() to create rows for each split data point element
df_out = (df.reset_index()
.apply(lambda x: x.str.split(r'\s*-\s*').explode())
.set_index('index').rename_axis(index=None)
)
Result:
print(df_out)
Green Bay Packers Chicago Bears
Att 30 45
Comp 18 26
Int 0 1
Sacked 5 5
Yds Lost 37 20
Penalties 10 10
Yards 71 107

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

merge multiple rows in panadas for all columns

Suppose we have a data frame with 1000 rows and 100 columns. The first column is the names and the rest are values or empty. Many rows have the same name. How can I add them and have each name once with the summation of the values?
For example the name Alex on the first row has the values 20, 30, 40 and on 2 other rows again we have Alex with values 10,10,20 respectively. So my new data frame should only have the row Alex just once with values 40, 50, 80
EDIT : First of all thank you all for your feedback. Sorry if I was not clear. Imagine I have the following matrix
Names Last name price1 price2 price3 (no named column)
-------------------------------------------------------------------------
Alex Robinson 10 20 30 (a string)
Bill Towns 10 40 50 (empty)
Alex Robinson 30 10 20 (empty)
George Leopold 10 10 10 (empty)
Alex Robinson 20 20 20 (empty)
Names Last name price1 price2 price3 (no named column)
(no named row)
---------------------------------------------------------------------------
Alex Robinson 60 50 70 (a string)
Bill Towns 10 40 50 (empty)
George Leopold 10 10 10 (empty)
But instead of 3 columns imagine I have 100. Thus I cannot do them explicitly by their name for example
EDIT2 : I forgot to tell you that some rows also contain a string. Unfortunately I get an error for this command
df8 = data.groupby('Name').sum()
I have already sorted the dataframe with this command
data2 = data.sort_values('Name',ascending=True).reset_index(drop=True)
Here's the code that will sum your score:
import pandas as pd
data = [['alan',10],['tom',23],['nick',22],['alan',11]]
df = pd.DataFrame(data,columns=['name','score'])
df = df.groupby(['name'], as_index=False)['score'].sum()
print(df)
The results:
Before:
name score
0 alan 10
1 tom 23
2 nick 22
3 alan 11
And after:
name score
0 alan 21
1 nick 22
2 tom 23
You can do it with df.groupby
df = df.groupby('Names').sum().reset_index()
Output
Names price1 price2 price3
0 Alex 60 50 70
1 Bill 10 40 50
2 George 10 10 10

Categories

Resources