Suppose we have a data frame with 1000 rows and 100 columns. The first column is the names and the rest are values or empty. Many rows have the same name. How can I add them and have each name once with the summation of the values?
For example the name Alex on the first row has the values 20, 30, 40 and on 2 other rows again we have Alex with values 10,10,20 respectively. So my new data frame should only have the row Alex just once with values 40, 50, 80
EDIT : First of all thank you all for your feedback. Sorry if I was not clear. Imagine I have the following matrix
Names Last name price1 price2 price3 (no named column)
-------------------------------------------------------------------------
Alex Robinson 10 20 30 (a string)
Bill Towns 10 40 50 (empty)
Alex Robinson 30 10 20 (empty)
George Leopold 10 10 10 (empty)
Alex Robinson 20 20 20 (empty)
Names Last name price1 price2 price3 (no named column)
(no named row)
---------------------------------------------------------------------------
Alex Robinson 60 50 70 (a string)
Bill Towns 10 40 50 (empty)
George Leopold 10 10 10 (empty)
But instead of 3 columns imagine I have 100. Thus I cannot do them explicitly by their name for example
EDIT2 : I forgot to tell you that some rows also contain a string. Unfortunately I get an error for this command
df8 = data.groupby('Name').sum()
I have already sorted the dataframe with this command
data2 = data.sort_values('Name',ascending=True).reset_index(drop=True)
Here's the code that will sum your score:
import pandas as pd
data = [['alan',10],['tom',23],['nick',22],['alan',11]]
df = pd.DataFrame(data,columns=['name','score'])
df = df.groupby(['name'], as_index=False)['score'].sum()
print(df)
The results:
Before:
name score
0 alan 10
1 tom 23
2 nick 22
3 alan 11
And after:
name score
0 alan 21
1 nick 22
2 tom 23
You can do it with df.groupby
df = df.groupby('Names').sum().reset_index()
Output
Names price1 price2 price3
0 Alex 60 50 70
1 Bill 10 40 50
2 George 10 10 10
Related
I am trying to modify some cells' values by multiplying by 10 . and it doesn't work.
Here is a simple code example: Thank you so much for help
a={'name':['john','eric','kate'],'buy':[100,50,200],'sell':[20,30,40]}
df=pd.DataFrame(a)
df
name buy sell
0 john 100 20
1 eric 50 30
2 kate 200 40
df[df['name']=='eric'].iloc[:,2:]=df[df['name']=='eric'].iloc[:,2:]*10
df
name buy sell
0 john 100 20
1 eric 50 30
2 kate 200 40
but if I do this by modifying all the row values, then it is fine, so what is the problem of above code when using row filtering? Thank you so much for your help
df.iloc[:,2:]=df.iloc[:,2:]*10
df
name buy sell
0 john 100 200
1 eric 50 300
2 kate 200 400
Lets try
df.loc[df['name']=='eric',['buy','sell']] *=10
How it works
.iloc is an integer accessor. it accesses by referencing columns using their integer axis. So df.iloc[:,2:] is a selection of the second column index which is sell
You can achieve the same using
df.iloc[df[df['name']=='eric'].index,2] *=10
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna
So I have a dataframe like the following:
Name Age City
A 21 NY
A 20 DC
A 35 OR
B 18 DC
B 19 PA
I need to keep all the rows for every Name and Age pair where a specific value is among those associated with column City. For example if my target city is NY, then my desired output would be:
Name Age City
A 21 NY
A 20 DC
A 35 OR
Edit1: I am not necessarily looking for a single value. There might be cases where there are multiple cities that I am looking for. For example: NY and DC at the same time.
Edit2: I have tried the followings which does not return correct output (daah):
df = df[df['City'] == 'NY']
and
df = df[df['City'].isin('NY')]
You can create function - first test City for equal and get all unique names for again filtering by isin:
def get_df_by_val(df, val):
return df[df['Name'].isin(df.loc[df['City'].eq(val), 'Name'].unique())]
print (get_df_by_val(df, 'NY'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
print (get_df_by_val(df, 'PA'))
Name Age City
3 B 18 DC
4 B 19 PA
print (get_df_by_val(df, 'OR'))
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
EDIT:
If need check multiple values per groups use GroupBy.transform with compare sets with issubset:
vals = ['NY', 'DC']
df1 = df[df.groupby('Name')['City'].transform(lambda x: set(vals).issubset(x))]
print (df1)
Name Age City
0 A 21 NY
1 A 20 DC
2 A 35 OR
I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ?
df1
place name qty unit
NY Tom 2 10
TK Ron 3 15
Lon Don 5 90
Hk Sam 4 49
df2
place name price
PH Tom 7
TK Ron 5
Result:
df3
place name qty unit
NY Tom 2 10
TK Ron 3 15
Option 1
Using df.isin:
In [1362]: df1[df1.name.isin(df2.name)]
Out[1362]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 2
Performing an inner-join with df.merge:
In [1365]: df1.merge(df2.name.to_frame())
Out[1365]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 3
Using df.eq:
In [1374]: df1[df1.name.eq(df2.name)]
Out[1374]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
You want something called an inner join.
df1.merge(df2,on = 'name')
place_x name qty unit place_y price
NY Tom 2 10 PH 7
TK Ron 3 15 TK 5
The _xand _y happens when you have a column in both data frames being merged.