Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
Related
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna
I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!
If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma
You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
I have a data frame below that shows the price of wood and steel from two different suppliers.
I would like to add a column that shows the highest price for the opposite item (i.e. if line is wood, it would pull steel) from the same supplier.
For example, the "Steel" row for "Tom" would show his highest wood price which is 42.
The code I have so far simply returns the highest price for the original item (i.e. not the opposite, so for Tom's steel row returns 24 but I would have wanted it to return 42).
I think this is an issue with pulling the max value for a multi-group. I have tried a number of different ways but just cannot seem to get it.
Any thoughts would be greatly appreciated.
import pandas as pd
import numpy as np
data = {'Supplier':['Tom', 'Tom', 'Tom', 'Bill','Bill','Bill'],'Item':['Wood','Wood','Steel','Steel','Steel','Wood'],'Price':[42,33,24,16,12,18]}
df = pd.DataFrame(data)
df['Opp_Item'] = np.where(df['Item']=="Wood", "Steel", "Wood")
df['Opp_Item_Max'] = df.groupby(['Supplier','Opp_Item'])['Price'].transform(max)
print(df)
Supplier Item Price Opp_Item Opp_Item_Max
0 Tom Wood 42 Steel 42
1 Tom Wood 33 Steel 42
2 Tom Steel 24 Wood 24
3 Bill Steel 16 Wood 16
4 Bill Steel 12 Wood 16
5 Bill Wood 18 Steel 18
If you can find the per supplier+item maximum, then you can just swap the values and assign them back through a join:
v = df.groupby(['Supplier', 'Item']).Price.max().unstack(-1)
# This reversal operation works under the assumption that
# there are only two items and that they are opposites of each other.
v[:] = v.values[:, ::-1]
df = (df.set_index(['Supplier', 'Item'])
.join(v.stack().to_frame('Opp_Item_Max'), how='left')
.reset_index())
print(df)
Supplier Item Price Opp_Item_Max
0 Bill Steel 16 18
1 Bill Steel 12 18
2 Bill Wood 18 16
3 Tom Steel 24 42
4 Tom Wood 42 24
5 Tom Wood 33 24
Note: Ordering of your data will not be preserved after the join.
You could map to the opposite values before a groupby, and then merge this back to the original DataFrame.
d = {'Steel': 'Wood', 'Wood': 'Steel'}
df.merge(df.assign(Item = df.Item.map(d))
.groupby(['Supplier', 'Item'], as_index=False).max(),
on=['Supplier', 'Item'],
how='left',
suffixes=['', '_Opp_Item'])
Supplier Item Price Price_Opp_Item
0 Tom Wood 42 24
1 Tom Wood 33 24
2 Tom Steel 24 42
3 Bill Steel 16 18
4 Bill Steel 12 18
5 Bill Wood 18 16
I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))