I have two dataframes,namely 'df' and 'df1'
df
Out[14]:
first country Rating
0 Robert US 100
1 Chris Aus 99
2 Scarlett US 100
df1
Out[17]:
last Role
0 Downey IronMan
1 Hemsworth Thor
2 Johansson BlackWidow
Expected output:
first last Role Rating
0 Robert Downey IronMan 100
1 Chris Hemsworth Thor 99
2 Scarlett Johansson BlackWidow 100
I need to drop off the 'country' column and replace with another dataframe(ie. 'df1')
I understand,I can join dataframes and drop 'country' column,but I need columns exactly in this order.
IIUC:
new_df = df.merge(df1, on='Role').drop('country', axis=1)
new_df = new_df[['first', 'last', 'Role', 'Rating']]
Could you give this a shot?
df1.join(df2, lsuffix='', rsuffix='r_')[["first", "last", "Role", "Rating"]]
Output:
first last Role Rating
0 Robert Downey IronMan 100
1 Chris Hemsworth Thor 99
2 Scarlett Johansson BlackWidow 100
#Moahmed you can try the below approach:
df2 = pd.concat([df,df1], axis = 1)
df2 = df2[['first','last','Role','Rating']]
df2.head()
Related
I have two dataframes, one consists of people and scores, the other consists of each time one of the people did a thing.
df_people = pd.DataFrame(
{
'Name' : ['Angie', 'John', 'Joanne', 'Shivangi'],
'ID' : ['0021', '0022', '0023', '0024'],
'Code' : ['BHU', 'JNU', 'DU', 'BHU'],
}
)
df_actions = pd.DataFrame(
{
'ID' : ['0023', '0021', '0022', '0021'],
'Act' : ['Enter', 'Enter', 'Enter', 'Leave'],
}
)
I would like to create a column in df_people that represents the count of each time they appear in df_actions based on the shared 'ID' column.
it would look like
Name
ID
Code
Count
0
Angie
0023
BHU
1
1
John
0021
JNU
2
2
Joanne
0022
DU
1
3
Shivan
0024
BHU
0
I have tried just taking a value count and insterting that as a new column into df_people but it seems very clunky.
Any advice would be much appreciated.
Another option is Series.map with Series.value_counts
new_df = df_people.assign(Count=df_people['ID'].map(df_actions['ID'].value_counts())
.fillna(0, downcast='infer'))
print(new_df)
Name ID Code Count
0 Angie 0021 BHU 2
1 John 0022 JNU 1
2 Joanne 0023 DU 1
3 Shivangi 0024 BHU 0
Use first GroupBy.agg to compute the counts, then merge:
(df_people
.merge(df_actions.groupby('ID')['Act'].agg(count='count'),
left_on='ID', right_index=True, how='left')
.fillna({'count': 0}, downcast='infer')
)
output:
Name ID Code count
0 Angie 0021 BHU 2
1 John 0022 JNU 1
2 Joanne 0023 DU 1
3 Shivangi 0024 BHU 0
I am trying to map owners to an IP address through the use of two tables, df1 & df2. df1 contains the IP list to be mapped and df2 contains an IP, an alias, and the owner. After running a join on the IP column, it gives me a half joined dataframe. Most of the remaining data can be joined by replacing the NaN values with a join on the Alias column, but I can’t figure out how to do it.
My initial thoughts were to try nesting pd.merge inside fillna(), but it won't accept a dataframe. Any help would be greatly appreciated.
df1 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.0.102', '192.18.0.103', '192.18.0.104']})
df2 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.1.206', '192.18.1.218', '192.18.1.118'],
'Alias' : ['192.18.1.214', '192.18.1.243', '192.18.0.102', '192.18.0.103', '192.18.1.180'],
'Owner' : ['Smith, Jim', 'Bates, Andrew', 'Kline, Jenny', 'Hale, Fred', 'Harris, Robert']})
new_df = pd.DataFrame(pd.merge(df1, df2[['IP', 'Owner']], on='IP', how= 'left'))
Expected output is:
IP Owner
192.18.0.100 Smith, Jim
192.18.0.101 Bates, Andrew
192.18.0.102 Kline, Jenny
192.18.0.103 Hale, Fred
192.18.0.104 nan
No need to merge, Just pull data where condition satisfies. This is way faster than merge and less complicated.
condition = (df1['IP'] == df2['IP']) | (df1['IP'] == df2['Alias'])
df1['Owner'] = np.where(condition, df2['Owner'], np.nan)
print(df1)
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Try this one:
new_df = pd.DataFrame(pd.merge(df1, pd.concat([df2[['IP', 'Owner']], df2[['Alias', 'Owner']].rename(columns={"Alias": "IP"})]).drop_duplicates(), on='IP', how= 'left'))
The result:
>>> new_df
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Let's melt then use map:
df1['IP'].map(df2.melt('Owner').set_index('value')['Owner'])
Output:
0 Smith, Jim
1 Bates, Andrew
2 Kline, Jenny
3 Hale, Fred
4 NaN
Name: IP, dtype: object
If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?
I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)
I am trying to 'broadcast' a date column from df1 to df2.
In df1 I have the names of all the users and their basic information.
In df2 I have a list of purchases made by the users.
df1 and df2 code
Assuming I have a much bigger dataset (the above created for sample) how can I add just(!) the df1['DoB'] column to df2?
I have tried both concat() and merge() but none of them seem to work:
code and error
The only way it seems to work is only if I merge both df1 and df2 together and then just delete the columns I don't need. But if I have tens of unwanted columns, it is going to be very problematic.
The full code (including the lines that throw an error):
import pandas as pd
df1 = pd.DataFrame(columns=['Name','Age','DoB','HomeTown'])
df1['Name'] = ['John', 'Jack', 'Wendy','Paul']
df1['Age'] = [25,23,30,31]
df1['DoB'] = pd.to_datetime(['04-01-2012', '03-02-1991', '04-10-1986', '06-03-1985'], dayfirst=True)
df1['HomeTown'] = ['London', 'Brighton', 'Manchester', 'Jersey']
df2 = pd.DataFrame(columns=['Name','Purchase'])
df2['Name'] = ['John','Wendy','John','Jack','Wendy','Jack','John','John']
df2['Purchase'] = ['fridge','coffee','washingmachine','tickets','iPhone','stove','notebook','laptop']
df2 = df2.concat(df1) # error
df2 = df2.merge(df1['DoB'], on='Name', how='left') #error
df2 = df2.merge(df1, on='Name', how='left')
del df2['Age'], df2['HomeTown']
df2 #that's how i want it to look like
Any help would be much appreciated. Thank you :)
I think you need merge with subset [['Name','DoB']] - need Name column for matching:
print (df1[['Name','DoB']])
Name DoB
0 John 2012-01-04
1 Jack 1991-02-03
2 Wendy 1986-10-04
3 Paul 1985-03-06
df2 = df2.merge(df1[['Name','DoB']], on='Name', how='left')
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04
Another solution with map by Series s:
s = df1.set_index('Name')['DoB']
print (s)
Name
John 2012-01-04
Jack 1991-02-03
Wendy 1986-10-04
Paul 1985-03-06
Name: DoB, dtype: datetime64[ns]
df2['DoB'] = df2.Name.map(s)
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04