Merge two dataframes based on a column - python

I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ?
df1
place name qty unit
NY Tom 2 10
TK Ron 3 15
Lon Don 5 90
Hk Sam 4 49
df2
place name price
PH Tom 7
TK Ron 5
Result:
df3
place name qty unit
NY Tom 2 10
TK Ron 3 15

Option 1
Using df.isin:
In [1362]: df1[df1.name.isin(df2.name)]
Out[1362]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 2
Performing an inner-join with df.merge:
In [1365]: df1.merge(df2.name.to_frame())
Out[1365]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 3
Using df.eq:
In [1374]: df1[df1.name.eq(df2.name)]
Out[1374]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15

You want something called an inner join.
df1.merge(df2,on = 'name')
place_x name qty unit place_y price
NY Tom 2 10 PH 7
TK Ron 3 15 TK 5
The _xand _y happens when you have a column in both data frames being merged.

Related

In-place update in pandas: update the value of the cell based on a condition

DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate

How do i increase an element value from column in Pandas?

Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1

merge multiple rows in panadas for all columns

Suppose we have a data frame with 1000 rows and 100 columns. The first column is the names and the rest are values or empty. Many rows have the same name. How can I add them and have each name once with the summation of the values?
For example the name Alex on the first row has the values 20, 30, 40 and on 2 other rows again we have Alex with values 10,10,20 respectively. So my new data frame should only have the row Alex just once with values 40, 50, 80
EDIT : First of all thank you all for your feedback. Sorry if I was not clear. Imagine I have the following matrix
Names Last name price1 price2 price3 (no named column)
-------------------------------------------------------------------------
Alex Robinson 10 20 30 (a string)
Bill Towns 10 40 50 (empty)
Alex Robinson 30 10 20 (empty)
George Leopold 10 10 10 (empty)
Alex Robinson 20 20 20 (empty)
Names Last name price1 price2 price3 (no named column)
(no named row)
---------------------------------------------------------------------------
Alex Robinson 60 50 70 (a string)
Bill Towns 10 40 50 (empty)
George Leopold 10 10 10 (empty)
But instead of 3 columns imagine I have 100. Thus I cannot do them explicitly by their name for example
EDIT2 : I forgot to tell you that some rows also contain a string. Unfortunately I get an error for this command
df8 = data.groupby('Name').sum()
I have already sorted the dataframe with this command
data2 = data.sort_values('Name',ascending=True).reset_index(drop=True)
Here's the code that will sum your score:
import pandas as pd
data = [['alan',10],['tom',23],['nick',22],['alan',11]]
df = pd.DataFrame(data,columns=['name','score'])
df = df.groupby(['name'], as_index=False)['score'].sum()
print(df)
The results:
Before:
name score
0 alan 10
1 tom 23
2 nick 22
3 alan 11
And after:
name score
0 alan 21
1 nick 22
2 tom 23
You can do it with df.groupby
df = df.groupby('Names').sum().reset_index()
Output
Names price1 price2 price3
0 Alex 60 50 70
1 Bill 10 40 50
2 George 10 10 10

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

get the distinct column values and union the dataframes

I am trying to convert sql statement
SELECT distinct table1.[Name],table1.[Phno]
FROM table1
union
select distinct table2.[Name],table2.[Phno] from table2
UNION
select distinct table3.[Name],table3.[Phno] from table3;
Now I have 4 dataframes: table1, table2, table3.
table1
Name Phno
0 Andrew 6175083617
1 Andrew 6175083617
2 Frank 7825942358
3 Jerry 3549856785
4 Liu 9659875695
table2
Name Phno
0 Sandy 7859864125
1 Nikhil 9526412563
2 Sandy 7859864125
3 Tina 7459681245
4 Surat 9637458725
table3
Name Phno
0 Patel 9128257489
1 Mary 3679871478
2 Sandra 9871359654
3 Mary 3679871478
4 Hali 9835167465
now I need to get distinct values of these dataframes and union them and get the output to be:
sample output
Name Phno
0 Andrew 6175083617
1 Frank 7825942358
2 Jerry 3549856785
3 Liu 9659875695
4 Sandy 7859864125
5 Nikhil 9526412563
6 Tina 7459681245
7 Surat 9637458725
8 Patel 9128257489
9 Mary 3679871478
10 Sandra 9871359654
11 Hali 9835167465
I tried to get the unique values for one dataframe table1 as shown below:
table1_unique = pd.unique(table1.values.ravel()) #which gives me
table1_unique
array(['Andrew', 6175083617L, 'Frank', 7825942358L, 'Jerry', 3549856785L,
'Liu', 9659875695L], dtype=object)
But i get them as an array. I even tried converting them as dataframe using:
table1_unique1 = pd.DataFrame(table1_unique)
table1_unique1
0
0 Andrew
1 6175083617
2 Frank
3 7825942358
4 Jerry
5 3549856785
6 Liu
7 9659875695
How do I get unique values in a dataframe, so that I can concat them as per my sample output. Hope this is clear. Thanks!!
a = table1df[['Name','Phno']].drop_duplicates()
b = table2df[['Name','Phno']].drop_duplicates()
c = table3df[['Name','Phno']].drop_duplicates()
result = pd.concat([a,b,c])

Categories

Resources