I have two pandas data-frame and each of them are of different sizes each over 1 million records.
I am looking to compare these two data-frames and identify the differences.
DataFrameA
ID Name Age Sex
1A1 Cling 21 M
1B2 Roger 22 M
1C3 Stew 23 M
DataFrameB
ID FullName Gender Age
1B2 Roger M 21
1C3 Rick M 23
1D4 Ash F 21
DataFrameB will always have more records than DataFrameA but the records found in DataFrameA may not still be in DataFrameB.
The column names in the DataFrameA and DataFrameB are different. I have the mapping stored in a different dataframe.
MappingDataFrame
DataFrameACol DataFrameBCol
ID ID
Name FullName
Age Age
Sex Gender
I am looking to compare these two and add a column next to it with the result.
Col Name Adder for DataFrame A = "_A_Txt"
Col Name Adder for DataFrame B = "_B_Txt"
ExpectedOutput
ID Name_A_Txt FullName_B_Text Result_Name Age_A_Txt Age_B_Txt Result_Age
1B2 Roger Roger Match ... ...
1C3 Stew Rick No Match ... ...
The column names should have the text added before this.
I am using a For loop at the moment to build this logic. But 1 million record is taking ages to complete. I left the program running for more than 50 minutes and it wasn't completed as in real-time, I am building it for more than 100 columns.
I will open bounty for this question and award the bounty, even if the question was answered before opening it as a reward. As, I have been struggling really for performance using For loop iteration.
To start with DataFrameA and DataFrameB, use the below code,
import pandas as pd
d = {
'ID':['1A1', '1B2', '1C3'],
'Name':['Cling', 'Roger', 'Stew'],
'Age':[21, 22, 23],
'Sex':['M', 'M', 'M']
}
DataFrameA = pd.DataFrame(d)
d = {
'ID':['1B2', '1C3', '1D4'],
'FullName':['Roger', 'Rick', 'Ash'],
'Gender':['M', 'M', 'F'],
'Age':[21, 23, 21]
}
DataFrameB = pd.DataFrame(d)
I believe, this question is a bit different from the suggestion (definition on joins) provided by Coldspeed as this also involves looking up at different column names and adding a new result column along. Also, the column names need to be transformed on the result side.
The OutputDataFrame Looks as below,
For better understanding of the readers, I am putting the column
names in the Row in order
Col 1 - ID (Coming from DataFrameA)
Col 2 - Name_X (Coming from DataFrameA)
Col 3 - FullName_Y (Coming from DataFrameB)
Col 4 - Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 - Age_X (Coming from DataFrameA)
Col 6 - Age_Y (Coming From DataFrameB)
Col 7 - Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 - Sex_X (Coming from DataFrameA)
Col 9 - Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)
m = list(mapping_df.set_index('DataFrameACol')['DataFrameBCol']
.drop('ID')
.iteritems())
m[m.index(('Age', 'Age'))] = ('Age_x', 'Age_y')
m
# [('Name', 'FullName'), ('Age_x', 'Age_y'), ('Sex', 'Gender')]
Start with an inner merge:
df3 = (df1.merge(df2, how='inner', on=['ID'])
.reindex(columns=['ID', *(v for V in m for v in V)]))
df3
ID Name FullName Age_x Age_y Sex Gender
0 1B2 Roger Roger 22 21 M M
1 1C3 Stew Rick 23 23 M M
Now, compare the columns and set values with np.where:
l, r = map(list, zip(*m))
matches = (df3[l].eq(df3[r].rename(dict(zip(r, l)), axis=1))
.add_prefix('Result_')
.replace({True: 'Match', False: 'No Match'}))
for k, v in m:
name = f'Result_{k}'
df3.insert(df3.columns.get_loc(v)+1, name, matches[name])
df3.columns
# Index(['ID', 'Name', 'FullName', 'Result_Name', 'Age_x', 'Age_y',
# 'Result_Age_x', 'Sex', 'Gender', 'Result_Sex'],
# dtype='object')
df3.filter(like='Result_')
Result_Name Result_Age_x Result_Sex
0 Match No Match Match
1 No Match Match Match
Related
I have a dataframe:
df1 = pd.DataFrame({'id': ['1','2','2','3','3','4','4'],
'name': ['James','Jim','jimy','Daniel','Dane','Ash','Ash'],
'event': ['Basket','Soccer','Soccer','Basket','Soccer','Basket','Soccer']})
I want to count unique values of id but with the name, the result I except are:
id name count
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I try to group by id and name but it doesn't count as i expected
You could try:
df1.groupby('id').agg(
name=('name', lambda x: ', '.join(x.unique())),
count=('name', 'count')
)
We are basically grouping by id and then joining the unique names to a comma separated list!
Here is a solution:
groups = df1[["id", "name"]].groupby("id")
a = groups.agg(lambda x: ", ".join( set(x) ))
b = groups.size().rename("count")
c = pd.concat([a,b], axis=1)
I'm not an expert when it comes to pandas but I thought I might as well post my solution because I think that it's straightforward and readable.
In your example, the groupby is done on the id column and not by id and name. The name column you see in your expected DataFrame is the result of an aggregation done after a groupby.
Here, it is obvious that the groupby was done on the id column.
My solution is maybe not the most straightforward but I still find it to be more readable:
Create a groupby object groups by grouping by id
Create a DataFrame a from groups by aggregating it using commas (you also need to remove the duplicates using set(...) ): lambda x: ", ".join( set(x) )
The DataFrame a will thus have the following data:
name
id
1 James
2 Jim, jimy
3 Daniel, Dane
4 Ash
Create another DataFrame b by computing the size of each groups in groups : groups.size() (you should also rename your column)
id
1 1
2 2
3 2
4 2
Name: count, dtype: int64
Concat a and b horizontally and you get what you wanted
name count
id
1 James 1
2 Jim, jimy 2
3 Daniel, Dane 2
4 Ash 2
I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))
My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example:
If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21.
For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up.
I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level).
In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt
Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:
def forHyphenRow(s):
return s.split('-')[1]
def forHyphenColumn(s):
return s.split('-')[0]
def forHyphenLevel(s):
return s.split('-')[2]
How do I do the same but for the other IDs?
Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)
Here is an answer that:
uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
puts the ID in a regularized format
adds columns for the ID components
sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)
import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
m = re.match('^([A-Z])([0-9]{2})-?([0-9]+)-?([A-Z])([0-9]{2})$', s)
if not m:
return None
tup = m.groups()
colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
tup = (colChar, colInt, rowInt, levelChar, levelInt)
return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)
Output:
BinLocation
0 A0233B21
1 A02033B21
2 A02-033-B21
3 A02-33-B21
4 A02-33-B15
5 A02-30-B21
6 A01-33-B21
BinLocation RawBinLocation ColChar ColInt RowInt LevChar LevInt
0 A01-033-B21 A01-33-B21 A 1 33 B 21
1 A02-030-B21 A02-30-B21 A 2 30 B 21
2 A02-033-B15 A02-33-B15 A 2 33 B 15
3 A02-033-B21 A0233B21 A 2 33 B 21
4 A02-033-B21 A02033B21 A 2 33 B 21
5 A02-033-B21 A02-033-B21 A 2 33 B 21
6 A02-033-B21 A02-33-B21 A 2 33 B 21
If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):
def forNotHyphenColumn(s):
return s[:3]
def forNotHyphenLevel(s):
return s[-3:]
def forNotHyphenRow(s):
return s[3:-3]
Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
EDIT:
To also use the hyphen functions in the apply():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#3 A01-33-C13 A01 33 C13
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
I have the following Data Frame:
import pandas as pd
df = {'Country': ['A','A','B','B','B'],'MY_Product': ['NS_1', 'SY_1','BMX_3','NS_5','NK'],'Cost': [5, 35,34,45,9],'Competidor_Country_2': ['A', 'A' ,'B','B','B'],'Competidor_Product_2': ['BMX_2','TM_0','NS_6','SY_8','NA'],'Competidor_Cost_2': [35, 20,65,67,90]}
df_new = pd.DataFrame(df,columns=['Country', 'MY_Product', 'Cost','Competidor_Country_2','Competidor_Product_2','Competidor_Cost_2'])
print(df_new)
Information:
My products must start with "NS","SY", "NK" or "NA";
In the first three columns is represented informations of my products and in the last three the competitor's product
I did not put all examples to simplify the exercise
Problem:
As you can see in the third row, there is a product that is not mine ("BMX_3") and the competidor is one of mine...So I would like to replace not only the pruduct but the other competidor's columns too, thus leaving the first three columns with my product and the last 3 with the competitor's
Considerations:
if the two products in the line are my products (last row for exemple), I don't need to do anything (but if possible leave a "comment code" to delete this comparison will help me, just in case)
If I understand you right, you want to swap values of the 3 columns if the product in MY_Product isn't yours:
# create a mask
mask = ~df_new.MY_Product.str.contains(r"^(?:NS|SY|NK|NA)")
# swap the values of the three columns:
vals = df_new.loc[mask, ["Country", "MY_Product", "Cost"]].values
df_new.loc[mask, ["Country", "MY_Product", "Cost"]] = df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
].values
df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
] = vals
# print the dataframe
print(df_new)
Prints:
Country MY_Product Cost Competidor_Country_2 Competidor_Product_2 Competidor_Cost_2
0 A NS_1 5 A BMX_2 35
1 A SY_1 35 A TM_0 20
2 B NS_6 65 B BMX_3 34
3 B NS_5 45 B SY_8 67
4 B NK 9 B NA 90
I want to cut a certain part of a string (applied to multiple columns and differs on each column) when one column contains a particular substring
Example:
Assume the following dataframe
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
print(df)
Out:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike39 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 Holy5 18 Libra AA Gannon h_i_j
For example if one entry of column City ends with 'e', cut the last three letters of column 'City' and the last two letters of column 'name'.
What I tried so far is something like this:
df['City'] = df['City'].apply(lambda x: df['City'].str[:-3] if df.City.str.endswith('e'))
That doesn't work and I also don't really know how to cut letters on other columns while having the same if clause.
I'm happy for any help I get.
Thank you
You can record the rows with City ending with e then use loc update:
mask = df['City'].str[-1] == 'e'
df.loc[mask, 'City'] = df.loc[mask, 'City'].str[:-3]
df.loc[mask, 'name'] = df.loc[mask, 'name'].str[:-2]
Output:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike 20 Leo AB Somervi c_d_e
2 Brend 25 Virgo B Hendersonvi f_g
3 Holy5 18 Libra AA Gannon h_i_j
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
def func(row):
index = row.name
if row['City'][-1] == 'c': #check the last letter of column City for each row, implement your condition here.
df.at[index, 'City'] = df['City'][index][:-3]
df.at[index, 'name'] = df['name'][index][:-1]
df.apply(lambda x: func(x), axis =1 )
print (df)