How to map categorical columns from two dataframes? - python

I have two dataframes and in the column 'column_to_compare value from df1' I have "sub-categorical" data which I want to match to the "names" to its "categories".
I listed the matching categories below in the matching_cat variable. I don't know if it is the right way to associate my variable to a category.
I want to create a new dataframe which compares these two dataframes with a common id and on the columns categories and names.
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
df1
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
df2
matching_cat = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect':['spider','mosquito'] }
So, here are my two dataframes:
And I want to be able to "map" the values with the categories and output this:

Ok using your example, here is what I came up with:
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
matching_category = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect': ['spider','mosquito'] }
# Function to compare rows against matching_category
def test(row):
try:
if row['names'] in matching_category[row['categories']]:
val = True
else:
val = False
except:
val=False
return val
# Merge the two dataframes based on 'key_col'
df3 = df1.merge(df2, how='outer', on='key_col')
# Call the test function on each row in the new dataframe (df3)
df3['new_col'] = df3.apply(test, axis=1)
# Drop unwanted columns
df3 = df3.drop(['column3', 'column_randomn'], axis=1)
# Create new dataframes based on whether output matches or not
output_matches = df3[df3['new_col']==True]
output_mismatches = df3[df3['new_col']==False]
# Display the dataframes
print('OUTPUT MATCHES:')
print('================================================')
print(output_matches)
print("")
print('OUTPUT MIS-MATCHES:')
print('================================================')
print(output_mismatches)
OUTPUT:
OUTPUT MATCHES:
================================================
key_col categories names new_col
0 12563 bird falcon True
1 12564 dog golden retriever True
3 12566 insect spider True
OUTPUT MIS-MATCHES:
================================================
key_col categories names new_col
2 12565 insect doberman False
4 12567 NaN spider False

Related

How to create a new columns conditional on two other columns in python?

I want to create a new columns conditional on two other columns in python.
Below is the dataframe:
name
address
apple
hello1234
banana
happy111
apple
str3333
pie
diary5144
I want to create a new column "want", conditional on column "name" and "column" address.
The rules are as follows:
(1)If the value in "name" is apple, the the value in "want" should be the first five letters in column "address".
(2)If the value in "name" is banana, the the value in "want" should be the first four letters in column "address".
(3)If the value in "name" is pie, the the value in "want" should be the first three letters in column "address".
The dataframe I want look like this:
name
address
want
apple
hello1234
hello
banana
happy111
happ
apple
str3333
str33
pie
diary5144
dia
How to address such problem? Thanks!
I hope you are well,
import pandas as pd
# Initialize data of lists.
data = {'Name': ['Apple', 'Banana', 'Apple', 'Pie'],
'Address': ['hello1234', 'happy111', 'str3333', 'diary5144']}
# Create DataFrame
df = pd.DataFrame(data)
# Add an empty column
df['Want'] = ''
for i in range(len(df)):
if df['Name'].iloc[i] == "Apple":
df['Want'].iloc[i] = df['Address'].iloc[i][:5]
if df['Name'].iloc[i] == "Banana":
df['Want'].iloc[i] = df['Address'].iloc[i][:4]
if df['Name'].iloc[i] == "Pie":
df['Want'].iloc[i] = df['Address'].iloc[i][:3]
# Print the Dataframe
print(df)
I hope it helps,
Have a lovely day
I think a broader way of doing this is by creating a conditional map dict and applying it with lambda functions on your dataset.
Creating the dataset:
import pandas as pd
data = {
'name': ['apple', 'banana', 'apple', 'pie'],
'address': ['hello1234', 'happy111', 'str3333', 'diary5144']
}
df = pd.DataFrame(data)
Defining the conditional dict:
conditionalMap = {
'apple': lambda s: s[:5],
'banana': lambda s: s[:4],
'pie': lambda s: s[:3]
}
Applying the map:
df.loc[:, 'want'] = df.apply(lambda row: conditionalMap[row['name']](row['address']), axis=1)
With the resulting df:
name
address
want
0
apple
hello1234
hello
1
banana
happy111
happ
2
apple
str3333
str33
3
pie
diary5144
dia
You could do the following:
for string, length in {"apple": 5, "banana": 4, "pie": 3}.items():
mask = df["name"].eq(string)
df.loc[mask, "want"] = df.loc[mask, "address"].str[:length]
Iterate over the 3 conditions: string is the string on which the length requirement depends, and the length requirement is stored in length.
Build a mask via df["name"].eq(string) which selects the rows with value string in column name.
Then set column want at those rows to the adequately clipped column address values.
Result for the sample dataframe:
name address want
0 apple hello1234 hello
1 banana happy111 happ
2 apple str3333 str33
3 pie diary5144 dia

Compare df's including detailed insight in data

I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Pandas: Create a new column with coulmn name and cell of matching string

I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
Final output should look something like
Output
You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
Output:
However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).
UPDATE:
If you want to use a list of columns as input and match with the first instance, you can use this instead:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
Output:
You can try:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)

Add a new column containing the difference between EACH TWO ROWS of another column of a data frame

I would like to get the difference between each 2 rows of the column duration and then fill the values in a new column differenceor print it.
So basically I want: row(1)-row(2)=difference1, row(3)-row(4)=difference2, row(5)-row(6)=difference3 ....
Example of a code:
data = {'Profession':['Teacher', 'Banker', 'Teacher', 'Judge','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Male','Male','Female'],'Size':['M','M','L','S','S','M'],'Duration':['5','6','2','3','4','7']}
data2={'Profession':['Doctor', 'Scientist', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Male','Male', 'Female','Female','Male','Male'],'Size':['L','M','L','M','L','L'],'Duration':['1','2','9','10','1','17']}
data3 = {'Profession':['Banker', 'Banker', 'Doctor', 'Doctor','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Female','Female','Male'],'Size':['S','M','S','M','L','S'],'Duration':['15','8','5','2','11','10']}
data4={'Profession':['Judge', 'Judge', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Female','Female', 'Female','Female','Female','Female'],'Size':['M','S','L','S','M','S'],'Duration':['1','2','9','10','1','17']}
df= pd.DataFrame(data)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
DATA=pd.concat([df,df2,df3,df4])
DATA.groupby(['Profession','Size','Gender']).agg('sum')
D=DATA.reset_index()
D['difference']=D['Duration'].diff(-1)
I tried using diff(-1) but it's not exactly what I'm looking for. any ideas?
Is that what you wanted?
D["Neighbour"]=D["Duration"].shift(-1)
# fill empty lines with 0
D["Neighbour"] = D["Neighbour"].fillna(0)
# convert columns "Neighbour" and "Duration" to numeric
D["Neighbour"] = pd.to_numeric(D["Neighbour"])
D["Duration"] = pd.to_numeric(D["Duration"])
# get difference
D["difference"]=D["Duration"] - D["Neighbour"]
# remove "Neighbour" column
D = D.drop(columns=["Neighbour"], axis=1)
# remove odd lines
D.loc[1::2,"difference"] = None
# print D
D

Check whether all unique value of column B are mapped with all unique value of Column A

I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")

Categories

Resources