Create new column based on conditions in other dataframes - python

I have two input dataframes I'm attempting to use to create a column in another dataframe (target column = 'Country' in df1). The issue is that one of the input dataframes (i.e., df2) has fewer rows than my target dataframe (i.e., df1). I want to iterate through all possible countries in df2 but when that is exhausted (i.e., after 'KOR'), I want the column to revert to df3 and return the countries starting from the beginning, hence the target result in df1['Country'].
I had believed I found a way of doing this using numpy.where, but I ended up with 'ESP' instead of 'US' when reverting to df3 and that's not my objective.
Also, these are just examples, the actual dataframes in question are much larger.
Hopefully someone can help me figure this out. Appreciate any help. Thanks.
import pandas as pd
data1 = {'Name': ['tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18],
'Country': ['IND', 'KOR', 'US', 'UK']}
data2 = {'Country2': ['IND', 'KOR']}
data3 = {'Country': ['US', 'UK', 'ESP', 'MEX']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
EDIT: The easiest way I can think of to more clearly explain my goal is to utilize excel, where df1 begins in A1, df2 begins in E1, and df3 begins in H1. In that case, the formula in C1 (the beginning of the country column in df1) would be something like =IF(LEN(E1)>0,E1,H1). If i pull that formula down the cells, i'd basically have something similar to "if df2 has a country available, provide the name of that country, else, revert to the beginning of df3 and provide the country (beginning at the start of the list)"

Related

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

Check if value from one dataframe exists in another dataframe and show another column in result

I have 2 dataframes:
in df1 i have two columns: keyword and father.
In df2 i have the column caled name.
if any keyword is present in df2 then i want to show the column df['father'].
df1 = pd.DataFrame({'keyword': ['Marc', 'Jake', 'Sam', 'Brad', 'Vinicius', 'Alexandre'],
'father': ['Minga', 'Maria', 'Cida', 'Neide', 'Carla', 'Nil']})
df2 = pd.DataFrame({'Name': ['Jake', 'John', 'Marc', 'Tony2', 'Bob', 'ALEXANDRE', 'AGNES', 'Bianca']})
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'] and show the correspondent row in Df1['father']
Can anyone help me?
Thank you
You can use .isin() to check for column keyword value is in column Name of df2, then use .loc to locate the rows with this matching condition, and also select the column father to display:
df1.loc[df1['keyword'].isin(df2['Name']), 'father']
Result:
0 Minga
1 Maria
Name: father, dtype: object

Joining dataframes based on values, pandas

I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..

How to merge multiple dataframes with unique field names into a single dataframe?

I'm trying to merge 5 dataframes into a single dataframe. Each individual dataframe has the same format, the only variation is the column name.
# Input Dataframes
df1 = df[['id', 'num', 'type_1', 'object_1', 'notes_1']]
df2 = df[['id', 'num', 'type_2', 'object_2', 'notes_2']]
df3 = df[['id', 'num', 'type_3', 'object_3', 'notes_3']]
df4 = df[['id', 'num', 'type_3', 'object_3', 'notes_3']]
df5 = df[['id', 'num', 'type_3', 'object_3', 'notes_3']]
Each time I try to combine them, I accidentally combine them together as columns instead of rows. My goal is to have generate a df with 5 rows
# my attempt
df = pd.concat([df1, df2, df3, df4, df5], axis=0, ignore_index=True)
outputs: [type_1, type_2, type_3, type_4, type_5, note_1,notes_2...]
# Desired Output Dataframe
final_df = df[['id', 'num', 'type', 'object', 'notes']]
It's mildly embarrassing that I don't know how to solve this with concat(), since exactly what I want to do is the very first example in the pandas .concat() documentation. Can anyone provide guidance? I feel like I'm almost there.
Thank you to #scott-boston and #alollz. I think you both are right but I was able to get it to work with Scott's suggestion. Thank you all.
# rename columns
d1 = df1a.rename(columns={'id':'id',\
'num':'num',\
'type_1':'type',\
'object_1':'object',\
'notes_1':'notes',}
#concatenate
frames = [d1, d2, d3, d4, d5]
result = pd.concat(frames)

Python pandas grouping issue

Am i doing something wrong here or is there a bug here.
df2 is a copy/slice of df1. But the minute i attempt to group it by column A and get the last value of the grouping from column C, creating a new column 'NewMisteryColumn', df1 also gets a new 'NewMisteryColumn'
The end result in df2 is correct. I also have different ways on how i can do this, i am not looking for a different method, just wondering on whether i have stumbled upon a bug.
My question is, isn't df1 separate from df2, why is df1 also getting the same column?
df1 = pd.DataFrame({'A':['some value','some value', 'another value'],
'B':['rthyuyu','truyruyru', '56564'],
'C':['tryrhyu','tryhyteru', '54676']})
df2 = df1
df2['NewMisteryColumn'] = df2.groupby(['A'])['C'].tail(1)
The problem is that df2 is just another reference to the DataFrame.
df2 = df1
df3 = df1.copy()
df1 is df2 # True
df1 is df3 # False
You can also verify the ids...
id(df1)
id(df2) # Same as id(df1)
id(df3) # Different!

Categories

Resources