How to merge two DataFrames using specific conditions in Python Pandas? - python

I have two Data Frames:
DataFrame 1
df1 = pd.DataFrame()
df1["ID1"] = [np.nan, 1, np.nan, 3]
df1["ID2"] =[np.nan, np.nan , 2, 3]
df1
DataFrame 2
df2 = pd.DataFrame()
df2["ID"] = [1, 2, 3, 4]
df2
And I need to merge these two DataFrames using below conditions:
If in df1 ID1 == ID2 then I can merge df1 with df2 using df1.ID1 = df2.ID or df1.ID2 = df2.ID
If in df1 ID1 != ID2 then I have to mergre df1 with df2 using both mentioned in point 1 conditions means: df1.ID1 = df2.ID and df1.ID2 = df2.ID
I have the command as above in points 1 and 2, nevertheless I totaly do not know how to write it in Python Pandas, any suggestions ?

if I understood correctly, this will fix your problem
df1 = pd.DataFrame()
df1["ID1"] = [np.nan, 1, np.nan, 3]
df1["ID2"] =[np.nan, np.nan , 2, 3]
df2 = pd.DataFrame()
df2["ID"] = [1, 2, 3, 4]
if df1["ID1"].equals(df1["ID2"]) == True:
pass #do your merging here
else:
df1["ID1"],df1["ID2"] = df2["ID"],df2["ID"]
df1
output:
ID1 ID2
0 1 1
1 2 2
2 3 3
3 4 4

Related

DataFrame fastest way to update rows without a loop

Creating a scenario:
Assuming a dataframe with two series, where A is the input and B is the result of A[index]*2:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [2, 4, 6]})
Lets say I am receiving a 100k row dataframe and searching for errors in it (here B->0 is invalid):
df = pd.DataFrame({'A': [1, 2, 3],
'B': [2, 0, 6]})
Searching the invalid rows by using
invalid_rows = df.loc[df['A']*2 != df['B']]
I have the invalid_rows now, but I am not sure what would be the fastest way to overwrite the invalid rows in the original df with the result of A[index]*2?
Iterating over the df using iterrows() is an option but slow if the df grows. Can I use df.update() for this somehow?
Working solution with a loop:
index = -1
for row_index, my_series in df.iterrows():
if myseries['A']*2 != myseries['B']:
df[index]['B'] = myseries['A']*2
But is there a faster way to do this?
Using mul, ne and loc:
m = df['A'].mul(2).ne(df['B'])
# same as: m = df['A'] * 2 != df['B']
df.loc[m, 'B'] = df['A'].mul(2)
A B
0 1 2
1 2 4
2 3 6
m returns a boolean series which marks the row where A * 2 != B
print(m)
0 False
1 True
2 False
dtype: bool

get unique column values from multiple PySpark dataframes using a for loop condition

I have 2 PySpark dataframes (DF1 and DF2) and would like to loop over some of the columns (colA, colB from DF1; colZ from DF2) in the two dataframes and get distinct values.
DF1:
colA colB colC
1 1 A
3 1 Y
DF2:
colX colY colZ
1 1 A21
3 4 Y33
Output:
column value
colA 1
colA 3
colB 1
colZ A21
colZ Y33
This method works but trying to create a for loop and collect resultant distinct value doesn't work.. (since I have more than 50 dataframes)
df_combined = DF1.select('colA').dropDuplicates(['colA']).withColumn("new_column",lit("colA")).union(DF1.select('colB').dropDuplicates(['colB']).withColumn("new_column", lit("colB"))).union(DF2.select('colZ').dropDuplicates(['colZ']).withColumn("new_column", lit("colZ")))
df_combined.withColumnRenamed("colA", "column").withColumnRenamed("new_column", "value").show()
I am not quite clear as to what you're trying to achieve here but this is how I would do it.
import pandas as pd
DF1 = pd.DataFrame(data={'colA': [1, 3], 'colB': [1, 1], 'colC': ['A', 'Y']})
DF2 = pd.DataFrame(data={'colX': [1, 3], 'colY': [1, 4], 'colZ': ['A21', 'Y33']})
DF1 = DF1.stack().reset_index()[['level_1',0]].rename(columns={'level_1':'column',0:'value'}).drop_duplicates(subset=['column', 'value'])
def transformAndAppend(df):
df = df.stack().reset_index()[['level_1', 0]].rename(columns={'level_1': 'column', 0: 'value'}).drop_duplicates(subset=['column', 'value'])
return DF1.append(df)
DF1 = transformAndAppend(DF2)
DF1 = DF1.loc[(DF1['column'] == 'colA') | (DF1['column'] == 'colB') | (DF1['column'] == 'colZ')]
print(DF1)

python pandas for loop assign column based on which dataframe it came from

i am using a for loop to go through two frames to eventually concat them.
data_frames = []
data_frames.append(df1)
data_frames.append(df2)
For data_frame in data_frames:
data_frame['col1'] = 'Test'
if date_frame.name = df1:
data_frame['col2'] = 'Apple'
else:
data_frame['col2'] = 'Orange'
The above fails, but in essence, I want to create data_frame['col2']'s value to be dependent on which dataframe it came from. So if the row is from df1, the value for that column should be 'Apple' and if not it should be 'Orange'
There are quite a few syntax errors in your code, but I believe this is what you're trying to do:
# Example Dataframes
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
# With names!
df1.name = 'df1'
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
df2.name = 'df2'
# Create a list of df1 & df2
data_frames = [df1, df2]
# For each data frame in list
for data_frame in data_frames:
# Set col1 to Test
data_frame['col1'] = 'Test'
# If the data_frame.name is 'df1'
if data_frame.name is 'df1':
# Set col2 to 'Apple'
data_frame['col2'] = 'Apple'
else:
# Else set 'col2' to 'Orange'
data_frame['col2'] = 'Orange'
# Print dataframes
for data_frame in data_frames:
print("{name}:\n{value}\n\n".format(name=data_frame.name, value=data_frame))
Output:
df1:
a col1 col2
0 1 Test Apple
1 1 Test Apple
2 1 Test Apple
df2:
a col1 col2
0 2 Test Orange
1 2 Test Orange
2 2 Test Orange
Let's use pd.concat with keys.
Using #AaronNBrock setup:
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
list_of_dfs = ['df1','df2']
df_out = pd.concat([eval(i) for i in list_of_dfs], keys=list_of_dfs)\
.rename_axis(['Source',None]).reset_index()\
.drop('level_1',axis=1)
print(df_out)
Output:
Source a
0 df1 1
1 df1 1
2 df1 1
3 df2 2
4 df2 2
5 df2 2

If pandas merge finds several matches, write values rows into one field

I had no real good idea how to formulate a good header here.
The situation is that I have two data frames I want to merge:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'ID'])
df2 = pd.DataFrame([[3, 2], [3, 3], [4, 6]], columns=['ID', 'values'])
so I do a:
pd.merge(df1, df2, on="ID", how="left")
which results in:
A ID values
0 1 2 NaN
1 1 3 2.0
2 1 3 3.0
3 4 6 NaN
What I would like though is that any combination of A and ID only appear once. If there were several ones, like in the example above, it should take the respective values and merge them into a list(?) of values. So the result should look like this:
A ID values
0 1 2 NaN
1 1 3 2.0, 3.0
2 4 6 NaN
I do not have the slightest idea how to approach this.
Once you've got your merged dataframe, you can groupby columns A and ID and then simply apply list to your values column to aggregate the results into a list for each group:
import pandas as pd
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'ID'])
df2 = pd.DataFrame([[3, 2], [3, 3], [4, 6]], columns=['ID', 'values'])
merged = pd.merge(df1, df2, on="ID", how="left") \
.groupby(['A', 'ID'])['values'] \
.apply(list) \
.reset_index()
print(merged)
prints:
A ID values
0 1 2 [nan]
1 1 3 [2.0, 3.0]
2 4 6 [nan]
You could use
merged = pd.merge(df1, df2, on="ID", how="left") \
.groupby(['A', 'ID'])['values'] \
.apply(list) \
.reset_index()
as in asongtoruin fine answer, but you might want to consider the case of only None as special (due to the merge), in which case you can use
>>> df['values'].groupby([df.A, df.ID]).apply(lambda g: [] if g.isnull().all() else list(g)).reset_index()
A ID values
0 1 2 []
1 1 3 [2.0, 3.0]
2 4 6 []

Look for value in df1('col1') is equal to any value in df2('col3') and remove row from df1 if True [Python] [duplicate]

I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.

Categories

Resources