I have two dataframes as follows:
DF1
A B C
1 2 3
4 5 6
7 8 9
DF2
Match Values
1 a,d
7 b,c
I want to match DF1['A'] with DF2['Match'] and append DF2['Values'] to DF1 if the value exists
So my result will be:
A B C Values
1 2 3 a,d
7 8 9 b,c
Now I can use the following code to match the values but it's returning an empty dataframe.
df1 = df1[df1['A'].isin(df2['Match'])]
Any help would be appreciated.
Instead of doing a lookup, you can do this in one step by merging the dataframes:
pd.merge(df1, df2, how='inner', left_on='A', right_on='Match')
Specify how='inner' if you only want records that appear in both, how='left' if you want all of df1's data.
If you want to keep only the Values column:
pd.merge(df1, df2.set_index('Match')['Values'].to_frame(), how='inner', left_on='A', right_index=True)
Related
I have 3 dataframes with the same ID column. I want to combine them into a single dataframe. I want to combine with inner join logic in SQL. When I try the code below it gives the following result. It correctly joins the two dataframes even though the ID column matches, but makes the last one wrong. How can I fix this? Thank you for your help in advance.
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
output
SOLVED: The data type of the ID column in DF1 was int, while the others were str. Before asking the question I had str the ID column in DF1 and got the following result. Then, when I converted all of them to int data type, I got the result I wanted.
Your IDs are not the same dtype:
>>> DF1
ID A
0 10 1
1 20 2
2 30 3
>>> DF2
ID K
0 30 3
1 10 1
2 20 2
>>> DF3
ID P
0 20 2
1 30 3
2 10 1
Your code:
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
The output:
>>> df_final
ID A K P
0 10 1 1 1
1 20 2 2 2
2 30 3 3 3
Use join:
# use set index to add 'join' key into the index and
# create a list of dataframes using list comprehension
l = [df.set_index('ID') for df in [df1,df2,df3])
# pd.DataFrame.join accepts a list of dataframes as 'other'
l[0].join(l[1:])
I am trying to merge multiple DataFrames on same DocID then sum up the weights but when I do merge it creates Weight_x,Weight_y. This would be fine for only two DataFrames but the amount of Dataframes to merge changes based on user input so merging creates Weight_x, Weight_y multiple times. So how can I merge more than 2 DataFrames such that they are merging on DocID and Weight is Summed?
Example:
df1= DocID Weight
1 4
2 7
3 8
df2= DocID Weight
1 5
2 9
8 1
finalDf=
DocID Weight
1 9
2 16
You can merge, set the 'DocID' column as the index, then sum the remaining columns together. Then you can reindex and rename the columns in the resulting final_df as needed:
df_final = pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
df_final = pd.DataFrame({"DocID": df_final.index, "Weight":df_final}).reset_index(drop=True)
Output:
>>> df_final
DocID Weight
0 1 9
1 2 16
df1.set_index('DocID').add(df2.set_index('DocID')).dropna()
Weight
DocID
1 9.0
2 16.0
Can you try this pd.merge(df1, df2, on=['DocID']).set_index(['DocID']).sum(axis=1)
You can now give any name to the sum column.
Similar to this topic : Add default values while merging tables in pandas
The answer to this topic fills all NaN in the resulting DataFrame and that's not what I want to do.
Let's imagine the following situation : I have two dataframes df1 and df2. Each of this DataFrame might contains some Nan, the columns of df1 are 'a' and col1, the columns of df2 are 'a' and col2 where col1 and col2 are disjoints list of columns name (For example df1 and df2 could have respectively 'a', 'b', 'c' and 'a', 'd', 'e' as columns names). I want to perform a left merge on df1 and df2 and fill all the missing values of that merge(any row of df1 with a value of the column 'a' that is not a value of column 'a' in df2) with a default value. We can imagine that I have a dict default_values that match any element of col2 to a default values.
To give you a concrete example :
df1
a b c
0 0 0.038108 0.961687
1 1 0.107457 0.616689
2 2 0.661485 0.240353
3 3 0.457169 0.560912
4 5 5.000000 5.000000
df2
a d e
0 0 0.405170 0.934776
1 1 0.684532 0.168738
2 2 0.729693 0.967310
3 3 0.844770 NaN
4 4 0.842673 0.941324
default_values = {'d':42, 'e':43}
Expected Output :
a b c d e
0 0 0.038108 0.961687 0.405170 0.934776
1 1 0.107457 0.616689 0.684532 0.168738
2 2 0.661485 0.240353 0.729693 0.967310
3 3 0.457169 0.560912 0.844770 NaN
4 5 5.000000 5.000000 42 43
While writing this question, I found a working solution. I still think it's an interesting question. Here's a solution to get the expected output :
df3 = pd.DataFrame(default_values,
index = df1.set_index('a').index.difference(df2.a))
df3['a'] = df3.index
df1.merge(pd.concat((df2, df3), sort=False))
This solution works for a left/right merge, and it can be extended to work for an outer merge (by completing the first dataframe as well).
Edit : The how='left' argument is not specified in my merge because the DataFrame I'm merging with is constructed to have all the value of the column 'a' in df1 in its own column 'a'. We could add an how='left' to this call of merge, and it would give the same output.
I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.
I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]
Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7
Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]
There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs:
>>> foo = DataFrame([
['a',1,2],
['b',4,5],
['c',7,8],
[np.NaN,10,11]
], columns=['id','x','y'])
>>> bar = DataFrame([
['a',3],
['c',9],
[np.NaN,12]
], columns=['id','z'])
>>> pd.merge(foo, bar, how='left', on='id')
Out[428]:
id x y z
0 a 1 2 3
1 b 4 5 NaN
2 c 7 8 9
3 NaN 10 11 12
[4 rows x 4 columns]
This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)
Is there a way to ignore missing values during a merge without first slicing them out?
You could exclude values from bar (and indeed foo if you wanted) where id is null during the merge. Not sure it's what you're after, though, as they are sliced out.
(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of bar that match and are not null.)
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')
Out[11]:
id x y z
0 a 1 2 3
1 b 4 5 NaN
2 c 7 8 9
3 NaN 10 11 NaN
If You want to preserve the NaNs from both tables without slicing them out, you could use the outer join method as follows:
pd.merge(foo, bar.dropna(subset=['id']), how='outer', on='id')
It basically returns the union of foo and bar
if do not need NaN in both left and right DF, use
pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id')
else if need NaN in left DF, use
pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id')
Another approach, which also keeps all rows if performing an outer join:
foo['id'] = foo.id.fillna('missing')
pd.merge(foo, bar, how='left', on='id')