Similar to this topic : Add default values while merging tables in pandas
The answer to this topic fills all NaN in the resulting DataFrame and that's not what I want to do.
Let's imagine the following situation : I have two dataframes df1 and df2. Each of this DataFrame might contains some Nan, the columns of df1 are 'a' and col1, the columns of df2 are 'a' and col2 where col1 and col2 are disjoints list of columns name (For example df1 and df2 could have respectively 'a', 'b', 'c' and 'a', 'd', 'e' as columns names). I want to perform a left merge on df1 and df2 and fill all the missing values of that merge(any row of df1 with a value of the column 'a' that is not a value of column 'a' in df2) with a default value. We can imagine that I have a dict default_values that match any element of col2 to a default values.
To give you a concrete example :
df1
a b c
0 0 0.038108 0.961687
1 1 0.107457 0.616689
2 2 0.661485 0.240353
3 3 0.457169 0.560912
4 5 5.000000 5.000000
df2
a d e
0 0 0.405170 0.934776
1 1 0.684532 0.168738
2 2 0.729693 0.967310
3 3 0.844770 NaN
4 4 0.842673 0.941324
default_values = {'d':42, 'e':43}
Expected Output :
a b c d e
0 0 0.038108 0.961687 0.405170 0.934776
1 1 0.107457 0.616689 0.684532 0.168738
2 2 0.661485 0.240353 0.729693 0.967310
3 3 0.457169 0.560912 0.844770 NaN
4 5 5.000000 5.000000 42 43
While writing this question, I found a working solution. I still think it's an interesting question. Here's a solution to get the expected output :
df3 = pd.DataFrame(default_values,
index = df1.set_index('a').index.difference(df2.a))
df3['a'] = df3.index
df1.merge(pd.concat((df2, df3), sort=False))
This solution works for a left/right merge, and it can be extended to work for an outer merge (by completing the first dataframe as well).
Edit : The how='left' argument is not specified in my merge because the DataFrame I'm merging with is constructed to have all the value of the column 'a' in df1 in its own column 'a'. We could add an how='left' to this call of merge, and it would give the same output.
Related
I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.
I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]
Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7
Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]
I have two dataframes as follows:
DF1
A B C
1 2 3
4 5 6
7 8 9
DF2
Match Values
1 a,d
7 b,c
I want to match DF1['A'] with DF2['Match'] and append DF2['Values'] to DF1 if the value exists
So my result will be:
A B C Values
1 2 3 a,d
7 8 9 b,c
Now I can use the following code to match the values but it's returning an empty dataframe.
df1 = df1[df1['A'].isin(df2['Match'])]
Any help would be appreciated.
Instead of doing a lookup, you can do this in one step by merging the dataframes:
pd.merge(df1, df2, how='inner', left_on='A', right_on='Match')
Specify how='inner' if you only want records that appear in both, how='left' if you want all of df1's data.
If you want to keep only the Values column:
pd.merge(df1, df2.set_index('Match')['Values'].to_frame(), how='inner', left_on='A', right_index=True)
I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]