Is there any problem with pandas merge currently? - python

Using outer join to merge two tables. Let's say
df1 = ['productID', 'Name']
df2 = ['userID', 'productID', 'usage']
I tried to use outer join with merge function in pandas.
pd.merge(df1, df2[['userID','productID', 'usage']], on='productID', how = 'outer')
However, the error message I got is
'productID' is both an index level and a column label, which is ambiguous.
I googled this error message and saw a open [issue]: https://github.com/facebook/prophet/issues/891
Any solution to my problem?

Error means there is same index name like column productID:
#check it
print (df2.index.name)
Solution is remove/rename index name, e.g. by DataFrame.rename_axis:
pd.merge(df1, df2.rename_axis(None)[['userID','productID', 'usage']],
on='productID', how = 'outer')

Related

Receiving a TypeError when using the merge() method with Pandas

I'm trying to merge two data frames on a column with a int data type
df3 = df2.merge('df1', how = 'inner', on = 'ID')
But I receive this error
TypeError: Can only merge Series or DataFrame objects, a (class 'str') was passed
I do not understand what is causing this, so any help would be appreciated!
The way you have written is calling to merge df2 with 'df1' as a computer this looks like trying to merge a dataframe with the literal phrase 'df1', try removing the quotations and do just df1 as an object.
You need to pass the variable 'df1' reference directly, not as a string:
df3 = df2.merge(df1, how = 'inner', on = 'ID')
Alternatively you can pass both dataframes as a parameter:
df3 = pd.merge(df1, df2, how = 'inner', on = 'ID')

Pandas Merge Vlookup, KeyError: "['Value'] not in index"

I am trying to perform a vlookup/merge betwen two dataframes. I get the error
KeyError: "['Player'] not in index"
i've tried to reindex the columns but doesnt seem to work.
df1= df1.reindex(columns = ['Player','Category'])
My current code is like so missingnames = pd.merge(df1,df2[['Player','Player Name']],on='Player',how = 'left')
My dataframes are like below:
df1:
df2:
expected output
can anyone help with this?
Thanks.
You can do it like this:
df1['Exists'] = df1['Player'].str.lower().isin(df2['Player Name'].str.lower())
Look closer at your merge argument and the columns you have in each dataframe
df1 includes "Player" and "Category"
df2 includes "Player Name", "Height" and "Weight"
Your merge argument says that the column "Player" is in df2 however it is not
missingnames = pd.merge(df1,df2[['Player','Player Name']],on='Player',how = 'left')
===============================^
missingnames needs to be changed to:
missingnames = df1.merge(df2,left_on='Player',right_on='Player Name',how = 'left')
And then from there ascertain if there are any missing values
Use numpy.where:
df1['Exists']=np.where(df1['Player'].str.upper().isin(df2['Player Name'].str.upper()),'Exists','')

Merge values in next row pandas

I am trying to merge two dataframes:
this are tables I am trying to merge
and this is the result I want to achieve
So, I want merged values not to be on the same row.
When I try to merge like that:
pd.merge(a,b,how='inner', on=['date_install','device_os'])
it merges values in one raw.
Please,suggest how it can be resolved
You can use append():
df1.append(df2, ignore_index = True)
or by using concat():
pd.concat([df,df1], ignore_index = True)
or using merge():
df.merge(df1, how = 'outer')
Reference taken from: https://stackoverflow.com/a/24391268/3748167 (probable duplicate)
The only difference here is concatenating both dataframes first and then applying rest of the logic.
def sjoin(x):
return ';'.join(x[x.notnull()].astype(str))
pd.concat([a,b]).groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))

Key error when joining dfs in Pandas

I have a dataframe with these columns:
df1:
Index(['cnpj', '#CNAE', 'Estado', 'Capital_Social', '#CNAEpai', '#CNAEvo',
'#CNAEbisavo', 'Porte'],
dtype='object')
I have another dataframe with these columns:
df2:
Index(['#CNAEpai', 'ROA_t12_Peers_CNAEpai', 'MgBruta_t12_Peers_CNAEpai',
'MgEBITDA_t12_Peers_CNAEpai', 'LiqCorrente_t12_Peers_CNAEpai',
'Crescimento_t12_Peers_CNAEpai', 'MgLucro_t12_Peers_CNAEpai',
'Custo/Receita_t12_Peers_CNAEpai', 'Passivo/EBITDA_t12_Peers_CNAEpai',
'ROE_t12_Peers_CNAEpai', 'RFinanceiro/Receita_t12_Peers_CNAEpai',
'cnpj_t12_Peers_CNAEpai', 'LiqGeral_t12_Peers_CNAEpai'],
dtype='object')
I'm trying to join them, using this line:
df1=df1.join(df2,on=['#CNAEpai'],how='left',rsuffix='_bbb')
But I'm getting this error:
KeyError: '#CNAEpai'
Since #CNAEpai is a column in both dfs that shouldn't be happening right?
What's going on?
As #root indicated, pd.DataFrame.join joins index-on-index or index-on-column, but not column-on-column.
To join on column(s), use pd.DataFrame.merge:
df1 = df1.merge(df2, on='#CNAEpai', how='left', rsuffix='_bbb')

Outer join Spark dataframe with non-identical join column and then merge join column

Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Categories

Resources