import pandas as pd
df1 = pd.read_csv("sdvsdvsvsd.csv")
df2 = pd.read_csv("dsvsdvdv.csv")
df3 = df1.join(df2, how='inner', left_on = 'TIME', right_on = 'TIME')
I created a joint but when I run it, I get a message"unexpected argument". I checked it multiple times and cant see any misstake.
beginner here, please help
Use pd.merge(df1, df2, how='inner, left_on ='TIME', right_on = 'TIME') instead.
.join doesn't have left_on or right_on
solved one column Name was missing(" " ) this symbols .thanks everyone
Related
I'm trying to merge two data frames on a column with a int data type
df3 = df2.merge('df1', how = 'inner', on = 'ID')
But I receive this error
TypeError: Can only merge Series or DataFrame objects, a (class 'str') was passed
I do not understand what is causing this, so any help would be appreciated!
The way you have written is calling to merge df2 with 'df1' as a computer this looks like trying to merge a dataframe with the literal phrase 'df1', try removing the quotations and do just df1 as an object.
You need to pass the variable 'df1' reference directly, not as a string:
df3 = df2.merge(df1, how = 'inner', on = 'ID')
Alternatively you can pass both dataframes as a parameter:
df3 = pd.merge(df1, df2, how = 'inner', on = 'ID')
Using outer join to merge two tables. Let's say
df1 = ['productID', 'Name']
df2 = ['userID', 'productID', 'usage']
I tried to use outer join with merge function in pandas.
pd.merge(df1, df2[['userID','productID', 'usage']], on='productID', how = 'outer')
However, the error message I got is
'productID' is both an index level and a column label, which is ambiguous.
I googled this error message and saw a open [issue]: https://github.com/facebook/prophet/issues/891
Any solution to my problem?
Error means there is same index name like column productID:
#check it
print (df2.index.name)
Solution is remove/rename index name, e.g. by DataFrame.rename_axis:
pd.merge(df1, df2.rename_axis(None)[['userID','productID', 'usage']],
on='productID', how = 'outer')
I have two dataframe and I'd like to join based on a couple of columns. However, my join logic has an 'OR' in it, e.g. I want to join based on columns ['A','B','C'] OR ['A','B','D']. I have the following code to join based on one set of columns but how I can add the second set of columns?
pd.merge(df1,df2, how='inner',left_on = ['A','B','C'], right_on = ['A','B','C'])
Try this, since left_on and right_on are the same just use on:
d_1 = pd.merge(df1,df2, how='inner', on = ['A','B','C'])
d_2 = pd.merge(df1,df2, how='inner', on = ['A','B','D'])
d_3 = pd.concat([d_1,d_2]).drop_duplicates()
Why does the following fail with KeyError 'NUM'?
result = pandas.merge(sdf_subset, dfgeom, how='inner', on=['ID', 'NUM'])
The column 'ID' exists in sdf_subset and 'NUM' exists in dfgeom. I have checked the datatype and both are Int64.
Any ideas?
# you need to use left_on and right_on if the joining key is different between the dataframes.
result = pandas.merge(sdf_subset, dfgeom, how='inner', left_on='ID', right_on='NUM')
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html