How can I write an equivalent statement to the below SQL query?
SELECT ISNULL(df2.id, df1.id) as new_id
FROM dataframe1 df1
LEFT JOIN dataframe2 df2
ON df1.id = df2.id
One such equivalent would be:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
Related
I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
I was trying to merge two dataframes using a less-than operator. But I ended up using pandasql.
Is it possible to do the same query below using pandas functions?
(Records may be duplicated, but that is fine as I'm looking for something similar to cumulative total later)
sql = '''select A.Name,A.Code,B.edate from df1 A
inner join df2 B on A.Name = B.Name
and A.Code=B.Code
where A.edate < B.edate '''
df4 = sqldf(sql)
The suggested answer seems similar but couldn't get the result expected. Also the answer below looks very crisp.
Use:
df = df1.merge(df2, on=['Name','Code']).query('edate_x < edate_y')[['Name','Code','edate_y']]
I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.
This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.
For example, the equivalent SQL code could be like:
SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')
I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?
Thanks!
In principle, your query could be rewritten as a join and a filter where clause.
SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')
Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.
import pandas as pd
df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)
# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]
# join
merged = pd.merge(df_a, df_b, on='id', how='inner')
# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]
off-topic: depending on the use case, you may want to use abs() the difference.
I have the following Spark DataFrames:
df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?
If you have a SQL query that works, why not use pyspark-sql?
First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')
Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:
query = "select df1.*, df2.salary, df3.dob " \
"from df1 " \
"left join df2 on df1.id=df2.id "\
"left join df3 on df1.name=df3.name"
joined_df = spark.sql(query)
You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')
Then the following should work:
from pyspark.sql.functions import col
joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html