Join based on multiple complex conditions in Python - python

I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.
This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.
For example, the equivalent SQL code could be like:
SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')
I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?
Thanks!

In principle, your query could be rewritten as a join and a filter where clause.
SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')
Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.
import pandas as pd
df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)
# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]
# join
merged = pd.merge(df_a, df_b, on='id', how='inner')
# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]
off-topic: depending on the use case, you may want to use abs() the difference.

Related

Using SQL Minus Operator in python

I want to perform a minus operation like the code below on two tables.
SELECT
column_list_1
FROM
T1
MINUS
SELECT
column_list_2
FROM
T2;
This is after a migration has happened. I have these two databases that I have connected like this:
import cx_Oracle
import pandas as pd
import pypyodbc
source = cx_Oracle.connect(user, password, name)
df = pd.read_sql(""" SELECT * from some_table """, source)
target = pypyodbc.connect(blah, blah, db)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
How can I run a minus operation on the source and target databases in python using a query?
Choose either one:
Use Python in order to perform a "manual" MINUS operation between the two result sets.
Use Oracle by means of a dblink. In this case, you won't need to open two connections from Python.
if you have a DB link then you can do a MINUS or you can use merge from Pandas.
df = pd.read_sql(""" SELECT * from some_table """, source)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
df_combine = df.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
There will be a new column _merge created in df_combine which will contain values both (row present in both the data frame) and right_only (row in data frame df).
In the same way you can join a left merge.

pandas dataframe merge expression using a less-than operator?

I was trying to merge two dataframes using a less-than operator. But I ended up using pandasql.
Is it possible to do the same query below using pandas functions?
(Records may be duplicated, but that is fine as I'm looking for something similar to cumulative total later)
sql = '''select A.Name,A.Code,B.edate from df1 A
inner join df2 B on A.Name = B.Name
and A.Code=B.Code
where A.edate < B.edate '''
df4 = sqldf(sql)
The suggested answer seems similar but couldn't get the result expected. Also the answer below looks very crisp.
Use:
df = df1.merge(df2, on=['Name','Code']).query('edate_x < edate_y')[['Name','Code','edate_y']]

How to avoid duplicated columns after join operation?

In Scala it's easy to avoid duplicate columns after join operation:
df1.join(df1, Seq("id"), "left").show()
However, is there a similar solution in PySpark? If I do df1.join(df1, df1["id"] == df2["id"], "left").show() in PySpark, I get two columns id...
You have 3 options :
1. Use outer join
aDF.join(bDF, "id", "outer").show()
2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)
Let me know if that helps.

Joining multiple data frames in one statement and selecting only required columns

I have the following Spark DataFrames:
df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?
If you have a SQL query that works, why not use pyspark-sql?
First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')
Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:
query = "select df1.*, df2.salary, df3.dob " \
"from df1 " \
"left join df2 on df1.id=df2.id "\
"left join df3 on df1.name=df3.name"
joined_df = spark.sql(query)
You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')
Then the following should work:
from pyspark.sql.functions import col
joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')

How do you perform basic joins of two RDD tables in Spark using Python?

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for:
Inner Join
Left Outer Join
Cross Join
With two tables (RDD) with a single column in each that has a common key.
RDD(1):(key,U)
RDD(2):(key,V)
I think an inner join is something like this:
rdd1.join(rdd2).map(case (key, u, v) => (key, ls ++ rs));
Is that right? I have searched the internet and can't find a good example of joins. Thanks in advance.
It can be done either using PairRDDFunctions or Spark Data Frames. Since data frame operations benefit from Catalyst Optimizer the second option is worth considering.
Assuming your data looks as follows:
rdd1 = sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)])
rdd2 = sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)])
With PairRDDs:
Inner join:
rdd1.join(rdd2)
Left outer join:
rdd1.leftOuterJoin(rdd2)
Cartesian product (doesn't require RDD[(T, U)]):
rdd1.cartesian(rdd2)
Broadcast join (doesn't require RDD[(T, U)]):
see Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
Finally there is cogroup which has no direct SQL equivalent but can be useful in some situations:
cogrouped = rdd1.cogroup(rdd2)
cogrouped.mapValues(lambda x: (list(x[0]), list(x[1]))).collect()
## [('foo', ([1], [4])), ('bar', ([2], [5, 6])), ('baz', ([3], []))]
With Spark Data Frames
You can use either SQL DSL or execute raw SQL using sqlContext.sql.
df1 = spark.createDataFrame(rdd1, ('k', 'v1'))
df2 = spark.createDataFrame(rdd2, ('k', 'v2'))
# Register temporary tables to be able to use `sparkSession.sql`
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
Inner join:
# inner is a default value so it could be omitted
df1.join(df2, df1.k == df2.k, how='inner')
spark.sql('SELECT * FROM df1 JOIN df2 ON df1.k = df2.k')
Left outer join:
df1.join(df2, df1.k == df2.k, how='left_outer')
spark.sql('SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.k = df2.k')
Cross join (explicit cross join or configuration changes are required in Spark. 2.0 - spark.sql.crossJoin.enabled for Spark 2.x):
df1.crossJoin(df2)
spark.sql('SELECT * FROM df1 CROSS JOIN df2')
df1.join(df2)
sqlContext.sql('SELECT * FROM df JOIN df2')
Since 1.6 (1.5 in Scala) each of these can be combined with broadcast function:
from pyspark.sql.functions import broadcast
df1.join(broadcast(df2), df1.k == df2.k)
to perform broadcast join. See also Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

Categories

Resources