Spark Driver stuck when using different windows - python

I'm having problems with Spark (Spark 3.0.1) when trying to use use window functions over different windows.
There are no errors but the driver gets stuck when trying to show or write the dataframe (no tasks start in spark ui).
The problem happens both when running on yarn and local mode.
Example:
select
col_1,
col_2,
col_3,
sum(x) over (partition by col_1, col_2) s,
max(x) over (partition by col_1, col_3) m
from table
My work-around so far has been to split the window functions in different queries and persist the intermediate results to parquet. Something like this:
df = spark.sql("select col_1, col_2, col_3, sum(x) over (partition by col_1, col_2) s from table")
df.write.parquet(path)
df = spark.read.parquet(path)
df.createOrReplaceTempView("table")
df = spark.sql("select *, max(x) over (partition by col_1, col_3) m from table")
But I don't want to repeat this work-around anymore

Related

Using SQL Minus Operator in python

I want to perform a minus operation like the code below on two tables.
SELECT
column_list_1
FROM
T1
MINUS
SELECT
column_list_2
FROM
T2;
This is after a migration has happened. I have these two databases that I have connected like this:
import cx_Oracle
import pandas as pd
import pypyodbc
source = cx_Oracle.connect(user, password, name)
df = pd.read_sql(""" SELECT * from some_table """, source)
target = pypyodbc.connect(blah, blah, db)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
How can I run a minus operation on the source and target databases in python using a query?
Choose either one:
Use Python in order to perform a "manual" MINUS operation between the two result sets.
Use Oracle by means of a dblink. In this case, you won't need to open two connections from Python.
if you have a DB link then you can do a MINUS or you can use merge from Pandas.
df = pd.read_sql(""" SELECT * from some_table """, source)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
df_combine = df.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
There will be a new column _merge created in df_combine which will contain values both (row present in both the data frame) and right_only (row in data frame df).
In the same way you can join a left merge.

How to select all columns except 2 of them from a large table on pyspark sql?

In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks.
My pyspark sql:
%sql
set hive.support.quoted.identifiers=none;
select a.*, '?!(b.year|b.month)$).+'
from MY_TABLE_A as a
left join
MY_TABLE_B as b
on a.year = b.year and a.month = b.month
I followed
hive:select all column exclude two
Hive How to select all but one column?
but, it does not work for me. All columns are in the results.
I would like to remove the duplicated columns (year and month in the result).
thanks
set hive.support.quoted.identifiers=nonenot supported in Spark.
Instead in Spark set spark.sql.parser.quotedRegexColumnNames=true to
get same behavior as hive.
Example:
df=spark.createDataFrame([(1,2,3,4)],['id','a','b','c'])
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
#select all columns except a,b
sql("select `(a|b)?+.+` from tmp").show()
#+---+---+
#| id| c|
#+---+---+
#| 1| 4|
#+---+---+
As of Databricks runtime 9.0, you can use the * except() command like this:
df = spark.sql("select a.* except(col1, col2, col3) from my_table_a...")
or if just using %sql as in your example
select a.* except(col1, col2, col3) from my_table_a...
In pyspark, you can do something like this:
df.select([col for col in df.columns if c not in {'col1', 'col2', 'col3'}])
where df is the resulting dataframe after the join operation is perfomed.

Join based on multiple complex conditions in Python

I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.
This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.
For example, the equivalent SQL code could be like:
SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')
I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?
Thanks!
In principle, your query could be rewritten as a join and a filter where clause.
SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')
Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.
import pandas as pd
df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)
# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]
# join
merged = pd.merge(df_a, df_b, on='id', how='inner')
# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]
off-topic: depending on the use case, you may want to use abs() the difference.

Joining multiple data frames in one statement and selecting only required columns

I have the following Spark DataFrames:
df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?
If you have a SQL query that works, why not use pyspark-sql?
First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')
Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:
query = "select df1.*, df2.salary, df3.dob " \
"from df1 " \
"left join df2 on df1.id=df2.id "\
"left join df3 on df1.name=df3.name"
joined_df = spark.sql(query)
You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')
Then the following should work:
from pyspark.sql.functions import col
joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')

How to use Pandas.DataFrame as input of SQL query?

I am trying to use Pandas.DataFrame as the intermediate result dataset between two consequent SQL queries.
I imagine it looks like:
import pandas.io.sql as pisql
import pyodbc
SQL_command1 = """
select * from tab_A
"""
result = pisql.read_frame(SQL_command1)
SQL_command2 = """
select *
from ? A
inner join B
on A.id = B.id
"""
pyodbc.cursor.execute(SQL_command2, result)
The SQL_command2 in above code is simply a pseudo code, where ? takes in the result as the input and given a alias name as A.
This is my first time using Pandas, so I'm not confident if my idea is feasible or efficient. Can anyone enlight me please?
Many thanks.
The pseudo code would look like this
import pandas as pd
df_a = pd.read_csv('tab_a.csv') #or read_sql or other read engine
df_b = pd.read_csv('tab_b.csv')
result = pd.merge(left=df_a,
right=df_b,
how='inner',
on='id') #assuming 'id' is in both table
And to select columns of pandas dataframe, it would be something like df_a[['col1','col2','col3']]

Categories

Resources