Change Groupby with Join Spark SQL query to Spark Dataframe - python

I have written my scripts initially using Spark SQL but now for performance and other reasons trying to convert the Sql queries to PySpark Dataframes.
I have Orders Table (OrderID,CustomerID,EmployeeID,OrderDate,ShipperID)
and Shippers Table (ShipperID, ShipperName)
My Spark SQL query lists the number of orders sent by each shipper:
sqlContext.sql("SELECT Shippers.ShipperName, COUNT(Orders.ShipperID) AS NumberOfOrders
FROM Orders LEFT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID
GROUP BY ShipperName")
Now when i try to replace the above SQL query with Spark Dataframe,i write this
Shippers.join(Orders,["ShipperID"],'left').select(Shippers.ShipperName).groupBy(Shippers.ShipperName).agg(count(Orders.ShipperID).alias("NumberOfOrders"))
But I get an error here mostly because i feel aggregate count function while finding count of orderId from Orders table is wrong.
Below is the error that i get:-
"An error occurred while calling {0}{1}{2}.\n".format(target_id, ".", name), value)"
Can someone please help me to refactor the above SQL query to Spark Dataframe ?

Below is the pyspark operation for your question:
import pyspark.sql.functions as F
Shippers.alias("s").join(
Orders.alias("o"),
on = "ShipperID",
how = "left"
).groupby(
"s.ShipperName"
).agg(
F.count(F.col("o.OrderID")).alias("NumberOfOrders")
).show()

Related

Describe Snowflake table from Azure Databricks

I want to issue a DESC TABLE SQL command for a Snowflake table and using Azure Databricks but I can't quite figure it out! I'm not getting any errors but I'm not getting any results either. Here's the Python code I'm using:
options_vcp = {
"sfUrl": snowflake_url,
"sfUser": user,
"sfPassword": password,
"sfDatabase": db,
"sfWarehouse": wh,
"sfSchema": sch
}
sfUtils = sc._jvm.net.snowflake.spark.snowflake.Utils
sfUtils.runQuery(options_vcp, "DESC TABLE myTable")
I can download the Snowflake table using the "sfDatabase", "sfWarehouse", etc. values so they seem to be correct. I can run the DESC TABLE command in Snowflake and get correct results. But the only output I'm getting from databricks is this:
Out[1]: JavaObject id=o315
Does anyone know how to display this JavaObject or know of a different method to run DESC TABLE from Databricks?
From doc: Executing DDL/DML SQL Statements:
The runQuery method returns only TRUE or FALSE. It is intended for statements that do not return a result set, for example DDL statements like CREATE TABLE and DML statements like INSERT, UPDATE, and DELETE. It is not useful for statements that return a result set, such as SELECT or SHOW.
Alternative approach is to use INFORMATION_SCHEMA.COLUMNS view:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("query", "SELECT * FROM information_schema.columns WHERE table_name ILIKE 'myTable'")
.load()
Related: Moving Data from Snowflake to Spark:
When using DataFrames, the Snowflake connector supports SELECT queries only.
Usage Notes
Currently, the connector does not support other types of queries (e.g. SHOW or DESC, or DML statements) when using DataFrames.
I suggest using get_ddl() in your select statement to get the object definition:
https://docs.snowflake.com/en/sql-reference/functions/get_ddl.html

reading and converting SQL QUERY to Azure Databricks/Spark python

I am using azure databricks and I have the following sql query that I would like to convert into a spark python code:
SELECT DISTINCT
personID,
SUM(quantity) as total_shipped
FROM(
SELECT p.personID,
p.systemID,
s.quantity
FROM shipped s
LEFT JOIN ordered p
on (s.OrderId = p.OrderNumber OR
substr(s.OrderId,1,6) = p.OrderNumber )
and p.ndcnum = s.ndc
where s.Dateshipped <= "2022-04-07"
AND personID is not null
group by personID
I intend to merge the spark dataframes first, then perform the aggregated sum. However, I think I am making it more complicated than it is. So far, this is what I have but I am getting InvalidSyntax error:
ordered.join(shipped, ((ordered("OrderId").or(ordered.select(substring(ordered.OrderId, 1, 6)))) === ordered("ORDERNUMBER")) &&
(ordered("ndcnumber") === ordered("ndc")),"left")
.show()
The part I am getting confused is on the OR statement from the SQL query, how do I convert that into a spark python statement?
There is beauty in using databricks. you can directly use the same code by calling spark.sql(""" {your sql query here} """) and you will still get the same results. You can assign it to a variable and you will have a dataframe.

Query based on spark dataframe cell value

I want to run many queries on a table leveraging the Spark framework with python by running them in parallel rather than in sequence.
When I run queries with a for loop, it performs very slowly as (I believe) it's not able to break the job in parallel. For example:
for fieldName in fieldList:
result = spark.sql("select cast({0} as string) as value,
count({0}) as FREQ
from {1} group by {0} order by FREQ desc limit 5".format(fieldName, tableName))
I tried to make a dataframe with a column called 'queryStr' to hold the query, then have a 'RESULTS' column to hold the results with a command:
inputDF = inputDF.withColumn('RESULTS', queryUDF(inputDF.queryStr))
The UDF reads:
resultSchema = ArrayType(StructType([
StructField('VALUE', StringType(), True),
StructField('FREQ',IntegerType(), True)
]), True)
queryUDF = udf(lambda queryStr: spark.sql(queryStr).collect(), resultSchema
I'm using spark version 2.4.0.
My error is:
PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable
So, how do I run these queries in parallel? Or, is there a better way for me to iterate through a large number of queries?
Similar issue as: Trying to execute a spark sql query from a UDF
In short, cannot perform SQL queries within a UDF.

How can I use "where not exists" SQL condition in pyspark?

I have a table on Hive and I am trying to insert data in that table. I am taking data from SQL but I don't want to insert id which already exists in the Hive table. I am trying to use the same condition like where not exists. I am using PySpark on Airflow.
The exists operator doesn't exist in Spark but there are 2 join operators that can replace it : left_anti and left_semi.
If you want for example to insert a dataframe df in a hive table target, you can do :
new_df = df.join(
spark.table("target"),
how='left_anti',
on='id'
)
then you write new_df in your table.
left_anti allows you to keep only the lines which do not meet the join condition (equivalent of not exists). The equivalent of exists is left_semi.
You can use not exist directly using spark SQL on the dataframes through temp views:
table_withNull_df.createOrReplaceTempView("table_withNull")
tblA_NoNull_df.createOrReplaceTempView("tblA_NoNull")
result_df = spark.sql("""
select * from table_withNull
where not exists
(select 1 from
tblA_NoNull
where table_withNull.id = tblA_NoNull.id)
""")
This method can be preferred to left anti joins since they can cause unexpected BroadcastNestedLoopJoin resulting in a broadcast timeout (even without explicitly requesting the broadcast in the anti join).
After that you can do write.mode("append") to insert the previously not encountered data.
Example taken from here
IMHO I don't think exists such a property in Spark. I think you can use 2 approaches:
A workaround with the UNIQUE condition (typical of relational DB): in this way when you try to insert (in append mode) an already existing record you'll get an exception that you can properly handle.
Read the table in which you want to write, outer join it with the data that you want to add to the aforementioned table and then write the result in overwrite mode (but I think that the first solution may be better in performance).
For more details feel free to ask

Spark sql version of the same query does not work whereas the normal sql query does

The normal sql query :
SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;
Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?
result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")
This gives an error :
AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
How to resolve this issue?
Your "normal" query should not work anywhere. The correct way to write the query is:
SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;
This should work on any database (where the columns and tables are defined and of the right types).
Your version has state_geoid in the SELECT, but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).
Also, you almost never want to use SELECT DISTINCT with GROUP BY. And, the parentheses after the DISTINCT make no difference. The construct is SELECT DISTINCT. DISTINCT is not a function.

Categories

Resources