I am using azure databricks and I have the following sql query that I would like to convert into a spark python code:
SELECT DISTINCT
personID,
SUM(quantity) as total_shipped
FROM(
SELECT p.personID,
p.systemID,
s.quantity
FROM shipped s
LEFT JOIN ordered p
on (s.OrderId = p.OrderNumber OR
substr(s.OrderId,1,6) = p.OrderNumber )
and p.ndcnum = s.ndc
where s.Dateshipped <= "2022-04-07"
AND personID is not null
group by personID
I intend to merge the spark dataframes first, then perform the aggregated sum. However, I think I am making it more complicated than it is. So far, this is what I have but I am getting InvalidSyntax error:
ordered.join(shipped, ((ordered("OrderId").or(ordered.select(substring(ordered.OrderId, 1, 6)))) === ordered("ORDERNUMBER")) &&
(ordered("ndcnumber") === ordered("ndc")),"left")
.show()
The part I am getting confused is on the OR statement from the SQL query, how do I convert that into a spark python statement?
There is beauty in using databricks. you can directly use the same code by calling spark.sql(""" {your sql query here} """) and you will still get the same results. You can assign it to a variable and you will have a dataframe.
Related
I want to issue a DESC TABLE SQL command for a Snowflake table and using Azure Databricks but I can't quite figure it out! I'm not getting any errors but I'm not getting any results either. Here's the Python code I'm using:
options_vcp = {
"sfUrl": snowflake_url,
"sfUser": user,
"sfPassword": password,
"sfDatabase": db,
"sfWarehouse": wh,
"sfSchema": sch
}
sfUtils = sc._jvm.net.snowflake.spark.snowflake.Utils
sfUtils.runQuery(options_vcp, "DESC TABLE myTable")
I can download the Snowflake table using the "sfDatabase", "sfWarehouse", etc. values so they seem to be correct. I can run the DESC TABLE command in Snowflake and get correct results. But the only output I'm getting from databricks is this:
Out[1]: JavaObject id=o315
Does anyone know how to display this JavaObject or know of a different method to run DESC TABLE from Databricks?
From doc: Executing DDL/DML SQL Statements:
The runQuery method returns only TRUE or FALSE. It is intended for statements that do not return a result set, for example DDL statements like CREATE TABLE and DML statements like INSERT, UPDATE, and DELETE. It is not useful for statements that return a result set, such as SELECT or SHOW.
Alternative approach is to use INFORMATION_SCHEMA.COLUMNS view:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("query", "SELECT * FROM information_schema.columns WHERE table_name ILIKE 'myTable'")
.load()
Related: Moving Data from Snowflake to Spark:
When using DataFrames, the Snowflake connector supports SELECT queries only.
Usage Notes
Currently, the connector does not support other types of queries (e.g. SHOW or DESC, or DML statements) when using DataFrames.
I suggest using get_ddl() in your select statement to get the object definition:
https://docs.snowflake.com/en/sql-reference/functions/get_ddl.html
I try to join a second table (PageLikes) on a first Table (PageVisits) after selecting only distinct values on one column of the first table with the python ORM peewee.
In pure SQL I can do this:
SELECT DISTINCT(pagevisits.visitor_id), pagelikes.liked_item FROM pagevisits
INNER JOIN pagelikes on pagevisits.visitor_id = pagelikes.user_id
In peewee with python I have tried:
query = (Page.select(
fn.Distinct(Pagevisits.visitor_id),
PageLikes.liked_item)
.join(PageLIkes)
This gives me an error:
distinct() takes 1 positional argument but 2 were given
The only way I can and have used distinct with peewee is like this:
query = (Page.select(
Pagevisits.visitor_id,
PageLikes.liked_item)
.distinct()
which does not seem to work for my scenario.
So how can I select only distinct values in one table based on one column before I join another table with peewee?
I don't believe you should be encountering an error using fn.DISTINCT() in that way. I'm curious to see the full traceback. In my testing locally, I have no problems running something like:
query = (PageVisits
.select(fn.DISTINCT(PageVisits.visitor_id), PageLikes.liked_item)
.join(PageLikes))
Which produces SQL equivalent to what you're after. I'm using the latest peewee code btw.
As Papooch suggested, calling distinct on the Model seems to work:
distinct_visitors = (Pagevisits
.select(
Pagevisits.visitor_id.distinct().alias("visitor")
)
.where(Pagevisits.page_id == "Some specifc page")
.alias('distinct_visitors')
)
query = (Pagelikes
.select(fn.Count(Pagelikes.liked_item),
)
.join(distinct_visitors, on=(distinct_visitors.c.visitor = Pagelikes.user_id))
.group_by(Pagelikes.liked_item)
)
I am querying tables but I have different results using two manners, I would like to understand the reason.
I created a table using Delta location. I want to query the data that I stored in that location. I'm using Amazon S3.
I created the table like this:
spark.sql("CREATE TABLE bronze_client_trackingcampaigns.TRACKING_BOUNCES (ClientID INT, SendID INT, SubscriberKey STRING) USING DELTA LOCATION 's3://example/bronze/client/trackingcampaigns/TRACKING_BOUNCES/delta'")
I want to query the data using the next line:
spark.sql("SELECT count(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
But the results is not okay, it should be 41832 but it returns 1.
When I did the same query in other way:
spark.read.option("header", True).option("inferSchema", True).format("delta").table("bronze_client_trackingcampaigns.TRACKING_BOUNCES").count()
I obtained the result 41832.
My current results are:
I want to have the same results in both ways.
The 1 you got back is actually the row count - not the actual result. Change the sql statement to be:
df = spark.sql("SELECT COUNT(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
df.show()
You should now get the same result.
I have written my scripts initially using Spark SQL but now for performance and other reasons trying to convert the Sql queries to PySpark Dataframes.
I have Orders Table (OrderID,CustomerID,EmployeeID,OrderDate,ShipperID)
and Shippers Table (ShipperID, ShipperName)
My Spark SQL query lists the number of orders sent by each shipper:
sqlContext.sql("SELECT Shippers.ShipperName, COUNT(Orders.ShipperID) AS NumberOfOrders
FROM Orders LEFT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID
GROUP BY ShipperName")
Now when i try to replace the above SQL query with Spark Dataframe,i write this
Shippers.join(Orders,["ShipperID"],'left').select(Shippers.ShipperName).groupBy(Shippers.ShipperName).agg(count(Orders.ShipperID).alias("NumberOfOrders"))
But I get an error here mostly because i feel aggregate count function while finding count of orderId from Orders table is wrong.
Below is the error that i get:-
"An error occurred while calling {0}{1}{2}.\n".format(target_id, ".", name), value)"
Can someone please help me to refactor the above SQL query to Spark Dataframe ?
Below is the pyspark operation for your question:
import pyspark.sql.functions as F
Shippers.alias("s").join(
Orders.alias("o"),
on = "ShipperID",
how = "left"
).groupby(
"s.ShipperName"
).agg(
F.count(F.col("o.OrderID")).alias("NumberOfOrders")
).show()
I created a dataframe of type pyspark.sql.dataframe.DataFrame by executing the following line:
dataframe = sqlContext.sql("select * from my_data_table")
How can I convert this back to a sparksql table that I can run sql queries on?
You can create your table by using createReplaceTempView. In your case it would be like:
dataframe.createOrReplaceTempView("mytable")
After this you can query your mytable using SQL.
If your a spark version is ≤ 1.6.2 you can use registerTempTable