I want to run many queries on a table leveraging the Spark framework with python by running them in parallel rather than in sequence.
When I run queries with a for loop, it performs very slowly as (I believe) it's not able to break the job in parallel. For example:
for fieldName in fieldList:
result = spark.sql("select cast({0} as string) as value,
count({0}) as FREQ
from {1} group by {0} order by FREQ desc limit 5".format(fieldName, tableName))
I tried to make a dataframe with a column called 'queryStr' to hold the query, then have a 'RESULTS' column to hold the results with a command:
inputDF = inputDF.withColumn('RESULTS', queryUDF(inputDF.queryStr))
The UDF reads:
resultSchema = ArrayType(StructType([
StructField('VALUE', StringType(), True),
StructField('FREQ',IntegerType(), True)
]), True)
queryUDF = udf(lambda queryStr: spark.sql(queryStr).collect(), resultSchema
I'm using spark version 2.4.0.
My error is:
PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable
So, how do I run these queries in parallel? Or, is there a better way for me to iterate through a large number of queries?
Similar issue as: Trying to execute a spark sql query from a UDF
In short, cannot perform SQL queries within a UDF.
Related
I am using azure databricks and I have the following sql query that I would like to convert into a spark python code:
SELECT DISTINCT
personID,
SUM(quantity) as total_shipped
FROM(
SELECT p.personID,
p.systemID,
s.quantity
FROM shipped s
LEFT JOIN ordered p
on (s.OrderId = p.OrderNumber OR
substr(s.OrderId,1,6) = p.OrderNumber )
and p.ndcnum = s.ndc
where s.Dateshipped <= "2022-04-07"
AND personID is not null
group by personID
I intend to merge the spark dataframes first, then perform the aggregated sum. However, I think I am making it more complicated than it is. So far, this is what I have but I am getting InvalidSyntax error:
ordered.join(shipped, ((ordered("OrderId").or(ordered.select(substring(ordered.OrderId, 1, 6)))) === ordered("ORDERNUMBER")) &&
(ordered("ndcnumber") === ordered("ndc")),"left")
.show()
The part I am getting confused is on the OR statement from the SQL query, how do I convert that into a spark python statement?
There is beauty in using databricks. you can directly use the same code by calling spark.sql(""" {your sql query here} """) and you will still get the same results. You can assign it to a variable and you will have a dataframe.
I'd like to know the difference between query()'s return value and query().result()s'.
In BigQuery Python Client Library,
bigquery_client = bigquery.Client()
myQuery = "SELECT * FROM `mytable`"
## NOTE: This query result has just 1 row.
job = bigquery_client.query(myQuery)
for row in job:
val1 = row
result = job.result()
for row in result:
val2 = row
print(job == result) # False. I know QueryJob object is different to RowIterator object.
print(val1 == val2) # True
Why are val1 and val2 equivalent?
Can the values be different for a very large query?
This is a yearlong self-answer.
Basically, 'job' and 'result' are different in my code.
bigquery_client.query() returns QueryJob instance.
( See https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query )
But the QueryJob class has own _iter_ method and it returns iter(self.result())
( See https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/job/query.py#L1778 )
So 'job' becomes an iterator of result() for the for-in loop.
Thus, job != result but val1 == val2.
The query() method is used to execute SQL query.
BigQuery saves all the query results to a table which is either permanent or temporary.After a query finishes, the temporary table exists for up to 24 hours
and if we write the results to a new table it becomes a permanent table.
when writing a query result to a permanent table you can use the Python code which contains the result() method i. query().results() which is used to write the query data to a new permanent table.
so basically query() and query.results() gives the same output but in query().results() data fetched gets stored to a new table and in query() data resides in a temporary table.
As per you question val1==val2 you got true because data fetched by query() and query().results() are same but stored differently.
I am providing the public documentation link related to this.
writing query results
I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?
There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item
I have written my scripts initially using Spark SQL but now for performance and other reasons trying to convert the Sql queries to PySpark Dataframes.
I have Orders Table (OrderID,CustomerID,EmployeeID,OrderDate,ShipperID)
and Shippers Table (ShipperID, ShipperName)
My Spark SQL query lists the number of orders sent by each shipper:
sqlContext.sql("SELECT Shippers.ShipperName, COUNT(Orders.ShipperID) AS NumberOfOrders
FROM Orders LEFT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID
GROUP BY ShipperName")
Now when i try to replace the above SQL query with Spark Dataframe,i write this
Shippers.join(Orders,["ShipperID"],'left').select(Shippers.ShipperName).groupBy(Shippers.ShipperName).agg(count(Orders.ShipperID).alias("NumberOfOrders"))
But I get an error here mostly because i feel aggregate count function while finding count of orderId from Orders table is wrong.
Below is the error that i get:-
"An error occurred while calling {0}{1}{2}.\n".format(target_id, ".", name), value)"
Can someone please help me to refactor the above SQL query to Spark Dataframe ?
Below is the pyspark operation for your question:
import pyspark.sql.functions as F
Shippers.alias("s").join(
Orders.alias("o"),
on = "ShipperID",
how = "left"
).groupby(
"s.ShipperName"
).agg(
F.count(F.col("o.OrderID")).alias("NumberOfOrders")
).show()
I'm writing a UDF in Python for a Hive query on Hadoop. My table has several bigint fields, and several string fields.
My UDF modifies the bigint fields, subtracts the modified versions into a new column (should also be numeric), and leaves the string fields as is.
When I run my UDF in a query, the results are all string columns.
How can I preserve or specify types in my UDF output?
More details:
My Python UDF:
import sys
for line in sys.stdin:
# pre-process row
line = line.strip()
inputs = line.split('\t')
# modify numeric fields, calculate new field
inputs[0], inputs[1], new_field = process(int(inputs[0]), int(inputs[1]))
# leave rest of inputs as is; they are string fields.
# output row
outputs = [new_field]
outputs.extend(inputs)
print '\t'.join([str(i) for i in outputs]) # doesn't preserve types!
I saved this UDF as myudf.py and added it to Hive.
My Hive query:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl;
Streaming sends everything through stdout. It is really just a wrapper on top of hadoop streaming under the hood. All types get converted to strings, which you handled accordingly in your python udf, and come back into hive as a strings. A python transform in hive will never return anything but strings. You could try to do a the transform in a sub query, and then cast the results to a type:
SELECT cast(calculated_int as bigint)
,cast( modified_bif1 as bigint)
,cast( modified_bif2 as bigint)
,stringfield1
,stringfield2
FROM (
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl) A ;
Hive might let you get away with this, if it does not, You will need to save the results to a table, and then you can convert (cast) to a different type in another query.
The final option it to just use a Java UDF. Map only UDFs are not too bad, and they allow you to specify return types.
Update (from asker):
The above answer works really well. A more elegant solution I found reading the "Programming Hive" O'Reilly book a few weeks later is this:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int BIGINT, modified_bif1 BIGINT, modified_bif2 BIGINT, stringfield1 STRING, stringfield2 STRING)
FROM original_tbl;
Rather than casting, you can specify types right in the AS(...) line.