I have 2 DataFrames in PySpark script.
DF1 has this data:
+-----+--------------+
| id | keyword |
+-----+--------------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
+-----+--------------+
DF2 has this data:
+----+---------------+
| id | tokens |
+----+---------------+
| 13 | ['abc', 'def']|
| 14 | ['ghi', 'jkl']|
| 15 | ['mno', 'pqr']|
+----+---------------+
I'm looking to build a third DataFrame by a result of combining both of the DataFrames above and performing some complex calculations (the calculations are not important) between the keyword and the tokens defined by a python function:
def complex_calculation(keyword, tokens):
// some various stuff that produces a numeric result between the keyword and the tokens
// e.g. result = 0.7768756
return result
The final result should look something like this:
+-------------+---------+--------+--------+
| keyword | 13 | 14 | 15 |
+-------------+---------+--------+--------+
| banana | 0.5345 | 0.4325 | 0.6543 |
| apple | 0.2435 | 0.7865 | 0.9123 |
| orange | 0.3765 | 0.6942 | 0.2765 |
+-------------+---------+--------+--------+
Your complex calculation function is actually quite important in this context, because what you're looking to do is following:
Create a cartesian product of your two tables
table1 = spark._sc.parallelize([[1,"banana"],
[2,"apple"],
[3,"orange"]]).toDF(["id","keyword"])
table2 = spark._sc.parallelize([[13, ['abc', 'def']],
[14, ['ghi', 'jkl']],
[15, ['mno', 'pqr']]]).toDF(["id","token"])
Pivot with an aggregation function. Now this is where your function comes into play. As you can see, I am using f.count() as my aggregation function.
(
table1.select("keyword")
.crossJoin(table2)
.groupBy('keyword')
.pivot('id')
.agg(f.count("token"))
).show()
+-------+---+---+---+
|keyword| 13| 14| 15|
+-------+---+---+---+
| orange| 1| 1| 1|
| apple| 1| 1| 1|
| banana| 1| 1| 1|
+-------+---+---+---+
If you want to use some custom, clever calculation, you really have two options. If you're competent in Scala, you can write a UDAF (user-defined aggregate function) and register this jar to your Spark cluster. Alternatively, you can have a look at pandas udfs with something such as:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf("struct<agg_key: string, parameter1: parameter1_type>", PandasUDFType.GROUPED_MAP)
def my_agg_function(df):
df = pd.DataFrame(
df.groupby(agg_key).apply(lambda x: (...))
df.reset_index(inplace=True, drop=False)
return df
And then you use your pandas udf such as
spark_df.groupBy("keyword").pivot("id").apply(my_agg_function(...)))
However, despite best attempts at being vectorized, pandas udf are still not great and can have significant performance impacts. Hope this helps. More on pandas udf here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf
Ideally, you should try to do your complex aggregations using spark functions as much as you can, because Tungsten can then optimise this under the hood and give you best performance possible.
Related
I am looking to check some large amount of data in GBs containing 2 CSVs. CSV files have no headers and also include only column which has some complex string mixture of numbers and alphabets like this
+--------------------------------+
| _c0 |
+---+---------------------------+
| Hello | world | 1.3123.412 | B |
+---+----------------------------+
So far, I am able to converted into the dataframes but not sure , Is there any way to get the row numbers and rows of df1 not found in df2
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
file1 = 'file_path'
file2 = 'file_path'
df1 = spark.read.csv(file1)
df2 = spark.read.csv(file2)
df1.show(truncate=False)
Lets go step by step now that you are still learning
df1
+------------------------------+
|_c0 |
+------------------------------+
|Hello | world | 1.3123.412 | B|
|Hello | world | 1.3123.412 | C|
+------------------------------+
df2
+------------------------------+
|_c0 |
+------------------------------+
|Hello | world | 1.3123.412 | D|
|Hello | world | 1.3123.412 | C|
+------------------------------+
generate row numbers using a window function
df1= df1.withColumn('id', row_number().over(Window.orderBy('_c0')))
df2= df2.withColumn('id', row_number().over(Window.orderBy('_c0')))
Use a left semi join. These joins do not keep any values from the right datframe. They only compare values and keep the left dataframes values that are also found in the right dataframe
df1.join(df2, how='left_semi', on='_c0').show(truncate=False)
+------------------------------+---+
|_c0 |id |
+------------------------------+---+
|Hello | world | 1.3123.412 | C|2 |
+------------------------------+---+
I have a dataframe like this:
+--------------------+------------------------+
| category|count(DISTINCT category)|
+--------------------+------------------------+
| FINANCE| 1|
| ARCADE| 1|
| AUTO & VEHICLES| 1|
And would like to transform it in a dataframe like this:
+--------------------+------------------------+
| FINANCE | ARCADE | AUTO & VEHICLES|
+--------------------+------------------------+
| 1 | 1 | 1 |
But I can't think in any way of doing that except the very brute-force python way I am sure will be very inefficient. Is there a smarted way of doing that using pyspark operators?
You can use the pivot() function and then use first for aggregation:
from pyspark.sql.functions import *
df.groupby().pivot("category").agg(first("count(DISTINCT category)")).show()
+------+---------------+-------+
|ARCADE|AUTO & VEHICLES|FINANCE|
+------+---------------+-------+
| 1| 1| 1|
+------+---------------+-------+
I have data of the following kind:
+-----------+-----------+-------------------------------------------------------------+
| id| point| data |
+-------------------------------------------------------------------------------------+
| dfb| 6|[{"key1":"124", "key2": "345"},{"key3":"324", "key1":"wfe"}] |
| bgd| 7|[{"key3":"324", "key1":"wfe"},{"key1":"777", "key2":"888"}] |
| 34d| 6|[{"key1":"111", "key4": "788", "key2":"dfef}] |
and I want to convert it to
+-----------+-----------+-----------------------------------------------+
| id| point| key1 |
+-----------------------------------------------------------------------+
| dfb| 6| 124 |
| bgd| 7| 777 |
| 34d| 6| 111 |
There exist a lists of JSONs and they may share common keys but I want to extract value of key1 from the json which also has key2.
This can be easily achieved in python.
In pyspark I have seen solutions(How to split a list to multiple columns in Pyspark?) which are based on fixed schema, but how can I achieve this without a fixed schema as in this case.
Another approach with higher_order_functions (spark 2.4+) involving filter with transformcan be:
import pyspark.sql.functions as F
schema = ArrayType(MapType(StringType(),StringType()))
(df.withColumn("data",F.from_json(F.col("data"),schema))
.withColumn("Key1",F.expr('''transform(filter(data,x->
array_contains(map_keys(x),"key2")),y->y["key1"])''')[0])).show()
+---+-----+--------------------+----+
| id|point| data|Key1|
+---+-----+--------------------+----+
|dfb| 6|[[key1 -> 124, ke...| 124|
|bgd| 7|[[key3 -> 324, ke...| 777|
|34d| 6|[[key1 -> 111, ke...| 111|
+---+-----+--------------------+----+
Check below code.
from pyspark.sql import functions as F
from pyspark.sql.types import *
df.show()
+---+-----+---------------------------------------------------------+
|id |point|data |
+---+-----+---------------------------------------------------------+
|dfb|6 |[{"key1":"124","key2":"345"},{"key3":"324","key1":"wfe"}]|
|bgd|7 |[{"key3":"324","key1":"wfe"},{"key1":"777","key2":"888"}]|
|34d|6 |[{"key1":"111","key4":"788","key2":"dfef"}] |
+---+-----+---------------------------------------------------------+
schema = ArrayType(MapType(StringType(),StringType()))
df.withColumn("data",F.explode(F.from_json(F.col("data"),schema))).withColumn("data",F.when(F.col("data")["key1"].cast("long").isNotNull(),F.col("data")["key1"])).filter(F.col("data").isNotNull()).show()
+---+-----+----+
| id|point|data|
+---+-----+----+
|dfb| 6| 124|
|bgd| 7| 777|
|34d| 6| 111|
+---+-----+----+
I have a data frame that looks like
+-------+-------+
| Code1 | Code2 |
+-------+-------+
| A | 1 |
| B | 1 |
| A | 2 |
| B | 2 |
| C | 2 |
| D | 2 |
| D | 3 |
| F | 3 |
| G | 3 |
+-------+-------+
I then want to apply a unique set of filters like so:
Scenario 1 -> filter on Code1 IN (A,B)
Scenario 2 -> filter on Code1 IN (A,D) and Code2 IN (2,3)
Scenario 3 -> filter on Code2 = 2
The result of applying the filter should be a data frame that looks like:
+-------+-------+----------+
| Code1 | Code2 | Scenario |
+-------+-------+----------+
| A | 1 | 1 |
| B | 1 | 1 |
| A | 2 | 1 |
| B | 2 | 1 |
| A | 2 | 2 |
| D | 2 | 2 |
| D | 3 | 2 |
| A | 2 | 3 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 2 | 3 |
+-------+-------+----------+
QUESTION: What is the most efficient way to do this with spark via python?
I am new to spark, so I am really asking from a conceptual level and don't need an explicit solution. I am aiming to achieve as much parallelism as possible in the operation. My real-life example involves using an initial data frame with 38 columns that is on the order of 100MB to a couple GB as a csv file and I typically have at most 100-150 scenarios.
The original design of the solution was to process each scenario filter sequentially and union the resulting filtered data frames together, but I feel like that negates the whole point of using spark.
EDIT: Does it though? For each scenario, I would filter and then union, which are both transformations (lazy eval). Would the eventual execution plan be smart enough to automatically parallelize the multiple unique filters?
Isn't there a way we can apply the filters in parallel, e.g., apply scenario filter 1 at the same time as applying filters 2 and 3? Would we have to "blow up" the initial dataframe N times, where N = # of scenario filters, append a Scenario # column to the new data frame, and apply one big filter that looks something like:
WHERE (Scenario = 1 AND Code1 IN (A,B)) OR
(Scenario = 2 AND Code1 IN (A,D) AND Code2 IN (2,3)) OR
(Scenario = 3 AND Code2 = 2)
And if that does end up being the most efficient way, isn't it also dependent on how much memory the "blown up" data frame takes? If the "blown up" data frame takes up more memory than what my cluster has, am I going to have to process only as many scenarios as can fit in memory?
you can apply all filters at once:
data.withColumn("scenario",
when('code1.isin("A", "B"), 1).otherwise(
when('code1.isin("A", "D") && 'code2.isin("2","3"), 2).otherwise(
when('code2==="2",3)
)
)
).show()
but you have one more problem, for example values (A,2) could be in all of your scenarios 1,2,3. In this case you could try smth like that:
data.withColumn("s1", when('code1.isin("A", "B"), 1).otherwise(0))
.withColumn("s2",when('code1.isin("A", "D") && 'code2.isin("2","3"), 1).otherwise(0))
.withColumn("s3",when('code2==="2",1).otherwise(0))
.show()
output:
+-----+-----+---+---+---+
|code1|code2| s1| s2| s3|
+-----+-----+---+---+---+
| A| 1| 1| 0| 0|
| B| 1| 1| 0| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| A| 2| 1| 1| 1|
| D| 2| 0| 1| 1|
| D| 3| 0| 1| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| C| 2| 0| 0| 1|
| D| 2| 0| 1| 1|
+-----+-----+---+---+---+
In the EDIT to my question, I questioned whether the lazy evaluation is the key to the problem. After doing some research in the Spark UI, I have concluded that even though my original solution looks like it is applying transformations (filter then union) sequentially for each scenario, it is actually applying all the transformations simultaneously, once an action (e.g., dataframe.count()) is called. The screenshot here represents the Event Timeline from the transformation phase of the dataframe.count() job.
The job includes 96 scenarios each with a unique filter on the original data frame. You can see that my local machine is running 8 tasks simultaneously, where each task represents a filter from one of the scenarios.
In conclusion, Spark takes care of optimizing the filters to run in parallel, once an action has been called on the resulting dataframe.
I have a Pyspark Dataframe in the following format:
+------------+---------+
| date | query |
+------------+---------+
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 2 |
| 2011-08-12 | Query 3 |
| 2011-08-12 | Query 3 |
| 2011-08-13 | Query 1 |
+------------+---------+
And I need to transform it to turn each unique query into a column, grouped by date, and insert the count of each query in the rows of the dataframe. I expect the output to be like this:
+------------+---------+---------+---------+
| date | Query 1 | Query 2 | Query 3 |
+------------+---------+---------+---------+
| 2011-08-11 | 2 | 1 | 0 |
| 2011-08-12 | 0 | 0 | 2 |
| 2011-08-13 | 1 | 0 | 0 |
+------------+---------+---------+---------+
I am trying to use this answer as example, but I don't quite understand the code, especially the return statement in the make_row function.
Is there a way to count the queries while transforming the DataFrame?
Maybe something like
import pyspark.sql.functions as func
grouped = (df
.map(lambda row: (row.date, (row.query, func.count(row.query)))) # Just an example. Not sure how to do this.
.groupByKey())
It is a dataframe with potentially hundreds of thousands of rows and queries, so I prefer the RDD version over the options that use a .collect()
Thank you!
You can use groupBy.pivot with count as the aggregation function:
from pyspark.sql.functions import count
df.groupBy('date').pivot('query').agg(count('query')).na.fill(0).orderBy('date').show()
+--------------------+-------+-------+-------+
| date|Query 1|Query 2|Query 3|
+--------------------+-------+-------+-------+
|2011-08-11 00:00:...| 2| 1| 0|
|2011-08-12 00:00:...| 0| 0| 2|
|2011-08-13 00:00:...| 1| 0| 0|
+--------------------+-------+-------+-------+