Transforming distinct value quantities into columns in pyspark - python

I have a dataframe like this:
+--------------------+------------------------+
| category|count(DISTINCT category)|
+--------------------+------------------------+
| FINANCE| 1|
| ARCADE| 1|
| AUTO & VEHICLES| 1|
And would like to transform it in a dataframe like this:
+--------------------+------------------------+
| FINANCE | ARCADE | AUTO & VEHICLES|
+--------------------+------------------------+
| 1 | 1 | 1 |
But I can't think in any way of doing that except the very brute-force python way I am sure will be very inefficient. Is there a smarted way of doing that using pyspark operators?

You can use the pivot() function and then use first for aggregation:
from pyspark.sql.functions import *
df.groupby().pivot("category").agg(first("count(DISTINCT category)")).show()
+------+---------------+-------+
|ARCADE|AUTO & VEHICLES|FINANCE|
+------+---------------+-------+
| 1| 1| 1|
+------+---------------+-------+

Related

How to create multiple count columns in Pyspark?

I have a dataframe of title and bin:
+---------------------+-------------+
| Title| bin|
+---------------------+-------------+
| Forrest Gump (1994)| 3|
| Pulp Fiction (1994)| 2|
| Matrix, The (1999)| 3|
| Toy Story (1995)| 1|
| Fight Club (1999)| 3|
+---------------------+-------------+
How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:
+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|
+------------+------------+------------+
| 1| 1 | 3|
+------------+------------+------------+
Is this possible? Would someone please help me with this if you know how?
Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:
import pyspark.sql.functions as F
df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))
df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])
df1.show()
#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#| 1| 1| 3|
#+----------+----------+----------+

Combine two DataFrames in PySpark into matrix

I have 2 DataFrames in PySpark script.
DF1 has this data:
+-----+--------------+
| id | keyword |
+-----+--------------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
+-----+--------------+
DF2 has this data:
+----+---------------+
| id | tokens |
+----+---------------+
| 13 | ['abc', 'def']|
| 14 | ['ghi', 'jkl']|
| 15 | ['mno', 'pqr']|
+----+---------------+
I'm looking to build a third DataFrame by a result of combining both of the DataFrames above and performing some complex calculations (the calculations are not important) between the keyword and the tokens defined by a python function:
def complex_calculation(keyword, tokens):
// some various stuff that produces a numeric result between the keyword and the tokens
// e.g. result = 0.7768756
return result
The final result should look something like this:
+-------------+---------+--------+--------+
| keyword | 13 | 14 | 15 |
+-------------+---------+--------+--------+
| banana | 0.5345 | 0.4325 | 0.6543 |
| apple | 0.2435 | 0.7865 | 0.9123 |
| orange | 0.3765 | 0.6942 | 0.2765 |
+-------------+---------+--------+--------+
Your complex calculation function is actually quite important in this context, because what you're looking to do is following:
Create a cartesian product of your two tables
table1 = spark._sc.parallelize([[1,"banana"],
[2,"apple"],
[3,"orange"]]).toDF(["id","keyword"])
table2 = spark._sc.parallelize([[13, ['abc', 'def']],
[14, ['ghi', 'jkl']],
[15, ['mno', 'pqr']]]).toDF(["id","token"])
Pivot with an aggregation function. Now this is where your function comes into play. As you can see, I am using f.count() as my aggregation function.
(
table1.select("keyword")
.crossJoin(table2)
.groupBy('keyword')
.pivot('id')
.agg(f.count("token"))
).show()
+-------+---+---+---+
|keyword| 13| 14| 15|
+-------+---+---+---+
| orange| 1| 1| 1|
| apple| 1| 1| 1|
| banana| 1| 1| 1|
+-------+---+---+---+
If you want to use some custom, clever calculation, you really have two options. If you're competent in Scala, you can write a UDAF (user-defined aggregate function) and register this jar to your Spark cluster. Alternatively, you can have a look at pandas udfs with something such as:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf("struct<agg_key: string, parameter1: parameter1_type>", PandasUDFType.GROUPED_MAP)
def my_agg_function(df):
df = pd.DataFrame(
df.groupby(agg_key).apply(lambda x: (...))
df.reset_index(inplace=True, drop=False)
return df
And then you use your pandas udf such as
spark_df.groupBy("keyword").pivot("id").apply(my_agg_function(...)))
However, despite best attempts at being vectorized, pandas udf are still not great and can have significant performance impacts. Hope this helps. More on pandas udf here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf
Ideally, you should try to do your complex aggregations using spark functions as much as you can, because Tungsten can then optimise this under the hood and give you best performance possible.

How to apply multiple filters to a dataframe?

I have a data frame that looks like
+-------+-------+
| Code1 | Code2 |
+-------+-------+
| A | 1 |
| B | 1 |
| A | 2 |
| B | 2 |
| C | 2 |
| D | 2 |
| D | 3 |
| F | 3 |
| G | 3 |
+-------+-------+
I then want to apply a unique set of filters like so:
Scenario 1 -> filter on Code1 IN (A,B)
Scenario 2 -> filter on Code1 IN (A,D) and Code2 IN (2,3)
Scenario 3 -> filter on Code2 = 2
The result of applying the filter should be a data frame that looks like:
+-------+-------+----------+
| Code1 | Code2 | Scenario |
+-------+-------+----------+
| A | 1 | 1 |
| B | 1 | 1 |
| A | 2 | 1 |
| B | 2 | 1 |
| A | 2 | 2 |
| D | 2 | 2 |
| D | 3 | 2 |
| A | 2 | 3 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 2 | 3 |
+-------+-------+----------+
QUESTION: What is the most efficient way to do this with spark via python?
I am new to spark, so I am really asking from a conceptual level and don't need an explicit solution. I am aiming to achieve as much parallelism as possible in the operation. My real-life example involves using an initial data frame with 38 columns that is on the order of 100MB to a couple GB as a csv file and I typically have at most 100-150 scenarios.
The original design of the solution was to process each scenario filter sequentially and union the resulting filtered data frames together, but I feel like that negates the whole point of using spark.
EDIT: Does it though? For each scenario, I would filter and then union, which are both transformations (lazy eval). Would the eventual execution plan be smart enough to automatically parallelize the multiple unique filters?
Isn't there a way we can apply the filters in parallel, e.g., apply scenario filter 1 at the same time as applying filters 2 and 3? Would we have to "blow up" the initial dataframe N times, where N = # of scenario filters, append a Scenario # column to the new data frame, and apply one big filter that looks something like:
WHERE (Scenario = 1 AND Code1 IN (A,B)) OR
(Scenario = 2 AND Code1 IN (A,D) AND Code2 IN (2,3)) OR
(Scenario = 3 AND Code2 = 2)
And if that does end up being the most efficient way, isn't it also dependent on how much memory the "blown up" data frame takes? If the "blown up" data frame takes up more memory than what my cluster has, am I going to have to process only as many scenarios as can fit in memory?
you can apply all filters at once:
data.withColumn("scenario",
when('code1.isin("A", "B"), 1).otherwise(
when('code1.isin("A", "D") && 'code2.isin("2","3"), 2).otherwise(
when('code2==="2",3)
)
)
).show()
but you have one more problem, for example values (A,2) could be in all of your scenarios 1,2,3. In this case you could try smth like that:
data.withColumn("s1", when('code1.isin("A", "B"), 1).otherwise(0))
.withColumn("s2",when('code1.isin("A", "D") && 'code2.isin("2","3"), 1).otherwise(0))
.withColumn("s3",when('code2==="2",1).otherwise(0))
.show()
output:
+-----+-----+---+---+---+
|code1|code2| s1| s2| s3|
+-----+-----+---+---+---+
| A| 1| 1| 0| 0|
| B| 1| 1| 0| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| A| 2| 1| 1| 1|
| D| 2| 0| 1| 1|
| D| 3| 0| 1| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| C| 2| 0| 0| 1|
| D| 2| 0| 1| 1|
+-----+-----+---+---+---+
In the EDIT to my question, I questioned whether the lazy evaluation is the key to the problem. After doing some research in the Spark UI, I have concluded that even though my original solution looks like it is applying transformations (filter then union) sequentially for each scenario, it is actually applying all the transformations simultaneously, once an action (e.g., dataframe.count()) is called. The screenshot here represents the Event Timeline from the transformation phase of the dataframe.count() job.
The job includes 96 scenarios each with a unique filter on the original data frame. You can see that my local machine is running 8 tasks simultaneously, where each task represents a filter from one of the scenarios.
In conclusion, Spark takes care of optimizing the filters to run in parallel, once an action has been called on the resulting dataframe.

Find column names of interconnected row values - Spark

I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+

get unique values when concatenating two columns pyspark data frame

I have a data frame in pyspark like below.
+---+-------------+------------+
| id| device| model|
+---+-------------+------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 3| cctv| cctv|
| 2| iphone|apple iphone|
| 3| spy camera| |
+---+-------------+------------+
I want to create a column by concatenating unique values in device and model columns for each id
I have done like below
First concatenated both device and model columns
df1 = df.select(col("id"), concat(col("model"), lit(","), col("device")).alias('con'))
+---+--------------------+
| id| con|
+---+--------------------+
| 3| mac,mac pro|
| 1| iphone5,iphone|
| 1|android,android p...|
| 1| windows,windows pc|
| 1|spy camera,spy ca...|
| 2| camera,|
| 3| cctv,cctv|
| 2| apple iphone,iphone|
| 3| ,spy camera|
+---+--------------------+
Then done a groupBy by id
df2 = df1.groupBy("id").agg(f.concat_ws(",", f.collect_set(df1.con)).alias('Group_con')
+---+-----------------------------------------------------------------------------+
| id| Group_con|
+---+-----------------------------------------------------------------------------+
| 1|iphone5,iphone,android,android phone,windows,windows pc,spy camera,spy camera|
| 2| camera,,apple iphone,iphone|
| 3| mac,mac pro,cctv,cctv,,spy camera|
+---+-----------------------------------------------------------------------------+
But I am getting duplicate values in the result. How can I avoid populating duplicate values in the final data frame
Use F.array(), F.explode() and F.collect_set():
from pyspark.sql import functions as F
df.withColumn('con', F.explode(F.array('device', 'model'))) \
.groupby('id').agg(F.collect_set('con').alias('Group_con')) \
.show(3,0)
# +---+--------------------------------------------------------------------------+
# |id |Group_con |
# +---+--------------------------------------------------------------------------+
# |3 |[cctv, mac pro, spy camera, mac] |
# |1 |[windows pc, iphone5, windows, iphone, android phone, spy camera, android]|
# |2 |[apple iphone, camera, iphone] |
# +---+--------------------------------------------------------------------------+
(tested on spark version 2.2.1)
You can remove the duplicates by using collect_set and a udf function as
from pyspark.sql import functions as f
from pyspark.sql import types as t
def uniqueStringUdf(device, model):
return ','.join(set(filter(bool, device + model)))
uniqueStringUdfCall = f.udf(uniqueStringUdf, t.StringType())
df.groupBy("id").agg(uniqueStringUdfCall(f.collect_set("device"), f.collect_set("model")).alias("con")).show(truncate=False)
which should give you
+---+------------------------------------------------------------------+
|id |con |
+---+------------------------------------------------------------------+
|3 |spy camera,mac,mac pro,cctv |
|1 |spy camera,windows,iphone5,windows pc,iphone,android phone,android|
|2 |camera,iphone,pple iphone |
+---+------------------------------------------------------------------+
where,
device + model is concatenation for two collected sets
filter(bool, device + model) is removing empty strings from concatenated list
set(filter(bool, device + model)) removes the duplicate strings in the concatenated list
','.join(set(filter(bool, device + model))) concats all the elements of concatenated list to a comma separated string.
I hope the answer is helpful
Not sure if this is going to be very helpful. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index.
Or
Split all values at comma "," list and remove all the duplicates by comparing each value. Or count() the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one.
Sorry if this wasn't help. These are the 2 ways I could think of solving your problem.

Categories

Resources