I have a Pyspark Dataframe in the following format:
+------------+---------+
| date | query |
+------------+---------+
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 2 |
| 2011-08-12 | Query 3 |
| 2011-08-12 | Query 3 |
| 2011-08-13 | Query 1 |
+------------+---------+
And I need to transform it to turn each unique query into a column, grouped by date, and insert the count of each query in the rows of the dataframe. I expect the output to be like this:
+------------+---------+---------+---------+
| date | Query 1 | Query 2 | Query 3 |
+------------+---------+---------+---------+
| 2011-08-11 | 2 | 1 | 0 |
| 2011-08-12 | 0 | 0 | 2 |
| 2011-08-13 | 1 | 0 | 0 |
+------------+---------+---------+---------+
I am trying to use this answer as example, but I don't quite understand the code, especially the return statement in the make_row function.
Is there a way to count the queries while transforming the DataFrame?
Maybe something like
import pyspark.sql.functions as func
grouped = (df
.map(lambda row: (row.date, (row.query, func.count(row.query)))) # Just an example. Not sure how to do this.
.groupByKey())
It is a dataframe with potentially hundreds of thousands of rows and queries, so I prefer the RDD version over the options that use a .collect()
Thank you!
You can use groupBy.pivot with count as the aggregation function:
from pyspark.sql.functions import count
df.groupBy('date').pivot('query').agg(count('query')).na.fill(0).orderBy('date').show()
+--------------------+-------+-------+-------+
| date|Query 1|Query 2|Query 3|
+--------------------+-------+-------+-------+
|2011-08-11 00:00:...| 2| 1| 0|
|2011-08-12 00:00:...| 0| 0| 2|
|2011-08-13 00:00:...| 1| 0| 0|
+--------------------+-------+-------+-------+
Related
I have a pyspark dataframe as below, df
| D1 | D2 | D3 |Out|
| 2 | 4 | 5 |D2 |
| 5 | 8 | 4 |D3 |
| 3 | 7 | 8 |D1 |
And I would like to replace the row values of the "out" column by the row value within the same row with the same column name of the row value of the "out" column.
| D1 | D2 | D3 |Out|Result|
| 2 | 4 | 5 |D2 |4 |
| 5 | 8 | 4 |D3 |4 |
| 3 | 7 | 8 |D1 |3 |
df_lag=df.rdd.map(lambda row: row + (row[row.Out],)).toDF(df.columns + ["Result"])
I have tried the code above and it could obtain the result but when I tried to save to csv, it keeps showing the error "Job aborted due to......" so I would like to ask if there is any other method could also obtain the same result. Thanks!
You can use chained when statements generated dynamically from the column names using reduce:
from functools import reduce
import pyspark.sql.functions as F
df2 = df.withColumn(
'Result',
reduce(
lambda x, y: x.when(F.col('Out') == y, F.col(y)),
df.columns[:-1],
F
)
)
df2.show()
+---+---+---+---+------+
| D1| D2| D3|Out|Result|
+---+---+---+---+------+
| 2| 4| 5| D2| 4|
| 5| 8| 4| D3| 4|
| 3| 7| 8| D1| 3|
+---+---+---+---+------+
i want to calculate the time difference/date difference for each unique name it took for the status to get from order to arrived.
Input dataframe is like this
+------------------------------+
| Date | id | name |staus
+------------------------------+
| 1986/10/15| A |john |order
| 1986/10/16| A |john |dispatched
| 1986/10/18| A |john |arrived
| 1986/10/15| B |peter|order
| 1986/10/16| B |peter|dispatched
| 1986/10/17| B |peter|arrived
| 1986/10/16| C |raul |order
| 1986/10/17| C |raul |dispatched
| 1986/10/18| C |raul |arrived
+-----------------------------+
the expected output dataset should look similar to this
+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
A | john | 3days
B |peter | 2days
C | Raul | 2days
+---------------------------------------------------+
I am stuck on what logic to implement
You can group by and calculate the date diff using a conditional aggregation:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
F.datediff(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3|
| C| raul| 2|
| B|peter| 2|
+---+-----+---------+
You can also directly subtract the dates, which will return an interval type column:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3 days|
| C| raul| 2 days|
| B|peter| 2 days|
+---+-----+---------+
Assuming ordered is the earliest date and delivered is the last, just use aggregation and datediff():
select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;
For more precision, you can use conditional aggregation:
select id, name,
datediff(max(case when status = 'arrived' then date end)
min(case when status = 'order' then date end)
) as num_days
from t
group by id, name;
I have a dataframe like this:
+--------------------+------------------------+
| category|count(DISTINCT category)|
+--------------------+------------------------+
| FINANCE| 1|
| ARCADE| 1|
| AUTO & VEHICLES| 1|
And would like to transform it in a dataframe like this:
+--------------------+------------------------+
| FINANCE | ARCADE | AUTO & VEHICLES|
+--------------------+------------------------+
| 1 | 1 | 1 |
But I can't think in any way of doing that except the very brute-force python way I am sure will be very inefficient. Is there a smarted way of doing that using pyspark operators?
You can use the pivot() function and then use first for aggregation:
from pyspark.sql.functions import *
df.groupby().pivot("category").agg(first("count(DISTINCT category)")).show()
+------+---------------+-------+
|ARCADE|AUTO & VEHICLES|FINANCE|
+------+---------------+-------+
| 1| 1| 1|
+------+---------------+-------+
I have a data frame that looks like
+-------+-------+
| Code1 | Code2 |
+-------+-------+
| A | 1 |
| B | 1 |
| A | 2 |
| B | 2 |
| C | 2 |
| D | 2 |
| D | 3 |
| F | 3 |
| G | 3 |
+-------+-------+
I then want to apply a unique set of filters like so:
Scenario 1 -> filter on Code1 IN (A,B)
Scenario 2 -> filter on Code1 IN (A,D) and Code2 IN (2,3)
Scenario 3 -> filter on Code2 = 2
The result of applying the filter should be a data frame that looks like:
+-------+-------+----------+
| Code1 | Code2 | Scenario |
+-------+-------+----------+
| A | 1 | 1 |
| B | 1 | 1 |
| A | 2 | 1 |
| B | 2 | 1 |
| A | 2 | 2 |
| D | 2 | 2 |
| D | 3 | 2 |
| A | 2 | 3 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 2 | 3 |
+-------+-------+----------+
QUESTION: What is the most efficient way to do this with spark via python?
I am new to spark, so I am really asking from a conceptual level and don't need an explicit solution. I am aiming to achieve as much parallelism as possible in the operation. My real-life example involves using an initial data frame with 38 columns that is on the order of 100MB to a couple GB as a csv file and I typically have at most 100-150 scenarios.
The original design of the solution was to process each scenario filter sequentially and union the resulting filtered data frames together, but I feel like that negates the whole point of using spark.
EDIT: Does it though? For each scenario, I would filter and then union, which are both transformations (lazy eval). Would the eventual execution plan be smart enough to automatically parallelize the multiple unique filters?
Isn't there a way we can apply the filters in parallel, e.g., apply scenario filter 1 at the same time as applying filters 2 and 3? Would we have to "blow up" the initial dataframe N times, where N = # of scenario filters, append a Scenario # column to the new data frame, and apply one big filter that looks something like:
WHERE (Scenario = 1 AND Code1 IN (A,B)) OR
(Scenario = 2 AND Code1 IN (A,D) AND Code2 IN (2,3)) OR
(Scenario = 3 AND Code2 = 2)
And if that does end up being the most efficient way, isn't it also dependent on how much memory the "blown up" data frame takes? If the "blown up" data frame takes up more memory than what my cluster has, am I going to have to process only as many scenarios as can fit in memory?
you can apply all filters at once:
data.withColumn("scenario",
when('code1.isin("A", "B"), 1).otherwise(
when('code1.isin("A", "D") && 'code2.isin("2","3"), 2).otherwise(
when('code2==="2",3)
)
)
).show()
but you have one more problem, for example values (A,2) could be in all of your scenarios 1,2,3. In this case you could try smth like that:
data.withColumn("s1", when('code1.isin("A", "B"), 1).otherwise(0))
.withColumn("s2",when('code1.isin("A", "D") && 'code2.isin("2","3"), 1).otherwise(0))
.withColumn("s3",when('code2==="2",1).otherwise(0))
.show()
output:
+-----+-----+---+---+---+
|code1|code2| s1| s2| s3|
+-----+-----+---+---+---+
| A| 1| 1| 0| 0|
| B| 1| 1| 0| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| A| 2| 1| 1| 1|
| D| 2| 0| 1| 1|
| D| 3| 0| 1| 0|
| A| 2| 1| 1| 1|
| B| 2| 1| 0| 1|
| C| 2| 0| 0| 1|
| D| 2| 0| 1| 1|
+-----+-----+---+---+---+
In the EDIT to my question, I questioned whether the lazy evaluation is the key to the problem. After doing some research in the Spark UI, I have concluded that even though my original solution looks like it is applying transformations (filter then union) sequentially for each scenario, it is actually applying all the transformations simultaneously, once an action (e.g., dataframe.count()) is called. The screenshot here represents the Event Timeline from the transformation phase of the dataframe.count() job.
The job includes 96 scenarios each with a unique filter on the original data frame. You can see that my local machine is running 8 tasks simultaneously, where each task represents a filter from one of the scenarios.
In conclusion, Spark takes care of optimizing the filters to run in parallel, once an action has been called on the resulting dataframe.
I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+