PySpark: Pivot / Flip table with column picked using a condition - python

I am trying to solve a sort of pivoting problem that seems to be more complex than expected.
I have a table with this schema:
+-------+-------------+------------+------+------+------+--------+--------+--------+
| Place | ReportMonth | MetricName | VolA | VolB | VolC | ValueA | ValueB | ValueC |
+-------+-------------+------------+------+------+------+--------+--------+--------+
| ABC | 2020-01-01 | M1 | 10 | 15 | 13 | 3.3 | 4.5 | 4.1 |
| ABC | 2020-01-01 | M2 | 9 | 34 | 12 | 3.2 | 10.1 | 4.0 |
| ABC | 2020-02-01 | M2 | 8 | 5 | 65 | 3.0 | 2.3 | 12.3 |
| DEF | 2020-01-01 | M1 | 11 | 13 | 24 | 3.4 | 4.3 | 3.1 |
| DEF | 2020-02-01 | M1 | 5 | 45 | 9 | 2.1 | 11.1 | 3.0 |
| DEF | 2020-02-01 | M2 | 7 | 8 | 53 | 2.6 | 5.3 | 25.3 |
+-------+-------------+------------+------+------+------+--------+--------+--------+
So, I have N metrics reported monthly per place (might be missing some months for some places).
For each metric there is a correspondent letter I need to report:
metrics_dct = {
'M1': 'C',
'M2': 'A'
}
I need to pivot this table (sdf_mtr) to have the following (sdf_agg):
+-------+-------------+--------+----------+--------+----------+
| Place | ReportMonth | M1VolC | M1ValueC | M2VolA | M2ValueA |
+-------+-------------+--------+----------+--------+----------+
| ABC | 2020-01-01 | 13 | 4.1 | 9 | 3.2 |
| ABC | 2020-02-01 | null | null | 8 | 3.0 |
| ABC | 2020-01-01 | 9 | 3.0 | null | null |
| ABC | 2020-02-01 | 53 | 25.3 | 7 | 2.6 |
+-------+-------------+--------+----------+--------+----------+
Essentially based on the name of the metric I have to pic the correct volume and value columns since A B and C are different type of measures that I have to include also in the new column name.
If it would have only be all columns the same per metric I could have used a normal pivoting. But I have a condition on the metric name to pic a desired column. I am currently using joins, but it is very inefficient:
grp_cols = ['Place', 'ReportDate']
sdf_agg = sdf_mtr.groupBy(grp_cols).count().drop('count')
for mtr_k, mtr_v in metrics_dct.items():
s_c_v = [
F.col(f'Vol{mtr_v}').alias(f'{mtr_k}Vol{mtr_v}'),
F.col(f'Value{mtr_v}').alias(f'{mtr_k}Value{mtr_v}')
]
sdf_agg = sdf_agg.join(
sdf_mtr.filter(F.col('MetricName') == mtr_k).select(grp_cols + s_c_v),
grp_cols,
'left'
)
Does anyone have any idea on how to do it avoiding joins? I managed with joins but it takes ages even with a small table. I was thinking of broadcast join but I would like to avoid it.

You can pivot the dataframe and aggregate values according to your requirements, currently I am taking max value if more than one value is available as
df.groupBy('Place','ReportMonth').\
pivot('MetricName').\
agg(f.max('VolA').alias('VolA'),f.max('ValueA').alias('ValueA'),f.max('VolB').alias('VolB'),f.max('ValueB').alias('ValueB')).\
orderBy('Place','ReportMonth').show()
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
|Place|ReportMonth|M1_VolA|M1_ValueA|M1_VolB|M1_ValueB|M2_VolA|M2_ValueA|M2_VolB|M2_ValueB|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
| ABC| 2020-01-01| 10| 3.3| 15| 4.5| 9| 3.2| 34| 10.1|
| ABC| 2020-02-01| null| null| null| null| 8| 3.0| 5| 2.3|
| DEF| 2020-01-01| 11| 3.4| 13| 4.3| null| null| null| null|
| DEF| 2020-02-01| 5| 2.1| 45| 11.1| 7| 2.6| 8| 5.3|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
If you don't want to miss any value you can use collect_list instead of max and you are good to go.

Related

Get Min and Max from values of another column after a Groupby in PySpark

I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+

Ordering by specific field value first pyspark

id | name | priority
--------------------
1 | core | 10
2 | core | 9
3 | other | 8
4 | board | 7
5 | board | 6
6 | core | 4
I want to order the result set using priority but first those rows that have name=core even if have lower priority. The result should look like this
id | name | priority
--------------------
6 | core | 4
2 | core | 9
1 | core | 10
5 | board | 6
4 | board | 7
3 | other | 8
You can order by a boolean that checks whether the name is equal to core:
import pyspark.sql.functions as F
df.orderBy(F.col('name') != 'core', 'priority').show()
+---+-----+--------+
| id| name|priority|
+---+-----+--------+
| 6| core| 4|
| 2| core| 9|
| 1| core| 10|
| 5|board| 6|
| 4|board| 7|
| 3|other| 8|
+---+-----+--------+

Pyspark Group and Order by Sum for Group Divide by parts

I am new to pyspark and am confused on how to group some data together by a couple of columns, order it by another column, then add up a column for each of the groups, then use that as a denominator for each row of data to calculate a weight in each row making up the groups.
This is being done in jupyterlab using a pyspark3 notebook. There's no way to get around that.
Here is an example of the data...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| A | 1 | 1-A | 2019-10-10 | 9 | 13245 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
+-------+-----+-----------+------------+------+--------+
I'd like to group this together by ntrk, zipcode, zip-ntwrk, event-date and then order it by event-date desc and hour desc. There are 24 hours for each date, so for each zip-ntwrk combo I would want to see the date and the hour in order. Something like this...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
+-------+-----+-----------+------------+------+--------+
Now that everything is in order, I need to run a calculation to create a ratio of how much count there is in each hour compared to the total of counts for each day when combining the hours. This will be used in the denominator to divide the hourly count by the total to get a ratio of how much count is in each hour compared to the day total. So something like this...
+-------+-----+-----------+------------+------+--------+-------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total |
+-------+-----+-----------+------------+------+--------+-------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 |
+-------+-----+-----------+------------+------+--------+-------+
And now that we have the denominator, we can divide counts by total for each row to get the factor counts/total=factor and this would end up looking like...
+-------+-----+-----------+------------+------+--------+-------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total | factor |
+-------+-----+-----------+------------+------+--------+-------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 | .766 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 | .233 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 | 1 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 | .02 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 | .979 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 | 1 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 | .64 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 | .359 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 | 1 |
+-------+-----+-----------+------------+------+--------+-------+--------+
That's what I'm trying to do, and any advice on how to get this done would be greatly appreciated.
Thanks
Use window sum function and then sum over the window partition by ntwrk,zip.
finally we are going to divide with counts/total.
Example:
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("ntwrk","zip","event-date")
df1.withColumn("total",sum(col("counts")).over(w).cast("int")).orderBy("ntwrk","zip","event-date","hour").\
withColumn("factor",format_number(col("counts")/col("total"),3)).show()
#+-----+---+---------+----------+----+------+-----+------+
#|ntwrk|zip|zip-ntwrk|event-date|hour|counts|total|factor|
#+-----+---+---------+----------+----+------+-----+------+
#| A| 1| 1-A|2019-10-10| 1| 12362|25607| 0.483|
#| A| 1| 1-A|2019-10-10| 9| 13245|25607| 0.517|#input 13245 not 3765
#| A| 2| 2-A|2019-10-11| 1| 28730|28730| 1.000|
#| B| 3| 3-B|2019-10-10| 1| 100| 4973| 0.020|
#| B| 3| 3-B|2019-10-10| 4| 4873| 4973| 0.980|
#| B| 4| 4-B|2019-10-11| 1| 3765| 3765| 1.000|
#| C| 5| 5-C|2019-10-10| 1| 17493|27320| 0.640|
#| C| 5| 5-C|2019-10-10| 2| 9827|27320| 0.360|
#| C| 6| 6-C|2019-10-11| 1| 728| 728| 1.000|
#+-----+---+---------+----------+----+------+-----+------+
You must have reticulated the splines
Pyspark works on distributive architecture and hence it may not retain the order. So, you should always order it the way you need before showing the records.
Now, on your point to get the %of records at various levels. You can achieve the same using window function, partition by on the levels you want the data.
Like:
w = Window.partitionBy("ntwrk-zip", "hour")
df =df.withColumn("hourly_recs", F.count().over(w))
Also, you can refer to this tutorial in YouTube - https://youtu.be/JEBd_4wWyj0

Efficiently transpose/explode spark dataframe columns into rows in a new table/dataframe format [pyspark]

How to efficiently explode a pyspark dataframe in this way:
+----+-------+------+------+
| id |sport |travel| work |
+----+-------+------+------+
| 1 | 0.2 | 0.4 | 0.6 |
+----+-------+------+------+
| 2 | 0.7 | 0.9 | 0.5 |
+----+-------+------+------+
and my desired output is this:
+------+--------+
| c_id | score |
+------+--------+
| 1 | 0.2 |
+------+--------+
| 1 | 0.4 |
+------+--------+
| 1 | 0.6 |
+------+--------+
| 2 | 0.7 |
+------+--------+
| 2 | 0.9 |
+------+--------+
| 2 | 0.5 |
+------+--------+
First you could put your 3 columns in an array, then arrays_zip them and then explode them and unpack them with .*, then select and rename unzipped column.
df.withColumn("zip", F.explode(F.arrays_zip(F.array("sport","travel","work"))))\
.select("id", F.col("zip.*")).withColumnRenamed("0","score").show()
+---+-----+
| id|score|
+---+-----+
| 1| 0.2|
| 1| 0.4|
| 1| 0.6|
| 2| 0.7|
| 2| 0.9|
| 2| 0.5|
+---+-----+
You can also do this without arrays_zip(as mentioned by cPak). Arrays_zip is used for combining arrays in different dataframe columns to struct form, so that you can explode all of them together, and then select with .* . For this case you could just use:
df.withColumn("score", F.explode((F.array(*(x for x in df.columns if x!="id"))))).select("id","score").show()

PySpark, kind of groupby, considering sequence

I have this kind of dataset:
+------+------+------+
| Time | Tool | Hole |
+------+------+------+
| 1 | A | H1 |
| 2 | A | H2 |
| 3 | B | H3 |
| 4 | A | H4 |
| 5 | A | H5 |
| 6 | B | H6 |
+------+------+------+
The expected result is the following: It's a kind of temporal aggregation of my data, where the sequence is important.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 2 |
| B | 3 | 3 |
| A | 4 | 5 |
| B | 6 | 6 |
+------+-----------+---------+
Current result, with groupby statement doesn't match my expectation, as the sequence is not considered.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 5 |
| B | 3 | 5 |
+------+-----------+---------+
rdd = rdd.groupby(['tool']).agg(min(rdd.time).alias('minTMSP'),
max(rdd.time).alias('maxTMSP'))
I tried to pass through a window function, but without any result so far... Any idea how I could handle this use case in pyspark?
We can use the lag function and Window class to check if the entry in each row has changed with regard to its previous row. We can then calculate the cumulative sum using this same Window to find our column to group by. From that point on it is straightforward to find the minimum and maximum times per group.
Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(1,'A'), (2,'A'), (3,'B'),(4,'A'),(5,'A'),(6,'B')],
schema=['Time','Tool'])
w = Window.partitionBy().orderBy('Time')
df2 = (df.withColumn('Tool_lag',F.lag(df['Tool']).over(w))
.withColumn('equal',F.when(F.col('Tool')==F.col('Tool_lag'), F.lit(0)).otherwise(F.lit(1)))
.withColumn('group', F.sum(F.col('equal')).over(w))
.groupBy('Tool','group').agg(
F.min(F.col('Time')).alias('start'),
F.max(F.col('Time')).alias('end'))
.drop('group'))
df2.show()
Output:
+----+-----+---+
|Tool|start|end|
+----+-----+---+
| A| 1| 2|
| B| 3| 3|
| A| 4| 5|
| B| 6| 6|
+----+-----+---+

Categories

Resources