I have a Spark dataframe which shows (daily) how many times a product has been used. It looks like this:
| x_id | product | usage | yyyy_mm_dd | status |
|------|---------|-------|------------|--------|
| 10 | prod_go | 15 | 2020-10-10 | i |
| 10 | prod_rv | 7 | 2020-10-10 | fc |
| 10 | prod_mb | 0 | 2020-10-10 | n |
| 15 | prod_go | 0 | 2020-10-10 | n |
| 15 | prod_rv | 5 | 2020-10-10 | fc |
| 15 | prod_mb | 1 | 2020-10-10 | fc |
| 10 | prod_go | 20 | 2020-10-11 | i |
| 10 | prod_rv | 11 | 2020-10-11 | i |
| 10 | prod_mb | 3 | 2020-10-11 | fc |
| 15 | prod_go | 0 | 2020-10-11 | n |
| 15 | prod_rv | 5 | 2020-10-11 | fc |
| 15 | prod_mb | 1 | 2020-10-11 | fc |
The status column is based on usage. When usage is 0 then it will have n. When usage is between 1 and 9 and the status will be fc. If usage is >= 10 then the status will be i.
I would like to introduce two additional columns to this Spark dataframe, date_reached_fc and date_reached_i. These columns should hold the min(yyyy_mm_dd) when an x_id reached each status respectively for a product.
Based on the sample data, the output would look like this:
| x_id | product | usage | yyyy_mm_dd | status | date_reached_fc | date_reached_i |
|------|---------|-------|------------|--------|-----------------|----------------|
| 10 | prod_go | 15 | 2020-10-10 | i | null | 2020-10-10 |
| 10 | prod_rv | 7 | 2020-10-10 | fc | 2020-10-10 | null |
| 10 | prod_mb | 0 | 2020-10-10 | n | null | null |
| 15 | prod_go | 0 | 2020-10-10 | n | null | null |
| 15 | prod_rv | 5 | 2020-10-10 | fc | 2020-10-10 | null |
| 15 | prod_mb | 1 | 2020-10-10 | fc | 2020-10-10 | null |
| 10 | prod_go | 20 | 2020-10-11 | i | null | 2020-10-10 |
| 10 | prod_rv | 11 | 2020-10-11 | i | 2020-10-10 | 2020-10-11 |
| 10 | prod_mb | 3 | 2020-10-11 | fc | 2020-10-11 | null |
| 15 | prod_go | 0 | 2020-10-11 | n | null | null |
| 15 | prod_rv | 5 | 2020-10-11 | fc | 2020-10-10 | null |
| 15 | prod_mb | 1 | 2020-10-11 | fc | 2020-10-10 | null |
The ordering is a bit different from your question, but the results should be correct... Basically just use min over a window, and also use when to filter only the relevant dates.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'date_reached_fc',
F.min(F.when(F.col('status') == 'fc', F.col('yyyy_mm_dd'))).over(Window.partitionBy('x_id', 'product').orderBy('yyyy_mm_dd', 'usage'))
).withColumn(
'date_reached_i',
F.min(F.when(F.col('status') == 'i', F.col('yyyy_mm_dd'))).over(Window.partitionBy('x_id', 'product').orderBy('yyyy_mm_dd', 'usage'))
).orderBy('x_id', 'product', 'yyyy_mm_dd', 'usage')
df2.show()
+----+-------+-----+----------+------+---------------+--------------+
|x_id|product|usage|yyyy_mm_dd|status|date_reached_fc|date_reached_i|
+----+-------+-----+----------+------+---------------+--------------+
| 10|prod_go| 15|2020-10-10| i| null| 2020-10-10|
| 10|prod_go| 20|2020-10-11| i| null| 2020-10-10|
| 10|prod_mb| 0|2020-10-10| n| null| null|
| 10|prod_mb| 3|2020-10-11| fc| 2020-10-11| null|
| 10|prod_rv| 7|2020-10-10| fc| 2020-10-10| null|
| 10|prod_rv| 11|2020-10-11| i| 2020-10-10| 2020-10-11|
| 15|prod_go| 0|2020-10-10| n| null| null|
| 15|prod_go| 0|2020-10-11| n| null| null|
| 15|prod_mb| 1|2020-10-10| fc| 2020-10-10| null|
| 15|prod_mb| 1|2020-10-11| fc| 2020-10-10| null|
| 15|prod_rv| 5|2020-10-10| fc| 2020-10-10| null|
| 15|prod_rv| 5|2020-10-11| fc| 2020-10-10| null|
+----+-------+-----+----------+------+---------------+--------------+
Related
I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+
id | name | priority
--------------------
1 | core | 10
2 | core | 9
3 | other | 8
4 | board | 7
5 | board | 6
6 | core | 4
I want to order the result set using priority but first those rows that have name=core even if have lower priority. The result should look like this
id | name | priority
--------------------
6 | core | 4
2 | core | 9
1 | core | 10
5 | board | 6
4 | board | 7
3 | other | 8
You can order by a boolean that checks whether the name is equal to core:
import pyspark.sql.functions as F
df.orderBy(F.col('name') != 'core', 'priority').show()
+---+-----+--------+
| id| name|priority|
+---+-----+--------+
| 6| core| 4|
| 2| core| 9|
| 1| core| 10|
| 5|board| 6|
| 4|board| 7|
| 3|other| 8|
+---+-----+--------+
I am trying to solve a sort of pivoting problem that seems to be more complex than expected.
I have a table with this schema:
+-------+-------------+------------+------+------+------+--------+--------+--------+
| Place | ReportMonth | MetricName | VolA | VolB | VolC | ValueA | ValueB | ValueC |
+-------+-------------+------------+------+------+------+--------+--------+--------+
| ABC | 2020-01-01 | M1 | 10 | 15 | 13 | 3.3 | 4.5 | 4.1 |
| ABC | 2020-01-01 | M2 | 9 | 34 | 12 | 3.2 | 10.1 | 4.0 |
| ABC | 2020-02-01 | M2 | 8 | 5 | 65 | 3.0 | 2.3 | 12.3 |
| DEF | 2020-01-01 | M1 | 11 | 13 | 24 | 3.4 | 4.3 | 3.1 |
| DEF | 2020-02-01 | M1 | 5 | 45 | 9 | 2.1 | 11.1 | 3.0 |
| DEF | 2020-02-01 | M2 | 7 | 8 | 53 | 2.6 | 5.3 | 25.3 |
+-------+-------------+------------+------+------+------+--------+--------+--------+
So, I have N metrics reported monthly per place (might be missing some months for some places).
For each metric there is a correspondent letter I need to report:
metrics_dct = {
'M1': 'C',
'M2': 'A'
}
I need to pivot this table (sdf_mtr) to have the following (sdf_agg):
+-------+-------------+--------+----------+--------+----------+
| Place | ReportMonth | M1VolC | M1ValueC | M2VolA | M2ValueA |
+-------+-------------+--------+----------+--------+----------+
| ABC | 2020-01-01 | 13 | 4.1 | 9 | 3.2 |
| ABC | 2020-02-01 | null | null | 8 | 3.0 |
| ABC | 2020-01-01 | 9 | 3.0 | null | null |
| ABC | 2020-02-01 | 53 | 25.3 | 7 | 2.6 |
+-------+-------------+--------+----------+--------+----------+
Essentially based on the name of the metric I have to pic the correct volume and value columns since A B and C are different type of measures that I have to include also in the new column name.
If it would have only be all columns the same per metric I could have used a normal pivoting. But I have a condition on the metric name to pic a desired column. I am currently using joins, but it is very inefficient:
grp_cols = ['Place', 'ReportDate']
sdf_agg = sdf_mtr.groupBy(grp_cols).count().drop('count')
for mtr_k, mtr_v in metrics_dct.items():
s_c_v = [
F.col(f'Vol{mtr_v}').alias(f'{mtr_k}Vol{mtr_v}'),
F.col(f'Value{mtr_v}').alias(f'{mtr_k}Value{mtr_v}')
]
sdf_agg = sdf_agg.join(
sdf_mtr.filter(F.col('MetricName') == mtr_k).select(grp_cols + s_c_v),
grp_cols,
'left'
)
Does anyone have any idea on how to do it avoiding joins? I managed with joins but it takes ages even with a small table. I was thinking of broadcast join but I would like to avoid it.
You can pivot the dataframe and aggregate values according to your requirements, currently I am taking max value if more than one value is available as
df.groupBy('Place','ReportMonth').\
pivot('MetricName').\
agg(f.max('VolA').alias('VolA'),f.max('ValueA').alias('ValueA'),f.max('VolB').alias('VolB'),f.max('ValueB').alias('ValueB')).\
orderBy('Place','ReportMonth').show()
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
|Place|ReportMonth|M1_VolA|M1_ValueA|M1_VolB|M1_ValueB|M2_VolA|M2_ValueA|M2_VolB|M2_ValueB|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
| ABC| 2020-01-01| 10| 3.3| 15| 4.5| 9| 3.2| 34| 10.1|
| ABC| 2020-02-01| null| null| null| null| 8| 3.0| 5| 2.3|
| DEF| 2020-01-01| 11| 3.4| 13| 4.3| null| null| null| null|
| DEF| 2020-02-01| 5| 2.1| 45| 11.1| 7| 2.6| 8| 5.3|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
If you don't want to miss any value you can use collect_list instead of max and you are good to go.
I am new to pyspark and am confused on how to group some data together by a couple of columns, order it by another column, then add up a column for each of the groups, then use that as a denominator for each row of data to calculate a weight in each row making up the groups.
This is being done in jupyterlab using a pyspark3 notebook. There's no way to get around that.
Here is an example of the data...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| A | 1 | 1-A | 2019-10-10 | 9 | 13245 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
+-------+-----+-----------+------------+------+--------+
I'd like to group this together by ntrk, zipcode, zip-ntwrk, event-date and then order it by event-date desc and hour desc. There are 24 hours for each date, so for each zip-ntwrk combo I would want to see the date and the hour in order. Something like this...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
+-------+-----+-----------+------------+------+--------+
Now that everything is in order, I need to run a calculation to create a ratio of how much count there is in each hour compared to the total of counts for each day when combining the hours. This will be used in the denominator to divide the hourly count by the total to get a ratio of how much count is in each hour compared to the day total. So something like this...
+-------+-----+-----------+------------+------+--------+-------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total |
+-------+-----+-----------+------------+------+--------+-------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 |
+-------+-----+-----------+------------+------+--------+-------+
And now that we have the denominator, we can divide counts by total for each row to get the factor counts/total=factor and this would end up looking like...
+-------+-----+-----------+------------+------+--------+-------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total | factor |
+-------+-----+-----------+------------+------+--------+-------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 | .766 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 | .233 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 | 1 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 | .02 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 | .979 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 | 1 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 | .64 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 | .359 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 | 1 |
+-------+-----+-----------+------------+------+--------+-------+--------+
That's what I'm trying to do, and any advice on how to get this done would be greatly appreciated.
Thanks
Use window sum function and then sum over the window partition by ntwrk,zip.
finally we are going to divide with counts/total.
Example:
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("ntwrk","zip","event-date")
df1.withColumn("total",sum(col("counts")).over(w).cast("int")).orderBy("ntwrk","zip","event-date","hour").\
withColumn("factor",format_number(col("counts")/col("total"),3)).show()
#+-----+---+---------+----------+----+------+-----+------+
#|ntwrk|zip|zip-ntwrk|event-date|hour|counts|total|factor|
#+-----+---+---------+----------+----+------+-----+------+
#| A| 1| 1-A|2019-10-10| 1| 12362|25607| 0.483|
#| A| 1| 1-A|2019-10-10| 9| 13245|25607| 0.517|#input 13245 not 3765
#| A| 2| 2-A|2019-10-11| 1| 28730|28730| 1.000|
#| B| 3| 3-B|2019-10-10| 1| 100| 4973| 0.020|
#| B| 3| 3-B|2019-10-10| 4| 4873| 4973| 0.980|
#| B| 4| 4-B|2019-10-11| 1| 3765| 3765| 1.000|
#| C| 5| 5-C|2019-10-10| 1| 17493|27320| 0.640|
#| C| 5| 5-C|2019-10-10| 2| 9827|27320| 0.360|
#| C| 6| 6-C|2019-10-11| 1| 728| 728| 1.000|
#+-----+---+---------+----------+----+------+-----+------+
You must have reticulated the splines
Pyspark works on distributive architecture and hence it may not retain the order. So, you should always order it the way you need before showing the records.
Now, on your point to get the %of records at various levels. You can achieve the same using window function, partition by on the levels you want the data.
Like:
w = Window.partitionBy("ntwrk-zip", "hour")
df =df.withColumn("hourly_recs", F.count().over(w))
Also, you can refer to this tutorial in YouTube - https://youtu.be/JEBd_4wWyj0
I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791