id | name | priority
--------------------
1 | core | 10
2 | core | 9
3 | other | 8
4 | board | 7
5 | board | 6
6 | core | 4
I want to order the result set using priority but first those rows that have name=core even if have lower priority. The result should look like this
id | name | priority
--------------------
6 | core | 4
2 | core | 9
1 | core | 10
5 | board | 6
4 | board | 7
3 | other | 8
You can order by a boolean that checks whether the name is equal to core:
import pyspark.sql.functions as F
df.orderBy(F.col('name') != 'core', 'priority').show()
+---+-----+--------+
| id| name|priority|
+---+-----+--------+
| 6| core| 4|
| 2| core| 9|
| 1| core| 10|
| 5|board| 6|
| 4|board| 7|
| 3|other| 8|
+---+-----+--------+
Related
I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+
I have a table like this:
+-----+-----------------------+
| id | word |
+---+-------------------------+
| 1 | today is a nice day |
| 2 | hello world |
| 3 | he is good |
| 4 | is it raining? |
+-----+-----------------------+
I want to get the position of a substring (is) in the word column only if it occurs after the 3rd position
+-----+-----------------------+-----------------+
| id | word | substr_position|
+---+-------------------------+-----------------+
| 1 | today is a nice day | 7 |
| 2 | hello world | 0 |
| 3 | he is good | 4 |
| 4 | is it raining? | 0 |
+-----+-----------------------+-----------------+
Any help?
You can use the locate function in spark.
It returns the first occurrence of a substring in a string column, after a specific position.
from pyspark.sql.functions import locate, col
df.withColumn("substr_position", locate("is", col("word"), pos=3)).show()
+---+-------------------+---------------+
| id| word|substr_position|
+---+-------------------+---------------+
| 1|today is a nice day| 7|
| 2| hello world| 0|
| 3| he is good| 4|
| 4| is it raining?| 0|
+---+-------------------+---------------+
I am trying to solve a sort of pivoting problem that seems to be more complex than expected.
I have a table with this schema:
+-------+-------------+------------+------+------+------+--------+--------+--------+
| Place | ReportMonth | MetricName | VolA | VolB | VolC | ValueA | ValueB | ValueC |
+-------+-------------+------------+------+------+------+--------+--------+--------+
| ABC | 2020-01-01 | M1 | 10 | 15 | 13 | 3.3 | 4.5 | 4.1 |
| ABC | 2020-01-01 | M2 | 9 | 34 | 12 | 3.2 | 10.1 | 4.0 |
| ABC | 2020-02-01 | M2 | 8 | 5 | 65 | 3.0 | 2.3 | 12.3 |
| DEF | 2020-01-01 | M1 | 11 | 13 | 24 | 3.4 | 4.3 | 3.1 |
| DEF | 2020-02-01 | M1 | 5 | 45 | 9 | 2.1 | 11.1 | 3.0 |
| DEF | 2020-02-01 | M2 | 7 | 8 | 53 | 2.6 | 5.3 | 25.3 |
+-------+-------------+------------+------+------+------+--------+--------+--------+
So, I have N metrics reported monthly per place (might be missing some months for some places).
For each metric there is a correspondent letter I need to report:
metrics_dct = {
'M1': 'C',
'M2': 'A'
}
I need to pivot this table (sdf_mtr) to have the following (sdf_agg):
+-------+-------------+--------+----------+--------+----------+
| Place | ReportMonth | M1VolC | M1ValueC | M2VolA | M2ValueA |
+-------+-------------+--------+----------+--------+----------+
| ABC | 2020-01-01 | 13 | 4.1 | 9 | 3.2 |
| ABC | 2020-02-01 | null | null | 8 | 3.0 |
| ABC | 2020-01-01 | 9 | 3.0 | null | null |
| ABC | 2020-02-01 | 53 | 25.3 | 7 | 2.6 |
+-------+-------------+--------+----------+--------+----------+
Essentially based on the name of the metric I have to pic the correct volume and value columns since A B and C are different type of measures that I have to include also in the new column name.
If it would have only be all columns the same per metric I could have used a normal pivoting. But I have a condition on the metric name to pic a desired column. I am currently using joins, but it is very inefficient:
grp_cols = ['Place', 'ReportDate']
sdf_agg = sdf_mtr.groupBy(grp_cols).count().drop('count')
for mtr_k, mtr_v in metrics_dct.items():
s_c_v = [
F.col(f'Vol{mtr_v}').alias(f'{mtr_k}Vol{mtr_v}'),
F.col(f'Value{mtr_v}').alias(f'{mtr_k}Value{mtr_v}')
]
sdf_agg = sdf_agg.join(
sdf_mtr.filter(F.col('MetricName') == mtr_k).select(grp_cols + s_c_v),
grp_cols,
'left'
)
Does anyone have any idea on how to do it avoiding joins? I managed with joins but it takes ages even with a small table. I was thinking of broadcast join but I would like to avoid it.
You can pivot the dataframe and aggregate values according to your requirements, currently I am taking max value if more than one value is available as
df.groupBy('Place','ReportMonth').\
pivot('MetricName').\
agg(f.max('VolA').alias('VolA'),f.max('ValueA').alias('ValueA'),f.max('VolB').alias('VolB'),f.max('ValueB').alias('ValueB')).\
orderBy('Place','ReportMonth').show()
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
|Place|ReportMonth|M1_VolA|M1_ValueA|M1_VolB|M1_ValueB|M2_VolA|M2_ValueA|M2_VolB|M2_ValueB|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
| ABC| 2020-01-01| 10| 3.3| 15| 4.5| 9| 3.2| 34| 10.1|
| ABC| 2020-02-01| null| null| null| null| 8| 3.0| 5| 2.3|
| DEF| 2020-01-01| 11| 3.4| 13| 4.3| null| null| null| null|
| DEF| 2020-02-01| 5| 2.1| 45| 11.1| 7| 2.6| 8| 5.3|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
If you don't want to miss any value you can use collect_list instead of max and you are good to go.
I am new to pyspark and am confused on how to group some data together by a couple of columns, order it by another column, then add up a column for each of the groups, then use that as a denominator for each row of data to calculate a weight in each row making up the groups.
This is being done in jupyterlab using a pyspark3 notebook. There's no way to get around that.
Here is an example of the data...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| A | 1 | 1-A | 2019-10-10 | 9 | 13245 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
+-------+-----+-----------+------------+------+--------+
I'd like to group this together by ntrk, zipcode, zip-ntwrk, event-date and then order it by event-date desc and hour desc. There are 24 hours for each date, so for each zip-ntwrk combo I would want to see the date and the hour in order. Something like this...
+-------+-----+-----------+------------+------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts |
+-------+-----+-----------+------------+------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 |
+-------+-----+-----------+------------+------+--------+
Now that everything is in order, I need to run a calculation to create a ratio of how much count there is in each hour compared to the total of counts for each day when combining the hours. This will be used in the denominator to divide the hourly count by the total to get a ratio of how much count is in each hour compared to the day total. So something like this...
+-------+-----+-----------+------------+------+--------+-------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total |
+-------+-----+-----------+------------+------+--------+-------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 |
+-------+-----+-----------+------------+------+--------+-------+
And now that we have the denominator, we can divide counts by total for each row to get the factor counts/total=factor and this would end up looking like...
+-------+-----+-----------+------------+------+--------+-------+--------+
| ntwrk | zip | zip-ntwrk | event-date | hour | counts | total | factor |
+-------+-----+-----------+------------+------+--------+-------+--------+
| A | 1 | 1-A | 2019-10-10 | 1 | 12362 | 16127 | .766 |
| A | 1 | 1-A | 2019-10-10 | 9 | 3765 | 16127 | .233 |
| A | 2 | 2-A | 2019-10-11 | 1 | 28730 | 28730 | 1 |
| B | 3 | 3-B | 2019-10-10 | 1 | 100 | 4973 | .02 |
| B | 3 | 3-B | 2019-10-10 | 4 | 4873 | 4973 | .979 |
| B | 4 | 4-B | 2019-10-11 | 1 | 3765 | 3765 | 1 |
| C | 5 | 5-C | 2019-10-10 | 1 | 17493 | 27320 | .64 |
| C | 5 | 5-C | 2019-10-10 | 2 | 9827 | 27320 | .359 |
| C | 6 | 6-C | 2019-10-11 | 1 | 728 | 728 | 1 |
+-------+-----+-----------+------------+------+--------+-------+--------+
That's what I'm trying to do, and any advice on how to get this done would be greatly appreciated.
Thanks
Use window sum function and then sum over the window partition by ntwrk,zip.
finally we are going to divide with counts/total.
Example:
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("ntwrk","zip","event-date")
df1.withColumn("total",sum(col("counts")).over(w).cast("int")).orderBy("ntwrk","zip","event-date","hour").\
withColumn("factor",format_number(col("counts")/col("total"),3)).show()
#+-----+---+---------+----------+----+------+-----+------+
#|ntwrk|zip|zip-ntwrk|event-date|hour|counts|total|factor|
#+-----+---+---------+----------+----+------+-----+------+
#| A| 1| 1-A|2019-10-10| 1| 12362|25607| 0.483|
#| A| 1| 1-A|2019-10-10| 9| 13245|25607| 0.517|#input 13245 not 3765
#| A| 2| 2-A|2019-10-11| 1| 28730|28730| 1.000|
#| B| 3| 3-B|2019-10-10| 1| 100| 4973| 0.020|
#| B| 3| 3-B|2019-10-10| 4| 4873| 4973| 0.980|
#| B| 4| 4-B|2019-10-11| 1| 3765| 3765| 1.000|
#| C| 5| 5-C|2019-10-10| 1| 17493|27320| 0.640|
#| C| 5| 5-C|2019-10-10| 2| 9827|27320| 0.360|
#| C| 6| 6-C|2019-10-11| 1| 728| 728| 1.000|
#+-----+---+---------+----------+----+------+-----+------+
You must have reticulated the splines
Pyspark works on distributive architecture and hence it may not retain the order. So, you should always order it the way you need before showing the records.
Now, on your point to get the %of records at various levels. You can achieve the same using window function, partition by on the levels you want the data.
Like:
w = Window.partitionBy("ntwrk-zip", "hour")
df =df.withColumn("hourly_recs", F.count().over(w))
Also, you can refer to this tutorial in YouTube - https://youtu.be/JEBd_4wWyj0
I have this kind of dataset:
+------+------+------+
| Time | Tool | Hole |
+------+------+------+
| 1 | A | H1 |
| 2 | A | H2 |
| 3 | B | H3 |
| 4 | A | H4 |
| 5 | A | H5 |
| 6 | B | H6 |
+------+------+------+
The expected result is the following: It's a kind of temporal aggregation of my data, where the sequence is important.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 2 |
| B | 3 | 3 |
| A | 4 | 5 |
| B | 6 | 6 |
+------+-----------+---------+
Current result, with groupby statement doesn't match my expectation, as the sequence is not considered.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 5 |
| B | 3 | 5 |
+------+-----------+---------+
rdd = rdd.groupby(['tool']).agg(min(rdd.time).alias('minTMSP'),
max(rdd.time).alias('maxTMSP'))
I tried to pass through a window function, but without any result so far... Any idea how I could handle this use case in pyspark?
We can use the lag function and Window class to check if the entry in each row has changed with regard to its previous row. We can then calculate the cumulative sum using this same Window to find our column to group by. From that point on it is straightforward to find the minimum and maximum times per group.
Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(1,'A'), (2,'A'), (3,'B'),(4,'A'),(5,'A'),(6,'B')],
schema=['Time','Tool'])
w = Window.partitionBy().orderBy('Time')
df2 = (df.withColumn('Tool_lag',F.lag(df['Tool']).over(w))
.withColumn('equal',F.when(F.col('Tool')==F.col('Tool_lag'), F.lit(0)).otherwise(F.lit(1)))
.withColumn('group', F.sum(F.col('equal')).over(w))
.groupBy('Tool','group').agg(
F.min(F.col('Time')).alias('start'),
F.max(F.col('Time')).alias('end'))
.drop('group'))
df2.show()
Output:
+----+-----+---+
|Tool|start|end|
+----+-----+---+
| A| 1| 2|
| B| 3| 3|
| A| 4| 5|
| B| 6| 6|
+----+-----+---+