I have this kind of dataset:
+------+------+------+
| Time | Tool | Hole |
+------+------+------+
| 1 | A | H1 |
| 2 | A | H2 |
| 3 | B | H3 |
| 4 | A | H4 |
| 5 | A | H5 |
| 6 | B | H6 |
+------+------+------+
The expected result is the following: It's a kind of temporal aggregation of my data, where the sequence is important.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 2 |
| B | 3 | 3 |
| A | 4 | 5 |
| B | 6 | 6 |
+------+-----------+---------+
Current result, with groupby statement doesn't match my expectation, as the sequence is not considered.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 5 |
| B | 3 | 5 |
+------+-----------+---------+
rdd = rdd.groupby(['tool']).agg(min(rdd.time).alias('minTMSP'),
max(rdd.time).alias('maxTMSP'))
I tried to pass through a window function, but without any result so far... Any idea how I could handle this use case in pyspark?
We can use the lag function and Window class to check if the entry in each row has changed with regard to its previous row. We can then calculate the cumulative sum using this same Window to find our column to group by. From that point on it is straightforward to find the minimum and maximum times per group.
Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(1,'A'), (2,'A'), (3,'B'),(4,'A'),(5,'A'),(6,'B')],
schema=['Time','Tool'])
w = Window.partitionBy().orderBy('Time')
df2 = (df.withColumn('Tool_lag',F.lag(df['Tool']).over(w))
.withColumn('equal',F.when(F.col('Tool')==F.col('Tool_lag'), F.lit(0)).otherwise(F.lit(1)))
.withColumn('group', F.sum(F.col('equal')).over(w))
.groupBy('Tool','group').agg(
F.min(F.col('Time')).alias('start'),
F.max(F.col('Time')).alias('end'))
.drop('group'))
df2.show()
Output:
+----+-----+---+
|Tool|start|end|
+----+-----+---+
| A| 1| 2|
| B| 3| 3|
| A| 4| 5|
| B| 6| 6|
+----+-----+---+
Related
I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+
id | name | priority
--------------------
1 | core | 10
2 | core | 9
3 | other | 8
4 | board | 7
5 | board | 6
6 | core | 4
I want to order the result set using priority but first those rows that have name=core even if have lower priority. The result should look like this
id | name | priority
--------------------
6 | core | 4
2 | core | 9
1 | core | 10
5 | board | 6
4 | board | 7
3 | other | 8
You can order by a boolean that checks whether the name is equal to core:
import pyspark.sql.functions as F
df.orderBy(F.col('name') != 'core', 'priority').show()
+---+-----+--------+
| id| name|priority|
+---+-----+--------+
| 6| core| 4|
| 2| core| 9|
| 1| core| 10|
| 5|board| 6|
| 4|board| 7|
| 3|other| 8|
+---+-----+--------+
I have a table like this:
+-----+-----------------------+
| id | word |
+---+-------------------------+
| 1 | today is a nice day |
| 2 | hello world |
| 3 | he is good |
| 4 | is it raining? |
+-----+-----------------------+
I want to get the position of a substring (is) in the word column only if it occurs after the 3rd position
+-----+-----------------------+-----------------+
| id | word | substr_position|
+---+-------------------------+-----------------+
| 1 | today is a nice day | 7 |
| 2 | hello world | 0 |
| 3 | he is good | 4 |
| 4 | is it raining? | 0 |
+-----+-----------------------+-----------------+
Any help?
You can use the locate function in spark.
It returns the first occurrence of a substring in a string column, after a specific position.
from pyspark.sql.functions import locate, col
df.withColumn("substr_position", locate("is", col("word"), pos=3)).show()
+---+-------------------+---------------+
| id| word|substr_position|
+---+-------------------+---------------+
| 1|today is a nice day| 7|
| 2| hello world| 0|
| 3| he is good| 4|
| 4| is it raining?| 0|
+---+-------------------+---------------+
I am trying to solve a sort of pivoting problem that seems to be more complex than expected.
I have a table with this schema:
+-------+-------------+------------+------+------+------+--------+--------+--------+
| Place | ReportMonth | MetricName | VolA | VolB | VolC | ValueA | ValueB | ValueC |
+-------+-------------+------------+------+------+------+--------+--------+--------+
| ABC | 2020-01-01 | M1 | 10 | 15 | 13 | 3.3 | 4.5 | 4.1 |
| ABC | 2020-01-01 | M2 | 9 | 34 | 12 | 3.2 | 10.1 | 4.0 |
| ABC | 2020-02-01 | M2 | 8 | 5 | 65 | 3.0 | 2.3 | 12.3 |
| DEF | 2020-01-01 | M1 | 11 | 13 | 24 | 3.4 | 4.3 | 3.1 |
| DEF | 2020-02-01 | M1 | 5 | 45 | 9 | 2.1 | 11.1 | 3.0 |
| DEF | 2020-02-01 | M2 | 7 | 8 | 53 | 2.6 | 5.3 | 25.3 |
+-------+-------------+------------+------+------+------+--------+--------+--------+
So, I have N metrics reported monthly per place (might be missing some months for some places).
For each metric there is a correspondent letter I need to report:
metrics_dct = {
'M1': 'C',
'M2': 'A'
}
I need to pivot this table (sdf_mtr) to have the following (sdf_agg):
+-------+-------------+--------+----------+--------+----------+
| Place | ReportMonth | M1VolC | M1ValueC | M2VolA | M2ValueA |
+-------+-------------+--------+----------+--------+----------+
| ABC | 2020-01-01 | 13 | 4.1 | 9 | 3.2 |
| ABC | 2020-02-01 | null | null | 8 | 3.0 |
| ABC | 2020-01-01 | 9 | 3.0 | null | null |
| ABC | 2020-02-01 | 53 | 25.3 | 7 | 2.6 |
+-------+-------------+--------+----------+--------+----------+
Essentially based on the name of the metric I have to pic the correct volume and value columns since A B and C are different type of measures that I have to include also in the new column name.
If it would have only be all columns the same per metric I could have used a normal pivoting. But I have a condition on the metric name to pic a desired column. I am currently using joins, but it is very inefficient:
grp_cols = ['Place', 'ReportDate']
sdf_agg = sdf_mtr.groupBy(grp_cols).count().drop('count')
for mtr_k, mtr_v in metrics_dct.items():
s_c_v = [
F.col(f'Vol{mtr_v}').alias(f'{mtr_k}Vol{mtr_v}'),
F.col(f'Value{mtr_v}').alias(f'{mtr_k}Value{mtr_v}')
]
sdf_agg = sdf_agg.join(
sdf_mtr.filter(F.col('MetricName') == mtr_k).select(grp_cols + s_c_v),
grp_cols,
'left'
)
Does anyone have any idea on how to do it avoiding joins? I managed with joins but it takes ages even with a small table. I was thinking of broadcast join but I would like to avoid it.
You can pivot the dataframe and aggregate values according to your requirements, currently I am taking max value if more than one value is available as
df.groupBy('Place','ReportMonth').\
pivot('MetricName').\
agg(f.max('VolA').alias('VolA'),f.max('ValueA').alias('ValueA'),f.max('VolB').alias('VolB'),f.max('ValueB').alias('ValueB')).\
orderBy('Place','ReportMonth').show()
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
|Place|ReportMonth|M1_VolA|M1_ValueA|M1_VolB|M1_ValueB|M2_VolA|M2_ValueA|M2_VolB|M2_ValueB|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
| ABC| 2020-01-01| 10| 3.3| 15| 4.5| 9| 3.2| 34| 10.1|
| ABC| 2020-02-01| null| null| null| null| 8| 3.0| 5| 2.3|
| DEF| 2020-01-01| 11| 3.4| 13| 4.3| null| null| null| null|
| DEF| 2020-02-01| 5| 2.1| 45| 11.1| 7| 2.6| 8| 5.3|
+-----+-----------+-------+---------+-------+---------+-------+---------+-------+---------+
If you don't want to miss any value you can use collect_list instead of max and you are good to go.
I stuck in solving my problem.
What I have:
Pyspark dataframe that looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| B | EN | 2 |
+----+---------+---------+
| A | IQ | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
I need to take max value of country by counter (or any if equal) and delete all duplicates.
So it should looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
Can anyone help me?
You can first drop duplicates based on id and counter , then take max over a window of id , finally filter where counter equals the Maximum value;
If order of id is to be retained , we would need a monototically increasing id to be assigned so we can sort later:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out =(df.withColumn('idx',F.monotonically_increasing_id())
.drop_duplicates(['id','counter'])
.withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum").orderBy('idx')
.drop(*['idx','Maximum']))
out.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| A| RU| 1|
| C| RU| 3|
| D| FR| 5|
| B| FR| 5|
+---+-------+-------+
If order of id is not a concern , same logic but no additional id required:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out1 = (df.drop_duplicates(['id','counter']).withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum")
.drop('Maximum'))
out1.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| B| FR| 5|
| D| FR| 5|
| C| RU| 3|
| A| RU| 1|
+---+-------+-------+