I stuck in solving my problem.
What I have:
Pyspark dataframe that looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| B | EN | 2 |
+----+---------+---------+
| A | IQ | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
I need to take max value of country by counter (or any if equal) and delete all duplicates.
So it should looks like:
+----+---------+---------+
| id | country | counter |
+====+=========+=========+
| A | RU | 1 |
+----+---------+---------+
| C | RU | 3 |
+----+---------+---------+
| D | FR | 5 |
+----+---------+---------+
| B | FR | 5 |
+----+---------+---------+
Can anyone help me?
You can first drop duplicates based on id and counter , then take max over a window of id , finally filter where counter equals the Maximum value;
If order of id is to be retained , we would need a monototically increasing id to be assigned so we can sort later:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out =(df.withColumn('idx',F.monotonically_increasing_id())
.drop_duplicates(['id','counter'])
.withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum").orderBy('idx')
.drop(*['idx','Maximum']))
out.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| A| RU| 1|
| C| RU| 3|
| D| FR| 5|
| B| FR| 5|
+---+-------+-------+
If order of id is not a concern , same logic but no additional id required:
from pyspark.sql.window import Window
w = Window.partitionBy('id')
out1 = (df.drop_duplicates(['id','counter']).withColumn("Maximum",F.max(F.col("counter"))
.over(w)).filter("counter==Maximum")
.drop('Maximum'))
out1.show()
+---+-------+-------+
| id|country|counter|
+---+-------+-------+
| B| FR| 5|
| D| FR| 5|
| C| RU| 3|
| A| RU| 1|
+---+-------+-------+
Related
I need to add a new column in the followinf dataframe so that those that have the same value in column "column_1" must have the same numerical value, starting with 1 in a new column called "group"
``` # +---+-------------+
# | id| column_1|
# +---+-------------+
# | 0| a |
# | 7| a |
# | 1| c |
# | 2| d |
# | 3| e |
# | 4| a |
# | 10| c |
# | 12| b |
# +---+-------------+```
And I want:
``` # +---+-------------+
# | id| column_1| grupo|
# +---+-----------------+
# | 0| a | 1 |
# | 7| a | 1 |
# | 1| c | 3 |
# | 2| d | 4 |
# | 3| e | 5 |
# | 4| a | 1 |
# | 10| c | 3 |
# | 12| b | 2 |
# +---+-------------+```
windowSpec = Window.partitionBy("aux").orderBy("column_1")
df_expl = df_expl.withColumn("aux", F.lit(1))
df_expl = df_expl.withColumn("group",dense_rank().over(windowSpec))
df_expl = df_expl.drop("id")
I´m trying to get the min and max values from a column´s values after doing a groupby in two other columns in pyspark.
The dataset looks like:
| country | company | value |
|-------------------|----------------|-----------|
| arg | hh | 3 |
| arg | hh | 2 |
| arg | go | 4 |
| arg | go | 3 |
| bra | go | 1 |
| bra | go | 2 |
| bra | hh | 3 |
| bra | hh | 2 |
My current implementation is this one:
from pyspark.sql.functions import col, first, min, max
new_df = df.groupBy("country", "company").agg(first("value").alias("value"),
min("value").alias("min_value"),
max("value").alias("max_value")
)
But the result I´m getting is not correct, since I get this:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 3 | 3 |
| arg | hh | 2 | 2 | 2 |
| arg | go | 4 | 4 | 4 |
| arg | go | 3 | 3 | 3 |
| bra | go | 1 | 1 | 1 |
| bra | go | 2 | 2 | 2 |
| bra | hh | 3 | 3 | 3 |
| bra | hh | 2 | 2 | 2 |
And I wish to get something like:
| country | company | value | min_value | max_value |
|-------------------|----------------|-----------|---------------|---------------|
| arg | hh | 3 | 2 | 3 |
| arg | hh | 2 | 2 | 3 |
| arg | go | 4 | 3 | 4 |
| arg | go | 3 | 3 | 4 |
| bra | go | 1 | 1 | 2 |
| bra | go | 2 | 1 | 2 |
| bra | hh | 3 | 2 | 3 |
| bra | hh | 2 | 2 | 3 |
Do a join with the grouped dataframe
from pyspark.sql.functions import min, max
df.join(df.groupby('country', 'company').agg(min('value').alias('min_value'),
max('value').alias('max_value')),
on=['country', 'company'])
which is the (unordered) result you are looking for
+-------+-------+-----+---------+---------+
|country|company|value|min_value|max_value|
+-------+-------+-----+---------+---------+
| bra| go| 1| 1| 2|
| bra| go| 2| 1| 2|
| bra| hh| 3| 2| 3|
| bra| hh| 2| 2| 3|
| arg| hh| 3| 2| 3|
| arg| hh| 2| 2| 3|
| arg| go| 4| 3| 4|
| arg| go| 3| 3| 4|
+-------+-------+-----+---------+---------+
This question already has answers here:
PySpark - get row number for each row in a group
(2 answers)
Closed 2 years ago.
in Pyspark, I have a dataframe spark in this format :
CODE | TITLE | POSITION
A | per | 1
A | eis | 3
A | fon | 4
A | dat | 5
B | jem | 2
B | neu | 3
B | tri | 5
B | nok | 6
and I want to have that :
CODE | TITLE | POSITION
A | per | 1
A | eis | 2
A | fon | 3
A | dat | 4
B | jem | 1
B | neu | 2
B | tri | 3
B | nok | 4
the idea is that the column position starts at 1, and for example for the CODE A, it starts with 1 and I have the position 2 missing, then I need to make 3-1=>2, 4-1=>3 and 5=>4
how can we do that in pyspark ?
thank you for your help
With a slightly simpler dataframe
df.show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| A| AA| 1|
| A| BB| 3|
| A| CC| 4|
| A| DD| 5|
| B| EE| 2|
| B| FF| 3|
| B| GG| 5|
| B| HH| 6|
+----+-----+--------+
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn('POSITION', row_number().over(Window.partitionBy('CODE').orderBy('POSITION'))).show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
| B| EE| 1|
| B| FF| 2|
| B| GG| 3|
| B| HH| 4|
| A| AA| 1|
| A| BB| 2|
| A| CC| 3|
| A| DD| 4|
+----+-----+--------+
I have a CSV file which has been imported as a dataframe through the following codes:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of file.csv", inferSchema = True, header = True)
df.show()
output
+-----+------+-----+
|col1 | col2 | col3|
+-----+------+-----+
| A | 2 | 4 |
+-----+------+-----+
| A | 4 | 5 |
+-----+------+-----+
| A | 7 | 7 |
+-----+------+-----+
| A | 3 | 8 |
+-----+------+-----+
| A | 7 | 3 |
+-----+------+-----+
| B | 8 | 9 |
+-----+------+-----+
| B | 10 | 10 |
+-----+------+-----+
| B | 8 | 9 |
+-----+------+-----+
| B | 20 | 15 |
+-----+------+-----+
I want to create another col4 which contains col2[n+3]/col2-1 for each group in col1 separately.
The output should be
+-----+------+-----+-----+
|col1 | col2 | col3| col4|
+-----+------+-----+-----+
| A | 2 | 4 | 0.5| #(3/2-1)
+-----+------+-----+-----+
| A | 4 | 5 | 0.75| #(7/4-1)
+-----+------+-----+-----+
| A | 7 | 7 | NA |
+-----+------+-----+-----+
| A | 3 | 8 | NA |
+-----+------+-----+-----+
| A | 7 | 3 | NA |
+-----+------+-----+-----+
| B | 8 | 9 | 1.5 |
+-----+------+-----+-----+
| B | 10 | 10 | NA |
+-----+------+-----+-----+
| B | 8 | 9 | NA |
+-----+------+-----+-----+
| B | 20 | 15 | NA |
+-----+------+-----+-----+
I know how to do this in pandas but I am not sure how to do some computation on the grouped column in PySpark.
At the moment, my PySpark version is 2.4
My Spark version is 2.2. lead() and Window() have been used. For reference.
from pyspark.sql.window import Window
from pyspark.sql.functions import lead, col
my_window = Window.partitionBy('col1').orderBy('col1')
df = df.withColumn('col2_lead_3', lead(col('col2'),3).over(my_window))\
.withColumn('col4',(col('col2_lead_3')/col('col2'))-1).drop('col2_lead_3')
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| B| 8| 9| 1.5|
| B| 10| 10|null|
| B| 8| 9|null|
| B| 20| 15|null|
| A| 2| 4| 0.5|
| A| 4| 5|0.75|
| A| 7| 7|null|
| A| 3| 8|null|
| A| 7| 3|null|
+----+----+----+----+
I have this kind of dataset:
+------+------+------+
| Time | Tool | Hole |
+------+------+------+
| 1 | A | H1 |
| 2 | A | H2 |
| 3 | B | H3 |
| 4 | A | H4 |
| 5 | A | H5 |
| 6 | B | H6 |
+------+------+------+
The expected result is the following: It's a kind of temporal aggregation of my data, where the sequence is important.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 2 |
| B | 3 | 3 |
| A | 4 | 5 |
| B | 6 | 6 |
+------+-----------+---------+
Current result, with groupby statement doesn't match my expectation, as the sequence is not considered.
+------+-----------+---------+
| Tool | Time_From | Time_To |
+------+-----------+---------+
| A | 1 | 5 |
| B | 3 | 5 |
+------+-----------+---------+
rdd = rdd.groupby(['tool']).agg(min(rdd.time).alias('minTMSP'),
max(rdd.time).alias('maxTMSP'))
I tried to pass through a window function, but without any result so far... Any idea how I could handle this use case in pyspark?
We can use the lag function and Window class to check if the entry in each row has changed with regard to its previous row. We can then calculate the cumulative sum using this same Window to find our column to group by. From that point on it is straightforward to find the minimum and maximum times per group.
Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(1,'A'), (2,'A'), (3,'B'),(4,'A'),(5,'A'),(6,'B')],
schema=['Time','Tool'])
w = Window.partitionBy().orderBy('Time')
df2 = (df.withColumn('Tool_lag',F.lag(df['Tool']).over(w))
.withColumn('equal',F.when(F.col('Tool')==F.col('Tool_lag'), F.lit(0)).otherwise(F.lit(1)))
.withColumn('group', F.sum(F.col('equal')).over(w))
.groupBy('Tool','group').agg(
F.min(F.col('Time')).alias('start'),
F.max(F.col('Time')).alias('end'))
.drop('group'))
df2.show()
Output:
+----+-----+---+
|Tool|start|end|
+----+-----+---+
| A| 1| 2|
| B| 3| 3|
| A| 4| 5|
| B| 6| 6|
+----+-----+---+