I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
Related
I have a dataframe like this
+---+---------------------+
| id| csv|
+---+---------------------+
| 1|a,b,c\n1,2,3\n2,3,4\n|
| 2|a,b,c\n3,4,5\n4,5,6\n|
| 3|a,b,c\n5,6,7\n6,7,8\n|
+---+---------------------+
and I want to explode the string type csv column, in fact I'm only interested in this column. So I'm looking for a method to obtain the following dataframe from the above.
+--+--+--+
| a| b| c|
+--+--+--+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
| 6| 7| 8|
+--+--+--+
Looking at the from_csv documentation it seems that the insput csv string can contain only one row of data, which I found stated more clearly here. So that's not an option.
I guess I could loop over the individual rows of the input dataframe, extract and parse the csv string from each row and then stitch everything together:
rows = df.collect()
for (i, row) in enumerate(rows):
data = row['csv']
data = data.split('\\n')
rdd = spark.sparkContext.parallelize(data)
df_row = (spark.read
.option('header', 'true')
.schema('a int, b int, c int')
.csv(rdd))
if i == 0:
df_new = df_row
else:
df_new = df_new.union(df_row)
df_new.show()
But that seems awfully inefficient. Is there a better way to achieve the desired result?
Using split + from_csv functions along with transform you can do something like:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, r"a,b,c\n1,2,3\n2,3,4\n"), (2, r"a,b,c\n3,4,5\n4,5,6\n"),
(3, r"a,b,c\n5,6,7\n6,7,8\n")], ["id", "csv"]
)
df1 = df.withColumn(
"csv",
F.transform(
F.split(F.regexp_replace("csv", r"^a,b,c\\n|\\n$", ""), r"\\n"),
lambda x: F.from_csv(x, "a int, b int, c int")
)
).selectExpr("inline(csv)")
df1.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 1| 2| 3|
# | 2| 3| 4|
# | 3| 4| 5|
# | 4| 5| 6|
# | 5| 6| 7|
# | 6| 7| 8|
# +---+---+---+
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+
I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)
This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
Closed 4 years ago.
I have a dataframe grouped by 'id' and 'type':
+---+----+-----+
| id|type|count|
+---+----+-----+
| 0| A| 2|
| 0| B| 3|
| 0| C| 1|
| 0| D| 3|
| 0| G| 1|
| 1| A| 0|
| 1| C| 1|
| 1| D| 1|
| 1| G| 2|
+---+----+-----+
I would like now to group by 'id' and get a sum of 3 largest values:
+---+-----+
| id|count|
+---+-----+
| 0| 8|
| 1| 4|
+---+-----+
How can I do it in pyspark, so that the computation is relatively efficient?
Found solution here
You can use the following code to perform this
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df = spark.createDataFrame([
(0, "A", 2),
(0,"B", 3),
(0,"C", 1),
(0,"D", 3),
(1,"A", 0),
(1,"C", 1),
(1,"D", 1),
(1,"G", 2)
], ("id", "type", "count"))
my_window = Window.partitionBy("id").orderBy("count")
df.withColumn("last_3", lead("count").over(my_window)).groupBy("id").agg(sum("last_3")).show()
Output:
+---+-----------+
| id|sum(last_3)|
+---+-----------+
| 0| 8|
| 1| 4|
+---+-----------+
Details: Window partitions your data by id and orders it by count then you create a new column where lead uses this window and returns the next value in that group (which was created by the window) so (0,C,1) is the lowest tuple in the group for id=0 this receive the value 2 since it is the next highest in this group (from tuple (0,A,2) and so on. The highest tuple does not have a following value and is assigned null. Finally perform the group operation and the sum.
I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups