Suppose there is a pyspark dataframe of the form:
id col1 col2 col3 col4
------------------------
as1 4 10 4 6
as2 6 3 6 1
as3 6 0 2 1
as4 8 8 6 1
as5 9 6 6 9
Is there a way to search the col 2-4 of the pyspark dataframe for values in col1 and to return the (id row name, column name)?
For instance:
In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)
Hint: Assume that col1 will be a set {4,6,8,9} i.e. unique
Yes, you can leverage the Spark SQL .isin operator.
Let's first create the DataFrame in your example
Part 1- Creating the DataFrame
cSchema = StructType([StructField("id", IntegerType()),\
StructField("col1", IntegerType()),\
StructField("col2", IntegerType()),\
StructField("col3", IntegerType()),\
StructField("col4", IntegerType())])
test_data = [[1,4,10,4,6],[2,6,3,6,1],[3,6,0,2,1],[4,8,8,6,1],[5,9,6,6,9]]
df = spark.createDataFrame(test_data,schema=cSchema)
df.show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 4| 10| 4| 6|
| 2| 6| 3| 6| 1|
| 3| 6| 0| 2| 1|
| 4| 8| 8| 6| 1|
| 5| 9| 6| 6| 9|
+---+----+----+----+----+
Part 2 -Function To Search for Matching Values
isin: A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
def search(col1,col3):
col1_list = df.select(col1).rdd\
.map(lambda x: x[0]).collect()
search_results = df[df[col3].isin(col1_list)]
return search_results
search_results.show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 4| 10| 4| 6|
| 2| 6| 3| 6| 1|
| 4| 8| 8| 6| 1|
| 5| 9| 6| 6| 9|
+---+----+----+----+----+
This should guide you in the right direction. You can select for just the Id Column etc.. or whatever you are attempting to return. The function can easily be changed to take more columns to search through. Hope this helps!
# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
StructField("col1", IntegerType()),
StructField("col2", IntegerType()),
StructField("col3", IntegerType()),
StructField("col4", IntegerType())])
test_data = [['as1', 4, 10, 4, 6],
['as2', 6, 3, 6, 1],
['as3', 6, 0, 2, 1],
['as4', 8, 8, 6, 1],
['as5', 9, 6, 6, 9]]
# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)
df.show()
# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']
# implement the search of values in rest columns found in col 1
def search(distinct_list ):
for i in distinct_list :
print(str(i) + ' found in: ')
# for col in df.columns:
for col in col_list:
df_search = df.select(*col_list) \
.filter(df[str(col)] == str(i))
if (len(df_search.head(1)) > 0):
df_search.show()
search(distinct_list)
Find full example code at GITHUB
Output:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|as1| 4| 10| 4| 6|
|as2| 6| 3| 6| 1|
|as3| 6| 0| 2| 1|
|as4| 8| 8| 6| 1|
|as5| 9| 6| 6| 9|
+---+----+----+----+----+
6 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5| 6| 6| 9|
+---+----+----+----+
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as2| 3| 6| 1|
|as4| 8| 6| 1|
|as5| 6| 6| 9|
+---+----+----+----+
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1| 10| 4| 6|
+---+----+----+----+
9 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5| 6| 6| 9|
+---+----+----+----+
4 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1| 10| 4| 6|
+---+----+----+----+
8 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as4| 8| 6| 1|
+---+----+----+----+
Related
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+
So I have a question regarding pyspark. I have a dataframe that looks like this:
+---+------------+
| id| list|
+---+------------+
| 2|[3, 5, 4, 2]|
+---+------------+
| 3|[4, 5, 3, 2]|
+---+------------+
And I would like to explode lists it into multiple rows and keeping information about which position did each element of the list had in a separate column. The result should look like this:
+---+------------+------------+
| id| listitem| rank|
+---+------------+------------+
| 2| 3| 1|
+---+------------+------------+
| 2| 5| 2|
+---+------------+------------+
| 2| 4| 3|
+---+------------+------------+
| 2| 2| 4|
+---+------------+------------+
| 3| 4| 1|
+---+------------+------------+
| 3| 5| 2|
+---+------------+------------+
| 3| 3| 3|
+---+------------+------------+
| 3| 2| 4|
+---+------------+------------+
The rank column has the index+1 of the position each element had in the list. Any suggestions on the most optimal code to achieve it?
You can use posexplode() or posexplode_outer() function to get desired result.
df = spark.createDataFrame([(2, [3, 5, 4, 2]), (3, [4, 5, 3, 2])], ["id", "list"])
df.select('id',posexplode_outer('list').alias('rank', 'listitem')) \
.withColumn('rank', col('rank') + 1).show()
+---+----+--------+
| id|rank|listitem|
+---+----+--------+
| 2| 1| 3|
| 2| 2| 5|
| 2| 3| 4|
| 2| 4| 2|
| 3| 1| 4|
| 3| 2| 5|
| 3| 3| 3|
| 3| 4| 2|
+---+----+--------+
I have the following dataframe
dataframe - columnA, columnB, columnC, columnD, columnE
I want to groupBy columnC and then consider max value of columnE
dataframe .select('*').groupBy('columnC').max('columnE')
expected output
dataframe - columnA, columnB, columnC, columnD, columnE
Real output
dataframe - columnC, columnE
Why all columns in the dataframe are not displayed as expected ?
For Spark version >= 3.0.0 you can use max_by to select the additional columns.
import random
from pyspark.sql import functions as F
#create some testdata
df = spark.createDataFrame(
[[random.randint(1,3)] + random.sample(range(0, 30), 4) for _ in range(10)],
schema=["columnC", "columnB", "columnA", "columnD", "columnE"]) \
.select("columnA", "columnB", "columnC", "columnD", "columnE")
df.groupBy("columnC") \
.agg(F.max("columnE"),
F.expr("max_by(columnA, columnE) as columnA"),
F.expr("max_by(columnB, columnE) as columnB"),
F.expr("max_by(columnD, columnE) as columnD")) \
.show()
For the testdata
+-------+-------+-------+-------+-------+
|columnA|columnB|columnC|columnD|columnE|
+-------+-------+-------+-------+-------+
| 25| 20| 2| 0| 2|
| 14| 2| 2| 24| 6|
| 26| 13| 3| 2| 1|
| 5| 24| 3| 19| 17|
| 22| 5| 3| 14| 21|
| 24| 5| 1| 8| 4|
| 7| 22| 3| 16| 20|
| 6| 17| 1| 5| 7|
| 24| 22| 2| 8| 3|
| 4| 14| 1| 16| 11|
+-------+-------+-------+-------+-------+
the result is
+-------+------------+-------+-------+-------+
|columnC|max(columnE)|columnA|columnB|columnD|
+-------+------------+-------+-------+-------+
| 1| 11| 4| 14| 16|
| 3| 21| 22| 5| 14|
| 2| 6| 14| 2| 24|
+-------+------------+-------+-------+-------+
What you want to achieve can be done via WINDOW function. Not groupBy
partition your data by columnC
Order your data within each partition in desc (rank)
filter out your desired result.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import col
windowSpec = Window.partitionBy("columnC").orderBy(col("columnE").desc())
expectedDf = df.withColumn("rank", rank().over(windowSpec)) \
.filter(col("rank") == 1)
You might wanna restructure your question.
I need to add a number of columns (4000) into the data frame in pyspark. I am using the withColumn function, but getting assertion error.
df3 = df2.withColumn("['ftr' + str(i) for i in range(0, 4000)]", [expr('ftr[' + str(x) + ']') for x in range(0, 4000)])
Not sure what is wrong.
We can use .select() instead of .withColumn() to use a list as input to create a similar result as chaining multiple .withColumn()'s. The ["*"] is used to select also every existing column in the dataframe.
import pyspark.sql.functions as F
df2:
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
df3 = df2.select(["*"] + [F.lit(f"{x}").alias(f"ftr{x}") for x in range(0,10)])
Results in:
+---+----+----+----+----+----+----+----+----+----+----+
|age|ftr0|ftr1|ftr2|ftr3|ftr4|ftr5|ftr6|ftr7|ftr8|ftr9|
+---+----+----+----+----+----+----+----+----+----+----+
| 10| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| 11| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| 13| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
+---+----+----+----+----+----+----+----+----+----+----+
Try to do something like this:
df2 = df3
for i in range(0, 4000):
df2 = df2.withColumn(f"ftr{i}", lit(f"frt{i}"))
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.