Pyspark create combinations from list - python

Say, I have Dataframe:
df = spark.createDataFrame([['some_string', 'A'],['another_string', 'B']],['a','b'])
a | b
---------------------------+------------
some_string | A
another_string | B
And i have list of ints like [1,2,3]
What i want - is to add list column to my dataframe.
a | b | c
---------------------------+-----------+------------
some_string | A | 1
some_string | A | 2
some_string | A | 3
another_string | B | 1
another_string | B | 2
another_string | B | 3
Is there any way to do it without udf?

Use crossJoin. Please check below code.
>>> dfa.show()
+--------------+---+
| a| b|
+--------------+---+
| some_string| A|
|another_string| B|
+--------------+---+
>>> dfb.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
>>> dfa.crossJoin(dfb).show()
+--------------+---+---+
| a| b| id|
+--------------+---+---+
| some_string| A| 1|
| some_string| A| 2|
| some_string| A| 3|
|another_string| B| 1|
|another_string| B| 2|
|another_string| B| 3|
+--------------+---+---+

You could also just use explode, and avoid unnecessary shuffle caused by joins.
ints=[1,2,3]
from pyspark.sql import functions as F
df.withColumn("c", F.explode(F.array(*[F.lit(x) for x in ints]))).show()
#+--------------+---+---+
#| a| b| c|
#+--------------+---+---+
#| some_string| A| 1|
#| some_string| A| 2|
#| some_string| A| 3|
#|another_string| B| 1|
#|another_string| B| 2|
#|another_string| B| 3|
#+--------------+---+---+

Related

Calculate percentages of occurrences by rolling window in pyspark

I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+

extract substring before first occurrence and substring after last occurrence of a delimiter in Pyspark

I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))
df.show()
+--------------+----+-----+
| rust|name|value|
+--------------+----+-----+
| 14_100_00| A| 25|
| 13_100_00| B| 24|
| 15_100_00| A| 20|
| 150_100| C| 21|
| 16| A| 20|
|1634_100_00_01| B| 22|
| 1_100_00| C| 23|
| 18_100_00| D| 24|
+--------------+----+-----+
I am trying to create a new column using the rust column using below conditions
1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)
If no underscores in the column then have column as is
I have tried like below
from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16~16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
expected result:
+--------------+----+-----+-------+
| rust|name|value|extract|
+--------------+----+-----+-------+
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
+--------------+----+-----+-------+
How can I achieve what I want
Use the instr function to determine whether the rust column contains _, and then use the when function to process.
df1 = df.select("*",
f.when(f.instr(df.rust, '_') > 0,
f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
)
.otherwise(df.rust)
.alias("extract")
)

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

Pyspark | Seperate the string / int values from the dataframe

I have a Spark Dataframe as below:
+---------+
|col_str_1|
+---------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| a|
| b|
| c|
| d|
| e|
| f|
| g|
| h|
| 1|
| 2|
| 3.0|
+---------+
I want to separate the string / int / float values based on request
For Example:
Req is for STRING, return DF must be like below
+---------+
|col_str_1|
+---------+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
| h|
+---------+
Req is for Integer, return DF must be like below
+---------+
|col_str_1|
+---------+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 1|
| 2|
+---------+
Tried below steps:
>> df = sqlContext.sql('select * from --db--.vt_prof_test')
>> columns = df.columns[0]
>> df.select(columns).????
how to proceed further, either use filter or map, can any one help me out ??
You can go for udf
import pyspark.sql.functions as F
df = sqlContext.sql('select * from --db--.vt_prof_test')
REQUEST = 'STRING'
request_bc = sc.broadcast(REQUEST)
def check_value(val):
if request_bc.value == 'STRING':
try:
val = int(val)
return None
except:
return val
if request_bc.value == 'INTEGER':
try:
val = int(val)
return val
except:
return None
check_udf = F.udf(lambda x: check_value(x))
df = df.select(check_udf(F.col('col_str_1').alias('col_str_1')).dropna()
Set the REQUEST parameter according to the need.

Partition pyspark dataframe based on the change in column value

I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups

Categories

Resources