PySpark - new column partial regex matching from dictionary

PySpark - new column partial regex matching from dictionary - python

I have a PySpark dataframe like this:
A
B
1
abc_value
2
abc_value
3
some_other_value
4
anything_else
I have a mapping dictionary:
d = {
"abc":"X",
"some_other":Y,
"anything":Z
}
I need to create new column in my original Dataframe which should be like this:
A
B
C
1
abc_value
X
2
abc_value
X
3
some_other_value
Y
4
anything_else
Z
I tried mapping like this:
mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())]) and then applying it with withColumn however it is exact matching, however I need partial (regex) matching as you can see.
How to accomplish this, please?

I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.
In this case, you can first create a search string which includes all your dictionary keys to be searched:
keys = list(d.keys())
keys_expr = '|'.join(keys)
keys_expr
# 'abc|some_other|anything'
Then you can use regexp_extract to extract the first key from keys_expr that we encounter in column B, if present (that's the reason for the | operator).
Finally, you can use dictionary d to replace the values in the new column.
import pyspark.sql.functions as F
df = df\
.withColumn('C', F.regexp_extract('B', keys_expr, 0))\
.replace(d, subset=['C'])
df.show()
+---+----------------+---+
| A| B| C|
+---+----------------+---+
| 1| abc_value| X|
| 2| abc_value| X|
| 3|some_other_value| Y|
| 4| anything_else| Z|
+---+----------------+---+

Related

Compare each column for two spark dataframe & create new columns for result

I have 2 pyspark dataframes.
Both will be having equal number of columns usually.
Columns name are different in both dataframes.
For first column in both dataframes values can be same or different.
I want to compare both dataframes, like first column in both dfs then second column in both dfs and so on.
For compare result I want to create a new column, showing match/no_match/null_in_1st/null_in_2nd and so on.
I want one final df that should have all column names & compared result side by side for each column.
NOTE : I have lot of columns so have to tackle that using some loop instead of type each column in code.
e.g. dataframe1
Column A
Column B
abc
2
def
4
xyz
mno
5
e.g. dataframe2
Column C
Column D
abc
2
def
3
xyz
4
mno
Result dataframe
Column X
Column X_result
abc
match
def
no_match
xyz
dataframe1_null
mno
dataframe2_null
I want one final df that should have all column names & compared result side by side for each column.
Result dataframe
Column X
Column X_result
abc
match
def
no_match
xyz
dataframe1_null
mno
dataframe2_null

I have used a very simple approach of CASE STATEMENTS.
Data :
val df1 = Seq(("abc","2"),
("def","4"),
("xyz",null),
("mno","5")
).toDF("colA","colb")
val df2 = Seq(("abc","2"),
("def","3"),
("xyz","4"),
("mno",null)
).toDF("colC","colD")
hence the data looks like -
+----+----+
|colA|colb|
+----+----+
| abc| 2|
| def| 4|
| xyz|null|
| mno| 5|
+----+----+
+----+----+
|colC|colD|
+----+----+
| abc| 2|
| def| 3|
| xyz| 4|
| mno|null|
+----+----+
Method 1: Join + Case Statements
val df_result = df1.join(df2,df1("colA") === df2("colC"),"left")
.select(df1("colA"),df1("colB"),df2("colD"))
.withColumn("Column X_result",
when(df1("colB")===df2("colD"),"match")
.when(df1("colB").isNull,"dataframe1_null")
.when(df2("colD").isNull,"dataframe2_null")
.otherwise("no_match")
)
.drop("colB","colD")
.withColumnRenamed("colA","Column X")
display(df_result)
Method 2 : Join + iterative Method with MAP + if-else
val joined_df = df1.join(df2,df1("colA") === df2("colC"))
.select(df1("colA"),df1("colB"),df2("colD"))
val df_results = joined_df.map(row =>{
val colx = if (row.getString(1)==null) {"dataframe1_null"}
else if (row.getString(2)==null) {"dataframe2_null"}
else if (row.getString(1)==row.getString(2)) {"match"}
else {"no_match"}
(row.getString(0),colx)
}
).toDF("Column X","Column X_result")
display(df_results)
Result -

Groupby and Standardise values in Pyspark

So, I have a Pyspark dataframe of the type
Group
Value
A
12
B
10
A
1
B
0
B
1
A
6
and I'd like to perform an operation able to generate something a DataFrame having the standardised value with respect to its group.
In short, I should have:
Group
Value
A
1.26012384
B
1.4083737
A
-1.18599891
B
-0.81537425
B
-0.59299945
A
-0.07412493
I think this should be performed by using a groupBy and then some agg operation but honestly I'm not really sure on how to do it.

You can calculate the mean and stddev in each group using Window functions:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Value',
(F.col('Value') - F.mean('Value').over(Window.partitionBy('Group'))) /
F.stddev_pop('Value').over(Window.partitionBy('Group'))
)
df2.show()
+-----+--------------------+
|Group| Value|
+-----+--------------------+
| B| 1.4083737016560922|
| B| -0.8153742483272112|
| B| -0.5929994533288808|
| A| 1.2601238383238722|
| A| -1.1859989066577619|
| A|-0.07412493166611006|
+-----+--------------------+
Note that the order of the results will be randomized because Spark dataframes do not have indices.

Chaining multiple groupBy in pyspark

My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?

There is no need to serialize to rdd. Here's a generalized way to group by multiple columns and aggregate the rest of the columns into lists without hard-coding all of them:
from pyspark.sql.functions import collect_list
grouping_cols = ["id", "duration"]
other_cols = [c for c in df.columns if c not in grouping_cols]
df.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols]).show()
#+---+--------+-------+-------+
#| id|duration|action1|action2|
#+---+--------+-------+-------+
#| 1| 10| [A, B]| [D, E]|
#| 2| 25| [A]| [E]|
#| 1| 7| [A]| [G]|
#+---+--------+-------+-------+
Update
If you need to preserve the order of the actions, the best way is to use a pyspark.sql.Window with an orderBy(). This is because there seems to be some ambiguity as to whether or not a groupBy() following an orderBy() maintains that order.
Suppose your timestamps are stored in a column "ts". You should be able to do the following:
from pyspark.sql import Window
w = Window.partitionBy(grouping_cols).orderBy("ts")
grouped_df = df.select(
*(grouping_cols + [collect_list(c).over(w).alias(c) for c in other_cols])
).distinct()

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4
I wish to group by A , present any of B and run aggregation (let's say SUM) on C.
The expected result would be:
A | B | C
----------
A | 1 | 13
B | 2 | 12
SQL-wise I would do:
SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A
What is the PySpark way to do that?
I can group by A and B together or select MIN(B) per each A, for example:
df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))
or
df.groupBy(['A','B']).agg(F.sum('C').alias('C'))
but that seems inefficient. Is there is anything similar to SQL coalesce in PySpark?
Thanks

You'll just need to use first instead :
from pyspark.sql.functions import first, sum, col
from pyspark.sql import Row
array = [Row(A="A", B=1, C=6),
Row(A="A", B=1, C=7),
Row(A="B", B=2, C=8),
Row(A="B", B=2, C=4)]
df = sqlContext.createDataFrame(sc.parallelize(array))
results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C"))
Let's now check the results :
results.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | B| 2| 12|
# | A| 1| 13|
# +---+---+---+
From the comments:
Is first here is computationally equivalent to any ?
groupBy causes shuffle. Thus a non deterministic behaviour is to expect.
Which is confirmed in the documentation of first :
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour.
I hope this helps !

PySpark: Split DataFrame into multiple DataFrames without using loop

Hi I have a DataFrame as shown -
ID X Y
1 1234 284
1 1396 179
2 8620 178
3 1620 191
3 8820 828
I want split this DataFrame into multiple DataFrames based on ID. So for this example there will be 3 DataFrames. One way to achieve it is to run filter operation in loop. However, I would like to know if it can be done in much more efficient way.

#initialize spark dataframe
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])
#get the list of unique ID values ; there's probably a better way to do this, but this was quick and easy
listids = [x.asDict().values()[0] for x in df.select("ID").distinct().collect()]
#create list of dataframes by IDs
dfArray = [df.where(df.ID == x) for x in listids]
dfArray[0].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 1|1234|282|
| 1|1396|179|
+---+----+---+
dfArray[1].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 2|8620|178|
+---+----+---+
dfArray[2].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 3|1620|191|
| 3|8820|828|
+---+----+---+

The answer of #James Tobin needs to be altered a tiny bit if you are working with Python 3.X, as dict.values returns a dict-value object instead of a list. A quick workaround is just adding the list function:
listids = [list(x.asDict().values())[0]
for x in df.select("ID").distinct().collect()]
Posting as a seperate answer as I do not have the reputation required to put a comment on his answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark - new column partial regex matching from dictionary - python

Related

Compare each column for two spark dataframe & create new columns for result

Groupby and Standardise values in Pyspark

Chaining multiple groupBy in pyspark

PySpark aggregation function for "any value"

PySpark: Split DataFrame into multiple DataFrames without using loop

Categories

Resources