spark extract columns from string - python

Need help in parsing a string, where it contains values for each attribute. below is my sample string...
otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>
sometimes, the string can include other values with a different delimiter like
otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString..
the output columns will be
Name | Type | SqVal | ID | Loc | dest
-------------------------------------------
Series VR | 1Ac4 | 34 | 2 | null | null
Series X | 1B3 | 34 | 2 | sfo | chc

As we discussed, to use str_to_map function on your sample data, we can setup pairDelim and keyValueDelim to the following:
pairDelim: '(?i)>? *(?=Name|Type|SqVal|conn ID|conn Loc|dest|$)'
keyValueDelim: '=<?'
Where pariDelim is case-insensitive (?i) with an optional > followed by zero or more SPACEs, then followed by one of the pre-defined keys (we use '|'.join(keys) to generate it dynamically) or the end of string anchor $. keyValueDelim is an '=' with an optional <.
from pyspark.sql import functions as F
df = spark.createDataFrame([
("otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>",),
("otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString..",)
],["value"])
keys = ["Name", "Type", "SqVal", "conn ID", "conn Loc", "dest"]
# add the following conf for Spark 3.0 to overcome duplicate map key ERROR
#spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")
df.withColumn("m", F.expr("str_to_map(value, '(?i)>? *(?={}|$)', '=<?')".format('|'.join(keys)))) \
.select([F.col('m')[k].alias(k) for k in keys]) \
.show()
+---------+----+-----+-------+--------+--------------------+
| Name|Type|SqVal|conn ID|conn Loc| dest|
+---------+----+-----+-------+--------+--------------------+
|Series VR|1Ac4| 34| 2| null| null|
| Series X| 1B3| 34| 2| sfo|chc bridge otherp...|
+---------+----+-----+-------+--------+--------------------+
We will need to do some post-processing to the values of the last mapped-key, since there is no anchor or pattern to distinguish them from other unrelated text (this could be a problem as it might happen on any keys), please let me know if you can specify any pattern.
Edit: If using map is less efficient for case-insensitive search since it requires some expensive pre-processing, try the following:
ptn = '|'.join(keys)
df.select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<?([^=>]+?)>? *(?={1}|$)'.format(k,ptn), 1).alias(k) for k in keys]).show()
In case the angle brackets < and > are used only when values or their next adjacent key contain any non-word chars, it can be simplified with some pre-processing:
df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
.select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<([^>]+)>'.format(k), 1).alias(k) for k in keys]) \
.show()
Edit-2: added a dictionary to handle key aliases:
keys = ["Name", "Type", "SqVal", "ID", "Loc", "dest"]
# aliases are case-insensitive and added only if exist
key_aliases = {
'Type': [ 'ThisType', 'AnyName' ],
'ID': ['conn ID'],
'Loc': ['conn Loc']
}
# set up regex pattern for each key differently
key_ptns = [ (k, '|'.join([k, *key_aliases[k]]) if k in key_aliases else k) for k in keys ]
#[('Name', 'Name'),
# ('Type', 'Type|ThisType|AnyName'),
# ('SqVal', 'SqVal'),
# ('ID', 'ID|conn ID'),
# ('Loc', 'Loc|conn Loc'),
# ('dest', 'dest')]
df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
.select("*", *[F.regexp_extract('value', r'(?i)\b(?:{0})=<([^>]+)>'.format(p), 1).alias(k) for k,p in key_ptns]) \
.show()
+--------------------+---------+----+-----+---+---+----+
| value| Name|Type|SqVal| ID|Loc|dest|
+--------------------+---------+----+-----+---+---+----+
|otherPartofString...|Series VR|1Ac4| 34| 2| | |
|otherPartofString...| Series X| 1B3| 34| 2|sfo| chc|
+--------------------+---------+----+-----+---+---+----+

Related

Calculate relative frequency of bigrams in PySpark

I'm trying to count word pairs in a text file. First, I've done some pre-processing on the text, and then I counted word pairs as shown below:
((Aspire, to), 1) ; ((to, inspire), 4) ; ((inspire, before), 38)...
Now, I want to report the 1000 most frequent pairs, sorted by :
Word (second word of the pair)
Relative frequency (pair occurences / 2nd word total occurences)
Here's what I've done so far
from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.appName("Bigram occurences and relative frequencies").master("local[*]").getOrCreate()
sc = spark.sparkContext
text = sc.textFile("big.txt")
tokens = text.map(lambda x: x.lower()).map(lambda x: re.split("[\s,.;:!?]+", x))
pairs = tokens.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:]))).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
frame = pairs.toDF(['pair', 'count'])
# Dataframe ordered by the most frequent pair to the least
most_frequent = frame.sort(frame['count'].desc())
# For each row, trying to add a column with the relative frequency, but I'm getting an error
with_rf = frame.withColumn("rf", frame['count'] / (frame.pair._2.sum()))
I think I'm relatively close to the result I want but I can't figure it out. I'm new to Spark and DataFrames in general.
I also tried
import pyspark.sql.functions as F
frame.groupBy(frame['pair._2']).agg((F.col('count') / F.sum('count')).alias('rf')).show()
Any help would be appreciated.
EDIT: here's a sample of the frame dataframe
+--------------------+-----+
| pair|count|
+--------------------+-----+
|{project, gutenberg}| 69|
| {gutenberg, ebook}| 14|
| {ebook, of}| 5|
| {adventures, of}| 6|
| {by, sir}| 12|
| {conan, doyle)}| 1|
| {changing, all}| 2|
| {all, over}| 24|
+--------------------+-----+
root
|-- pair: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- count: long (nullable = true)
The relative frequency can be computed by using window function, that partitions by the second word in the pair and applies a sum operation.
Then, we limit the entries in the df to the top x, based on count and finally order by the second word in pair and the relative frequency.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [(("project", "gutenberg"), 69,),
(("gutenberg", "ebook"), 14,),
(("ebook", "of"), 5,),
(("adventures", "of"), 6,),
(("by", "sir"), 12,),
(("conan", "doyle"), 1,),
(("changing", "all"), 2,),
(("all", "over"), 24,), ]
df = spark.createDataFrame(data, ("pair", "count", ))
ws = W.partitionBy(F.col("pair")._2).rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
(df.withColumn("relative_freq", F.col("count") / F.sum("count").over(ws))
.orderBy(F.col("count").desc())
.limit(3) # change here to select top 1000
.orderBy(F.desc(F.col("pair")._2), F.col("relative_freq").desc())
).show()
"""
+--------------------+-----+-------------+
| pair|count|relative_freq|
+--------------------+-----+-------------+
| {all, over}| 24| 1.0|
|{project, gutenberg}| 69| 1.0|
| {gutenberg, ebook}| 14| 1.0|
+--------------------+-----+-------------+
"""

Remove words from pyspark dataframe based on words from another pyspark dataframe

I want to remove the words in main data frame from secondary data frame.
This is the main data frame:
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
Below is single-column secondary data frame. The words in the secondary data frame need to be removed from the main data frame in column cust_text wherever the words occur. For example, 'want' will be removed from every row wherever it shows up in the main data frame (in this example will be removed from 1st and 4th row).
+-------+
|column1|
+-------+
| want|
|because|
| need|
| hello|
| a|
| have|
| go|
+-------+
The event_dt column will remain as is and each row will remain as is, only the secondary data frame words are removed from main data frame in the result data frame as shown below
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
Help is appreciated!!
This should be the working solution for you - Use array_except() in order to eliminate the unwanted strings, however in order to do that, we need to do a little bit of preparation.
Create the DataFrame Here
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need a line hold"), ("2020-09-02", "i have the 60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])
Make the column as Array for future use
df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()
Output
+----------+---------------------------+
|col1 |col2 |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |
+----------+---------------------------+
+----+-------+
|col1| col2|
+----+-------+
| 1| want|
| 1|because|
| 1| need|
| 1| hello|
| 1| a|
| 1| give|
| 1| go|
+----+-------+
Now, just groupBy the lookup dataframe and take all the lookup values in a variable as below
df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)
This does the trick
df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1 |col2 |filter_col |ArrayColumn |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to] |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach] |[need, want, a, because, hello, give, go]|[you, teach] |
+----------+---------------------------+-----------------------------------------+---------------------------+

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

Use spark function result as input of another function

In my Spark application I have a dataframe with informations like
+------------------+---------------+
| labels | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] | 000 |
| ['l3','l4','l5'] | 100 |
+------------------+---------------+
What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.
For example, for label='l3' I would like to retrieve this output:
+------------------+---------------+--------------------+
| labels | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] | 000 | 0 |
| ['l3','l4','l5'] | 100 | 1 |
+------------------+---------------+--------------------+
Here's what I am attempting to use:
selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
"single_label_value",
F.substring(my_df.labels_values, label_position, 1)
)
But I am getting an error because the substring function does not like the label_position argument.
Is there any way to combine these function outputs without writing an udf?
Hope, this will work for you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]
df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"])
selected_label='l3'
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).alias("pos_val"))
df2.createOrReplaceTempView("temp_table")
df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")
df3.show()
+------------+------------+-------+-------+
| lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]| 000| 2| 0|
|[l3, l4, l5]| 100| 0| 1|
+------------+------------+-------+-------+
This is giving location of the value. If you want exact index then you can use -1 from this value.
--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now.
Edit2 -> Answer using dataframe.
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
)
df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Try maybe:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))
df.show()
(for spark version >=2.4)
I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

Fill null values with next incrementing number | PySpark | Python

I am having below dataframe with values. I want to add next concecative id in the column id which must be unique as well incrementing in nature.
+----------------+----+--------------------+
|local_student_id| id| last_updated|
+----------------+----+--------------------+
| 610931|null| null|
| 599768| 3|2020-02-26 15:47:...|
| 633719|null| null|
| 612949| 2|2020-02-26 15:47:...|
| 591819| 1|2020-02-26 15:47:...|
| 595539| 4|2020-02-26 15:47:...|
| 423287|null| null|
| 641322| 5|2020-02-26 15:47:...|
+----------------+----+--------------------+
I want below expected output. can anybody hemp me? I am new to Pyspark. and also want to add current timestamp in last_updated column.
+----------------+----+--------------------+
|local_student_id| id| last_updated|
+----------------+----+--------------------+
| 610931| 6|2020-02-26 16:00:...|
| 599768| 3|2020-02-26 15:47:...|
| 633719| 7|2020-02-26 16:00:...|
| 612949| 2|2020-02-26 15:47:...|
| 591819| 1|2020-02-26 15:47:...|
| 595539| 4|2020-02-26 15:47:...|
| 423287| 8|2020-02-26 16:00:...|
| 641322| 5|2020-02-26 15:47:...|
+----------------+----+--------------------+
actually i tried
final_data = final_data.withColumn(
'id', when(col('id').isNull(), row_number() + max(col('id'))).otherwise(col('id')))
but it gives the below Error:-
: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`local_student_id`' is not an aggregate function. Wrap '(CASE WHEN (`id` IS NULL) THEN (CAST(row_number() AS BIGINT) + max(`id`)) ELSE `id` END AS `id`)' in windowing function(s) or wrap '`local_student_id`' in first() (or first_value) if you don't care which value you get.;;
here is the code you need :
from pyspark.sql import functions as F, Window
max_id = final_data.groupBy().max("id").collect()[0][0]
final_data.withColumn(
"id",
F.coalesce(
F.col("id"),
F.row_number().over(Window.orderBy("id")) + F.lit(max_id)
)
).withColumn(
"last_updated",
F.coalesce(
F.col("last_updated"),
F.current_timestamp()
)
)

Categories

Resources