Extracting URL parameters using Python and PySpark - python

Say I have a column filled with URLs like in the following:
+------------------------------------------+
|url |
+------------------------------------------+
|https://www.example1.com?param1=1&param2=a|
|https://www.example2.com?param1=2&param2=b|
|https://www.example3.com?param1=3&param2=c|
+------------------------------------------+
What would be the best way of extracting the URL parameters from this column and adding them as columns to the dataframe to produce the below?
+-------------------------------------------+---------------+
| url| param1| param2|
+-------------------------------------------+---------------+
|https://www.example1.com?param1=1&param2=a | 1| a|
|https://www.example2.com?param1=2&param2=b | 2| b|
|https://www.example3.com?param1=3&param2=c | 3| c|
|etc... | etc...| etc...|
+-------------------------------------------+---------------+
My Attempts
I can think of two possible methods of doing this, using functions.regexp_extract from the pyspark library or by using urllib.parse.parse_qs and urllib.parse.urlparse from the standard library. The former solution uses regex which is a finicky method of extracting parameters from strings but the latter would need to be wrapped in a UDF to be used.
from pyspark.sql import *
from pyspark.sql import functions as fn
df = spark.createDataFrame(
[
("https://www.example.com?param1=1&param2=a",),
("https://www.example2.com?param1=2&param2=b",),
("https://www.example3.com?param1=3&param2=c",)
],
["url"]
)
Regex solution:
df2 = df.withColumn("param1", fn.regexp_extract('url', 'param1=(\d)', 1))
df2 = df2.withColumn("param2", fn.regexp_extract('url', 'param2=([a-z])', 1))
df2.show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1&param2=a|1 |a |
>> |https://www.example2.com?param1=2&param2=b|2 |b |
>> |https://www.example3.com?param1=3&param2=c|3 |c |
>> +------------------------------------------+------+------+
UDF solution:
from urllib.parse import urlparse, parse_qs
from pyspark.sql.types import MapType, StringType
extract_params = udf(lambda x: {k: v[0] for k, v in parse_qs(urlparse(x).query).items()}, MapType(StringType(), StringType()))
df3 = df.withColumn(
"params", extract_params(df.url)
)
df3.withColumn(
"param1", df3.params['param1']
).withColumn(
"param2", df3.params['param2']
).drop("params").show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1&param2=a|1 |a |
>> |https://www.example2.com?param1=2&param2=b|2 |b |
>> |https://www.example3.com?param1=3&param2=c|3 |c |
>> +------------------------------------------+------+------+
I'd like to use the versatility of a library like urllib but would also like the optimisability of writing it in pyspark functions. Is there a better method than the two I've tried so far?

You can use parse_url within SQL expression expr.
Extract specific query parameter
parse_url can take a third parameter to specify the key (param) we want to extract from the URL:
df.selectExpr("*", "parse_url(url,'QUERY', 'param1')").show()
+------------------------------------------+------+
|url |param1|
+------------------------------------------+------+
|https://www.example2.com?param1=2&param2=b|2 |
|https://www.example.com?param1=1&param2=a |1 |
|https://www.example3.com?param1=3&param2=c|3 |
+------------------------------------------+------+
Extract all query parameters to columns
If you want to extract all query parameters as new columns without having to specify their names, you can first, parse the URL then split and explode to get the parameters and their values and finally pivot to get each parameter as a column.
import pyspark.sql.functions as F
df.withColumn("parsed_url", F.explode(F.split(F.expr("parse_url(url, 'QUERY')"), "&"))) \
.withColumn("parsed_url", F.split("parsed_url", "=")) \
.select("url",
F.col("parsed_url").getItem(0).alias("param_name"),
F.col("parsed_url").getItem(1).alias("value")
) \
.groupBy("url") \
.pivot("param_name") \
.agg(F.first("value")) \
.show()
Gives:
+------------------------------------------+------+------+
|url |param1|param2|
+------------------------------------------+------+------+
|https://www.example2.com?param1=2&param2=b|2 |b |
|https://www.example.com?param1=1&param2=a |1 |a |
|https://www.example3.com?param1=3&param2=c|3 |c |
+------------------------------------------+------+------+
Another solution, as suggested by #jxc in the comments is to use str_to_map function:
df.selectExpr("*", "explode(str_to_map(split(url,'[?]')[1],'&','='))") \
.groupBy('url') \
.pivot('key') \
.agg(F.first('value'))

I'll go with an UDF and a more generic output format using map type.
from urllib.parse import urlparse, parse_qs
from pyspark.sql import functions as F, Types as T
#F.udf(T.MapType(T.StringType(), T.ArrayType(T.StringType())))
def url_param_pars(url):
parsed = urlparse(url)
return parse_qs(parsed.query)
df_params = df.withColumn("params", url_param_pars(F.col('url')))
df_params.show(truncate=False)
+------------------------------------------+------------------------------+
|url |params |
+------------------------------------------+------------------------------+
|https://www.example.com?param1=1&param2=a |[param1 -> [1], param2 -> [a]]|
|https://www.example2.com?param1=2&param2=b|[param1 -> [2], param2 -> [b]]|
|https://www.example3.com?param1=3&param2=c|[param1 -> [3], param2 -> [c]]|
+------------------------------------------+------------------------------+
df_params.printSchema()
root
|-- url: string (nullable = true)
|-- params: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
With this method, you can have any number of params.

You can add split function like following.
from pyspark.sql import functions as f
df3 = df3.withColumn("param1", f.split(f.split(df3.url, "param1=")[1], "&")[0])

Here is one more solution which works for Spark >= 2.4 since it uses high order function filter.
The next solution is based on assumption that all the records have identical number of query parameters:
from pyspark.sql.functions import expr, col
# get the query string for the first non null url
query = df.filter(df["url"].isNotNull()).first()["url"].split("?")[1]
# extract parameters (this should remain the same for all the records)
params = list(map(lambda p: p.split("=")[0], query.split("&")))
# you can also omit the two previous lines (query parameters autodiscovery)
# and replace them with: params = ['param1', 'param2']
# when you know beforehand the query parameters
cols = [col('url')] + [expr(f"split( \
filter( \
split(split(url,'\\\?')[1], '&'), \
p -> p like '{qp}=%' \
)[0], \
'=')[1]").alias(qp)
for qp in params]
df.select(*cols).show(10, False)
# +------------------------------------------+------+------+
# |url |param1|param2|
# +------------------------------------------+------+------+
# |https://www.example.com?param1=1&param2=a |1 |a |
# |https://www.example2.com?param1=2&param2=b|2 |b |
# |https://www.example3.com?param1=3&param2=c|3 |c |
# +------------------------------------------+------+------+
Explanation:
split(split(url,'\\\?')[1], '&') -> [param1=1,param2=a]: first split with ? to retrieve the query string then by &. As result we get the array [param1=1,param2=a]
filter(... , p -> p like '{qp}=%')[0] -> param1=1, param2=a ...: apply filter function on the items of the array we got from the previous step and apply the filter p -> p like '{qp}=%' where {qp}=% the param name i.e param1=%. qp stands for the items of the params array. filter will return an array hence we just access the first item since we know that the particular param should always exists. For the first parameter this will return param1=1, for the second param2=a etc.
split( ... , '=')[1] -> 1, a, ... : Finally split by = to retrieve the value of the query parameter. Here we return the second value since the first one will be the query parameter name.
The basic idea here is that we divide the problem into two sub-problems, first get all the possible query parameters and then we extract the values for all the urls.
Why is that? Well you could indeed use pivot as #blackbishop brilliantly already implemented although I believe that that wouldn't work when the cardinality of the query parameters is very high i.e 500 or more unique params. This would require a big shuffle which consequently could cause an OOM exception. On the other side if you already know that the cardinality of the data is low then the #blackbishop's solution should be considered the ideal one for all the cases.
In order to face the previous problem is better first to find all the query params (here I just made the assumption that all the queries have identical params but the implementation for this part should be similar to the previous one) and then apply the above expression for each param to extract the params values. This will generate a select expression that will contain multiple expr expressions although this shouldn't cause any performance issues since select is a narrow transformation and will not cause any shuffle.

Related

Extract multiple substrings from column in pyspark

I have a pyspark DataFrame with only one column as follows:
df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")
I would like to extract the codes in col1 to other columns like:
df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]
Does anyone know how to do this? Thank you very much.
I believe this can be shortened. Went long hand to give you my logic. Would have been easier if you laid down your logic in the question
#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))
#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]
df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9]+$'))"))#Filter only non alphanumeric characters in the array
#Convert resulting array into struct to allow split
.withColumn(
"k",
F.struct(*[
F.col("k")[i].alias(f"col{i+2}") for i in range(s)
])
))
#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()
+--------------------+------------+------------+----------+----+
| col1| col2| col3| col4|col5|
+--------------------+------------+------------+----------+----+
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834| null| null|null|
|U1JG97297 And ODN...| U1JG97297| ODNO926902| null|null|
|there is nothing ...| null| null| null|null|
+--------------------+------------+------------+----------+----+
As you said in your comment, here we are assuming that your "codes" are strings of at least two characters composed only by uppercase letters and numbers.
That being said, as of Spark 3.1+, you can use regexp_extract_all with an expr function to create a temporary array column with all the codes, then dynamically create multiple columns for each entry of the arrays.
import pyspark.sql.functions as F
# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))
# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))
# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col' + str(i+2)) for i in range(max_array_length)]) \
.show(truncate=False)
Max array length: 3
+------------------------------------+------------+------------+----------+
|col1 |col2 |col3 |col4 |
+------------------------------------+------------+------------+----------+
|This is AD185E000834 |AD185E000834|null |null |
|U1JG97297 And ODNO926902 etc. |U1JG97297 |ODNO926902 |null |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here. |null |null |null |
+------------------------------------+------------+------------+----------+

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.
The following is some code that I have written:
my_df = my_df.selectExpr(
" regexp_replace(states, '^-?[0-9]+$', '') AS states ",
"someOtherColumn")
This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &amp| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()
You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+

Replacing last two characters in PySpark column

In a spark dataframe with a column containing date-based integers (like 20190200, 20180900), I would like to replace all those ending on 00 to end on 01, so that I can convert them afterwards to readable timestamps.
I have the following code:
from pyspark.sql.types import StringType
import pyspark.sql.functions as sf
udf = sf.udf(lambda x: x.replace("00","01"), StringType())
sdf.withColumn('date_k', udf(sf.col("date_k"))).show()
I also tried:
sdf.withColumn('date_k',sf.regexp_replace(sf.col('date_k').cast('string').substr(1, 9),'00','01'))
The problem is this doesn't work when having for instance a value of 20200100, as it will produce 20201101.
I tried also with '\\00', '01' , it does not work. What is the right way to use this regex for this purpose?
Try this. you can use $ to identify string ending 00 and use regexp_replace to replace it with 01
# Input DF
df.show(truncate=False)
# +--------+
# |value |
# +--------+
# |20190200|
# |20180900|
# |20200100|
# |20200176|
# +--------+
df.withColumn("value", F.col('value').cast( StringType()))\
.withColumn("value", F.when(F.col('value').rlike("(00$)"), F.regexp_replace(F.col('value'),r'(00$)','01')).otherwise(F.col('value'))).show()
# +--------+
# | value|
# +--------+
# |20190201|
# |20180901|
# |20200101|
# |20200176|
# +--------
This is how I solved it.
Explanation first cut the number for first part excluding last two digits and in second do regex replace, then concat both parts.
import pyspark.sql.functions as f
df = spark.sql("""
select 20200100 as date
union
select 20311100 as date
""")
df.show()
"""
+--------+
| date|
+--------+
|20311100|
|20200100|
+--------+
"""
df.withColumn("date_k", f.expr("""concat(substring(cast(date as string), 0,length(date)-2),
regexp_replace(substring(cast(date as string), length(date)-1,length(date)),'00','01'))""")).show()
"""
+--------+--------+
| date| date_k|
+--------+--------+
|20311100|20311101|
|20200100|20200101|
+--------+--------+
"""

String to array in spark

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+

Categories

Resources