Replacing last two characters in PySpark column - python

In a spark dataframe with a column containing date-based integers (like 20190200, 20180900), I would like to replace all those ending on 00 to end on 01, so that I can convert them afterwards to readable timestamps.
I have the following code:
from pyspark.sql.types import StringType
import pyspark.sql.functions as sf
udf = sf.udf(lambda x: x.replace("00","01"), StringType())
sdf.withColumn('date_k', udf(sf.col("date_k"))).show()
I also tried:
sdf.withColumn('date_k',sf.regexp_replace(sf.col('date_k').cast('string').substr(1, 9),'00','01'))
The problem is this doesn't work when having for instance a value of 20200100, as it will produce 20201101.
I tried also with '\\00', '01' , it does not work. What is the right way to use this regex for this purpose?

Try this. you can use $ to identify string ending 00 and use regexp_replace to replace it with 01
# Input DF
df.show(truncate=False)
# +--------+
# |value |
# +--------+
# |20190200|
# |20180900|
# |20200100|
# |20200176|
# +--------+
df.withColumn("value", F.col('value').cast( StringType()))\
.withColumn("value", F.when(F.col('value').rlike("(00$)"), F.regexp_replace(F.col('value'),r'(00$)','01')).otherwise(F.col('value'))).show()
# +--------+
# | value|
# +--------+
# |20190201|
# |20180901|
# |20200101|
# |20200176|
# +--------

This is how I solved it.
Explanation first cut the number for first part excluding last two digits and in second do regex replace, then concat both parts.
import pyspark.sql.functions as f
df = spark.sql("""
select 20200100 as date
union
select 20311100 as date
""")
df.show()
"""
+--------+
| date|
+--------+
|20311100|
|20200100|
+--------+
"""
df.withColumn("date_k", f.expr("""concat(substring(cast(date as string), 0,length(date)-2),
regexp_replace(substring(cast(date as string), length(date)-1,length(date)),'00','01'))""")).show()
"""
+--------+--------+
| date| date_k|
+--------+--------+
|20311100|20311101|
|20200100|20200101|
+--------+--------+
"""

Related

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.
The following is some code that I have written:
my_df = my_df.selectExpr(
" regexp_replace(states, '^-?[0-9]+$', '') AS states ",
"someOtherColumn")
This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+

Extracting URL parameters using Python and PySpark

Say I have a column filled with URLs like in the following:
+------------------------------------------+
|url |
+------------------------------------------+
|https://www.example1.com?param1=1&param2=a|
|https://www.example2.com?param1=2&param2=b|
|https://www.example3.com?param1=3&param2=c|
+------------------------------------------+
What would be the best way of extracting the URL parameters from this column and adding them as columns to the dataframe to produce the below?
+-------------------------------------------+---------------+
| url| param1| param2|
+-------------------------------------------+---------------+
|https://www.example1.com?param1=1&param2=a | 1| a|
|https://www.example2.com?param1=2&param2=b | 2| b|
|https://www.example3.com?param1=3&param2=c | 3| c|
|etc... | etc...| etc...|
+-------------------------------------------+---------------+
My Attempts
I can think of two possible methods of doing this, using functions.regexp_extract from the pyspark library or by using urllib.parse.parse_qs and urllib.parse.urlparse from the standard library. The former solution uses regex which is a finicky method of extracting parameters from strings but the latter would need to be wrapped in a UDF to be used.
from pyspark.sql import *
from pyspark.sql import functions as fn
df = spark.createDataFrame(
[
("https://www.example.com?param1=1&param2=a",),
("https://www.example2.com?param1=2&param2=b",),
("https://www.example3.com?param1=3&param2=c",)
],
["url"]
)
Regex solution:
df2 = df.withColumn("param1", fn.regexp_extract('url', 'param1=(\d)', 1))
df2 = df2.withColumn("param2", fn.regexp_extract('url', 'param2=([a-z])', 1))
df2.show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1&param2=a|1 |a |
>> |https://www.example2.com?param1=2&param2=b|2 |b |
>> |https://www.example3.com?param1=3&param2=c|3 |c |
>> +------------------------------------------+------+------+
UDF solution:
from urllib.parse import urlparse, parse_qs
from pyspark.sql.types import MapType, StringType
extract_params = udf(lambda x: {k: v[0] for k, v in parse_qs(urlparse(x).query).items()}, MapType(StringType(), StringType()))
df3 = df.withColumn(
"params", extract_params(df.url)
)
df3.withColumn(
"param1", df3.params['param1']
).withColumn(
"param2", df3.params['param2']
).drop("params").show()
>> +------------------------------------------+------+------+
>> |url |param1|param2|
>> +------------------------------------------+------+------+
>> |https://www.example1.com?param1=1&param2=a|1 |a |
>> |https://www.example2.com?param1=2&param2=b|2 |b |
>> |https://www.example3.com?param1=3&param2=c|3 |c |
>> +------------------------------------------+------+------+
I'd like to use the versatility of a library like urllib but would also like the optimisability of writing it in pyspark functions. Is there a better method than the two I've tried so far?
You can use parse_url within SQL expression expr.
Extract specific query parameter
parse_url can take a third parameter to specify the key (param) we want to extract from the URL:
df.selectExpr("*", "parse_url(url,'QUERY', 'param1')").show()
+------------------------------------------+------+
|url |param1|
+------------------------------------------+------+
|https://www.example2.com?param1=2&param2=b|2 |
|https://www.example.com?param1=1&param2=a |1 |
|https://www.example3.com?param1=3&param2=c|3 |
+------------------------------------------+------+
Extract all query parameters to columns
If you want to extract all query parameters as new columns without having to specify their names, you can first, parse the URL then split and explode to get the parameters and their values and finally pivot to get each parameter as a column.
import pyspark.sql.functions as F
df.withColumn("parsed_url", F.explode(F.split(F.expr("parse_url(url, 'QUERY')"), "&"))) \
.withColumn("parsed_url", F.split("parsed_url", "=")) \
.select("url",
F.col("parsed_url").getItem(0).alias("param_name"),
F.col("parsed_url").getItem(1).alias("value")
) \
.groupBy("url") \
.pivot("param_name") \
.agg(F.first("value")) \
.show()
Gives:
+------------------------------------------+------+------+
|url |param1|param2|
+------------------------------------------+------+------+
|https://www.example2.com?param1=2&param2=b|2 |b |
|https://www.example.com?param1=1&param2=a |1 |a |
|https://www.example3.com?param1=3&param2=c|3 |c |
+------------------------------------------+------+------+
Another solution, as suggested by #jxc in the comments is to use str_to_map function:
df.selectExpr("*", "explode(str_to_map(split(url,'[?]')[1],'&','='))") \
.groupBy('url') \
.pivot('key') \
.agg(F.first('value'))
I'll go with an UDF and a more generic output format using map type.
from urllib.parse import urlparse, parse_qs
from pyspark.sql import functions as F, Types as T
#F.udf(T.MapType(T.StringType(), T.ArrayType(T.StringType())))
def url_param_pars(url):
parsed = urlparse(url)
return parse_qs(parsed.query)
df_params = df.withColumn("params", url_param_pars(F.col('url')))
df_params.show(truncate=False)
+------------------------------------------+------------------------------+
|url |params |
+------------------------------------------+------------------------------+
|https://www.example.com?param1=1&param2=a |[param1 -> [1], param2 -> [a]]|
|https://www.example2.com?param1=2&param2=b|[param1 -> [2], param2 -> [b]]|
|https://www.example3.com?param1=3&param2=c|[param1 -> [3], param2 -> [c]]|
+------------------------------------------+------------------------------+
df_params.printSchema()
root
|-- url: string (nullable = true)
|-- params: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
With this method, you can have any number of params.
You can add split function like following.
from pyspark.sql import functions as f
df3 = df3.withColumn("param1", f.split(f.split(df3.url, "param1=")[1], "&")[0])
Here is one more solution which works for Spark >= 2.4 since it uses high order function filter.
The next solution is based on assumption that all the records have identical number of query parameters:
from pyspark.sql.functions import expr, col
# get the query string for the first non null url
query = df.filter(df["url"].isNotNull()).first()["url"].split("?")[1]
# extract parameters (this should remain the same for all the records)
params = list(map(lambda p: p.split("=")[0], query.split("&")))
# you can also omit the two previous lines (query parameters autodiscovery)
# and replace them with: params = ['param1', 'param2']
# when you know beforehand the query parameters
cols = [col('url')] + [expr(f"split( \
filter( \
split(split(url,'\\\?')[1], '&'), \
p -> p like '{qp}=%' \
)[0], \
'=')[1]").alias(qp)
for qp in params]
df.select(*cols).show(10, False)
# +------------------------------------------+------+------+
# |url |param1|param2|
# +------------------------------------------+------+------+
# |https://www.example.com?param1=1&param2=a |1 |a |
# |https://www.example2.com?param1=2&param2=b|2 |b |
# |https://www.example3.com?param1=3&param2=c|3 |c |
# +------------------------------------------+------+------+
Explanation:
split(split(url,'\\\?')[1], '&') -> [param1=1,param2=a]: first split with ? to retrieve the query string then by &. As result we get the array [param1=1,param2=a]
filter(... , p -> p like '{qp}=%')[0] -> param1=1, param2=a ...: apply filter function on the items of the array we got from the previous step and apply the filter p -> p like '{qp}=%' where {qp}=% the param name i.e param1=%. qp stands for the items of the params array. filter will return an array hence we just access the first item since we know that the particular param should always exists. For the first parameter this will return param1=1, for the second param2=a etc.
split( ... , '=')[1] -> 1, a, ... : Finally split by = to retrieve the value of the query parameter. Here we return the second value since the first one will be the query parameter name.
The basic idea here is that we divide the problem into two sub-problems, first get all the possible query parameters and then we extract the values for all the urls.
Why is that? Well you could indeed use pivot as #blackbishop brilliantly already implemented although I believe that that wouldn't work when the cardinality of the query parameters is very high i.e 500 or more unique params. This would require a big shuffle which consequently could cause an OOM exception. On the other side if you already know that the cardinality of the data is low then the #blackbishop's solution should be considered the ideal one for all the cases.
In order to face the previous problem is better first to find all the query params (here I just made the assumption that all the queries have identical params but the implementation for this part should be similar to the previous one) and then apply the above expression for each param to extract the params values. This will generate a select expression that will contain multiple expr expressions although this shouldn't cause any performance issues since select is a narrow transformation and will not cause any shuffle.

String to array in spark

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+

How to convert date to the first day of month in a PySpark Dataframe column?

I have the following DataFrame:
+----------+
| date|
+----------+
|2017-01-25|
|2017-01-21|
|2017-01-12|
+----------+
Here is the code the create above DataFrame:
import pyspark.sql.functions as f
rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)])
df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd"))
df.show()
I want a new column with the first date of month for each row, just replace the day to "01" in all the dates
+----------++----------+
| date| first_date|
+----------++----------+
|2017-11-25| 2017-11-01|
|2017-12-21| 2017-12-01|
|2017-09-12| 2017-09-01|
+----------+-----------+
There is a last_day function in PySpark.sql.function, however, there is no first_day function.
I tried using date_sub to do this but did not work: I get a column not Iterable error because the second argument to date_sub cannot be a column and has to be an integer.
f.date_sub(f.col('date'), f.dayofmonth(f.col('date')) - 1 )
You can use trunc:
import pyspark.sql.functions as f
df.withColumn("first_date", f.trunc("date", "month")).show()
+----------+----------+
| date|first_date|
+----------+----------+
|2017-11-25|2017-11-01|
|2017-12-21|2017-12-01|
|2017-09-12|2017-09-01|
+----------+----------+
You can get the beginning of the month with the trunc function (as Alper) mentioned or with the date_trunc method. The trunc function returns a date column and the date_trunc function returns a time column. Suppose you have the following DataFrame:
+----------+
| some_date|
+----------+
|2017-11-25|
|2017-12-21|
|2017-09-12|
| null|
+----------+
Run the trunc and date_trunc functions:
datesDF\
.withColumn("beginning_of_month_date", trunc(col("some_date"), "month"))\
.withColumn("beginning_of_month_time", date_trunc("month" ,col("some_date")))\
.show()
Observe the result:
+----------+-----------------------+-----------------------+
| some_date|beginning_of_month_date|beginning_of_month_time|
+----------+-----------------------+-----------------------+
|2017-11-25| 2017-11-01| 2017-11-01 00:00:00|
|2017-12-21| 2017-12-01| 2017-12-01 00:00:00|
|2017-09-12| 2017-09-01| 2017-09-01 00:00:00|
| null| null| null|
+----------+-----------------------+-----------------------+
Print the schema to confirm the column types:
root
|-- some_date: date (nullable = true)
|-- beginning_of_month_date: date (nullable = true)
|-- beginning_of_month_time: timestamp (nullable = true)
Scala users should use the beginningOfMonthDate and beginningOfMonthTime functions defined in spark-daria.
PySpark users should use the beginning_of_month_date and beginning_of_month_time functions defined in quinn.
Notice how the trunc function takes the column argument first and the date_trunc takes the column argument second. The trunc method is poorly named - it's part of the functions package, so it's easy to mistakenly think this function is for string truncation. It's surprising that date_trunc is returning a timestamp result... sounds like it should return a date result.
Just make sure to wrap these functions with descriptive function / UDF names so your code is readable. See here for more info.
I suppose it is syntactical error, Can you please change f.dayofmonth -> dayofmonth and try. Expression looks fine.
import pyspark.sql.functions as f
f.date_sub(f.col('Match_date'),dayofmonth(f.col('Match_date')) - 1 )

Truncate a string with pyspark

I am currently working on PySpark with Databricks and I was looking for a way to truncate a string just like the excel right function does.
For example, I would like to change for an ID column in a DataFrame 8841673_3 into 8841673.
Does anybody knows how I should proceed?
Regular expressions with regexp_extract:
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([("8841673_3", )], ("id", ))
df.select(regexp_extract("id", "^(\d+)_.*", 1)).show()
# +--------------------------------+
# |regexp_extract(id, ^(\d+)_.*, 1)|
# +--------------------------------+
# | 8841673|
# +--------------------------------+
regexp_replace:
from pyspark.sql.functions import regexp_replace
df.select(regexp_replace("id", "_.*$", "")).show()
# +--------------------------+
# |regexp_replace(id, _.*$, )|
# +--------------------------+
# | 8841673|
# +--------------------------+
or just split:
from pyspark.sql.functions import split
df.select(split("id", "_")[0]).show()
# +---------------+
# |split(id, _)[0]|
# +---------------+
# | 8841673|
# +---------------+
You can use the pyspark.sql.Column.substr method:
import pyspark.sql.functions as F
def left(x, n):
return x.substr(0, n)
def right(x, n):
x_len = F.length(x)
return x.substr(x_len - n, x_len)

Categories

Resources