replace single quotes with double quotes in pyspark dataframe - python

from the below code I am writing a dataframe to csv file.
As my dataframe contains "" for None, I have added replace("", None) because Null values are supposed to be represented as None instead of "" (double quotes)
newDf.coalesce(1).replace("", None).replace("'", "\"").write.format('csv').option('nullValue', None).option('header', 'true').option('delimiter', '|').mode('overwrite').save(destination_csv)
I tried adding .replace("'", "\""). but it doesn't work
the data also contains data with single quotes
eg :
Survey No. 123, 'Anjanadhri Godowns', CityName
I need to replace the single quotes from the dataframe and replace it with double-quotes.
How can it be achieved?

You can use regexp_replace to replace single quotes with double quotes in all columns before writing the output:
import pyspark.sql.functions as F
df2 = df.select([F.regexp_replace(c, "'", '"').alias(c) for c in df.columns])
# then write output
# df2.coalesce(1).write(...)

Using translate
from pyspark.sql.functions import *
data_list = [(1, "'Name 1'"), (2, "'Name 2' and 'Something'")]
df = spark.createDataFrame(data = data_list, schema = ["ID", "my_col"])
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| 'Name 1'|
# | 2|'Name 2' and 'Som...|
# +---+--------------------+
df.withColumn('my_col', translate('my_col', "'", '"')).show()
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| "Name 1"|
# | 2|"Name 2" and "Som...|
# +---+--------------------+
This will replace all occurrences of the single quote character with a double quotation mark in the column my_col.

Related

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.
The following is some code that I have written:
my_df = my_df.selectExpr(
" regexp_replace(states, '^-?[0-9]+$', '') AS states ",
"someOtherColumn")
This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &amp| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()
You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+

Replacing last two characters in PySpark column

In a spark dataframe with a column containing date-based integers (like 20190200, 20180900), I would like to replace all those ending on 00 to end on 01, so that I can convert them afterwards to readable timestamps.
I have the following code:
from pyspark.sql.types import StringType
import pyspark.sql.functions as sf
udf = sf.udf(lambda x: x.replace("00","01"), StringType())
sdf.withColumn('date_k', udf(sf.col("date_k"))).show()
I also tried:
sdf.withColumn('date_k',sf.regexp_replace(sf.col('date_k').cast('string').substr(1, 9),'00','01'))
The problem is this doesn't work when having for instance a value of 20200100, as it will produce 20201101.
I tried also with '\\00', '01' , it does not work. What is the right way to use this regex for this purpose?
Try this. you can use $ to identify string ending 00 and use regexp_replace to replace it with 01
# Input DF
df.show(truncate=False)
# +--------+
# |value |
# +--------+
# |20190200|
# |20180900|
# |20200100|
# |20200176|
# +--------+
df.withColumn("value", F.col('value').cast( StringType()))\
.withColumn("value", F.when(F.col('value').rlike("(00$)"), F.regexp_replace(F.col('value'),r'(00$)','01')).otherwise(F.col('value'))).show()
# +--------+
# | value|
# +--------+
# |20190201|
# |20180901|
# |20200101|
# |20200176|
# +--------
This is how I solved it.
Explanation first cut the number for first part excluding last two digits and in second do regex replace, then concat both parts.
import pyspark.sql.functions as f
df = spark.sql("""
select 20200100 as date
union
select 20311100 as date
""")
df.show()
"""
+--------+
| date|
+--------+
|20311100|
|20200100|
+--------+
"""
df.withColumn("date_k", f.expr("""concat(substring(cast(date as string), 0,length(date)-2),
regexp_replace(substring(cast(date as string), length(date)-1,length(date)),'00','01'))""")).show()
"""
+--------+--------+
| date| date_k|
+--------+--------+
|20311100|20311101|
|20200100|20200101|
+--------+--------+
"""

Replace string in PySpark

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
| revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
|-- revenue: string (nullable = true)
Output desired:
df.show()
+---------+
| revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
|-- revenue: float (nullable = true)
I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))
But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
| |
+-------+
You need to escape . to match it literally, as . is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))

Spark dataframe update column where other colum is like with PySpark

This creates my example dataframe:
df = sc.parallelize([('abc',),('def',)]).toDF() #(
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
df.show()
looking like this:
+---+---+
|one|two|
+---+---+
|abc| z|
|def| z|
+---+---+
now what I want to do is a series of SQL where like statements where column two is appended whether or not it matches
in "pseudo code" it looks like this:
for letter in ['a','b','c','d']:
df = df['two'].where(col('one').like("%{}%".format(letter))) += letter
finally resulting in a df looking like this:
+---+----+
|one| two|
+---+----+
|abc|zabc|
|def| zd|
+---+----+
If you are using a list of strings to subset your string column, you can best use broadcast variables. Let's start with a more realistic example where your string still contain spaces:
df = sc.parallelize([('a b c',),('d e f',)]).toDF()
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
Then we create a broadcast variable from a list of letters, and consequently define an udf that uses them to subset a list of strings; and finally concatenates them with the value in another column, returning one string:
letters = ['a','b','c','d']
letters_bd = sc.broadcast(letters)
def subs(col1, col2):
l_subset = [x for x in col1 if x in letters_bd.value]
return col2 + ' ' + ' '.join(l_subset)
subs_udf = udf(subs)
To apply the above, the string we are subsetting need to be converted to a list, so we use the function split() first and then apply our udf:
from pyspark.sql.functions import col, split
df.withColumn("three", split(col('one'), r'\W+')) \
.withColumn("three", subs_udf("three", "two")) \
.show()
+-----+---+-------+
| one|two| three|
+-----+---+-------+
|a b c| z|z a b c|
|d e f| z| z d|
+-----+---+-------+
Or without udf, using regexp_replace and concat if your letters can be comfortably fit into the regex expression.
from pyspark.sql.functions import regexp_replace, col, concat, lit
df.withColumn("three", concat(col('two'), lit(' '),
regexp_replace(col('one'), '[^abcd]', ' ')))

Categories

Resources