Replace string in PySpark - python

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
| revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
|-- revenue: string (nullable = true)
Output desired:
df.show()
+---------+
| revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
|-- revenue: float (nullable = true)
I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))
But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
| |
+-------+

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))

Related

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &amp| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()
You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+

How can I turn off rounding in Spark?

I have a dataframe and I'm doing this:
df = dataframe.withColumn("test", lit(0.4219759403))
I want to get just the first four numbers after the dot, without rounding.
When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4)
or even with round function, this number is rounded to 0.4220.
The only way that I found without rounding is applying the function format_number(), but this function gives me a string, and when I cast this string to DecimalType(20,4), the framework rounds the number again to 0.4220.
I need to convert this number to DecimalType(20,4) without rounding, and I expect to see 0.4219.
If you have numbers with more than 1 digit before the decimal point, the substr is not adapt. Instead, you can use a regex to always extract the first 4 decimal digits (if present).
You can do this using regexp_extract
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
Example
import pyspark.sql.functions as F
dataframe = spark.createDataFrame([
(0.4219759403, ),
(0.4, ),
(1.0, ),
(0.5431293, ),
(123.769859, )
], ['test'])
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
df.show()
+------------+--------+
| test| rounded|
+------------+--------+
|0.4219759403| 0.4219|
| 0.4| 0.4|
| 1.0| 1.0|
| 0.5431293| 0.5431|
| 123.769859|123.7698|
+------------+--------+
Hi welcome to stackoverflow,
please next time try to provide a reproducible example with the code you tried, anyways this works for me:
from pyspark.sql.types import DecimalType
df = spark.createDataFrame([
(1, "a"),
(2, "b"),
(3, "c"),
], ["ID", "Text"])
df = df.withColumn("test", lit(0.4219759403))
df = df.withColumn("test_string", F.substring(df["test"].cast("string"), 0, 6))
df = df.withColumn("test_string_decimaltype", df["test_string"].cast(DecimalType(20,4)))
df.show()
df.printSchema()
+---+----+------------+-----------+-----------------------+
| ID|Text| test|test_string|test_string_decimaltype|
+---+----+------------+-----------+-----------------------+
| 1| a|0.4219759403| 0.4219| 0.4219|
| 2| b|0.4219759403| 0.4219| 0.4219|
| 3| c|0.4219759403| 0.4219| 0.4219|
+---+----+------------+-----------+-----------------------+
root
|-- ID: long (nullable = true)
|-- Text: string (nullable = true)
|-- test: double (nullable = false)
|-- test_string: string (nullable = false)
|-- test_string_decimaltype: decimal(20,4) (nullable = true)
Of course if you want you can overwrite the same column by putting always "test", i choose different names to let you see the steps.

replace single quotes with double quotes in pyspark dataframe

from the below code I am writing a dataframe to csv file.
As my dataframe contains "" for None, I have added replace("", None) because Null values are supposed to be represented as None instead of "" (double quotes)
newDf.coalesce(1).replace("", None).replace("'", "\"").write.format('csv').option('nullValue', None).option('header', 'true').option('delimiter', '|').mode('overwrite').save(destination_csv)
I tried adding .replace("'", "\""). but it doesn't work
the data also contains data with single quotes
eg :
Survey No. 123, 'Anjanadhri Godowns', CityName
I need to replace the single quotes from the dataframe and replace it with double-quotes.
How can it be achieved?
You can use regexp_replace to replace single quotes with double quotes in all columns before writing the output:
import pyspark.sql.functions as F
df2 = df.select([F.regexp_replace(c, "'", '"').alias(c) for c in df.columns])
# then write output
# df2.coalesce(1).write(...)
Using translate
from pyspark.sql.functions import *
data_list = [(1, "'Name 1'"), (2, "'Name 2' and 'Something'")]
df = spark.createDataFrame(data = data_list, schema = ["ID", "my_col"])
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| 'Name 1'|
# | 2|'Name 2' and 'Som...|
# +---+--------------------+
df.withColumn('my_col', translate('my_col', "'", '"')).show()
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| "Name 1"|
# | 2|"Name 2" and "Som...|
# +---+--------------------+
This will replace all occurrences of the single quote character with a double quotation mark in the column my_col.

Change datatype but return null value for Dataframe

I'm new to Pyspark 3.0 and I have this homework where I need to change the string (geolocation) to tuple numeric data type (geolocation1).
Here is my code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df2.withColumn('geolocation1', col('geolocation').cast('double'))
Output:
|                    geolocation             | geolocation1 |
| ------------------------------------------| ----------------- |
| (-37.80899950, 140.96004459) | null                |
| (-37.80899952, 140.96004451) | null                |
What have I done wrong here?
If you have a string like that, you can remove the parenthesis and split by comma, then cast to array<double>:
import pyspark.sql.functions as F
df = df2.withColumn(
'geolocation1',
F.split(
F.regexp_replace('geolocation', '[\( \)]', ''),
','
).cast('array<double>')
)
df.show(truncate=False)
+----------------------------+---------------------------+
|geolocation |geolocation1 |
+----------------------------+---------------------------+
|(-37.80899950, 140.96004459)|[-37.8089995, 140.96004459]|
+----------------------------+---------------------------+
df.printSchema()
root
|-- geolocation: string (nullable = false)
|-- geolocation1: array (nullable = false)
| |-- element: double (containsNull = true)
I would like to give some suggestions before answering this
First , you need to understand what is a double type. Here you are blindly converting a string which contains non-numeric characters as well to a numeric format.So internally spark will throw an exception which will be caught and null will be populated as an output.
And as I understood from the name of field, its a Geolocation which will be a combination of latitude and longitude. So I assume who ever given you this homework needs these two values as new columns. And if my assumption is correct , below is one of the ways to achieve it.

Pyspark - struct to string to multiple columns

I have a dataframe with a schema as follows:
root
|-- column: struct (nullable = true)
| |-- column-string: string (nullable = true)
|-- count: long (nullable = true)
What I want to do is:
Get rid of the struct - or by that I mean "promote" column-string, so my dataframe only has 2 columns - column-string and count
I then want to split column-string into 3 different columns, so I end up with the schema:
The text within column-string always fits the format:
Some-Text,Text,MoreText
Does anyone know how this is possible?
I'm using Pyspark Python.
PS. I am new to Pyspark & I don't know much about the struct format and couldn't find how to write an example into my post to make it reproducible - sorry.
You can also use from_csv to convert the comma-delimited string into a struct, and then star expand the struct:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.from_csv(
'column.column-string',
'`column-string` string, `column-string2` string, `column-string3` string'
)
).select('col.*', 'count')
df2.show()
+-------------+--------------+--------------+-----+
|column-string|column-string2|column-string3|count|
+-------------+--------------+--------------+-----+
| SomeText| Text| MoreText| 1|
+-------------+--------------+--------------+-----+
Note that it's better not to have hyphens in column names because they are reserved for subtraction. Underscores are better.
You can select column-string field from the struct using column.column-string, the simply split by a comma to get three columns :
from pyspark.sql import functions as F
df1 = df.withColumn(
"column_string", F.split(F.col("column.column-string"), ",")
).select(
F.col("column_string")[0].alias("column-string"),
F.col("column_string")[1].alias("column-string2"),
F.col("column_string")[2].alias("column-string3"),
F.col("count")
)

Categories

Resources