Unescape comma when reading CSV with spark

Unescape comma when reading CSV with spark - python

I have a CSV file that contains name field with comma (,) escaped with \
id,name
"10","Ashraful\, Islam"
I am reading the csv file from pyspark
test = spark.read.format("csv").option("sep", ",").option("escape", "\\").option("inferSchema", "true").option("header", "true").load("test.csv")
test.show()
The name should be Ashraful, Islam, but getting output
+---+----------------+
| id| name|
+---+----------------+
| 10|Ashraful\, Islam|
+---+----------------+

Simply use:
df = spark.read.csv('file:///mypath.../myFile.csv', sep=',', header=True)
df.show()
This gives the output:
+---+---------------+
| id| name|
+---+---------------+
| 10|Ashraful, Islam|
+---+---------------+
EDIT: I could not replicate your problem with the input file you have but if it persists you can solve it with a workaround. Simply replace "\," (or any other special character which is escaped) in the dataframe.
You can
from pyspark.sql.functions import *
df = spark.read.csv('file:///home/perfman/todel.csv', sep=',', header=True)
df.withColumn('nameClean', regexp_replace('name', '\\\,', ',')).show()
>>>
+---+----------------+---------------+
| id| name| nameClean|
+---+----------------+---------------+
| 10|Ashraful\, Islam|Ashraful, Islam|
+---+----------------+---------------+

Related

Convert none type to datetime using parser.parse

I'm using the following function to parse string to date in PySpark
func = udf(lambda x: parser.parse(x), DateType())
My date format is:
"22-Jan-2021 00:00"
Althought this function does not work with None types, I have the following Spark data frame
date
----
"22-Jan-2021 00:00"
""
"10-Feb-2020 14:00"
When I apply my func to the date column I got an error on the second line of DF saying that it can't parse NoneType. Any tips to solve this problem using PySpark and the above func ?
An MVCE:
date = None
date_parsed = parser.parse(date)

It seems like you could just use the to_timestamp function.
ex.
df.show()
+---+-----------------+
| id| date|
+---+-----------------+
| 1|22-Jan-2021 00:00|
| 2| null|
| 3|10-Feb-2020 14:00|
+---+-----------------+
You can simply use the following code to convert the string in the date column to a timestamp type.
from pyspark.sql import functions
df = df.withColumn("date", functions.to_timestamp("date", "dd-MMM-yyyy HH:mm"))
df.show()
+---+-------------------+
| id| date|
+---+-------------------+
| 1|2021-01-22 00:00:00|
| 2| null|
| 3|2020-02-10 14:00:00|
+---+-------------------+
You can also verify that the conversion is done correctly with df.schema
print(df.schema)
StructType(List(StructField(id,StringType,true),StructField(date,TimestampType,true)))

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &amp| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()

You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)

I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+

How do I convert convert a unicode list contained in pyspark column of a dataframe into float list?

I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark

That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+

I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+

Filtering trash from a JSON file before reading it into PySpark DataFrame

I have the following file, which was supposed to be a JSON file, but it has a string slapped right before the actual JSON content (they are separated by a tab!):
string_smth\t{id:"str", num:0}
string_smth1\t{id:"str2", num:1}
string_smth2\t{id:"str3", num:2}
string_smth3\t{id:"str4", num:3}
Doing the following returns null for all columns:
import pyspark.sql
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
df = spark.read.json("hdfs:///path/files.json/*", schema=schema)
df.show()
+--+---+
|id|num|
+--+---+
|null|null|
|null|null|
|null|null|
|null|null|
Any way of fixing that during the spark.read.json call? If not, what are my options?

I can see several issues in your file, but maybe it is just a problem related to your example.
I created a rdd :
a = sc.parallelize(['string_smth\t{"id":"str","num":0}',
'string_smth1\t{"id":"str2","num":1}',
'string_smth2\t{"id":"str3","num":2}',
'string_smth3\t{"id":"str4","num":3}'])
In your case, replace this sc.parallelize with sc.textFile(path_to_file) to acquire the file you need.
As you can see, the id is enclosed with double quotes. That is how a json is supposed to be in a string format. And also, technically, you do not have space after the comma. How is your original file exactly ?
Then, just do this :
import json
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
a.map(lambda x : json.loads(x.split('\t')[1])).toDF(schema).show()
+----+---+
| id|num|
+----+---+
| str| 0|
|str2| 1|
|str3| 2|
|str4| 3|
+----+---+

json, struct and case class don't need schema to be created.
You can use sparkContext's textFile api to read the text file and parse the lines to get the valid json strings
rdd = sc.textFile("path to the csv file")\
.map(lambda line: line.split("\t", 1)[1].replace("id:", "\"id\":").replace("num:", "\"num\":"))
Then finally convert the valid json rdds to dataframe
df = sqlContext.read.json(rdd)
which should give you
+----+---+
|id |num|
+----+---+
|str |0 |
|str2|1 |
|str3|2 |
|str4|3 |
+----+---+

A potential solution would be to split on the '{' character for each line :
json_lin = '{' + 'string_smth {id:"str", num:0}'.split('{')[-1]

pyspark/dataframe: replace null with empty space

I have the following udf function in pyspark dataframe. The code works fine except when myFun1('oldColumn') is null, I want the output to be empty string instead of null.
myFun1 = udf(lambda x: myModule.myFunction1(x), StringType())
myDF = myDF.withColumn('newColumn', myFun1('oldColumn'))
Is it possible to do this in place instead of create another udf function? Thanks!

Using df.fillna() or df.na.fill() to replace null values with an empty string worked for me.
You can do replacements by column by supplying the column and value you want to replace nulls with as a parameter:
myDF = myDF.na.fill({'oldColumn': ''})
The Pyspark docs have an example :
>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height| name|
+---+------+-------+
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|
+---+------+-------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unescape comma when reading CSV with spark - python

Related

Convert none type to datetime using parser.parse

Pyspark DataFrame - Escaping &

How do I convert convert a unicode list contained in pyspark column of a dataframe into float list?

Filtering trash from a JSON file before reading it into PySpark DataFrame

pyspark/dataframe: replace null with empty space

Categories

Resources