I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()
You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
Related
from the below code I am writing a dataframe to csv file.
As my dataframe contains "" for None, I have added replace("", None) because Null values are supposed to be represented as None instead of "" (double quotes)
newDf.coalesce(1).replace("", None).replace("'", "\"").write.format('csv').option('nullValue', None).option('header', 'true').option('delimiter', '|').mode('overwrite').save(destination_csv)
I tried adding .replace("'", "\""). but it doesn't work
the data also contains data with single quotes
eg :
Survey No. 123, 'Anjanadhri Godowns', CityName
I need to replace the single quotes from the dataframe and replace it with double-quotes.
How can it be achieved?
You can use regexp_replace to replace single quotes with double quotes in all columns before writing the output:
import pyspark.sql.functions as F
df2 = df.select([F.regexp_replace(c, "'", '"').alias(c) for c in df.columns])
# then write output
# df2.coalesce(1).write(...)
Using translate
from pyspark.sql.functions import *
data_list = [(1, "'Name 1'"), (2, "'Name 2' and 'Something'")]
df = spark.createDataFrame(data = data_list, schema = ["ID", "my_col"])
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| 'Name 1'|
# | 2|'Name 2' and 'Som...|
# +---+--------------------+
df.withColumn('my_col', translate('my_col', "'", '"')).show()
# +---+--------------------+
# | ID| my_col|
# +---+--------------------+
# | 1| "Name 1"|
# | 2|"Name 2" and "Som...|
# +---+--------------------+
This will replace all occurrences of the single quote character with a double quotation mark in the column my_col.
I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
| revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
|-- revenue: string (nullable = true)
Output desired:
df.show()
+---------+
| revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
|-- revenue: float (nullable = true)
I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))
But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
| |
+-------+
You need to escape . to match it literally, as . is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+
I have a CSV file that contains name field with comma (,) escaped with \
id,name
"10","Ashraful\, Islam"
I am reading the csv file from pyspark
test = spark.read.format("csv").option("sep", ",").option("escape", "\\").option("inferSchema", "true").option("header", "true").load("test.csv")
test.show()
The name should be Ashraful, Islam, but getting output
+---+----------------+
| id| name|
+---+----------------+
| 10|Ashraful\, Islam|
+---+----------------+
Simply use:
df = spark.read.csv('file:///mypath.../myFile.csv', sep=',', header=True)
df.show()
This gives the output:
+---+---------------+
| id| name|
+---+---------------+
| 10|Ashraful, Islam|
+---+---------------+
EDIT: I could not replicate your problem with the input file you have but if it persists you can solve it with a workaround. Simply replace "\," (or any other special character which is escaped) in the dataframe.
You can
from pyspark.sql.functions import *
df = spark.read.csv('file:///home/perfman/todel.csv', sep=',', header=True)
df.withColumn('nameClean', regexp_replace('name', '\\\,', ',')).show()
>>>
+---+----------------+---------------+
| id| name| nameClean|
+---+----------------+---------------+
| 10|Ashraful\, Islam|Ashraful, Islam|
+---+----------------+---------------+
I have the following file, which was supposed to be a JSON file, but it has a string slapped right before the actual JSON content (they are separated by a tab!):
string_smth\t{id:"str", num:0}
string_smth1\t{id:"str2", num:1}
string_smth2\t{id:"str3", num:2}
string_smth3\t{id:"str4", num:3}
Doing the following returns null for all columns:
import pyspark.sql
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
df = spark.read.json("hdfs:///path/files.json/*", schema=schema)
df.show()
+--+---+
|id|num|
+--+---+
|null|null|
|null|null|
|null|null|
|null|null|
Any way of fixing that during the spark.read.json call? If not, what are my options?
I can see several issues in your file, but maybe it is just a problem related to your example.
I created a rdd :
a = sc.parallelize(['string_smth\t{"id":"str","num":0}',
'string_smth1\t{"id":"str2","num":1}',
'string_smth2\t{"id":"str3","num":2}',
'string_smth3\t{"id":"str4","num":3}'])
In your case, replace this sc.parallelize with sc.textFile(path_to_file) to acquire the file you need.
As you can see, the id is enclosed with double quotes. That is how a json is supposed to be in a string format. And also, technically, you do not have space after the comma. How is your original file exactly ?
Then, just do this :
import json
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
a.map(lambda x : json.loads(x.split('\t')[1])).toDF(schema).show()
+----+---+
| id|num|
+----+---+
| str| 0|
|str2| 1|
|str3| 2|
|str4| 3|
+----+---+
json, struct and case class don't need schema to be created.
You can use sparkContext's textFile api to read the text file and parse the lines to get the valid json strings
rdd = sc.textFile("path to the csv file")\
.map(lambda line: line.split("\t", 1)[1].replace("id:", "\"id\":").replace("num:", "\"num\":"))
Then finally convert the valid json rdds to dataframe
df = sqlContext.read.json(rdd)
which should give you
+----+---+
|id |num|
+----+---+
|str |0 |
|str2|1 |
|str3|2 |
|str4|3 |
+----+---+
A potential solution would be to split on the '{' character for each line :
json_lin = '{' + 'string_smth {id:"str", num:0}'.split('{')[-1]