I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?
Related
I've read that the delimiter of a CSV is the comma (,) and the escape character is the double quotes ("). What I don't understand is why or how double quotes (") are used to also preserve spaces in field values...What I mean is this claim: "CSV files use double-quote marks to delimit field values that have spaces, so a value like Santa Claus gets saved as “Santa Claus”". Why would the space not be preserved? Also, if double quotes (") is the escape character, how is the S in "Santa Clause" not escaped? What about the comma that will inevitably come after "Santa Clause"? Wouldn't that be escaped by the double quotes after the e in Clause? Obviously not... because it doesn't get escaped and acts as a delimiter as it is supposed to when parsed. And so my question is essentially investigating the role of double quotes (") in CSV files, because it seems like it's two fold (Almost acting like a delimiter and an escape character), and yet I cannot find an explanation of this anywhere. Thanks in advance.
This is my first question, so I'm sorry if this isn't quite the right place for a question that doesn't deal with code itself, but I am currently dealing with parsing lots of csv files in python that are given in text format, where these double quotes do not surround every field value. I was using the split() method to.. well.. to split the string up by line, and then by values, but then I realized I needed to account for escape characters and that's when I started to think hard about how a CSV file works. I thought I understood the concept of escape characters in general, and still think I do, but observing the role(s) double quotes play in CSVs, I realized I was missing some kind of intuition. Just hoping for some clarity.
Consider what happens when you want a string with a comma as a field in your row. You would need some sort of way to let the csv parser that this is not a parsing comma, but it is a 'data comma', so you need to denote it in some sort of special form. usually this is done by enclosing the field with a comma in double quotes. From the RFC https://datatracker.ietf.org/doc/html/rfc4180:
Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma. For example:
aaa,bbb,ccc
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
I'm running this code below and outputting the results into a csv file:
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
df.to_csv('filename.csv', index=False)
However I've noticed that in notepad++ my strings are coming back with triple quotes. Is there a way around this as I only want ASCII double quotes?
Desired: "string"
Current: """string"""
to_csv() inserts the needed double quotes around your field already, but as the field contains double quotes (you insert them manually), those need to be escaped.
The CSV format is described in RFC-4180, which states "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
So, 'your' double quotes get escaped with double double quotes, and then another pair of double quotes is put around the field by to_csv(). And since you put 'your' double quotes at start and end of the field, you'll end up with triple double quotes.
Solutions:
If you want the CSV reading process produce a string with single double quotes around it: The triple double quotes are correct.
If you want the CSV reading process produce a string without quotes around it: let to_csv() handle the outer quotes around the field.
If you need a different variant of the CSV format (there are a lot of options), you need to edit the options of to_csv().
Try changing
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
to
df['Post Town'] = df['Post Town'].astype(str)
I have a data that looks like this
id,receiver_id,date,name,age
123,"a,b,c",2012-03-05,"john",32
456,"x,y,z",2012-06-05,"max",49
789,"abc",2012-07-05,"sam",19
In a nutshell, the delimiter is a comma (,). However, in case where a particular field is
surrounded with quotes, there might be comma within the value
I am currently trying to write a dataframe to a csv using to_csv. The input and output for the data is below. How do I write the to_csv to ensure that the fields with commas still get double quoted but the row with Katie doesn't get additional double quotes?
Input:
Title
Johnny,Appleseed
Lauren,Appleseed
Katie_"Appleseed"
Output:
Title
"Johnny,Appleseed"
"Lauren,Appleseed"
Katie_""Appleseed""
code
df.to_csv(r"filelocation", sep=',')
Escaping quotes with double quotes is part of the CSV standard:
"7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote."
TL; DR:
line = "one|two|three\|four\|five"
fields = line.split(whatever)
for what value of whatever does:
fields == ['one', 'two', 'three\|four\|five']
I have a file delimited by pipe characters. Some of the fields in that file also include pipes, escaped by a leading backslash.
For example, a single row of data in this file might have an array representation of ['one', 'two', 'three\|four\|five'], and this will be represented in the file as one|two|three\|four\|five
I have no control over the file. I cannot preprocess the file. I have to do it in a single split.
I ultimately need to split each row of this file into the separate fields, but that leading backslash is proving to be all sorts of trouble. I initially tried using a negative look-ahead, but there's some sort of arcana surrounding python strings and double-escaped characters which I don't understand, and this is stopping me from figuring it out.
Explanation of the solution is appreciated but optional.
You can use a regex like
re.split(r'([^|]+[^\\])\|', line)
which will use a character group to specify anything except \ followed by a | will be used to do the split
That will give an extra empty match at the beginning of the list, but hopefully you can work around that like
re.split(r'([^|]+[^\\])\|', line)[1:]
This is still subject to the parsing issues that Wiktor raised though, of course
Maybe you can use something like this :
[^\\]\|
where [^\\] match any caracter different of \.