Python - to_csv

Python - to_csv - python

I am currently trying to write a dataframe to a csv using to_csv. The input and output for the data is below. How do I write the to_csv to ensure that the fields with commas still get double quoted but the row with Katie doesn't get additional double quotes?
Input:
Title
Johnny,Appleseed
Lauren,Appleseed
Katie_"Appleseed"
Output:
Title
"Johnny,Appleseed"
"Lauren,Appleseed"
Katie_""Appleseed""
code
df.to_csv(r"filelocation", sep=',')

Escaping quotes with double quotes is part of the CSV standard:
"7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote."

Related

Do double quotes (") act like field delimiters or as an escape character, or both in CSV files?

I've read that the delimiter of a CSV is the comma (,) and the escape character is the double quotes ("). What I don't understand is why or how double quotes (") are used to also preserve spaces in field values...What I mean is this claim: "CSV files use double-quote marks to delimit field values that have spaces, so a value like Santa Claus gets saved as “Santa Claus”". Why would the space not be preserved? Also, if double quotes (") is the escape character, how is the S in "Santa Clause" not escaped? What about the comma that will inevitably come after "Santa Clause"? Wouldn't that be escaped by the double quotes after the e in Clause? Obviously not... because it doesn't get escaped and acts as a delimiter as it is supposed to when parsed. And so my question is essentially investigating the role of double quotes (") in CSV files, because it seems like it's two fold (Almost acting like a delimiter and an escape character), and yet I cannot find an explanation of this anywhere. Thanks in advance.
This is my first question, so I'm sorry if this isn't quite the right place for a question that doesn't deal with code itself, but I am currently dealing with parsing lots of csv files in python that are given in text format, where these double quotes do not surround every field value. I was using the split() method to.. well.. to split the string up by line, and then by values, but then I realized I needed to account for escape characters and that's when I started to think hard about how a CSV file works. I thought I understood the concept of escape characters in general, and still think I do, but observing the role(s) double quotes play in CSVs, I realized I was missing some kind of intuition. Just hoping for some clarity.

Consider what happens when you want a string with a comma as a field in your row. You would need some sort of way to let the csv parser that this is not a parsing comma, but it is a 'data comma', so you need to denote it in some sort of special form. usually this is done by enclosing the field with a comma in double quotes. From the RFC https://datatracker.ietf.org/doc/html/rfc4180:
Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma. For example:
aaa,bbb,ccc
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:

Adding commas to a string in python but in notepad++, the string has three quotes around it

I'm running this code below and outputting the results into a csv file:
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
df.to_csv('filename.csv', index=False)
However I've noticed that in notepad++ my strings are coming back with triple quotes. Is there a way around this as I only want ASCII double quotes?
Desired: "string"
Current: """string"""

to_csv() inserts the needed double quotes around your field already, but as the field contains double quotes (you insert them manually), those need to be escaped.
The CSV format is described in RFC-4180, which states "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
So, 'your' double quotes get escaped with double double quotes, and then another pair of double quotes is put around the field by to_csv(). And since you put 'your' double quotes at start and end of the field, you'll end up with triple double quotes.
Solutions:
If you want the CSV reading process produce a string with single double quotes around it: The triple double quotes are correct.
If you want the CSV reading process produce a string without quotes around it: let to_csv() handle the outer quotes around the field.
If you need a different variant of the CSV format (there are a lot of options), you need to edit the options of to_csv().

Try changing
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
to
df['Post Town'] = df['Post Town'].astype(str)

Multiple separators and mixed quotations in CSV and Pandas

I have a specific CSV file, I think this is a standard how PHP works because it's coming from PHP code.
I'm trying to use pandas to remove certain columns (200+ columns), but need to preserve the quotations in both header line and all other lines.
shorted header line:
name, "Full Name", "Suggested Name", id
(so spaces are escaped with double quotes in the header line)
And data:
blah, "Very, Blah Line", "Not Suggested", 2
So have commas and spaces within the column, and such is escaped with quotes.
If I use pandas read_scv, it reads the data correctly, but then saves everything with quotes, meaning changes header line to:
"name", "Full Name", "Suggested Name", "id"
And same with data.
This breaks some of our environments, and I can't have that in CSV.
If I use no quotation, then it takes all the quotations out from header line, and other lines, where then spaces become a problem.
Any suggestion welcome here.

Use the correct quoting-constant from the module csv in your pd.to_csv(...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)-call.
You most probably need either QUOTE_MINIMAL or QUOTE_NONNUMERIC:
QUOTE_MINIMAL : quotes only where needed
QUOTE_NONNUMERIC : quotes all non-numerics
You probably need QUOTE_MINIMAL (because blah is not quoted):
your_df.to_csv('some.txt', quoting=csv.QUOTE_MINIMAL)

It seems it was easier than I thought, I was focusing on delimiter, instead on escape chars.
This worked in my case:
new_f.to_csv("output.csv", sep=',', escapechar=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL, index=False)

Saved data has undesired quotation marks

I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!

You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?

How to handle double quotes inside field values with csv module?

I'm trying to parse CSV files from an external system which I have no control of.
comma is used as a separator
when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
(my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.
Example CSV:
qw""erty,"a""b""c""d,ef""""g"
Should be parsed as:
[['qw"erty', 'a"b"c"d,ef""g']]
However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place.
csv.reader(my_file) (with default doublequote=True) returns:
['qw""erty', 'a"b"c"d,ef""g']
Is there any way to parse this with python csv module ?

Following on #JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\"'.
Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\"'.
Python documentation says:
On reading, the escapechar removes any special meaning from the following character
However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty' producing [['qw\\"erty']]. So we have to escape the escape characters before parsing too.
Final solution:
with open(file_path, 'rb') as f:
content = f.read().replace('\\', '\\\\').replace('""', '\\"')
reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
return [row for row in reader]

as #JackManey suggests, after reading the file, you can replace the two-double-quotes with a single-double-quote.
my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.