Multiple separators and mixed quotations in CSV and Pandas - python

I have a specific CSV file, I think this is a standard how PHP works because it's coming from PHP code.
I'm trying to use pandas to remove certain columns (200+ columns), but need to preserve the quotations in both header line and all other lines.
shorted header line:
name, "Full Name", "Suggested Name", id
(so spaces are escaped with double quotes in the header line)
And data:
blah, "Very, Blah Line", "Not Suggested", 2
So have commas and spaces within the column, and such is escaped with quotes.
If I use pandas read_scv, it reads the data correctly, but then saves everything with quotes, meaning changes header line to:
"name", "Full Name", "Suggested Name", "id"
And same with data.
This breaks some of our environments, and I can't have that in CSV.
If I use no quotation, then it takes all the quotations out from header line, and other lines, where then spaces become a problem.
Any suggestion welcome here.

Use the correct quoting-constant from the module csv in your pd.to_csv(...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)-call.
You most probably need either QUOTE_MINIMAL or QUOTE_NONNUMERIC:
QUOTE_MINIMAL : quotes only where needed
QUOTE_NONNUMERIC : quotes all non-numerics
You probably need QUOTE_MINIMAL (because blah is not quoted):
your_df.to_csv('some.txt', quoting=csv.QUOTE_MINIMAL)

It seems it was easier than I thought, I was focusing on delimiter, instead on escape chars.
This worked in my case:
new_f.to_csv("output.csv", sep=',', escapechar=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL, index=False)

Related

Python - to_csv

I am currently trying to write a dataframe to a csv using to_csv. The input and output for the data is below. How do I write the to_csv to ensure that the fields with commas still get double quoted but the row with Katie doesn't get additional double quotes?
Input:
Title
Johnny,Appleseed
Lauren,Appleseed
Katie_"Appleseed"
Output:
Title
"Johnny,Appleseed"
"Lauren,Appleseed"
Katie_""Appleseed""
code
df.to_csv(r"filelocation", sep=',')
Escaping quotes with double quotes is part of the CSV standard:
"7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote."

Removing quotes from text files

I need to read a pipe(|)-separated text file.
One of the fields contains a description that may contain double-quotes.
I noticed that all lines that contain a " is missing in the receiving dict.
To avoid this, I tried to read the entire line, and use the string.replace() to remove them, as shown below, but it looks like the presence of those quotes creates problem at the line-reading stage, i.e before the string.replace() method.
The code is below, and the question is 'how to force python not to use any separator and keep the line whole ?".
with open(fileIn) as txtextract:
readlines = csv.reader(txtextract,delimiter="µ")
for line in readlines:
(...)
LI_text = newline[107:155]
LI_text.replace("|","/")
LI_text.replace("\"","") # use of escape char don't work.
Note: I am using version 3.6
You may use regex
In [1]: import re
In [2]: re.sub(r"\"", "", '"remove all "double quotes" from text"')
Out[2]: 'remove all double quotes from text'
In [3]: re.sub(r"(^\"|\"$)", "", '"remove all "only surrounding quotes" from text"')
Out[3]: 'remove all "only surrounding quotes" from text'
or add quote='"' and quoting=csv.QUOTE_MINIMAL options to csv.reader(), like:
with open(fileIn) as txtextract:
readlines = csv.reader(txtextract, delimiter="µ", quote='"', quoting=csv.QUOTE_MINIMAL)
for line in readlines:
(...)
Lesson: method string.replace() does not change the string itself. The modified text must be stored back (string = string.replace() )

Saved data has undesired quotation marks

I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?

How to handle double quotes inside field values with csv module?

I'm trying to parse CSV files from an external system which I have no control of.
comma is used as a separator
when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
(my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.
Example CSV:
qw""erty,"a""b""c""d,ef""""g"
Should be parsed as:
[['qw"erty', 'a"b"c"d,ef""g']]
However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place.
csv.reader(my_file) (with default doublequote=True) returns:
['qw""erty', 'a"b"c"d,ef""g']
Is there any way to parse this with python csv module ?
Following on #JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\"'.
Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\"'.
Python documentation says:
On reading, the escapechar removes any special meaning from the following character
However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty' producing [['qw\\"erty']]. So we have to escape the escape characters before parsing too.
Final solution:
with open(file_path, 'rb') as f:
content = f.read().replace('\\', '\\\\').replace('""', '\\"')
reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
return [row for row in reader]
as #JackManey suggests, after reading the file, you can replace the two-double-quotes with a single-double-quote.
my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]

Access tab-separated rows with quoted fields containing newlines and more quotes

I have a long file with rows ending in newlines, and fields separated by tabs. Fields are quoted using "". A single quoted field may also contain newlines, and -- as an added twist -- may additionally contain quoted strings.
Here is an example illustrating all cases:
"FieldA" "FieldB" "FieldC"
"AnotherOne" "May contain
newlines" "FieldC"
"Here is one more row" "FieldB" "FieldC"
"And here is a twist" "Some fields with newlines may contain or end with "quotes and"
continue on next line" "FieldC"
I tried csv module in this way:
with open(sys.argv[1], 'rU') as csvfile:
a = csv.reader(csvfile, delimiter='\t', quotechar='"')
for row in a:
print len(row)
...but this gives me variable row lenghts, so I cannot access a field reliably. How to access values in such a file reliably from Python?

Categories

Resources