I'm trying to parse CSV files from an external system which I have no control of.
comma is used as a separator
when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
(my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.
Example CSV:
qw""erty,"a""b""c""d,ef""""g"
Should be parsed as:
[['qw"erty', 'a"b"c"d,ef""g']]
However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place.
csv.reader(my_file) (with default doublequote=True) returns:
['qw""erty', 'a"b"c"d,ef""g']
Is there any way to parse this with python csv module ?
Following on #JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\"'.
Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\"'.
Python documentation says:
On reading, the escapechar removes any special meaning from the following character
However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty' producing [['qw\\"erty']]. So we have to escape the escape characters before parsing too.
Final solution:
with open(file_path, 'rb') as f:
content = f.read().replace('\\', '\\\\').replace('""', '\\"')
reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
return [row for row in reader]
as #JackManey suggests, after reading the file, you can replace the two-double-quotes with a single-double-quote.
my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]
Related
Facing an issue for parsing the following CSV file row:
UPDATED,464,**"{\"node-id\":\"\",\"change-type\":\"UPDATED\",\"object-type\":\"service\",\"internalgeneratedepoch\":1674472915591000,\"topic-name\":\"Service\",\"object-id\":\"wdm_tpdr_service1\",\"changed-attributes\":{\"lifecycle-state\":{\"old-value\":\" \",\"new-value\":\"planned\"},\"administrative-state\":{\"old-value\":\" \",\"new-value\":\"outOfService\"}},\"internaleventid\":464}"**,1674472915591000,,wdm_tpdr_service1,service
Issue is with the column 3 data (highlighted in bold) which has commas inside the curly braces and double quotes. I am not able to read this column data as a single data point, pandas is splitting this data across the commas which are read as separators. Can someone help please.
Want to read the following string as a single data point:
"{"node-id":"","change-type":"UPDATED","object-type":"service","internalgeneratedepoch":1674472915591000,"topic-name":"Service","object-id":"wdm_tpdr_service1","changed-attributes":{"lifecycle-state":{"old-value":" ","new-value":"planned"},"administrative-state":{"old-value":" ","new-value":"outOfService"}},"internaleventid":464}"
Tried this code:
csv_input = pd.read_csv(file_name, delimiter=',(?![^{]*})',engine="python",index_col=False)
But its not working for all the rows.
Any help will be appreciated.
The code you have provided doesn't work because it contains an invalid regex expression as the delimiter which is not allowed. The regex expression is not valid because it is looking for a closing curly brace which may not be present in some of the rows in the comma separated file. To fix this, you can either remove the regex expression and use a simple comma as the delimiter or you can look for a more specific pattern within the string in the delimiter argument such as a certain set of characters or words.
You can try using the json library to parse the string that is in the third column:
import json
csv_input = pd.read_csv(file_name)
# read the third column in the csv
third_column = csv_input[2]
# parse the string as json
parsed_data = json.loads(third_column)
# use the parsed json data however you want
# If you want to store the parsed data in the csv, you can create a new column and add the results there.
csv_input['parsed_data'] = [json.loads(x) for x in third_column]
I'm running this code below and outputting the results into a csv file:
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
df.to_csv('filename.csv', index=False)
However I've noticed that in notepad++ my strings are coming back with triple quotes. Is there a way around this as I only want ASCII double quotes?
Desired: "string"
Current: """string"""
to_csv() inserts the needed double quotes around your field already, but as the field contains double quotes (you insert them manually), those need to be escaped.
The CSV format is described in RFC-4180, which states "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
So, 'your' double quotes get escaped with double double quotes, and then another pair of double quotes is put around the field by to_csv(). And since you put 'your' double quotes at start and end of the field, you'll end up with triple double quotes.
Solutions:
If you want the CSV reading process produce a string with single double quotes around it: The triple double quotes are correct.
If you want the CSV reading process produce a string without quotes around it: let to_csv() handle the outer quotes around the field.
If you need a different variant of the CSV format (there are a lot of options), you need to edit the options of to_csv().
Try changing
df['Post Town'] = '"' + df['Post Town'].astype(str) + '"'
to
df['Post Town'] = df['Post Town'].astype(str)
I have a specific CSV file, I think this is a standard how PHP works because it's coming from PHP code.
I'm trying to use pandas to remove certain columns (200+ columns), but need to preserve the quotations in both header line and all other lines.
shorted header line:
name, "Full Name", "Suggested Name", id
(so spaces are escaped with double quotes in the header line)
And data:
blah, "Very, Blah Line", "Not Suggested", 2
So have commas and spaces within the column, and such is escaped with quotes.
If I use pandas read_scv, it reads the data correctly, but then saves everything with quotes, meaning changes header line to:
"name", "Full Name", "Suggested Name", "id"
And same with data.
This breaks some of our environments, and I can't have that in CSV.
If I use no quotation, then it takes all the quotations out from header line, and other lines, where then spaces become a problem.
Any suggestion welcome here.
Use the correct quoting-constant from the module csv in your pd.to_csv(...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)-call.
You most probably need either QUOTE_MINIMAL or QUOTE_NONNUMERIC:
QUOTE_MINIMAL : quotes only where needed
QUOTE_NONNUMERIC : quotes all non-numerics
You probably need QUOTE_MINIMAL (because blah is not quoted):
your_df.to_csv('some.txt', quoting=csv.QUOTE_MINIMAL)
It seems it was easier than I thought, I was focusing on delimiter, instead on escape chars.
This worked in my case:
new_f.to_csv("output.csv", sep=',', escapechar=' ', quotechar='"', quoting=csv.QUOTE_MINIMAL, index=False)
I am currently trying to write a dataframe to a csv using to_csv. The input and output for the data is below. How do I write the to_csv to ensure that the fields with commas still get double quoted but the row with Katie doesn't get additional double quotes?
Input:
Title
Johnny,Appleseed
Lauren,Appleseed
Katie_"Appleseed"
Output:
Title
"Johnny,Appleseed"
"Lauren,Appleseed"
Katie_""Appleseed""
code
df.to_csv(r"filelocation", sep=',')
Escaping quotes with double quotes is part of the CSV standard:
"7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote."
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read a CSV line with "?
I have seen a number of related questions but none have directly addressed what I am trying to do.
I am reading in lines of text from a CSV file.
All the items are in quotes and some have additional commas within the quotes.
I would like to split the line along commas, but ignore the commas within quotes.
Is there a way to do this within Python that does not require a number of regex statements.
An example is:
"114111","Planes,Trains,and Automobiles","50","BOOK"
which I would like parsed into 4 separate variables of values:
"114111" "Planes,Trains,and Automobiles" "50" "Book"
Is there a simple option in line.split() that I am missing ?
If you want to read lines from a CSV file, use Python's csv module from the standard library, which will handle the quoted comma separated values.
Example
# cat test.py
import csv
with open('some.csv') as f:
reader = csv.reader(f)
for row in reader:
print(row)
# cat some.csv
"114111","Planes,Trains,and Automobiles","50","BOOK"
# python test.py
['114111', 'Planes,Trains,and Automobiles', '50', 'BOOK']
[]
You can probably split on "," that is "[quote][comma][quote]"
the other option is coming up with an escape character, so if somebody wants to embed a comma in the string they do \c and if they want a backslash they do \\. Then you have to split the string, then unescape it before processing.