How to parse a string using a CSV parser in Python? - python

I need to parse a string using a CSV parser. I've found this solution in many places, but it doesn't work for me. I was using Python 3.4, now I changed it to 2.7.9 and still nothing...
import csv
import StringIO
csv_file = StringIO.StringIO(line)
csv_reader = csv.reader(csv_file)
for data in csv_reader:
# do something
Could anyone please suggest me another way to parse this string using a CSV parser? Or how can I make this work?
Obs: I have a string in a CSV format, with fields that have commas inside, that's why I can't parse it in the standard way.

You need to put double quotes around elements that contain commas.
The CSV format implements RFC 4180, which states:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
So for instance (run code here.):
import StringIO
import csv
# the text between double quotes will be treated
# as a single element and not parsed by commas
line = '1,2,3,"1,2,3",4'
csv_file = StringIO.StringIO(line)
csv_reader = csv.reader(csv_file)
for data in csv_reader:
# output: ['1', '2', '3', '1,2,3', '4']
print data
As another option, you can change the delimiter. The default for csv.reader is delimiter=',' and quotechar='"' but both of these can be changed depending on your needs.
Semicolon Delimiter:
line = '1;2;3;1,2,3;4'
csv_file = StringIO.StringIO(line)
csv_reader = csv.reader(csv_file, delimiter=';')
for data in csv_reader:
# output: ['1', '2', '3', '1,2,3', '4']
print data
Vertical Bar Quotechar
line = '1,2,3,|1,2,3|,4'
csv_file = StringIO.StringIO(line)
csv_reader = csv.reader(csv_file, quotechar='|')
for data in csv_reader:
# output: ['1', '2', '3', '1,2,3', '4']
print data
Also, the python csv module works on python 2.6 - 3.x, so that shouldn't be the problem.

The obvious solution that jumps out of the page, rather than reimplementing CSV parsing, is to preprocess the data first and replace all of the commas within strings by some never used token character (or even the word COMMA), then feeding that into the CSV parser, and then going back through the data and replacing the tokens back with commas.
Sorry, I've not tried this myself in Python, but I had issues with quotes in my data in another language, and that's how I solved it.
Also, Bcorso's answer is much more complete. Mine is just a quick hack to get around a common limitation.

Related

Python CSV Reader splitting on comma inside of quotes

from csv import reader
csv_reader_results = reader(["办公室弥漫着\"女红\"缝扣子.编蝴蝶结..手绣花...呵呵..原来做 些也会有幸福的感觉,,,,用心做东西感觉真好!!!"],
escapechar='\\',
quotechar='"',
delimiter=',',
quoting=csv.QUOTE_ALL,
skipinitialspace=True)
for result in csv_reader_result:
print result[0]
What I'm expecting is:
办公室弥漫着"女红"缝扣子.编蝴蝶结..手绣花...呵呵..原来做 些也会有幸福的感觉,,,,用心做东西感觉真好!!!
But what I'm getting is:
办公室弥漫着"女红"缝扣子.编蝴蝶结..手绣花...呵呵..原来做 些也会有幸福的感觉
Because it splits on the four commas inside the sentence.
I'm escaping the quotes inside of the sentence. I've set the quotechar and escapechar for csv.reader. What am I doing wrong here?
Edit:
I used the answer by j6m8 https://stackoverflow.com/a/19881343/3945463 as a workaround. But it would be preferable to learn the correct way to do this with csv reader.

csv module automatically writing unwanted carriage returns

When using pythons csv module to create a csv it is automatically putting carriage return characters at the end of strings if the string has a comma inside it e.g:
['this one will have a carriage return, at the end','this one wont']
in an excel sheet this will turn out like:
| |this on|
because of the extra carriage return, it will also surround the string with the comma inside in double quotes, as expected.
The code I am using is:
with open(oldfile, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for row in data:
writer.writerow(row)
How do I create a csv using the same data format which won't have carriage returns if the strings have commas inside, I don't mind the strings being surrounded by double quotes though
Here's a link to the diagnosis of the problem with the output .csv:
Excel showing empty cells when importing file created with csv module
It's the accepted answer.
I have changed my code to:
with open(oldfile, 'w', newline='', quoting=csv.QUOTE_MINIMAL) as csvfile:
writer = csv.writer(csvfile)
for row in data:
writer.writerow(row)
I am now getting the error:
TypeError: 'quoting' is an invalid keyword argument for this function
The built-in CSV module of python has the option: csv.QUOTE_MINIMAL. When this option is added as an argument to the writer, it adds quotemarks when the delimeter is in the given string: "your text, with comma", "other field". This will eliminate the need for carriage returns.
The code is:
with open(oldfile, 'w') as csvfile: writer = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL) for row in data: writer.writerow(row)

Reading ASCII with field delimiter as ctrl A and line delimiting as \n into python

I have an ASCII dataset that has ctrl A field delimiting and \n as the line delimiter. I am looking to read this into Python and am wondering how to deal with it. In particular I would like to be able to read this information into a pandas dataframe.
I currently have;
import pandas as pd
input = pd.read_csv('000000_0', sep='^A')
The error I then get is
_main__:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does
not support regex separators; you can avoid this warning by specifying engine='python'.
I then don't know how I am specifying the line delimiter too.
Any ideas?
Thanks in advance!
Instead of mentioning "^A" mention the hex code. Its works like a charm
import pandas as pd
data = pd.read_csv('000000_0', sep='\x01')
Use pd.read_csv with parameter sep=chr(1)
from io import StringIO
import pandas as pd
mycsv = """a{0}b{0}c
d{0}e{0}f""".format(chr(1))
pd.read_csv(StringIO(mycsv), sep=chr(1))
a b c
0 d e f
If by CTRL+A you mean the ASCII-Code for SOH (start of heading), try splitting your data on newline first to get the rows, and split these on "\x01", which is the hex code for SOH. But without any code, data, expected result or error message, this is mostly guessing.
Try this
reader = csv.reader(open("/Users/778123/Documents/Splunk/data/DMS3^idms_core^20200723140421.csv",newline=None), delimiter=',')
print(reader)
writer = csv.writer(open("/Users/778123/Documents/Splunk/data/DMS3^idms_core^test.csv", 'w'), delimiter=chr(1), quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(reader)
Python's csv library is pretty good at reading delimited files ;-)
Taking an example from the docs linked above:
import csv
with open('eggs.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
print ', '.join(row)
This will automatically iterate over the lines in the file (thus handle the newline characters), and you can set the delimiter as shown.

Writing to a CSV without getting quote marks, escapcehar error

I have an output I am writing to a CSV. I need to add csv.QUOTE_NONE but I can't seem to find the right location without it producing an error.
variable:
variable = ['20', '10', '30,30']
Note: some of the variables I am using will contain strings i.e ['Test','Output', '100']
code:
with open('file.csv', 'w') as csv_file:
writerc = csv.writer(csv_file)
for item in variable():
writerc.writerow(item)
When using the above code, it produces the following line in the CSV.
20,10,"30,30"
The required write is:
20,30,30,30
If I use quoting=csv.QUOTE_NONE I get an escapechar error _csv.Error: need to escape, but no escapechar set - this is resolved if I set an escapechar but this then adds a character in place of the quotation marks.
Any ideas?
You could try further splitting your data before writing it. This would avoid it needing to use quote characters automatically.
It works by creating a new list of values possibly containing multiple new split entries, for example your '30,30' would become ['30', '30']. Next it uses Python's chain function to flatten these sub-lists back into a single list which can then be written to your output CSV file.
import itertools
import csv
data = [['20', '10', '30,30'], ['Test','Output', '100']]
with open('file.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
for line in data:
csv_output.writerow(list(itertools.chain.from_iterable(v.split(',') for v in line)))
This would give you the following file.csv:
20,10,30,30
Test,Output,100
I think the problem lies in the fact that .csv means Comma-separated
values, so it treats "," in the value as the separator or delimiter. That's why the doublequotes are used automatically to escape.
I suggest you use the pandas library which makes it easier to deal with this issue.
Code
import pandas as pd
df = pd.DataFrame({'variable' : ['20', '10', '30,30']})
# Note that I use '\t' as the separator instead of ',' and get a .tsv file which
# is essentially the same as .csv file except the separator.
df.to_csv(sep = '\t', path_or_buf='file.tsv', index=False)
you can see the differences of using these 2 separators in the Full script. Another thing is that, I think your code suggests that you use the variable
as the name of the column, but your output suggests that you use variable as the name of the row (or index). Anyway, my answer is based on the assumption that you use variable
as the name of the column. Hope it helps ;)

How to convert tab separated, pipe separated to CSV file format in Python

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))

Categories

Resources