I need to parse a large csv file (e.g. converting it to pandas df). It is an unquoted CSV, with a comma as the delimiter. I received the file as txt, and changed the extension to csv. I now see that some of the fields represent free text, and have commas as part of it. I was thinking of using a heauristic where a delimiter-comma will never have space following it, while a free-text comma will, in most cases, be followed by a space.
The problem is that using escapechar = ' ' marks the chars followed by the escape, while I need it to escape the preceding character.
Is there a way to mark a reverse-escape char?
I was considering the alternative of replacing all ", " with "#$#$#$#", but the file is 3 gb and it feels super inefficient.
Another option is to send the file back, complaining that it's malformed. Problem is that it will hurt my pride.
Thanks!
You can add a wrapper to massage each line given to the csv reader:
# foo.csv:
col1,col2,col3, with, commas,col4
# python file:
def escape_commas(filelike):
for line in filelike:
yield line.replace(', ', '\\, ')
with open('foo.csv', newline='') as csvfile:
reader = csv.reader(escape_commas(csvfile), escapechar='\\')
for row in reader:
print('|'.join(row))
# result:
col1|col2|col3, with, commas|col4
Edit: for pandas you might want to make a wrapper for file that implements the read method:
class EscapeCommas():
def __init__(self, file):
self.file = file
def read(self, size=-1, /):
text = self.file.read(size)
return text.replace(', ', '\\, ')
with open('foo.csv', newline='') as csvfile:
pd.read_csv(EscapeCommas(csvfile), escapechar='\\')
Related
I'm trying to solve a problem from the pyschools website that asks to write a script that reads a CSV file with comas "," as a delimiter and returns a list of records. When running my script on their website it returns as incorrect using a test case of:
csvReader('books.csv')[0] thus returning:
['"Pete,Zelle","Intro to HTML, CSS",2011']
when the expected result is:
['Pete,Zelle', 'Intro to HTML, CSS', '2011']
I've notice that the problem has to do with the quotations " & ' but still haven't come up with the right answer, using replace('"','') for the line variable to remove the double quotes does not fix it as it returns as:
['Pete,Zelle,Intro to HTML, CSS,2011']
where it removes the last quotation mark from some of the words e.g. Zelle, instead of Zelle',.
Below ill provide a link to the exercise, the problem and my current script. Any explanation or help is greatly appreciated.
link:
http://www.pyschools.com/quiz/view_question/s13-q8
problem:
Write a function to read a CSV file with ',' as delimiter and returns a list of records.
The function must be able to ignore ',' that are within a pair of double quotes '"'.
script:
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
records.append([line.replace('"','')])
return records
I was after the CSV file you are trying to read. Sounds as though you need to seperate the fields whilst ignoring any delimiters that fall inbetween quotation marks.
In this case I would recommend the CSV Library and setting the quotation character.
import csv
record = '"Pete,Zelle","Intro to HTML, CSS",2011'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([record], delimiter=',', quotechar='"'))[0] ]
print(newStr)
Will return ['"Pete,Zelle"', '"Intro to HTML, CSS"', '"2011"']
In your function you could incorporate this as below
import csv
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
newLine = [ '"{}"'.format(x) for x in list(csv.reader([line], delimiter=',', quotechar='"'))[0] ]
records.append(newLine)
return records
Batteries are included, as usual, with python. Here's using the standard lib csv module:
import csv
with open(path, "r") as f:
csv_reader = csv.reader(f, delimiter=",")
for row_number, row in enumerate(csv_reader):
print(f"{row_number} => {row}")
If the stdlib isn't available for some strange reason.. you'll need to tokenize each line with 'delimiters', 'separators', and 'cell values'. Again, this would be trivial with stdlib (import re). Let's pretend you have no batteries at all, just plain python.
You'll need to realize that how you treat each character of each line depends on a "context" and that that context is built up by all previous characters. Using a stack is advised here. You push and pop off states (aka contexts) from a stack
depending on what the current context is (the top of your stack) and the current character you're handing. Now, given a context, you may process each character differently depending on that context:
class State:
IN_NON_DELIMITED_CELL = 1
IN_DELIMITED_CELL = 2
def get_cell_values(line, quotechar='"', separator=','):
stack = []
stack.append(State.IN_NON_DELIMITED_CELL)
cell_values = [""]
for character in line:
current_state = stack[-1]
if current_state == State.IN_NON_DELIMITED_CELL:
if character == quotechar:
stack.append(State.IN_DELIMITED_CELL)
elif character == separator:
cell_values.append("")
else:
cell_values[-1] += character
if current_state == State.IN_DELIMITED_CELL:
if character == quotechar:
stack.pop()
else:
cell_values[-1] += character
return cell_values
with open(path, "r") as f:
for line in f:
cell_values = tokenize(line, quotechar='"', delimiter=',')
print(cell_values)
This is a good starting point:
print(get_cell_values('"this","is",an,example,of,"doing things, the hard way?"'))
# prints:
['this', 'is', 'an', 'example', 'of', 'doing things, the hard way?']
For taking this (MUCH) further, look into these topics: tokenizing strings, LL+LR parsers, recursive descent, shift-reduce parsers.
I want to export some data from DB to CSV file. I need to add a '|' delimiter to specific fields. At the moment when I export file, I use something like that:
- To specific fields (at the end and beginning) I add '|':
....
if response.value_display.startswith('|'):
sheets[response.sheet.session][response.input.id] = response.value_display
else:
sheets[response.sheet.session][response.input.id] = '|'+response.value_display+'|'
....
And I have CSV writer function settings like that:
self.writer = csv.writer(self.queue, dialect=dialect,
lineterminator='\n',
quotechar='',
quoting=csv.QUOTE_NONE,
escapechar=' ',
** kwargs)
Now It works, but when I have DateTime fields(where is space) writer adds some extra space.
When I have default settings (sometimes) at the end and beginning CSV writer add double-quotes but I don't know why and what it depends on.
To remove your extra spaces I would just do something like.
file = open(the_file.csv, w+) #open your csv file
file.write(file.readline().replace(" ", " ") #finds any two spaces and replaces with one
file.close()
With the delimiter it is specific to the situation. If you want to add it at the beginning or the end.
delimiter = "|"
my_str = my_str + delimiter
or
delimiter = "|"
my_str = delimiter + my_str
If you want to add the delimiter somewhere else you may have to get creative as it would be based on the context.
I'm not sure on the double quotes. I'd replace like the spaces.
file = open(the_file.csv, w+) #open your csv file
file.write(file.readline().replace("\"", "'")
file.close()
Assuming you wanted to replace the double quote with a single quote.
I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.
There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()
I am trying to read a bunch of data in .csv file into an array in format:
[ [a,b,c,d], [e,f,g,h], ...]
Running the code below, when I print an entry with a space (' ') the way I'm accessing the element isn't correct because it stops at the first space (' ').
For example if Business, Fast Company, Youtube, fastcompany is the 10th entry...when I print the below I get on separate lines:
Business,Fast
Company,YouTube,FastCompany
Any advice on how to get as the result: [ [a,b,c,d], [Business, Fast Company, Youtube, fastcompany], [e,f,g,h], ...]?
import csv
partners = []
partner_dict = {}
i=9
with open('partners.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
partners.append(row)
print len(partners)
for entry in partners[i]:
print entry
The delimiter argument specifies which character to use to split each row of the file into separate values. Since you're passing ' ' (a space), the reader is splitting on spaces.
If this is really a comma-separated file, use ',' as the delimiter (or just leave the delimiter argument out and it will default to ',').
Also, the pipe character is an unusual value for the quote character. Is it really true that your input file contains pipes in place of quotes? The sample data you supplied contains neither pipes nor quotes.
There are a few issues with your code:
The "correct" syntax for iterating over a list is for entry in partners:, not for entry in partners[i]:
The partners_dict variable in your code seems to be unused, I assume you'll use it later, so I'll ignore it for now
You're opening a text file as binary (use open(file_name, "r") instead of open(file_name, "rb")
Your handling of the processed data is still done inside of the context manager (with ... [as ...]:-block)
Your input text seems to delimit by ", ", but you delimit by " " when parsing
If I understood your question right your problem seems to be caused by the last one. The "obvious solution" would probably be to change the delimeter argument to ", ", but only single-char strings are allowed as delimiters by the module. So what do we do? Well, since "," is really the "true" delimiter (it's never supposed to be inside actual unquoted data, contrary to spaces), that would seem like a good solution. However, now all your values start with " " which is probably not what you want. So what do you do? Well, all strings have a pretty neat strip() method which by default removes all whitespace in the beginning and end of the string. So, to strip() all the values, let's use a "list comprehension" (evaluates an expression on all items in a list and then returns a new list with the new values) which should look somewhat like [i.strip() for i in row] before appending it to partners.
In the end your code should hopefully look somewhat like this:
import csv
partners = []
with open('partners.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in spamreader:
partners.append([i.strip() for i in row])
print len(partners)
for entry in partners:
print entry