How to add a text delimiter correctly in Python - python

I want to export some data from DB to CSV file. I need to add a '|' delimiter to specific fields. At the moment when I export file, I use something like that:
- To specific fields (at the end and beginning) I add '|':
....
if response.value_display.startswith('|'):
sheets[response.sheet.session][response.input.id] = response.value_display
else:
sheets[response.sheet.session][response.input.id] = '|'+response.value_display+'|'
....
And I have CSV writer function settings like that:
self.writer = csv.writer(self.queue, dialect=dialect,
lineterminator='\n',
quotechar='',
quoting=csv.QUOTE_NONE,
escapechar=' ',
** kwargs)
Now It works, but when I have DateTime fields(where is space) writer adds some extra space.
When I have default settings (sometimes) at the end and beginning CSV writer add double-quotes but I don't know why and what it depends on.

To remove your extra spaces I would just do something like.
file = open(the_file.csv, w+) #open your csv file
file.write(file.readline().replace(" ", " ") #finds any two spaces and replaces with one
file.close()
With the delimiter it is specific to the situation. If you want to add it at the beginning or the end.
delimiter = "|"
my_str = my_str + delimiter
or
delimiter = "|"
my_str = delimiter + my_str
If you want to add the delimiter somewhere else you may have to get creative as it would be based on the context.
I'm not sure on the double quotes. I'd replace like the spaces.
file = open(the_file.csv, w+) #open your csv file
file.write(file.readline().replace("\"", "'")
file.close()
Assuming you wanted to replace the double quote with a single quote.

Related

Parsing a csv using a reverse escape char

I need to parse a large csv file (e.g. converting it to pandas df). It is an unquoted CSV, with a comma as the delimiter. I received the file as txt, and changed the extension to csv. I now see that some of the fields represent free text, and have commas as part of it. I was thinking of using a heauristic where a delimiter-comma will never have space following it, while a free-text comma will, in most cases, be followed by a space.
The problem is that using escapechar = ' ' marks the chars followed by the escape, while I need it to escape the preceding character.
Is there a way to mark a reverse-escape char?
I was considering the alternative of replacing all ", " with "#$#$#$#", but the file is 3 gb and it feels super inefficient.
Another option is to send the file back, complaining that it's malformed. Problem is that it will hurt my pride.
Thanks!
You can add a wrapper to massage each line given to the csv reader:
# foo.csv:
col1,col2,col3, with, commas,col4
# python file:
def escape_commas(filelike):
for line in filelike:
yield line.replace(', ', '\\, ')
with open('foo.csv', newline='') as csvfile:
reader = csv.reader(escape_commas(csvfile), escapechar='\\')
for row in reader:
print('|'.join(row))
# result:
col1|col2|col3, with, commas|col4
Edit: for pandas you might want to make a wrapper for file that implements the read method:
class EscapeCommas():
def __init__(self, file):
self.file = file
def read(self, size=-1, /):
text = self.file.read(size)
return text.replace(', ', '\\, ')
with open('foo.csv', newline='') as csvfile:
pd.read_csv(EscapeCommas(csvfile), escapechar='\\')

Python: Read csv file with an arbitrary number of tabs as delimiter

I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.

How to remove more than one space when reading text file

Problem: I cannot seem to parse the information in a text file because python reads it as a full string not individual separate strings. The spaces between each variable is not a \t which is why it does not separate. Is there a way for python to flexibly remove the spaces and put a comma or \t instead?
Example DATA:
MOR125-1 MOR129-1 0.587
MOR125-1 MOR129-3 0.598
MOR129-1 MOR129-3 0.115
The code I am using:
with open("Distance_Data_No_Bootstrap_RAW.txt","rb") as f:
reader = csv.reader(f,delimiter="\t")
d=list(reader)
for i in range(3):
print d[i]
Output:
['MOR125-1 MOR129-1 0.587']
['MOR125-1 MOR129-3 0.598']
['MOR129-1 MOR129-3 0.115']
Desired Output:
['MOR125-1', 'MOR129-1', '0.587']
['MOR125-1', 'MOR129-3', '0.598']
['MOR129-1', 'MOR129-3', '0.115']
You can simply declare the delimiter to be a space, and ask csv to skip initial spaces after a delimiter. That way, your separator is in fact the regular expression ' +', that is one or more spaces.
rd = csv.reader(fd, delimiter=' ', skipinitialspace=True)
for row in rd:
print row
['MOR125-1', 'MOR129-1', '0.587']
['MOR125-1', 'MOR129-3', '0.598']
['MOR129-1', 'MOR129-3', '0.115']
You can instruct csv.reader to use space as delimiter and skip all the extra space:
reader = csv.reader(f, delimiter=" ", skipinitialspace=True)
For detailed information about available parameters check Python docs:
Dialect.delimiter
A one-character string used to separate fields. It defaults to ','.
Dialect.skipinitialspace
When True, whitespace immediately following the delimiter is ignored. The default is False.

How can i quote escape characters in csv writer in python

I am writing the csv file like this
for a in products:
mylist =[]
for h in headers['product']:
mylist.append(a.get(h))
writer.writerow(mylist)
My my few fields are text fields can conatins any characters like , " ' \n or anything else. what is the safest way to write that in csv file. also file will also have integers and floats
You should use QUOTE_ALL quoting option:
import StringIO
import csv
row = ["AAA \n BBB ,222 \n CCC;DDD \" EEE ' FFF 111"]
output = StringIO.StringIO()
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerow( row )
# Test:
contents = output.getvalue()
parsedRow = list(csv.reader([contents]))[0]
if parsedRow == row: print "BINGO!"
using csv.QUOTE_ALL will ensure that all of your entries are quoted like so:
"value1","value2","value3" while using csv.QUOTE_NONE will give you: value1,value2,value3
Additionally, this will change all of your quotes in the entries to double quotes as follows. "somedata"user"somemoredata will become
"somedata""user""somemoredata in your written .csv
However, if you set your quotechar to the backslash character (for example), your entry will return as \" for all quotes.
create=csv.writer(open("test.csv","wb"),quoting=csv.QUOTE_NONEescapechar='\\', quotechar='"')
for element in file:
create.writerow(element)
and the previous example will become somedata\"user\"somemoredata which is clean. It will also escape any commas that you have in your elements the same way.

Remove special characters from csv file using python

There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()

Categories

Resources