I need to modify some columns of a CSV file to add some text in them. Once I've modified that columns I write the whole row, with the modified column to a new CSV file, but it does not keep the original format, as it adds "" in the empty columns.
The original CSV is a special dialect that I've registered as:
csv.register_dialect('puntocoma', delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL)
And it is part of my code:
with open(fileName,'rt', newline='', encoding='ISO8859-1') as fdata, \
open(r'SampleFiles\Servergiro\fout.csv',
'wt', newline='', encoding='ISO8859-1') as fout:
reader=csv.DictReader(fdata, dialect='puntocoma')
writer=csv.writer(fout, dialect='puntocoma')
I am reading the CSV with DictReader and with the CSV module
Then I modify the column that I need:
for row in reader:
for (key, value) in row.items():
if key=='C' or key == 'D' or key == 'E':
if row[key] != "":
row[key] = '<something>' + value + '</something>'
And I write the modified content as it follows
content = list(row[i] for i in fields)
writer.writerow(content)
The original CSV has content like (header included):
"A";"B";"C";"D";"E";"F";"G";"H";"I";"J";"K";"L";"Ma";"No";"O 3";"O 4";"O 5"
"3123131";"Lorem";;;;;;;;;;;"Ipsum";"Ar";"Maquina Lorem";;;
"3003321";"HD 2.5' EP";;"asät 600 MB<br />Ere qweqsI (SAS)<br />tre qwe 15000 RPM<br />sasd ty 2.5 Zor<br />Areämis tyn<br />Ser Ja<br />Ütr ewas/s";;;;;;;;;"rew";"Asert ";"Trebol";"Casa";;
"3026273";"Sertro 5 M";;;;;;;;;;;"Rese";"Asert ";"Trebol";"Casa";;
But my modified CSV writes the following:
"3123131";"<something>Lorem</something>";"";"";"";"";"";"";"";"";"";"";"<something>Ipsum</something>";"<something>Ar</something>";"<something>Maquina Lorem</something>";"";"";""
I've modified the original question adding the headers of the CSV. (The names of the headers are not the original.
How can I write the new CSV without quotes. My guess is about the dialect, but in reality it is a quote-all dialect except for columns that are empty.
It seems that you either have quotes everywhere (QUOTE_ALL) or no quotes (QUOTE_MINIMAL) (and other exotic options useless here).
I first posted a solution which wrote in a file buffer, then replaced the double quotes by nothing, but it was really a hack and could not manage strings containing quotes properly.
A better solution is to manually manage the quoting to force it if string is not empty, and don't put any if empty:
with open("input.csv") as fr, open("output.csv","w") as fw:
csv.register_dialect('puntocoma', delimiter=';', quotechar='"')
cr = csv.reader(fr,dialect="puntocoma")
cw = csv.writer(fw,delimiter=';',quotechar='',escapechar="\\",quoting=csv.QUOTE_NONE)
cw.writerows(['"{}"'.format(x.replace('"','""')) if x else "" for x in row] for row in cr)
Here we tell csv no write no quotes at all (and we even pass an empty quote char). The manual quoting consists in generating the rows using a list comprehension quoting only if string is not empty, and doubling the quotes from within the string.
Related
I have a following problem.
I would like to save a list into a csv (in the first column).
See example here:
import csv
mylist = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
def list_na_csv(file, mylist):
with open(file, "w", newline="") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerows(mylist)
list_na_csv("example.csv", mylist)
My output in excel looks like this:
Desired output is:
You can see that I have two issues: Firstly, each character is followed by comma. Secondly, I don`t know how to use some encoding, for example UTF-8 or cp1250. How can I fix it please?
I tried to search similar question, but nothing worked for me. Thank you.
You have two problems here.
writerows expects a list of rows, said differently a list of iterables. As a string is iterable, you write each word in a different row, one character per field. If you want one row with one word per field, you should use writerow
csv_writer.writerow(mylist)
by default, the csv module uses the comma as the delimiter (this is the most common one). But Excel is a pain in the ass with it: it expects the delimiter to be the one of the locale, which is the semicolon (;) in many West European countries, including Germany. If you want to use easily your file with your Excel you should change the delimiter:
csv_writer = csv.writer(csv_file, delimiter=';')
After your edit, you want all the data in the first column, one element per row. This is kind of a decayed csv file, because it only has one value per record and no separator. If the fields can never contain a semicolon nor a new line, you could just write a plain text file:
...
with open(file, "w", newline="") as csv_file:
for row in mylist:
print(row, file=file)
...
If you want to be safe and prevent future problems if you later want to process more corner cases values, you could still use the csv module and write one element per row by including it in another iterable:
...
with open(file, "w", newline="") as csv_file:
csv_writer = csv.writer(csv_file, delimiter=';')
csv_writer.writerows([elt] for elt in mylist)
...
l = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
with open("file.csv", "w") as msg:
msg.write(",".join(l))
For less trivial examples:
l = ["Hallo", "der, Pixer", "Glas", "Telefon", "Der Kühlschrank, brach kaputt."]
with open("file.csv", "w") as msg:
msg.write(",".join([ '"'+x+'"' for x in l]))
Here you basically set every list element between quotes, to prevent from the intra field comma problem.
Try this it will work 100%
import csv
mylist = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
def list_na_csv(file, mylist):
with open(file, "w") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(mylist)
list_na_csv("example.csv", mylist)
If you want to write the entire list of strings to a single row, use csv_writer.writerow(mylist) as mentioned in the comments.
If you want to write each string to a new row, as I believe your reference to writing them in the first column implies, you'll have to format your data as the class expects: "A row must be an iterable of strings or numbers for Writer objects". On this data that would look something like:
csv_writer.writerows((entry,) for entry in mylist)
There, I'm using a generator expression to wrap each word in a tuple, thus making it an iterable of strings. Without something like that, your strings are themselves iterables and lead to it delimiting between each character as you've seen.
Using csv to write a single entry per line is almost pointless, but it does have the advantage that it will escape your delimiter if it appears in the data.
To specify an encoding, the docs say:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when
opening the output file.
try split("\n")
example:
counter = 0
amazing list = ["hello","hi"]
for x in titles:
ok = amazinglist[counter].split("\n")
writer.writerow(ok)
counter +=1
I'm trying to make csv module to parse lines containing quoted strings and quoted separators. Unfortunately I'm not able to achieve desired results with any dialect/format parameters. Is there any way to parse this:
'"AAA", BBB, "CCC, CCC"'
and get this:
['"AAA"', 'BBB', '"CCC, CCC"'] # 3 elements, one quoted separator
?
Two fundamental requirements:
Quotations have to be preserved
Quoted, and not escaped separators have to be copied as regular characters
Is it possible?
There are 2 issues to overcome:
spaces around comma separator: skipinitialspace=True does the job (see also Python parse CSV ignoring comma with double-quotes)
preserving quoting when reading: replacing quotes by tripled quotes allows to preserve quotes
That second part is described in the documentation as:
Dialect.doublequote
Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.
standalone example, without file:
import csv
data = ['"AAA", BBB, "CCC, CCC"'.replace('"','"""')]
cr = csv.reader(data,skipinitialspace=True)
row = next(cr)
print(row)
result:
['"AAA"', 'BBB', '"CCC, CCC"']
with a file as input:
import csv
with open("input.csv") as f:
cr = csv.reader((l.replace('"','"""' for l in f),skipinitialspace=True)
for row in cr:
print(row)
Have you tried this ?
import csv
with open('file.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
print row
I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.
my csv file is created using values from a table as:
with open(fullpath,'wb') as csvFile:
writer = csv.writer(csvFile,delimiter='|',
escapechar=' ',
quoting=csv.QUOTE_NONE)
string_values = 'abc|xyz|mno'
for obj in queryset.iterator():
writer.writerow([
smart_str(obj.code),#code
smart_str(string_values),# string of values
])
The csv generated should output result as:
code1|abc|xyz|mno
but what is generated is:
code1|abc |xyz |mno
I am unable to remove the space from the string while creating the csv. in whatever way may be. The string values are generated dynamically and can have 1 or more string values. I cannot use quoting=csv.QUOTE_NONE without specyfying escapechar=' ' i.e a space. Please suggest.
You can't just write strings containing the delimiter, unescaped!
Use, instead:
string_values = 'abc|xyz|mno'.split('|')
for obj in queryset.iterator():
writer.writerow([smart_str(obj.code)] + string_values)
IOW, writerow a list of strings, and let the writer insert the delimiters between the strings in the list!
I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()
Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)
Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)
The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).