I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.
Related
I'm trying to solve a problem from the pyschools website that asks to write a script that reads a CSV file with comas "," as a delimiter and returns a list of records. When running my script on their website it returns as incorrect using a test case of:
csvReader('books.csv')[0] thus returning:
['"Pete,Zelle","Intro to HTML, CSS",2011']
when the expected result is:
['Pete,Zelle', 'Intro to HTML, CSS', '2011']
I've notice that the problem has to do with the quotations " & ' but still haven't come up with the right answer, using replace('"','') for the line variable to remove the double quotes does not fix it as it returns as:
['Pete,Zelle,Intro to HTML, CSS,2011']
where it removes the last quotation mark from some of the words e.g. Zelle, instead of Zelle',.
Below ill provide a link to the exercise, the problem and my current script. Any explanation or help is greatly appreciated.
link:
http://www.pyschools.com/quiz/view_question/s13-q8
problem:
Write a function to read a CSV file with ',' as delimiter and returns a list of records.
The function must be able to ignore ',' that are within a pair of double quotes '"'.
script:
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
records.append([line.replace('"','')])
return records
I was after the CSV file you are trying to read. Sounds as though you need to seperate the fields whilst ignoring any delimiters that fall inbetween quotation marks.
In this case I would recommend the CSV Library and setting the quotation character.
import csv
record = '"Pete,Zelle","Intro to HTML, CSS",2011'
newStr = [ '"{}"'.format(x) for x in list(csv.reader([record], delimiter=',', quotechar='"'))[0] ]
print(newStr)
Will return ['"Pete,Zelle"', '"Intro to HTML, CSS"', '"2011"']
In your function you could incorporate this as below
import csv
def csvReader(filename):
records = []
for line in open(filename):
line = line.rstrip() # strip '\n'
if line=='","':
continue # ignore empty line
newLine = [ '"{}"'.format(x) for x in list(csv.reader([line], delimiter=',', quotechar='"'))[0] ]
records.append(newLine)
return records
Batteries are included, as usual, with python. Here's using the standard lib csv module:
import csv
with open(path, "r") as f:
csv_reader = csv.reader(f, delimiter=",")
for row_number, row in enumerate(csv_reader):
print(f"{row_number} => {row}")
If the stdlib isn't available for some strange reason.. you'll need to tokenize each line with 'delimiters', 'separators', and 'cell values'. Again, this would be trivial with stdlib (import re). Let's pretend you have no batteries at all, just plain python.
You'll need to realize that how you treat each character of each line depends on a "context" and that that context is built up by all previous characters. Using a stack is advised here. You push and pop off states (aka contexts) from a stack
depending on what the current context is (the top of your stack) and the current character you're handing. Now, given a context, you may process each character differently depending on that context:
class State:
IN_NON_DELIMITED_CELL = 1
IN_DELIMITED_CELL = 2
def get_cell_values(line, quotechar='"', separator=','):
stack = []
stack.append(State.IN_NON_DELIMITED_CELL)
cell_values = [""]
for character in line:
current_state = stack[-1]
if current_state == State.IN_NON_DELIMITED_CELL:
if character == quotechar:
stack.append(State.IN_DELIMITED_CELL)
elif character == separator:
cell_values.append("")
else:
cell_values[-1] += character
if current_state == State.IN_DELIMITED_CELL:
if character == quotechar:
stack.pop()
else:
cell_values[-1] += character
return cell_values
with open(path, "r") as f:
for line in f:
cell_values = tokenize(line, quotechar='"', delimiter=',')
print(cell_values)
This is a good starting point:
print(get_cell_values('"this","is",an,example,of,"doing things, the hard way?"'))
# prints:
['this', 'is', 'an', 'example', 'of', 'doing things, the hard way?']
For taking this (MUCH) further, look into these topics: tokenizing strings, LL+LR parsers, recursive descent, shift-reduce parsers.
I need to modify some columns of a CSV file to add some text in them. Once I've modified that columns I write the whole row, with the modified column to a new CSV file, but it does not keep the original format, as it adds "" in the empty columns.
The original CSV is a special dialect that I've registered as:
csv.register_dialect('puntocoma', delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL)
And it is part of my code:
with open(fileName,'rt', newline='', encoding='ISO8859-1') as fdata, \
open(r'SampleFiles\Servergiro\fout.csv',
'wt', newline='', encoding='ISO8859-1') as fout:
reader=csv.DictReader(fdata, dialect='puntocoma')
writer=csv.writer(fout, dialect='puntocoma')
I am reading the CSV with DictReader and with the CSV module
Then I modify the column that I need:
for row in reader:
for (key, value) in row.items():
if key=='C' or key == 'D' or key == 'E':
if row[key] != "":
row[key] = '<something>' + value + '</something>'
And I write the modified content as it follows
content = list(row[i] for i in fields)
writer.writerow(content)
The original CSV has content like (header included):
"A";"B";"C";"D";"E";"F";"G";"H";"I";"J";"K";"L";"Ma";"No";"O 3";"O 4";"O 5"
"3123131";"Lorem";;;;;;;;;;;"Ipsum";"Ar";"Maquina Lorem";;;
"3003321";"HD 2.5' EP";;"asät 600 MB<br />Ere qweqsI (SAS)<br />tre qwe 15000 RPM<br />sasd ty 2.5 Zor<br />Areämis tyn<br />Ser Ja<br />Ütr ewas/s";;;;;;;;;"rew";"Asert ";"Trebol";"Casa";;
"3026273";"Sertro 5 M";;;;;;;;;;;"Rese";"Asert ";"Trebol";"Casa";;
But my modified CSV writes the following:
"3123131";"<something>Lorem</something>";"";"";"";"";"";"";"";"";"";"";"<something>Ipsum</something>";"<something>Ar</something>";"<something>Maquina Lorem</something>";"";"";""
I've modified the original question adding the headers of the CSV. (The names of the headers are not the original.
How can I write the new CSV without quotes. My guess is about the dialect, but in reality it is a quote-all dialect except for columns that are empty.
It seems that you either have quotes everywhere (QUOTE_ALL) or no quotes (QUOTE_MINIMAL) (and other exotic options useless here).
I first posted a solution which wrote in a file buffer, then replaced the double quotes by nothing, but it was really a hack and could not manage strings containing quotes properly.
A better solution is to manually manage the quoting to force it if string is not empty, and don't put any if empty:
with open("input.csv") as fr, open("output.csv","w") as fw:
csv.register_dialect('puntocoma', delimiter=';', quotechar='"')
cr = csv.reader(fr,dialect="puntocoma")
cw = csv.writer(fw,delimiter=';',quotechar='',escapechar="\\",quoting=csv.QUOTE_NONE)
cw.writerows(['"{}"'.format(x.replace('"','""')) if x else "" for x in row] for row in cr)
Here we tell csv no write no quotes at all (and we even pass an empty quote char). The manual quoting consists in generating the rows using a list comprehension quoting only if string is not empty, and doubling the quotes from within the string.
I have an excel file that contains data with multiple columns of varying width that I need to work with on my PC. However, the file contains SOH and STX characters as delimiting characters, since they were from TextEdit on a Mac. The SOH is record delimiter and the STX is row delimiter. On my PC, both these characters are shown as a rectangle (in screenshot). I can't use the fixed width delimited option since I would lose data. I tried writing a Python script, but Python doesn't recognize the SOH and STX either, just displays it as a rectangle too. How do I delimit these records appropriately? I would appreciate any possible method.
Thanks!
This should work
SOH='\x01'
STX='\x02'
# As it is, this function returns the values as strings, not as integers
def read_lines(filename):
rawdata = open(filename, "rb").read()
for l in rawdata.split(SOH + STX):
if not l:
continue
yield l.split(SOH)
# Rows is a list. Each element in the list is a row of values
# (either a list or a tuple, for example)
def write_lines(filename, rows):
with open(filename, "wb") as f:
for row in rows:
f.write(SOH.join([str(x) for x in row]) + SOH + STX)
Edit: Example use...
for row in read_lines("myfile.csv"):
print ", ".join(row)
There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()
I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()
Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)
Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)
The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).