Error reading csv file unicodeescape - python

I have this program
import csv
with open("C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r") as scoreFile:
# write = w, read = r, append = a
scoreFileReader = csv.reader(scoreFile)
scoreList = []
for row in scoreFileReader:
if len (row) != 0:
scoreList = scoreList + [row]
scoreFile.close()
print(scoreList)
Why do I get this Error ?
with
open("C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r")
as scoreFile:
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape.

You need to use raw strings with Windows-style filenames:
with open(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt", 'r') as scoreFile:
^^
Otherwise, Python's string engine thinks that \U is the start of a Unicode escape sequence - which of course it isn't in this case.
Also, be careful also your scoreFile.close() is useless:
The with statement replace a try and catch. It also
automatically close the file after the block. That mean you can delete the scoreFile.close() line.
Also, you can change the line if len(row) != 0
According to PEP8:
For sequences, (strings, lists, tuples), use the fact that empty
sequences are false.
Yes: if not seq:
if seq:
No: if len(seq):
if not len(seq):
One last thing, your for loop isn't good either, to read csv you better start copying an example from the doc first!
with open('eggs.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
print ', '.join(row)

you may try this:
import csv
import io
def guess_encoding(csv_file):
"""guess the encoding of the given file"""
with io.open(csv_file, "rb") as f:
data = f.read(5)
if data.startswith(b"\xEF\xBB\xBF"): # UTF-8 with a "BOM"
return ["utf-8-sig"]
elif data.startswith(b"\xFF\xFE") or data.startswith(b"\xFE\xFF"):
return ["utf-16"]
else: # in Windows, guessing utf-8 doesn't work, so we have to try
try:
with io.open(csv_file, encoding="utf-8") as f:
preview = f.read(222222)
return ["utf-8"]
except:
return [locale.getdefaultlocale()[1], "utf-8"]
encodings = guess_encoding(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt")[0])
# then your code with
with open(r"C:\Users\frederic\Desktop\WinPython-64bit-3.4.4.3Qt5\notebooks\scores.txt","r", encoding=encodings[0]) as scoreFile:

Related

How can I decode a string read from file?

I read an file into a string in Python, and it shows up as encoded (not sure the encoding).
query = ""
with open(file_path) as f:
for line in f.readlines():
print(line)
query += line
query
The lines all print out in English as expected
select * from table
but the query at the end shows up like
ÿþd\x00r\x00o\x00p\x00 \x00t\x00a\x00b\x00l\x00e\x00
What's going on?
Agreed with Carlos, the encoding seems to be UTF-16LE. There seems to be BOM present, thus encoding="utf-16" would be able to autodetect if it's little- or big-endian.
Idiomatic Python would be:
with open(file_path, encoding="...") as f:
for line in f:
# do something with this line
In your case, you append each line to query, thus entire code can be reduced to:
query = open(file_path, encoding="...").read()
It seems like UTF-16 data.
Can you try decoding it with utf-16?
with open(file_path) as f:
query=f.decode('utf-16')
print(query)
with open(filePath) as f:
fileContents = f.read()
if isinstance(fileContents, str):
fileContents = fileContents.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.
elif isinstance(fileContents, unicode):
fileContents = fileContents.encode('ascii', 'ignore')

Python Split not working properly

I have the following code to read the lines in a file and split them with a delimiter specified. After split I have to write some specific fields into another file.
Sample Data:
Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
Code:
import os
import codecs
sfilename = "WEEK_RPT_1108" + os.extsep + "dat"
sfilepath = "Club" + "/" + sfilename
sbackupname = "Club" + "/" + sfilename + os.extsep + "bak"
try:
os.unlink(sbackupname)
except OSError:
pass
os.rename(sfilepath, sbackupname)
try:
inputfile = codecs.open(sbackupname, "r", "utf-16-le")
outputfile = codecs.open(sfilepath, "w", "utf-16-le")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(';')
outputfile.write(record[1])
except IOError, err:
pass
I can see that the 0th array position contains the whole line instead of the first record:
record[0] = Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
while on printing record[1], it says array index out of range.
Need help as new to python.
Thanks!
After you comment saying that print line outputs u'\u6557\u6b65\u3934\u415f\u365f\u3030\u3230\u3030\u3b30\u614d\u3b72\u5946\u3431‌​\u413b\u7463\u6175\u3b6c\u6f57\u6b72\u6e69\u3b67\u5f45\u3031\u3030\u503b\u5f43\u3‌​030\u3030\u3030\u343b\u3832\u2e37\u3336', I can explain what happens and how to fix it.
What happens:
you have a normal 8bits characters file, and the line you show is even in plain ASCII, but you try to decode it as if it were in UTF-16 little endian. So you wrongly combine every two bytes in a single 16 bits unicode character! If your system had been able to correctly display them and if you had directly print line instead of repr(line), you would have got 敗步㤴䅟㙟〰㈰〰㬰慍㭲奆㐱䄻瑣慵㭬潗歲湩㭧彅〱〰倻彃〰〰〰㐻㠲⸷㌶. Of course, none of those unicode characters is the semicolon (; or \x3b of \u003b) so the line cannot be splitted on it.
But as you encode it back before writing record[0] you find the whole line in the new file, what let you believe erroneously that the problem is in the split function.
How to fix:
Just open the file normally, or use the correct encoding if it contains non ascii characters. But as you are using a version 2 of python, I would just do:
try:
inputfile = open(sbackupname, "r")
outputfile = open(sfilepath, "w")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(sdelimdatfile)
outputfile.write(record[1])
except IOError, err:
pass
If you really need to use the codecs module, for example if the file contains UTF8 or latin1 characters, you can replace the open part with:
encoding = "utf8" # or "latin1" or whatever the actual encoding is...
inputfile = codecs.open(sbackupname, "r", encoding)
outputfile = codecs.open(sfilepath, "w", encoding)
Then there is no index [1]:
Either skip the line with "continue" if len(record) < 1 or just not write to the file (like here)
for line in inputfile:
record = line.split(';')
if len(record) >= 1:
outputfile.write(record[1])

Python TypeError: coercing to Unicode: need string or buffer, file found

I am into learning Python, with a C- language background. Sorry, if my problem is 'naive' or 'too simple' or 'didn't worked enough'.
In the below code, I want to practice for future problems, the removal of specific rows by the 'set' data-structure. But, first of all: it fails to match the removal set contents.
Also, the second issue: is the error in o/p. This can be checked by making the indented block work instead.
The trimmed data file is : marks_trim.csv
"Anaconda Systems Campus Placement",,,,,,
"Conducted on:",,,"30 Feb 2011",,,
"Sno","Math","CS","GK","Prog","Comm","Sel"
1,"NA","NA","NA",4,0,0
import csv, sys, re, random, os, time, io, StringIO
datfile = sys.argv[1]
outfileName = sys.argv[2]
outfile = open(outfileName, "w")
count = 0
removal_list = set()
tmp = list()
i=0
re_pattern = "\d+"
with open(datfile, 'r') as fp:
reader1 = csv.reader(fp)
for row in reader1:
if re.match(re_pattern, row[0]):
for cols in row:
removal_list.add(tuple(cols)) #as tuple is hashable
print "::row>>>>>>",row
print "::removal_list>>>>>>>>",removal_list
convert = list(removal_list)
print "<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>"
print convert
f = open(datfile, 'r')
reader2 = csv.reader(f)
print ""
print "Removal List Starts"
print removal_list
print "Removal List Ends\n"
new_a_buf = io.BytesIO() # StringIO.StringIO() : both 'io' & StringIO' work
writer = csv.writer(new_a_buf)
rr =""
j = 0
for row in reader2:
if row not in convert: # removal_list: not used as list not hashable
writer.writerow(row) #outfile.write(new_a_buf)
'''
#below code using char array isn't used as it doesn't copy structure of csv file
for cols in row: #at indentation level of "if row not in convert", stated above
if cols not in convert: # removal_list: not used as list not hashable
for j in range(0,len(cols)):
rr+=cols[j] #at indentation level of "if cols not in convert:"
outfile.write(rr) # at the indentation level of 'if'
print "<<<<<<<<<<<<<<<<", rr
f = open(outfile, 'r')
reader2 = csv.reader(f)
'''
new_a_buf.seek(0)
reader2 = csv.reader(new_a_buf)
for row in reader2:
print row
Problem/Issue:
The common error (i.e. using char array / csv.writer object) in the o/p is also giving the rows to be deleted, i.e. by occurrence in removal_list.
However, in the approach using char array for retrieving left-out rows, the error is :
Traceback (most recent call last):
File "test_list_in_set.py", line 51, in
f = open(outfile, 'r')
TypeError: coercing to Unicode: need string or buffer, file found
I didn't read through all that code - but it mostly doesn't seem relevant. The error is to do with opening a file: open takes a filename, but you are passing it outfile, which is already a file. You should close that file first then pass outfileName to open.
Got it, sadly myself.
Apart from the change of not storing indl. cols, change the removal_list to an array, and then appending to the array using > removal_list.append( row )
Hits!

What is the most efficient way to get first and last line of a text file?

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Python - Finding unicode/ascii problems

I am csv.reader to pull in info from a very long sheet. I am doing work on that data set and then I am using the xlwt package to give me a workable excel file.
However, I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)
My question to you all is, how can I find exactly where that error is in my data set? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)?
The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :
try:
# decode using utf-8 (use ascii if you want)
unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
print "The error is there !"
this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.
Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.
The csv module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):
import codecs
import csv
class AsciiFile:
def __init__(self, path):
self.f = codecs.open(path, 'rb', 'utf-8')
def close(self):
self.f.close()
def __iter__(self):
for line in self.f:
# 'replace' for unicode characters -> ?, 'ignore' to ignore them
y = line.encode('ascii', 'replace')
y = y.replace('\0', '?') # Can't handle null characters!
yield y
f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()
If you want to find the positions of the characters which you can't be handled by the CSV module, you could do e.g:
import codecs
lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
for x, c in enumerate(line):
if not c.encode('ascii', 'ignore') or c == '\0':
print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
lineno += 1
f.close()
Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:
import codecs
def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
infile = codecs.open(Path, "rb", Encoding, errors=Errors)
for Line in infile:
Line = Line.strip('\r\n')
if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
elif Qualifier != '(None)':
# Take a note of the chars 'before' just
# in case of excel-style """ quoting.
cB41 = ''; cB42 = ''
L = ['']
qMode = False
for c in Line:
if c==Qualifier and c==cB41==cB42 and qMode:
# Triple qualifiers, so allow it with one
L[-1] = L[-1][:-2]
L[-1] += c
elif c==Qualifier:
# A qualifier, so reverse qual mode
qMode = not qMode
elif c in Delims and not qMode:
# Not in qual mode and delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
cB42 = cB41
cB41 = c
yield L
else:
# There aren't any qualifiers.
cB41 = ''; cB42 = ''
L = ['']
for c in Line:
cB42 = cB41; cB41 = c
if c in Delims:
# Delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
yield L
for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
...
You can refer to code snippets in the question below to get a csv reader with unicode encoding support:
General Unicode/UTF-8 support for csv files in Python 2.6
PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.
It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?
To find where the problems (not necessarily errors) are, try a little script like this (untested):
import sys, glob
for pattern in sys.argv[1:]:
for filepath in glob.glob(pattern):
for linex, line in enumerate(open(filepath, 'r')):
if any(c >= '\x80' for c in line):
print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
print repr(line)
It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.
I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt, avoiding any encode/decode problems.
Have you worked through the tutorial from the python-excel.org site?

Categories

Resources