How to (literally) read a file character by character? [duplicate] - python

I have to read a text file into Python. The file encoding is:
file -bi test.csv
text/plain; charset=us-ascii
This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.
My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.
Is there a way to avoid this. If I try this:
fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
for line in companiesFile:
print(line, end="");
except UnicodeDecodeError:
pass;
then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.
Is there any way to do this?
Thank you very much.

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.
You can tell open() how to treat decoding errors, with the errors keyword:
errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.
Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.
Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:
import re
_surrogates = re.compile(r"[\uDC80-\uDCFF]")
def detect_decoding_errors_line(l, _s=_surrogates.finditer):
"""Return decoding errors in a line of text
Works with text lines decoded with the surrogateescape
error handler.
Returns a list of (pos, byte) tuples
"""
# DC80 - DCFF encode bad bytes 80-FF
return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
for m in _s(l)]
E.g.
with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
for i, line in enumerate(f, 1):
errors = detect_decoding_errors_line(line)
if errors:
print(f"Found errors on line {i}:")
for (col, b) in errors:
print(f" {col + 1:2d}: {b[0]:02x}")
Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

Related

how can I replace characters which give me UnicodeDecoderErrors?

I am extracting the content from different file types into a csv file.
I am currently trying to extract from the file type 'm'.
That's my extraction function:
def extract_m(f): # f is the file
with open(f, encoding="utf8") as text:
lines = text.read()
lines = cleaning(lines)
return lines
this code is working until some specific characters are in the document. Then my program throws out the UnicodeDecodeError: 'utf-8' codec can't decode byte
In some other file types the program was crashing when I tried to write the extracted data into the csv file. To fix that I used the cleaning() function which replaced the troublesome characters.
But now the program crashes at the line lines = text.read()
So the program cannot go into the cleaning() function.
I tried
text = f.read().decode(errors='replace')
But then I get the Error AttributeError: 'str' object has no attribute 'decode'
I don't know why my function cannot open the file anymore.
Edit: You were all correct. One of the files is encoded in cp 1252.
When I put the
errors='replace
in
open
, then it is opening without error but '�'-symbols.
Try this instead:
def extract_m(f): # f is the file
with open(f, encoding="utf8", errors='replace') as text:
lines = text.read()
lines = cleaning(lines)
return lines
Where errors can be the following ones (from the docs):
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as low surrogate code units ranging from U+DC80 to U+DCFF. These surrogate code units will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' replaces malformed data by Python’s backslashed escape sequences.
'namereplace' (also only supported when writing) replaces unsupported characters with \N{...} escape sequences.

Why am I keep getting a UnicodeDecodeError in pandas read_csv() function even though I specified the correct encoding parameter? [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.
Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding
TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')
I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')
The solution was change to "UTF-8 sin BOM"

How to manipulate multibyte string in python?

I have a log file having multibyte data in it (). I want to write a script that does some data manipulation on it.
with open(fo, encoding="cp1252") as file:
for line in file:
print(line)
if("WINDOWS" in line):
print(found)
print(line) give following output:
there is one extra byte after every character.
This is not working due to the fact that WINDOWS is not multibyte. I am unable to find the solution for this. Can someone help me here ?
cp1252 is not a multibyte encoding. If the file in fact contains UTF-16, but most of it is in the very lowest range of Unicode, using cp1252 will yield roughly the correct characters except there will be zero (null) bytes between them. Without an unambiguous sample of the bytes in the file, we can only speculate; but try opening the file with encoding='utf-16le'. (If this fails, please edit your question to indlude a hex dump or repr() of the binary bytes in the file; see also Problematic questions about decoding errors)

Python3 using UTF-8 data from bytes [duplicate]

Why is the below item failing? Why does it succeed with "latin-1" codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I'm using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don't know the codeset you're receiving strings in, you're in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you'd just reject ones that didn't decode.
If you can't do that, you'll need heuristics.
Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is "Latin-1, also known as ISO-8859-1"
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
This happened to me also, while i was reading text containing Hebrew from a .txt file.
I clicked: file -> save as and I saved this file as a UTF-8 encoding
TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
Read zip
Read child zip
Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1", but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â), but in the end it was not the actual issue.
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to "UTF-8 without BOM" and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')
I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx file on my local computer, and from there exporting single sheet as .csv. Then the error went away for pd.read_csv('myfile.csv')
The solution was change to "UTF-8 sin BOM"

Removing unknown characters from a text file

I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')
The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.

Categories

Resources