Removing unknown characters from a text file - python

I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?

You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().

I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')

The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.

Related

How to (literally) read a file character by character? [duplicate]

I have to read a text file into Python. The file encoding is:
file -bi test.csv
text/plain; charset=us-ascii
This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.
My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.
Is there a way to avoid this. If I try this:
fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
for line in companiesFile:
print(line, end="");
except UnicodeDecodeError:
pass;
then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.
Is there any way to do this?
Thank you very much.
Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.
You can tell open() how to treat decoding errors, with the errors keyword:
errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.
Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.
Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:
import re
_surrogates = re.compile(r"[\uDC80-\uDCFF]")
def detect_decoding_errors_line(l, _s=_surrogates.finditer):
"""Return decoding errors in a line of text
Works with text lines decoded with the surrogateescape
error handler.
Returns a list of (pos, byte) tuples
"""
# DC80 - DCFF encode bad bytes 80-FF
return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
for m in _s(l)]
E.g.
with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
for i, line in enumerate(f, 1):
errors = detect_decoding_errors_line(line)
if errors:
print(f"Found errors on line {i}:")
for (col, b) in errors:
print(f" {col + 1:2d}: {b[0]:02x}")
Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

How to address: Python import of file with .csv Dictreader fails on undefined character

First of all, I found the following which is basically the same as my question, but it is closed and I'm not sure I understand the reason for closing vs. the content of the post. I also don't really see a working answer.
I have 20+ input files from 4 apps. All files are exported as .csv files. The first 19 files worked (4 others exported from the same app work) and then I ran into a file that gives me this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5762: character maps to <undefined>
If I looked that up right it is a &lt ctrl &gt. The code below are the relevant lines:
with open(file, newline = '') as f:
reader = csv.DictReader(f, dialect = 'excel')
for line in reader:
I know I'm going to be getting a file. I know it will be a .csv. There may be some variance in what I get due to the manual generation/export of the source files. There may also be some strange characters in some of the files (e.g. Japanese, Russian, etc.). I provide this information because going back to the source to get a different file might just kick the can down the road until I have to pull updated data (or worse, someone else does).
So the question is probably multi-part:
1) Is there a way to tell the csv.DictReader to ignore undefined characters? (Hint for the codec: if I can't see it, it is of no value to me.)
2) If I do have "crazy" characters, what should I do? I've considered opening each input as a binary file, filtering out offending hex characters, writing the file back to disk and then opening the new file, but that seems like a lot of overhead for the program and even more for me. It's also a few JCL statements from being 1977 again.
3) How do I figure out what I'm getting as an input if it crashes while I'm reading it in.
4) I chose the "dialect = 'excel'"; because many of the inputs are Excel files that can be downloaded from one of the source applications. From the docs on dictreader, my impression is that this just defines delimiter, quote character and EOL characters to expect/use. Therefore, I don't think this is my issue, but I'm also a Python noob, so I'm not 100% sure.
I posted the solution I went with in the comments above; it was to set the errors argument of open() to 'ignore':
with open(file, newline = '', errors='ignore') as f:
This is exactly what I was looking for in my first question in the original post above (i.e. whether there is a way to tell the csv.DictReader to ignore undefined characters).
Update: Later I did need to work with some of the Unicode characters and couldn't ignore them. The correct answer for that solution based on Excel-produced unicode .csv file was to use the 'utf_8_sig' codec. That deletes the byte order marker (utf-16 BOM) that Windows writes at the top of the file to let it know there are unicode characters in it.

Remove all characters which cannot be decoded in Python

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character . I tried to open it with several encodings but I don't find the correct one.
As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?
I would like to know how to remove all bytes which can't be decode. Is there a solution?
This is simple:
with open('filename', 'r', encoding='utf8', errors='ignore') as f:
...
The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.
Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.
In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.
But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():
import unicodedata
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if unicodedata.category(c) != 'Cc')
Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))
This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.
In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

Remove byte order mark from objects in a list

I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?
For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.
When I modify the files after download, I use the following syntax:
with open(t, "w", encoding='utf-8') as outfile:
with open(f, "r", encoding='utf-8') as infile:
text = infile.read
#Arguments to make modifications follow
Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff. I try to remove the BOM using the following list comprehension:
g = list_outfile #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]
While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.
I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword rather than word will be analyzed as \ufeffword.
Again, is it possible to remove the BOM after the fact?
The problem is that you are replacing specific bytes, while the representation of your byte order mark might be different, depending on the encoding of your file.
Actually checking for the presence of a BOM is pretty straightforward with the codecs library. Codecs has the specific byte order marks for different UTF encodings. Also, you can get the encoding automatically from an opened file, no need to specify it.
Suppose you are reading a csv file with utf-8 encoding, which may or may not use a byte order mark. Then you could go about like this:
import codecs
with open("testfile.csv", "r") as csvfile:
line = csvfile.readline()
if line.__contains__(codecs.BOM_UTF8.decode(csvfile.encoding)):
# A Byte Order Mark is present
line = line.strip(codecs.BOM_UTF8.decode(csvfile.encoding))
print(line)
In the output resulting from the code above you will see the output without byte order mark. To further improve on this, you could also restrict this check to be only done on the first line of a file (because that is where the byte order mark always resides, it is the first few bytes of the file).
Using strip instead of replace won't replace anything and won't actually do anything if the indicated byte order mark is not present. So you may even skip the manual check for byte-order-mark altogether and just run the strip method on the entire contents of the file:
import codecs
with open("testfile.csv", "r") as csvfile:
with open("outfile.csv", "w") as outfile:
outfile.write(csvfile.read().strip(codecs.BOM_UTF8.decode(csvfile.encoding)))
Voila, you end up with 'outfile.csv' containing the exact contents of the original (testfile.csv) without the Byte Order Mark.

utf-16 file seeking in python. how?

For some reason i can not seek my utf16 file. It produces 'UnicodeException: UTF-16 stream does not start with BOM'. My code:
f = codecs.open(ai_file, 'r', 'utf-16')
seek = self.ai_map[self._cbClass.Text] #seek is valid int
f.seek(seek)
while True:
ln = f.readline().strip()
I tried random stuff like first reading something from stream, didnt help. I checked offset that is seeked to using hex editor - string starts at character, not null byte (i guess its good sign, right?)
So how to seek utf-16 in python?
Well, the error message is telling you why: it's not reading a byte order mark. The byte order mark is at the beginning of the file. Without having read the byte order mark, the UTF-16 decoder can't know what order the bytes are in. Apparently it does this lazily, the first time you read, instead of when you open the file -- or else it is assuming that the seek() is starting a new UTF-16 stream.
If your file doesn't have a BOM, that's definitely the problem and you should specify the byte order when opening the file (see #2 below). Otherwise, I see two potential solutions:
Read the first two bytes of the file to get the BOM before you seek. You seem to say this didn't work, indicating that perhaps it's expecting a fresh UTF-16 stream after the seek, so:
Specify the byte order explicitly by using utf-16-le or utf-16-be as the encoding when you open the file.

Categories

Resources