Remove byte order mark from objects in a list - python

I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?
For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.
When I modify the files after download, I use the following syntax:
with open(t, "w", encoding='utf-8') as outfile:
with open(f, "r", encoding='utf-8') as infile:
text = infile.read
#Arguments to make modifications follow
Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff. I try to remove the BOM using the following list comprehension:
g = list_outfile #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]
While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.
I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword rather than word will be analyzed as \ufeffword.
Again, is it possible to remove the BOM after the fact?

The problem is that you are replacing specific bytes, while the representation of your byte order mark might be different, depending on the encoding of your file.
Actually checking for the presence of a BOM is pretty straightforward with the codecs library. Codecs has the specific byte order marks for different UTF encodings. Also, you can get the encoding automatically from an opened file, no need to specify it.
Suppose you are reading a csv file with utf-8 encoding, which may or may not use a byte order mark. Then you could go about like this:
import codecs
with open("testfile.csv", "r") as csvfile:
line = csvfile.readline()
if line.__contains__(codecs.BOM_UTF8.decode(csvfile.encoding)):
# A Byte Order Mark is present
line = line.strip(codecs.BOM_UTF8.decode(csvfile.encoding))
print(line)
In the output resulting from the code above you will see the output without byte order mark. To further improve on this, you could also restrict this check to be only done on the first line of a file (because that is where the byte order mark always resides, it is the first few bytes of the file).
Using strip instead of replace won't replace anything and won't actually do anything if the indicated byte order mark is not present. So you may even skip the manual check for byte-order-mark altogether and just run the strip method on the entire contents of the file:
import codecs
with open("testfile.csv", "r") as csvfile:
with open("outfile.csv", "w") as outfile:
outfile.write(csvfile.read().strip(codecs.BOM_UTF8.decode(csvfile.encoding)))
Voila, you end up with 'outfile.csv' containing the exact contents of the original (testfile.csv) without the Byte Order Mark.

Related

How to merge multiple CSV files with different languages into one CSV file?

I have a lot of CSV files and I want to merge them into one CSV file. The thing is that the CSV files contain data in different languages like Russian, English, Croatian, Spanish, etc. Some of the CSV files even have their data written in multiple languages.
When I open the CSV files, the data looks perfectly fine, written properly in their languages and I want to read all the CSV files in their language, and write them to one big CSV file as they are.
The code I use is this:
directory_path = os.getcwd()
all_files=glob.glob(os.path.join(directory_path,"DR_BigData_*.csv"))
print(all_files)
merge_file='data_5.csv'
df_from_each_file=(pd.read_csv(f,encoding='latin1') for f in all_files)
df_merged=pd.concat(df_from_each_file,ignore_index=True)
df_merged.to_csv(merge_file,index=False)
If I use "encoding='latin1'", it successfully writes all the CSV files into one but as you might guess, the characters are so messed up.
Here is a part of the output as an example:
I also tried to write them into .xlsx with using encoding='latin1', I still encountered the same issue. In addition to these, I tried many different encoding, but those gave me decoding errors.
When you force the input encoding to Latin-1, you are basically wrecking any input files which are not actually Latin-1. For example, a Russian text file containing the text привет in code page 1251 will silently be translated to ïðèâåò. (The same text in the UTF-8 encoding would map to the similarly bogus but completely different string пÑивеÑ.)
The sustainable solution is to, first, correctly identify the input encoding of each file, and then, second, choose an output encoding which can accommodate all of the input encodings correctly.
I would choose UTF-8 for output, but any Unicode variant will technically work. If you need to pass the result to something more or less braindead (cough Microsoft cough Java) maybe UTF-16 will be more convenient for your use case.
data = dict()
for file in glob.glob("DR_BigData_*.csv"):
if 'ru' in file:
enc = 'cp1251'
elif 'it' in file:
enc = 'latin-1'
# ... add more here
else:
raise KeyError("I don't know the encoding for %s" % file)
data[file] = pd.read_csv(file, encoding=enc)
# ... merge data[] as previously
The if statement is really just a placeholder for something more useful; without access to your files, I have no idea how your files are named, or which encodings to use for which ones. This simplistically assumes that files in Russian would all have the substring "ru" in their names, and that you want to use a specific encoding for all of those.
If you only have two encodings, and one of them is UTF-8, this is actually quite easy; try to decode as UTF-8, then if that doesn't work, fall back to the other encoding:
for file in glob.glob("DR_BigData_*.csv"):
try:
data[file] = pd.read_csv(file, encoding='utf-8')
except UnicodeDecodeError:
data[file] = pd.read_csv(file, encoding='latin-1')
This is likely to work simply because text which is not valid UTF-8 will typically raise a UnicodeDecodeError very quickly. The encoding is designed so that bytes with the 8th bit set have to adhere to a very specific pattern. This is a useful feature, not something you should feel frustrated about. Not getting the correct data from the file is much worse.
If you don't know what encodings are, now would be a good time to finally read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
As an aside, your computer already knows which directory it's in; you basically never need to call os.getcwd() unless you require to find out the absolute path of the current directory.
If I understood your question correctly, you can easily merge all your csv files (as they are) using cat command:
cat file1.csv file2.csv file3.csv ... > Merged.csv

UT8 issue - Is there a way to convert strange looking characters ä to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?
did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.

Remove all characters which cannot be decoded in Python

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character . I tried to open it with several encodings but I don't find the correct one.
As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?
I would like to know how to remove all bytes which can't be decode. Is there a solution?
This is simple:
with open('filename', 'r', encoding='utf8', errors='ignore') as f:
...
The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.
Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.
In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.
But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():
import unicodedata
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if unicodedata.category(c) != 'Cc')
Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):
def strip_control_chars(data: str) -> str:
return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))
This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.
In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

python utf-8-sig BOM in the middle of the file when appending to the end

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file:
<BOM>123
<BOM>123
Isn't that a bug? This is so not logical.
Could anyone explain to me why it was done so?
Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.
If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).

Removing unknown characters from a text file

I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')
The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.

Categories

Resources