i have a dict as utf-8 file and reading from the commandline the word and search it in the dictionary keys. But my file have the characters turkish and arabic
word = 'şüyûh'
mydictionary[word]
my program give me the word 'şüyûh' as KeyError this 'şüyûh' back. how can i fix it.
Handle everything as unicode.
Unicode in Python, Completely Demystified"
If you're reading from a file, you'll need to tell python how to interpret the bytes in the file (files can only contain bytes) into the characters as you understand them.
The most basic way of doing so is to open the file using codecs.open instead of the built in open function. When you pull data out of the file in this way, it will be already decoded:
import codecs
with codecs.open("something.txt", encoding="utf-8") as myfile:
# do something with the file.
Note that you must tell python what encoding the file is in.
Related
When I read a file in python and print it to the screen, it does not read certain characters properly, however, those same characters hard coded into a variable print just fine. Here is an example where "test.html" contains the text "Hallå":
with open('test.html','r') as file:
Str = file.read()
print(Str)
Str = "Hallå"
print(Str)
This generates the following output:
hallå
Hallå
My guess is that there is something wrong with how the data in the file is being interpreted when it is read into Python, however I am uncertain of what it is since Python 3.8.5 already uses UTF-8 encoding by default.
Function open does not use UTF-8 by default. As the documentation says:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So, it depends, and to be certain, you have to specify the encoding yourself. If the file is saved in UTF-8, you should do this:
with open('test.html', 'r', encoding='utf-8') as file:
On the other hand, it is not clear whether the file is or is not saved in UTF-8 encoding. If it is not, you'll have to choose a different one.
I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?
did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.
So, i would like to make a program does 2 things:
Reads A Word
Reads the translation in Greek
Then I make a new format that looks like this: "word,translation" and i'm writing it into a file.
So the test.txt file should contain "Hello,Γεια" and in case i read again , the next line should go under this one.
word=raw_input("Word:\n") #The Word
translation=raw_input("Translation:\n").decode("utf-8") #The Translation in UTF-8
format=word+","+translation+"\n"
file=open("dict.txt","w")
file.write(format.encode("utf-8"))
file.close()
The Error I get:
UnicodeDecodeError 'utf8'codec can't decode byte 0x82 in position 0: invalid start byte
EDIT: This is Python 22
Although python 2 supports unicode, its input is not automatically decoded into unicode for you. raw_input returns a string and if something other than ascii is piped in, you get the encoded bytes. The trick is to figure out what that encoding is. And that depends on whatever is pumping data into the program. if its a terminal, then sys.stdin.encoding should tell you what encoding to use. If its piped in from, say, a file, then sys.stdin.encoding is None and you just kinda have to know what it is.
A solution to your problem follows. Note that even though your method of writing the file (encode then write) works, the codecs module imports a file object that does it for you.
import sys
import codecs
# just randomly picking an encoding.... a command line param may be
# useful if you want to get input from files
_stdin_encoding = sys.stdin.encoding or 'utf-8'
def unicode_input(prompt):
return raw_input(prompt).decode(_stdin_encoding)
word=unicode_input("Word:\n") #The Word
translation=unicode_input("Translation:\n")
format=word+","+translation+"\n"
with codecs.open("dict.txt","w") as myfile:
myfile.write(format)
I would like to know how I can write to another file on live the lines which are utf-8 encoded. I have a folder containing number of files. I cannot go and check each and every file for UTF-8 character.
I have tried this code:
import codecs
try:
f = codecs.open(filename, encoding='utf-8', errors='strict')
for line in f:
pass
print "Valid utf-8"
except UnicodeDecodeError:
print "invalid utf-8"
This check the whole while is UTF-8 verified or not. But I am trying to check each and every line of the file in a folder and write those lines which are UTF-8 character encoded.
I would like to delete the lines in my file which are not UTF-8 encoded. If while reading line the program get to know that the line is UTF-8 then it should move on to next line, else delete the line which is not UTF-8. I think now it is clear.
I would like to know how I can do it with the help of Python. Kindly let me know.
I am not looking to convert them, but to delete them. Or write to another file the UTF-8 satisfied line from the files.
This article will be of help about how to process text files on Python 3
Basically if you use:
open(fname, encoding="utf-8", errors="strict")
It will raise an exception if the file is not utf-8 encoded, but you can change the errors handling param for read the file and apply your algorithm for exclude lines.
By example:
open(fname, encoding="utf-8", errors="replace")
Will replace non utf-8 characters by a ? symbol.
As #Leon says, you need to consider that Chinese and/or Arabic characters can be utf-8 valid.
If you want a more strict character set you can try to open your file using a latin-1 or a ascii encoding (takin into account that utf-8 and latin-1 are ASCII compatible)
You need to take in count that there are so many character encoding types, and they can be not ASCII compatibles. Is very dificult to read properly text files if you dont know its encoding type, the chardet module can help on that, but is not 100% reliable.
I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?
For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.
When I modify the files after download, I use the following syntax:
with open(t, "w", encoding='utf-8') as outfile:
with open(f, "r", encoding='utf-8') as infile:
text = infile.read
#Arguments to make modifications follow
Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff. I try to remove the BOM using the following list comprehension:
g = list_outfile #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]
While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.
I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword rather than word will be analyzed as \ufeffword.
Again, is it possible to remove the BOM after the fact?
The problem is that you are replacing specific bytes, while the representation of your byte order mark might be different, depending on the encoding of your file.
Actually checking for the presence of a BOM is pretty straightforward with the codecs library. Codecs has the specific byte order marks for different UTF encodings. Also, you can get the encoding automatically from an opened file, no need to specify it.
Suppose you are reading a csv file with utf-8 encoding, which may or may not use a byte order mark. Then you could go about like this:
import codecs
with open("testfile.csv", "r") as csvfile:
line = csvfile.readline()
if line.__contains__(codecs.BOM_UTF8.decode(csvfile.encoding)):
# A Byte Order Mark is present
line = line.strip(codecs.BOM_UTF8.decode(csvfile.encoding))
print(line)
In the output resulting from the code above you will see the output without byte order mark. To further improve on this, you could also restrict this check to be only done on the first line of a file (because that is where the byte order mark always resides, it is the first few bytes of the file).
Using strip instead of replace won't replace anything and won't actually do anything if the indicated byte order mark is not present. So you may even skip the manual check for byte-order-mark altogether and just run the strip method on the entire contents of the file:
import codecs
with open("testfile.csv", "r") as csvfile:
with open("outfile.csv", "w") as outfile:
outfile.write(csvfile.read().strip(codecs.BOM_UTF8.decode(csvfile.encoding)))
Voila, you end up with 'outfile.csv' containing the exact contents of the original (testfile.csv) without the Byte Order Mark.