Read an Arabic File in Python - python

I'm working with an Arabic text file which is a corpus.
What should I do to be able to import the file in python so I can easily access the file and be able to analyze it instead of copying and pasting the content in the interpreter every time. It's an Arabic file, not English.

The most important thing when reading and writing plain text is to know and specify the plain text encoding. You shouldn't let Python guesses the encoding for you, especially in real world program (The encoding should be either configurable or you ask the user for the encoding).
Many people don't have an issue with English text because ASCII is a subset of most encodings. The issue is there and they will run into it as soon as the program tries to read or write texts in different encodings.
Most Arabic texts are encoded in (ordered by popularity1) Windows-1256, UTF-8, CP720, or ISO 8859-6. You should know ahead of time what encoding your plain text is using, for example when most text editors allow you to select the encoding when you save the file.
I have three files with your name طارق but in 3 different encodings. Reading the files as raw binary data show you how different these files are, though it's the the same text:
>>> f = open('file-utf8.txt', 'rb')
>>> f.read()
b'\xd8\xb7\xd8\xa7\xd8\xb1\xd9\x82'
>>>
>>> f = open('file-cp720.txt', 'rb')
>>> f.read()
b'\xe1\x9f\xa9\xe7'
>>>
>>> f = open('file-windows1256.txt', 'rb')
>>> f.read()
b'\xd8\xc7\xd1\xde'
>>>
The right way to read these files is by telling Python what encoding it should use so it decodes it to its internal Unicode representation (Using the mapping tables in /Python33/Lib/encodings/):
>>> f = open('file-utf8.txt', encoding='utf-8')
>>> f.read()
'طارق'
>>>
>>> f = open('file-cp720.txt', encoding='cp720')
>>> f.read()
'طارق'
>>>
>>> f = open('file-windows1256.txt', encoding='windows-1256')
>>> f.read()
'طارق'
>>>
The issue of encoding is not only related to files. Whenever you read texts from external source to the program, e.g. file, console, network socket, you must know the encoding. Also when you write to external source you have to encode the text to the right encoding.
The encoding have to be consistent, if your console is using Latin-1 and you tried to write to the console, i.e. print, you will get some meaningless word or, if you are lucky, you will get UnicodeEncodeError exception.
There are ways for guessing the encoding, but I won't bother to use them as they only mask the problem. It will come sooner or later.
1 If it's up to you, always go with UTF-8 because it's well supported.

The right encodings to read an Arabic text file are utf_8 and utf_16. But you have to try both and see which one is the right encoding for your file. You can do this by using the codecs package and setting the right encoding argument.
import codecs, sys
# pass your file as a command-line argument
# try "utf-16" encoding if this does not work
for line in codecs.open(sys.argv[1], encoding = "utf_8"):
print(line.strip()

Arabic is generally represented in Unicode.
Generally, you can read the file in and then convert to Unicode:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
For more information, refer to https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

File = open("Infixes.txt",encoding = "utf-16")
print(File.read())
This works for me in Windows 8.1 and 64 bit processor

Use this for Urdu file reading in Python:
File = open("Infixes.txt",encoding = "utf-8")
print(File.read())

Related

How to read languages in Python [duplicate]

When I read a file in python and print it to the screen, it does not read certain characters properly, however, those same characters hard coded into a variable print just fine. Here is an example where "test.html" contains the text "Hallå":
with open('test.html','r') as file:
Str = file.read()
print(Str)
Str = "Hallå"
print(Str)
This generates the following output:
hallå
Hallå
My guess is that there is something wrong with how the data in the file is being interpreted when it is read into Python, however I am uncertain of what it is since Python 3.8.5 already uses UTF-8 encoding by default.
Function open does not use UTF-8 by default. As the documentation says:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So, it depends, and to be certain, you have to specify the encoding yourself. If the file is saved in UTF-8, you should do this:
with open('test.html', 'r', encoding='utf-8') as file:
On the other hand, it is not clear whether the file is or is not saved in UTF-8 encoding. If it is not, you'll have to choose a different one.

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

Converting character codes to unicode [Python]

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:
être is être, for example (atleast when I open the file in Excel)
Here is the csv:
https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv
In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.
...
for row in reader:
inf_lst.append(row[0])
verb = inf_lst[2338]
#(verb = être)
if there was a straightforward/built in method for printing it out with correct unicode to give "être"?
I am aware that you could do this by replacing the ê's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way.
Thanks,
You can use unicode encoding by prefixing a string with 'u'.
>>> foo = u'être' >>> print foo être
It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.
You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:
>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
... stream=True)
>>> try:
... inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
... del resp
...
>>> len(inf_list)
5362
You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.
To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:
import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
f.write(u'être')

How do I write utf-8 characters( '\xe7\x8e\xa9' ) into another file as Chinese characters?

I got some strings from the database which look like '\xe7\x8e\xa9'.
I think it's utf-8. I can print them out by using:
print '\xe7\x8e\xa9'
玩
The things is, I need write them into another file as Chinese Character(e.g. 玩) together with other alphanumeric data.
I tried encode, decode but I didn't get the results I was hoping for.
Here are my attempts:
f = open('a','w')
name = u.name #.encode('utf8') # I commented it to get raw
f.write('\t$$%r$$many_other_data' % name)
f.close()
When I open the output file with vim7.4:
`$$u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14$$many_other_data'`
Here is code sample working for me:
with open('foo', 'w+') as f:
f.write('\xe7\x8e\xa9')
and in foo file a have:
玩
but, I've open foo with utf-8 encoding, so it's displays chines character instead of Unicode value.
I've tested it with both vim and gedit and it works just fine.
Perhaps you should provide type of your output file, so we can be more specific.
EDIT
I see the problem now. You used %r flag in writing your string. You should use %s (and enable encoding again).
Here is working example:
>>> a = u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14'
>>> f = open('tmp', 'w')
>>> a = a.encode('utf-8')
>>> f.write('\t$$%r$$other_data\n'%a)
>>> f.write('\t$$%s$$other_data\n'%a)
>>> f.close
results being:
$$'\xe7\xab\xaf\xe5\xba\x84\xe7\x9a\x84\xe9\xa9\xac\xe6\xad\x87\xe5\xb0\x94'$$other_data
$$端庄的马歇尔$$other_data
Please ready this answer for reference about difference between %r and %s.
Hope that helped.
Files are bytes. You can't store characters in them.
A particularly common encoding is ASCII. It's an encoding just like all those different unicode ones.
The bytes are meaniningless (as text) on their own without an associated encoding to give them meaning.
You'll need to view the file with an editor or viewer that is using the same encoding that you used to write the file.
Since you have bytes, you need to know your encoding. There are multiple ways to turn bytes into unicode (str.decode), and it's depending on what encoding the bytes are in.
You can't get this from the bytes themselves, someone has to tell you the encoding.
Although, sometimes you can make an educated guess:
>>> import chardet
>>> s = '\xe7\x8e\xa9'
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
>>> s.decode(chardet.detect(s)['encoding'])
u'\u73a9'
>>> print _
玩
Now, you should convert any strings from db to unicode as soon as they enter your python program so that your code is working entirely in unicode, not bytes.
Then, you can write your file like this:
import io
with io.open('/tmp/myfile.txt', 'wb', encoding='utf-8') as f:
f.write(u'\u73a9')
f.write('\n')
f.write('random other data 12345...')

Python Check for unicode files

I'm very new to python and I have a mix of both ansi and unicode (utf-16-le) text based files in a series of directories. I've got some code which reads the text files okay until it hits a unicode file which at the mo, I've written in the code to skip. . I'm wondering if there's anyway I can get python to run a
with codecs.open
type of thing when it hits a unicode file as part of one prog? With my currrent level of python experience, the only way I could see of doing this is to write two separate progs; one to process the ANSI stuff and one for the Unicode.
Thanks in advance for any help you can provide
use unicode by default(which is a good programming discipline) and switch to ansi only if necessary.
import codecs
def opener(filename):
try:
f = codecs.open(filename, encoding='utf-8')
except UnicodeError:
f = open(filename)
return f
Just open all files using UTF-8.
f = codecs.open(file_name, "r", "utf-8")

Categories

Resources