Python uncode output [duplicate] - python

My code is
f = codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = f.read().split()
for i in nepali:
print i
Display the words in file:
यो
किताब
टेबुल
मा
छ
यो
एक
किताब
हो
केटा
But when I try to create a list of the words with code:
file=codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = list(file.read().split())
print nepali
The output now is displayed like this
[u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]
The output should look like:
[यो, किताब, टेबुल, मा, छ,यो, एक, किताब, हो]

You are looking at the output of the repr() function, which is always used for displaying the contents of containers. The output is meant for debugging, not end-user displays; any non-printable non-ASCII codepoint is represented by an escape sequence (which can, depending on the codepoint, be a single character escape like \t or \n, or use 2, 4, or 8 hex digits, like \xe5, \u2603 or \U0001f4e2).
You'll have to produce the output manually:
print u'[{}]'.format(u', '.join(nepali))
This produces a unicode string formatted to look like a list object, but without using repr(), simply by adding square brackets around the strings, joined with ', ' (comma and space).
Demo:
>>> nepali = [u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]
>>> print u'[{}]'.format(u', '.join(nepali))
[यो, किताब, टेबुल, मा, छ, यो, एक, किताब, हो]
However, if you want to show this to an end-user, why use the square brackets at all?

Related

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.
What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.
You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

How to remove set of characters when a string comprise of "\" and Special characters in python

a = "\Virtual Disks\DG2_ASM04\ACTIVE"
From the above string I would like to get the part "DG2_ASM04" alone. I cannot split or strip as it has the special characters "\", "\D" and "\A" in it.
Have tried the below and can't get the desired output.
a.lstrip("\Virtual Disks\\").rstrip("\ACTIVE")
the output I have got is: 'G2_ASM04' instead of "DG2_ASM04"
Simply use slicing and escape backslash(\)
>>> a.split("\\")[-2]
'DG2_ASM04'
In your case D is also removing because it is occurring more than one time in given string (thus striping D as well). If you tweak your string then you will realize what is happening
>>> a = "\Virtual Disks\XG2_ASM04\ACTIVE"
>>> a.lstrip('\\Virtual Disks\\').rstrip("\\ACTIVE")
'XG2_ASM04'

Dealing with doubly escaped unicode string

I have a database of badly formatted database of strings. The data looks like this:
"street"=>"\"\\u4e2d\\u534e\\u8def\""
when it should be like this:
"street"=>"中华路"
The problem I have is that when that doubly escaped strings comes from the database they are not being decoded to the chinese characters as they should be. So suppose I have this variable; street="\"\\u4e2d\\u534e\\u8def\"" and if I print that print(street) the result is a string of codepoints "\u4e2d\u534e\u8def"
What can I do at this point to convert "\u4e2d\u534e\u8def" to actual unicode characters ?
First encode this string as utf8 and then decode it with unicode-escape which will handle the \\ for you:
>>> line = "\"\\u4e2d\\u534e\\u8def\""
>>> line.encode('utf8').decode('unicode-escape')
'"中华路"'
You can then strip the " if necessary
You could remove the quotation marks with strip and split at every '\\u'. This would give you the characters as strings representing hex numbers. Then for each string you could convert it to int and back to string with chr:
>>> street = "\"\\u4e2d\\u534e\\u8def\""
>>> ''.join(chr(int(x, 16)) for x in street.strip('"').split('\\u') if x)
'中华路'
Based on what you wrote, the database appears to be storing an eval-uable ascii representation of a string with non-unicode chars.
>>> eval("\"\\u4e2d\\u534e\\u8def\"")
'中华路'
Python has a built-in function for this.
>>> ascii('中华路')
"'\\u4e2d\\u534e\\u8def'"
The only difference is the use of \" instead of ' for the needed internal quote.

How to encode a python list

I'm having a hard time trying to encode a python list, I already did it with a text file in order to count specific words inside it, using re module.
This is the code:
# encoding text file
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
for line in f:
# Using re module to extract specific words
unicode_pattern = re.compile(r'\b\w{4,20}\b', re.UNICODE)
result = unicode_pattern.findall(line)
word_counts = Counter(result) # It creates a dictionary key and wordCount
Allwords = []
for clave in word_counts:
if word_counts[clave] >= 10: # We look for the most repeated words
word = clave
Allwords.append(word)
print Allwords
Part of the output looks like this:
[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']
If I print variable word the output looks as it should be. However, when I use append, all the words breaks again, as the example before.
I use this example:
[x.encode("utf-8") for x in Allwords]
The output looks exactly the same as before.
I also use this example:
Allwords.append(str(word.encode("utf-8")))
The output change, but the words don't look as they should be:
[...'recursos', 'Partidos', 'Constituci\xc3\xb3n', 'veh\xc3\xadculos', 'investigaci\xc3\xb3n', 'Pol\xc3\xadticos']
Some of the answers have given this example:
print('[' + ', '.join(Allwords) + ']')
The output looks like this:
[...recursos, Partidos, Constitución, vehículos, investigación, Políticos]
To be honest I do not want to print the list, just encode it, so that all items (words) are recognized.
I'm looking for something like this:
[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']
Any suggestions to solve the problem are appreciated
Thanks,
you might what to try
print('[' + ', '.join(Allwords) + ']')
Your Unicode string list is correct. When you print lists the items in the list display as their repr() function. When you print the items themselves, the items display as their str() function. It is only a display option, similar to printing integers as decimal or hexadecimal.
So print the individual words if you want to see them correctly, but for comparisons the content is correct.
It's worth noting that Python 3 changes the behavior of repr() and now will display non-ASCII characters without escape codes if the terminal supports them directly and the ascii() function reproduces the Python 2 repr() behavior.

Similar C string format in Python

I need to read a file with some strange string lines like : \x72\xFE\x20TEST_STRING\0\0\0
but when I do a print of this string (with repr()) it prints this : r\xfe TEST_STRING\x00\x00\x00
Example :
>>> test = '\x72\xFE\x20TEST_STRING\0\0\0'
>>> print test
r? TEST_STRING
>>> print repr(test)
'r\xfe TEST_STRING\x00\x00\x00'
How can I get the same line from a file in Python and my editor ?
Is python changing encoding during string manipulation ?
You should use python's raw strings, like this (note the 'r' in front of the string)
test = r'\x72\xFE\x20TEST_STRING\0\0\0'
Then it won't try to interpret the escapes as special characters.
When reading from a text file python shouldn't be trying to interpret the string as having multi-byte unicode characters. You should get a exactly what's in the file:
In [22]: fp = open("test.txt", "r")
In [23]: s = fp.read()
In [24]: s
Out[24]: '\\x72\\xFE\\x20TEST_STRING\\0\\0\\0\n\n'
In [25]: print s
\x72\xFE\x20TEST_STRING\0\0\0
\x20 is a space. When you put that into a Python string it is stored exactly the same way as a space.
If you have printable characters in a string it does not matter whether they were typed as the actual character or some escape sequence, they will be represented the same way because they are in fact the same value.
Consider the following examples:
>>> ' ' == '\x20'
True
>>> hex(ord('a'))
'0x61'
>>> '\x61'
'a'
Python did not change the encoding:
When printing Python just resolved the printable chars in your string: chr(0x72) is a "r", chr(0xfe) is not printable, so you get the "?", chr(0x20) is chr(32) that is a space " ", and zero bytes are not printed at all.
repr() resolves the "r", leaves the chr(0xfe), and prints the chr(0) in full hexadecimal notation for chr(0x00).
So if you want the same line in your editor and for repr(), you have to type your string in your editor in the same notation repr() does, that is you write
test='r\xfe TEST_STRING\x00\x00\x00'
and repr(test) should print the same string:
To avoid having python interpret the backslashes as escaped characters, prefix your string with an "r" character:
>>> test = r'\x72\xFE\x20TEST_STRING\0\0\0'
>>> print test
\x72\xFE\x20TEST_STRING\0\0\0`

Categories

Resources