Treat all Unicode characters as single letters

Treat all Unicode characters as single letters - python

I want to create a program that counts the "value" of a word by adding values given to letters of it based on their first position in a word (as an exercise, I'm new to Python).
Ie. "foo" would return 5 (as 'f' = 1, 'o' = 2) and "bar" would return 6 (as 'b' = 1, 'a' = 2, 'r' = 3).
Here's my code so far:
# -*- coding: utf-8 -*-
def ppn(word):
word = list(word)
cipher = dict()
i = 1
e = 0
for letter in word:
if letter not in cipher:
cipher[letter] = i
e += i
i += 1
else:
e += cipher[letter]
return ''.join(word) + ": " + str(e)
if __name__ == "__main__":
print ppn(str(raw_input()))
It works well, however for words containing characters like 'ł', 'ą' etc. it doesn't return the correct value (I would guess it's because it translates these letters to Unicode codes first). Is there a way to bypass it and make the interpreter treat all the letters as single letters?

Decode your input into unicode, then use unicode everywhere, then decode when you output.
Specifically you will need to change
print ppn(str(raw_input()))
To
print ppn(raw_input().decode(sys.stdin.encoding))
This will decode your input. Then you will also need to change
''.join(word) + ": " + str(e)
To
u''.join(word) + u': ' + unicode(e)
This is making all your code use unicode objects internally.
Print will encode the unicode properly to whatever encoding your terminal is using, but you can also specify it if you need to.
Alternatively you can do exactly what you have already, but run it with python 3.
For more information, please read this very useful talk on the subject

Decode with the encoding of your shell:
if __name__ == "__main__":
import sys
print ppn((raw_input()).decode(sys.stdin.encoding))
For Unix system typically UTF-8 works. On Windows things can be different. To be save use sys.stdin.encoding. You never know where your script is going to run.
Or, even better. switch to Python 3:
# -*- coding: utf-8 -*-
import sys
assert sys.version_info.major > 2
def ppn(word):
word = list(word)
cipher = dict()
i = 1
e = 0
for letter in word:
if letter not in cipher:
cipher[letter] = i
e += i
i += 1
else:
e += cipher[letter]
return ''.join(word) + ": " + str(e)
if __name__ == "__main__":
print(ppn(str(input())))
In Python 3 strings are unicode per default. So no need for the decoding businesses.

All the answers so far have explained what to do, but not what's going on, so here are some hints.
When you use raw_input() with Python 2 you are returned a string of bytes (input() on Python 3 behaves differently). Most unicode characters cannot be represented as a single byte for the reason that there are more unicode characters than values that can be represented with a byte.
Characters like ł or ą, when encoded with utf-8 or other encodings, can take two bytes or more:
>>> 'ł'
'\xc5\x82'
>>> 'ą'
'\xc4\x85'
Your original program is interpreting those two bytes as distinct characters, leading to incorrect results.
Python offers an alternative to byte string: unicode strings. With unicode string, one character appears exactly as one character (the internal representation of the string is opaque), and the problem you are experiencing cannot occur.
Therefore decoding the bytestring into a unicode string is the way to go.

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

According to #artfulrobot, this should be faster than filter and lambda:
import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)
See more examples here Replace non-ASCII characters with a single space

You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2' ... is that what you really want?
A greater solution would include:
a better name for the filter function than onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

Working my way through Fluent Python (Ramalho) - highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable (answer from #jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

this is best way to get ascii characters and clean code, Checks for all possible errors
from string import printable
def getOnlyCharacters(texts):
_type = None
result = ''
if type(texts).__name__ == 'bytes':
_type = 'bytes'
texts = texts.decode('utf-8','ignore')
else:
_type = 'str'
texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')
texts = str(texts)
for text in texts:
if text in printable:
result += text
if _type == 'bytes':
result = result.encode('utf-8')
return result
text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)
print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

How to recover the text in url and convert text back? [duplicate]

I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode.
Apparently urllib.unquote does not support unicode.

%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a")
'\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

def unquote(text):
def unicode_unquoter(match):
return unichr(int(match.group(1),16))
return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):
from urllib import unquote
def unquote_u(source):
result = unquote(source)
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
return result
print unquote_u('Tan%u0131m')
> Tanım

there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.
eg. "%5B%AB%u03E1%BB%5D" causes this error.
I found if you just did the unicode ones first, the problem went away:
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
return result

You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.
Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.
With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):
try:
# Python 3
from urllib.parse import unquote
unichr = chr
except ImportError:
# Python 2
from urllib import unquote
def unquote_unicode(string, _cache={}):
string = unquote(string) # handle two-digit %hh components first
parts = string.split(u'%u')
if len(parts) == 1:
return parts
r = [parts[0]]
append = r.append
for part in parts[1:]:
try:
digits = part[:4].lower()
if len(digits) < 4:
raise ValueError
ch = _cache.get(digits)
if ch is None:
ch = _cache[digits] = unichr(int(digits, 16))
if (
not r[-1] and
u'\uDC00' <= ch <= u'\uDFFF' and
u'\uD800' <= r[-2] <= u'\uDBFF'
):
# UTF-16 surrogate pair, replace with single non-BMP codepoint
r[-2] = (r[-2] + ch).encode(
'utf-16', 'surrogatepass').decode('utf-16')
else:
append(ch)
append(part[4:])
except ValueError:
append(u'%u')
append(part)
return u''.join(r)
The function is heavily inspired by the current standard-library implementation.
Demo:
>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6')) # surrogate pair
🏖
>>> print(unquote_unicode('%ufoobar%u666')) # incomplete
%ufoobar%u666
The function works on Python 2 (tested on 2.4 - 2.7) and Python 3 (tested on 3.3 - 3.8).

How to replace unicode characters in string with something else python?

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).
I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?
I tried doing
str.replace("•", "something")
but it does not appear to work... how do I do this?

Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)

Encode string as unicode.
>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'

import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)

Try this one.
you will get the output in a normal string
str.encode().decode('unicode-escape')
and after that, you can perform any replacement.
str.replace('•','something')

str1 = "This is Python\u500cPool"
Encode the string to ASCII and replace all the utf-8 characters with '?'.
str1 = str1.encode("ascii", "replace")
Decode the byte stream to string.
str1 = str1.decode(encoding="utf-8", errors="ignore")
Replace the question mark with the desired character.
str1 = str1.replace("?"," ")

Funny the answer is hidden in among the answers.
str.replace("•", "something")
would work if you use the right semantics.
str.replace(u"\u2022","something")
works wonders ;) , thnx to RParadox for the hint.

If you want to remove all \u character. Code below for you
def replace_unicode_character(self, content: str):
content = content.encode('utf-8')
if "\\x80" in str(content):
count_unicode = 0
i = 0
while i < len(content):
if "\\x" in str(content[i:i + 1]):
if count_unicode % 3 == 0:
content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
i += 2
count_unicode += 1
i += 1
content = content.replace(b'\x80\x80\x80', b'')
return content.decode('utf-8')

How to un-escape a backslash-escaped string? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:
>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>>
However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?

>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"

You can use ast.literal_eval which is safe:
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None. (END)
Like this:
>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!

All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:
from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)
In recent Python versions, this also works without the import:
sample = u'mon€y\\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')
As suggested by obataku, you can also use the literal_eval method from the ast module like so:
import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))
Or like this when your string really contains a string literal (including the quotes):
import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))
However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.

In python 3, str objects don't have a decode method and you have to use a bytes object. ChristopheD's answer covers python 2.
# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")
# or directly
my_bytes = b"Hello,\\nworld"
print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"

For Python3, consider:
my_string.encode('raw_unicode_escape').decode('unicode_escape')
The 'raw_unicode_escape' codec encodes to latin1, but first replaces all other Unicode code points with an escaped '\uXXXX' or '\UXXXXXXXX' form. Importantly, it differs from the normal 'unicode_escape' codec in that it does not touch existing backslashes.
So when the normal 'unicode_escape' decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.
(The 'raw_unicode_escape' decoder appears to pay attention only to the '\uXXXX' and '\UXXXXXXXX' forms, ignoring all other escapes.)
Documentation:
https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings

custom string parser to decode only some backslash-escapes, in this case \" and \'
def backslash_decode(src):
"decode backslash-escapes"
slashes = 0 # count backslashes
dst = ""
for loc in range(0, len(src)):
char = src[loc]
if char == "\\":
slashes += 1
if slashes == 2:
dst += char # decode backslash
slashes = 0
elif slashes == 0:
dst += char # normal char
else: # slashes == 1
if char == '"':
dst += char # decode double-quote
elif char == "'":
dst += char # decode single-quote
else:
dst += "\\" + char # keep backslash-escapes like \n or \t
slashes = 0
return dst
src = "a" + "\\\\" + r"\'" + r'\"' + r"\n" + r"\t" + r"\x" + "z" # input
exp = "a" + "\\" + "'" + '"' + r"\n" + r"\t" + r"\x" + "z" # expected output
res = backslash_decode(src)
print(res)
assert res == exp

Get str repr with double quotes Python

I'm using a small Python script to generate some binary data that will be used in a C header.
This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.
The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:
'"%s"'%repr(data)[1:-1]
but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.
I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.
Extra point:
It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.
Any suggestions?

You could try json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
>>> print(json.dumps('hëllo "world"!'))
"h\u00ebllo \"world\"!"
I don't know for sure whether json strings are compatible with C but at least they have a pretty large common subset and are guaranteed to be compatible with javascript;).

Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape
>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9
For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:
>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""

If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:
/* figure out which quote to use; single is preferred */
quote = '\'';
if (smartquotes &&
memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.
I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:
class MyString(str):
__slots__ = []
def __repr__(self):
return '"%s"' % self.replace('"', r'\"')
print repr(MyString(r'foo"bar'))
Or, don't use repr at all:
def ready_string(string):
return '"%s"' % string.replace('"', r'\"')
print ready_string(r'foo"bar')
This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.

repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".
This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.
If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.
def string_to_c(s, max_length = 140, unicode=False):
ret = []
# Try to split on whitespace, not in the middle of a word.
split_at_space_pos = max_length - 10
if split_at_space_pos < 10:
split_at_space_pos = None
position = 0
if unicode:
position += 1
ret.append('L')
ret.append('"')
position += 1
for c in s:
newline = False
if c == "\n":
to_add = "\\\n"
newline = True
elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
to_add = "\\x%02x" % ord(c)
elif ord(c) > 0xff:
if not unicode:
raise ValueError, "string contains unicode character but unicode=False"
to_add = "\\u%04x" % ord(c)
elif "\\\"".find(c) != -1:
to_add = "\\%c" % c
else:
to_add = c
ret.append(to_add)
position += len(to_add)
if newline:
position = 0
if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
ret.append("\\\n")
position = 0
elif position >= max_length:
ret.append("\\\n")
position = 0
ret.append('"')
return "".join(ret)
print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Treat all Unicode characters as single letters - python

Related

How to convert Unicode to ASCII leaving out the non-convertible characters? [duplicate]

How to recover the text in url and convert text back? [duplicate]

How to replace unicode characters in string with something else python?

How to un-escape a backslash-escaped string? [duplicate]

Get str repr with double quotes Python

Categories

Resources