Python unicode vs utf-8

Python unicode vs utf-8 - python

I am building a string query (cypher query) to execute it against a database (Neo4J).
I need to concatenate some strings but I am having trouble with encoding.
I am trying to build a unicode string.
# -*- coding: utf-8 -*-
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ repr(value)
print line.encode("utf-8")
I expected to have:
Name = "D'Santana Carlos Lãnez"
But i getting:
Name = u"D'Santana Carlos L\xe3nez"
I imagine that repr is returning a unicode. Or probably i am not using the right function.

Python literal (repr) syntax is not a valid substitute for Cypher string literal syntax. The leading u is only one of the differences between them; notably, Cypher string literals don't have \x escapes, which Python will use for characters between U+0080–U+00FF.
If you need to create Cypher string literals from Python strings you would need to write your own string escaping function that writes output matching that syntax. But you should generally avoid creating queries from variable input. As with SQL databases, the better answer is query parameterisation.

value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ value
print(line)

value is already unicode because you use prefix u in u"..." so you don't need repr() (and unicode() or decode())
Besides repr() doesn't convert to unicode. But it returns string very useful for debuging - it shows hex codes of native chars and other things.

Related

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".

The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

Decode bad escape characters in python

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés
I tried the following
name = " JosÃ© Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result = []
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "JosÃ© Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings

How to get the Unicode character from a code point variable? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 4 years ago.
I have a variable which stores the string "u05e2" (The value is constantly changing because I set it within a loop). I want to print the Hebrew letter with that Unicode value. I tried the following but it didn't work:
>>> a = 'u05e2'
>>> print(u'\{}'.format(a))
I got \u05e2 instead of ע(In this case).
I also tried to do:
>>> a = 'u05e2'
>>> b = '\\' + a
>>> print(u'{}'.format(b))
Neither one worked. How can I fix this?
Thanks in advance!

This seems like an X-Y Problem. If you want the Unicode character for a code point, use an integer variable and the function chr (or unichr on Python 2) instead of trying to format an escape code:
>>> for a in range(0x5e0,0x5eb):
... print(hex(a),chr(a))
...
0x5e0 נ
0x5e1 ס
0x5e2 ע
0x5e3 ף
0x5e4 פ
0x5e5 ץ
0x5e6 צ
0x5e7 ק
0x5e8 ר
0x5e9 ש
0x5ea ת

All you need is a \ before u05e2. To print a Unicode character, you must provide a unicode format string.
a = '\u05e2'
print(u'{}'.format(a))
#Output
ע
When you try the other approach by printing the \ within the print() function, Python first escapes the \ and does not show the desired result.
a = 'u05e2'
print(u'\{}'.format(a))
#Output
\u05e2
A way to verify the validity of Unicode format strings is using the ord() built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.
a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a
To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c. This is explained in the Python docs.
print('{0:c}'.format(1506))
#Output
ע
If we pass a normal string literal to ord(), we get an error. This is because this string does not represent a Unicode character.
a = 'u05e2'
print(ord(a))
#Error
TypeError: ord() expected a character, but string of length 5 found

This is happening because you have to add the suffix u outside of the string.
a = u'\u05e2'
print(a)
ע
Hope this helps you.

Python - printing "u" character for every list values [duplicate]

I tried the following on Codecademy's Python lesson
hobbies = []
# Add your code below!
for i in range(3):
Hobby = str(raw_input("Enter a hobby:"))
hobbies.append(Hobby)
print hobbies
With this, it works fine but if instead I try
Hobby = raw_input("Enter a hobby:")
I get [u'Hobby1', u'Hobby2', u'Hobby3']. Where are the extra us coming from?

The question's subject line might be a bit misleading: Python 2's raw_input() normally returns a byte string, NOT a Unicode string.
However, it could return a Unicode string if it or sys.stdin has been altered or replaced (by an application, or as part of an alternative implementation of Python).
Therefore, I believe #ByteCommander is on the right track with his comment:
Maybe this has something to do with the console it's running in?
The Python used by Codecademy is ostensibly 2.7, but (a) it was implemented by compiling the Python interpreter to JavaScript using Emscripten and (b) it's running in the browser; so between those factors, there could very well be some string encoding and decoding injected by Codecademy that isn't present in plain-vanilla CPython.
Note: I have not used Codecademy myself nor do I have any inside knowledge of its inner workings.

'u' means its a unicode. You can also specify raw_input().encode('utf8') to convert to string.
Edited:
I checked in python 2.7 it returns byte string not unicode string. So problem is something else here.
Edited:
raw_input() returns unicode if sys.stdin.encoding is unicode.
In codeacademy python environment, sys.stdin.encoding and sys.stdout.decoding both are none and default endcoding scheme is ascii.
Python will use this default encoding only if it is unable to find proper encoding scheme from environment.

Where are the extra us coming from?
raw_input() returns Unicode strings in your environment
repr() is called for each item of a list if you print it (convert to string)
the text representation (repr()) of a Unicode string is the same as Unicode literal in Python: u'abc'.
that is why print [raw_input()] may produce: [u'abc'].
You don't see u'' in the first code example because str(unicode_string) calls the equivalent of unicode_string.encode(sys.getdefaultencoding()) i.e., it converts Unicode strings to bytestrings—don't do it unless you mean it.
Can raw_input() return unicode?
Yes:
#!/usr/bin/env python2
"""Demonstrate that raw_input() can return Unicode."""
import sys
class UnicodeFile:
def readline(self, n=-1):
return u'\N{SNOWMAN}'
sys.stdin = UnicodeFile()
s = raw_input()
print type(s)
print s
Output:
<type 'unicode'>
☃
The practical example is win-unicode-console package which can replace raw_input() to support entering Unicode characters outside of the range of a console codepage on Windows. Related: here's why sys.stdout should be replaced.
May raw_input() return unicode?
Yes.
raw_input() is documented to return a string:
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that.
String in Python 2 is either a bytestring or Unicode string :isinstance(s, basestring).
CPython implementation of raw_input() supports Unicode strings explicitly: builtin_raw_input() can call PyFile_GetLine() and PyFile_GetLine() considers bytestrings and Unicode strings to be strings—it raises TypeError("object.readline() returned non-string") otherwise.

You could encode the strings before appending them to your list:
hobbies = []
# Add your code below!
for i in range(3):
Hobby = raw_input("Enter a hobby:")
hobbies.append(Hobby.encode('utf-8')
print hobbies

Customized non-ascii characters flagger

I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing.
Use Case
I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. The following is my Python code.
# -*- coding: utf-8 -*-
import re
unicode_rainbow_dictionary = {
u'\u00A0':' ',
u'\uFB01':'fi',
}
strings = ["This contains the annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print "Epic fail! There is a rogue character in '"+string+"'"
else:
print string
The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case.
What is leading to the false positive?

Use Unicode strings for all text as #jgfoot points out. The easiest way to do this is to use from __future__ to default to Unicode literals for strings. Additionally, using print as a function will make the code Python 2/3 compatible:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re
unicode_rainbow_dictionary = {
'\u00A0':' ',
'\uFB01':'fi',
}
strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print("Epic fail! There is a rogue character in '"+string+"'")
else:
print(string)

If you have the possibility then switch to Python 3 as soon as possible! Python 2 is not good at handling unicode whereas Python 3 does it natively.
for string in strings:
for character in unicode_rainbow_dictionary:
if character in string:
print("Rogue character '" + character + "' in '" + string + "'")
I couldn't get the non-breaking space to occur in my test. I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched.

Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected
If you change this to:
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
It works as expected.
Solving unicode headaches like this is a major benefit of Python 3.
Here's an alternative approach to your problem. How about just trying to encode the string as ASCII, and catching errors if it doesn't work?:
def is_this_ascii(s):
try:
ignore = unicode(s).encode("ascii")
return True
except (UnicodeEncodeError, UnicodeDecodeError):
return False
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]
for s in strings:
print(repr(is_this_ascii(s)))
##False
##True
##False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unicode vs utf-8 - python

value = u"D'Santana Carlos Lãnez" key = u"Name" line = key + u" = "+ value print(line)

value is already unicode because you use prefix u in u"..." so you don't need repr() (and unicode() or decode()) Besides repr() doesn't convert to unicode. But it returns string very useful for debuging - it shows hex codes of native chars and other things.

Related

Converting escaped characters to utf in Python

Decode bad escape characters in python

How to get the Unicode character from a code point variable? [duplicate]

Python - printing "u" character for every list values [duplicate]

Customized non-ascii characters flagger

Categories

Resources