Decode bad escape characters in python - python

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result = []
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings

Related

regex windows path incomplete escape '\U' [duplicate]

Is there a way to declare a string variable in Python such that everything inside of it is automatically escaped, or has its literal character value?
I'm not asking how to escape the quotes with slashes, that's obvious. What I'm asking for is a general purpose way for making everything in a string literal so that I don't have to manually go through and escape everything for very large strings.
Raw string literals:
>>> r'abc\dev\t'
'abc\\dev\\t'
If you're dealing with very large strings, specifically multiline strings, be aware of the triple-quote syntax:
a = r"""This is a multiline string
with more than one line
in the source code."""
There is no such thing. It looks like you want something like "here documents" in Perl and the shells, but Python doesn't have that.
Using raw strings or multiline strings only means that there are fewer things to worry about. If you use a raw string then you still have to work around a terminal "\" and with any string solution you'll have to worry about the closing ", ', ''' or """ if it is included in your data.
That is, there's no way to have the string
' ''' """ " \
properly stored in any Python string literal without internal escaping of some sort.
You will find Python's string literal documentation here:
http://docs.python.org/tutorial/introduction.html#strings
and here:
http://docs.python.org/reference/lexical_analysis.html#literals
The simplest example would be using the 'r' prefix:
ss = r'Hello\nWorld'
print(ss)
Hello\nWorld
(Assuming you are not required to input the string from directly within Python code)
to get around the Issue Andrew Dalke pointed out, simply type the literal string into a text file and then use this;
input_ = '/directory_of_text_file/your_text_file.txt'
input_open = open(input_,'r+')
input_string = input_open.read()
print input_string
This will print the literal text of whatever is in the text file, even if it is;
' ''' """ “ \
Not fun or optimal, but can be useful, especially if you have 3 pages of code that would’ve needed character escaping.
Use print and repr:
>>> s = '\tgherkin\n'
>>> s
'\tgherkin\n'
>>> print(s)
gherkin
>>> repr(s)
"'\\tgherkin\\n'"
# print(repr(..)) gets literal
>>> print(repr(s))
'\tgherkin\n'
>>> repr('\tgherkin\n')
"'\\tgherkin\\n'"
>>> print('\tgherkin\n')
gherkin
>>> print(repr('\tgherkin\n'))
'\tgherkin\n'

Customized non-ascii characters flagger

I've looked around for a custom-made solution, but I couldn't find a solution for a use case that I am facing.
Use Case
I'm building a 'website' QA test where the script will go through a bulk of HTML documents, and identify any rogue characters. I cannot use pure non-ascii method since the HTML documents contain characters such as ">" and other minor characters. Therefore, I am building up a unicode rainbow dictionary that identifies some of the common non-ascii characters that my team and I frequently see. The following is my Python code.
# -*- coding: utf-8 -*-
import re
unicode_rainbow_dictionary = {
u'\u00A0':' ',
u'\uFB01':'fi',
}
strings = ["This contains the annoying non-breaking space","This is fine!","This is not fine!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print "Epic fail! There is a rogue character in '"+string+"'"
else:
print string
The issue here is that the last string in the strings array contains a non-ascii ligature character (the combined fi). When I run this script, it doesn't capture the ligature character, but it captures the non-breakable space character in the first case.
What is leading to the false positive?
Use Unicode strings for all text as #jgfoot points out. The easiest way to do this is to use from __future__ to default to Unicode literals for strings. Additionally, using print as a function will make the code Python 2/3 compatible:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re
unicode_rainbow_dictionary = {
'\u00A0':' ',
'\uFB01':'fi',
}
strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not fine!"]
for string in strings:
for regex in unicode_rainbow_dictionary:
result = re.search(regex,string)
if result:
print("Epic fail! There is a rogue character in '"+string+"'")
else:
print(string)
If you have the possibility then switch to Python 3 as soon as possible! Python 2 is not good at handling unicode whereas Python 3 does it natively.
for string in strings:
for character in unicode_rainbow_dictionary:
if character in string:
print("Rogue character '" + character + "' in '" + string + "'")
I couldn't get the non-breaking space to occur in my test. I got around that by using "This contains the annoying" + chr(160) + "non-breaking space" after which it matched.
Your code doesn't work as expected because, in your "strings" variable, you have unicode characters in non-unicode strings. You forgot to put the "u" in front of them to signal that they should be treated as unicode strings. So, when you search for a unicode string inside a non-unicode string, it doesn't work as expected
If you change this to:
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]
It works as expected.
Solving unicode headaches like this is a major benefit of Python 3.
Here's an alternative approach to your problem. How about just trying to encode the string as ASCII, and catching errors if it doesn't work?:
def is_this_ascii(s):
try:
ignore = unicode(s).encode("ascii")
return True
except (UnicodeEncodeError, UnicodeDecodeError):
return False
strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not fine!"]
for s in strings:
print(repr(is_this_ascii(s)))
##False
##True
##False

Python unicode vs utf-8

I am building a string query (cypher query) to execute it against a database (Neo4J).
I need to concatenate some strings but I am having trouble with encoding.
I am trying to build a unicode string.
# -*- coding: utf-8 -*-
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ repr(value)
print line.encode("utf-8")
I expected to have:
Name = "D'Santana Carlos Lãnez"
But i getting:
Name = u"D'Santana Carlos L\xe3nez"
I imagine that repr is returning a unicode. Or probably i am not using the right function.
Python literal (repr) syntax is not a valid substitute for Cypher string literal syntax. The leading u is only one of the differences between them; notably, Cypher string literals don't have \x escapes, which Python will use for characters between U+0080–U+00FF.
If you need to create Cypher string literals from Python strings you would need to write your own string escaping function that writes output matching that syntax. But you should generally avoid creating queries from variable input. As with SQL databases, the better answer is query parameterisation.
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ value
print(line)
value is already unicode because you use prefix u in u"..." so you don't need repr() (and unicode() or decode())
Besides repr() doesn't convert to unicode. But it returns string very useful for debuging - it shows hex codes of native chars and other things.

latexcodec stripping slashes but not translating characters (Python)

I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?
Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)

How do convert unicode escape sequences to unicode characters in a python string

When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.

Categories

Resources