Unicode being re-unicoded

Unicode being re-unicoded - python

I'm scraping info from Facebook which compiles weirdly. The source for a page returns the name "Trentemøller" as a regular string with a unicode character:
Trentem\u00f8ller
When I try to print that or commit it to a list print u'%s' % name or print unicode(name) it escape-sequences the backslash.
u'Trentem\\u00f8ller'
['foo', 'bar', u'Trentem\u00f8ller']
What is the proper way to treat this string? Ideally it would save it into the list in a u'' but not the added backslash.

If you're in control of forming the unicode string, then use just one backslash:
>>> print u'Trentem\u00f8ller'
Trentemøller
If the regular string has already been formed by the screen scaper, you will need to re-evaluate the string to transform the backslash escape sequences into a real unicode characters. The eval builtin would tempting, but it is safer to use ast.literal_eval instead:
>>> import ast
>>> s = 'Trentem\u00f8ller' # a regular string
>>> print ast.literal_eval('u"""' + s + '"""')
Trentemøller

Related

Read regexes from file and avoid or undo escaping

I want to read regular expressions from a file, where each line contains a regex:
lorem.*
dolor\S*
The following code is supposed to read each and append it to a list of regex strings:
vocabulary=[]
with open(path, "r") as vocabularyFile:
for term in vocabularyFile:
term = term.rstrip()
vocabulary.append(term)
This code seems to escape the \ special character in the file as \\. How can I either avoid escaping or unescape the string so that it can be worked with as if I wrote this?
regex = r"dolor\S*"

You are getting confused by echoing the value. The Python interpreter echoes values by printing the repr() function result, and this makes sure to escape any meta characters:
>>> regex = r"dolor\S*"
>>> regex
'dolor\\S*'
regex is still an 8 character string, not 9, and the single character at index 5 is a single backslash:
>>> regex[4]
'r'
>>> regex[5]
'\\'
>>> regex[6]
'S'
Printing the string writes out all characters verbatim, so no escaping takes place:
>>> print(regex)
dolor\S*
The same process is applied to the contents of containers, like a list or a dict:
>>> container = [regex, 'foo\nbar']
>>> print(container)
['dolor\\S*', 'foo\nbar']
Note that I didn't echo there, I printed. str(list_object) produces the same output as repr(list_object) here.
If you were to print individual elements from the list, you get the same unescaped result again:
>>> print(container[0])
dolor\S*
>>> print(container[1])
foo
bar
Note how the \n in the second element was written out as a newline now. It is for that reason that containers use repr() for contents; to make otherwise hard-to-detect or non-printable data visible.
In other words, your strings do not contain escaped strings here.

how to properly use a unicode string in python regex

I am getting an input regular expression from a user which is saved as a unicode string. Do I have to turn the input string into a raw string before compliling it as a regex object? Or is it unnecessary? Am I converting it to raw string properly?
import re
input_regex_as_unicode = u"^(.){1,36}$"
string_to_check = "342342dedsfs"
# leave as unicode
compiled_regex = re.compile(input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)
# convert to raw
compiled_regex = re.compile(r'' + input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)
#Ahsanul Haque, my question is more regular expression specific, whether the regex handles the unicode string properly when converting it into a regex object

The re module handles both unicode strings and normal strings properly, you do not need to convert them to anything (but you should be consistent in your use of strings).
There is no such a thing like "raw strings". You can use raw string notation in your code if it helps you with strings containing backslashes. For instance to match a newline character you could use '\\n', u'\\n', r'\n' or ur'\n'.
Your use of the raw string notation in your example does nothing since r'' and '' evaluate to the same string.

How to assign '\'(-inf-24.5]\'' to a python string?

s='\'(-inf-24.5]\'' #this in not working
what should be put before \ to include it?
we have to assign s '\'(-inf-24.5]\''
the last two characters are two single quotes and not a single double quote.
the string should literally contain the given single backslashes as the string is to be inserted as it is in a column.

You can try this:
>>> s="\\'(-inf-24.5]\\'"
>>> print s
\'(-inf-24.5]\'
or
>>> s="'\\'(-inf-24.5]\\''"
>>> print s
'\'(-inf-24.5]\''
Basically, you will need to escape the backslash, when you write \' normally, python treats it as the ' being escaped. Also, python strings can be either "", or '', so you can mix them togather to get the desired result.

>>> s = r"'\'(-inf-24.5]\''"
>>> s
"'\\'(-inf-24.5]\\''"
>>> print(s)
'\'(-inf-24.5]\''
Prepending r before a string denotes a raw string, basically indicating to the interpreter that that string's characters should be taken literally. The only thing it can't do is end a string with a backslash (such a backslash would have to be concatenated from a separate string).

Python putting r before unicode string variable

For static strings, putting an r in front of the string would give the raw string (e.g. r'some \' string'). Since it is not possible to put r in front of a unicode string variable, what is the minimal approach to dynamically convert a string variable to its raw form? Should I manually substitute all backslashes with double backslashes?
str_var = u"some text with escapes e.g. \( \' \)"
raw_str_var = ???

If you really need to escape a string, let's say you want to print a newline as \n, you can use the encode method with the Python specific string_escape encoding:
>>> s = "hello\nworld"
>>> e = s.encode("string_escape")
>>> e
"hello\\nworld"
>>> print s
hello
world
>>> print e
hello\nworld
You didn't mention anything about unicode, or which Python version you are using, but if you are dealing with unicode strings you should use unicode_escape instead.
>>> u = u"föö\nbär"
>>> print u
föö
bär
>>> print u.encode('unicode_escape')
f\xf6\xf6\nb\xe4r
Your post originally had the regex tag, maybe re.escape is what you're actually looking for?
>>> re.escape(u"foo\nbar\'baz")
u"foo\\\nbar\\'baz"
Not the "double escapes", ie printing the above string yields:
foo\
bar\'baz

There is nothing to convert - the r prefix is only significant in source code notation, not for program logic.
As a rule, if you use a single backslash in a normal string, it will automatically be converted to a double backslash if it doesn't start a valid escape sequence:
>>> "\n \("
'\n \\('
Since it may be difficult to remember all the valid/invalid escape sequences, raw string notation was introduced. But there is no way and no need to convert a string after it has been defined.
In your case, the correct approach would be to use
str_var = ur"some text with escapes e.g. \( \' \)"
which happens to result in the same string here, but is more explicit.

Multiple Quotes in String

In Python how would I write the string '"['BOS']"'.
I tried entering "\"['BOS']\"" but this gives the output '"[\'BOS\']"' with added backslashes in front of the '.

You can use triple quotes:
'''"['BOS']"'''
What you did ("\"['BOS']\"") is fine too. You get the backslashes on output, but they aren't part of the string:
>>> a = "\"['BOS']\""
>>> a
'"[\'BOS\']"' # this is the representation of the string
>>> print a
"['BOS']" # this is the actual content
When you type an expression such as a into the console, it's the same as writing print repr(a). repr(a) returns a string that can be used to reconstruct the original value, hence the quotes around the string and the backslashes.

You should use triple quotes so that you don't need to use backslashes.
'''"['BOS']"'''
The reason you got \s in your output is because the python console adds them:
>>> s = '''"['BOS']"'''
>>> s
'"[\'BOS\']"'
>>>

Enclose the entire string with """ or ''' (you would use ''' if the outermost quotation marks were ") in cases like these to make things simpler.
"""'"['BOS']"'"""

You can build it dynamically as well:
>>> print('"{}"'.format("'[BOS]'"))
"'[BOS]'"
>>> print('"'+"'[BOS]'"+'"')
"'[BOS]'"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode being re-unicoded - python

Related

Read regexes from file and avoid or undo escaping

how to properly use a unicode string in python regex

How to assign '\'(-inf-24.5]\'' to a python string?

Python putting r before unicode string variable

Multiple Quotes in String

Categories

Resources