Character encoding in python to replace 'u2019' with ' - python

I have tried numerous ways to encode this to the end result "BACK RUSHIN'" with the most important character being the right apostrophe '.
I would like a way of getting to this end result using some of the built in functions Python has where there is no discrimination between a normal string and a unicode string.
This was the code I was using to retrieve the string: str(unicode(etree.tostring(root.xpath('path')[0],method='text', encoding='utf-8'),errors='ignore')).strip()
With the result being: 'BACK RUSHIN' the thing being the apostrophe ' is missing.
Another way was: root.xpath('path/text()')
And that result was: u'BACK RUSHIN\u2019' in python.
Lastly if I try: u'BACK RUSHIN\u2019'.encode('ascii', 'replace')
The result is: 'BACK RUSHIN?'
Please no replace functions, I would like to make use of pythons codec libraries.
Also no printing the string because it is being held in a variable.
Thanks

>>> import unidecode
>>> unidecode.unidecode(u'BACK RUSHIN\u2019')
"BACK RUSHIN'"
unidecode

Related

Can't replace a string with multiple escape characters

I am having trouble with the replace() method. I want to replace some part of a string, and the part which I want to replace consist of multiple escape characters. It looks like something like this;
['<div class=\"content\">rn
To remove it, I have a block of code;
garbage_text = "[\'<div class=\\\"content\\\">rn "
entry = entry.replace(garbage_text,"")
However, it does not work. Anything is removed from my complete string. Can anybody point out where exactly I am thinking wrong about it? Thanks in advance.
Addition:
The complete string looks like this;
"['<div class=\"content\">rn gitar calmak icin kullanilan minik plastik garip nesne.rn </div>']"
You could use the triple quote format for your replacement string so that you don't have to bother with escaping at all:
garbage_text = """['<div class="content">rn """
Perhaps your 'entry' is not formatted correctly?
With an extra variable 'text', the following worked in Python 3.6.7:
>>> garbage_text
'[\'<div class=\\\'content\'\\">rn '
>>> text
'[\'<div class=\\\'content\'\\">rn And then there were none'
>>> entry = text.replace(garbage_text, "")
>>> entry
'And then there were none'

Insert invisible unicode into MySQL using python3 but encountered duplicate

When I insert the device data into MySQL(v5.5.6) using python(v3.2). It encountered a problem.
This is device A (It contains three unicode and a blank space):
'\u202d\u202d \u202d'
And device B (It is only a blank space):
' '
The problem is when i insert all device data into MySQL , Error is
Duplicate entry 'activate_device-20151201-1-5740-01000P---‭‭ ‭--' for key 'PRIMARY'
I guess MySQL has deal the '\u202d'(A unicode to reverse string maybe?).
How can I simulate the process in python3 like MySQL?
How can I avoid the duplicate?
The expected result is translate '\u202d\u202d \u202d' to ' ' in python3.
Please help me.
There are some ambiguities here. Do you want to keep only the visible ascii characters or also visible unicode characters ?
If you want to keep only visible ascii characters, the simple way is to use the python inbuilt string module.
import string
new_string = "".join(filter(lambda x:x in string.printable, original_string))
For your specific usecase, a space is part of visible ascii - so the above will convert '\u202d\u202d \u202d' and ' ' to ' '

Decoding text in python

I want to know how to decode certain text, and have found some text like this which I want to decode:
\xe2\x80\x93
I know that printing it will solve it, but I am building a web crawler hence I need to build an index (dictionary) containing words with a list of URLs where the word appears.
Hence I want to do something like this:
dic = {}
dic['\xe2\x80\x93'] = 'http://example.com' #this is the url where the word appears
... but when I do:
print dic
I get:
'\xe2\x80\x93'
... instead of –.
But when I do print dic['\xe2\x80\x93'] I successfully get –.
Howe can I get – by print dic also?
When you see \xhh, that is a a character escape sequence. In this case, it is showing you the hex value of the character (see: lexical analysis: string-literals).
The reason you see \xhh sometimes, and you see the actual characters when you use print is related to the difference between __str__ and __repr__ in Python.

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?
You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

[Python]How to deal with a string ending with one backslash?

I'm getting some content from Twitter API, and I have a little problem, indeed I sometimes get a tweet ending with only one backslash.
More precisely, I'm using simplejson to parse Twitter stream.
How can I escape this backslash ?
From what I have read, such raw string shouldn't exist ...
Even if I add one backslash (with two in fact) I still get an error as I suspected (since I have a odd number of backslashes)
Any idea ?
I can just forget about these tweets too, but I'm still curious about that.
Thanks : )
Prepending the string with r (stands for "raw") will escape all characters inside the string. For example:
print r'\b\n\\'
will output
\b\n\\
Have I understood the question correctly?
I guess you are looking a method similar to stripslashes in PHP. So, here you go:
Python version of PHP's stripslashes
You can try using raw strings by prepending an r (so nothing has to be escaped) to the string or re.escape().
I'm not really sure what you need considering I haven't seen the text of the response. If none of the methods you come up with on your own or get from here work, you may have to forget about those tweets.
Unless you update your question and come back with a real problem, I'm asserting that you don't have an issue except confusion.
You get the string from the Tweeter API, ergo the string does not show up in your code. “Raw strings” exist only in your code, and it is “raw strings” in code that can't end in a backslash.
Consider this:
def some_obscure_api():
"This exists in a library, so you don't know what it does"
return r"hello" + "\\" # addition just for fun
my_string = some_obscure_api()
print(my_string)
See? my_string happily ends in a backslash and your code couldn't care less.

Categories

Resources