Python: Encoding issues? - python

in my HTML file, the word "Schilde­rung" looks normally and it doesn't seem to have an (encoding?) problem.
But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).
What's the problem here, and how can I handle this?
Thanks a lot for any help!
EDIT:
At the moment, I use the following: output.write(text.decode("utf-8"))
This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung
How can we solve this problem? Thanks a lot!

There is U+00AD SOFT HYPHEN before r in the string:
>>> "Schilde­rung".decode('utf-8')
u'Schilde\xadrung'
To remove non-ascii characters:
>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Seems like "r" isn't ASCII:
>>> u'Schilde­rung'
u'Schilde\xadrung'

Related

Why Python hex() function is not working on a 64 bit binary digit that starts with 1? [duplicate]

I'm using Python to script some operations on specific locations in memory (32 bit addresses) in an embedded system.
When I'm converting these addresses to and from strings, integers and hex values a trailing L seems to appear. This can be a real pain, for example the following seemingly harmless code won't work:
int(hex(4220963601))
Or this:
int('0xfb96cb11L',16)
Does anyone know how to avoid this?
So far I've come up with this method to strip the trailing L off of a string, but it doesn't seem very elegant:
if longNum[-1] == "L":
longNum = longNum[:-1]
If you do the conversion to hex using
"%x" % 4220963601
there will be neither the 0x nor the trailing L.
Calling str() on those values should omit the trailing 'L'.
Consider using rstrip. For example:
result.rstrip("L")
This is what I did: int(variable_which_is_printing_as_123L) and it worked for me. This will work for normal integers.
this could help somebody:
>>>n=0xaefa5ba7b32881bf7d18d18e17d3961097548d7cL
>>>print "n=","%0s"%format(n,'x').upper()
n= AEFA5BA7B32881BF7D18D18E17D3961097548D7C

Can't replace a string with multiple escape characters

I am having trouble with the replace() method. I want to replace some part of a string, and the part which I want to replace consist of multiple escape characters. It looks like something like this;
['<div class=\"content\">rn
To remove it, I have a block of code;
garbage_text = "[\'<div class=\\\"content\\\">rn "
entry = entry.replace(garbage_text,"")
However, it does not work. Anything is removed from my complete string. Can anybody point out where exactly I am thinking wrong about it? Thanks in advance.
Addition:
The complete string looks like this;
"['<div class=\"content\">rn gitar calmak icin kullanilan minik plastik garip nesne.rn </div>']"
You could use the triple quote format for your replacement string so that you don't have to bother with escaping at all:
garbage_text = """['<div class="content">rn """
Perhaps your 'entry' is not formatted correctly?
With an extra variable 'text', the following worked in Python 3.6.7:
>>> garbage_text
'[\'<div class=\\\'content\'\\">rn '
>>> text
'[\'<div class=\\\'content\'\\">rn And then there were none'
>>> entry = text.replace(garbage_text, "")
>>> entry
'And then there were none'

Arabic regex giving TypeError

I have this simple regex:
text = re.sub("[إأٱآا]", "ا", text)
However, I get this (Python 2.7) error:
TypeError: expected string or buffer
I'm a regex newbie, I imagine this is a simple thing to fix, but I'm
not sure how? Thanks.
Define all your strings as unicode and don't forget to add the encoding line in the header of the file:
#coding: utf-8
import re
text = re.sub(u"[إأٱآا]", u"ا", u"الآلهة")
print text
To get:
الالهة
re.sub expects regex as first parameter. You need to escape the left bracket in your patterns. Use \[ instead of [
Sorry I couldn't fit this in the comments section. There is nothing wrong in the re.sub as far as I understand. Because if you code the chars back to unicode you get the below verbatim.
text = re.sub("[\u0625\u0623\u0671\u0622\u0627]", "\u0627", text)
Because it is arabic, remember it is right to left, the visuals are a bit jumbled that's all.
It is actually trying to replace a set of chars with one char.
Although why would one replace \u0627 with \u0627, I do not know.
The issue I believe is with text. If you can do print(text), then we can see if there are any chars in it that belong to "[إأٱآا]" == "[\u0625\u0623\u0671\u0622\u0627]"
Just a quip the \u0627 is the smallest vertical line on the left ;-)
Little help in understanding what it actually is use(just copy the whole statement in the question and do the below)
for x in mystr: print(x + '-' + str(ord(x)))
http://www.fileformat.info/info/unicode/char/0627/index.htm
EDITED
>>> re.sub(myset,myrep,text)
u'\u0627\u0627\u0627abc'
>>> res=re.sub(myset,myrep,text)
>>> res
u'\u0627\u0627\u0627abc'
>>> myrep
u'\u0627'
>>> myset
u'[\u0625\u0623\u0671\u0622\u0627]'
>>> text
u'\u0625\u0623\u0623abc'
>>> print(res)
اااabc
>>> print(myrep)
ا
>>> print(myset)
[إأٱآا]
>>> print(text)
إأأabc
>>>
So in essence All Works Well and the error is else where.
I think reproduced the error that is occurring elsewhere and here it is
>>> print(u'\u0625'+ord(u'\u0625'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, int found
Cheers!
This is how I eventually did it:
sText = re.sub(ur"[\u0625|\u0623|\u0671|\u0622|\u0627]", ur"\u0627", sText)
Thank you all for your help.

How to fix broken utf-8 encoding in Python?

My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx
and I start to try by Python
mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')
but actually it is not correct because original string is utf-8 but the string show is not my expecting result.
Note: it is Vietnamese character.
How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.
The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy
I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh
Try:
str.encode('ascii', 'ignore').decode('utf-8')
You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.
The correct method in python 3.9.6 is:
"string".encode('utf-8').decode('latin-1')
"string".encode('latin1').decode('utf8')

Python Trailing L Problem

I'm using Python to script some operations on specific locations in memory (32 bit addresses) in an embedded system.
When I'm converting these addresses to and from strings, integers and hex values a trailing L seems to appear. This can be a real pain, for example the following seemingly harmless code won't work:
int(hex(4220963601))
Or this:
int('0xfb96cb11L',16)
Does anyone know how to avoid this?
So far I've come up with this method to strip the trailing L off of a string, but it doesn't seem very elegant:
if longNum[-1] == "L":
longNum = longNum[:-1]
If you do the conversion to hex using
"%x" % 4220963601
there will be neither the 0x nor the trailing L.
Calling str() on those values should omit the trailing 'L'.
Consider using rstrip. For example:
result.rstrip("L")
This is what I did: int(variable_which_is_printing_as_123L) and it worked for me. This will work for normal integers.
this could help somebody:
>>>n=0xaefa5ba7b32881bf7d18d18e17d3961097548d7cL
>>>print "n=","%0s"%format(n,'x').upper()
n= AEFA5BA7B32881BF7D18D18E17D3961097548D7C

Categories

Resources