I have this simple regex:
text = re.sub("[إأٱآا]", "ا", text)
However, I get this (Python 2.7) error:
TypeError: expected string or buffer
I'm a regex newbie, I imagine this is a simple thing to fix, but I'm
not sure how? Thanks.
Define all your strings as unicode and don't forget to add the encoding line in the header of the file:
#coding: utf-8
import re
text = re.sub(u"[إأٱآا]", u"ا", u"الآلهة")
print text
To get:
الالهة
re.sub expects regex as first parameter. You need to escape the left bracket in your patterns. Use \[ instead of [
Sorry I couldn't fit this in the comments section. There is nothing wrong in the re.sub as far as I understand. Because if you code the chars back to unicode you get the below verbatim.
text = re.sub("[\u0625\u0623\u0671\u0622\u0627]", "\u0627", text)
Because it is arabic, remember it is right to left, the visuals are a bit jumbled that's all.
It is actually trying to replace a set of chars with one char.
Although why would one replace \u0627 with \u0627, I do not know.
The issue I believe is with text. If you can do print(text), then we can see if there are any chars in it that belong to "[إأٱآا]" == "[\u0625\u0623\u0671\u0622\u0627]"
Just a quip the \u0627 is the smallest vertical line on the left ;-)
Little help in understanding what it actually is use(just copy the whole statement in the question and do the below)
for x in mystr: print(x + '-' + str(ord(x)))
http://www.fileformat.info/info/unicode/char/0627/index.htm
EDITED
>>> re.sub(myset,myrep,text)
u'\u0627\u0627\u0627abc'
>>> res=re.sub(myset,myrep,text)
>>> res
u'\u0627\u0627\u0627abc'
>>> myrep
u'\u0627'
>>> myset
u'[\u0625\u0623\u0671\u0622\u0627]'
>>> text
u'\u0625\u0623\u0623abc'
>>> print(res)
اااabc
>>> print(myrep)
ا
>>> print(myset)
[إأٱآا]
>>> print(text)
إأأabc
>>>
So in essence All Works Well and the error is else where.
I think reproduced the error that is occurring elsewhere and here it is
>>> print(u'\u0625'+ord(u'\u0625'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, int found
Cheers!
This is how I eventually did it:
sText = re.sub(ur"[\u0625|\u0623|\u0671|\u0622|\u0627]", ur"\u0627", sText)
Thank you all for your help.
Related
When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?
Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?
Python strings are unicode-encoded by default, so the string you have is different from the string you output.
>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a
The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:
>>> txt[-5:]
'u017a'
>>> txt[-6:]
'\\u017a'
>>> txt[-6:-5]
'\\'
What you should do to resolve it:
Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'
I am having trouble with the replace() method. I want to replace some part of a string, and the part which I want to replace consist of multiple escape characters. It looks like something like this;
['<div class=\"content\">rn
To remove it, I have a block of code;
garbage_text = "[\'<div class=\\\"content\\\">rn "
entry = entry.replace(garbage_text,"")
However, it does not work. Anything is removed from my complete string. Can anybody point out where exactly I am thinking wrong about it? Thanks in advance.
Addition:
The complete string looks like this;
"['<div class=\"content\">rn gitar calmak icin kullanilan minik plastik garip nesne.rn </div>']"
You could use the triple quote format for your replacement string so that you don't have to bother with escaping at all:
garbage_text = """['<div class="content">rn """
Perhaps your 'entry' is not formatted correctly?
With an extra variable 'text', the following worked in Python 3.6.7:
>>> garbage_text
'[\'<div class=\\\'content\'\\">rn '
>>> text
'[\'<div class=\\\'content\'\\">rn And then there were none'
>>> entry = text.replace(garbage_text, "")
>>> entry
'And then there were none'
I was curious about how ASCII worked in python, so I decided to find out more. I learnt quite a bit, before I began to try to print letters using ASCII numbers. I'm not sure if I am doing it correctly, as I am using the string module, but I keep picking up an error
print(string.ascii_lowercase(104))
This should print out "h", as far as I know, but all that happens is that I receive an error.
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
string.ascii_lowercase(104)
TypeError: 'str' object is not callable
If someone could help me solve this, or tell me a better way, I would be ever grateful. Thanks in advance! :)
ascii_lowercase is a string, not a function. Use chr(104).
I guess what you want is chr
>>> chr(104)
'h'
The chr() function returns the corresponding character to the ASCII value you put in.
The ord() function returns the ASCII value of the character you put in.
Example:
chr(104) = 'h'
ord('h') = 104
in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem.
But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).
What's the problem here, and how can I handle this?
Thanks a lot for any help!
EDIT:
At the moment, I use the following: output.write(text.decode("utf-8"))
This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung
How can we solve this problem? Thanks a lot!
There is U+00AD SOFT HYPHEN before r in the string:
>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'
To remove non-ascii characters:
>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11
Seems like "r" isn't ASCII:
>>> u'Schilderung'
u'Schilde\xadrung'
i want to know, the result format of xlrd.
See the code
>>> sh.cell_value(rowx=2, colx=1)
u'Adam Gilchrist xxxxxxxxxxxxxxxxxxxxx'
Now when i try running a res.search
>>> temp1=sh.cell_value(rowx=2, colx=1)
>>> x=re.search("Adam",'temp1')
>>> x.group()
Traceback (most recent call last):
File "<pyshell#58>", line 1, in <module>
x.group()
AttributeError: 'NoneType' object has no attribute 'group'
I get nothing.
First i want to know , what is the 'u' with result.
What are the result formats returned by sh.cell_value. Is it integer, string etc.
Can we run regular expressions on them?
Answering your question first
First i want to know , what is the 'u' with result? u is the qualifier for unicode string. So u'Adam Gilchrist xxxxxxxxxxxxxxxxxxxxx' means the test in unicode.
What are the result formats returned by sh.cell_value . Is it integer , string etc.? Its unicode string
Can we run regular expressions on them ? Yes you can and this is how you do
temp1=u'Adam Gilchrist xxxxxxxxxxxxxxxxxxxxx'
x=re.search(u'Adam',temp1)
x.group()
u'Adam'
Its only that you have to specify the pattern in unicode also.
It's a Unicode string
Cell_value returns the value of the cell. The type depends on the type of the cell.
Yes. You can use regular expressions on Unicode strings, but your code isn't right.
Your code passes "temp1" to re.search as a string. It does not pass the variable temp1. You want:
>>> x=re.search(u"Adam",temp1)