python UTF-8 str.replace() vs re.sub() - python

When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?
Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?

Python strings are unicode-encoded by default, so the string you have is different from the string you output.
>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a
The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:
>>> txt[-5:]
'u017a'
>>> txt[-6:]
'\\u017a'
>>> txt[-6:-5]
'\\'
What you should do to resolve it:
Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'

Related

Can't replace a string with multiple escape characters

I am having trouble with the replace() method. I want to replace some part of a string, and the part which I want to replace consist of multiple escape characters. It looks like something like this;
['<div class=\"content\">rn
To remove it, I have a block of code;
garbage_text = "[\'<div class=\\\"content\\\">rn "
entry = entry.replace(garbage_text,"")
However, it does not work. Anything is removed from my complete string. Can anybody point out where exactly I am thinking wrong about it? Thanks in advance.
Addition:
The complete string looks like this;
"['<div class=\"content\">rn gitar calmak icin kullanilan minik plastik garip nesne.rn </div>']"
You could use the triple quote format for your replacement string so that you don't have to bother with escaping at all:
garbage_text = """['<div class="content">rn """
Perhaps your 'entry' is not formatted correctly?
With an extra variable 'text', the following worked in Python 3.6.7:
>>> garbage_text
'[\'<div class=\\\'content\'\\">rn '
>>> text
'[\'<div class=\\\'content\'\\">rn And then there were none'
>>> entry = text.replace(garbage_text, "")
>>> entry
'And then there were none'

PyQt QString masks special characters and does not display correctly

I cannot get PyQt to display a string with special characters correctly. From a drag and drop action I end up with filenames as QString that may contain a blank or one of the ugly German Umlaute
For simplicity let's consider this is the filename I'd like to handle: 'abc defä.ghi', the resulting QString I get is 'abc%20.def%C3%A4.ghi'. I now just want to print the original string:
from PyQt4.QtCore import QString, QTextCodec, QTextDecoder
s = QString('abc%20.def%C3%A4.ghi')
print s, unicode(s), s.toUtf8()
Nothing seems to work and I'm afraid I'm missing the obvious.
Not sure where you're getting the data from, but it's obviously not UTF-8 encoded. It's percent-encoded - so from the internet, somehow?
Anyway, it should be decoded like this in python2:
>>> b = QtCore.QByteArray.fromPercentEncoding('abc%20.def%C3%A4.ghi')
>>> b.data()
'abc .def\xc3\xa4.ghi'
>>> s = b.data().decode('utf8')
>>> print s
abc .defä.ghi

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring
Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

stripping away code in python using "re.sub"

I read this:
Stripping everything but alphanumeric chars from a string in Python
and this:
Python: Strip everything but spaces and alphanumeric
Didn't quite understand but I tried a bit on my own code, which now looks like this:
import re
decrypt = str(open("crypt.txt"))
crypt = re.sub(r'([^\s\w]|_)+', '', decrypt)
print(crypt)
When I run the script It comes back with this answer:
C:\Users\Adrian\Desktop\python>python tick.py
ioTextIOWrapper namecrypttxt moder encodingcp1252
I am trying to get away all the extra code from the document and just keep numbers and letter, inside the document the following text can be found: http://pastebin.com/Hj3SjhxC
I am trying to solve the assignment here:
http://www.pythonchallenge.com/pc/def/ocr.html
Anyone knows what "ioTextIOWrapper namecrypttxt moder encodingcp1252" means?
And how should I format the code to properly strip it from everything except letter and numbers?
Sincerely
str(open("file.txt")) doesn't do what you think it does. open() returns a file object. str gives you the string representation of that file object, not the contents of the file. If you want to read the contents of the file use open("file.txt").read().
Or, more safely, use a with statement:
with open("file.txt") as f:
decrypt = f.read()
crypt = ...
# etc.
You could just search for the alphanumeric chars instead. Like this:
print ''.join(re.findall('[A-Za-z]', decrypt))
And you also want:
decrypt = open("crypt.txt").read()

Python - Display string containing entity references as normal text

I have a Python string "''Grassmere''"
as retrieved from a website.
I would like to have the ' displayed as the correct ascii symbol (') but for some reason python insists on just printing the ascii code.
Batteries are included for this one
>>> import xmllib
>>> X=xmllib.XMLParser()
>>> X.translate_references("''Grassmere''")
"''Grassmere''"
Or without additional modules:
re.sub("&#(\d+);", lambda m: chr(int(m.group(1))), "''Grassmere''")

Categories

Resources