SyntaxError: Non-UTF-8 code starting with '\x91' - python

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position

Add this line at the top of you code. It may work.
# coding=utf8

Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.

If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.

The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter

Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

Related

replacing string from '\' into '/' Python

I've been strugling with some code where i need to change simple \ into / in Python. Its a path of file- Python doesn't read path of file in Windows'es way, so i simply want to change Windows path for Python to read file correctly.
I want to parse some text from game to count statistics. Im Doing it this way:
import re
pathNumbers = "D:\Gry\Tibia\packages\TibiaExternal\log\test server.txt"
pathNumbers = re.sub(r"\\", r"/",pathNumbers)
fileNumbers = open (pathNumbers, "r")
print(fileNumbers.readline())
fileNumbers.close()
But the Error i get back is
----> 6 fileNumbers = open (pathNumbers, "r") OSError: [Errno 22] Invalid argument: 'D:/Gry/Tibia/packages/TibiaExternal\test server.txt'
And the problem is, that function re.sub() and .replace(), give the same result- almost full path is replaced, but last char to change always stays untouched.
Do you have any solution for this, because it seems like changing those chars are for python a sensitive point.
Simple answer:
If you want to use paths on different plattforms join them with
os.path.join(path,*paths)
This way you don't have to work with the different separators at all.
Answer to what you intended to do:
The actual problem is, that your pathNumbers variable is not raw (leading r in definition), meaning that the backslashes are used as escape characters. In most cases this does not change anything, because the combinations with the following characters don't have a meaning. \t is the tab character, \n would be the newline character, so these are not simple backslash characters any more.
So simply write
pathNumbers = r"D:\Gry\Tibia\packages\TibiaExternal\log\test server.txt"

Python UTF-8 REGEX

I have a problem while trying to find text specified in regex.
Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...
I've tried to solve this problem with using
import sys
reload(sys) # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")
but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.
k = text.encode('utf-8')
pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
salary = pattern.search(k).group(1)
print (salary)
Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?
The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.
# coding: utf-8
string = "£"
or equivalently
string = u"\u00a3"
Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.
(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)
The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position
Add this line at the top of you code. It may work.
# coding=utf8
Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.
If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.
The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter
Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring
Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?
Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.
You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).
I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".
A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?
Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).
The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".
You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

Categories

Resources