SyntaxError: Non-ASCII character. Python - python

Could somebody tell me which character is a non-ASCII character in the following:
Columns(str) – comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword.
Essentially I am defining a class and trying to give it a comment to define how it works:
def test(self,uniprot_id):
'''
Same as the UniProt.search() method arguments:
search(query, frmt='tab', columns=None, include=False, sort='score', compress=False, limit=None, offset=None, maxTrials=10)
query (str) -- query must be a valid uniprot query. See http://www.uniprot.org/help/text-search, http://www.uniprot.org/help/query-fields See also example below
frmt (str) -- a valid format amongst html, tab, xls, asta, gff, txt, xml, rdf, list, rss. If tab or xls, you can also provide the columns argument. (default is tab)
include (bool) -- include isoform sequences when the frmt parameter is fasta. Include description when frmt is rdf.
sort (str) -- by score by default. Set to None to bypass this behaviour
compress (bool) -- gzip the results
limit (int) -- Maximum number of results to retrieve.
offset (int) -- Offset of the first result, typically used together with the limit parameter.
maxTrials (int) -- this request is unstable, so we may want to try several time.
Columns(str) -- comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword. '
'''
u = UniProt()
uniprot_entry = u.search(uniprot_id)
return uniprot_entry
Without the line 52, i.e. the one beginning with 'columns' in the quoted out comment block, this works as expected but as soon as I describe what 'columns' is I get the following error:
SyntaxError: Non-ASCII character '\xe2' in file /home/cw00137/Documents/Python/Identify_gene.py on line 52, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Does anybody know what is going on?

You are using 'fancy' curly quotes in that line:
>>> u'‘database(PDB)’'
u'\u2018database(PDB)\u2019'
That's a U+2018 LEFT SINGLE QUOTATION MARK at the start and U+2019 RIGHT SINGLE QUOTATION MARK at the end.
Use ASCII quotes (U+0027 APOSTROPHE or U+0022 QUOTATION MARK) or declare an encoding other than ASCII for your source.
You are also using an U+2013 EN DASH:
>>> u'Columns(str) –'
u'Columns(str) \u2013'
Replace that with a U+002D HYPHEN-MINUS.
All three characters encode to UTF-8 with a leading E2 byte:
>>> u'\u2013 \u2018 \u2019'.encode('utf8')
'\xe2\x80\x93 \xe2\x80\x98 \xe2\x80\x99'
which you then see reflected in the SyntaxError exception message.
You may want to avoid using these characters in the first place. It could be that your OS is replacing these as you type, or you are using a word processor instead of a plain text editor to write your code and it is replacing these for you. You probably want to switch that feature off.

Previously encountered the same problem and same error, python2 will default to using ASCII encoding.
You can try to declare the following comment in the py file's first or second line:
# -*- coding: utf-8 -*-

Related

How to create variables for substitution based on user for unique filepath in python?

I'm writing code that I want to make generic to whoever needs to follow it.
Part of the code is reading in an excel file that the user has to download. I know that each user has a specific 6-digit unique ID, and the folder and name of the file remains the same. Is there some way for me to modify the pd.read_csv function so that it is like this:
USERID = '123abc'
pd.read_csv(r'C:\Users\USERID\Documents\Dataset.csv')
I keep getting stuck because there is an ' next to the r so concatenation with a constant does not seem to work.
Similarly, is there a method for code for exporting that would insert the current date in the title?
What you want to use are formatted strings. The r preceding the string literal in your code denotes that you are creating a raw string, which means that you aren't going to ever see the value of your variable get assigned correctly within that string. Python's docs explain what these raw strings are:
Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters. (3.10.4 Python Language Reference, Lexical Analysis)
Like Fredericka mentions in her comment, the formatted string is a great way to accomplish what you're trying to do. If you're using Python version 3.6 or greater, you can also use the format method on the string, which does the same thing.
# set the User ID
user_id = "PythonUser1"
# print the full filepath
print("C:\\Users\\{}\\Documents\\Dataset.csv".format(user_id))
# read the CSV file using formatted string literals
my_csv = pd.read_csv("C:\\Users\\{user_id}\\Documents\\Dataset.csv")
# read the CSV file using the format method
my_csv = pd.read_csv("C:\\Users\\{}\\Documents\\Dataset.csv".format(user_id))
For more information, I'd recommend checking out the official Python docs on input and output.

Parsing Email Headers Tabs

I am parsing E-Mails with the Python email module.
If I parse it with the Python E-Mail parser, it does not remove the tab in front of the header items:
from email.parser import Parser
from email.policy import default
testmail = """Date: Wed, 26 Jan 2022 10:45:29 +0100
Message-ID:
<123123123123123123123123123123123123123.testinst.themultiverse.com>
Subject:
=?iso-8859-1?Q?Auftragsbest=E4tigung_blablabla?=
=?iso-8859-1?Q?_one nice thing?=
Content Body Whatnot"""
message = Parser(policy=default).parsestr(testmail)
print(repr(message["Message-Id"]))
print(repr(message["Subject"]))
results in:
'\t<123123123123123123123123123123123123123.testinst.themultiverse.com>'
'\tAuftragsbestätigung blablabla one nice thing'
I have tried the different policies of the email parser, but I do not manage to remove the tab in the beginning. I saw the header_source_parse method of the EmailPolicy class does strip the whitespace, but only in combination with a space in the beginning.
<pythonlib>/email/policy.py:
[...]
value = value.lstrip(' \t') + ''.join(sourcelines[1:])
[...]
Not sure if that is intended behavior or a bug.
My question now: Is there a way in the standard library to do this, or do I need to write a custom policy? The E-Mails are unchanged from an IMAP Server (exchange) and it feels strange that the standard tools do not cover this.
Something let me think that the message is not strictly conformant to RFC5322.
We can see at 3.2.2. Folding White Space and Comments:
However, where CFWS occurs in this specification, it MUST NOT
be inserted in such a way that any line of a folded header field is
made up entirely of WSP characters and nothing else.
But for the Subject and Message-ID fields, the first line will only contain spaces before the first newline.
IIUC, it correspond to an obsolete syntax, because we find at 4. Obsolete Syntax:
Another key difference between the obsolete and the current syntax is
that the rule in section 3.2.2 regarding lines composed entirely of
white space in comments and folding white space does not apply.
The doc for EmailPolicy from Python Standard Library is even more explicit on what happens:
header_source_parse(sourcelines)
The name is parsed as everything up to the ‘:’ and returned unmodified. The value is determined by stripping leading whitespace off the remainder of the first line, joining all subsequent lines together, and stripping any trailing carriage return or linefeed characters.
As the tab occurs on the second line, it is not stripped.
I am unsure whether this interpretation is correct, but a possible workaround is to specialize a subclass or EmailPolicy to strip that initial line:
class ObsoletePolicy(email.policy.EmailPolicy):
def header_source_parse(self, sourcelines):
header, value = super().header_source_parse(sourcelines)
value = value.lstrip(' \t\r\n')
return header, value
If you use:
message = Parser(policy=ObsoletePolicy()).parsestr(testmail)
you will now get for print(repr(message['Subject'])):
'Auftragsbestätigung blablabla one nice thing'

Unable to kick out unwanted characters from a long string

How can I remove unwanted characters from a long text using .replace() or any of that sort. Symbols I wish to kick out from the text are ',',{,},[,] (commas are not included). My existing text is:
{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}
I tried with the below code:
content='''
{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}
'''
print(content.replace("'",""))
Output I got: [btw, If i keep going like .replace().replace() with different symbols in it then it works but i wish to do the same in a single instance if its possible]
{SearchText:319 lizzie,ResultList:[{PropertyQuickRefID:R016698,PropertyType:Real}],TaxYear:2018}
I wish i could use replace function like .replace("',{,},[,]",""). However, I'm not after any solution derived from regex. String manipulation is what I expected. Thanks in advance.
content=r"{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}"
igno = "{}[]''´´``''"
cleaned = ''.join([x for x in content if x not in igno])
print(cleaned)
PyFiddle 3.6:
SearchText:319 lizzie,ResultList:PropertyQuickRefID:R016698,PropertyType:Real,TaxYear:2018
In 2.7 I get an error:
Non-ASCII character '\xc2' in file main.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
wich can be fixed by adding # This Python file uses the following encoding: utf-8 as 1st line in source code - which then gives identical output.

Python YAML preserving newline without adding extra newline

I have a similar problem to this question that I need to insert newlines in a YAML mapping value string and prefer not to insert \n myself. The answer suggests using:
Data: |
Some data, here and a special character like ':'
Another line of data on a separate line
instead of
Data: "Some data, here and a special character like ':'\n
Another line of data on a separate line"
which also adds a newline at the end, that is unacceptable.
I tried using Data: > but that showed to give completely different results. I have been stripping the final newline after reading in the yaml file, and of course that works but that is not elegant. Any better way to preserve newlines without adding an extra one at the end?
I am using python 2.7 fwiw
If you use | This makes a scalar into a literal block style scalar. But the default behaviour of |, is clipping and that doesn't get you the string you wanted (as that leaves the final newline).
You can "modify" the behaviour of | by attaching block chomping indicators
Strip
Stripping is specified by the “-” chomping indicator. In this case, the final line break and any trailing empty lines are excluded from the scalar’s content.
Clip
Clipping is the default behavior used if no explicit chomping indicator is specified. In this case, the final line break character is preserved in the scalar’s content. However, any trailing empty lines are excluded from the scalar’s content.
Keep
Keeping is specified by the “+” chomping indicator. In this case, the final line break and any trailing empty lines are considered to be part of the scalar’s content. These additional lines are not subject to folding.
By adding the stripchomping operator '-' to '|', you can prevent/strip the final newline:¹
import ruamel.yaml as yaml
yaml_str = """\
Data: |-
Some data, here and a special character like ':'
Another line of data on a separate line
"""
data = yaml.load(yaml_str)
print(data)
gives:
{'Data': "Some data, here and a special character like ':'\nAnother line of data on a separate line"}
¹ This was done using ruamel.yaml of which I am the author. You should get the same result with PyYAML (of which ruamel.yaml is a superset, preserving comments and literal scalar blocks on round-trip).

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?
Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.
You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).
I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".
A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?
Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).
The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".
You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

Categories

Resources