Python, Windows and a Unicode file - python

I'm trying to execute a simple utility I wrote for Linux, which I thought it would be executed without problem on Windows. Wrong.
The script parses a simple file using the "re" module for regex. The problem is that the expression fails every time because Windows doesn't treat well the text file, which is UTF-8, because it contains things like áéíóú or ñ (it's in Spanish).
I've found a lot of stuff about printing text in Unicode format, but have found nothing about reading an Unicode line from a text file or using regex with Unicode on Windows. Thought you might shed some light on the subject.

open() uses locale.getpreferredencoding(False) encoding to decode a file. It is likely to be utf-8 on POSIX systems and it is something else on Windows e.g., cp1252.
If you know the text in the file is stored using utf-8 character encoding then pass it explicitly:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import io
import re
with io.open("filename.txt", encoding='utf-8') as file:
for line in file:
if re.search(u"(?u)unicode\s+pattern…", line):
# found..

Related

Python not picking up UTF-8 encoding

I am having some trouble encoding ascii characters to UTF-8, or a string is not picking up the encoding.
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
#!/usr/bin/python
# -*- coding: UTF-8 -*-
def remove_non_ascii_1(text):
text.encode('utf-8')
for i in text:
return ''.join(i for i in text if i=='£')
In Python 2.7 I get the error
SyntaxError: Non-ASCII character '\xc2' in file on line 16, but no encoding declared; see SyntaxError: Non-ASCII character '\xc2' in file.
With the Unicode replacement
return ''.join(i for i in text if i=='\xc2')
the error is
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Sample text :
row from a csv file reading in
[u'06/11/2020', u'ABC', u'32154E', u'3214', u'DEF', u'Cash Purchase', u'Final', u'', u'20.00%', u'ABC', u'Sold From Pickup', u'New ', u'10.00%', u'0', u'15%', u'\xa3469.84', u'Jonathan Jerome', u'3', u'\xa3140.95', u'2%', u'\xa393.97', u'\xa39,396.83', u'', u'\xa35,638.00', u'30/06/2020', u'4', u'Boiler-Amended']
I want to remove the \xa3 or £ in the currency fields.
First 2 things ahead:
Don't use Python 2 any more because of reason mentioned here!
Don't use different encodings in Python 2.
TL;DR Python 3 just improved so many things regarding encodings that it simply isn't worth it.
Whole story: read here
Ok this out of the way let's start fixing your code.
As Klaus D. already mentioned you do not save the result of text. This leads to an encoding warning when comparing seamingly equal characters (£ and £) but one is encoded in the encoding coming from the file you read the other one is encoded in ascii (despite you encoding your code in -*- coding: UTF-8 -*-. This is just to show what your code-file is encoded in, this does not change the behaviour of the interpreter regarding str-parsing).
Edit: Also when comparing to the character you will need to compare to a unicode char so you could either convert it or simply tell the interpreter to encode it as unicode in the first place (that's why I added a leading 'u' in front of your '£')
To fix this simply safe your result into text again after you called text.encode('utf-8').
Additionally the "shebang" and the encoding info should always be on the very top of a file that as soon as you open the file you know what you are dealing with.
Something else I would correct is the first for-loop. this one is unnecessary because you return out of this function anyway after you handled the first element.
This means the completely "corrected" code is this.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
def remove_non_ascii_1(text):
text.encode('utf-8')
return ''.join(i for i in text if i==u'£')
PS: You should really think again about whether the def remove_non_ascii_1(text) is really necessary. By the looks of it you already input a list of unicode encoded strings which you probably directly read from the file. This means you don't need to correct encoding though the comparison for '£' could stay. You might just want to rename the method ;)
Hope this helped and fixed possible unclarities about Python 2 encodings :D
If you print your list as a whole now you will see it still contains \xca and not the actual '£' but if you print the elements seperately it works fine. This is because the __str__() method of list does not encode unicodes directly but uses the standard ascii encoding...
Python 3 greatly improved unicode text handling. If you have to use Python 2.7, I would recommend using the codecs library when reading text files since it helps you with Pyhton 2.7 unicode issues:
import codecs
fp = codecs.open("file", "r"; encoding="utf-8")
In your case I noticed that you are using unicodecsv as a drop-in csv replacement. In this case, you can hand the parameter encoding="utf-8" when reading the csv file into a list:
r = csv.reader(f, encoding='utf-8')
For just removing non-Ascii characters I would recommend checking this good answer on StackOverflow

write chinese command line window using python

Im new in python
I'm trying to print some chineses word to command line windows 10 and file but got a problem:
Here is my code:
fh = open("hello.txt", "w")
str="欢迎大家加入自由职业者群体。谢谢大家"
print(str)
fh.write(str)
fh.close()
The default encoding of files is locale.getpreferredencoding(False), which seems to be cp1252 on your system. Specify the encoding when opening the file.
Also use with and the file will be closed for you when it exits the block:
#!python3.6
with open('hello.txt','w',encoding='utf8') as fh:
str="欢迎大家加入自由职业者群体。谢谢大家"
print(str)
fh.write(str)
To see the Chinese characters on the console you'll need to install a Chinese language pack, and change the console font to one that supports Chinese. Using an IDE that supports UTF-8 will also work. The "boxed question mark" characters are what is displayed when the font doesn't support the characters. If you cut-n-paste those characters to an application like Notepad that has support for Chinese fonts you should see the correct characters.
Here's my US Windows system with the Chinese Language Pack. The console is configured with the SimHei font.
Couple of issues:
There shouldn't be an identation after the fh variable declaration. You shouldn't name a string "str" because that's a builtin function. If you want to use characters outside the latin alphabet you need to declare that you're using UTF-8 like so: # -*- coding: utf-8 -*- (put that at the top of your file). Then it should work. Although terminal does sometimes have issues with foreign characters.
# -*- coding: utf-8 -*-
fh = open("hello.txt", "w")
str1="欢迎大家加入自由职业者群体。谢谢大家"
print(str1)
fh.write(str1)
fh.close()
Edit
Official solution is, use PyCharm!
For me, changing the font in command line worked.
System: Windows 10
Font that worked: NSimSun

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Python: Can't write to file - UnicodeEncodeError

This code should write some text to file.
When I'm trying to write my text to console, everything works. But when I try to write the text into the file, I get UnicodeEncodeError. I know, that this is a common problem which can be solved using proper decode or encode, but I tried it and still getting the same UnicodeEncodeError. What am I doing wrong?
I've attached an example.
print "(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2])
(StarBuy s.r.o.,Inzertujte s foto, auto-moto, oblečenie, reality, prácu, zvieratá, starožitnosti, dovolenky, nábytok, všetko pre deti, obuv, stroj....
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u010d' in position 50: ordinal not in range(128)
Where could be the problem?
To write Unicode text to a file, you could use io.open() function:
#!/usr/bin/env python
from io import open
with open('utf8.txt', 'w', encoding='utf-8') as file:
file.write(u'\u010d')
It is default on Python 3.
Note: you should not use the binary file mode ('b') if you want to write text.
# coding: utf8 that defines the source code encoding has nothing to do with it.
If you see sys.setdefaultencoding() outside of site.py or Python tests; assume the code is broken.
#ned-batchelder is right. You have to declare that the system default encoding is "utf-8". The coding comment # -*- coding: utf-8 -*- doesn't do this.
To declare the system default encoding, you have to import the module sys, and call sys.setdefaultencoding('utf-8'). However, sys was previously imported by the system and its setdefaultencoding method was removed. So you have to reload it before you call the method.
So, you will need to add the following codes at the beginning:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
You may need to explicitly declare that python use UTF-8 encoding.
The answer to this SO question explains how to do that: Declaring Encoding in Python
For Python 2:
Declare document encoding on top of the file (if not done yet):
# -*- coding: utf-8 -*-
Replace .decode with .encode:
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".encode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))

Hebrew characters in Python code on eclipse

I'm writing python code on eclipse and whenver I use hebrew characters I get the following syntax error:
SyntaxError: Non-ASCII character '\xfa' in file ... on line 66, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I declare unicode/utf-8 encoding?
I tried adding
-*- coding: Unicode -*-
or
-*- coding: utf-8 -*-
in the commented section in the beginnning of the py file. It didn't work.
I'm running eclipse with pydev, python 2.6 on windows 7.
I tried it also and here is my conclusion:
You should add
# -*- coding: utf-8 -*-
at the very first line in your file.
And yes, I work with windows...
If I got it right, you are missing the #
Ensure that the encoding the editor is using to enter data matches the declared encoding in the file metadata.
This isn't something unique to Eclipse or Python; it applies to all character data formats and text editors.
Python has a number of options for dealing with string literals in both the str and unicode types via escape sequences. I believe there were changes to string literals between Python 2 and 3.
Python 2.7 string literals
Python 3.2 string literals
I had the same thing and it was because I'd tried to do:
a='言語版の記事'
When I should have done:
a=u'言語版の記事'
I think it's python/pydev complaining when trying to parse the source, rather than eclipse as such.
"Unicode" is certainly wrong, and \xfa is not UTF-8. Figure out which encoding is actually being used and declare that instead.

Categories

Resources