Unicode portability with PyQT4/numpy on linux - python

I'm developing a multiplatform application in pyQT4 and numpy and actually it doesn't work on my linux system (Xubuntu 12.04) even so it seems work great on windows 7. So, the problem comes from my import files method (It is in a PyQT4 class) :
def import_folder(self,abs_path,folder_list):
for folder_i in folder_list:
filenames = glob.glob("%s/%s/*.txt" %( abs_path,folder_i ))
aa =list(filenames)[1]
print filenames
data_fichier = np.genfromtxt("%s" %(aa),delimiter=';',skip_header=35,usecols=[8])
data_fichier2 = np.zeros((data_fichier.shape[0],))
And this is the error I get :
data_fichier = np.genfromtxt("%s" %aa,delimiter=';',skip_header=35,usecols=[8])
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 1241, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 145, in open
return ds.open(path, mode)
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 472, in open
found = self._findfile(path)
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 315, in _findfile
filelist += self._possible_names(self.abspath(path))
File "/usr/lib/python2.7/dist-packages/numpy/lib/_datasource.py", line 364, in abspath
splitpath = path.split(self._destpath, 2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
I have printed my "filenames" variables :
[u'/home/*****/Documents/Plate_test/Track-Arousal_experiment_24_06_13_track_1-Trial 2-7-Subject 1.txt', u'/home/*****/Documents/Plate_test/Track-Arousal_experiment_24_06_13_track_1-Trial 2-5-Subject 1.txt']
Therefore, the problem comes from the unicode mode (the "u" at the begining of the list elements). And I don't know at all why I get this unicode mode with my linux system. Have you any ideas how I can remove it and turn of in the "regular" mode (sorry for the terminology, I'm not an expert) ? (Or others ideas about my problem).
(Just for you know, when I have launched the method as a simple function without the PyQT
class, it works great, so I suspect it.)
Thanks,
Lem

It seems that the loadtxt method expects a str and not a unicode since it tries to decode it. glob in your case is returning unicode strings so try encoding your filenames:
filenames = map(str, filenames)
Or
filenames = map(lambda f: f.encode('utf-8'), filenames)
Or even:
def safe_str(s):
if isinstance(s, unicode):
return s.encode('utf-8')
return s
filenames = map(safe_str, filenames)

Related

Stuck at translation of FTP-uploadscript from Python2.x towards Python3.x

Python script for ftp-upload of various types of files from local Raspberry to remote Webserver:
original is running on several Raspberries under Python2.x & Raspian_Buster (and earlier Raspian_versions) without any problems.
The txt-file for this upload is generated by a lua-script-setup like the one below
file = io.open("/home/pi/PVOutput_Info.txt", "w+")
-- Opens a file named PVOutput_Info.txt (stored under the designated sub-folder of Domoticz)
file:write(" === PV-generatie & Consumptie === \n")
file:write(" Datum = " .. WSDatum .. "\n")
file:write(" Tijd = " .. WSTijd .. "\n")
file:close() -- closes the open file
os.execute("chmod a+rw /home/pi/PVTemp_Info.txt")
Trying to upgrade this simplest version towards use with Python3.x & Raspian_Bullseye, but stuck with solving the reported error.
It looks as if the codec now has a problem with a byte 0xb0 in the txt-file.
Any remedy or hint to circumvent this problem?
#!/usr/bin/python3
# (c)2017 script compiled by Toulon7559 from various material from forums, version 0.1 for upload of *.txt to /
# Original script running under Python2.x and Raspian_Buster
# Version 0165P3 of 20230201 is an experimental adaptation towards Python3.x and Raspian_Bullseye
# --------------------------------------------------
# Line006 = Function for FTP_UPLOAD to Server
# --------------------------------------------------
# Imports for script-operation
import ftplib
import os
# Definition of Upload_function
def upload(ftp, file):
ext = os.path.splitext(file)[1]
if ext in (".txt", ".htm", ".html"):
ftp.storlines("STOR " + file, open(file))
else:
ftp.storbinary("STOR " + file, open(file, "rb"), 1024)
# --------------------------------------------------
# Line020 = Actual FTP-Login & -Upload
# --------------------------------------------------
ftp = ftplib.FTP("<FTP_server>")
ftp.login("<Login_UN>", "<login_PW>")
# set path to destination directory
ftp.cwd('/')
# set path to source directory
os.chdir("/home/pi/")
# upload of TXT-files
upload(ftp, "PVTemp_Info.txt")
upload(ftp, "PVOutput_Info.txt")
# reset path to root
ftp.cwd('/')
print ('End of script Misc_Upload_0165P3')
print
Putty_CLI_Command
sudo python3 /home/pi/domoticz/scripts/python/Misc_upload_0165P3a.py
Resulting report at Putty's CLI
Start of script Misc_Upload_0165P3
Traceback (most recent call last):
File "/home/pi/domoticz/scripts/python/Misc_upload_0165P3a.py", line 39, in <module>
upload(ftp, "PVTemp_Info.txt")
File "/home/pi/domoticz/scripts/python/Misc_upload_0165P3a.py", line 25, in upload
ftp.storlines("STOR " + file, open(file))
File "/usr/lib/python3.9/ftplib.py", line 519, in storlines
buf = fp.readline(self.maxline + 1)
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 175: invalid start byte
I'm afraid that there's no easy mapping to the Python 3. Two simple, but not 1:1 solutions for Python 3 would be:
Consider uploading all files using a binary mode. I.e. get rid of the
if ext in (".txt", ".htm", ".html"):
ftp.storlines("STOR " + file, open(file))
else:
Or open the text file using the actual encoding that the files use (you have to find out):
open(file, encoding='cp1252')
See Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you really need the exact functionality that you had in Python 2 (that is: Upload any text file, in whatever encoding, using FTP text transfer mode), it would be more complicated. The Python 2 basically just translates any of CR/LF EOL sequences in the file to CRLF (what is the requirement of the FTP specification), keeping the rest of the file intact.
You can copy FTP.storbinary code and implement the above translation of buf byte-wise (without decoding/recording which Python 3 FTP.storlines/readline does).
If the files are not huge, a simple implementation is to load whole file to memory, convert in memory and upload. This is not difficult, if you know that all your files use the same EOL sequence. If not, the translation might be more difficult.
Or you may even give up on the translation, as most FTP servers do not care (they can handle any common EOL sequence). Just use the FTP.storbinary code as it is, only change TYPE I to TYPE A (what you need to do even if you implement the translation as per the previous point).
Btw, you also need to close the file in any case, so the correct code would be like:
with open(file) as f:
ftp.storlines("STOR " + file, f)
Likewise for storbinary.

Transposing files from columns to rows for multiple files

I have approximately 200 files (plus more in the future) that I need to transpose data from columns into rows. I'm a microbiologist, so coding isn't my forte (have worked with Linux and R in the past). One of my computer science friends was trying to help me write code in Python, but I have never used it before today.
The files are in .lvm format, and I'm working on a Mac. Items with 2 stars on either side are paths that I've hidden to protect my privacy.
The for loop is where I've been getting the error, but I'm not sure if that's where my problem lies or if it's something else.
This is the Python code I've been working on:
import os
lvm_directory = "/Users/**path**"
output_file = "/Users/**path**/Transposed.lvm"
newFile = True
output_delim = "\t"
for filename in os.listdir(lvm_directory):
header = []
data = []
f = open(lvm_directory + "/" + filename)
for l in f:
sl = l.split()
if (newFile):
header += [sl[1]]
f. close()
This is the error message I've been getting and I can't figure out how to work through it:
File "<pyshell#97>", line 5, in <module>
for l in f:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 345: invalid continuation byte
The rest of the code after this error is as follows, but I haven't worked through it yet due to the above error:
f = open(output_file, 'w')
f.write(output_delim.join(header))
newFile = False
else:
f = open(output_file, 'a')
f.write("\n"+output_delim.join(data))
f.close()
Looks like your files have a different encoding than the default utf-8 format. Probably ASCII. You'd use something like:
with open(lvm_directory + "/" + filename, encoding="ascii") as f:
for l in f:
# rest of your code here
^ It's generally more "pythonic" to use a with statement to handle resource management (i.e. opening and closing a file), hence the with approach demonstrated above. If your files aren't ASCII, see if any other encoding work. There are command-line tools like chardet that can help you identify the file's encoding.

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)
In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass
Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.
In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

Python: Special characters encoding

This is the code i am using in order to replace special characters in text files and concatenate them to a single file.
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = "C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = dirpath+"\\"+fname
with codecs.open(currentfile, encoding='utf8') as infile:
#print currentfile
outfile.write(fname)
outfile.write('\n')
outfile.write('\n')
for line in infile:
line = line.replace(u"´ı", "i")
line = line.replace(u"ï¬", "fi")
line = line.replace(u"fl", "fl")
outfile.write (line)
The first line.replace works fine while the others do not (which makes sense) and since no errors were generated, i though there might be a problem of "visibility" (if that's the term).And so i made this:
import codecs
currentfile = 'textfile.txt'
with codecs.open('C:\\Users\\user\\path\\to\\output2.txt', 'w', encoding='utf-8') as outfile:
with open(currentfile) as infile:
for line in infile:
if "ï¬" not in line: print "not found!"
which always returns "not found!" proving that those characters aren't read.
When changing to with codecs.open('C:\Users\user\path\to\output.txt', 'w', encoding='utf-8') as outfile: in the first script, i get this error:
Traceback (most recent call last):
File C:\\path\\to\\concat.py, line 30, in <module>
outfile.write(line)
File C:\\Python27\\codecs.py, line 691, in write
return self.writer.write(data)
File C:\\Python27\\codecs.py, line 351, in write
data, consumed = self.encode(object, self.errors)
Unicode DecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal
not in range (128)
Since i am not really experienced in python i can't figure it out, by the different sources already available: python documentation (1,2) and relevant questions in StackOverflow (1,2)
I am stuck here. Any suggestions?? all answers are welcome!
There is no point in using codecs.open() if you don't use an encoding. Either use codecs.open() with an encoding specified for both reading and writing, or forgo it completely. Without an encoding, codecs.open() is an alias for just open().
Here you really do want to specify the codec of the file you are opening, to process Unicode values. You should also use unicode literal values when straying beyond ASCII characters; specify a source file encoding or use unicode escape codes for your data:
# -*- coding: utf-8 -*-
import os
import codecs
dirpath = u"C:\\Users\\user\\path\\to\\textfiles"
filenames = os.listdir(dirpath)
with codecs.open(r'C:\Users\user\path\to\output.txt', 'w', encoding='utf8') as outfile:
for fname in filenames:
currentfile = os.path.join(dirpath, fname)
with codecs.open(currentfile, encoding='utf8') as infile:
outfile.write(fname + '\n\n')
for line in infile:
line = line.replace(u"´ı", u"i")
line = line.replace(u"ï¬", u"fi")
line = line.replace(u"fl", u"fl")
outfile.write (line)
This specifies to the interpreter that you used the UTF-8 codec to save your source files, ensuring that the u"´ı" code points are correctly decoded to Unicode values, and using encoding when opening files with codec.open() makes sure that the lines you read are decoded to Unicode values and ensures that your Unicode values are written out to the output file as UTF-8.
Note that the dirpath value is a Unicode value as well. If you use a Unicode path, then os.listdir() returns Unicode filenames, which is essential if you have any non-ASCII characters in those filenames.
If you do not do all this, chances are your source code encoding does not match the data you read from the file, and you are trying to replace the wrong set of encoded bytes with a few ASCII characters.

UnicodeDecodeError while processing filenames

I'm using Python 2.7.3 on Ubuntu 12 x64.
I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:
Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg
I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.
The script works for most of the files, but, for some of the files Python chokes and spits out the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
File "./download.py", line 53, in downloadGalleries
numDownloaded = downloadGallery(opener, galleryLink)
File "./download.py", line 75, in downloadGallery
filePathPrefix = getFilePath(content)
File "./download.py", line 90, in getFilePath
return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
File "/home/abc/XYZ/common.py", line 22, in cleanupString
return HTMLParser.HTMLParser().unescape(string)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
Here is the contents of my cleanupString function:
def cleanupString(string):
string = urllib2.unquote(string)
return HTMLParser.HTMLParser().unescape(string)
And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):
rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []
for root, dirs, files in os.walk(rootFolder):
for filename in files:
foundFile = os.path.join(root, filename)
if reobj.match(foundFile):
imgs.append(foundFile)
for img in imgs :
print 'Checking file: ' + img
newImg = cleanupString(img) #Code blows up here for some files
Can anyone provide me with a way to get around this error? I've already tried adding
# -*- coding: utf-8 -*-
to the top of the script but that has no effect.
Thanks.
Your filenames are byte strings that contain UTF-8 bytes representing unicode characters. The HTML parser normally works with unicode data instead of byte strings, particularly when it encounters a ampersand escape, so Python is automatically trying to decode the value for you, but it by default uses ASCII for that decoding. This fails for UTF-8 data as it contains bytes that fall outside of the ASCII range.
You need to explicitly decode your string to a unicode object:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string)
Your next problem will be that you now have unicode filenames, but your filesystem will need some kind of encoding to work with these filenames. You can check what that encoding is with sys.getfilesystemencoding(); use this to re-encode your filenames:
def cleanupString(string):
string = urllib2.unquote(string).decode('utf8')
return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())
You can read up on how Python deals with Unicode in the Unicode HOWTO.
Looks like you're bumping into this issue. I would try reversing the order you call unescape and unquote, since unquote would be adding non-ASCII characters into your filenames, although that may not fix the problem.
What is the actual filename it is choking on?

Categories

Resources