For some school assignments I've been trying to get pyplot to plot some scientific graphs based on data from Logger Pro for me. I'm met with the error
ValueError: could not convert string to float: '0'
This is the program:
plot.py
-------------------------------
import matplotlib.pyplot as plt
import numpy as np
infile = open('text', 'r')
xs = []
ys = []
for line in infile:
print (type(line))
x, y = line.split()
# print (x, y)
# print (type(line), type(x), type(y))
xs.append(float(x))
ys.append(float(y))
xs.sort()
ys.sort()
plt.plot(xs, ys, 'bo')
plt.grid(True)
# print (xs, ys)
plt.show()
infile.close()
And the input file is containing this:
text
-------------------------------
0 1.33
1 1.37
2 1.43
3 1.51
4 1.59
5 1.67
6 1.77
7 1.86
8 1.98
9 2.1
This is the error message I recieve when I'm running the program:
Traceback (most recent call last):
File "\route\to\the\file\plot01.py", line 36, in <module>
xs.append(float(x))
ValueError: could not convert string to float: '0'
You have a UTF-8 BOM in your data file; this is what my Python 2 interactive session states is being converted to a float:
>>> '0'
'\xef\xbb\xbf0'
The \xef\xbb\xbf bytes is a UTF-8 encoded U+FEFF ZERO WIDTH NO-BREAK SPACE, commonly used as a byte-order mark, especially by Microsoft products. UTF-8 has no byte order issues, the mark isn't required to record the byte ordering like you have to for UTF-16 or UTF-32; instead Microsoft uses it as an aid to detect encodings.
On Python 3, you could open the file using the utf-8-sig codec; this codec expects the BOM at the start and will remove it:
infile = open('text', 'r', encoding='utf-8-sig')
On Python 2, you could use the codecs.BOM_UTF8 constant to detect and strip;
for line in infile:
if line.startswith(codecs.BOM_UTF8):
line = line[len(codecs.BOM_UTF8):]
x, y = line.split()
As the codecs documentation explains it:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
Related
I have Unicode Code Point of an emoticon represented as U+1F498:
emoticon = u'\U0001f498'
I would like to get utf-16 decimal groups of this character, which according to this website are 55357 and 56472.
I tried to do print emoticon.encode("utf16") but did not help me at all because it gives some other characters.
Also, trying to decode from UTF-8 before encode it to UTF-16 as follow print str(int("0001F498", 16)).decode("utf-8").encode("utf16") does not help either.
How do I correctly get the utf-16 decimal groups of a unicode character?
You can encode the character with the utf-16 encoding, and then convert every 2 bytes of the encoded data to integers with int.from_bytes (or struct.unpack in python 2).
Python 3
def utf16_decimals(char, chunk_size=2):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every `chunk_size` bytes to an integer
decimals = []
for i in range(0, len(encoded_char), chunk_size):
chunk = encoded_char[i:i+chunk_size]
decimals.append(int.from_bytes(chunk, 'big'))
return decimals
Python 2 + Python 3
import struct
def utf16_decimals(char):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every 2 bytes to an integer
decimals = []
for i in range(0, len(encoded_char), 2):
chunk = encoded_char[i:i+2]
decimals.append(struct.unpack('>H', chunk)[0])
return decimals
Result:
>>> utf16_decimals(u'\U0001f498')
[55357, 56472]
In a Python 2 "narrow" build, it is as simple as:
>>> emoticon = u'\U0001f498'
>>> map(ord,emoticon)
[55357, 56472]
This works in Python 2 (narrow and wide builds) and Python 3:
from __future__ import print_function
import struct
emoticon = u'\U0001f498'
print(struct.unpack('<2H',emoticon.encode('utf-16le')))
Output:
(55357, 56472)
This is a more general solution that prints the UTF-16 code points for any length of string:
from __future__ import print_function,division
import struct
def utf16words(s):
encoded = s.encode('utf-16le')
num_words = len(encoded) // 2
return struct.unpack('<{}H'.format(num_words),encoded)
emoticon = u'ABC\U0001f498'
print(utf16words(emoticon))
Output:
(65, 66, 67, 55357, 56472)
I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
osx
python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?
As the documentation explains, when you seek on text files:
offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:〹宠蜇
>>> print(getline(f, 1, maxlen))
0:〹宠蜇
>>> print(getline(f, 10, maxlen))
0:〹宠蜇
>>> print(getline(f, 11, maxlen))
0:〹宠蜇
>>> print(getline(f, 12, maxlen))
1:〹宠蜇
>>> print(getline(f, 59, maxlen))
4:〹宠蜇
I'm wondering how can I convert ISO-8859-2 (latin-2) characters (I mean integer or hex values that represents ISO-8859-2 encoded characters) to UTF-8 characters.
What I need to do with my project in python:
Receive hex values from serial port, which are characters encoded in ISO-8859-2.
Decode them, this is - get "standard" python unicode strings from them.
Prepare and write xml file.
Using Python 3.4.3
txt_str = "ąęłóźć"
txt_str.decode('ISO-8859-2')
Traceback (most recent call last): File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
The main problem is still to prepare valid input for the "decode" method (it works in python 2.7.10, and thats the one I'm using in this project). How to prepare valid string from decimal value, which are Latin-2 code numbers?
Note that it would be uber complicated to receive utf-8 characters from serial port, thanks to devices I'm using and communication protocol limitations.
Sample data, on request:
68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069
This is some sample data. ISO-8859-2 pushed into uint32, 4 chars per int.
bit of code that manages unboxing:
l = l[7:].replace(",", "").replace(".", "").replace("\n","").replace("\r","") # crop string from uart, only data left
vl = [l[0:2], l[2:4], l[4:6], l[6:8]] # list of bytes
vl = vl[::-1] # reverse them - now in actual order
To get integer value out of hex string I can simply use:
int_vals = [int(hs, 16) for hs in vl]
Your example doesn't work because you've tried to use a str to hold bytes. In Python 3 you must use byte strings.
In reality, if you're using PySerial then you'll be reading byte strings anyway, which you can convert as required:
with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
s = ser.read(10)
# Py3: s == bytes
# Py2.x: s == str
my_unicode_string = s.decode('iso-8859-2')
If your iso-8895-2 data is actually then encoded to ASCII hex representation of the bytes, then you have to apply an extra layer of encoding:
with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
hex_repr = ser.read(10)
# Py3: hex_repr == bytes
# Py2.x: hex_repr == str
# Decodes hex representation to bytes
# Eg. b"A3" = b'\xa3'
hex_decoded = codecs.decode(hex_repr, "hex")
my_unicode_string = hex_decoded.decode('iso-8859-2')
Now you can pass my_unicode_string to your favourite XML library.
Interesting sample data. Ideally your sample data should be a direct print of the raw data received from PySerial. If you actually are receiving the raw bytes as 8-digit hexadecimal values, then:
#!python3
from binascii import unhexlify
data = b''.join(unhexlify(x)[::-1] for x in b'''\
68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069'''.splitlines())
print(data.decode('iso-8859-2'))
Output:
W chuj bardzo długa nazwa jakiejś zapyziałej pipidówy, brudnej ulicyumer najgorszej rudery we wsi
Google Translate of Polish to English:
The dick very long name some zapyziałej Small Town , dirty ulicyumer worst hovel in the village
This topic is closed. Working code, that handles what need to be done:
x=177
x.to_bytes(1, byteorder='big').decode("ISO-8859-2")
So I have two arrays of tuples that are arranged with Restaurant Name and an Int:
("Restaurant Name", 0)
One is called ArrayForInitialSpots, and the other is called ArrayForChosenSpots. What I want to do is to write the tuples from both rows in side-by-side order in a csv file like this:
"First Restaurant in ArrayForInitialSPots",0,"First Restaurant in ArrayForChosenSpots", 1
"Second Restaurant in ArrayForInitialSpots",0,"Second Restaurant in ArrayForChosenSpots",0
So far i've tried doing this:
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow(x + y)
#csv_out.writerow(y)
But I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-6: ordinal not in range(128)
If I remove the zip function, I get too many values to unpack. Any suggestions guys? Thank you very much in advance.
There are two things that you could use to handle extended ascii characters while writing to files:
Set default encoding to utf-8
import sys
reload(sys).setdefaultencoding("utf-8")
Use unicodecsv writer to write data to files
import unicodecsv
unicodecsv
with mhawke help here is my solution
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
list_ = [str(word).decode("utf8") for word in (x+y)]
counter = 0
while counter < len(list_):
s=""
for i in range(counter,counter+4):
s+=list_[i].encode('utf-8')
s+=","
counter = counter + 4
csv_out.writerow(s[:-1])
The problem is not due to your use of zip() - that looks OK, but instead it is an encoding issue. Probably the restaurant names are unicode strings or in some encoding other than ASCII or UTF8? ISO-8859-1 perhaps?
The csv module does not handle unicode; other encodings might work, but it depends. The module does handle 8-bit values OK (except ASCII NUL), so you should be able to encode them as UTF8 like this:
ENCODING = 'iso-8859-1' # assume strings are encoded in this encoding
def to_utf8(item, from_encoding):
if isinstance(item, str):
# byte strings are first decoded to unicode
item = unicode(item, from_encoding)
return unicode(item).encode('utf8')
with open('data.csv', 'w') as out:
csv_out = csv.writer(out)
csv_out.writerow(['Restaurant Name', 'Change'] * 2)
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow([to_utf8(item, ENCODING) for item in x+y])
This works by converting each element of the tuple formed by x+y into a UTF-8 string. This includes byte strings in other encodings, as well as other objects such as integers that can be converted to a unicode string via unicode(). If your strings are unicode, just set ENCODING to None.
I'd suggest using numpy:
import numpy as np
IniSpots=[("Restaurant Name0a", 0),("Restaurant Name1a", 1)]
ChoSpots=[("Restaurant Name0b", 0),("Restaurant Name1b", 0)]
c=np.hstack((IniSpots,ChoSpots))
np.savetxt("data.csv", c, fmt='%s',delimiter=",")
I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.
Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).
How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.
Given a file object, and a number of characters, you can use:
# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
_lead_byte_to_count.append(
1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)
def readUTF8(f, count):
"""Read `count` UTF-8 bytes from file `f`, return as unicode"""
# Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
res = []
while count:
count -= 1
lead = f.read(1)
res.append(lead)
readcount = _lead_byte_to_count[ord(lead)]
if readcount:
res.append(f.read(readcount))
return (''.join(res)).decode('utf8')
Result of a test:
>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'
In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper() object and leave decoding to the native and efficient Python UTF-8 implementation.
One character in UTF-8 can be 1byte,2bytes,3byte3.
If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. http://en.wikipedia.org/wiki/UTF-8
Most the time, you can just set the encoding to utf-8, and read the input stream.
You do not need to care how much bytes you have read.