'utf-8' codec can't decode byte: invalid continuation byte

'utf-8' codec can't decode byte: invalid continuation byte - python

i am trying to retrain a model with tensorflow. I used the code from Tensorflow Models while following a Tutorial on Youtube. The process ends with the following error:
File "C:\Users\Jeff\anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\compat.py", line 109, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 9: invalid continuation byte
The Error happens to be in the following script compat.py in the highlighted line in the method "as_text":
def as_text(bytes_or_text, encoding='utf-8'):
if isinstance(bytes_or_text, _six.text_type):
return bytes_or_text
elif isinstance(bytes_or_text, bytes):
**return bytes_or_text.decode(encoding)**
else:
raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)
I googled for hours but couldn't find anything. I tried different pre-trained models and startet the installation and training configuration all over again. I am out of ideas and hope someone can help me here.
Can anybody tell me, what i can do to make it work? Thanks in advance!
Here the whole script compat.py :
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Compatibility functions.
The `tf.compat` module contains two sets of compatibility functions.
## Tensorflow 1.x and 2.x APIs
The `compat.v1` and `compat.v2` submodules provide a complete copy of both the
`v1` and `v2` APIs for backwards and forwards compatibility across TensorFlow
versions 1.x and 2.x. See the
[migration guide](https://www.tensorflow.org/guide/migrate) for details.
## Utilities for writing compatible code
Aside from the `compat.v1` and `compat.v2` submodules, `tf.compat` also contains
a set of helper functions for writing code that works in both:
* TensorFlow 1.x and 2.x
* Python 2 and 3
## Type collections
The compatibility module also provides the following aliases for common
sets of python types:
* `bytes_or_text_types`
* `complex_types`
* `integral_types`
* `real_types`
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numbers as _numbers
import numpy as _np
import six as _six
from tensorflow.python.util.tf_export import tf_export
try:
# This import only works on python 3.3 and above.
import collections.abc as collections_abc # pylint: disable=unused-import
except ImportError:
import collections as collections_abc # pylint: disable=unused-import
def as_bytes(bytes_or_text, encoding='utf-8'):
"""Converts `bytearray`, `bytes`, or unicode python input types to `bytes`.
Uses utf-8 encoding for text by default.
Args:
bytes_or_text: A `bytearray`, `bytes`, `str`, or `unicode` object.
encoding: A string indicating the charset for encoding unicode.
Returns:
A `bytes` object.
Raises:
TypeError: If `bytes_or_text` is not a binary or unicode string.
"""
if isinstance(bytes_or_text, bytearray):
return bytes(bytes_or_text)
elif isinstance(bytes_or_text, _six.text_type):
return bytes_or_text.encode(encoding)
elif isinstance(bytes_or_text, bytes):
return bytes_or_text
else:
raise TypeError('Expected binary or unicode string, got %r' %
(bytes_or_text,))
def as_text(bytes_or_text, encoding='utf-8'):
"""Converts any string-like python input types to unicode.
Returns the input as a unicode string. Uses utf-8 encoding for text
by default.
Args:
bytes_or_text: A `bytes`, `str`, or `unicode` object.
encoding: A string indicating the charset for decoding unicode.
Returns:
A `unicode` (Python 2) or `str` (Python 3) object.
Raises:
TypeError: If `bytes_or_text` is not a binary or unicode string.
"""
if isinstance(bytes_or_text, _six.text_type):
return bytes_or_text
elif isinstance(bytes_or_text, bytes):
return bytes_or_text.decode(encoding)
else:
raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)
def as_str(bytes_or_text, encoding='utf-8'):
if _six.PY2:
return as_bytes(bytes_or_text, encoding)
else:
return as_text(bytes_or_text, encoding)
tf_export('compat.as_text')(as_text)
tf_export('compat.as_bytes')(as_bytes)
tf_export('compat.as_str')(as_str)
#tf_export('compat.as_str_any')
def as_str_any(value):
"""Converts input to `str` type.
Uses `str(value)`, except for `bytes` typed inputs, which are converted
using `as_str`.
Args:
value: A object that can be converted to `str`.
Returns:
A `str` object.
"""
if isinstance(value, bytes):
return as_str(value)
else:
return str(value)
#tf_export('compat.path_to_str')
def path_to_str(path):
r"""Converts input which is a `PathLike` object to `str` type.
Converts from any python constant representation of a `PathLike` object to
a string. If the input is not a `PathLike` object, simply returns the input.
Args:
path: An object that can be converted to path representation.
Returns:
A `str` object.
Usage:
In case a simplified `str` version of the path is needed from an
`os.PathLike` object
Examples:
```python
$ tf.compat.path_to_str('C:\XYZ\tensorflow\./.././tensorflow')
'C:\XYZ\tensorflow\./.././tensorflow' # Windows OS
$ tf.compat.path_to_str(Path('C:\XYZ\tensorflow\./.././tensorflow'))
'C:\XYZ\tensorflow\..\tensorflow' # Windows OS
$ tf.compat.path_to_str(Path('./corpus'))
'corpus' # Linux OS
$ tf.compat.path_to_str('./.././Corpus')
'./.././Corpus' # Linux OS
$ tf.compat.path_to_str(Path('./.././Corpus'))
'../Corpus' # Linux OS
$ tf.compat.path_to_str(Path('./..////../'))
'../..' # Linux OS
```
"""
if hasattr(path, '__fspath__'):
path = as_str_any(path.__fspath__())
return path
def path_to_bytes(path):
r"""Converts input which is a `PathLike` object to `bytes`.
Converts from any python constant representation of a `PathLike` object
or `str` to bytes.
Args:
path: An object that can be converted to path representation.
Returns:
A `bytes` object.
Usage:
In case a simplified `bytes` version of the path is needed from an
`os.PathLike` object
"""
if hasattr(path, '__fspath__'):
path = path.__fspath__()
return as_bytes(path)
# Numpy 1.8 scalars don't inherit from numbers.Integral in Python 3, so we
# need to check them specifically. The same goes from Real and Complex.
integral_types = (_numbers.Integral, _np.integer)
tf_export('compat.integral_types').export_constant(__name__, 'integral_types')
real_types = (_numbers.Real, _np.integer, _np.floating)
tf_export('compat.real_types').export_constant(__name__, 'real_types')
complex_types = (_numbers.Complex, _np.number)
tf_export('compat.complex_types').export_constant(__name__, 'complex_types')
# Either bytes or text.
bytes_or_text_types = (bytes, _six.text_type)
tf_export('compat.bytes_or_text_types').export_constant(__name__,
'bytes_or_text_types')

Related

TypeError: 'bytes' object cannot be interpreted as an integer

I use OpenCV to read a barcode and send it to an Arduino via serial communication using package pyserial.
The goal is to make a robotic arm move the object (much like in amazon warehouses).
When sending the bytes it shows this error:
C:\Users\arcco\venv\Scripts\python.exe D:\Python\pythonProject\test.py
Traceback (most recent call last):
File "D:\Python\pythonProject\test.py", line 45, in <module>
img = decode(img)
^^^^^^^^^^^
File "D:\Python\pythonProject\test.py", line 19, in decode
ser.write (a)
File "C:\Users\arcco\venv\Lib\site-packages\serial\serialwin32.py", line 310, in write
data = to_bytes(data)
^^^^^^^^^^^^^^
File "C:\Users\arcco\venv\Lib\site-packages\serial\serialutil.py", line 68, in to_bytes
return bytes(bytearray(seq))
^^^^^^^^^^^^^^
TypeError: 'bytes' object cannot be interpreted as an integer
detected barcode: Decoded(data=b'43770929851162', type='I25', rect=Rect(left=62, top=0, width=694, height=180), polygon=[Point(x=62, y=1), Point(x=62, y=179), Point(x=756, y=180), Point(x=756, y=0)], quality=181, orientation='UP')
Type: I25
Data: b'43770929851162'
Process finished with exit code 1
Code
Code where I tried to send the bytes array to serial.
from pyzbar import pyzbar
import cv2
import serial
ser = serial.Serial("COM3", 9600)
def decode(image):
# decodes all barcodes from an image
decoded_objects = pyzbar.decode(image)
for obj in decoded_objects:
# draw the barcode
print("detected barcode:", obj)
image = draw_barcode(obj, image)
# print barcode type & data
print("Type:", obj.type)
print("Data:", obj.data)
print()
a = (bytes([obj.data]))
ser.write(bytes(a))
return image
def draw_barcode(decoded, image):
# n_points = len(decoded.polygon)
# for i in range(n_points):
# image = cv2.line(image, decoded.polygon[i], decoded.polygon[(i+1) % n_points], color=(0, 255, 0), thickness=5)
# uncomment above and comment below if you want to draw a polygon and not a rectangle
image = cv2.rectangle(image, (decoded.rect.left, decoded.rect.top),
(decoded.rect.left + decoded.rect.width, decoded.rect.top + decoded.rect.height),
color=(0, 255, 0),
thickness=5)
return image
if __name__ == "__main__":
from glob import glob
barcodes = glob("barcode*.png")
for barcode_file in barcodes:
# load the image to opencv
img = cv2.imread(barcode_file)
# decode detected barcodes & get the image
# that is drawn
img = decode(img)
# show the image
cv2.imshow("img", img)
cv2.waitKey(0)

Debug and consult the docs
Let's do 3 things to solve your issue:
Clarify the output expected from pyzbar's docs of the decode function.
Debug the existing code by adding print statements for output and types
Clarify the input expected by pyserial's docs of the write function.
Expected output from pyzbar.decode(image)
pyzbar.decode(image) returns an array of Decoded named-tuples. Each tuple has an attribute named data which contains bytes like b'some bytes'. An example of the returned objects can be seen in the project's GitHub README in section Example Usage.
See also What does the 'b' character do in front of a string literal?.
Debugging details
So in your code the following debug printing output of obj.data should look similar to b'...' (bytes) and type(obj.data) should be printed as <class 'bytes'>:
# debug output
print(obj.data)
print(type(obj.data))
For the obj of type Decoded your given output shows the attribute data to have type bytes (denoted by prefix "b"). So apparently the decoded data contains as sequence of digits, a barcode of type I25 for Interleaved 2 of 5:
Decoded(data=b'43770929851162', type='I25',
Expected input for pyserial's write(data)
The method write(data) accepts
Write the bytes data to the port. This should be of type bytes (or compatible such as bytearray or memoryview). Unicode strings must be encoded (e.g. 'hello'.encode('utf-8').
Changed in version 2.5: Accepts instances of bytes and bytearray when available (Python 2.6 and newer) and str otherwise.
So following example should work, since it passes bytes to the function:
ser.write(b'bytes')
Issues
Let's analyze your faulty lines:
a = (bytes([obj.data])) # obsolete conversion
# your first attempt
ser.write(bytes(a)) # raises TypeError
# also your second attempt
ser.write(a.encode()) # raises AttributeError
Given obj.data is of needed type bytes already, then:
[obj.data] is a list of bytes
bytes([obj.data]) raises:
TypeError: 'bytes' object cannot be interpreted as an integer
in your first attempt bytes(a) is equivalent with bytes( (bytes([obj.data])) ) which seems a bit obsolete
in your second attempt of passing a.encode() as argument to the write function you got an error. The encode function only works on string. But a is a list, so the statement a.encode() raises this AttributeError.
In Python's interactive REPL you can reproduce each error, try out:
bytes([bytes()]) for the first TypeError
list().encode() for the second AttributeError
Solution
for obj in pyzbar.decode(image):
ser.write(obj.data) # assume obj.data is of type bytes
See also
Related questions:
python - pySerial write() won't take my string

How to compress the string using gzip and encode in python 2.7

I'm trying to compress and encode my string using gGIP in python 2.7, I'm able to do that in Python 3 but I'm not getting same output in Python 2.7 version code:
Python 3:
import sys
import redis
import io import StringIO
import gzip
redisConn = redis.StrictRedis(host="127.0.0.1", port=6379, db=0)
myValue = "This is test data"
result = gzip.compress(myValue.encode())
redisConn.set("myKey", result)
Python 2.7:
import sys
import redis
import StringIO
import gzip
redisConn = redis.StrictRedis(host="127.0.0.1", port=6379, db=0)
myValue = "This is test data"
out = StringIO.StringIO()
gzip_s = gzip.GzipFile(fileobj=out, mode="w")
result = gzip_s.write(myValue.encode())
redisConn.set("myKey", result)
But Python 2.7 version code is breaking I'm getting an error:
'int' object has no attribute 'encode'
Can someone please help what's the equivalent code of Python 2.7 - my Python 3 version is working as expected.
Thanks for your help in advance.

Python 2 doesn't make the distinction between strings and bytes (even if the gzip stream is open as binary like here). You can write strings in a binary stream without needing to encode it. It has some drawbacks but in your case just remove the .encode() call:
gzip_s.write(myValue)
for a Python 2/3 agnostic code, I would simply do:
if bytes is str:
# python 2, no need to do anything
pass
else:
# python 3+: encode string as bytes
myValue = myValue.encode()
gzip_s.write(myValue)
EDIT: since you seem to issue a command redisConn.set("myKey", result), don't forget to call:
gzip_s.close()
before that, or it's not guaranteed that the file is fully flushed.

Here's a complete working example for Python 2.7. Note that gzip_s.write() returns the number of bytes written, so your code is passing an int to redisConn.set("myKey", result). Also, you probably should explicitly encode and decode the data to avoid unexpected encoding/decoding errors if you ever need to store non-ASCII data.
# -*- coding: utf-8 -*-
import redis
import StringIO
import gzip
redis_cxn = redis.StrictRedis(host='127.0.0.1', port=6379, db=0)
test_data = u'This is some test data with a non-ASCII character: ñ'
print 'Test data:\n ', test_data
out_file = StringIO.StringIO()
gzip_file = gzip.GzipFile(fileobj=out_file, mode='wb')
# If you don't encode the data yourself, it will be implicitly encoded as
# ASCII.
gzip_file.write(test_data.encode('utf-8'))
gzip_file.close()
# Get the bytes written to the underlying file object
value = out_file.getvalue()
print 'Setting value in Redis'
redis_cxn.set('key', value)
print 'Getting value from Redis'
retrieved_value = redis_cxn.get('key')
assert retrieved_value == value
in_file = StringIO.StringIO()
in_file.write(retrieved_value)
in_file.seek(0)
gzip_file = gzip.GzipFile(fileobj=in_file, mode='rb')
retrieved_data = gzip_file.read()
retrieved_data = retrieved_data.decode('utf-8')
gzip_file.close()
assert retrieved_data == test_data
print 'Data retrieved from Redis and unzippped:\n ', test_data

md5 not support in python 3.6 and django 1.10

I want to decrypt response received from CCavenue.In their refrence code they use md5 library but for django 1.10 with python 3.6 not supported.
import md5
ModuleNotFoundError: No module named 'md5'

In python3.x,you should use this:
from hashlib import md5
You can now feed this object with bytes-like objects (normally
bytes) using the update() method.
e.g.
from hashlib import md5
m = md5()
m.update(b"Nobody")
print(m.hexdigest())
Module name: md5 .
Rationale: Replaced by the 'hashlib' module.
Date: 15-May-2007 .
Documentation: Documented as deprecated as of Python 2.5, but listing
in this PEP was neglected.DeprecationWarning raised as of Python 2.6.
See more details from hashlib.

Thanks to McGrady's answer.
Here is my detailed explanation:
Python 2.x
md5 is single module
not merged into hashlib
update() support str
not bytes
Special: for Python 2.7, md5 has merged to hashlib but update() still support str
code:
try:
import md5
except ImportError:
from hashlib import md5
def generateMd5(strToMd5) :
encrptedMd5 = ""
md5Instance = md5.new()
#<md5 HASH object # 0x1062af738>
md5Instance.update(strToMd5)
encrptedMd5 = md5Instance.hexdigest()
#af0230c7fcc75b34cbb268b9bf64da79
return encrptedMd5
Python 3.x
md5 has merged into hashlib
update() only support bytes
not support str
code:
from hashlib import md5 # only for python 3.x
def generateMd5(strToMd5) :
"""
generate md5 string from input string
eg:
xxxxxxxx -> af0230c7fcc75b34cbb268b9bf64da79
:param strToMd5: input string
:return: md5 string of 32 chars
"""
encrptedMd5 = ""
md5Instance = md5()
# print("type(md5Instance)=%s" % type(md5Instance)) # <class '_hashlib.HASH'>
# print("type(strToMd5)=%s" % type(strToMd5)) # <class 'str'>
bytesToMd5 = bytes(strToMd5, "UTF-8")
# print("type(bytesToMd5)=%s" % type(bytesToMd5)) # <class 'bytes'>
md5Instance.update(bytesToMd5)
encrptedMd5 = md5Instance.hexdigest()
# print("type(encrptedMd5)=%s" % type(encrptedMd5)) # <class 'str'>
# print("encrptedMd5=%s" % encrptedMd5) # 3a821616bec2e86e3e232d0c7f392cf5
return encrptedMd5

Tracking down implicit unicode conversions in Python 2

I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:
someDynamicStr = "bar" # could come from various sources
# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
someDynamicStr = "\xff" # uh-oh
# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
(Possibly other forms as well.)
Now I would like to track down those usages, especially those in actively used code.
It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).
/edit:
While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).
I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):
import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively
(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)
/edit #2:
For explicit conversions I made a snippet with the help of a SO post on monkeypatching:
class PatchedUnicode(unicode):
def __init__(self, obj=None, encoding=None, *args, **kwargs):
if encoding in (None, "ascii", "646", "us-ascii"):
print("Problematic unicode() usage detected!")
super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)
import __builtin__
__builtin__.unicode = PatchedUnicode
This only affects explicit conversions using the unicode() constructor directly so it's not something I need.
/edit #3:
The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).
/edit #4:
It's nice to see many good answers here, too bad I can only give out the bounty once.
In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs?
Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).
On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.
/edit #5
I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi

You can register a custom encoding which prints a message whenever it's used:
Code in ourencoding.py:
import sys
import codecs
import traceback
# Define a function to print out a stack frame and a message:
def printWarning(s):
sys.stderr.write(s)
sys.stderr.write("\n")
l = traceback.extract_stack()
# cut off the frames pointing to printWarning and our_encode
l = traceback.format_list(l[:-2])
sys.stderr.write("".join(l))
# Define our encoding:
originalencoding = sys.getdefaultencoding()
def our_encode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.encode(s, originalencoding, errors), len(s))
def our_decode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.decode(s, originalencoding, errors), len(s))
def our_search(name):
if name == 'our_encoding':
return codecs.CodecInfo(
name='our_encoding',
encode=our_encode,
decode=our_decode);
return None
# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')
If you import this file at the start of our script, then you'll see warnings for implicit conversions:
#!python2
# coding: utf-8
import ourencoding
print("test 1")
a = "hello " + u"world"
print("test 2")
a = "hello ☺ " + u"world"
print("test 3")
b = u" ".join(["hello", u"☺"])
print("test 4")
c = unicode("hello ☺")
output:
test 1
test 2
Default encoding used
File "test.py", line 10, in <module>
a = "hello ☺ " + u"world"
test 3
Default encoding used
File "test.py", line 13, in <module>
b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
File "test.py", line 16, in <module>
c = unicode("hello ☺")
It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.

What you can do is the following:
First create a custom encoding. I will call it "lascii" for "logging ASCII":
import codecs
import traceback
def lascii_encode(input,errors='strict'):
print("ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
def lascii_decode(input,errors='strict'):
print("DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class Codec(codecs.Codec):
def encode(self, input,errors='strict'):
return lascii_encode(input,errors)
def decode(self, input,errors='strict'):
return lascii_decode(input,errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
print("Incremental ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
print("Incremental DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class StreamWriter(Codec,codecs.StreamWriter):
pass
class StreamReader(Codec,codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(
name='lascii',
encode=lascii_encode,
decode=lascii_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)
What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.
Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:
def searchFunc(name):
if name=="lascii":
return getregentry()
else:
return None
codecs.register(searchFunc)
The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:
import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')
Warning:
This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.

Just add:
from __future__ import unicode_literals
at the beginning of your source code files - it has to be the first import and it has to be in all source code files affected and the headache of using unicode in Python-2.7 goes away. If you didn't do anything super weird with strings then it should get rid of the problem out of the box.
Check out the following Copy&Paste from my console - I tried with the sample from your question:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
">>> u"foo{}".format(someDynamicStr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>>
And now with __future__ magic:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
u'foo\xff'
>>> u"foo{}".format(someDynamicStr)
u'foo\xff'
>>>

I see you have a lot of edits relating to solutions you may have encountered. I'm just going to address your original post which I believe to be: "I want to create a wrapper around the unicode constructor that checks input".
The unicode method is part of Python's standard library. You will decorate the unicode method to add checks to the method.
def add_checks(fxn):
def resulting_fxn(*args, **kargs):
# this is where whether the input is of type str
if type(args[0]) is str:
# do something
# this is where the encoding/errors parameters are set to the default values
encoding = 'utf-8'
# Set default error behavior
error = 'ignore'
# Print any information (i.e. traceback)
# print 'blah'
# TODO: for traceback, you'll want to use the pdb module
return fxn(args[0], encoding, error)
return resulting_fxn
Using this will look like this:
unicode = add_checks(unicode)
We overwrite the existing function name so that you don't have to change all the calls in the large project. You want to do this very early on in the runtime so that subsequent calls have the new behavior.

Example of use libextractor3 from python

I'm using python-extractor to work with libextractor3. I can not find any examples of it. Does any one have any documentations or examples?

Source package of python-extractor contains a file named extract.py which has a small demo on how to use libextractor Python binding.
Content from extract.py
import extractor
import sys
from ctypes import *
import struct
xtract = extractor.Extractor()
def print_k(xt, plugin, type, format, mime, data, datalen):
mstr = cast (data, c_char_p)
# FIXME: this ignores 'datalen', not that great...
# (in general, depending on the mime type and format, only
# the first 'datalen' bytes in 'data' should be used).
if (format == extractor.EXTRACTOR_METAFORMAT_UTF8):
print "%s - %s" % (xtract.keywordTypes()[type], mstr.value)
return 0
for arg in sys.argv[1:]:
print "Keywords from %s:" % arg
xtract.extract(print_k, None, arg)
To have better understanding of python-extractor go through source code in extractor.py.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

'utf-8' codec can't decode byte: invalid continuation byte - python

Related

TypeError: 'bytes' object cannot be interpreted as an integer

How to compress the string using gzip and encode in python 2.7

md5 not support in python 3.6 and django 1.10

Tracking down implicit unicode conversions in Python 2

Example of use libextractor3 from python

Categories

Resources