Unicode string in python - python

I have
(Pdb) email
'\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'
(Pdb) print email
test#gmail.com
I need to validate whether thie value is an email format, however, how can i convert this string to actual ascii string?

Seems like it's encoded with utf-16 encoding.
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 28: truncated data
and truncated:
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'[1:].decode('utf-16')
u'test#gmail.com'
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'[1:].decode('utf-16-le')
u'test#gmail.com'
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'.decode('utf-16-be', 'ignore')
u'test#gmail.com'

Converting your email to an ASCII string can be done like this :
str(email.decode('utf-16le'))

Related

"01"-string representing bytes to unicode conversion in python 2

If I have byte - 11001010 or 01001010, how can I convert it back to Unicode if it is a valid code point?
I can take inputs and do a regex check on the input, but that would be a crude way of doing it, and it will be only limited to UTF-8. If I want to extend in future, how can I optimise the solution?
The input is string with 0's and 1's -
11001010 This is invalid
or 01001010 This is valid
or 11010010 11001110 This is invalid
If there is no other text, split the strings on whitespace, convert each to an integer and feed the result to a bytearray() object to decode:
as_binary = bytearray(int(b, 2) for b in inputtext.split())
as_unicode = as_binary.decode('utf8')
By putting the integer values into a bytearray() we avoid having to concatenate individual characters and get a convenient .decode() method as a bonus.
Note that this does expect the input to contain valid UTF-8. You could add an error handler to replace bad bytes rather than raise an exception, e.g. as_binary.decode('utf8', 'replace').
Wrapped up as a function that takes a codec and error handler:
def to_text(inputtext, encoding='utf8', errors='strict'):
as_binary = bytearray(int(b, 2) for b in inputtext.split())
return as_binary.decode(encoding, errors)
Most of your samples are not actually valid UTF-8, so the demo sets errors to 'replace':
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('01001010', errors='replace')
u'J'
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('11010010 11001110', errors='replace')
u'\ufffd\ufffd'
Leave errors to the default if you want to detect invalid data; just catch the UnicodeDecodeError exception thrown:
>>> to_text('11010010 11001110')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in to_text
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd2 in position 0: invalid continuation byte

What happens if you seek() to the middle of a multi-byte UTF-8 character and call read(1)?

In Python 3, read(size) has the following documentation:
Read and return at most size characters from the stream as a single str. If size is negative or None, reads until EOF.
But suppose that you seek() to the middle of a multi-byte UTF-8 character. What will read(1) return?
The partial unicode character can't be decoded so python will raise a UnicodeDecodeError. But you can recover from the problem. The UTF-8 encoding is built to be self-healing, meaning that the first byte of the character sequence (0x00-0x7f or 0xc0-0xfd) will not appear in any other byte, so you just need to keep seeking backwards by 1 byte until the decode works.
>>> def read_unicode(fp, position, count):
... while position >= 0:
... fp.seek(position)
... try:
... return fp.read(count)
... except UnicodeDecodeError:
... position -= 1
... raise UnicodeDecodeError("File not decodable")
...
>>> open('test.txt', 'w', encoding='utf-8').write("学"*10000)
10000
>>> f=open('test.txt', 'r', encoding='utf-8')
>>> f.seek(32)
32
>>> f.read(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte
>>> read_unicode(f, 32, 1)
'学'
Text streams in Python 3 don't support arbitrary seek offsets, you're only supposed to use offsets of 0, or values returned by tell with whence of SEEK_SET. Everything else is undefined or unsupported behavior. See the docs for TextIOBase.seek.
Sure, in practice, you might get UnicodeDecodeError, but that is not a guarantee. As soon as you violate the API contractual requirements, it can do whatever it wants.

How to base64 encode/decode a variable with string type in Python 3?

It gives me an error that the line encoded needs to be bytes not str/dict
I know of adding a "b" before the text will solve that and print the encoded thing.
import base64
s = base64.b64encode(b'12345')
print(s)
>>b'MTIzNDU='
But how do I encode a variable?
such as
import base64
s = "12345"
s2 = base64.b64encode(s)
print(s2)
It gives me an error with the b added and without. I don't understand
I'm also trying to encode/decode a dictionary with base64.
You need to encode the unicode string. If it's just normal characters, you can use ASCII. If it might have other characters in it, or just for general safety, you probably want utf-8.
>>> import base64
>>> s = "12345"
>>> s2 = base64.b64encode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ". . . /lib/python3.3/base64.py", line 58, in b64encode
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
>>> s2 = base64.b64encode(s.encode('ascii'))
>>> print(s2)
b'MTIzNDU='
>>>

How to convert my bytearray('b\x9e\x18K\x9a') to something like this--> '\x9e\x18K\x9a'<---just str ,not array

How to convert my bytearray('b\x9e\x18K\x9a') to something like this --> \x9e\x18K\x9a <---just str, not array!
>> uidar = bytearray()
>> uidar.append(tag.nti.nai.uid[0])
>> uidar.append(tag.nti.nai.uid[1])
>> uidar.append(tag.nti.nai.uid[2])
>> uidar.append(tag.nti.nai.uid[3])
>> uidar
bytearray('b\x9e\x18K\x9a')
I try to decode my bytearray by
uid = uidar.decode('utf-8')
but it can't...
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
uid = uidar.decode("utf-8")
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9e in position 0: invalid start byte
Help me Please ...
In 2.x, strings are bytestrings.
>>> str(bytearray('b\x9e\x18K\x9a'))
'b\x9e\x18K\x9a'
Latin-1 maps the first 256 characters to their bytevalue equivalents, so in Python 3.x:
3>> bytearray(b'b\x9e\x18K\x9a').decode('latin-1')
'b\x9e\x18K\x9a'

QString: Unicode encoding-decoding problem

I am trying to make a simple conversion to Unicode string to standart string, but no success.
I have: PyQt4.QtCore.QString(u'\xc5\x9f')
I want: '\xc5\x9f' notice str type not unicode, because the library I am using is not accepting unicode.
Here is what I tried, you can see how hopeless I am :) :
>>> s = QtCore.QString(u'\xc5\x9f')
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) '
>>> s.toUtf8()
PyQt4.QtCore.QByteArray('\xc3\x85\xc2\x9f')
>>> s.toUtf8().decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'QByteArray' object has no attribute 'decode'
>>> str(s.toUtf8()).decode("utf-8")
u'\xc5\x9f'
>>> str(str(s.toUtf8()).decode("utf-8"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
I know there are a lot of questions related to Unicode, but I can't find this answer.
What should I do?
Edit:
I found a hacky way:
>>> unicoded = str(s.toUtf8()).decode("utf-8")
>>> unicoded
u'\xc5\x9f'
>>> eval(repr(unicoded)[1:])
'\xc5\x9f'
Do you know a better way?
If you have unicode string of QString data type , and need to convert it to python string , you just :
unicode(YOUR_QSTRING_STRING)
Is this what you are after?
In [23]: a
Out[23]: u'\xc5\x9f'
In [24]: a.encode('latin-1')
Out[24]: '\xc5\x9f'
In [25]: type(a.encode('latin-1'))
Out[25]: <type 'str'>

Categories

Resources