If I have byte - 11001010 or 01001010, how can I convert it back to Unicode if it is a valid code point?
I can take inputs and do a regex check on the input, but that would be a crude way of doing it, and it will be only limited to UTF-8. If I want to extend in future, how can I optimise the solution?
The input is string with 0's and 1's -
11001010 This is invalid
or 01001010 This is valid
or 11010010 11001110 This is invalid
If there is no other text, split the strings on whitespace, convert each to an integer and feed the result to a bytearray() object to decode:
as_binary = bytearray(int(b, 2) for b in inputtext.split())
as_unicode = as_binary.decode('utf8')
By putting the integer values into a bytearray() we avoid having to concatenate individual characters and get a convenient .decode() method as a bonus.
Note that this does expect the input to contain valid UTF-8. You could add an error handler to replace bad bytes rather than raise an exception, e.g. as_binary.decode('utf8', 'replace').
Wrapped up as a function that takes a codec and error handler:
def to_text(inputtext, encoding='utf8', errors='strict'):
as_binary = bytearray(int(b, 2) for b in inputtext.split())
return as_binary.decode(encoding, errors)
Most of your samples are not actually valid UTF-8, so the demo sets errors to 'replace':
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('01001010', errors='replace')
u'J'
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('11010010 11001110', errors='replace')
u'\ufffd\ufffd'
Leave errors to the default if you want to detect invalid data; just catch the UnicodeDecodeError exception thrown:
>>> to_text('11010010 11001110')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in to_text
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd2 in position 0: invalid continuation byte
Related
It's possible to print the hexcode of the emoji with u'\uXXX' pattern in Python, e.g.
>>> print(u'\u231B')
⌛
However, if I have a list of hex code like 231B, just "adding" the string won't work:
>>> print(u'\u' + ' 231B')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
The chr() fails too:
>>> chr('231B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required (got type str)
My first part of the question is given the hexcode, e.g. 231A how do I get the str type of the emoji?
My goal is to getting the list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt and read the hexcode on the first column.
There are cases where it ranges from 231A..231B, the second part of my question is given a hexcode range, how do I iterate through the range to get the emoji str, e.g. 2648..2653, it is possible to do range(2648, 2653+1) but if there's a character in the hexa, e.g. 1F232..1F236, using range() is not possible.
Thanks #amadan for the solutions!!
TL;DR
To get a list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt into a file.
import requests
response = requests.get('https://unicode.org/Public/emoji/13.0/emoji-sequences.txt')
with open('emoji.txt', 'w') as fout:
for line in response.content.decode('utf8').split('\n'):
if line.strip() and not line.startswith('#'):
hexa = line.split(';')[0]
hexa = hexa.split('..')
if len(hexa) == 1:
ch = ''.join([chr(int(h, 16)) for h in hexa[0].strip().split(' ')])
print(ch, end='\n', file=fout)
else:
start, end = hexa
for ch in range(int(start, 16), int(end, 16)+1):
#ch = ''.join([chr(int(h, 16)) for h in ch.split(' ')])
print(chr(ch), end='\n', file=fout)
Convert hex string to number, then use chr:
chr(int('231B', 16))
# => '⌛'
or directly use a hex literal:
chr(0x231B)
To use a range, again, you need an int, either converted from a string or using a hex literal:
''.join(chr(c) for c in range(0x2648, 0x2654))
# => '♈♉♊♋♌♍♎♏♐♑♒♓'
or
''.join(chr(c) for c in range(int('2648', 16), int('2654', 16)))
(NOTE: you'd get something very different from range(2648, 2654)!)
In Python 3, read(size) has the following documentation:
Read and return at most size characters from the stream as a single str. If size is negative or None, reads until EOF.
But suppose that you seek() to the middle of a multi-byte UTF-8 character. What will read(1) return?
The partial unicode character can't be decoded so python will raise a UnicodeDecodeError. But you can recover from the problem. The UTF-8 encoding is built to be self-healing, meaning that the first byte of the character sequence (0x00-0x7f or 0xc0-0xfd) will not appear in any other byte, so you just need to keep seeking backwards by 1 byte until the decode works.
>>> def read_unicode(fp, position, count):
... while position >= 0:
... fp.seek(position)
... try:
... return fp.read(count)
... except UnicodeDecodeError:
... position -= 1
... raise UnicodeDecodeError("File not decodable")
...
>>> open('test.txt', 'w', encoding='utf-8').write("学"*10000)
10000
>>> f=open('test.txt', 'r', encoding='utf-8')
>>> f.seek(32)
32
>>> f.read(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte
>>> read_unicode(f, 32, 1)
'学'
Text streams in Python 3 don't support arbitrary seek offsets, you're only supposed to use offsets of 0, or values returned by tell with whence of SEEK_SET. Everything else is undefined or unsupported behavior. See the docs for TextIOBase.seek.
Sure, in practice, you might get UnicodeDecodeError, but that is not a guarantee. As soon as you violate the API contractual requirements, it can do whatever it wants.
The program works for a short time and then hits an error and I have no idea what it means or how to fix it.
Here is the code:
from bs4 import BeautifulSoup
import urllib
BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
capitals_countries = []
html = urllib.urlopen(BASE_URL).read()
soup = BeautifulSoup(html, "html.parser")
country_table = soup.find('table', {"class" : "wikitable sortable"})
for row in country_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 3:
capitals_countries.append((cols[0].text.strip(), cols[1].text.strip()))
for capital, country in capitals_countries:
print('{:35} {}'.format(capital, country))
Here is the error:
Traceback (most recent call last):
File "/Users/Kyle/Documents/scraper.py", line 19, in <module>
print('{:35} {}'.format(capital, country))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128)
I have a rather basic understanding of html and scraping in general. I would appreciate any clarity that anyone can provide me for what is going on here.
You already have a unicode string, trying to capital.decode('utf-8'), that is going to give you:
In [13]: s = u'\xe1'
In [14]: print s
á
In [15]: s.decode("utf-8")
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-61efbeae9c77> in <module>()
----> 1 s.decode("utf-8")
/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
The reason you see it in your own code is you are trying to do the same using str.format when you call format on the unicode string you are trying to encode the string to ascii which fails as you have non ascii characters:
In [16]: print "{}".format(s)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-16-1119d22adcca> in <module>()
----> 1 print "{}".format(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
All you need is to make the str.format string a unicode string with a leading u, do not decode anything:
In [17]: print u"{}".format(s)
á
So in your own code you need a leading u on your format string, nothing else.
for capital, country in capitals_countries:
print(u'{:35} {}'.format(capital, country))
You can verify that you have a unicode string by just adding a print type(capital) which a would output <type 'unicode'>.
When I looked at the list I see that the code broke at Bogotá
I think it's breaking due to special character á
When I change the print statement to
print(u'{:35} {}'.format(capital, country))
It works perfectly fine
This should fix the issue:
print('{:35} {}'.format(capital.decode('utf-8'), country.decode('utf-8')))
Or as suggested by #karlson in the comment, we can also use unicode strings like:
print(u'{:35} {}'.format(capital, country))
Now the u'{:35} {}' part is unicode, so you don't need to decode it any more.
I have
(Pdb) email
'\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'
(Pdb) print email
test#gmail.com
I need to validate whether thie value is an email format, however, how can i convert this string to actual ascii string?
Seems like it's encoded with utf-16 encoding.
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 28: truncated data
and truncated:
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'[1:].decode('utf-16')
u'test#gmail.com'
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'[1:].decode('utf-16-le')
u'test#gmail.com'
>>> '\x00t\x00e\x00s\x00t\x00#\x00g\x00m\x00a\x00i\x00l\x00.\x00c\x00o\x00m\x00'.decode('utf-16-be', 'ignore')
u'test#gmail.com'
Converting your email to an ASCII string can be done like this :
str(email.decode('utf-16le'))
How to convert my bytearray('b\x9e\x18K\x9a') to something like this --> \x9e\x18K\x9a <---just str, not array!
>> uidar = bytearray()
>> uidar.append(tag.nti.nai.uid[0])
>> uidar.append(tag.nti.nai.uid[1])
>> uidar.append(tag.nti.nai.uid[2])
>> uidar.append(tag.nti.nai.uid[3])
>> uidar
bytearray('b\x9e\x18K\x9a')
I try to decode my bytearray by
uid = uidar.decode('utf-8')
but it can't...
Traceback (most recent call last):
File "<pyshell#42>", line 1, in <module>
uid = uidar.decode("utf-8")
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9e in position 0: invalid start byte
Help me Please ...
In 2.x, strings are bytestrings.
>>> str(bytearray('b\x9e\x18K\x9a'))
'b\x9e\x18K\x9a'
Latin-1 maps the first 256 characters to their bytevalue equivalents, so in Python 3.x:
3>> bytearray(b'b\x9e\x18K\x9a').decode('latin-1')
'b\x9e\x18K\x9a'