I have a string containing utf-8 encoded text. I need to remove the last utf-8 character.
So far I did
msg = msg[:-1]
but this only removes the last byte. It works as long as the last character is an ASCII code. It doesn't work anymore when the last character is a multibyte character.
The simplest way is to decode your UTF-8 bytes to Unicode text:
without_last = msg.decode('utf8')[:-1]
You can always encode it again.
The alternative would be for you to search for a UTF-8 start byte; UTF-8 byte sequences always start with a byte with the most significant bit set to 0, or the two most significant bits set to 1, while continuation bytes always start with 10:
# find starting byte of last codepoint
pos = len(msg) - 1
while pos > -1 and ord(msg[pos]) & 0xC0 == 0x80:
# character at pos is a continuation byte (bit 7 set, bit 6 not)
pos -= 1
msg = msg[:pos]
Related
print(bytes('ba', 'utf-16'))
Result :
b'\xff\xfeb\x00a\x00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:
\xff\xfeb
1 - What is this ?????????
2 - Why fe ??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
You have,
b'\xff\xfeb\x00a\x00'
This is what you asked for, it has three characters.
b'\xff\xfe' # 0xff 0xfe
b'b\x00' # 0x62 0x00
b'a\x00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.
If you don't want the BOM, you can use utf-16le or utf-16be.
>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.
I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE: This is the byte-order mark required in UTF-16 (Byte order mark - Wikipedia).
00, followed by the 8-bit encoding of 'b' (that is shown as the character 'b' instead of using an \x escape sequence): This is the 16-bit representation of 'b'.
00, followed by the 8-bit encoding of 'a': This is the 16-bit representation of 'a'.
You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don't tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can't say anything just by looking at these bytes. Did I mean to send you 愀 or a ?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"
print((ordering1 + b"\x00a").decode("utf-16")) # a
print((ordering2 + b"a\x00").decode("utf-16")) # a
Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it's already there.
I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:
def str_to_byte(padded):
byte_array = padded.encode()
binary_int = int.from_bytes(byte_array, "big")
binary_string = bin(binary_int)
without_b = binary_string[2:]
return without_b
def byte_to_str(without_b):
binary_int = int(without_b, 2)
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
padded_char = ascii_text[:]
return padded_char
After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.
If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?
It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.
Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:
import unireedsolomon as rs
import random
def corrupt(encoded):
'''Flip up to 3 bits (might pick the same bit more than once).
'''
b = bytearray(encoded) # convert to writable bytes
for _ in range(3):
index = random.randrange(len(b)) # pick random byte
bit = random.randrange(8) # pic random bit
b[index] ^= 1 << bit # flip it
return bytes(b) # back to read-only bytes, but not necessary
def encode(coder,msg):
'''Convert the msg to UTF-8-encoded bytes and encode with "coder". Return as bytes.
'''
return coder.encode(msg.encode('utf8')).encode('latin1')
def decode(coder,encoded):
'''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
'''
return coder.decode(encoded)[0].encode('latin1').decode('utf8')
coder = rs.RSCoder(20,13)
msg = 'hello(你好)' # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)
Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.
b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)
A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.
UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at
https://en.wikipedia.org/wiki/UTF-8#Encoding
Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.
How do I get bytes this line of hex ASCII to a hex address as shown below?
I've managed to separate them to each line, but am having issues with the python 3 hex() as it throws errors such as str type when I try to do something like say line.hex().
Input
\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00
Output
0xDDCCBBAA
0x2211FFEE
0x66554433
0x00998877
My Code
import re
a=r"\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
r = '\n'.join(re.findall('................|.$', a))
for line in r.splitlines():
print(line)
s = "\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
print("\n".join([hex(ord(c)) for c in s]))
Will output:
0xaa
0xbb
0xcc
0xdd
0xee
0xff
0x11
0x22
0x33
0x44
0x55
0x66
0x77
0x88
0x99
0x0
Is this what you needed? This assumes the input string s is a string consisting of Unicode characters. It would be quite helpful to know what you actually want to achieve.
I always verify that my input is correct before I start doing anything. So using regexp, I came up with the following:
import re
a=r"\xAA\xBB\xCC\xDD\xEE\xFF\x11\x22\x33\x44\x55\x66\x77\x88\x99\x00"
# matches a word consisting of 4 bytes encoded as \xHH where H is any hex character
listOfFourHexBytes = re.findall(r'(?:\\x[0-9a-fA-F]{2}){4}', a)
if (len(a) != 4 * 16 or len(listOfFourHexBytes) != 4):
raise Exception("Input does not consist of 16 hexadecimal bytes")
for fourHexbytes in listOfFourHexBytes:
justFourHexBytes = re.sub(r'\\x', '', fourHexbytes)
print("0x" + justFourHexBytes)
# not asked for, but generate number from hex value first
n = int(justFourHexBytes, 16)
print("0x%08X" % n)
It first looks for actual hex bytes in the format you asked for, in groups of 4 hex encoded bytes. Then it checks if there are indeed 4 of those, making sure that the input string is of the correct size (no spurious skipped characters). Then it iterates over those, removing the \x from the strings. The next step is of course easy as pie, just add 0x to the result and print it out.
Usually you need to do something with the values, so I added the int conversion for free.
This question already has answers here:
Split unicode string into 300 byte chunks without destroying characters
(5 answers)
Closed 8 years ago.
I want to split unicode string to max 255 byte characters and return the result as unicode:
# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')
Problem with this snippet, is that if 255-th byte character is part of 2-byte unicode character, I'll get error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data
Even if I handle the error I'll get unwanted garbage at the string end.
How to solve this more elegantly?
One very nice property of UTF-8 is that trailing bytes can easily be differentiated from starting bytes. Just work backwards until you've deleted a starting byte.
trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
final = -1
while ord(trunc_s[final]) & 0xc0 == 0x80:
final -= 1
trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')
Edit: Check out the answers in the question identified as a duplicate, too.
I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.
But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.
My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?
I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.
In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.
I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.
See also:
"Incorrect string value" warning when saving some unicode characters to MySQL (at Django ticket system)
‘𠂉’ Not a valid unicode character, but in the unicode character set? (at Stack Overflow)
[EDIT] Added tests about the proposed solutions
So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et
import cProfile
import random
import re
# How many times to repeat each filtering
repeat_count = 256
# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90
# Total number of characters in this string
string_size = 8 * 1024
# Generating a random testing string
test_string = u''.join(
unichr(random.randrange(32,
0x10ffff if random.randrange(100) > normal_chars else 0x0fff
)) for i in xrange(string_size) )
# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def filter_using_re(unicode_string):
return re_pattern.sub(u'\uFFFD', unicode_string)
def filter_using_python(unicode_string):
return u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
def repeat_test(func, unicode_string):
for i in xrange(repeat_count):
tmp = func(unicode_string)
print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')
#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')
The results:
filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
Conclusion
The RegEx solution was, by far, the fastest one.
Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.
pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)
Edit adding Python from Denilson Sá's script in the question body:
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)
You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:
#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:
def filter_4byte_chars(s):
i = 0
j = len(s)
# you need to convert
# the immutable string
# to a mutable list first
s = list(s)
while i < j:
# get the value of this byte
k = ord(s[i])
# this is a 1-byte character, skip to the next byte
if k <= 127:
i += 1
# this is a 2-byte character, skip ahead by 2 bytes
elif k < 224:
i += 2
# this is a 3-byte character, skip ahead by 3 bytes
elif k < 240:
i += 3
# this is a 4-byte character, remove it and update
# the length of the string we need to check
else:
s[i:i+4] = []
j -= 4
return ''.join(s)
Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.
And just for the fun of it, an itertools monstrosity :)
import itertools as it, operator as op
def max3bytes(unicode_string):
# sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))
# is the argument less than or equal to 65535?
selector= ft.partial(op.le, 65535)
# using the character ordinals, return 0 or 1 based on `selector`
indexer= it.imap(selector, it.imap(ord, unicode_string))
# now pick the correct item for all pairs
return u''.join(it.imap(tuple.__getitem__, pairs, indexer))
Encode as UTF-16, then reencode as UTF-8.
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.
EDIT:
MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:
mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)
...
>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨
According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.
Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.
It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:
all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)
and this code will replace any "nasties" with u\ufffd:
u''.join(
uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
for uc in unicode_string
)
I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :
def max3bytes(unicode_string):
return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)
NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.
This does more than filtering out just 3+ byte UTF-8 unicode characters. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don't have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophes and quotations.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is Python 2 BTW.