I have encounter a case where I need to convert a string of character into a character string in python.
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
print s #gives: \x80\x78\x07\x00\x75\xb3
what I want is that, given the string s, I can get the real character store in s. which in this case is "\x80, \x78, \x07, \x00, \x75, and \xb3"(something like this)�xu�.
You can use string-escape encoding (Python 2.x):
>>> s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
>>> s.decode('string-escape')
'\x80x\x07\x00u\xb3'
Use unicode-escape encoding (in Python 3.x, need to convert to bytes first):
>>> s.encode().decode('unicode-escape')
'\x80x\x07\x00u³'
you can simply write a function, taking the string and returning the converted form!
something like this:
def str_to_chr(s):
res = ""
s = s.split("\\")[1:] #"\\x33\\x45" -> ["x33","x45"]
for(i in s):
res += chr(int('0'+i, 16)) # converting to decimal then taking the chr
return res
remember to print the return of the function.
to find out what does each line do, run that line, if still have questions comment it... i'll answer
or you can build a string from the byte values, but that might not all be "printable" depending on your encoding, example:
# -*- coding: utf-8 -*-
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
r = ''
for byte in s.split('\\x'):
if byte: # to get rid of empties
r += chr(int(byte,16)) # convert to int from hex string first
print (r) # given the example, not all bytes are printable char's in utf-8
HTH, Edwin
Related
I want to convert a hex string to utf-8
a = '0xb3d9'
to
동 (http://www.unicodemap.org/details/0xB3D9/index.html)
First, obtain the integer value from the string of a, noting that a is expressed in hexadecimal:
a_int = int(a, 16)
Next, convert this int to a character. In python 2 you need to use the unichr method to do this, because the chr method can only deal with ASCII characters:
a_chr = unichr(a_int)
Whereas in python 3 you can just use the chr method for any character:
a_chr = chr(a_int)
So, in python 3, the full command is:
a_chr = chr(int(a, 16))
I need to build a python encoder so that I can reformat strings like this:
import codecs
codecs.encode("Random 🐍 UTF-8 String ☑⚠⚡", 'name_of_my_encoder')
The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.
from string import ascii_letters
from string import digits
valid_characters = set(ascii_letters + digits + ['_'])
def validation_function(characters):
for char in characters:
if char not in valid_characters:
raise Exception
Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.
from codecs import encode
from string import ascii_letters
from string import digits
ALPHANUMERIC_SET = set(ascii_letters + digits)
def underscore_encode(chars_in):
chars_out = list()
for char in chars_in:
if char not in ALPHANUMERIC_SET:
chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))
else:
chars_out.append(char)
return ''.join(chars_out)
This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.
Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.
Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.
This seems to do the trick. Basically, alphanumeric letters are left alone. Any non-alphanumeric character in the ASCII set is encoded as a \xXX escape code. All other unicode characters are encoded using the \uXXXX escape code. However, you've said you can't use \, but you can use _, thus all escape sequences are translated to start with a _. This makes decoding extremely simple. Just replace the _ with \ and then use the unicode-escape codec. Encoding is slightly more difficult as the unicode-escape codec leaves ASCII characters alone. So first you have to escape the relevant ASCII characters, then run the string through the unicode-escape codec, before finally translating all \ to _.
Code:
from string import ascii_letters, digits
# non-translating characters
ALPHANUMERIC_SET = set(ascii_letters + digits)
# mapping all bytes to themselves, except '_' maps to '\'
ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\"))
# reverse mapping -- maps `\` back to `_`
ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_"))
# encoding table for ASCII characters not in ALPHANUMERIC_SET
ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))}
def encode(s):
s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set
bytes_ = s.encode("unicode-escape")
bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE)
return bytes_
def decode(s):
s = s.translate(ESCAPE_CHAR_DECODE_TABLE)
return s.decode("unicode-escape")
s = u"Random UTF-8 String ☑⚠⚡"
#s = '北亰'
print(s)
b = encode(s)
print(b)
new_s = decode(b)
print(new_s)
Which outputs:
Random UTF-8 String ☑⚠⚡
b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1'
Random UTF-8 String ☑⚠⚡
This works on both python 3.4 and python 2.7, which is why the ESCAPE_CHAR_{DE,EN}CODE_TABLE is a bit messy bytes on python 2.7 is an alias for str, which works differently to bytes on python 3.4. This is why the table is constructed using a bytearray. For python 2.7, the encode method expects a unicode object not str.
Use base32! It uses only the 26 letters of the alphabet and 0-9. You can’t use base64 because it uses the = character, which won’t pass your validator.
>>> import base64
>>>
>>> print base64.b32encode('Random 🐍 UTF-8 String ☑⚠⚡"')
KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC
>>>
>>> print base64.b32decode('KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC')
Random 🐍 UTF-8 String ☑⚠⚡"
>>>
Despite several good answers. I ended up with a solution that seems cleaner and more understandable. So I'm posting the code of my eventual solution to answer my own question.
from string import ascii_letters
from string import digits
from base64 import b16decode
from base64 import b16encode
ALPHANUMERIC_SET = set(ascii_letters + digits)
def utf8_string_to_hex_string(s):
return ''.join(chr(i) for i in b16encode(s.encode('utf-8')))
def hex_string_to_utf8_string(s):
return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8')
def underscore_encode(chars_in):
chars_out = list()
for char in chars_in:
if char not in ALPHANUMERIC_SET:
chars_out.append('_{}_'.format(utf8_string_to_hex_string(char)))
else:
chars_out.append(char)
return ''.join(chars_out)
def underscore_decode(chars_in):
chars_out = list()
decoding = False
for char in chars_in:
if char == '_':
if not decoding:
hex_chars = list()
decoding = True
elif decoding:
decoding = False
chars_out.append(hex_string_to_utf8_string(hex_chars))
else:
if not decoding:
chars_out.append(char)
elif decoding:
hex_chars.append(char)
return ''.join(chars_out)
You could abuse the url quoting, to get both readable and easy to decode in other languages format that passes your validation function:
#!/usr/bin/env python3
import urllib.parse
def alnum_encode(text):
return urllib.parse.quote(text, safe='')\
.replace('-', '%2d').replace('.', '%2e').replace('_', '%5f')\
.replace('%', '_')
def alnum_decode(underscore_encoded):
return urllib.parse.unquote(underscore_encoded.replace('_','%'), errors='strict')
s = alnum_encode("Random 🐍 UTF-8 String ☑⚠⚡")
print(s)
print(alnum_decode(s))
Output
Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1
Random 🐍 UTF-8 String ☑⚠⚡
Here's an implementation using a bytearray() (to move it to C later if necessary):
#!/usr/bin/env python3.5
from string import ascii_letters, digits
def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')):
result = bytearray()
for byte in bytearray(text, 'utf-8'):
if byte in alnum:
result.append(byte)
else:
result += b'_%02x' % byte
return result.decode('ascii')
If you want a transliteration of Unicode to ASCII (e.g. ç --> c), then check out the Unidecode package. Here are their examples:
>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '
Here's my example:
# -*- coding: utf-8 -*-
from unidecode import unidecode
print unidecode(u'快樂星期天')
Gives as an output*
Kuai Le Xing Qi Tian
*may be nonsense, but at least it's ASCII
To remove punctuation, see this answer.
Although Python 3.x solved the problem that uppercase and lowercase for some locales (for example tr_TR.utf8) Python 2.x branch lacks this. Several workaround for this issuse like https://github.com/emre/unicode_tr/ but did not like this kind of a solution.
So I am implementing a new upper/lower/capitalize/title methods for monkey-patching unicode class with
string.maketrans method.
The problem with maketrans is the lenghts of two strings must have same lenght.
The nearest solution came to my mind is "How can I convert 1 Byte char to 2 bytes?"
Note: translate method does work only ascii encoding, when I pass u'İ' (1 byte length \u0130) as arguments to translate gives ascii encoding error.
from string import maketrans
import unicodedata
c1 = unicodedata.normalize('NFKD',u'i').encode('utf-8')
c2 = unicodedata.normalize('NFKD',u'İ').encode('utf-8')
c1,len(c1)
('\xc4\xb1', 2)
# c2,len(c2)
# ('I', 1)
'istanbul'.translate( maketrans(c1,c2))
ValueError: maketrans arguments must have same length
Unicode objects allow multicharacter translation via a dictionary instead of two byte strings mapped through maketrans.
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
print u'istanbul'.translate(D)
Output:
İstanbul
If you start with an ASCII byte string and want the result in UTF-8, simply decode/encode around the translation:
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
s = 'istanbul'.decode('ascii')
t = s.translate(D)
s = t.encode('utf8')
print repr(s)
Output:
'\xc4\xb0stanbul'
The following technique can do the job of maketrans. Note that the dictionary keys must be Unicode ordinals, but the value can be Unicode ordinals, Unicode strings or None. If None, the character is deleted when translated.
#!python2
#coding:utf8
def maketrans(a,b):
return dict(zip(map(ord,a),b))
D = maketrans(u'àáâãäå',u'ÀÁÂÃÄÅ')
print u'àbácâdãeäfåg'.translate(D)
Output:
ÀbÁcÂdÃeÄfÅg
Reference: str.translate
I have long string and I want to present it as a long num.
I tried:
l=[ord (i)for i in str1]
but this is not what I need.
I need to make it long number and not numbers as items in the list.
this line gives me [23,21,45,34,242,32]
and I want to make it one long Number that I can change it again to the same string.
any idea?
Here is a translation of Paulo Bu's answer (with base64 encoding) into Python 3:
>>> import base64
>>> s = 'abcde'
>>> e = base64.b64encode(s.encode('utf-8'))
>>> print(e)
b'YWJjZGU='
>>> base64.b64decode(e).decode('utf-8')
'abcde'
Basically the difference is that your workflow has gone from:
string -> base64
base64 -> string
To:
string -> bytes
bytes -> base64
base64 -> bytes
bytes -> string
Is this what you are looking for :
>>> str = 'abcdef'
>>> ''.join([chr(y) for y in [ ord(x) for x in str ]])
'abcdef'
>>>
#! /usr/bin/python2
# coding: utf-8
def encode(s):
result = 0
for ch in s.encode('utf-8'):
result *= 256
result += ord(ch)
return result
def decode(i):
result = []
while i:
result.append(chr(i%256))
i /= 256
result = reversed(result)
result = ''.join(result)
result = result.decode('utf-8')
return result
orig = u'Once in Persia reigned a king …'
cipher = encode(orig)
clear = decode(cipher)
print '%s -> %s -> %s'%(orig, cipher, clear)
This is code that I found that works.
str='sdfsdfsdfdsfsdfcxvvdfvxcvsdcsdcs sdcsdcasd'
I=int.from_bytes(bytes([ord (i)for i in str]),byteorder='big')
print(I)
print(I.to_bytes(len(str),byteorder='big'))
What about using base 64 encoding? Are you fine with it? Here's an example:
>>>import base64
>>>s = 'abcde'
>>>e = base64.b64encode(s)
>>>print e
YWJjZGU=
>>>base64.b64decode(e)
'abcde'
The encoding is not pure numbers but you can go back and forth from string without much trouble.
You can also try encoding the string to hexadecimal. This will yield numbers although I'm not sure you can always come back from the encode string to the original string:
>>>s='abc'
>>>n=s.encode('hex')
>>>print n
'616263'
>>>n.decode('hex')
'abc'
If you need it to be actual integers then you can extend the trick:
>>>s='abc'
>>>n=int(s.encode('hex'), 16) #convert it to integer
>>>print n
6382179
hex(n)[2:].decode('hex') # return from integer to string
>>>abc
Note: I'm not sure this work out of the box in Python 3
UPDATE: To make it work with Python 3 I suggest using binascii module this way:
>>>import binascii
>>>s = 'abcd'
>>>n = int(binascii.hexlify(s.encode()), 16) # encode is needed to convert unicode to bytes
>>>print(n)
1633837924 #integer
>>>binascii.unhexlify(hex(n)[2:].encode()).decode()
'abcd'
encode and decode methods are needed to convert from bytes to string and the opposite. If you plan to include especial (non-ascii) characters then probably you'll need to specify encodings.
Hope this helps!
for analysis I'd have to unescape URL-encoded binary strings (non-printable characters most likely). The strings sadly come in the extended URL-encoding form, e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
How do I get this into binary data in python? (urllib.unquote only handles the "%HH"-form)
The strings sadly come in the extended URL-encoding form, e.g. "%u616f"
Incidentally that's not anything to do with URL-encoding. It's an arbitrary made-up format produced by the JavaScript escape() function and pretty much nothing else. If you can, the best thing to do would be to change the JavaScript to use the encodeURIComponent function instead. This will give you a proper, standard URL-encoded UTF-8 string.
e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
Are you sure 0x61 0x6f (the letters "ao") is the byte stream you want to store? That would imply UTF-16BE encoding; are you treating all your strings that way?
Normally you'd want to turn the input into Unicode then write it out using an appropriate encoding, such as UTF-8 or UTF-16LE. Here's a quick way of doing it, relying on the hack of making Python read '%u1234' as the string-escaped format u'\u1234':
>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'
>>> print _
hello é 慯
>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'
I guess you will have to write the decoder function by yourself. Here is an implementation to get you started:
def decode(file):
while True:
c = file.read(1)
if c == "":
# End of file
break
if c != "%":
# Not an escape sequence
yield c
continue
c = file.read(1)
if c != "u":
# One hex-byte
yield chr(int(c + file.read(1), 16))
continue
# Two hex-bytes
yield chr(int(file.read(2), 16))
yield chr(int(file.read(2), 16))
Usage:
input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()
Here is a regex-based approach:
# the replace function concatenates the two matches after
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))
# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result
gives
ao