Printing out all unicode emojis to file

Printing out all unicode emojis to file - python

It's possible to print the hexcode of the emoji with u'\uXXX' pattern in Python, e.g.
>>> print(u'\u231B')
⌛
However, if I have a list of hex code like 231B, just "adding" the string won't work:
>>> print(u'\u' + ' 231B')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
The chr() fails too:
>>> chr('231B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required (got type str)
My first part of the question is given the hexcode, e.g. 231A how do I get the str type of the emoji?
My goal is to getting the list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt and read the hexcode on the first column.
There are cases where it ranges from 231A..231B, the second part of my question is given a hexcode range, how do I iterate through the range to get the emoji str, e.g. 2648..2653, it is possible to do range(2648, 2653+1) but if there's a character in the hexa, e.g. 1F232..1F236, using range() is not possible.
Thanks #amadan for the solutions!!
TL;DR
To get a list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt into a file.
import requests
response = requests.get('https://unicode.org/Public/emoji/13.0/emoji-sequences.txt')
with open('emoji.txt', 'w') as fout:
for line in response.content.decode('utf8').split('\n'):
if line.strip() and not line.startswith('#'):
hexa = line.split(';')[0]
hexa = hexa.split('..')
if len(hexa) == 1:
ch = ''.join([chr(int(h, 16)) for h in hexa[0].strip().split(' ')])
print(ch, end='\n', file=fout)
else:
start, end = hexa
for ch in range(int(start, 16), int(end, 16)+1):
#ch = ''.join([chr(int(h, 16)) for h in ch.split(' ')])
print(chr(ch), end='\n', file=fout)

Convert hex string to number, then use chr:
chr(int('231B', 16))
# => '⌛'
or directly use a hex literal:
chr(0x231B)
To use a range, again, you need an int, either converted from a string or using a hex literal:
''.join(chr(c) for c in range(0x2648, 0x2654))
# => '♈♉♊♋♌♍♎♏♐♑♒♓'
or
''.join(chr(c) for c in range(int('2648', 16), int('2654', 16)))
(NOTE: you'd get something very different from range(2648, 2654)!)

Related

Python concatenate bytes to str

Is is possible to concatenate bytes to str?
>>> b = b'this is bytes'
>>> s = 'this is string'
>>> b + s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>>
It is not possible based on simple code above.
The reason I'm asking this as I've seen a code where bytes has been concatenated to str?
Here is the snipet of the code.
buf = ""
buf += "\xdb\xd1\xd9\x74\x24\xf4\x5a\x2b\xc9\xbd\x0e\x55\xbd"
buffer = "TRUN /.:/" + "A" * 2003 + "\xcd\x73\xa3\x77" + "\x90" * 16 + buf + "C" * (5060 - 2003 - 4 - 16 - len(buf))
You can see the full code here.
http://sh3llc0d3r.com/vulnserver-trun-command-buffer-overflow-exploit/

Either encode the string to bytes to get a result in bytes:
print(b'byte' + 'string'.encode())
# b'bytestring'
Or decode the bytes into a string to get a result as str:
print(b'byte'.decode() + 'string')
# bytestring

The second code snippet shows strings being concatenated. You will need to convert the bytes to a string (as shown in the question Convert bytes to a string). Try this: b.decode("utf-8") + s. It should give you the output you need.

Some conversion issues between the byte and strings

Here is the I am trying:
import struct
#binary_data = open("your_binary_file.bin","rb").read()
#your binary data would show up as a big string like this one when you .read()
binary_data = '\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
def search(text):
#convert the text to binary first
s = ""
for c in text:
s+=struct.pack("b", ord(c))
results = binary_data.find(s)
if results == -1:
print ("no results found")
else:
print ("the string [%s] is found at position %s in the binary data"%(text, results))
search("Dibenzoylperoxid")
search("03.05.1994")
And this is the error I am getting:
Traceback (most recent call last):
File "dec_new.py", line 22, in <module>
search("Dibenzoylperoxid")
File "dec_new.py", line 14, in search
s+=struct.pack("b", ord(c))
TypeError: Can't convert 'bytes' object to str implicitly
Kindly, let me know what I can do to make it functional properly.
I am using Python 3.5.0.

s = ""
for c in text:
s+=struct.pack("b", ord(c))
This won't work because s is a string, and struct.pack returns a bytes, and you can't add a string and a bytes.
One possible solution is to make s a bytes.
s = b""
... But it seems like a lot of work to convert a string to a bytes this way. Why not just use encode()?
def search(text):
#convert the text to binary first
s = text.encode()
results = binary_data.find(s)
#etc
Also, "your binary data would show up as a big string like this one when you .read()" is not, strictly speaking, true. The binary data won't show up as a big string, because it is a bytes, not a string. If you want to create a bytes literal that resembles what might be returned by open("your_binary_file.bin","rb").read(), use the bytes literal syntax binary_data = b'\x44\x69<...etc...>\x33\x30'

encoding special characters in python2

I have a piece of code that works well in Python3:
def encode_test(filepath, char_to_int):
with open(filepath, "r", encoding= "latin-1") as f:
dat = [line.rstrip() for line in f]
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]
However, when I try to do this in Python2.7, I first got the error
SyntaxError: Non-ASCII character '\xc3' in file languageIdentification.py on line 30, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I realize that I may need to add #coding=utf-8 at the top of code. However, after doing this, I encountered another error:
UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]
Traceback (most recent call last):
File "languageIdentification.py", line 190, in <module>
test_string = encode_test(sys.argv[3], char_to_int)
File "languageIdentification.py", line 32, in encode_test
string_to_int = [[char_to_int[char] if char != 'ó' else
char_to_int['ò'] for char in line] for line in dat]
KeyError: u'\xf3'
So could anyone tell me what could I do to solve this problem in Python2.7?
Thank you!

The problem is that you try to compare unicode-string and byte-string:
char != 'ó'
Where char is a unicode and 'ó' is a byte-string (or just str).
When Python 2 faces with such a comparison, it tries to convert (or decode):
byte-string -> unicode
The conversion is provided with a default encoding which is ASCII in Python 2.
Since byte-value of 'ó' is higher than 127, it leads to the error (UnicodeWarning).
By the way, for literal which byte-value is in ASCII-range, comparison
will be successful.
Examples:
print u'ó' == 'ó' # UnicodeWarning: ...
print u'z' == 'z' # True
So, in comparison you need to convert your byte-string to unicode manually.
For example, you can do that with built-in unicode() function:
u = unicode('ó', 'utf-8') # note, that you can specify encoding
Or just with 'u'-literal:
u = u'ó'
But be aware: with this option the convertion will be implemented through the encoding you specified at the top of the source file.
So, your actual source encoding and the encoding declared at the top should match.
As I see from the SyntaxError message: in your source 'ó' starts with '\xc3'-byte.Therefore it should be '\xc3\xb3' which is UTF-8:
print '\xc3\xb3'.decode('utf-8') # ó
So, # coding: utf-8 + char != u'ó' should solve your problem.
UPD.
As I see from the UnicodeWarning message - there is the 2nd trouble: KeyError
This error occurs in the statement:
char_to_int[char]
because u'\xf3' (which actually is u'ó') is not a valid key.
This unicode comes from decoding your file (with latin-1).
And I suppose, that there are no unicode keys in your dict char_to_int at all.
So, try to encode such a key back to its byte-value with:
char_to_int[char.encode('latin-1')]
Summarizing, try to change the last string of provided code to:
string_to_int = [[char_to_int[char.encode('latin-1')] if char != u'ó' else char_to_int['ò'] for char in line] for line in dat]

If you want to convert character to its integer value you may use ord function, it works well for Unicode too.
line = u’some Unicode line with ò and ó’
string_to_int = [ord(char) if char!=u‘ó’ else ord(u’ò’) for char in line]

"01"-string representing bytes to unicode conversion in python 2

If I have byte - 11001010 or 01001010, how can I convert it back to Unicode if it is a valid code point?
I can take inputs and do a regex check on the input, but that would be a crude way of doing it, and it will be only limited to UTF-8. If I want to extend in future, how can I optimise the solution?
The input is string with 0's and 1's -
11001010 This is invalid
or 01001010 This is valid
or 11010010 11001110 This is invalid

If there is no other text, split the strings on whitespace, convert each to an integer and feed the result to a bytearray() object to decode:
as_binary = bytearray(int(b, 2) for b in inputtext.split())
as_unicode = as_binary.decode('utf8')
By putting the integer values into a bytearray() we avoid having to concatenate individual characters and get a convenient .decode() method as a bonus.
Note that this does expect the input to contain valid UTF-8. You could add an error handler to replace bad bytes rather than raise an exception, e.g. as_binary.decode('utf8', 'replace').
Wrapped up as a function that takes a codec and error handler:
def to_text(inputtext, encoding='utf8', errors='strict'):
as_binary = bytearray(int(b, 2) for b in inputtext.split())
return as_binary.decode(encoding, errors)
Most of your samples are not actually valid UTF-8, so the demo sets errors to 'replace':
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('01001010', errors='replace')
u'J'
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('11010010 11001110', errors='replace')
u'\ufffd\ufffd'
Leave errors to the default if you want to detect invalid data; just catch the UnicodeDecodeError exception thrown:
>>> to_text('11010010 11001110')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in to_text
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd2 in position 0: invalid continuation byte

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data

Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"

This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (Ð½Ð°Ð¿Ð¾Ð¼Ð¸Ð½Ð°Ð»ÐºÐ¸)")
'Шепот (напоминалки)'

Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Printing out all unicode emojis to file - python

Related

Python concatenate bytes to str

Some conversion issues between the byte and strings

encoding special characters in python2

"01"-string representing bytes to unicode conversion in python 2

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

Categories

Resources