I'd like to print emojis from python(3) src
I'm working on a project that analyzes Facebook Message histories and in the raw htm data file downloaded I find a lot of emojis are displayed as boxes with question marks, as happens when the value can't be displayed. If I copy paste these symbols into terminal as strings, I get values such as \U000fe328. This is also the output I'm getting when I run the htm files through BeautifulSoup and output the data.
I Googled this string (and others), and consistently one of the only sites that comes up with them is iemoji.com, in the case of the above string this page, that lists the string as a Python Src. I want to be able to print out these strings as their corresponding emojis (after all, they were originallly emojis when being messaged), and after looking around I found a mapping of src encodings at this page, that mapped the above like strings to emoji string names. I then found this emoji string names to Unicode list, that for the most part seems to map the emoji names to Unicode. If I try printing out these values, I get good output. Like following
>>> print(u'\U0001F624')
😤
Is there a way to map these "Python src" encodings to their unicode values? Chaining both libraries would work if not for the fact that the original src mapping is missing around 50% of the unicode values found in the unicode library. And if I do end up having to do that, is there a good way to find the Python Src value of a given emoji? From my testing emoji as strings equal their Unicode, such as '😤' == u'\U0001F624', but I can't seem to get any sort of relations to \U000fe328
This has nothing to do with Python. An escape like \U000fe328 just contains the hexadecimal representation of a code point, so this one is U+0FE328 (which is a private use character).
These days a lot of emoji are assigned to code points, eg. 😤 is U+01F624 — FACE WITH LOOK OF TRIUMPH.
Before these were assigned, various programs used various code points in the private use ranges to represent emoji. Facebook apparently used the private use character U+0FE328. The mapping from these code points to the standard code points is arbitrary. Some of them may not have a standard equivalent at all.
So what you have to look for is a table which tells you which of these old assignments correspond to which standard code point.
There's php-emoji on GitHub which appears to contain these mappings. But note that this is PHP code, and the characters are represented as UTF-8 (eg. the character above would be "\xf3\xbe\x8c\xa8").
Related
I have a set of strings that need to be decoded. The strings format varies with products on the site. So its pretty unpredictable. Few examples of the format are given below:
1. longDescription":"\u003cul\u003e \u003cli\u003eTender grill’d bites made " (unicode and symbol combination)
2. longDescription":"Goodness You Can Seeâ„¢" (all decoded, to be picked as is)
3. longDescription":"With a wide variety of headphones, \u003cbr /\u003e \u003cb\u003e\u003cbr /\u003eBlackWeb Flat CAT6 Network Cable:\u003c/b\u003e \u003cbr /\u003e \u003cul\u003e \u003cli\u003eFlat CAT6 Network Cable\u003c/li\u003e \u003cli\u003eLength: 14'\u003c/li\u003e \u003cli\u003eUltra-slim design\u003c/li\u003e \u003cli\u003e1GBPS" (all unicode)
Basically, I want to extract this long description key (backend) or (bulleted list in the front end) from products like https://www.walmart.com/ip/Friskies-Gravy-Wet-Cat-Food-Warm-d-Serv-d-Grill-d-Bites-With-Shrimp-3-5-oz-Pouch/842464118
I have tried the below codes:
if '\\u' in longdescription:
try:
#temp['Key_Features'] =longdescription
temp['Key_Features'] =longdescription.decode("unicode-escape").encode()
except Exception as e:
temp['Key_Features'] =HTMLParser.HTMLParser().unescape(longdescription)
else:
temp['Key_Features'] =longdescription
I have tried all these above cases separately and the above one is with a combination. These work for most cases but in cases like the 1st one, it encodes and decodes ' symbol (or any other symbol) too and my output becomes:
Tender grillâd bites (see the change in grill'd)
We have a dependency on python2 for this code, so requesting a solution in python2. Also, I am ok with HTML tags coming in output. Just need to have a code that works for all three cases. Thanks.
This is fixed in python3 now. Used below code to convert :
temp['Key_Features']=longDescription.encode().decode('unicode-escape').encode('latin1').decode('utf8').replace('&','&').replace('Â ','').replace('"','"')
This happened because data was in different encoding formats and couldn't be handled by a single encoding/decoding. The above logic works for all.
I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà ! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà .
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda
I'm currently writting a chat bot in python and I would like to be able to type special characters like emoji, etc. my first attempt was just to place the literal character in the code.
add_reaction('🇦')
Unfortunately not many editors support these characters, so they appear mostly as random gibberish. For readability this isn't very good either.
To solve the gibberish issue I used chr(charcode:{int}) which also made them more copy paste save.
Then I put all of them to a separate file special_chars.py so i could give the characters a name
thumbs_up = chr(...)
smiley_face = chr(...)
regional_a_z = [chr(127462+i) for i in range(0,25)]
...
However this file started to grow really long really quickly.
So is there a better way to do this?
Something to keep in mind:
if a long file isn't avoidable could the character codes be moved to a non-python file
potential list for consecutive characters or character groups ex: thumb-up and down, list of regional indicators
The unicodedata module of the standard library already contains names for the special characters:
>>> unicodedata.lookup('THUMBS UP SIGN')
'\U0001f44d'
>>> unicodedata.lookup("REGIONAL INDICATOR SYMBOL LETTER A")
'\U0001f1e6'
You can get the official name of a character by its code:
>>> unicodedata.name('\U0001F600')
'GRINNING FACE'
I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...
Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.
In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.
*My purposes are:
Have a dependency-free snippet I can use with a moderate-high degree of success,
Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
for the challenge of getting this to work.
Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.
def guess_encoding_debug(file_path):
"""
DEBUG - returns many 2 value tuples
Will return list of all possible text encodings with a count of the number of chars
read that are common characters, which might be a symptom of success.
SEE warnings in sister function
"""
import codecs
import string
from operator import itemgetter
READ_LEN = 1000
ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']
#chars in the regular ascii printable set are BY FAR the most common
#in most files written in English, so their presence suggests the file
#was decoded correctly.
nonsuspect_chars = string.printable
#to be a list of 2 value tuples
results = []
for e in ENCODINGS:
#some encodings will cause an exception with an incompatible file,
#they are invalid encoding, so use try to exclude them from results[]
try:
with codecs.open(file_path, 'r', e) as f:
#sample from the beginning of the file
data = f.read(READ_LEN)
nonsuspect_sum = 0
#count the number of printable ascii chars in the
#READ_LEN sized sample of the file
for n in nonsuspect_chars:
nonsuspect_sum += data.count(n)
#if there are more chars than READ_LEN
#the encoding is wrong and bloating the data
if nonsuspect_sum <= READ_LEN:
results.append([e, nonsuspect_sum])
except:
pass
#sort results descending based on nonsuspect_sum portion of
#tuple (itemgetter index 1).
results = sorted(results, key=itemgetter(1), reverse=True)
return results
def guess_encoding(file_path):
"""
Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
Will return one likely text encoding, though there may be others just as likely.
WARNING: DO NOT use if your file uses any significant number of characters
outside the standard ASCII printable characters!
WARNING: DO NOT use for critical applications, this code will fail you.
"""
results = guess_encoding_debug(file_path)
#return the encoding string (second 0 index) from the first
#result in descending list of encodings (first 0 index)
return results[0][0]
I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.
External libraries may be an option in the future, but for now I want to avoid them because:
This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.
Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.
I'm working with large files in French and German. Basically, writing strings of characters from one file to another, collecting data from them, and so forth. Unfortunately, I have no idea what to import in order to let Python handle these characters.
Even when collecting data from files that Python has already converted (in french you get weird things like écouteur ça), I get key errors when checking dicts for things that I know have already been placed in that dict, but only when the items have special characters in them like in the example of écouteur ça.
For example, when the tuple ('écouteur', 'ça') has been added to a dict which collects the frequency that any given pair of words occur together, you get a key error when probing that dict for the tuple ('écouteur', 'ça'), but not when probing the dict for other tuples that don't contain the wacky characters.
Does anyone know a quick way to get around this issue at every level?
Best,
Georgina
"Unicode in Python, Completely Demystified"