I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda
When using Cherrypy, I ran into this comment line. "strings get wrapped in a list because iterating over a single item list is much faster than iterating over every character in a long string."
This is located at
https://github.com/cherrypy/cherrypy/blob/master/cherrypy/lib/encoding.py#L223
I have done some researches online but I still don't fully understand the reason to wrap the response.body as [response.body]. ? Can anyone show me the details behind this design?
I think that code only makes sense if you recognize that prior to the code with that comment, self.body could be either a single string, or an iterable sequence that contains many strings. Other code will use it as the latter (iterating on it and doing string stuff with the items).
While would technically work to let that later code loop over the characters of the single string, processing the data character by character is likely inefficient. So the code below the comment wraps a list around the single string, letting it get processed all at once.
given a character like "✮" (\xe2\x9c\xae), for example, can be others like "Σ", "д" or "Λ") I want to find the "actual" length that character takes when printed onscreen
for example
len("✮")
len("\xe2\x9c\xae")
both return 3, but it should be 1
You may try like this:
unicodedata.normalize('NFC', u'✮')
len(u"✮")
UTF-8 is an unicode encoding which uses more than one byte for special characters. Check unicodedata.normalize()
My answer to a similar question:
You are looking for the rendering width from the current output context. For graphical UIs, there is usually a method to directly query this information; for text environments, all you can do is guess what a conformant rendering engine would probably do, and hope that the actual engine matches your expectations.
I have a massive string im trying to parse as series of tokens in string form, and i found a problem: because many of the strings are alike, sometimes doing string.replace()will cause previously replaced characters to be replaced again.
say i have the string being replaced is 'goto' and it gets replaced by '41' (hex) and gets converted into ASCII ('A'). later on, the string 'A' is also to be replaced, so that converted token gets replaced again, causing problems.
what would be the best way to get the strings to be replaced only once? breaking each token off the original string and searching for them one at a time takes very long
This is the code i have now. although it more or less works, its not very fast
# The largest token is 8 ASCII chars long
'out' is the string with the final outputs
while len(data) != 0:
length = 8
while reverse_search(data[:length]) == None:#sorry THC4k, i used your code
#at first, but it didnt work out
#for this and I was too lazy to
#change it
length -= 1
out += reverse_search(data[:length])
data = data[length:]
If you're trying to substitute strings at once, you can use a dictionary:
translation = {'PRINT': '32', 'GOTO': '41'}
code = ' '.join(translation[i] if i in translation else i for i in code.split(' '))
which is basically O(2|S|+(n*|dict|)). Very fast. Although memory usage could be quite substantial. Keeping track of substitutions would allow you to solve the problem in linear time, but only if you exclude the cost of looking up previous substitution. Altogether, the problem seems to be polynomial by nature.
Unless there is a function in python to translate strings via dictionaries that i don't know about, this one seems to be the simplest way of putting it.
it turns
10 PRINT HELLO
20 GOTO 10
into
10 32 HELLO
20 41 10
I hope this has something to do with your problem.
I'm having a problem passing strings that exceed 80 characters in JSON. When I pass a string that's exactly 80 characters long it works like magic. But once I add the 81st letter it craps out. I've tried looking at the json object in firebug and it seems to think the string is an array because it has an expander next to it. Clicking the expander though does nothing. I've tried searching online for caps on JSON string sizes and work arounds but am coming up empty :(. Anybody know anything about this?
edit:
It actually doesn't matter what the string is... using "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" yields the same results.
Here's my code: (I'm using python)
result = {"test": "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"}
self.response.out.write(simplejson.dumps(result))
would you happen to know the class that encodes strings properly for python? Thanks so much :)
What is the 81st character? Sounds like the string isn't properly escaped, making the json decoder think it is an array. If you could post the string here, or at least the 20 or so characters around 80, I could probably tell you what is wrong. Also, if you could tell how the json string was made. In most languages you can get a class that will make proper json strings out of objects and arrays. For example, php has json_encode();