during these days I'm struggling with geographical dataframes which I'm managing with geopandas. My problem comes from weird format of special characters that belong to the names of regions and towns. I never saw the format which I'm in front of. Fortunately they are not so many.
I tried to select all kind of encodings, from latin-1 to several ISO-xxx but the only way that appears to work properly is a manual replacement with a dictionary (which I don't like as it is built only with the examples I can reach from the dataframe itself. If it does change in the future, it will omit that).
Here's an example of how I approached the replacement. Since I couldn't find any good encoding that allowed me to read the dataframe properly, I put the 'utf-8' encoding as a parameter of the geopandas opener.
df1 = gpd.read_file('path/to/my/file.shp', encoding='utf-8')
The obtained result is the same inserted in the example, anyway. For the sake of the example, I put only 2 instances, beside in my original dataframe there is at least one for each pair in the dictionary.
df = pd.DataFrame([[b"Pr\x8e-Saint-Didier", b"Vall\x8e d'Aoste"],[ "Bozen", b"Trentino Alto Adige - S\x9ddtirol"]], columns = ['town', 'region'])
special_chars = {
'\x9f':'ü',
'\x93':'ì',
'\xed':'ì',
'\x8e':'é',
'\x8f':'è',
'\x8d':'ç',
'\x90':'ê',
'\x98':'ò',
'\x9d':'ù',
'\x88':'à',
}
df['town'] = df['town'].str.decode('latin-1').replace(special_chars, regex=True)
df['region'] = df['region'].str.decode('latin-1').replace(special_chars, regex=True)
Does anybody have any idea on how to solve this problem?
How to handle it?
Probably it is an existing encoding, so you have several possibilities: check few of such characters in Wikipedia. Some of accented characters have a list of possible encoding. In this case, I found that an old MacOS codepage had some of your characters correct. So I checked other Mac encodings, I think I found it.
Alternatively (and do this if you have many different files and encodings): you can write a Python script with a short conversion table, and iterate all encodings. Select the 3 encodings with better point (and maybe print also the character in such encodings). This is longer on first try, but if you have often such problem, it will help you (especially because it seems you are dealing with old data).
Note: It seems that maybe few guesses of you are wrong (wrong case?).
What I found?
I think it is Mac OS Roman. Or maybe some related Mac_OS encoding. Now it is your task to check carefully if my guess is correct (I didn't check all characters).
Note: This encoding is known as mac_roman in Python.
Related
I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda
I am faced with the following problem: when I generate .csv files in python using django-import-export even though the field is a string, when I open it in Excel the leading zeros are omitted. E.g. 000123 > 123.
This is a problem, because if I'd like to display a zipcode I need the zeros the way they are. I can cover it in quotes, but that's not desirable since it will grab unnecessary attention and it just looks bad. I'm also aware that you can do it in Excel files manually by changing the data type, but I don't want to explain that to people who are using my software.
Any suggestions?
Thanks in advance.
I've tried this solution. It's the solution suggested by #jquijano but it hasn't worked.
After generating the CSV, I opened it with 'open office' and 'excel' and in both cases I could see the (') character at the beginning of each string. However, if I added a new value to the CSV in the editor, for example '0895, the (') disappeared and the leading 0 wasn't removed.
Luckily, I found a workaround. I just added an empty character at the beginning.
value = chr(24) + unidecode('00123')
An easy fix would be adding an apostrophe (') at the beginning of each number when doing using import-export. This way Excel will recognize those numbers as a text.
I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:
my_token:[
"key_of_interest"
],
This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.
Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.
How can this can possibly be done the "dumbest and simplest way"?
find the starting position
look on for the ending position
grab everything indiscriminately between the two
This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:
narrow down the search region (requires additional constraints/assumptions as per comment56995056)
speed up the search operation bits, which include:
extracting raw data from the format
you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
elementary pattern comparison operation
unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
The underlying requirement shows through when you clarify:
I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.
There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.
So, use libraries written by people who have dealt with the issues before you.
If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.
So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.
Well, as already mentioned - a parser seems the best option.
But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.
matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]
There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.
And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.
I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...
Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.
In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.
*My purposes are:
Have a dependency-free snippet I can use with a moderate-high degree of success,
Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
for the challenge of getting this to work.
Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.
def guess_encoding_debug(file_path):
"""
DEBUG - returns many 2 value tuples
Will return list of all possible text encodings with a count of the number of chars
read that are common characters, which might be a symptom of success.
SEE warnings in sister function
"""
import codecs
import string
from operator import itemgetter
READ_LEN = 1000
ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']
#chars in the regular ascii printable set are BY FAR the most common
#in most files written in English, so their presence suggests the file
#was decoded correctly.
nonsuspect_chars = string.printable
#to be a list of 2 value tuples
results = []
for e in ENCODINGS:
#some encodings will cause an exception with an incompatible file,
#they are invalid encoding, so use try to exclude them from results[]
try:
with codecs.open(file_path, 'r', e) as f:
#sample from the beginning of the file
data = f.read(READ_LEN)
nonsuspect_sum = 0
#count the number of printable ascii chars in the
#READ_LEN sized sample of the file
for n in nonsuspect_chars:
nonsuspect_sum += data.count(n)
#if there are more chars than READ_LEN
#the encoding is wrong and bloating the data
if nonsuspect_sum <= READ_LEN:
results.append([e, nonsuspect_sum])
except:
pass
#sort results descending based on nonsuspect_sum portion of
#tuple (itemgetter index 1).
results = sorted(results, key=itemgetter(1), reverse=True)
return results
def guess_encoding(file_path):
"""
Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
Will return one likely text encoding, though there may be others just as likely.
WARNING: DO NOT use if your file uses any significant number of characters
outside the standard ASCII printable characters!
WARNING: DO NOT use for critical applications, this code will fail you.
"""
results = guess_encoding_debug(file_path)
#return the encoding string (second 0 index) from the first
#result in descending list of encodings (first 0 index)
return results[0][0]
I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.
External libraries may be an option in the future, but for now I want to avoid them because:
This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.
Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.
I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.
My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!
FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.
Thanks for reading.
So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!
I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.
The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:
emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')
Those blocks could be merged in three blocks (UCS-4):
emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')
Their equivalents in UCS-2 are:
emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')
So finally we can define a single regular expression with all the cases together:
import re
try:
# UCS-4
highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
# UCS-2
highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
I found out there another regular expresion that is able to identify the emojis.
This the regex is provided by the team at instagram-enginnering blog
u"(?<!&)#(\w|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+
Source:
http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji
note: I add another answer as this one is not complemetary to my previous answer here.
i am using json encoder function that encode the input.
this function is used for dict encoding (to convert it to string) on json.dumps. (so we need to do some edit to the response - remove the ' " ')
this enabled me to save emoji to mysql, and present it (at web):
# encode input
from json.encoder import py_encode_basestring_ascii
name = py_encode_basestring_ascii(name)[1:-1]
# save
YourModel.name = name
name.save()