I am trying to take unicode and clean it to be used for URLs.
Examples : "Bird's Milk" Cake OR Pão com Ovo
In converting these, my goal is to make them as human readable as possible so, the urls following those examples would be - /birds-milk-cake/ or /pao-com-ovo/
To get the ASCII of the accented characters,
title = 'Pão com Ovo'
title = unicodedata.normalize('NFKD', title).encode('ascii','ignore')
However I am wondering what the best solution is for removing characters like # ! ' " ( ) &. Normalize() errors on those characters so is there a proper way for removing those characters while retaining the accented characters?
There is an old, unmaintained but working code that extends django.template.defaultfilters.slugify() by adding support for all characters you can imagine. If you need to support all kinds of languages then this may be a good solution. It's called slughifi.
>>> from django.template.defaultfilters import slugify
>>> slugify("Pão com Ovo")
u'pao-com-ovo'
>>> slugify(""""Bird's Milk" Cake""")
u'birds-milk-cake'
Related
import re
string="b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '"
print(re.findall(r"\x[0-9a-z]{2}",string))
The the list returned by the findall() function is empty :(
The problem here is that your string is the Python representation of a Python bytes object, which is pretty much useless.
Most likely, you had a bytes object, like this:
b = b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '
… and you converted it to a string, like this:
s = str(b)
Don't do that. Instead, decode it:
s = b.decode('utf-8')
That will get you the actual characters, which you can then match easily, instead of trying to match the characters in the string representation of the bytes representation and then reconstructing the actual characters laboriously from the results.
However, it's worth noting that \xe2\x80\xa6 is not an emoji, it's a horizontal ellipsis character, …. If that isn't what you wanted, you already corrupted the data before this point.
Not a regexp per se, but might help you out any way.
def emojis(s):
return [c for c in s if ord(c) in range(0x1F600, 0x1F64F)]
print(emojis("hello world 😊")) # sample usage
You need to re.compile(ur'A\xe2\x80\xa6',re.UNICODE)
Compile a Unicode regex and use that pattern matching for your find,find all’s,subs,etc.
Try this. I joined the string in your question with that in your title to make the final search string
import re
k = r"#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python"
print(k)
print()
p = re.findall(r"((\\x[a-z0-9]{1,}){1,})", k)
for each in p:
print(each[0])
Output
#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python
\xe2\x80\xa6
\x60\xe2\x4b
I'm currently writting a chat bot in python and I would like to be able to type special characters like emoji, etc. my first attempt was just to place the literal character in the code.
add_reaction('🇦')
Unfortunately not many editors support these characters, so they appear mostly as random gibberish. For readability this isn't very good either.
To solve the gibberish issue I used chr(charcode:{int}) which also made them more copy paste save.
Then I put all of them to a separate file special_chars.py so i could give the characters a name
thumbs_up = chr(...)
smiley_face = chr(...)
regional_a_z = [chr(127462+i) for i in range(0,25)]
...
However this file started to grow really long really quickly.
So is there a better way to do this?
Something to keep in mind:
if a long file isn't avoidable could the character codes be moved to a non-python file
potential list for consecutive characters or character groups ex: thumb-up and down, list of regional indicators
The unicodedata module of the standard library already contains names for the special characters:
>>> unicodedata.lookup('THUMBS UP SIGN')
'\U0001f44d'
>>> unicodedata.lookup("REGIONAL INDICATOR SYMBOL LETTER A")
'\U0001f1e6'
You can get the official name of a character by its code:
>>> unicodedata.name('\U0001F600')
'GRINNING FACE'
I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.
My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!
FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.
Thanks for reading.
So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!
I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.
The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:
emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')
Those blocks could be merged in three blocks (UCS-4):
emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')
Their equivalents in UCS-2 are:
emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')
So finally we can define a single regular expression with all the cases together:
import re
try:
# UCS-4
highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
# UCS-2
highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
I found out there another regular expresion that is able to identify the emojis.
This the regex is provided by the team at instagram-enginnering blog
u"(?<!&)#(\w|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+
Source:
http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji
note: I add another answer as this one is not complemetary to my previous answer here.
i am using json encoder function that encode the input.
this function is used for dict encoding (to convert it to string) on json.dumps. (so we need to do some edit to the response - remove the ' " ')
this enabled me to save emoji to mysql, and present it (at web):
# encode input
from json.encoder import py_encode_basestring_ascii
name = py_encode_basestring_ascii(name)[1:-1]
# save
YourModel.name = name
name.save()
I try to localize OSQA (django + python) for russian language. A lot of string I can translate with locale-folder. But in OSQA some string was hard-coded (put in code in simple text).
I try simply replace english text to russian, but get an error.
For example:
class WordpressAuthContext(ConsumerTemplateContext):
mode = 'SMALLICON'
type = 'SIMPLE_FORM'
simple_form_context = {
'your_what': 'Wordpress blog name'
}
weight = 270
human_name = 'Wordpress'
icon = '/media/images/openid/wordpress.png'
In this code I need to replace 'Wordpress blog name' on russian text.
I try replace english characters with unicode \uXXXX characters, but on web-page I will see these characters in original view \uXXXX.
Then I try this code:
'your_what': 'Wordpress blog name'.encode('utf-8')
And it's not work too.
What can I try?
Try:
simple_form_context = {
'your_what': u'whätévèr wéird chars you w#ñt to ûse'
}
I really don't know if django is prepared to handle Unicode strings through all framework, but's worth trying.
I don't think it would be prepared to handle byte strings (like the one generated with .encode()), so forget about that.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Single quotes vs. double quotes in Python
I have seen that when i have to work with string in Python both of the following sintax are accepted:
mystring1 = "here is my string 1"
mystring2 = 'here is my string 2'
Is anyway there any difference?
Is it by any reason better use one solution rather than the other?
Cheers,
No, there isn't. When the string contains a single quote, it's easier to enclose it in double quotes, and vice versa. Other than this, my advice would be to pick a style and stick to it.
Another useful type of string literals are triple-quoted strings that can span multiple lines:
s = """string literal...
...continues on second line...
...and ends here"""
Again, it's up to you whether to use single or double quotes for this.
Lastly, I'd like to mention "raw string literals". These are enclosed in r"..." or r'...' and prevent escape sequences (such as \n) from being parsed as such. Among other things, raw string literals are very handy for specifying regular expressions.
Read more about Python string literals here.
While it's true that there is no difference between one and the other, I encountered a lot of the following behavior in the opensource community:
" for text that is supposed to be read (email, feeback, execption, etc)
' for data text (key dict, function arguments, etc)
triple " for any docstring or text that includes " and '
No. A matter of style only. Just be consistent.
I tend to using " simply because that's what most other programming languages use.
So, habit, really.
There's no difference.
What's better is arguable. I use "..." for text strings and '...' for characters, because that's consistent with other languages and may save you some keypresses when porting to/from different language. For regexps and SQL queries, I always use r'''...''', because they frequently end up containing backslashes and both types of quotes.
Python is all about the least amount of code to get the most effect. The shorter the better. And ' is, in a way, one dot shorter than " which is why I prefer it. :)
As everyone's pointed out, they're functionally identical. However, PEP 257 (Docstring Conventions) suggests always using """ around docstrings just for the purposes of consistency. No one's likely to yell at you or think poorly of you if you don't, but there it is.