Insert invisible unicode into MySQL using python3 but encountered duplicate - python

When I insert the device data into MySQL(v5.5.6) using python(v3.2). It encountered a problem.
This is device A (It contains three unicode and a blank space):
'\u202d\u202d \u202d'
And device B (It is only a blank space):
' '
The problem is when i insert all device data into MySQL , Error is
Duplicate entry 'activate_device-20151201-1-5740-01000P---‭‭ ‭--' for key 'PRIMARY'
I guess MySQL has deal the '\u202d'(A unicode to reverse string maybe?).
How can I simulate the process in python3 like MySQL?
How can I avoid the duplicate?
The expected result is translate '\u202d\u202d \u202d' to ' ' in python3.
Please help me.

There are some ambiguities here. Do you want to keep only the visible ascii characters or also visible unicode characters ?
If you want to keep only visible ascii characters, the simple way is to use the python inbuilt string module.
import string
new_string = "".join(filter(lambda x:x in string.printable, original_string))
For your specific usecase, a space is part of visible ascii - so the above will convert '\u202d\u202d \u202d' and ' ' to ' '

Related

Force python to use ' instead of " for strings

I have a script that migrates data from one database to another written in python and sql using the psycopg2 library.
I retrieve a string from the first database and store it for later in a list so I can put it into the second database when I finish gathering all the data I need.
If the string has an apostrophe in it then python will represent the string using " ". The problem with this is that sql interprets " " as specifying a column name and ' ' for strings whereas python interprets both as strings. I wish to force python to use apostrophes to represent the string (or another suitable workaround)
Google has not turned up anything. Can't even find a mention of the fact that python will use " " when you have apostrophes in your string. I have considered replacing apostrophes in my string with a different character and converting it back later but this seems like a clumsy solution.
For example
MyString = 'it\'s'
MyList = [MyString]
print(MyList) # returns "it's"
print(MyList[0]) # returns it's
When I insert the new values into the database I am in inserting the whole list as the values.
INSERT INTO table VALUES MyList
This is where the error crops up because the string is using " " instead of ' '.
A solution on either the python or sql side would work.
Found a fix. It's a bit janky but it works. Convert the list into a string. Use the replace function like so:
MyString = MyString.replace('"',"'")
And then use that string instead.

How to get rid of the nested double quote in 'name' subfield?

I am trying to read the following string into a dictionary using Python json package
However, under one of the subfield 'name' there is a description with a nested double quote. My json is unable to read the string that way
import json
string1 =
'{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'
json.loads(string1)
A ERROR was raised
JSONDecodeError: Expecting ',' delimiter: line 1 column 96 (char 95)
I know that the reason for this error was due to the nested double quote around "The Sunshine Makers"
How to I get rid of this double quote?
More examples of string that cause error
string2 = '{"id":960066,"project_id":960066,"state":"active","state_changed_at":1502049940,"name":"New J. Lye Album - Behind The Lyes","blurb":"I am working on my new project titled "Behind The Lyes" which is coming out fall of 2017."'
#The problem with this string comes from the nested double quote around the pharse "Behind The Lyes inside" the 'blurb' subfield
Note that your string has more than one issue making it invalid JSON:
The error you're seeing is the \xa0 (a non-breaking space). That needs to be addressed before the "" issue becomes a problem.
Your string is missing a closing }.
That said, for the string you've cited first, one approach to fixing your issues would be to use .replace():
string1 = '{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'.replace('\xa0"', "'").replace('""', "'\"") + '}'
For example, the following handles the double quoting and other issues in your two sample strings:
import json
fixes = [('\xa0', ' '),('"',"'"),("{'",'{"'),("','", '","'),(",'", ',"'),("':'", '":"'),("':", '":'),("''", '\'\"'), ("'}",'"}')]
print(fixes)
string1 = '{"id":17033,"project_id":17033,"state":"active","state_changed_at":1488054590,"name":"a.k.a.:\xa0"The Sunshine Makers""'
string2 = '{"id":960066,"project_id":960066,"state":"active","state_changed_at":1502049940,"name":"New J. Lye Album - Behind The Lyes","blurb":"I am working on my new project titled "Behind The Lyes" which is coming out fall of 2017."'
strings = [string1, string2]
for string in strings:
print(string)
string = string + '}'
for fix in fixes:
string = string.replace(*fix)
print(string)
print(json.loads(string)['name'])
It would be helpful if you could fill out your question with the code or file from which you are retrieving these strings. That would make it possible to give a more comprehensive answer.

Python project strings migrating from ascii to unicode

Ive got some lazy trouble about pythons strings.
I have a project with python 2.x and all strings we have there are 'blabla'.
Now we want to move this strings to unicode without taking extra libraries like __future__ or moving to python 3 or using sys.setdefaultencoding.
And i have to click this all through project to change '' to u''. But not all strings i need to change, for example fields of object i do not want to change:
obj = {'field': field}
A question: is there a way to make it automatic? And i have stacked with a next problem my regex [^u]([\'][^\'\"]*[\']) catches ' ' ' ' middle section which are not a string.
For now i have next replacements: (\'.*\') --> u$1
is there a way to make it automatic?
If you mean -- is there a program that may decide what type of string (Unicode (u''), bytestring (b''), or native ('')) should be used in a specific place in an arbitrary program -- then no: there is no such program -- you should inspect each and every case very carefully. See Text versus binary data.

Character encoding in python to replace 'u2019' with '

I have tried numerous ways to encode this to the end result "BACK RUSHIN'" with the most important character being the right apostrophe '.
I would like a way of getting to this end result using some of the built in functions Python has where there is no discrimination between a normal string and a unicode string.
This was the code I was using to retrieve the string: str(unicode(etree.tostring(root.xpath('path')[0],method='text', encoding='utf-8'),errors='ignore')).strip()
With the result being: 'BACK RUSHIN' the thing being the apostrophe ' is missing.
Another way was: root.xpath('path/text()')
And that result was: u'BACK RUSHIN\u2019' in python.
Lastly if I try: u'BACK RUSHIN\u2019'.encode('ascii', 'replace')
The result is: 'BACK RUSHIN?'
Please no replace functions, I would like to make use of pythons codec libraries.
Also no printing the string because it is being held in a variable.
Thanks
>>> import unidecode
>>> unidecode.unidecode(u'BACK RUSHIN\u2019')
"BACK RUSHIN'"
unidecode

How can I filter Emoji characters from my input so I can save in MySQL <5.5?

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.
My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!
FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.
Thanks for reading.
So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!
I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.
The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:
emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')
Those blocks could be merged in three blocks (UCS-4):
emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')
Their equivalents in UCS-2 are:
emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')
So finally we can define a single regular expression with all the cases together:
import re
try:
# UCS-4
highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
# UCS-2
highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
I found out there another regular expresion that is able to identify the emojis.
This the regex is provided by the team at instagram-enginnering blog
u"(?<!&)#(\w|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+
Source:
http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji
note: I add another answer as this one is not complemetary to my previous answer here.
i am using json encoder function that encode the input.
this function is used for dict encoding (to convert it to string) on json.dumps. (so we need to do some edit to the response - remove the ' " ')
this enabled me to save emoji to mysql, and present it (at web):
# encode input
from json.encoder import py_encode_basestring_ascii
name = py_encode_basestring_ascii(name)[1:-1]
# save
YourModel.name = name
name.save()

Categories

Resources