Removing all invalid characters (e.g. \uf0b7) from text

Removing all invalid characters (e.g. \uf0b7) from text - python

I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific text might contain and I wondered if there was a way to make sure that a string is cleaned of all types of 'invalid character', since a process later on (which is dependent on a third party package) can not receive a string which contains it.
I've tried searching for a solution, but all I get it is answers regarding regular characters which people want removed (e.g. '^%$&*') which they have classified as invalid characters, however I want to remove/replace the actual character 'invalid character' in all its forms

The Python library codecs might be of help. Take a look at the documentation here: https://docs.python.org/2/library/codecs.htm
In my use case, I was doing some analysis of documents that had non-ASCII text. For my purposes, ignoring the invalid characters was acceptable. I opened the files with the following line and was able to parse the corpus.
for filename in os.listdir(ROOT_DIR):
with codecs.open(os.path.join(ROOT_DIR, filename), encoding = 'UTF8', errors ='replace' ) as f:

I had a similar issue. It turns out private use areas characters are in the Co general category, as returned by category() in unicodedata.
I fixed my problem as follow:
import unicodedata
def is_pua(c):
return unicodedata.category(c) == 'Co'
content = "This\uf0b7 is a \uf0b7string \uf0c7with private \uf0b7use are\uf0a7as blocks\uf0d7."
"".join([char for char in content if not is_pua(char)])
This outputs:
'This is a string with private use areas blocks.'

Related

Python character constants for chat bot

I'm currently writting a chat bot in python and I would like to be able to type special characters like emoji, etc. my first attempt was just to place the literal character in the code.
add_reaction('ðŸ‡¦')
Unfortunately not many editors support these characters, so they appear mostly as random gibberish. For readability this isn't very good either.
To solve the gibberish issue I used chr(charcode:{int}) which also made them more copy paste save.
Then I put all of them to a separate file special_chars.py so i could give the characters a name
thumbs_up = chr(...)
smiley_face = chr(...)
regional_a_z = [chr(127462+i) for i in range(0,25)]
...
However this file started to grow really long really quickly.
So is there a better way to do this?
Something to keep in mind:
if a long file isn't avoidable could the character codes be moved to a non-python file
potential list for consecutive characters or character groups ex: thumb-up and down, list of regional indicators

The unicodedata module of the standard library already contains names for the special characters:
>>> unicodedata.lookup('THUMBS UP SIGN')
'\U0001f44d'
>>> unicodedata.lookup("REGIONAL INDICATOR SYMBOL LETTER A")
'\U0001f1e6'
You can get the official name of a character by its code:
>>> unicodedata.name('\U0001F600')
'GRINNING FACE'

Some sensitive filenames cause failure of loading data

In using keras.model.load_weights, by the way, the weight file is saved in a hdf5 format, I come across some situations where the folder names that have initial r or t, cause the error: errno = 22, error message = 'invalid argument', flags = 0, o_flags = 0.
I want to know if there are some specified rules on the filenames which should be avoided and otherwise would lead to such reading error in python, or the situation I encountered is only specific to keras.

It would greatly help debug this if you include examples of such filenames that give you trouble. However, I have a good idea on what is probably happening here.
This problems seem to appear on folders that start with r or t on their names. Also, as they are folders, on their full path name they are preceded by a \ character (for example "\thisFolder", or similar). This is true in the case of a Windows environment, as they use \ for separating paths contrary to *nix systems that use the regular slash /.
Considering these things, seems that perhaps you are experiencing this as \r and \t are both special characters that mean Carriage Return and Tabulation, respectively. If this is the case many file openers will have trouble processing such file name.
Even more, I would not be surprised if you got the same errors on folders that begin with n or other letters that when concatenated to a backslash give special characters (\n is new line, \s is a white space, etc.).
To overcome this seems that you will need to escape your backslash character before passing it as a filename. In python, an escaped backslash is "\\"
. In addition, you can also opt to pass a Raw string instead, by adding the r prefix to your string, something like r"\a\raw\string". More information on escaping and raw string can be found on this question and answers.
I want to know if there are some specified rules on the filenames which should be avoided and otherwise would lead to such reading error in python,
As mentioned, you should avoid this with characters that have a special meaning with a backslash. I suggest you check here to see the characters Python accepts like this, so you can refrain from using such characters (or well use raw strings and forget about this problem).

Python: Replacing character error with read in file

Goal: I just want to take the comma away as that is the only character that will screw up my (course required) file parsing for bayesian analysis (i.e word,2,4) instead of say (i.e. word,,2,4)
So I'm currently trying to read in an email in the form of a text file from the Enron public corpus online and building a bayesian spam filter.
I've noticed that reading in some of the files are raising errors when trying to manipulate the strings that are present. I am fully aware that some of theses files contain viruses so the encoding of some of the characters might not be valid. However, I'm trying to simply replace a comma within a string and I'm getting the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 1169: ordinal not in range(128)
I have tried everything that this forum has to offer and i've searched everywhere for a solution such as:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = words.replace(',','')
words = words.split()
I've also tried many regex attempts... this is one of the versions:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = re.sub(',','',words)
words = words.split()
Now, I can simply just regex a version that only lets A-Za-z through but I'm noticing that spam accuracy is heavily being affected by the fact that a lot of the spam files have such special characters.
Any suggestion would be most appreciated. Thanks.
-Robert

If you just want to remove the extra comma and as you said nothing is working out you can use the simple split and join (assuming comma is the only delimiter here)
','.join([s for s in 'word,,2,4'.split(',') if s])

So I ended up using another implementation that I found useful as well. It turns out, for some reason, python retains any prior information it had for any previous strings that were originally present. So i've learned its always a good idea to just re-assign it to a different(new) variable as follows:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read()).split()
new_array = []
for word in words:
new_array.append(word.replace(',','').lower())
return new_array
Its a little more expensive as far as storing and assigning data to a whole other variable. However, I've noticed its a lot safer in terms of your string not getting casted to a unicode string. The original problem was this output
print words
[u'hello,',u'what?',u'is',u'going',u'on?']
The comma in 'hello' would not get replaced. With the code above you're guaranteed that the comma will be stripped from each word and not casted into a unicode string
print new_array
['hello','what?',u'is',u'going',u'on?']
As far as performance of the code goes, I'm still training massive files at a decent speed. So it should effect you that much.
Hope this helps!
-Robert

How can I filter Emoji characters from my input so I can save in MySQL <5.5?

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.
My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!
FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.
Thanks for reading.

So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!

I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.
The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:
emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')
Those blocks could be merged in three blocks (UCS-4):
emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')
Their equivalents in UCS-2 are:
emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')
So finally we can define a single regular expression with all the cases together:
import re
try:
# UCS-4
highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
# UCS-2
highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)

I found out there another regular expresion that is able to identify the emojis.
This the regex is provided by the team at instagram-enginnering blog
u"(?<!&)#(\w|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+
Source:
http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji
note: I add another answer as this one is not complemetary to my previous answer here.

i am using json encoder function that encode the input.
this function is used for dict encoding (to convert it to string) on json.dumps. (so we need to do some edit to the response - remove the ' " ')
this enabled me to save emoji to mysql, and present it (at web):
# encode input
from json.encoder import py_encode_basestring_ascii
name = py_encode_basestring_ascii(name)[1:-1]
# save
YourModel.name = name
name.save()

Cleaning an XML file in Python before parsing

I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like à¹„à¸à¹€à¸Ÿà¸¥ &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.

Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)

Take a look at µTidyLib, a Python wrapper to TidyLib.

If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"à¹„à¸ à¹€à¸Ÿà¸¥ &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'

It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?

I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.