i have a long text which i need to be as clean as possible.
I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.
I then found characters like \u2003 and \u2019
What are these? How do I make sure that in my text I will have removed all special characters?
Besides the \n \t and the \u2003, should I check for more characters to remove?
I am using python 3.6
Try this:
import re
# string contains the \u2003 character
string = u'This is a test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()
Result
'This is a test string'
If you want to preserve ascii special characters:
re.sub('[^!-~]+',' ',string).strip()
This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.
In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.
Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison
if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):
Related
Have a string containing this
\ud83d\ude80
\ud83c\udfb0
\ud83d\udd25
like sub-strings all of them start from
\ud83
(telegram emoji) and have different 7 characters after
3
so trying to remove them with
text = re.sub(r'\\ud83\w{7}', '', text, flags=re.MULTILINE)
with no success what i do wrong? Thanks!
You are not dealing with 12 characters here. These seem to be only 2 unicode characters, which are not printable by python and therefore displayed in their escaped form.
re.sub(r"[\ud83d\ud83c]\S", "", text)
You could create the character class [\ud83d\ud83c] manually (adding every allowed starting character) or you find a way to do this programmatically.
I think that, if you're trying to remove everything after your telegram emoji code, \w won't catch the \ character.
Try
text = re.sub(r'\\ud83[\w\\]{7}', '', text, flags=re.MULTILINE)
which is telling the regex to look for 7 characters which could either be alpharnumeric or the \.
I need to add an additional \ to escape some characters. For instance, BC \ BS needs to become BC \\ BS. The following line solves the issue:
txt.encode('unicode_escape').replace("'", "\\'")
However, it messes up other characters. For example, ^#? becomes \x00?. in such a situation, I will need to remove \x00 as a subsequent step but other characters might show up.
What is the most Pythonic way to add the escape character to set of characters such as \,\t,\n etc. without causing other characters to break? I have tried using translate but ran into issues as the character size of \\t is unequal to \t.
Are you attempting to escape certain characters in text strings, raw text strings, or byte strings? What is the full set of characters you need to escape? If it's text, then there is no need to encode it first...Some things you can try:
test = "Please escape this: ' and this \n"
for char in ["'","\n"]:
test = test.replace(char, f"\{char}")
complicated_test = "This \tstring \n is \ full of ' things to \t escape"
re.escape(complicated_test)
I have the following string:
word = u'Buffalo,\xa0IL\xa060625'
I don't want the "\xa0" in there. How can I get rid of it? The string I want is:
word = 'Buffalo, IL 06025
The most robust way would be to use the unidecode module to convert all non-ASCII characters to their closest ASCII equivalent automatically.
The character \xa0 (not \xa as you stated) is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.
import unidecode
word = unidecode.unidecode(word)
If you know for sure that is the only character you don't want, you can .replace it:
>>> word.replace(u'\xa0', ' ')
u'Buffalo, IL 60625'
If you need to handle all non-ascii characters, encoding and replacing bad characters might be a good start...:
>>> word.encode('ascii', 'replace')
'Buffalo,?IL?60625'
There is no \xa there. If you try to put that into a string literal, you're going to get a syntax error if you're lucky, or it's going to swallow up the next attempted character if you're not, because \x sequences aways have to be followed by two hexadecimal digits.
What you have is \xa0, which is an escape sequence for the character U+00A0, aka "NO-BREAK SPACE".
I think you want to replace them with spaces, but whatever you want to do is pretty easy to write:
word.replace(u'\xa0', u' ') # replaced with space
word.replace(u'\xa0', u'0') # closest to what you were literally asking for
word.replace(u'\xa0', u'') # removed completely
You can easily use unicodedata to get rid of all of \x... characters.
from unicodedata import normalize
normalize('NFKD', word)
>>> 'Buffalo, IL 60625'
This seems to work for getting rid of non-ascii characters:
fixedword = word.encode('ascii','ignore')
I am processing a large number of CSV files in python. The files are received from external organizations and are encoded with a range of encodings. I would like to find an automated method to remove the following:
Non-ASCII Characters
Control characters
Null (ASCII 0) Characters
I have a product called 'Find and Replace It!' that would use regular expressions so a way to solve the above with a regular expression would be very helpful.
Thank you
An alternative you might be interested in would be:
import string
clean = lambda dirty: ''.join(filter(string.printable.__contains__, dirty))
It simply filters out all non-printable characters from the dirty string it receives.
>>> len(clean(map(chr, range(0x110000))))
100
Try this:
clean = re.sub('[\0\200-\377]', '', dirty)
The idea is to match each NUL or "high ASCII" character (i.e. \0 and those that do not fit in 7 bits) and remove them. You could add more characters as you find them, such as ASCII ESC or BEL.
Or this:
clean = re.sub('[^\040-\176]', '', dirty)
The idea being to only permit the limited range of "printable ASCII," but note that this also removes newlines. If you want to keep newlines or tabs or the like, just add them into the brackets.
Replace anything that isn't a desirable character with a blank (delete it):
clean = re.sub('[^\s!-~]', '', dirty)
This allows all whitespace (spaces, newlines, tabs etc), and all "normal" characters (! is the first ascii printable and ~ is the last ascii printable under decimal 128).
Since this shows up on Google and we're no longer targeting Python 2.x, I should probably mention the isprintable method on strings.
It's not perfect, since it sees spaces as printables but newlines and tabs as non-printable, but I'd probably do something like this:
whitespace_normalizer = re.compile('\s+', re.UNICODE)
cleaner = lambda instr: ''.join(x for x in whitespace_normalizer.sub(' ', instr) if x.isprintable())
The regex does HTML-like whitespace collapsing (i.e. it converts arbitrary spans of whitespace as defined by Unicode into single spaces) and then the lambda strips any characters other than space that are classified by Unicode as "Separator" or "Other".
Then you get a result like this:
>>> cleaner('foo\0bar\rbaz\nquux\tspam eggs')
'foobar baz quux spam eggs'
I know similar questions were asked around here on StackOverflow. I tryed to adapt some of the approaches but I couldn't get anything to work, that fits my needs:
Given a python string I want to strip every non alpha numeric charater - but - leaving any special charater like µ æ Å Ç ß... Is this even possible? with regexes I tryed variations of this
re.sub(r'[^a-zA-Z0-9: ]', '', x) # x is my string to sanitize
but it strips me more then I want. An example of what I want would be:
Input: "A string, with characters µ, æ, Å, Ç, ß,... Some whitespace confusion ?"
Output: "A string with characters µ æ Å Ç ß Some whitespace confusion"
Is this even possible without getting complicated?
Use \w with the UNICODE flag set. This will match the underscore also, so you might need to take care of that separately.
Details on http://docs.python.org/library/re.html.
EDIT: Here is some actual code. It will keep unicode letters, unicode digits, and spaces.
import re
x = u'$a_bßπ7: ^^#p'
pattern = re.compile(r'[^\w\s]', re.U)
re.sub(r'_', '', re.sub(pattern, '', x))
If you did not use re.U then the ß and π characters would have been stripped.
Sorry I can't figure out a way to do this with one regex. If you can, can you post a solution?
Eliminate characters in "Punctuation, Other" Unicode category.
# -*- coding: utf-8 -*-
import unicodedata
# This removes punctuation characters.
def strip_po(s):
return ''.join(x for x in s if unicodedata.category(x) != 'Po')
# This reduces multiple whitespace characters into a single space.
def fix_space(s):
return ' '.join(s.split())
s = u'A string, with characters µ, æ, Å, Ç, ß,... Some whitespace confusion ?'
print fix_space(strip_po(s))
You'll have to better define what you mean by special characters. There are certain flags that will group things like whitespace, non-whitespace, digits, etc. and do it specific to a locale. See http://docs.python.org/library/re.html for more details.
However, since this is a character by character operation, you may find it easier to simply explicitly specify every character, or, if the number of characters you want to exclude is smaller, writing an expression that only excludes those.
If you're ok with the Unicode Consortium's classification of what's a letter or a digit, an easy way to do this without RegEx or importing anything outside the built-ins:
filter(unicode.isalnum, u"A string, with characters µ, æ, Å, Ç, ß,... Some whitespace confusion ?")
If you have a str instead of a unicode, you'll need to encode it first.