Removing substring in string using python? - python

Have a string containing this
\ud83d\ude80
\ud83c\udfb0
\ud83d\udd25
like sub-strings all of them start from
\ud83
(telegram emoji) and have different 7 characters after
3
so trying to remove them with
text = re.sub(r'\\ud83\w{7}', '', text, flags=re.MULTILINE)
with no success what i do wrong? Thanks!

You are not dealing with 12 characters here. These seem to be only 2 unicode characters, which are not printable by python and therefore displayed in their escaped form.
re.sub(r"[\ud83d\ud83c]\S", "", text)
You could create the character class [\ud83d\ud83c] manually (adding every allowed starting character) or you find a way to do this programmatically.

I think that, if you're trying to remove everything after your telegram emoji code, \w won't catch the \ character.
Try
text = re.sub(r'\\ud83[\w\\]{7}', '', text, flags=re.MULTILINE)
which is telling the regex to look for 7 characters which could either be alpharnumeric or the \.

Related

find non English characters in python string [duplicate]

This question already has answers here:
Detect strings with non English characters in Python
(6 answers)
Closed 2 years ago.
I am collecting strings that may have writing of other languages in it and I want to find all strings that contain non English characters.
for example
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
Depends on what you mean with "non-english" characters. If you are only allowing characters a-z you could use the string method "isalpha".
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
allowed_strings = [string for string in lst if string.isalpha()]
If alphanumeric is allowed, use string.isalnum()
If alphanumeric + standard special characters, you could use string.isascii()
If any other specific scenarios is allowed, use regex.
e.g. in your example if using isascii() in the list comprehension above, you would remove the last string ut keep the first 2.
If you want to also have special character, you cannot use isAlpha() alone, but perhaps that's a start. (it won't accept "hi!" or "hi here")
First you need to decide what English character means. Do you want to reject words like café or naïve?
If you want only A-Z or A-Z and numbers you can use str.isalpha() or str.isalnum(). You can't use str.isascii() in your case, as the 7-bit US-ASCII range doesn't include any accented characters, just some extra symbols.
To include accented characters you can use a regular expression using the regex package and match against specific Unicode scripts or character blocks. For example, \p{IsLatin} will match all characters in the Latin1 script.
To find strings with non-English words you can use [^\p{IsLatin}]:
regex.match(r'[^\{IsLatin}]', 'not english 行中ワ')

I need to remove all invisible characters in python

i have a long text which i need to be as clean as possible.
I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.
I then found characters like \u2003 and \u2019
What are these? How do I make sure that in my text I will have removed all special characters?
Besides the \n \t and the \u2003, should I check for more characters to remove?
I am using python 3.6
Try this:
import re
# string contains the \u2003 character
string = u'This is a test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()
Result
'This is a test string'
If you want to preserve ascii special characters:
re.sub('[^!-~]+',' ',string).strip()
This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.
In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.
Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison
if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):

Regex replace multiple punctuation in python

I would like to find multiple occurrences of exclamation marks, question marks and periods (such as !!?!, ...?, ...!) and replace them with just the final punctuation.
i.e. !?!?!? would become ?
and ....! would become !
Is this possible?
text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', '', text)
That is, remove any sequence of ?!. characters that are going to be followed by another ?!. character.
[...] is a character class. It matches any character inside the brackets.
+ means "1 or more of these".
(?=...) is a lookahead. It looks to see what is going to come next in the string.
text = re.search('[.?!]*([.?!])', text).group(1)
The way this works is that the parentheses create a capture group, allowing you to access the matched text via the group function.

Removing non-printable "gremlin" chars from text files

I am processing a large number of CSV files in python. The files are received from external organizations and are encoded with a range of encodings. I would like to find an automated method to remove the following:
Non-ASCII Characters
Control characters
Null (ASCII 0) Characters
I have a product called 'Find and Replace It!' that would use regular expressions so a way to solve the above with a regular expression would be very helpful.
Thank you
An alternative you might be interested in would be:
import string
clean = lambda dirty: ''.join(filter(string.printable.__contains__, dirty))
It simply filters out all non-printable characters from the dirty string it receives.
>>> len(clean(map(chr, range(0x110000))))
100
Try this:
clean = re.sub('[\0\200-\377]', '', dirty)
The idea is to match each NUL or "high ASCII" character (i.e. \0 and those that do not fit in 7 bits) and remove them. You could add more characters as you find them, such as ASCII ESC or BEL.
Or this:
clean = re.sub('[^\040-\176]', '', dirty)
The idea being to only permit the limited range of "printable ASCII," but note that this also removes newlines. If you want to keep newlines or tabs or the like, just add them into the brackets.
Replace anything that isn't a desirable character with a blank (delete it):
clean = re.sub('[^\s!-~]', '', dirty)
This allows all whitespace (spaces, newlines, tabs etc), and all "normal" characters (! is the first ascii printable and ~ is the last ascii printable under decimal 128).
Since this shows up on Google and we're no longer targeting Python 2.x, I should probably mention the isprintable method on strings.
It's not perfect, since it sees spaces as printables but newlines and tabs as non-printable, but I'd probably do something like this:
whitespace_normalizer = re.compile('\s+', re.UNICODE)
cleaner = lambda instr: ''.join(x for x in whitespace_normalizer.sub(' ', instr) if x.isprintable())
The regex does HTML-like whitespace collapsing (i.e. it converts arbitrary spans of whitespace as defined by Unicode into single spaces) and then the lambda strips any characters other than space that are classified by Unicode as "Separator" or "Other".
Then you get a result like this:
>>> cleaner('foo\0bar\rbaz\nquux\tspam eggs')
'foobar baz quux spam eggs'

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Categories

Resources