Combining multiple regex substitutions - python

I'm trying to delete some things from a block of text using regex. I have all of my patterns ready, but I can't seem to be able to remove two (or more) that overlap.
For example:
import re
r1 = r'I am'
r2 = r'am foo'
text = 'I am foo'
re.sub(r1, '', text) # Returns ' foo'
re.sub(r2, '', text) # Returns 'I '
How do I replace both of the occurrences simultaneously and end up with an empty string?
I ended up using a slightly modified version of Ned Batchelder's answer:
def clean(self, text):
mask = bytearray(len(text))
for pattern in patterns:
for match in re.finditer(pattern, text):
r = range(match.start(), match.end())
mask[r] = 'x' * len(r)
return ''.join(character for character, bit in zip(text, mask) if not bit)

You can't do it with consecutive re.sub calls as you have shown. You can use re.finditer to find them all. Each match will provide you with a match object, which has .start and .end attributes indicating their positions. You can collect all those together, and then remove characters at the end.
Here I use a bytearray as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:
bits = bytearray(len(text))
for pat in patterns:
for m in re.finditer(pat, text):
bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)

Not to be a downer, but the short answer is that I'm pretty sure you can't. Can you change your regex so that it doesn't require overlapping?
If you still want to do this, I would try keeping track of the start and stop indices of each match made on the original string. Then go through the string and only keep characters not in any deletion range?

Quite efficient too is a solution coming from ... Perl combine the regexps in one:
# aptitude install regexp-assemble
$ regexp-assemble
I am
I am foo
Ctrl + D
I am(?: foo)?
regexp-assemble takes all the variants of regexps or string you want to match and then
combine them in one. And yes it changes the initial problem to another one since it is not about matching overlapping regexp anymore, but combining regexp for a match
And Then you can use it in your code:
$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub("I am foo","I am(?: foo)?","")
''
A port of Regexp::Assemble in python would be nice :)

Here is an alternative that filters the strings on the fly using itertools.compress on the text with a selector iterator. The selector returns True if the character should be kept. selector_for_patterns creates one selector for every pattern. The selector are combined with the all function (only when all pattern want to keep a character it should be in the resulting string).
import itertools
import re
def selector_for_pattern(text, pattern):
i = 0
for m in re.finditer(pattern, text):
for _ in xrange(i, m.start()):
yield True
for _ in xrange(m.start(), m.end()):
yield False
i = m.end()
for _ in xrange(i, len(text)):
yield True
def clean(text, patterns):
gen = [selector_for_pattern(text, pattern) for pattern in patterns]
selector = itertools.imap(all, itertools.izip(* gen))
return "".join(itertools.compress(text, selector))

Related

How can I implement isalnum() into this Python web scraper to remove special characters? [duplicate]

I'm trying to remove specific characters from a string using Python. This is the code I'm using right now. Unfortunately it appears to do nothing to the string.
for char in line:
if char in " ?.!/;:":
line.replace(char,'')
How do I do this properly?
Strings in Python are immutable (can't be changed). Because of this, the effect of line.replace(...) is just to create a new string, rather than changing the old one. You need to rebind (assign) it to line in order to have that variable take the new value, with those characters removed.
Also, the way you are doing it is going to be kind of slow, relatively. It's also likely to be a bit confusing to experienced pythonators, who will see a doubly-nested structure and think for a moment that something more complicated is going on.
Starting in Python 2.6 and newer Python 2.x versions *, you can instead use str.translate, (see Python 3 answer below):
line = line.translate(None, '!##$')
or regular expression replacement with re.sub
import re
line = re.sub('[!##$]', '', line)
The characters enclosed in brackets constitute a character class. Any characters in line which are in that class are replaced with the second parameter to sub: an empty string.
Python 3 answer
In Python 3, strings are Unicode. You'll have to translate a little differently. kevpie mentions this in a comment on one of the answers, and it's noted in the documentation for str.translate.
When calling the translate method of a Unicode string, you cannot pass the second parameter that we used above. You also can't pass None as the first parameter. Instead, you pass a translation table (usually a dictionary) as the only parameter. This table maps the ordinal values of characters (i.e. the result of calling ord on them) to the ordinal values of the characters which should replace them, or—usefully to us—None to indicate that they should be deleted.
So to do the above dance with a Unicode string you would call something like
translation_table = dict.fromkeys(map(ord, '!##$'), None)
unicode_line = unicode_line.translate(translation_table)
Here dict.fromkeys and map are used to succinctly generate a dictionary containing
{ord('!'): None, ord('#'): None, ...}
Even simpler, as another answer puts it, create the translation table in place:
unicode_line = unicode_line.translate({ord(c): None for c in '!##$'})
Or, as brought up by Joseph Lee, create the same translation table with str.maketrans:
unicode_line = unicode_line.translate(str.maketrans('', '', '!##$'))
* for compatibility with earlier Pythons, you can create a "null" translation table to pass in place of None:
import string
line = line.translate(string.maketrans('', ''), '!##$')
Here string.maketrans is used to create a translation table, which is just a string containing the characters with ordinal values 0 to 255.
Am I missing the point here, or is it just the following:
string = "ab1cd1ef"
string = string.replace("1", "")
print(string)
# result: "abcdef"
Put it in a loop:
a = "a!b#c#d$"
b = "!##$"
for char in b:
a = a.replace(char, "")
print(a)
# result: "abcd"
>>> line = "abc##!?efg12;:?"
>>> ''.join( c for c in line if c not in '?:!/;' )
'abc##efg12'
With re.sub regular expression
Since Python 3.5, substitution using regular expressions re.sub became available:
import re
re.sub('\ |\?|\.|\!|\/|\;|\:', '', line)
Example
import re
line = 'Q: Do I write ;/.??? No!!!'
re.sub('\ |\?|\.|\!|\/|\;|\:', '', line)
'QDoIwriteNo'
Explanation
In regular expressions (regex), | is a logical OR and \ escapes spaces and special characters that might be actual regex commands. Whereas sub stands for substitution, in this case with the empty string ''.
The asker almost had it. Like most things in Python, the answer is simpler than you think.
>>> line = "H E?.LL!/;O:: "
>>> for char in ' ?.!/;:':
... line = line.replace(char,'')
...
>>> print line
HELLO
You don't have to do the nested if/for loop thing, but you DO need to check each character individually.
For the inverse requirement of only allowing certain characters in a string, you can use regular expressions with a set complement operator [^ABCabc]. For example, to remove everything except ascii letters, digits, and the hyphen:
>>> import string
>>> import re
>>>
>>> phrase = ' There were "nine" (9) chick-peas in my pocket!!! '
>>> allow = string.letters + string.digits + '-'
>>> re.sub('[^%s]' % allow, '', phrase)
'Therewerenine9chick-peasinmypocket'
From the python regular expression documentation:
Characters that are not within a range can be matched by complementing
the set. If the first character of the set is '^', all the characters
that are not in the set will be matched. For example, [^5] will match
any character except '5', and [^^] will match any character except
'^'. ^ has no special meaning if it’s not the first character in the
set.
line = line.translate(None, " ?.!/;:")
>>> s = 'a1b2c3'
>>> ''.join(c for c in s if c not in '123')
'abc'
Strings are immutable in Python. The replace method returns a new string after the replacement. Try:
for char in line:
if char in " ?.!/;:":
line = line.replace(char,'')
This is identical to your original code, with the addition of an assignment to line inside the loop.
Note that the string replace() method replaces all of the occurrences of the character in the string, so you can do better by using replace() for each character you want to remove, instead of looping over each character in your string.
I was surprised that no one had yet recommended using the builtin filter function.
import operator
import string # only for the example you could use a custom string
s = "1212edjaq"
Say we want to filter out everything that isn't a number. Using the filter builtin method "...is equivalent to the generator expression (item for item in iterable if function(item))" [Python 3 Builtins: Filter]
sList = list(s)
intsList = list(string.digits)
obj = filter(lambda x: operator.contains(intsList, x), sList)))
In Python 3 this returns
>> <filter object # hex>
To get a printed string,
nums = "".join(list(obj))
print(nums)
>> "1212"
I am not sure how filter ranks in terms of efficiency but it is a good thing to know how to use when doing list comprehensions and such.
UPDATE
Logically, since filter works you could also use list comprehension and from what I have read it is supposed to be more efficient because lambdas are the wall street hedge fund managers of the programming function world. Another plus is that it is a one-liner that doesnt require any imports. For example, using the same string 's' defined above,
num = "".join([i for i in s if i.isdigit()])
That's it. The return will be a string of all the characters that are digits in the original string.
If you have a specific list of acceptable/unacceptable characters you need only adjust the 'if' part of the list comprehension.
target_chars = "".join([i for i in s if i in some_list])
or alternatively,
target_chars = "".join([i for i in s if i not in some_list])
Using filter, you'd just need one line
line = filter(lambda char: char not in " ?.!/;:", line)
This treats the string as an iterable and checks every character if the lambda returns True:
>>> help(filter)
Help on built-in function filter in module __builtin__:
filter(...)
filter(function or None, sequence) -> list, tuple, or string
Return those items of sequence for which function(item) is true. If
function is None, return the items that are true. If sequence is a tuple
or string, return the same type, else return a list.
Try this one:
def rm_char(original_str, need2rm):
''' Remove charecters in "need2rm" from "original_str" '''
return original_str.translate(str.maketrans('','',need2rm))
This method works well in Python 3
Here's some possible ways to achieve this task:
def attempt1(string):
return "".join([v for v in string if v not in ("a", "e", "i", "o", "u")])
def attempt2(string):
for v in ("a", "e", "i", "o", "u"):
string = string.replace(v, "")
return string
def attempt3(string):
import re
for v in ("a", "e", "i", "o", "u"):
string = re.sub(v, "", string)
return string
def attempt4(string):
return string.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "")
for attempt in [attempt1, attempt2, attempt3, attempt4]:
print(attempt("murcielago"))
PS: Instead using " ?.!/;:" the examples use the vowels... and yeah, "murcielago" is the Spanish word to say bat... funny word as it contains all the vowels :)
PS2: If you're interested on performance you could measure these attempts with a simple code like:
import timeit
K = 1000000
for i in range(1,5):
t = timeit.Timer(
f"attempt{i}('murcielago')",
setup=f"from __main__ import attempt{i}"
).repeat(1, K)
print(f"attempt{i}",min(t))
In my box you'd get:
attempt1 2.2334518376057244
attempt2 1.8806643818474513
attempt3 7.214925774955572
attempt4 1.7271184513757465
So it seems attempt4 is the fastest one for this particular input.
Here's my Python 2/3 compatible version. Since the translate api has changed.
def remove(str_, chars):
"""Removes each char in `chars` from `str_`.
Args:
str_: String to remove characters from
chars: String of to-be removed characters
Returns:
A copy of str_ with `chars` removed
Example:
remove("What?!?: darn;", " ?.!:;") => 'Whatdarn'
"""
try:
# Python2.x
return str_.translate(None, chars)
except TypeError:
# Python 3.x
table = {ord(char): None for char in chars}
return str_.translate(table)
#!/usr/bin/python
import re
strs = "how^ much for{} the maple syrup? $20.99? That's[] ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!|a|b]',r' ',strs)#i have taken special character to remove but any #character can be added here
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)#for removing special character
print nestr
You can also use a function in order to substitute different kind of regular expression or other pattern with the use of a list. With that, you can mixed regular expression, character class, and really basic text pattern. It's really useful when you need to substitute a lot of elements like HTML ones.
*NB: works with Python 3.x
import re # Regular expression library
def string_cleanup(x, notwanted):
for item in notwanted:
x = re.sub(item, '', x)
return x
line = "<title>My example: <strong>A text %very% $clean!!</strong></title>"
print("Uncleaned: ", line)
# Get rid of html elements
html_elements = ["<title>", "</title>", "<strong>", "</strong>"]
line = string_cleanup(line, html_elements)
print("1st clean: ", line)
# Get rid of special characters
special_chars = ["[!##$]", "%"]
line = string_cleanup(line, special_chars)
print("2nd clean: ", line)
In the function string_cleanup, it takes your string x and your list notwanted as arguments. For each item in that list of elements or pattern, if a substitute is needed it will be done.
The output:
Uncleaned: <title>My example: <strong>A text %very% $clean!!</strong></title>
1st clean: My example: A text %very% $clean!!
2nd clean: My example: A text very clean
My method I'd use probably wouldn't work as efficiently, but it is massively simple. I can remove multiple characters at different positions all at once, using slicing and formatting.
Here's an example:
words = "things"
removed = "%s%s" % (words[:3], words[-1:])
This will result in 'removed' holding the word 'this'.
Formatting can be very helpful for printing variables midway through a print string. It can insert any data type using a % followed by the variable's data type; all data types can use %s, and floats (aka decimals) and integers can use %d.
Slicing can be used for intricate control over strings. When I put words[:3], it allows me to select all the characters in the string from the beginning (the colon is before the number, this will mean 'from the beginning to') to the 4th character (it includes the 4th character). The reason 3 equals till the 4th position is because Python starts at 0. Then, when I put word[-1:], it means the 2nd last character to the end (the colon is behind the number). Putting -1 will make Python count from the last character, rather than the first. Again, Python will start at 0. So, word[-1:] basically means 'from the second last character to the end of the string.
So, by cutting off the characters before the character I want to remove and the characters after and sandwiching them together, I can remove the unwanted character. Think of it like a sausage. In the middle it's dirty, so I want to get rid of it. I simply cut off the two ends I want then put them together without the unwanted part in the middle.
If I want to remove multiple consecutive characters, I simply shift the numbers around in the [] (slicing part). Or if I want to remove multiple characters from different positions, I can simply sandwich together multiple slices at once.
Examples:
words = "control"
removed = "%s%s" % (words[:2], words[-2:])
removed equals 'cool'.
words = "impacts"
removed = "%s%s%s" % (words[1], words[3:5], words[-1])
removed equals 'macs'.
In this case, [3:5] means character at position 3 through character at position 5 (excluding the character at the final position).
Remember, Python starts counting at 0, so you will need to as well.
In Python 3.5
e.g.,
os.rename(file_name, file_name.translate({ord(c): None for c in '0123456789'}))
To remove all the number from the string
How about this:
def text_cleanup(text):
new = ""
for i in text:
if i not in " ?.!/;:":
new += i
return new
Below one.. with out using regular expression concept..
ipstring ="text with symbols!##$^&*( ends here"
opstring=''
for i in ipstring:
if i.isalnum()==1 or i==' ':
opstring+=i
pass
print opstring
Recursive split:
s=string ; chars=chars to remove
def strip(s,chars):
if len(s)==1:
return "" if s in chars else s
return strip(s[0:int(len(s)/2)],chars) + strip(s[int(len(s)/2):len(s)],chars)
example:
print(strip("Hello!","lo")) #He!
You could use the re module's regular expression replacement. Using the ^ expression allows you to pick exactly what you want from your string.
import re
text = "This is absurd!"
text = re.sub("[^a-zA-Z]","",text) # Keeps only Alphabets
print(text)
Output to this would be "Thisisabsurd". Only things specified after the ^ symbol will appear.
# for each file on a directory, rename filename
file_list = os.listdir (r"D:\Dev\Python")
for file_name in file_list:
os.rename(file_name, re.sub(r'\d+','',file_name))
Even the below approach works
line = "a,b,c,d,e"
alpha = list(line)
while ',' in alpha:
alpha.remove(',')
finalString = ''.join(alpha)
print(finalString)
output: abcde
The string method replace does not modify the original string. It leaves the original alone and returns a modified copy.
What you want is something like: line = line.replace(char,'')
def replace_all(line, )for char in line:
if char in " ?.!/;:":
line = line.replace(char,'')
return line
However, creating a new string each and every time that a character is removed is very inefficient. I recommend the following instead:
def replace_all(line, baddies, *):
"""
The following is documentation on how to use the class,
without reference to the implementation details:
For implementation notes, please see comments begining with `#`
in the source file.
[*crickets chirp*]
"""
is_bad = lambda ch, baddies=baddies: return ch in baddies
filter_baddies = lambda ch, *, is_bad=is_bad: "" if is_bad(ch) else ch
mahp = replace_all.map(filter_baddies, line)
return replace_all.join('', join(mahp))
# -------------------------------------------------
# WHY `baddies=baddies`?!?
# `is_bad=is_bad`
# -------------------------------------------------
# Default arguments to a lambda function are evaluated
# at the same time as when a lambda function is
# **defined**.
#
# global variables of a lambda function
# are evaluated when the lambda function is
# **called**
#
# The following prints "as yellow as snow"
#
# fleece_color = "white"
# little_lamb = lambda end: return "as " + fleece_color + end
#
# # sometime later...
#
# fleece_color = "yellow"
# print(little_lamb(" as snow"))
# --------------------------------------------------
replace_all.map = map
replace_all.join = str.join
If you want your string to be just allowed characters by using ASCII codes, you can use this piece of code:
for char in s:
if ord(char) < 96 or ord(char) > 123:
s = s.replace(char, "")
It will remove all the characters beyond a....z even upper cases.

How to extract date from a filename using regular expression [duplicate]

This question already has an answer here:
How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary(\b)
(1 answer)
Closed 2 years ago.
I have to handle a long filename in a specific format that contains two dates and someone's full name. Here is a template that describes this format:
firstname_middlename_lastname_yyyy-mm-dd_text1_text2_yyyy-mm-dd.xls
How to extract the fullname, first date, and second date from that filename using regular expression?
I've tried to extract the first date like:
string1 = 'CHEN_MOU_MOU_1999-04-11_Scientific_Report_2020-03-14.xlsx'
ptn = re.compile('\b(\d{4}-\d{2}-\d{2})\b')
print(ptn.match(string1))
But it doesn't seem to work. The output I get is None.
Any help will be appreciated.
The reason your solution does not work is because _ is considered an alphanumeric character in Python.
From docs:
\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
So \b does not match _ in your string. But it'll match -.
From docs:
\b
This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.
But if you replace _ around your dates with a - (hyphen), then your solution works just fine.
>>> import re
>>> string1 = 'CHEN_MOU_MOU-1999-04-11-Scientific Report-2020-03-14.xlsx'
>>> ptn = re.compile(r'\b(\d{4}-\d{2}-\d{2})\b')
>>> ptn.findall(string1)
['1999-04-11', '2020-03-14']
Following is a solution that should work for your task:
$ python
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> string1 = 'CHEN_MOU_MOU_1999-04-11_Scientific_Report_2020-03-14.xlsx'
>>> fullnamepattern = r'[a-zA-Z]+_[a-zA-Z]+_[a-zA-Z]+'
>>> datepattern = r'\d{4}-\d{2}-\d{2}'
>>> re.search(fullnamepattern, string1).group()
'CHEN_MOU_MOU'
>>> re.findall(datepattern, string1)
['1999-04-11', '2020-03-14']
My two cents...
import re
pattern = re.compile(r'^(.*?)_(?=\d)(.*?)_(.*)_(.*?)\.(.*)$')
string = 'CHEN_MOU_MOU_1999-04-11_Scientific_Report_2020-03-14.xlsx'
split_filename = pattern.findall(string)
split_filename
Output:
[('CHEN_MOU_MOU', '1999-04-11', 'Scientific_Report', '2020-03-14', 'xlsx')]
Running...
split_filename[0][3]
shows...
'2020-03-14'
At least this way you can pick what you would like to read/check/record. Just change the [3] above to another number to get a different part...
split_filename[0][0]
'CHEN_MOU_MOU'
split_filename[0][4]
'xlsx'
This will work for you:
\d[^_\.]+
\d match a digit
[^_\.]+ match all except "." or "_"
Try this to extract all dates with the format yyyy-mm-dd in the string :
string1 = 'CHEN_MOU_MOU_1999-04-11_Scientific_Report_2020-03-14.xlsx'
ptn = re.compile("\d{4}-\d{2}-\d{2}")
all_dates = ptn.findall(string1)
# ['1999-04-11', '2020-03-14']
full_name = " ".join(string1.split('_')[:3])
# 'CHEN MOU MOU'
You can extract dates using reguler expression but you can extract full name easily using split function try following code to get more idea
string1 = 'CHEN_MOU_MOU_1999-04-11_Scientific_Report_2020-03-14.xlsx'
pattern = re.compile("\d{4}-\d{2}-\d{2}")
dates=pattern.findall(string1) # it will return first date and last date in dates's array
fullname = string1.split("_") # split data using _ character and it stored in array
fullname = " ".join(fullname[:3]) #join with fullname blank space (first three data)
Or you can join last two line in one
fullname = " ".join(string1.split("_")[:3])
Please let me know what you think about it

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).
Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.
And I want to put of all those separated components into a list.
Some examples would probably make this clear:
Case 1: English-only string. This case is easy:
>>> "I love Python".split()
['I', 'love', 'Python']
Case 2: Chinese-only string:
>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
How do I get it to display the actual characters instead of the unicode? Something like:
['我', '爱', '蟒', '蛇']
??
Case 3: A mix of English & Chinese:
I want to turn an input string such as
"我爱Python"
and turns it into a list like this:
['我', '爱', 'Python']
Is it possible to do something like that?
I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
# -*- coding: utf-8 -*-
import re
def group_words(s):
regex = []
# Match a whole word:
regex += [ur'\w+']
# Match a single CJK character:
regex += [ur'[\u4e00-\ufaff]']
# Match one of anything else, except for spaces:
regex += [ur'[^\s]']
regex = "|".join(regex)
r = re.compile(regex)
return r.findall(s)
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.
In Python 3, it also splits the number if you needed.
def spliteKeyWord(str):
regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
matches = re.findall(regex, str, re.UNICODE)
return matches
print(spliteKeyWord("Testing English text我爱Python123"))
=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']
Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)
A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.
I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.
Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.
# -*- coding: utf-8 -*-
import itertools, unicodedata
def group_words(s):
# This is a closure for key(), encapsulated in an array to work around
# 2.x's lack of the nonlocal keyword.
sequence = [0x10000000]
def key(part):
val = ord(part)
if part.isspace():
return 0
# This is incorrect, but serves this example; finding a more
# accurate categorization of characters is up to the user.
asian = unicodedata.category(part) == "Lo"
if asian:
# Never group asian characters, by returning a unique value for each one.
sequence[0] += 1
return sequence[0]
return 2
result = []
for key, group in itertools.groupby(s, key):
# Discard groups of whitespace.
if key == 0:
continue
str = "".join(group)
result.append(str)
return result
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:
def rec_group_words():
regex = []
# Match a whole word:
regex += [r'[A-za-z0-9\xc0-\xff]+']
# Match a single CJK character:
regex += [r'[\u4e00-\ufaff]']
regex = "|".join(regex)
return re.compile(regex)
The following works for python3.7:
import re
def group_words(s):
return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
if __name__ == "__main__":
print(group_words(u"Testing English text"))
print(group_words(u"我爱蟒蛇"))
print(group_words(u"Testing English text我爱蟒蛇"))
['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
For some reason, I cannot adapt Glenn Maynard's answer to python3.

Case insensitive replace

What's the easiest way to do a case-insensitive string replacement in Python?
The string type doesn't support this. You're probably best off using the regular expression sub method with the re.IGNORECASE option.
>>> import re
>>> insensitive_hippo = re.compile(re.escape('hippo'), re.IGNORECASE)
>>> insensitive_hippo.sub('giraffe', 'I want a hIPpo for my birthday')
'I want a giraffe for my birthday'
import re
pattern = re.compile("hello", re.IGNORECASE)
pattern.sub("bye", "hello HeLLo HELLO")
# 'bye bye bye'
In a single line:
import re
re.sub("(?i)hello","bye", "hello HeLLo HELLO") #'bye bye bye'
re.sub("(?i)he\.llo","bye", "he.llo He.LLo HE.LLO") #'bye bye bye'
Or, use the optional "flags" argument:
import re
re.sub("hello", "bye", "hello HeLLo HELLO", flags=re.I) #'bye bye bye'
re.sub("he\.llo", "bye", "he.llo He.LLo HE.LLO", flags=re.I) #'bye bye bye'
Continuing on bFloch's answer, this function will change not one, but all occurrences of old with new - in a case insensitive fashion.
def ireplace(old, new, text):
idx = 0
while idx < len(text):
index_l = text.lower().find(old.lower(), idx)
if index_l == -1:
return text
text = text[:index_l] + new + text[index_l + len(old):]
idx = index_l + len(new)
return text
Like Blair Conrad says string.replace doesn't support this.
Use the regex re.sub, but remember to escape the replacement string first. Note that there's no flags-option in 2.6 for re.sub, so you'll have to use the embedded modifier '(?i)' (or a RE-object, see Blair Conrad's answer). Also, another pitfall is that sub will process backslash escapes in the replacement text, if a string is given. To avoid this one can instead pass in a lambda.
Here's a function:
import re
def ireplace(old, repl, text):
return re.sub('(?i)'+re.escape(old), lambda m: repl, text)
>>> ireplace('hippo?', 'giraffe!?', 'You want a hiPPO?')
'You want a giraffe!?'
>>> ireplace(r'[binfolder]', r'C:\Temp\bin', r'[BinFolder]\test.exe')
'C:\\Temp\\bin\\test.exe'
This function uses both the str.replace() and re.findall() functions.
It will replace all occurences of pattern in string with repl in a case-insensitive way.
def replace_all(pattern, repl, string) -> str:
occurences = re.findall(pattern, string, re.IGNORECASE)
for occurence in occurences:
string = string.replace(occurence, repl)
return string
This doesn't require RegularExp
def ireplace(old, new, text):
"""
Replace case insensitive
Raises ValueError if string not found
"""
index_l = text.lower().index(old.lower())
return text[:index_l] + new + text[index_l + len(old):]
An interesting observation about syntax details and options:
# Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
>>> import re
>>> old = "TREEROOT treeroot TREerOot"
>>> re.sub(r'(?i)treeroot', 'grassroot', old)
'grassroot grassroot grassroot'
>>> re.sub(r'treeroot', 'grassroot', old)
'TREEROOT grassroot TREerOot'
>>> re.sub(r'treeroot', 'grassroot', old, flags=re.I)
'grassroot grassroot grassroot'
>>> re.sub(r'treeroot', 'grassroot', old, re.I)
'TREEROOT grassroot TREerOot'
Using the (?i) prefix in the match expression or adding flags=re.I as a fourth argument will result in a case-insensitive match - however using just re.I as the fourth argument does not result in case-insensitive match.
For comparison:
>>> re.findall(r'treeroot', old, re.I)
['TREEROOT', 'treeroot', 'TREerOot']
>>> re.findall(r'treeroot', old)
['treeroot']
I was having \t being converted to the escape sequences (scroll a bit down), so I noted that re.sub converts backslashed escaped characters to escape sequences.
To prevent that I wrote the following:
Replace case insensitive.
import re
def ireplace(findtxt, replacetxt, data):
return replacetxt.join( re.compile(findtxt, flags=re.I).split(data) )
Also, if you want it to replace with the escape characters, like the other answers here that are getting the special meaning bashslash characters converted to escape sequences, just decode your find and, or replace string. In Python 3, might have to do something like .decode("unicode_escape") # python3
findtxt = findtxt.decode('string_escape') # python2
replacetxt = replacetxt.decode('string_escape') # python2
data = ireplace(findtxt, replacetxt, data)
Tested in Python 2.7.8
i='I want a hIPpo for my birthday'
key='hippo'
swp='giraffe'
o=(i.lower().split(key))
c=0
p=0
for w in o:
o[c]=i[p:p+len(w)]
p=p+len(key+w)
c+=1
print(swp.join(o))

Stripping non printable characters from a string in python

I use to run
$s =~ s/[^[:print:]]//g;
on Perl to get rid of non printable characters.
In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.
What would you do?
EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output.
curses.ascii.isprint will return false for any unicode character.
Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys
all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys
all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
def remove_control_chars(s):
return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601
Edit Adding suggestions from the comments.
As far as I know, the most pythonic/efficient method would be:
import string
filtered_string = filter(lambda x: x in string.printable, myStr)
You could try setting up a filter using the unicodedata.category() function:
import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
return ''.join(c for c in str if unicodedata.category(c) in printable)
See Table 4-9 on page 175 in the Unicode database character properties for the available categories
The following will work with Unicode input and is rather fast...
import sys
# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}
def make_printable(s):
"""Replace non-printable characters in a string."""
# the translate method on str removes characters
# that map to None from the string
return s.translate(NOPRINT_TRANS_TABLE)
assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''
My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.
In Python 3,
def filter_nonprintable(text):
import itertools
# Use characters of control category
nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
# Use translate to remove all non-printable characters
return text.translate({character:None for character in nonprintable})
See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()
The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by #Ants Aasma.
This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):
from curses.ascii import isprint
def printable(input):
return ''.join(char for char in input if isprint(char))
Yet another option in python 3:
re.sub(f'[^{re.escape(string.printable)}]', '', my_string)
Based on #Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:
import unicodedata
def filter_non_printable(s):
return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))
The best I've come up with now is (thanks to the python-izers above)
def filter_non_printable(str):
return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])
This is the only way I've found out that works with Unicode characters/strings
Any better options?
In Python there's no POSIX regex classes
There are when using the regex library: https://pypi.org/project/regex/
It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.
From the documentation:
[[:alpha:]]; [[:^alpha:]]
POSIX character classes are supported. These
are normally treated as an alternative form of \p{...}.
(I'm not affiliated, just a user.)
An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:
''.join(c for c in my_string if c.isprintable())
str.isprintable()
Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)
The one below performs faster than the others above. Take a look
''.join([x if x in string.printable else '' for x in Str])
Adapted from answers by Ants Aasma and shawnrad:
nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
return text.translate(ord_dict)
#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)
tested on Python 3.7.7
To remove 'whitespace',
import re
t = """
\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p> </p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))
Error description
Run the copied and pasted python code report:
Python invalid non-printable character U+00A0
The cause of the error
The space in the copied code is not the same as the format in Python;
Solution
Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;
Source : Python invalid non-printable character U+00A0
I used this:
import sys
import unicodedata
# the test string has embedded characters, \u2069 \u2068
test_string = """"ABC⁩.⁨ 6", "}"""
nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if
unicodedata.category(c) in ['Cc','Cf']))
translate_dict = {character: None for character in nonprintable}
print("Before translate, using repr()", repr(test_string))
print("After translate, using repr()", repr(test_string.translate(translate_dict)))

Categories

Resources