python replace multiple words retaining case - python

I'm trying to write a filter in django that highlights words based on a search query. For example, if my string contains this is a sample string that I want to highlight using my filter and my search stubs are sam and ring, my desired output would be:
this is a <mark>sam</mark>ple st<mark>ring</mark> that I want to highlight using my filter
I'm using the answer from here to replace multiple words. I've presented the code below:
import re
words = search_stubs.split()
rep = dict((re.escape(k), '<mark>%s</mark>'%(k)) for k in words)
pattern = re.compile('|'.join(rep.keys()))
text = pattern.sub(lambda m : rep[re.escape(m.group(0))], text_to_replace)
However, when there is case sensitivity, this breaks. For example, if I have the string Check highlight function, and my search stub contains check, this breaks.
The desired output in this case would naturally be:
<mark>Check</mark> highlight function

You don't need to go for dictionary here. (?i) called case-insensitive modifier helps to do a case-insensitive match.
>>> s = "this is a sample string that I want to highlight using my filter"
>>> l = ['sam', 'ring']
>>> re.sub('(?i)(' + '|'.join(map(re.escape, l)) + ')', r'<mark>\1</mark>', s)
'this is a <mark>sam</mark>ple st<mark>ring</mark> that I want to highlight using my filter'
EXample 2:
>>> s = 'Check highlight function'
>>> l = ['check']
>>> re.sub('(?i)(' + '|'.join(map(re.escape, l)) + ')', r'<mark>\1</mark>', s)
'<mark>Check</mark> highlight function'

The simple way to do this is to not try to build a dict mapping every single word to its marked-up equivalent, and just use a capturing group and a reference to it. Then you can just use the IGNORECASE flag to do a case-insensitive search.
pattern = re.compile('({})'.format('|'.join(map(re.escape, words))),
re.IGNORECASE)
text = pattern.sub(r'<mark>\1</mark>', text_to_replace)
For example, if text_to_replace were:
I am Sam. Sam I am. I will not eat green eggs and spam.
… then text will be:
I am <mark>Sam</mark>. <mark>Sam</mark> I am. I will not eat green eggs and spam
If you really did want to do it your way, you could. For example:
text = pattern.sub(lambda m: rep[re.escape(m.group(0))].replace(m, m.group(0)),
text_to_replace)
But that would be kind of silly. You're building a dict with 'sam' embedded in the value, just so you can replace that 'sam' with the 'Sam' that you actually matched.
See Grouping in the Regular Expression HOWTO for more on groups and references, and the re.sub docs for specifics on using references in substitutions.

Related

Multiple regex substitutions using a dict with regex expressions as keys

I want to make multiple substitutions to a string using multiple regular expressions. I also want to make the substitutions in a single pass to avoid creating multiple instances of the string.
Let's say for argument that I want to make the substitutions below, while avoiding multiple use of re.sub(), whether explicitly or with a loop:
import re
text = "local foals drink cola"
text = re.sub("(?<=o)a", "w", text)
text = re.sub("l(?=a)", "co", text)
print(text) # "local fowls drink cocoa"
The closest solution I have found for this is to compile a regular expression from a dictionary of substitution targets and then to use a lambda function to replace each matched target with its value in the dictionary. However, this approach does not work when using metacharacters, thus removing the functionality needed from regular expressions in this example.
Let me demonstrate first with an example that works without metacharacters:
import re
text = "local foals drink cola"
subs_dict = {"a":"w", "l":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
print(text) # "coocwco fowcos drink cocow"
Now observe that adding the desired metacharacters to the dictionary keys results in a KeyError:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)
>>> KeyError: "a"
The reason for this is that the sub() function correctly finds a match for the expression "(?<=o)a", so this must now be found in the dictionary to return its substitution, but the value submitted for dictionary lookup by match.group(0) is the corresponding matched string "a". It also does not work to search for match.re in the dictionary (i.e. the expression that produced the match) because the value of that is the whole disjoint expression that was compiled from the dictionary keys (i.e. "(?<=o)a|l(?=a)").
EDIT: In case anyone would benefit from seeing thejonny's solution implemented with a lambda function as close to my originals as possible, it would work like this:
import re
text = "local foals drink cola"
subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join("("+key+")" for key in subs_dict))
group_index = 1
indexed_subs = {}
for target, sub in subs_dict.items():
indexed_subs[group_index] = sub
group_index += re.compile(target).groups + 1
text = re.sub(subs_regex, lambda match: indexed_subs[match.lastindex], text)
print(text) # "local fowls drink cocoa"
If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |ing the expressions, and then check which group found a match:
(exp1)|(exp2)|(exp3)
Or maybe named groups so you don't have to count the subgroups inside the subexpressions.
The replacement function than can look which group matched, and chose the replacement from a list.
I came up with this implementation:
import re
def dictsub(replacements, string):
"""things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
exprall = re.compile("|".join("("+x+")" for x in replacements))
gi = 1
replacements_by_gi = {}
for (expr, replacement) in replacements.items():
replacements_by_gi[gi] = replacement
gi += re.compile(expr).groups + 1
def choose(match):
return replacements_by_gi[match.lastindex]
return re.sub(exprall, choose, string)
text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))
that prints local fowls drink cocoa
You could do this by keeping your key as the expected match and storing both your replace and regex in a nested dict. Given you're looking to match specific chars, this definition should work.
subs_dict = {"a": {'replace': 'w', 'regex': '(?<=o)a'}, 'l': {'replace': 'co', 'regex': 'l(?=a)'}}
subs_regex = re.compile("|".join([subs_dict[k]['regex'] for k in subs_dict.keys()]))
re.sub(subs_regex, lambda match: subs_dict[match.group(0)]['replace'], text)
'local fowls drink cocoa'

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

Regex in Python 2.7 for extraction of information from Snort log files

I'm trying to extract information from a Snort file using regular expressions. I've sucessfully got the IP's and SID, but I seem to be having trouble with extracting a specific part of the text.
How can I extract part of a Snort log file? The part I'm trying to extract can look like [Classification: example-of-attack] or [Classification: Example of Attack]. However, the first example may have any number of hyphens and whilst the second instance doesn't have any hyphens but contains some capital letters.
How could I extract just example-of-attack or Example-of-Attack?
I unfortunately only know how to search for static words such as:
test = re.search("exact-name", line)
t = test.group()
print t
I've tried many different commands on the web, but I just don't seem to get it.
You can use the following regex:
>>> m = re.search(r'\[Classification:\s*([^]]+)\]', line).group(1)
( Explanation | Working Demo )
You could use look-behinds,
>>> s = "[Classification: example-of-attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s)
>>> m
<_sre.SRE_Match object at 0x7ff54a954370>
>>> m.group()
'example-of-attack'
>>> s = "[Classification: Example of Attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s).group()
>>> m
'Example of Attack'
Use regex module if there are more than one spaces after the string Classification:,
>>> import regex
>>> s = "[Classification: Example of Attack]"
>>> regex.search(r'(?<=Classification:\s+\b)[^\]]*', s).group()
'Example of Attack
'
If you want to match any substring with the pattern [Word: Value], you could use the following regex,
ptrn = r"\[\s*(\w+):\s*([\w\s-]+)\s*\]"
Here I've used two groups, one for the first word ("Classification" in your question) and one for the second (either "example-of-attack" or "Example of Attack"). It also requires opening and closing square brackets. For example,
txt1 = '[Classification: example-of-attack]'
m = re.search( ptrn, txt1 )
>>> m.group(2)
'example-of-attack'

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Python: Replace string with prefixStringSuffix keeping original case, but ignoring case when searching for match

So what I'm trying to do is replace a string "keyword" with
"<b>keyword</b>"
in a larger string.
Example:
myString = "HI there. You should higher that person for the job. Hi hi."
keyword = "hi"
result I would want would be:
result = "<b>HI</b> there. You should higher that person for the job.
<b>Hi</b> <b>hi</b>."
I will not know what the keyword until the user types the keyword
and won't know the corpus (myString) until the query is run.
I found a solution that works most of the time, but has some false positives,
namely it would return "<b>hi<b/>gher"which is not what I want. Also note that I
am trying to preserve the case of the original text, and the matching should take
place irrespective of case. so if the keyword is "hi" it should replace
HI with <b>HI</b> and hi with <b>hi</b>.
The closest I have come is using a slightly derived version of this:
http://code.activestate.com/recipes/576715/
but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.
Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation)
but I'm not sure how I would put the sentences back together given it does not
have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original
string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".
Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.
This ok?
>>> import re
>>> myString = "HI there. You should higher that person for the job. Hi hi."
>>> keyword = "hi"
>>> search = re.compile(r'\b(%s)\b' % keyword, re.I)
>>> search.sub('<b>\\1</b>', myString)
'<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>.'
The key to the whole thing is using word boundaries, groups and the re.I flag.
You should be able to do this very easily with re.sub using the word boundary assertion \b, which only matches at a word boundary:
import re
def SurroundWith(text, keyword, before, after):
regex = re.compile(r'\b%s\b' % keyword, re.IGNORECASE)
return regex.sub(r'%s\0%s' % (before, after), text)
Then you get:
>>> SurroundWith('HI there. You should hire that person for the job. '
... 'Hi hi.', 'hi', '<b>', '</b>')
'<b>HI</b> there. You should hire that person for the job. <b>Hi</b> <b>hi</b>.'
If you have more complicated criteria for what constitutes a "word boundary," you'll have to do something like:
def SurroundWith2(text, keyword, before, after):
regex = re.compile(r'([^a-zA-Z0-9])(%s)([^a-zA-Z0-9])' % keyword,
re.IGNORECASE)
return regex.sub(r'\1%s\2%s\3' % (before, after), text)
You can modify the [^a-zA-Z0-9] groups to match anything you consider a "non-word."
I think the best solution would be regular expression...
import re
def reg(keyword, myString) :
regx = re.compile(r'\b(' + keyword + r')\b', re.IGNORECASE)
return regx.sub(r'<b>\1</b>', myString)
of course, you must first make your keyword "regular expression safe" (quote any regex special characters).
Here's one suggestion, from the nitpicking committee. :-)
myString = "HI there. You should higher that person for the job. Hi hi."
myString.replace('higher','hire')

Categories

Resources