Regex in python: combining 2 regex expressions into one - python

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?

This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')

If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d

You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.

Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Related

Python - How to use regex to find multiple words and extract them at the same time

Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']

Cut within a pattern using Python regex

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.
What I am looking for:
I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.
Pattern: CDE|FG
String: ABCDEFGHIJKLMNOCDEFGZYPE
Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
What I have tried:
I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.
re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')
Gives,
['AB', 'HIJKLMNO', 'ZYPE']
When I actually need,
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Motivation:
Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.
A non regex way would be to replace the pattern with the piped value and then split.
>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
You can solve it with re.split() and positive "look arounds":
>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):
import re
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
cut_sequences = [
["CDE", "FG"],
["FGHI", ""],
["", "FGHI"]
]
for left, right in cut_sequences:
items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)
if not left:
items = items[1:]
if not right:
items = items[:-1]
print(items)
Prints:
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']
To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.
>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']
Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.
>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)
PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.
Edit: Here's a (less transparent) way to do it without adding an empty string to the list:
pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]
A safer non-regex solution could be this:
import re
def split(string, pattern):
"""Split the given string in the place indicated by a pipe (|) in the pattern"""
safe_splitter = "####SPLIT_HERE####"
safe_pattern = pattern.replace("|", safe_splitter)
string = string.replace(pattern.replace("|", ""), safe_pattern)
return string.split(safe_splitter)
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))
https://repl.it/C448

Delete substring not matching regex in Python

I have a string like:
'class="a", class="b", class="ab", class="body", class="etc"'
I want to delete everything except class="a" and class="b".
How can I do it? I think the problem is easy but I'm stuck.
Here is some one of my attempts but it didn't solve my problem:
re.sub(r'class="also"|class="etc"', '', a)
My string is a very long HTML code with a lot of classes and I want to only keep two of them and drop all the others.
Some times its good to make a break. I found solution for me with bleach
def filter_class(name, value):
if name == 'class' and value == 'aaa':
return True
attrs = {
'div': filter_class,
}
bleach.clean(html, tags=('div'), attributes=attrs, strip_comments=True)
You tried to explicitly enumerate those substrings you wanted to delete. Rather than writing such long patterns, you can just use negative lookaheads that provide a means to add exclusions to some more generic pattern.
Here is a regex you can use to remove those substrings in a clean way and disregarding order:
,? ?\bclass="(?![ab]")[^"]+"
See regex demo
Here, with (?![ab]")[^"]+, we match 1 or more characters other than " ([^"]+), but not those equal to a or b ((?![ab]")).
Here is a sample code:
import re
p = re.compile(r',? ?\bclass="(?![ab]")[^"]+"')
test_str = "class=\"a\", class=\"b\", class=\"ab\", class=\"body\", class=\"etc\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"etc\", class=\"a\"\nclass=\"b\", class=\"ab\", class=\"body\", class=\"a\", class=\"etc\""
result = re.sub(p, '', test_str)
print(result)
See IDEONE demo
NOTE: If instead of a and b you have longer sequences, use a (?!(?:a|b) non-capturing group in the look-ahead instead of a character class:
,? ?\bclass="(?!(?:arbuz|baklazhan)")[^"]+"
See another demo
another pretty simple solution.. good luck.
st = 'class="a", class="b", class="ab", class="body", class="etc"'
import re
res = re.findall(r'class="[a-b]"', st)
print res
'['class="a"', 'class="b"']'
you can use re.sub very easily
res = re.sub(r'class="[a-zA-Z][a-zA-Z].*"', "", st)
print res
class="a", class="b"
If you only wanted to keep the first two entries, one approach would be to use the split() function. This will split your string into a list at given separator points. In your case, this could be a comma. The first two list elements can then be joined back together with commas.
text = 'class="a", class="b", class="ab", class="body", class="etc"'
print ",".join(text.split(",")[:2])
Would give class="a", class="b"
If the entries can be anywhere, and for an arbitrary list of wanted classes:
def keep(text, keep_list):
keep_set = set(re.findall("class\w*=\w*[\"'](.*?)[\"']", text)).intersection(set(keep_list))
output_list = ['class="%s"' % a_class for a_class in keep_set]
return ', '.join(output_list)
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["a", "b"])
print keep('class="a", class="b", class="ab", class="body", class="etc"', ["body", "header"])
This would print:
class="a", class="b"
class="body"

How to capture multiple repeating patterns with regular expression?

I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}
I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture.
For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)
Then I found an alternative way to do this:
gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
gpathRegexCp=re.compile(gpathRegexPat)
strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
>>> strpath
'{sadf}{ad}'
p=re.compile('\{([^\{\}\[\]]*)\}')
>>> p.findall(strpath)
['sadf', 'ad']
or:
>>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
>>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
>>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
>>> strpath
'\\mypath{{sadf}{ad}}'
>>> p.findall(strpath)
['sadf', 'ad']
At this point, I thought, why not just use the findall on the original string? I may use:
gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.
With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:
>>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
>>> p2.findall(strpath)
['sadf', 'ad']
>>> p2.findall('{adadd}{dfada}{adafadf}')
['adadd', 'dfada', 'adafadf']
Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?
re.findall("{([^{}]+)}",text)
should work
returns
['path1', 'path2', 'path3', 'pathn']
finally
my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)
or even better
re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}
You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.
In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}
Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.

Regular expression split

I have inputs similar to the following:
TV-12VX
TV-14JW
TV-2JIS
VC-224X
I need to remove everything after the numbers after the dash. The result would be:
TV-12
TV-14
TV-2
TV-224
How would I do this split via regular expressions?
The following code shows how to match strings of the form "TV-" + (some number):
>>> re.match('TV-[0-9]+','TV-12VX').group(0)
'TV-12'
(Note that, because I'm using match, this only works if the string starts with the bit you want to extract.)
I think this regex is appropriate for you: (.+?-\d+?)[a-zA-Z]. You can use it with re.findall, or re.match.
import re
p = re.match('([\w]{2}-\d+)', 'TV-12VX')
print(p.group(0))
Outputs
TV-12
You can remove everything after the digits with this:
re.sub(r"^(\w+-\d+).*", r"\1", input)

Categories

Resources