python RE: Non greedy matches, repitition and grouping

python RE: Non greedy matches, repitition and grouping - python

I am trying to match repeated line patterns using python RE
input_string:
start_of_line: x
line 1
line 2
start_of_line: y
line 1
line 2
line 3
start_of_line: z
line 1
Basically I want to extract strings in a loop (each string starting from start_of_line till all characters before the next start_of_line)
I can easily solve this using a for loop, but wondering if there is a python RE to do this, tried my best but getting stuck with the grouping part.
The closest thing which resembles like a solution to me is
pattern= re.compile(r"start_of_line:.*?", re.DOTALL)
for match in re.findall(pattern, input_string):
print "Match =", match
But it prints
Match = start_of_line:
Match = start_of_line:
Match = start_of_line:
If I do anything else to group, I lose the matches.

To do this with a regex, you must use a lookahead test:
r"start_of_line:.*?(?=start_of_line|$)"
otherwhise, since you use a lazy quantifier ( *? ), you will obtain the shortest match possible, i.e. nothing after start_of_line:
Another way:
r"start_of_line:(?:[^\n]+|\n(?!start_of_line:))*"
Here i use a character class containing all but a newline (\n) repeated one or more times. When the regex engine find a newline it tests if start_of_line: doesn't follow. I repeat the group zero or more times.
This pattern is more efficient than the first because the lookahead is performed only when a newline is encounter (vs on each characters)

Related

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

I want to extract gene boundaries (like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list.
Below is example file
Below is what i have so far:
import re
#with open('boundaries.txt','a') as wf:
with open('sequence.gb','r') as rf:
for line in rf:
x= re.findall(r"^\s+\w+\s+\d+\W\d+",line)
print(x)

The pattern does not match, as you are matching a single non word character after matching the first digits that you encounter.
You can repeat matching those 1 or more times.
As you want to have a single match from the start of the string, you can also use re.match without the anchor ^
^\s+\w+\s+\d+\W+\d+
^
Regex demo
import re
s=" gene 1..3256"
pattern = r"\s+\w+\s+\d+\W+\d+"
m = re.match(pattern, s)
if m:
print(m.group())
Output
gene 1..3256

Maybe you used the wrong regex.Try the code below.
for line in rf:
x = re.findall(r"g+.*\s*\d+",line)
print(x)
You can also use online regex 101, to test your regex pattern online.
online regex 101

More suitable pattern would be: ^\s*gene\s*(\d+\.\.\d+)
Explanation:
^ - match beginning of a line
\s* - match zero or more whitespaces
gene - match gene literally
(...) - capturing group
\d+ - match one or more digits
\.\. - match .. literally
Then, it's enough to get match from first capturing group to get gene boundaries.

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?

We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.

What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'

This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*

You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo

Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False

This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Match an occurrence starting with two or three digits but not containing a specific pattern somewhere

I have the following lines:
12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2
I digits to get a match if at the beginning of the line I find two or three numbers but not one, also if the field contains somewhere the expressions FO, SH, GDP or LDP I should not count it as an occurrence. It means, from the previous lines, only get 153/G6S.3-H;2-3;1-2 as a match because in the others either contain FO, SH, GDP, or there is just one digit at the beginning.
I tried using
^[1-9][1-9]((?!FO|SH|GDP).)*$
I am getting the correct result but I am not sure is correct, I am not quite expert in regular expressions.

You need to add any other characters that might be between your starting digits and the things you want to exclude:
Simplified regex: ^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
will only match 153/G6S.3-H;2-3;1-2 from your given data.
Explanation:
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
----------- 2 to 3 digits or more at start of line
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--------------------- any characters + not matching (FO|SH|GDP|LDP)
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--- match till end of line
The (?:....) negative lookbehind must follow exactly, you have other characters between what you do not want to see and your match, hence it is not picking it up.
See https://regex101.com/r/j4SRoQ/1 for more explanations (uses {2,}).
Full code example:
import re
regex = r"^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$"
test_str = r"""12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2"""
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group())
Output:
153/G6S.3-H;2-3;1-2

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(

You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.

re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.

Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python RE: Non greedy matches, repitition and grouping - python

Related

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

how to write a regular expression to match a small part of a repeating pattern?

Match strings with alternating characters

Match an occurrence starting with two or three digits but not containing a specific pattern somewhere

repetition in regular expression in python

Categories

Resources