regex multiline matching in python

regex multiline matching in python - python

I want to filter for 'here is a sample' and all the lines afterwards until 2 new lines:
Here is my file (you can use it as a logfile):
here is a sample text
random line1
here is a sample text
random line2
random line3
random line4
should not match
random line 6
here is a sample
random line 5
I tried:
\r?\n?(here is a sample).*\r?\n?(.*)
With that I only filter the next line if I do the last part '\r?\n?(.*)'
again I get another line..
My question. What regex expression do I need in order to match all lines until I see 2 new lines.

If you want to match all until you have 2 newline, but also want to match the last occurrence if there are no 2 newlines:
^here is a sample.*(?:\n(?!\n).*)*
The pattern matches:
^ Start of string
here is a sample.* Match literally and the rest of the line
(?: Non capture group to repeat as a whole part
\n(?!\n) Match a newline, and assert that it is not directly followed by a newline
.* Match the rest of the line
)* Close the non capture group and optionally repeat it
Regex demo
If there should be 2 newlines present, you can use a capture group for the part that you want to keep, and match the 2 newlines to make sure that they are present.
^(here is a sample.*(?:\n(?!\n).*)*)\n\n
Regex demo

Related

Python - Edit lines in a list

I have insert my text file with about 10 lines in the form of a list. Now I want to cut off the firstpart in each line.
To be precise, the first 5 words should be cut off.
How exactly do I have to do this?
Edit:
I have insert my text file:
with open("test.txt", "r") as file:
list = []
for line in file:
list += [line.strip()]
print(list)
If i only have one line, this works for me:
newlist = " ".join(list.split(" ")[5:])
print(newlist)
But how can I do this with a list (10 lines)

Python has a method split() that splits a string by a delimiter. By default, it splits at every white space. Then, to cut the first 5 words you can either copy all the other items in the list starting from index 5, or, delete the indexes 0-4 from the list.

Perhaps something along the lines of:
text = []
with open('input.txt') as f:
for l in f.readlines():
words = l.split()
text.append(words[5:])
Obviously you should do all sorts of error checking here but the gist should be there.

If you want to remove all first 5 words from all lines in the file, you don't have to read it line by line and then split it.
You can read the whole file, then then use re.sub to remove the first 5 words surrounded by spaces.
import re
pattern = r"(?m)^[^\S\n]*(?:\S+[^\S\n]+){4}\S+[^\S\n]*"
with open('test.txt') as f:
print(re.sub(pattern, "", f.read()))
The pattern matches:
(?m) Inline modifier for multiline
^ Start of string
[^\S\n]* Match optional leading spaces without newlines
(?:\S+[^\S\n]+){4} Repeat 4 times matching 1+ non whitespace chars followed by 1+ spaces
\S+ Match 1+ non whitespace chars
[^\S\n]* Match optional trailing spaces without a newline
See a regex 101 demo for the matches.

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

I want to extract gene boundaries (like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list.
Below is example file
Below is what i have so far:
import re
#with open('boundaries.txt','a') as wf:
with open('sequence.gb','r') as rf:
for line in rf:
x= re.findall(r"^\s+\w+\s+\d+\W\d+",line)
print(x)

The pattern does not match, as you are matching a single non word character after matching the first digits that you encounter.
You can repeat matching those 1 or more times.
As you want to have a single match from the start of the string, you can also use re.match without the anchor ^
^\s+\w+\s+\d+\W+\d+
^
Regex demo
import re
s=" gene 1..3256"
pattern = r"\s+\w+\s+\d+\W+\d+"
m = re.match(pattern, s)
if m:
print(m.group())
Output
gene 1..3256

Maybe you used the wrong regex.Try the code below.
for line in rf:
x = re.findall(r"g+.*\s*\d+",line)
print(x)
You can also use online regex 101, to test your regex pattern online.
online regex 101

More suitable pattern would be: ^\s*gene\s*(\d+\.\.\d+)
Explanation:
^ - match beginning of a line
\s* - match zero or more whitespaces
gene - match gene literally
(...) - capturing group
\d+ - match one or more digits
\.\. - match .. literally
Then, it's enough to get match from first capturing group to get gene boundaries.

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?

We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.

What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

Match an occurrence starting with two or three digits but not containing a specific pattern somewhere

I have the following lines:
12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2
I digits to get a match if at the beginning of the line I find two or three numbers but not one, also if the field contains somewhere the expressions FO, SH, GDP or LDP I should not count it as an occurrence. It means, from the previous lines, only get 153/G6S.3-H;2-3;1-2 as a match because in the others either contain FO, SH, GDP, or there is just one digit at the beginning.
I tried using
^[1-9][1-9]((?!FO|SH|GDP).)*$
I am getting the correct result but I am not sure is correct, I am not quite expert in regular expressions.

You need to add any other characters that might be between your starting digits and the things you want to exclude:
Simplified regex: ^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
will only match 153/G6S.3-H;2-3;1-2 from your given data.
Explanation:
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
----------- 2 to 3 digits or more at start of line
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--------------------- any characters + not matching (FO|SH|GDP|LDP)
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--- match till end of line
The (?:....) negative lookbehind must follow exactly, you have other characters between what you do not want to see and your match, hence it is not picking it up.
See https://regex101.com/r/j4SRoQ/1 for more explanations (uses {2,}).
Full code example:
import re
regex = r"^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$"
test_str = r"""12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2"""
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group())
Output:
153/G6S.3-H;2-3;1-2

python RE: Non greedy matches, repitition and grouping

I am trying to match repeated line patterns using python RE
input_string:
start_of_line: x
line 1
line 2
start_of_line: y
line 1
line 2
line 3
start_of_line: z
line 1
Basically I want to extract strings in a loop (each string starting from start_of_line till all characters before the next start_of_line)
I can easily solve this using a for loop, but wondering if there is a python RE to do this, tried my best but getting stuck with the grouping part.
The closest thing which resembles like a solution to me is
pattern= re.compile(r"start_of_line:.*?", re.DOTALL)
for match in re.findall(pattern, input_string):
print "Match =", match
But it prints
Match = start_of_line:
Match = start_of_line:
Match = start_of_line:
If I do anything else to group, I lose the matches.

To do this with a regex, you must use a lookahead test:
r"start_of_line:.*?(?=start_of_line|$)"
otherwhise, since you use a lazy quantifier ( *? ), you will obtain the shortest match possible, i.e. nothing after start_of_line:
Another way:
r"start_of_line:(?:[^\n]+|\n(?!start_of_line:))*"
Here i use a character class containing all but a newline (\n) repeated one or more times. When the regex engine find a newline it tests if start_of_line: doesn't follow. I repeat the group zero or more times.
This pattern is more efficient than the first because the lookahead is performed only when a newline is encounter (vs on each characters)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex multiline matching in python - python

Related

Python - Edit lines in a list

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

how to write a regular expression to match a small part of a repeating pattern?

Match an occurrence starting with two or three digits but not containing a specific pattern somewhere

python RE: Non greedy matches, repitition and grouping

Categories

Resources