Matching a simple string with regex not working? - python

I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?

You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'

There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr

You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.

Related

Python does not recognise valid RegEx entry

I'm trying to write a Python script to parse through a log file. Script core is borrowed from pythonic ways.
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
match_list.append(match_text)
print(match_list) # work in progress
Above example works well when parsing for plain string values. But when I try to insert regex variable:
regex = '((.*\n){2}).*8355371640847825590'
It always returns an empty list.
What bothers me is that this expression works really well in test environments suchas as https://regex101.com/. Each value is correctly matched. Unfortunately, I cannot replicate this in Python.
I'd be grateful if you could assist me.
You need to read the whole file into a single variable if you want your pattern to match across line breaks. Besides, you may explicitly let the regex engine know that you need to only start matching from the start of a line,
(?m)^(?:.*\n){2}.*8355371640847825590
See the regex demo.
Details
(?m) - (the inline re.M / re.MULTILINE modifier) ^ will now match start of line positions
^ - start of a line
(?:.*\n){2} - two lines with line breaks
.*8355371640847825590 - any 0 or more chars other than line break chars as many as possible and then 8355371640847825590
Python demo:
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '(?m)^(?:(?:.*\n){2}).*8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
match_list = re.findall(regex, file.read())
print(match_list)

Python Exclude Comments with re.search

I am searching for a string in a line using:
import re
myfile = "myfile.txt"
files = open(myfile, 'r').read().splitlines()
for line in file:
if re.search("`this", line):
print "bingo"
This works fine. However, I want to exclude any lines that are comments. The comments in the file that I am reading the lines from can have comments in the form of //. I'm not sure how to exclude the comments though. Comments might start anywhere in the line, not necessarily at the beginning of the line.
Example:
I want to exclude lines like first_last = "name" //`this THAT since "`this" is in a comment
This can be done with a variable length negative lookbehind assertion, but for that you need to use the regex package installable with pip form the PyPi repository. The regex is:
(?<!//.*) # negative lookahead assertion stating that the following must not be preceded by // followed by 0 or more arbitary characters
`this # matches `this
The code:
import regex as re
regex = re.compile(r'(?<!//.*)`this')
myfile = "myfile.txt"
with open(myfile, 'r') as f:
for line in f: # line has newline character at end; call rstrip method on line to get rid if you want
if regex.search(line):
print(line, end='')
Regex Demo

Matching start of line in regex (^ returns empty list)

I am using Python's re module to extract some information from a .txt file.
My .txt file looks like this:
621345
21345[45]6213
421345[45]21345
21345[45]6213456
66456
21345[45]621345
I want to match the lines that begin with 21345.
My code is as follows:
import re
pattern = re.compile('^21345.+')
filename = 'myfile.txt'
with open(filename, 'r') as f:
found = re.findall(pattern, f.read())
print(found)
This returns an empty list. It should return:
['21345[45]6213', '21345[45]6213456', '21345[45]621345']
I have tried matching just 21345, which works. When I add the ^, I start getting an empty list.
Your issue is that the ^ anchor matches the beginning of a string by default.
file.read() reads your entire text file in one go, and the resulting string does not match your query (given the first line does not start with the defined sequence), hence the empty list. If you want to match the beginning of each line, set the re.MULTILINE flag when compiling your pattern, e.g.
pattern = re.compile('^21345.+', re.MULTILINE)
That will return the desired list

Delete comments in text file

I am trying to delete comments starting on new lines in a Python code file using Python code and regular expressions. For example, for this input:
first line
#description
hello my friend
I would like to get this output:
first line
hello my friend
Unfortunately this code didn't work for some reason:
with open(input_file,"r+") as f:
string = re.sub(re.compile(r'\n#.*'),"",f.read()))
f.seek(0)
f.write(string)
for some reason the output I get is the same as the input.
1) There is no reason to call re.compile unless you save the result. You can always just use the regular expression text.
2) Seeking to the beginning of the file and writing there may cause problems for you if your replacement text is shorter than your original text. It is easier to re-open the file and write the data.
Here is how I would fix your program:
import re
input_file = 'in.txt'
with open(input_file,"r") as f:
data = f.read()
data = re.sub(r'\n#.*', "", data)
with open(input_file, "w") as f:
f.write(data)
It doesn't seem right to start the regular expression with \n, and I don't think you need to use re.compile here.
In addition to that, you have to use the flag re.M to make the search on multiline
This will delete all lines that start with # and empty lines.
with open(input_file, "r+") as f:
text = f.read()
string = re.sub('^(#.*)|(\s*)$', '', text, flags=re.M)
f.write(string)

Python regex search for string at beginning of line in file

Here's my code:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r'.*IgnoreR.*', f.read())
print(strings)
That returns data, but I need specific regex matching: e.g.:
^\s*[^#]*IgnoreRhosts\s+yes
If I change my code to simply:
strings = re.search(r'^IgnoreR.*', f.read())
or even
strings = re.search(r'^.*IgnoreR.*', f.read())
I don't get anything back. I need to be able to use real regex's like in perl
You can use the multiline mode then ^ match the beginning of a line:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r"^\s*[^#]*IgnoreRhosts\s+yes", f.read(), flags=re.MULTILINE)
print(strings.group(0))
Note that without this mode you can always replace ^ by \n
Note too that this file is calibrated as a tomato thus:
^IgnoreRhosts\s+yes
is good enough for checking the parameter
EDIT: a better way
with open('/etc/ssh/sshd_config') as f:
for line in f:
if line.startswith('IgnoreRhosts yes'):
print(line)
One more time there is no reason to have leading spaces. However if you want to be sure you can always use lstrip().

Categories

Resources