Python does not recognise valid RegEx entry - python

I'm trying to write a Python script to parse through a log file. Script core is borrowed from pythonic ways.
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
match_list.append(match_text)
print(match_list) # work in progress
Above example works well when parsing for plain string values. But when I try to insert regex variable:
regex = '((.*\n){2}).*8355371640847825590'
It always returns an empty list.
What bothers me is that this expression works really well in test environments suchas as https://regex101.com/. Each value is correctly matched. Unfortunately, I cannot replicate this in Python.
I'd be grateful if you could assist me.

You need to read the whole file into a single variable if you want your pattern to match across line breaks. Besides, you may explicitly let the regex engine know that you need to only start matching from the start of a line,
(?m)^(?:.*\n){2}.*8355371640847825590
See the regex demo.
Details
(?m) - (the inline re.M / re.MULTILINE modifier) ^ will now match start of line positions
^ - start of a line
(?:.*\n){2} - two lines with line breaks
.*8355371640847825590 - any 0 or more chars other than line break chars as many as possible and then 8355371640847825590
Python demo:
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '(?m)^(?:(?:.*\n){2}).*8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
match_list = re.findall(regex, file.read())
print(match_list)

Related

Python Exclude Comments with re.search

I am searching for a string in a line using:
import re
myfile = "myfile.txt"
files = open(myfile, 'r').read().splitlines()
for line in file:
if re.search("`this", line):
print "bingo"
This works fine. However, I want to exclude any lines that are comments. The comments in the file that I am reading the lines from can have comments in the form of //. I'm not sure how to exclude the comments though. Comments might start anywhere in the line, not necessarily at the beginning of the line.
Example:
I want to exclude lines like first_last = "name" //`this THAT since "`this" is in a comment
This can be done with a variable length negative lookbehind assertion, but for that you need to use the regex package installable with pip form the PyPi repository. The regex is:
(?<!//.*) # negative lookahead assertion stating that the following must not be preceded by // followed by 0 or more arbitary characters
`this # matches `this
The code:
import regex as re
regex = re.compile(r'(?<!//.*)`this')
myfile = "myfile.txt"
with open(myfile, 'r') as f:
for line in f: # line has newline character at end; call rstrip method on line to get rid if you want
if regex.search(line):
print(line, end='')
Regex Demo

Matching start of line in regex (^ returns empty list)

I am using Python's re module to extract some information from a .txt file.
My .txt file looks like this:
621345
21345[45]6213
421345[45]21345
21345[45]6213456
66456
21345[45]621345
I want to match the lines that begin with 21345.
My code is as follows:
import re
pattern = re.compile('^21345.+')
filename = 'myfile.txt'
with open(filename, 'r') as f:
found = re.findall(pattern, f.read())
print(found)
This returns an empty list. It should return:
['21345[45]6213', '21345[45]6213456', '21345[45]621345']
I have tried matching just 21345, which works. When I add the ^, I start getting an empty list.
Your issue is that the ^ anchor matches the beginning of a string by default.
file.read() reads your entire text file in one go, and the resulting string does not match your query (given the first line does not start with the defined sequence), hence the empty list. If you want to match the beginning of each line, set the re.MULTILINE flag when compiling your pattern, e.g.
pattern = re.compile('^21345.+', re.MULTILINE)
That will return the desired list

Matching a simple string with regex not working?

I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.

How to get the line number in Python regex findall method [duplicate]

I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).
I have this regular expression: '\w*_VB.?\sout_RP' and my POS-tagged text file: 'Corpus.txt'.
I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:
2
5
44
So far all I have in my script is the following:
OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
phrase='\w*_VB.?\sout_RP'
for phrase in textfile:
OutputLineNumbers.close()
Any idea how to solve this problem?
In advance, thanks for your help!
This should solve your problem, presuming you have correct regex in variable 'phrase'
import re
# compile regex
regex = re.compile('[0-9]+')
# open the files
with open('Corpus.txt','r') as inputFile:
with open('OutputLineNumbers', 'w') as outputLineNumbers:
# loop through each line in corpus
for line_i, line in enumerate(inputFile, 1):
# check if we have a regex match
if regex.search( line ):
# if so, write it the output file
outputLineNumbers.write( "%d\n" % line_i )
you can do it directly with bash if your regular expression is grep friendly. show the line numbers using "-n"
for example:
grep -n "[1-9][0-9]" tags.txt
will output matching lines with the line numbers included at first
2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577

Python regex search for string at beginning of line in file

Here's my code:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r'.*IgnoreR.*', f.read())
print(strings)
That returns data, but I need specific regex matching: e.g.:
^\s*[^#]*IgnoreRhosts\s+yes
If I change my code to simply:
strings = re.search(r'^IgnoreR.*', f.read())
or even
strings = re.search(r'^.*IgnoreR.*', f.read())
I don't get anything back. I need to be able to use real regex's like in perl
You can use the multiline mode then ^ match the beginning of a line:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r"^\s*[^#]*IgnoreRhosts\s+yes", f.read(), flags=re.MULTILINE)
print(strings.group(0))
Note that without this mode you can always replace ^ by \n
Note too that this file is calibrated as a tomato thus:
^IgnoreRhosts\s+yes
is good enough for checking the parameter
EDIT: a better way
with open('/etc/ssh/sshd_config') as f:
for line in f:
if line.startswith('IgnoreRhosts yes'):
print(line)
One more time there is no reason to have leading spaces. However if you want to be sure you can always use lstrip().

Categories

Resources