I am searching for a string in a line using:
import re
myfile = "myfile.txt"
files = open(myfile, 'r').read().splitlines()
for line in file:
if re.search("`this", line):
print "bingo"
This works fine. However, I want to exclude any lines that are comments. The comments in the file that I am reading the lines from can have comments in the form of //. I'm not sure how to exclude the comments though. Comments might start anywhere in the line, not necessarily at the beginning of the line.
Example:
I want to exclude lines like first_last = "name" //`this THAT since "`this" is in a comment
This can be done with a variable length negative lookbehind assertion, but for that you need to use the regex package installable with pip form the PyPi repository. The regex is:
(?<!//.*) # negative lookahead assertion stating that the following must not be preceded by // followed by 0 or more arbitary characters
`this # matches `this
The code:
import regex as re
regex = re.compile(r'(?<!//.*)`this')
myfile = "myfile.txt"
with open(myfile, 'r') as f:
for line in f: # line has newline character at end; call rstrip method on line to get rid if you want
if regex.search(line):
print(line, end='')
Regex Demo
Related
I'm trying to write a Python script to parse through a log file. Script core is borrowed from pythonic ways.
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
for line in file:
for match in re.finditer(regex, line, re.S):
match_text = match.group()
match_list.append(match_text)
print(match_list) # work in progress
Above example works well when parsing for plain string values. But when I try to insert regex variable:
regex = '((.*\n){2}).*8355371640847825590'
It always returns an empty list.
What bothers me is that this expression works really well in test environments suchas as https://regex101.com/. Each value is correctly matched. Unfortunately, I cannot replicate this in Python.
I'd be grateful if you could assist me.
You need to read the whole file into a single variable if you want your pattern to match across line breaks. Besides, you may explicitly let the regex engine know that you need to only start matching from the start of a line,
(?m)^(?:.*\n){2}.*8355371640847825590
See the regex demo.
Details
(?m) - (the inline re.M / re.MULTILINE modifier) ^ will now match start of line positions
^ - start of a line
(?:.*\n){2} - two lines with line breaks
.*8355371640847825590 - any 0 or more chars other than line break chars as many as possible and then 8355371640847825590
Python demo:
import re
log_file_path = r"O:\ZTK log file parser\2 Parsing Log\JP"
regex = '(?m)^(?:(?:.*\n){2}).*8355371640847825590'
match_list = []
with open(log_file_path, "r") as file:
match_list = re.findall(regex, file.read())
print(match_list)
I am using Python's re module to extract some information from a .txt file.
My .txt file looks like this:
621345
21345[45]6213
421345[45]21345
21345[45]6213456
66456
21345[45]621345
I want to match the lines that begin with 21345.
My code is as follows:
import re
pattern = re.compile('^21345.+')
filename = 'myfile.txt'
with open(filename, 'r') as f:
found = re.findall(pattern, f.read())
print(found)
This returns an empty list. It should return:
['21345[45]6213', '21345[45]6213456', '21345[45]621345']
I have tried matching just 21345, which works. When I add the ^, I start getting an empty list.
Your issue is that the ^ anchor matches the beginning of a string by default.
file.read() reads your entire text file in one go, and the resulting string does not match your query (given the first line does not start with the defined sequence), hence the empty list. If you want to match the beginning of each line, set the re.MULTILINE flag when compiling your pattern, e.g.
pattern = re.compile('^21345.+', re.MULTILINE)
That will return the desired list
I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.
I've got a log file like below:
sw2 switch_has sw2_p3.
sw1 transmits sw2_p2
/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(#t_air_sens2)]),DataHasValue(DataProperty(#qos_type),^^(latency,http://www.xcx.org/1900/02/22-rdf-syntax-ns#PlainLiteral))) */
/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(#t_air_sens2)]),DataHasValue(DataProperty(#topic_type),^^(periodic,http://www.xcx.org/1901/11/22-rdf-syntax-ns#PlainLiteral))) */
...
what I'm interested in, is to extract specific words from /* BUG... lines and write them into separate file, something like below:
t_air_sens2 qos_type latency
t_air_sens2 topic_type periodic
...
I can do this with the help of awk and regex in shell like below:
awk -F'#|\\^\\^\\(' '{for (i=2; i<NF; i++) printf "%s%s", gensub(/[^[:alnum:]_].*/,"",1,$i), (i<(NF-1) ? OFS : ORS) }' output.txt > ./LogErrors/Properties.txt
How can I extract them using Python? (shall I use regex again, or..?)
You can of course use regex. I would read line by line, grab the lines the start with '/* BUG:', then parse those as needed.
import re
target = r'/* BUG:'
bugs = []
with open('logfile.txt', 'r') as infile, open('output.txt', 'w') as outfile:
# loop through logfile
for line in infile:
if line.startswith(target):
# add line to bug list and strip newlines
bugs.append(line.strip())
# or just do regex parsing here
# create match pattern groups with parentheses, escape literal parentheses with '\'
match = re.search(r'NamedIndividual\(([\w#]+)\)]\),DataHasValue\(DataProperty\(([\w#]+)\),\^\^\(([\w#]+),', line)
# if matches are found
if match:
# loop through match groups, write to output
for group in match.groups():
outfile.write('{} '.format(group))
outfile.write('\n')
Python has a pretty powerful regex module built-in: re module
You can search for a given pattern, then print out the matched groups as needed.
Note: raw strings (r'xxxx') let you use unescaped characters.
I have tried with following way and get the specific lines of the log file.
target =["BUGS"] # array with specific words
with open('demo.log', 'r') as infile, open('output.txt', 'w') as outfile:
for line in infile:
for phrase in target:
if phrase in line:
outfile.write('{} '.format(line))
This will output lines that include the words in the target and output is written in the output.txt file.
Here's my code:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r'.*IgnoreR.*', f.read())
print(strings)
That returns data, but I need specific regex matching: e.g.:
^\s*[^#]*IgnoreRhosts\s+yes
If I change my code to simply:
strings = re.search(r'^IgnoreR.*', f.read())
or even
strings = re.search(r'^.*IgnoreR.*', f.read())
I don't get anything back. I need to be able to use real regex's like in perl
You can use the multiline mode then ^ match the beginning of a line:
#!/usr/bin/python
import io
import re
f = open('/etc/ssh/sshd_config','r')
strings = re.search(r"^\s*[^#]*IgnoreRhosts\s+yes", f.read(), flags=re.MULTILINE)
print(strings.group(0))
Note that without this mode you can always replace ^ by \n
Note too that this file is calibrated as a tomato thus:
^IgnoreRhosts\s+yes
is good enough for checking the parameter
EDIT: a better way
with open('/etc/ssh/sshd_config') as f:
for line in f:
if line.startswith('IgnoreRhosts yes'):
print(line)
One more time there is no reason to have leading spaces. However if you want to be sure you can always use lstrip().