Why is this Python Regex Expression not working? - python

I'm new to regex expressions, and python. But created a script that uses multiple regex expressions. Two of which, work when run through Regexpal.com. But when I run the script. They do not work. Script works fine, when I run my other regex expressions. Here are the two that are not working. Can someone explain why they do not work, and give me the correct expressions?
I tested these three different ones, none work. I have a line with
Patient: Höler, Adam* 10.07.1920 ID-Nr: 1118111111
And I want to extract Patient: Höler, Adam.
Patient:\s.*\*
Patient:.*?([*])
Patient:.*\*
I have another line with
VCI-exsp = 20mm;
And I'm trying to extract VCI-exsp=20mm (get rid of the ';'). This is the regex expression I made, but it also works on regexpal.com (and on Atom), but not when I run the script.
VCI-exsp =[^;]*
Here is the scripts I have, regexText is a text file full of my regex expressions. And Realthingnotaphony is the text file with the text I'm trying to extract data from. If the problem is that I'm not including r, how would I inject it into the expressions?
regexarr = []
with open("regexText.txt") as fw:
for line in fw:
regexarr.append(re.compile(line))
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))

You are reading in from a text file but you are not stripping the newlines. This means your search criteria are not what you think they are. You can check this by using print(regexarr) after loading the first file.
[re.compile('Patient:\\s.*\\*\n'), re.compile('Patient:.*?([*])\n'), re.compile('Patient:.*\\*\n')]
Change your code to:
import re
with open("regexText.txt") as fw:
# This removes the newline character
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
Then your search terms Patient:\s.*\* and VCI-exsp =[^;]* will work.
Note:
You have a logic error in adding the entries to your match list because you are looping over each search term and resetting the result. This means you can only ever get a result on the last search term!
You can fix this by testing your output or by moving the regex loop. Note you can't just swap it with the for line in f because that is an iterator and you will exhaust the iterator on the first loop.
This would make your code:
import re
with open("regexText.txt") as fw:
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
for regexp in regexarr:
with open('Realthingnotaphony.txt') as f:
for line in f:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
You can also fix this by using the loading the entire file instead of each line and using the re.findall method rather than re.search. This will return a list of strings that you can then unbundle.

Related

Creating a list of all Regex Expressions

Trying to create a list of all Regex expressions... Looking for anything with the format of
name='[xxxxx]', and anything with that format should be added to a list element. See below code.
fpath = open('Netezza_twb.txt', 'r')
lines = fpath.readlines()
temp_out_lines = [line for line in lines if '<column caption' in line]
new_var = [line for line in temp_out_lines if 'param-domain-type' not in line]
for x in range(len(out_lines)):
test_v2 = str(new_var[x])
new_var[x] = re.findall(r"name='\[(.*?)\]'", lambda m: m.group(1).lower(), test_v2)
I previously used re.sub() to lower all the elements in the txt file, but now I would like to gather all elements fitting the above regex format, into a list. name='[xxxxx]'
Note re.findall() might not be the best function, as I dont have much Regular expression coding experience. The regex expression however r"name='\[(.*?)\]'" Has proven to work previously, so I believe that the issue is not in the formatting of that.
You should not mix syntax used by re.sub with that of re.findall.
Use
results = []
fpath = open('Netezza_twb.txt', 'r')
for line in fpath:
if '<column caption' in line and 'param-domain-type' not in line:
results.extend(list(map(str.lower, re.findall(r"name='\[([^][]*)]'", line))))
Notes
for line in fpath: - reads the file line by line
if '<column caption' in line and 'param-domain-type' not in line: only processes a line that contains one string and not another
re.findall(r"name='\[([^][]*)]'", line) extracts the matches captured in Group 1 (the contents between name='[ and ]')
list(map(str.lower,...)) converts the matches to lower case
results.extend adds the found matches to a list.

'regular expression in <string>' requires string as left operand, not list

I am new to python and I don't seem to find why the second script
does not work when using regular expressions.
Use case:
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a
Cisco running configuration file and print this line and the next 4.
I have managed to print the lines after the match but not the matched line itself.
My workaround for this is to print the text "crypto map IPSEC" statically first.
The script will then print me the next 4 lines using "islice".
As this is not perfect I wanted to use regular expression. This does not work at all.
>>>>>>
from itertools import islice
import re
#This works
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
print('crypto map IPSEC')
print(''.join(islice(f, 4)))
f.close()
# The following does not work.
# Here I would like to use regular expressions to fetch the lines
# with "crypto map IPSEC xx ipsec-isakmp"
#
'''
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
pattern = r"crypto\smap\sIPSEC\s\d+\s.+"
matched = re.findall(pattern, line)
if str(matched) in line:
print(str(matched))
print(''.join(islice(f, 4)))
f.close()
'''
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
should be:
if 'crypto map IPSEC' in line and 'ipsec-isakmp' in line:
Another alternative (if the line looks like what you described in the question):
if line.startswith('crypto map IPSEC') and line.endswith('ipsec-isakmp'): ...
And in:
print(''.join(islice(f, 4)))
You probably want to parse the line not f.
As for your question about regex: no need to parse it using a regex (consider previous parts of this answer) as it's running much slower and usually harder to maintain. That said, if this question is for learning, you can do:
import re
line = 'crypto map IPSEC 12345 ipsec-isakmp'
pattern = r'crypto map IPSEC (\d+) ipsec-isakmp'
matched = re.findall(pattern, line)
if matched:
print(matched[0])
See repl
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a Cisco running configuration file and print this line and the next 4.
Then you're making it much more complicated than it has to be:
for line in f:
if line.startswith("crypto map IPSEC") and "ipsec-isakmp" in line:
print(line.strip())
for i in range(4):
try:
print next(f).strip()
except StopIteration:
# we're reached the end of file and there weren't 4 lines left
# after the last "crypto map IPSEC" line. Sh!t happens...
break
nb: if you really insist on use regexps, replace the second line with
if re.match(r"^crypto map IPSEC \d+ ipsec-isakmp", line):
(assuming this is the correct pattern of course - hard to tell for sure without seeing your real data)

Finding strings in Text Files in Python

I need a program to find a string (S) in a file (P), and return the number of thimes it appears in the file, to do this i decided tocreate a function:
def file_reading(P, S):
file1= open(P, 'r')
pattern = S
match1 = "re.findall(pattern, P)"
if match1 != None:
print (pattern)
I know it doesn't look very good, but for some reason it's not outputing anything, let alone the right answer.
There are multiple problems with your code.
First of all, calling open() returns a file object. It does not read the contents of the file. For that you need to use read() or iterate through the file object.
Secondly, if your goal is to count the number of matches of a string, you don't need regular expressions. You can use the string function count(). Even still, it doesn't make sense to put the regular expression call in quotes.
match1 = "re.findall(pattern, file1.read())"
Assigns the string "re.findall(pattern, file1.read())" to the variable match1.
Here is a version that should work for you:
def file_reading(file_name, search_string):
# this will put the contents of the file into a string
file1 = open(file_name, 'r')
file_contents = file1.read()
file1.close() # close the file
# return the number of times the string was found
return file_contents.count(search_string)
You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c
def file_reading(file_name, pattern):
c = 0
with open(file_name, 'r') as f:
for line in f:
c + = line.count(pattern)
if c: print c
There are a few errors; let's go through them one by one:
Anything in quotes is a string. Putting "re.findall(pattern, file1.read())" in quotes just makes a string. If you actually want to call the re.findall function, no quotes are needed :)
You check whether match1 is None or not, which is really great, but then you should return that matches, not the initial pattern.
The if-statement should not be indented.
Also:
Always a close a file once you have opened it! Since most people forget to do this, it is better to use the with open(filename, action) syntax.
So, taken together, it would look like this (I've changed some variable names for clarity):
def file_reading(input_file, pattern):
with open(input_file, 'r') as text_file:
data = text_file.read()
matches = re.findall(pattern, data)
if matches:
print(matches) # prints a list of all strings found

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Writing items to file on separate lines without blank line at the end

I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?
If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))

Categories

Resources