Creating a list of all Regex Expressions - python

Trying to create a list of all Regex expressions... Looking for anything with the format of
name='[xxxxx]', and anything with that format should be added to a list element. See below code.
fpath = open('Netezza_twb.txt', 'r')
lines = fpath.readlines()
temp_out_lines = [line for line in lines if '<column caption' in line]
new_var = [line for line in temp_out_lines if 'param-domain-type' not in line]
for x in range(len(out_lines)):
test_v2 = str(new_var[x])
new_var[x] = re.findall(r"name='\[(.*?)\]'", lambda m: m.group(1).lower(), test_v2)
I previously used re.sub() to lower all the elements in the txt file, but now I would like to gather all elements fitting the above regex format, into a list. name='[xxxxx]'
Note re.findall() might not be the best function, as I dont have much Regular expression coding experience. The regex expression however r"name='\[(.*?)\]'" Has proven to work previously, so I believe that the issue is not in the formatting of that.

You should not mix syntax used by re.sub with that of re.findall.
Use
results = []
fpath = open('Netezza_twb.txt', 'r')
for line in fpath:
if '<column caption' in line and 'param-domain-type' not in line:
results.extend(list(map(str.lower, re.findall(r"name='\[([^][]*)]'", line))))
Notes
for line in fpath: - reads the file line by line
if '<column caption' in line and 'param-domain-type' not in line: only processes a line that contains one string and not another
re.findall(r"name='\[([^][]*)]'", line) extracts the matches captured in Group 1 (the contents between name='[ and ]')
list(map(str.lower,...)) converts the matches to lower case
results.extend adds the found matches to a list.

Related

'regular expression in <string>' requires string as left operand, not list

I am new to python and I don't seem to find why the second script
does not work when using regular expressions.
Use case:
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a
Cisco running configuration file and print this line and the next 4.
I have managed to print the lines after the match but not the matched line itself.
My workaround for this is to print the text "crypto map IPSEC" statically first.
The script will then print me the next 4 lines using "islice".
As this is not perfect I wanted to use regular expression. This does not work at all.
>>>>>>
from itertools import islice
import re
#This works
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
print('crypto map IPSEC')
print(''.join(islice(f, 4)))
f.close()
# The following does not work.
# Here I would like to use regular expressions to fetch the lines
# with "crypto map IPSEC xx ipsec-isakmp"
#
'''
print('Crypto map configurations: \n')
with open('show_run.txt', 'r') as f:
for line in f:
pattern = r"crypto\smap\sIPSEC\s\d+\s.+"
matched = re.findall(pattern, line)
if str(matched) in line:
print(str(matched))
print(''.join(islice(f, 4)))
f.close()
'''
if 'crypto map IPSEC' and 'ipsec-isakmp' in line:
should be:
if 'crypto map IPSEC' in line and 'ipsec-isakmp' in line:
Another alternative (if the line looks like what you described in the question):
if line.startswith('crypto map IPSEC') and line.endswith('ipsec-isakmp'): ...
And in:
print(''.join(islice(f, 4)))
You probably want to parse the line not f.
As for your question about regex: no need to parse it using a regex (consider previous parts of this answer) as it's running much slower and usually harder to maintain. That said, if this question is for learning, you can do:
import re
line = 'crypto map IPSEC 12345 ipsec-isakmp'
pattern = r'crypto map IPSEC (\d+) ipsec-isakmp'
matched = re.findall(pattern, line)
if matched:
print(matched[0])
See repl
I want to extract entries starting with "crypto map IPSEC xx ipsec-isakmp" from a Cisco running configuration file and print this line and the next 4.
Then you're making it much more complicated than it has to be:
for line in f:
if line.startswith("crypto map IPSEC") and "ipsec-isakmp" in line:
print(line.strip())
for i in range(4):
try:
print next(f).strip()
except StopIteration:
# we're reached the end of file and there weren't 4 lines left
# after the last "crypto map IPSEC" line. Sh!t happens...
break
nb: if you really insist on use regexps, replace the second line with
if re.match(r"^crypto map IPSEC \d+ ipsec-isakmp", line):
(assuming this is the correct pattern of course - hard to tell for sure without seeing your real data)

Why is this Python Regex Expression not working?

I'm new to regex expressions, and python. But created a script that uses multiple regex expressions. Two of which, work when run through Regexpal.com. But when I run the script. They do not work. Script works fine, when I run my other regex expressions. Here are the two that are not working. Can someone explain why they do not work, and give me the correct expressions?
I tested these three different ones, none work. I have a line with
Patient: Höler, Adam* 10.07.1920 ID-Nr: 1118111111
And I want to extract Patient: Höler, Adam.
Patient:\s.*\*
Patient:.*?([*])
Patient:.*\*
I have another line with
VCI-exsp = 20mm;
And I'm trying to extract VCI-exsp=20mm (get rid of the ';'). This is the regex expression I made, but it also works on regexpal.com (and on Atom), but not when I run the script.
VCI-exsp =[^;]*
Here is the scripts I have, regexText is a text file full of my regex expressions. And Realthingnotaphony is the text file with the text I'm trying to extract data from. If the problem is that I'm not including r, how would I inject it into the expressions?
regexarr = []
with open("regexText.txt") as fw:
for line in fw:
regexarr.append(re.compile(line))
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
You are reading in from a text file but you are not stripping the newlines. This means your search criteria are not what you think they are. You can check this by using print(regexarr) after loading the first file.
[re.compile('Patient:\\s.*\\*\n'), re.compile('Patient:.*?([*])\n'), re.compile('Patient:.*\\*\n')]
Change your code to:
import re
with open("regexText.txt") as fw:
# This removes the newline character
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
Then your search terms Patient:\s.*\* and VCI-exsp =[^;]* will work.
Note:
You have a logic error in adding the entries to your match list because you are looping over each search term and resetting the result. This means you can only ever get a result on the last search term!
You can fix this by testing your output or by moving the regex loop. Note you can't just swap it with the for line in f because that is an iterator and you will exhaust the iterator on the first loop.
This would make your code:
import re
with open("regexText.txt") as fw:
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
for regexp in regexarr:
with open('Realthingnotaphony.txt') as f:
for line in f:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
You can also fix this by using the loading the entire file instead of each line and using the re.findall method rather than re.search. This will return a list of strings that you can then unbundle.

Can't replace string with variable

I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)
Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1
In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.

Finding strings in Text Files in Python

I need a program to find a string (S) in a file (P), and return the number of thimes it appears in the file, to do this i decided tocreate a function:
def file_reading(P, S):
file1= open(P, 'r')
pattern = S
match1 = "re.findall(pattern, P)"
if match1 != None:
print (pattern)
I know it doesn't look very good, but for some reason it's not outputing anything, let alone the right answer.
There are multiple problems with your code.
First of all, calling open() returns a file object. It does not read the contents of the file. For that you need to use read() or iterate through the file object.
Secondly, if your goal is to count the number of matches of a string, you don't need regular expressions. You can use the string function count(). Even still, it doesn't make sense to put the regular expression call in quotes.
match1 = "re.findall(pattern, file1.read())"
Assigns the string "re.findall(pattern, file1.read())" to the variable match1.
Here is a version that should work for you:
def file_reading(file_name, search_string):
# this will put the contents of the file into a string
file1 = open(file_name, 'r')
file_contents = file1.read()
file1.close() # close the file
# return the number of times the string was found
return file_contents.count(search_string)
You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c
def file_reading(file_name, pattern):
c = 0
with open(file_name, 'r') as f:
for line in f:
c + = line.count(pattern)
if c: print c
There are a few errors; let's go through them one by one:
Anything in quotes is a string. Putting "re.findall(pattern, file1.read())" in quotes just makes a string. If you actually want to call the re.findall function, no quotes are needed :)
You check whether match1 is None or not, which is really great, but then you should return that matches, not the initial pattern.
The if-statement should not be indented.
Also:
Always a close a file once you have opened it! Since most people forget to do this, it is better to use the with open(filename, action) syntax.
So, taken together, it would look like this (I've changed some variable names for clarity):
def file_reading(input_file, pattern):
with open(input_file, 'r') as text_file:
data = text_file.read()
matches = re.findall(pattern, data)
if matches:
print(matches) # prints a list of all strings found

Slice strings in .txt and return only one of the new strings

I want to use lines of strings of a .txt file as search queries in other .txt files. But before this, I need to slice those strings of the lines of my original text data. Is there a simple way to do this?
This is my original .txt data:
CHEMBL2057820|MUBD_HDAC2_ligandset|mol2|42|dock12
CHEMBL1957458|MUBD_HDAC2_ligandset|mol2|58|dock10
CHEMBL251144|MUBD_HDAC2_ligandset|mol2|41|dock98
CHEMBL269935|MUBD_HDAC2_ligandset|mol2|30|dock58
... (over thousands)
And I need to have a new file where the new new lines contain only part of those strings, like:
CHEMBL2057820
CHEMBL1957458
CHEMBL251144
CHEMBL269935
Open the file, read in the lines and split each line at the | character, then index the first result
with open("test.txt") as f:
parts = (line.lstrip().split('|', 1)[0] for line in f)
with open('dest.txt', 'w') as dest:
dest.write("\n".join(parts))
Explanation:
lstrip - removes whitespace on leading part of the line
split("|") returns a list like: ['CHEMBL2057820', 'MUBD_HDAC2_ligandset', 'mol2', '42', 'dock12'] for each line
Since we're only conerned with the first section it's redundant to split the rest of the contents of the line on the | character, so we can specify a maxsplit argument, which will stop splitting the string after it's encoutered that many chacters
So split("|", 1)
gives['CHEMBL2057820','MUBD_HDAC2_ligandset|mol2|42|dock12']
Since we're only interested in the first part split("|", 1)[0] returns
the "CHEMBL..." section
Use split and readlines:
with open('foo.txt') as f:
g = open('bar.txt')
lines = f.readlines()
for line in lines:
l = line.strip().split('|')[0]
g.write(l)

Categories

Resources