Regex that grabs variable number of groups - python

This is not a question asking how to use re.findall() or the global modifier (?g) or \g. This is asking how to match n groups with one regex expression, with n between 3 and 5.
Rules:
needs to ignore lines with first non-space character as # (comments)
needs to get at least three items, always in order: ITEM1, ITEM2, ITEM3
class ITEM1(stuff)
model = ITEM2
fields = (ITEM3)
needs to get any of the following matches if they exist (UNKNOWN order, and can be missing)
write_once_fields = (ITEM4)
required_fields = (ITEM5)
needs to know which match is which, so either retrieve matches in order, returning None if there is no match, or retrieve pairs.
My question is if this is doable, and how?
I've gotten this far, but it hasn't dealt with comments or unknown order or if some items are missing and to stop searching for this particular regex when you see the next class definition. https://www.regex101.com/r/cG5nV9/8
(?s)\nclass\s(.*?)(?=\()
.*?
model\s=\s(.*?)\n
.*?
(?=fields.*?\((.*?)\))
.*?
(?=write_once_fields.*?\((.*?)\))
.*?
(?=required_fields.*?\((.*?)\))
Do I need a conditional?
Thanks for any kinds of hints.

I'd do something like:
from collections import defaultdict
import re
comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)
with open('path/to/file.txt') as inf:
d = {} # should catch and dispose of any matching lines
# not related to a class
for line in inf:
if comment_line.match(line):
continue # skip this line
if line.startswith('class '):
classname = line.split()[1]
d = matches[classname]
if line.startswith('model'):
d['model'] = line.split('=')[1].strip()
if line.startswith('fields'):
d['fields'] = line.split('=')[1].strip()
if line.startswith('write_once_fields'):
d['write_once_fields'] = line.split('=')[1].strip()
if line.startswith('required_fields'):
d['required_fields'] = line.split('=')[1].strip()
You could probably do this easier with regex matching.
comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P<classname>)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P<key>" + "|".join(possible_keys) +
r")\s+=\s+(?P<value>.*)")
with open( ...
d = {} # default catcher as above
for line in ...
if comment_line.match(line):
continue
class_match = class_line.match(line)
if class_match:
d = matches[class_match.group('classname')]
continue # there won't be more than one match per line
data_match = data_line.match(line)
if data_match:
key,value = data_match.group('key'), data_match.group('value')
d[key] = value
But this might be harder to understand. YMMV.

Related

Splitting a string to find words between delimiters?

Given a certain line that looks like this:
jfdajfjlausername=Bob&djfkaak;jdskjpsasword=12345&
I want to return the username and password, in this case being Bob and 12345.
I tried splitting the string by the & sign but could not figure out how to then find the individual words, and then also tried the below code:
left='password='
right='&'
userleft='username='
for x in file.readlines():
if 'password=' and 'username=' in x:
text=str(x)
#password=(text[text.index(left)+len(left):text.index(right)])
#username=(text[text.index(userleft)+len(userleft):text.index(useright)])
Without using regular expressions, you can split twice: once on & and once on =:
line = 'jfdajfjlausername=Bob&djfkaak;jdskjpsasword=12345&'
items = [item.split('=') for item in line.split('&')]
Now you can extract the values:
for item in items:
if len(item) == 2:
if item[0].endswith('password'):
password = item[1]
elif item[0].endswith('username'):
username = item[1]
If you had a bunch of keys you were looking for, like ('username', 'password'), you could write a nested loop to build dictionaries:
keys = ('username', 'password')
result = {}
for item in items:
if len(item) == 2:
for k in keys:
if item[0].endswith(k):
result[k] = item[1]
break
This makes it a lot easier to check that you got all the values you want, e.g. with if len(keys) == len(result): ....
If you want a very simple approach, you could do this:
data = 'jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&'
#right of "username=" and left of "&"
un = data.split('username=')[1].split('&')[0]
#right of "password=" and left of "&"
pw = data.split('password=')[1].split('&')[0]
print(un, pw) #Bob, 12345
Since the process is identical except for the desired key, you could do something like the below and homogenize the process of getting the value for any key in the query. An interesting side-effect of this is: Even if your example query did not end in "&", this would still work. This is because everything that is left would be in the result of .split('&')[0], and there simply wouldn't be a .split('&')[1]. Nothing below uses .split('&')[1] so, it just wouldn't matter.
query = 'jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&'
key2val = lambda q,k: q.split(f'{k}=')[1].split('&')[0]
un = key2val(query, 'username')
pw = key2val(query, 'password')
print(un, pw) #Bob, 12345
This method is likely superior to regex. It is bound to be faster, it doesn't require any dependencies or loops, and it is flexible enough to allow you to get the value from any key, regardless of order, without the need to ever change anything.
Use Regex:
import re
for x in file.readlines():
if 'password=' in x and 'username=' in x:
text=str(x)
username = re.findall('username=(\w+)',text)
password = re.findall('password=(\w+)',text)
Note the updated if statement. In the orginal, the if checks if "password=" evaluates to True, which it always will - since it is not an empty string.
You can use a single regular expression to parse this information out:
import re
s = "jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&"
regex = "username=(?P<username>.+)&.*password=(?P<password>.+)&"
match = re.search(regex, s)
print(match.groupdict())
{'username': 'Bob', 'password': '12345'}
Implementing this while looping over the lines in a file would look like:
regex = "username=(?P<username>.+)&.*password=(?P<password>.+)&"
with open('text') as f:
for line in f:
match = re.search(regex, line)
if match is not None:
print(match.groupdict())
Update #2
This reads a file named "text" and parses out the username and password for each line if they both exist.
This solution assumes that the username and password fields both end with a "&".
Update #3:
Note that this code will work even if the order of the username and password is reversed.
import re
with open('text') as f:
for line in f:
print(line.strip())
# Note that ([^&]+) captures any characters up to the next &.
m1 = re.search('username=([^&]+)', line)
m2 = re.search('password=([^&]+)', line)
if m1 and m2:
print('username=', m1[1])
print('password=', m2[1])
Output:
jfdajfjlausername=Bob&djfkaak;jdskjpassword=12345&
username= Bob
password= 12345

Speed up fuzzy regex search Python

I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.
What I am trying to do
I am fuzzy searching for keywords, stored in a dictionary d (with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings, stored in a file testFile (one string each line, ~150 characters each) - 4 mismatches max.
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords
How I do it
For this, I first compile my regex and store them in a dictionary compd. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.
Here is how I am doing it:
#/usr/bin/env python3
import argparse
import regex
parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()
#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}
#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []} #to store regex
for k, v in d.items(): #for each keyword
compd[k].append(regex.compile(r'(?b)(' + v[0] + '){s<=4}')) #compile 1st elt of list
compd[k].append(regex.compile(r'(?b)(' + v[1] + '){s<=4}')) #compile second
#Search keywords
with open(args.file) as f: #open file with strings
line = f.readline() #first line/string
while line: #go through each line
for k, v in compd.items(): #for each keyword (ID, regex)
for val in [v[0], v[1]]: #for each elt of list
found = val.search(line) #regex search
if found != None: #if match
print("Keyword " + k + " found as " + found[0]) #print match
if val == v[0]: #if 1st elt of list
break #don't search 2nd
line = f.readline() #next line
I have tested the script using the testFile:
AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT
and get the following expected result:
Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC
Efficiency
With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.
Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:
patterns = []
for k, v in d.items(): #for each keyword
patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)(?:' + '|'.join(patterns) + '){s<=4}'
reSeqs = regex.compile(pattern)
With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over all the regexes in compd with loops over matches in a line (in case there is more than 1 match) and dictionary items in each match (which could be implemented as a comprehension):
for matched in reSeqs.finditer(line):
try:
keyword = [kw for kw, val in matched.groupdict().items() if val][0]
# perform further processing of keyword
except: # no match
pass
(Note that you don't need to call readline on a file object to loop over lines; instead, you can loop over the file object directly: for line in f:.)
If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension over matches is more performant:
with open(args.file) as f:
contents = f.read() # slurp entire file
matches = [{
val:kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
This solution doesn't distinguish between repetitions of a given sequence; in particular, repetitions on a single line are lost. You could merge entries having the same keys into lists, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:
matches = [{
(val, found.span()):kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
To merge:
results = {}
for match in matches:
results.update(match)
# or:
results = {k:v for d in matches for k,v in d.items()}
If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer on each chunk:
# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
with open(path) as file:
yield from chunks(file, size=size)
results = {}
for block in file_chunks(args.file, size=2**20):
for found in reSeqs.finditer(block):
results.update({
(val, found.span()):kw for kw, val in found.groupdict().items() if val
})

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

Matching in Python lists when there are extra characters

I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking "names", if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the "names". For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.
string.startswith() No need for regex, if it's only trailing characters
>>> g = "COPB2A"
>>> f = "COPB2"
>>> g.startswith(f)
True
Here is a working piece of code:
file1_list = []
with open(in_file1, "r") as names:
for line in names:
line_items = line.split()
for item in line_items:
file1_list.append(item)
matches = []
with open(in_file2, "r") as symbols:
for line in symbols:
file2_items = line.split()
for file2_item in file2_items:
for file1_item in file1_list:
if file2_item.startswith(file1_item):
matches.append(file2_item)
print file2_item
print matches
It may be quite slow for large files. If it's unacceptable, I could try to think about how to optimize it.
You might take a look at difflib if you need a more generic solution. Keep in mind it is a big import with lots of overhead so only use it if you really need to. Here is another question that is somewhat similar.
https://stackoverflow.com/questions/1209800/difference-between-two-strings-in-python-php
Assuming you loaded the files into lists X, Y.
## match if a or b is equal to or substring of one another in a case-sensitive way
def Match( a, b):
return a.find(b[0:min(len(a),len(b))-1])
common_words = {};
for a in X:
common_words[a]=[];
for b in Y:
if ( Match( a, b ) ):
common_words[a].append(b);
If you want to use regular expressions to do the matching, you want to use "beginning of word match" operator "^".
import re
def MatchRe( a, b ):
# make sure longer string is in 'a'.
if ( len(a) < len(b) ):
a, b = b, a;
exp = "^"+b;
q = re.match(exp,a);
if ( not q ):
return False; #no match
return True; #access q.group(0) for matches

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources