CS50 PSET6 DNA no match using regex to count STR - python

I have been stuck at this point for quite a while, hope to get some tips.
The problem can be simplified as to find what is the largest consecutive occurrence of a pattern in a string. As a pattern AATG, for a string like ATAATGAATGAATGGAATG the right result should be 3. I tired to count the occurrences of the pattern by using re.compile(). I have found out from the doc that if i want to find consecutive occurrence of a pattern i possibly have to use special character +. For instance, a pattern like AATG i have to use re.compile(r'(AATG)+') instead of re.compile(r'AATG'). Otherwise, the occurrences will be overcounted. However, in this program the pattern is not a fixed string. I have treat it as a variable. I have tried many ways to put it into re.compile() without positive results. Could anyone enlighten me the correct way to format it (which is in the Function def countSTR below)?
After that, i think finditer(the_string_to_be_analysis) should return a iterator including all matches found. Then i used match.end() - match.start() to obtain the length of every match to compare with each other in order to get the longest consecutive occurrence of the pattern. maybe something goes wrong there?
code attached. Every input would be appreciated!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()

just find out a correct way to get the desired result.
post it here in case any others are also confused.
From what I have understand, to match a variable named var_pattern could use re.compile(rf'{var_pattern}'). Then if consecutive occurrences of the var_pattern should be searched, could use re.compile(rf'(var_pattern)+'). There may be other smarter ways to implement that, however i managed to get it work as fine as previously .

Related

returns the location of the first item for the whole list instead of each item's location?

this code is supposed to read a text file of a genome, and given a pattern, should return how many times the pattern occurred, and its location.
instead, it returns the number of occurrences and the location of the first occurrence only.
this is an example of running the code instead of returning the location of the 35 occurrences, it returns the first location 35 times.
# open the file with the original sequence
myfile = open('Vibrio_cholerae.txt')
# set the file to the variable Text to read and scan
Text = myfile.read()
# insert the pattern
Pattern = "TAATGGCT"
PatternLocations = []
def PatternCount(Text,Pattern):
count = 0
for i in range (len(Text)-len(Pattern)+1):
if Text [i:i+len(Pattern)] == Pattern:
count +=1
PatternLocations.append(Text.index(Pattern))
return count
# print the result of calling PatternCount on Text and Pattern.
print (f"Number of times the Pattern is repeated: {PatternCount(Text,Pattern)} time(s).")
print(f"List of Pattern locations: {PatternLocations}")
You did
PatternLocations.append(Text.index(Pattern))
.index with single argument does
Return the lowest index in S where substring sub is found
you should do
PatternLocations.append(i)
as you does find location yourself without using index but using
if Text [i:i+len(Pattern)] == Pattern:
instead of itearting throughout the text I would suggest that you use re.
Here's a snippet:
from re import finditer
for match in finditer(pattern, Text):
print(match.span(), match.group())
From a custom example I used (pattern='livraison') it returned something like that:
>>>(18, 27) livraison
>>>(80, 89) livraison
>>>(168, 177) livraison
>>>(290, 299) livraison

Why my code does outputs ? (Probable logic issue)

Following problem is a problem from cs50 pset6. The goal here is to search for a .txt file that represent a DNA thread and check how many times a specific string occurs consecutively. If number of cons. string matches with an individual for each DNA example, print individuals name.
I am still a newbie with python, I wrote what I meant by the lines I hope all is right. Any help is appreciated. (At the moment code does not give any outputs)
python dna.py databases/small.csv sequences/2.txt (commend line arguments,1st is the csv and 2nd is txt file)
CSV files :
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Cedric,31,21,41,28,30,9,36,44
Draco,9,13,8,26,15,25,41,39
Fred,37,40,10,6,5,10,28,8
Ginny,37,47,10,23,5,48,28,23
Hagrid,25,38,45,49,39,18,42,30
Harry,46,49,48,29,15,5,28,40
Hermione,43,31,18,25,26,47,31,36
James,46,41,38,29,15,5,48,22
Kingsley,7,11,18,33,39,31,23,14
Lavender,22,33,43,12,26,18,47,41
Lily,42,47,48,18,35,46,48,50
Lucius,9,13,33,26,45,11,36,39
Luna,18,23,35,13,11,19,14,24
Minerva,17,49,18,7,6,18,17,30
Neville,14,44,28,27,19,7,25,20
Petunia,29,29,40,31,45,20,40,35
Remus,6,18,5,42,39,28,44,22
Ron,37,47,13,25,17,6,13,35
Severus,29,27,32,41,6,27,8,34
Sirius,31,11,28,26,35,19,33,6
Vernon,26,45,34,50,44,30,32,28
Zacharias,29,50,18,23,38,24,22,9
.txt file example :
TGGTTTAGGGCCTATAATTGCAGGACCACTGGCCCTTGTCGAGGTGTACAGGTAGGGAGCTAAGTTCGAAACGCCCCTTGGTCGGGATTACCGCCATTCTAGTAGTCTAACCCCGAACGCGCTCAGGCTTTGAGTTCGCGCAGCATTAAGAAGTCCATGCCGGCACCGAATGTCCCGACGACAGGCAACCAGCACGGATACCCGCCTTGAAGGCGCAATCAGTAGGTCGAGTTACAGAGGCTCCCCCCGAGCTTGTGCTTCCATTGAGTAGGGGCTATAGATATGTAGCACTCAGGTTTAGTAGCGCCCTTTTAACAGCGAGAGCCCGCCTGGTCAGAACCGAACGGCTGATACGCGAGCTGATGGCTAGAGGATGAACACGGTCCTTCTCTTCGCTTCGATCCGGGGTAGTTTTGTAGCGAAGGATAACGCTCTGTGGATTCTCCGAGAATAATCATCAGTACGGTGTGCGTACCCTCTCTTTGATCCACGCCTGGGGCTGGACATAGTCAGGCGCATTTCATCTACTTAACCCCGGTAAGGGCCACGGGCGCGACATCTCCTTACCAGGGTTGTCTTATGCTCGCTTTTCCCAGATGATAAGCATCTCGTTGTAATGAACAGGTACCTAAGAAAACTGAGTTTCGACGACCCGTCGGCTCGTGTTCTTATCTATTGATCTAACCGAGGTGAAGCTCGCCAAAAATTTCGTAATGTAAGAGAGAAATTGAAGGGGTGAATTTTGCACTCTCGTGCATACGTCTTGCTACAATAGCAAGAGCTGTATGCGTGCGACCACTTCACTACCTCTATAGCATGCGATCTCGCAGCCCGGATTGTCGCCTCCTTTGGGCCGCAAAAACGGTATACGGACACACTGCATCTGTGAGCACGCACCAACTCGATGCGCGTAATAGGCATCTGCTCCACCCAGGGGGCAAGGGCACCTGAGGACGATCTGTTCCGAACAGTATATTTGAGCCAATGTTCTATTAGTGAAGGCAAGGTGAGGGCAACTCACACTGGTACCTCGAGGAGTAATCCGATCCATCTAAACTGGCATGCCCGTTCGAGCTCGGGGGGTTCTTGAATTTAGCTGCGTGTACCTCAGGTCTCCTGAAGGTGACACCAAAGTACAAGATAGATAGATAGATAGATAGGTAATATCTGACGTGAATTGCTTACCGCTAGTGAGCGATTTAGGTCACGTCCTCTGAAGATGTACCGTGATCATCGATGAAATGGCTCTGCAGAGCGTTGCTCTTAACTTAGTGGGAGTGTTCCTTTGCAGTCTGATGAAGTCGCTGCATGTTAACTTTGCACTAAGGGCGTTTCCGAACCCTATAGTCATCCTTATTGATTCGCCCTGTCTTACCCAGGATACACTACCGTTCGAGGCTCTTAACGTACCAACGCATGCAGTCAGAAGATGATTCACCATCCAAGCAAATCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTGGCTTTAAAGCACCGAATGTAGTTGGCCATCGGTCTCGGTCACCAATAAGACGGCCTCGTGGGTCACTCGGTCGATGATCTAGGGTCGGGTGCATAGTGTTTCAGGTCGGCGCTCAGGGTTCTTGTCAGGGAAATCTACGGGTGAGTTGGAAAGCGCCGCCAGCGAGATGCCTGTAGGCGATTAGTGTAGAGAGAGCAACATCGGAAAATTGTCCGTGGGGCGCTACGTAAGTGTTCCCAGTATTCTCGTCCAGAGTAAGTCATGCATACCAGTATCAGGCGTCTGTGTGTTACGTTGCAGTGTATCCCGGTAGCGGGAAGCGTATAGAGCGTAACAGACCTGTCCTACAGCACGCAGGATGTCGACCCTTTCTCAGGCACGATACTTCGTGTAACAGCAGTTCCGGTGTCATCTGTAACTGTTCTGTGTTCCATAGTGAAGATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCGGGCGACCCTGGCAATCAGGCGCCTGCTTATATGACAATTTGTCGCAATACGGGTCGCAGATGTATTGTCCGCATGAGTTTACGGTATCCGGAACTGTCACCGCCGATCCAATATGATCGCAAATCGGGGTGACACAATGGACCGCCGTAGATAAACCCTGCGATGCTGCAATAAGGATATGATATCGCGCGCGGGCCGTAAAACCGATCTTGGAAGGCGGGAAGTCTCCGGGAAAAACTCTCTGATAAAGCCTATTACAAGAAGAGCTCGAAGGCAAGATGGGCATGCCCCGTCGACCACACGGGCAAGCTCTGAGAATCGATGTGGTCGCTTAACCAACCCATACGGAGTGAACGAGACCACGCGGGCGGTTCTTGGTACGCATGATTCCTATTGGTTCTGCCGGGCGTGTGCAGGATTGTTCACTCCCCACCCTGTCGCTCACGAACGCGCTGGTTGCTTAAACCGACCGGAAATTCTGTAGCCGCCCCGTAAGTTTAACGCTTTGAAATACTCCACATGTGCGTACCGGGTCTGATCGCTTACGTGGCGCCACTATGTTAGGAGCTCATAGATATCGATGAATCAAATGTCTTTCATCGCTCCTTAAACAACCTGACGTATTCGCAAAATTGCGCGTATTGAGAAGGGAAAGTTAAAGGAACGATAACAATGAGTCTGCTTTCACCGGCTGCATAACGGGATCGCGCGCTATGGGATTTCCTAACTATAATTCGTGTCGATACTCAGACGCGTTGTACAGGTAAGAAGTCGGCGGGACAGTATTTGAGAAGGGGCTCTGCGGCACCAACGCCGAGCTGTATCAGGGGGGTTAATGTGTAGCGGGCATATAACACAATACAGCCCGCGGCGCGTCGTGGTTACCGTAGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACACCGGAGCTAGGATCCACGACTACCAGTGGGAAACTGTGAATTGTGCATGGTAATTAAAGGATGACTGGTCAACACCGGTCTCCACGGGCGTTAAACAACCTCGCTCCAGTCAATCTCTAGCGGTGGTTGTGGCAGCTTATTCCTGGAGGTAATACTCTTCCGGGCCCACTAAAAATGTAACGAAGTCGAGGTTGGGTCAGGGGATTGAGTGGGGGCGACTCACTGATTCCACCAGGAATTGTCGTCAATCGCGACGTACTTTGAGCCTTGTATCTTGGCGTTTCTTGTTGGTACGCGGCCGTGTTCGTGAATCACGACGTCGTTCATGATTCATCCGTCCAAGCCTAGACCTAGCGTAAAAACGGTGTCGATCTGTGCTCCAACCGATGGATGGTTTTTACACAAGTGAACTTCGAGGCTGTGGGACAAACAGCACAACTTGTTCACTGCTGACCGTGGTACTAAACCACGCTTGCTTTCAGCCCTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTCAGATGCGGATCAAGGGTTACTCGAGCCGTTCTGAGGTCCTAAAATTTTAGCCCTTGGTGTTAGCTTCGGTTTAAGAACGTAGGTGCGACGCGGGGGTCCTAGAGCTCCGCGATCTGCACTCCCCACCTGGCACCAAAACGAATCCTGCATAACGGCTCTCTGTGCATGGGGGATGGTCGCAACAACGAGCATAGCTGGCATCACTTCGTTTGCTGTGGATTGCTGTTTTATACAGAATACGGTGGTGATCATCAAAGGAAGCATAATCCACATCGGGCACCCCGGGCCATCGTGCGTTCCCTTATAGCCGGCTTGCATGTTGGGGGAGGAGTAAGGCCGGTAACGTCTCGCAGCACTGTCGCGTAACACAGGTACATCTTTATTTCCGGTGCTGTAGAAGTGGTTTTTCGAAGGCGTAACCCAGAACGACTGATATAATAGTCCACTATTCCCTGGTTTAAGACTTCTACAAAGTTTTACGCAAAGTTACATGCACACTCGGCGACGTAAATATTAGCCTTGCTAAATTGCCACGGATATTAATCCCGAGCCAACCTGTTCCCACTAGCGGTCTACGGTCATAGTCCTTTGTGTAGAGCGTCATTGCGGTTGGGGCCCGTCCGCGGAGGTTCCCCTTATGATCTAACCGCGGTGCAGGTTGACTGAATGCCATACACTATAGAGAAGACGTCTAAGTAGAAACGTTCTTTAAAAATCTTGAACTGACGGCCGAGTATTATCAAGAGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCATGCACAATAAGGTTAGAAGCAGCAAGCATGTATTCTTTGCATAGAGGCGGTAAAGCCGCCTTGCATACCCAGCAGCAGCCGCGAAGGCCTTACTCCAGAGGACAGAACTTCTCACACAGCGTCCGCATACACCGCGGACGTGACAAGGTTAGATAGCTCTAGTTTGCGGCAACCCTCGCATCAGGCCGACTCACCCGCGCTTGCTACCCGGAGGATGGGTCAAGGGATAAACATAGCACGTTAGTTAAGCCTAACGTCAGTTTTTAGAGTTTACATGCACGACTAAGTGCATCGAAATACACGCCGTTGACAGACCAACAGCGTGTCAACTGGGCCTTGAGAATTGTATCATAATAGCCAAATACGAGGCCAAGTAGTCCGACGAGAGGCACGTAGAGACCACTTTCCCTAAACGATCTGTCGCATTACCCTTTGACTCGCACCCTATGCCTTATGTTCCAAGCAGCACCGAAGTTAGATTTAAGGGCGTATCTATCGGTACCTCGGTTGGGCCGGTCCACAGCTCCAGCTGAATTAGTGCTCACCCCGCTTCGAGGTTGAGTAAGGGTCACTTTTAAAAATATGCTTAAGGGTGATTCACATGACAGTAATCGAATAGTGAGATATAAGTAGGTGCGCCCCGCGCACACATCAAAACTGTGCAGACTGAAACTGAATGCTGGAGGCTGAGGAAAATGAAGATCAGAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGAGAAGAGTGAATGAATAGATCTGCCGCTGAATCCCCGCGTGAGGTTTTTGCGAC
Code:
from sys import argv
import csv
from itertools import groupby
#first csv cma 2nd txt
# fread from CSV file first thing in a row is name then the number of strs
# fread from dna seq and read it into a memory
#find how many times each str censequetivel
# if number of strs == with a persons print the person
checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
header = csv.reader(p) # Header contains which strs to look for
for row in header:
checkstr = row[1:]
break
with open(f'{seq}','r') as f:#searching the text for strs
s = f.read()
for c in checkstr:
groups = groupby(s.split(c))
try:
return [sum(1 for _ in group)+1 for label, group in groups if label==''][0]
except IndexError:
return 0
def readcsv(n):
with open(f'{n}','r') as f:
readed = csv.DictReader(f)
for row in readed:
return row
def main():
counter = 0
if len(argv) != 3:
print("Please start program with cmd arguments.")
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()
I think the answer quite simple: you forgot an else-statement.
After you check the number arguments, you must place the else-statement.
def main():
counter = 0
if len(argv) != 3:
print("Please use: program <csvfile> <textfile>") # give usage and exit
else:
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

python how to increment vars in regex replacements

I want to replace multiple patterns in a file with regex.
This is my (working) code so far:
import re
with open('test.txt', "r") as fp:
text = fp.read()
result = re.sub(r'pattern', 'replacement', str)
result2 = re.sub(r'anotherpattern', 'anotherreplacement2', result)
...
with open('results.txt', 'w') as fp:
fp.write(result_x)
This works. But it seems to be inelegant to increment the vars names manually in every new line. How can I increment them better? It must be a for loop, I think. But how?
You do not need the previous result once you used it. You can store the new result in the same variable:
text = re.sub(r'pattern1', 'replacement1', text) # str() is a string constructor!
text = re.sub(r'pattern2', 'replacement2', text)
You can also have a list of patterns and replacements and loop through it:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pattern,replacement in to_replace:
text = re.sub(pattern, replacement, text)
Or in an even more Pythonic way:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pr in to_replace:
text = re.sub(*pr, string=text)
I don't know Python too well, but I think if you want to combine the patterns,
you could do it in a single pass using a callback.
Example:
def repl(m):
contents = m.group(1)
if m.group(1) != '':
return sr1
if m.group(2) != '':
return sr2
if m.group(3) != '':
return sr3
return m.group(0)
print re.sub('(stuff1)|(stuff2)|(stuff3)', repl, text)
And, it could also be looped inside the callback.
For instance, a var holding the fixed number of patterns
which is looped to test the match object.
There must be a replacement array the same size (and position) of the
number of groups in the regex.
How much of a performance increase will this give you?
Doing this in a single pass, you gain exponential performance.
Note that it is almost an error to re-examine the same text over and over again. Imagine searching the library of congress one word at a time from the beginning each time.. How long would that take ?

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources