Why my code does outputs ? (Probable logic issue) - python

Following problem is a problem from cs50 pset6. The goal here is to search for a .txt file that represent a DNA thread and check how many times a specific string occurs consecutively. If number of cons. string matches with an individual for each DNA example, print individuals name.
I am still a newbie with python, I wrote what I meant by the lines I hope all is right. Any help is appreciated. (At the moment code does not give any outputs)
python dna.py databases/small.csv sequences/2.txt (commend line arguments,1st is the csv and 2nd is txt file)
CSV files :
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Cedric,31,21,41,28,30,9,36,44
Draco,9,13,8,26,15,25,41,39
Fred,37,40,10,6,5,10,28,8
Ginny,37,47,10,23,5,48,28,23
Hagrid,25,38,45,49,39,18,42,30
Harry,46,49,48,29,15,5,28,40
Hermione,43,31,18,25,26,47,31,36
James,46,41,38,29,15,5,48,22
Kingsley,7,11,18,33,39,31,23,14
Lavender,22,33,43,12,26,18,47,41
Lily,42,47,48,18,35,46,48,50
Lucius,9,13,33,26,45,11,36,39
Luna,18,23,35,13,11,19,14,24
Minerva,17,49,18,7,6,18,17,30
Neville,14,44,28,27,19,7,25,20
Petunia,29,29,40,31,45,20,40,35
Remus,6,18,5,42,39,28,44,22
Ron,37,47,13,25,17,6,13,35
Severus,29,27,32,41,6,27,8,34
Sirius,31,11,28,26,35,19,33,6
Vernon,26,45,34,50,44,30,32,28
Zacharias,29,50,18,23,38,24,22,9
.txt file example :
TGGTTTAGGGCCTATAATTGCAGGACCACTGGCCCTTGTCGAGGTGTACAGGTAGGGAGCTAAGTTCGAAACGCCCCTTGGTCGGGATTACCGCCATTCTAGTAGTCTAACCCCGAACGCGCTCAGGCTTTGAGTTCGCGCAGCATTAAGAAGTCCATGCCGGCACCGAATGTCCCGACGACAGGCAACCAGCACGGATACCCGCCTTGAAGGCGCAATCAGTAGGTCGAGTTACAGAGGCTCCCCCCGAGCTTGTGCTTCCATTGAGTAGGGGCTATAGATATGTAGCACTCAGGTTTAGTAGCGCCCTTTTAACAGCGAGAGCCCGCCTGGTCAGAACCGAACGGCTGATACGCGAGCTGATGGCTAGAGGATGAACACGGTCCTTCTCTTCGCTTCGATCCGGGGTAGTTTTGTAGCGAAGGATAACGCTCTGTGGATTCTCCGAGAATAATCATCAGTACGGTGTGCGTACCCTCTCTTTGATCCACGCCTGGGGCTGGACATAGTCAGGCGCATTTCATCTACTTAACCCCGGTAAGGGCCACGGGCGCGACATCTCCTTACCAGGGTTGTCTTATGCTCGCTTTTCCCAGATGATAAGCATCTCGTTGTAATGAACAGGTACCTAAGAAAACTGAGTTTCGACGACCCGTCGGCTCGTGTTCTTATCTATTGATCTAACCGAGGTGAAGCTCGCCAAAAATTTCGTAATGTAAGAGAGAAATTGAAGGGGTGAATTTTGCACTCTCGTGCATACGTCTTGCTACAATAGCAAGAGCTGTATGCGTGCGACCACTTCACTACCTCTATAGCATGCGATCTCGCAGCCCGGATTGTCGCCTCCTTTGGGCCGCAAAAACGGTATACGGACACACTGCATCTGTGAGCACGCACCAACTCGATGCGCGTAATAGGCATCTGCTCCACCCAGGGGGCAAGGGCACCTGAGGACGATCTGTTCCGAACAGTATATTTGAGCCAATGTTCTATTAGTGAAGGCAAGGTGAGGGCAACTCACACTGGTACCTCGAGGAGTAATCCGATCCATCTAAACTGGCATGCCCGTTCGAGCTCGGGGGGTTCTTGAATTTAGCTGCGTGTACCTCAGGTCTCCTGAAGGTGACACCAAAGTACAAGATAGATAGATAGATAGATAGGTAATATCTGACGTGAATTGCTTACCGCTAGTGAGCGATTTAGGTCACGTCCTCTGAAGATGTACCGTGATCATCGATGAAATGGCTCTGCAGAGCGTTGCTCTTAACTTAGTGGGAGTGTTCCTTTGCAGTCTGATGAAGTCGCTGCATGTTAACTTTGCACTAAGGGCGTTTCCGAACCCTATAGTCATCCTTATTGATTCGCCCTGTCTTACCCAGGATACACTACCGTTCGAGGCTCTTAACGTACCAACGCATGCAGTCAGAAGATGATTCACCATCCAAGCAAATCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTGGCTTTAAAGCACCGAATGTAGTTGGCCATCGGTCTCGGTCACCAATAAGACGGCCTCGTGGGTCACTCGGTCGATGATCTAGGGTCGGGTGCATAGTGTTTCAGGTCGGCGCTCAGGGTTCTTGTCAGGGAAATCTACGGGTGAGTTGGAAAGCGCCGCCAGCGAGATGCCTGTAGGCGATTAGTGTAGAGAGAGCAACATCGGAAAATTGTCCGTGGGGCGCTACGTAAGTGTTCCCAGTATTCTCGTCCAGAGTAAGTCATGCATACCAGTATCAGGCGTCTGTGTGTTACGTTGCAGTGTATCCCGGTAGCGGGAAGCGTATAGAGCGTAACAGACCTGTCCTACAGCACGCAGGATGTCGACCCTTTCTCAGGCACGATACTTCGTGTAACAGCAGTTCCGGTGTCATCTGTAACTGTTCTGTGTTCCATAGTGAAGATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCGGGCGACCCTGGCAATCAGGCGCCTGCTTATATGACAATTTGTCGCAATACGGGTCGCAGATGTATTGTCCGCATGAGTTTACGGTATCCGGAACTGTCACCGCCGATCCAATATGATCGCAAATCGGGGTGACACAATGGACCGCCGTAGATAAACCCTGCGATGCTGCAATAAGGATATGATATCGCGCGCGGGCCGTAAAACCGATCTTGGAAGGCGGGAAGTCTCCGGGAAAAACTCTCTGATAAAGCCTATTACAAGAAGAGCTCGAAGGCAAGATGGGCATGCCCCGTCGACCACACGGGCAAGCTCTGAGAATCGATGTGGTCGCTTAACCAACCCATACGGAGTGAACGAGACCACGCGGGCGGTTCTTGGTACGCATGATTCCTATTGGTTCTGCCGGGCGTGTGCAGGATTGTTCACTCCCCACCCTGTCGCTCACGAACGCGCTGGTTGCTTAAACCGACCGGAAATTCTGTAGCCGCCCCGTAAGTTTAACGCTTTGAAATACTCCACATGTGCGTACCGGGTCTGATCGCTTACGTGGCGCCACTATGTTAGGAGCTCATAGATATCGATGAATCAAATGTCTTTCATCGCTCCTTAAACAACCTGACGTATTCGCAAAATTGCGCGTATTGAGAAGGGAAAGTTAAAGGAACGATAACAATGAGTCTGCTTTCACCGGCTGCATAACGGGATCGCGCGCTATGGGATTTCCTAACTATAATTCGTGTCGATACTCAGACGCGTTGTACAGGTAAGAAGTCGGCGGGACAGTATTTGAGAAGGGGCTCTGCGGCACCAACGCCGAGCTGTATCAGGGGGGTTAATGTGTAGCGGGCATATAACACAATACAGCCCGCGGCGCGTCGTGGTTACCGTAGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACACCGGAGCTAGGATCCACGACTACCAGTGGGAAACTGTGAATTGTGCATGGTAATTAAAGGATGACTGGTCAACACCGGTCTCCACGGGCGTTAAACAACCTCGCTCCAGTCAATCTCTAGCGGTGGTTGTGGCAGCTTATTCCTGGAGGTAATACTCTTCCGGGCCCACTAAAAATGTAACGAAGTCGAGGTTGGGTCAGGGGATTGAGTGGGGGCGACTCACTGATTCCACCAGGAATTGTCGTCAATCGCGACGTACTTTGAGCCTTGTATCTTGGCGTTTCTTGTTGGTACGCGGCCGTGTTCGTGAATCACGACGTCGTTCATGATTCATCCGTCCAAGCCTAGACCTAGCGTAAAAACGGTGTCGATCTGTGCTCCAACCGATGGATGGTTTTTACACAAGTGAACTTCGAGGCTGTGGGACAAACAGCACAACTTGTTCACTGCTGACCGTGGTACTAAACCACGCTTGCTTTCAGCCCTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTCAGATGCGGATCAAGGGTTACTCGAGCCGTTCTGAGGTCCTAAAATTTTAGCCCTTGGTGTTAGCTTCGGTTTAAGAACGTAGGTGCGACGCGGGGGTCCTAGAGCTCCGCGATCTGCACTCCCCACCTGGCACCAAAACGAATCCTGCATAACGGCTCTCTGTGCATGGGGGATGGTCGCAACAACGAGCATAGCTGGCATCACTTCGTTTGCTGTGGATTGCTGTTTTATACAGAATACGGTGGTGATCATCAAAGGAAGCATAATCCACATCGGGCACCCCGGGCCATCGTGCGTTCCCTTATAGCCGGCTTGCATGTTGGGGGAGGAGTAAGGCCGGTAACGTCTCGCAGCACTGTCGCGTAACACAGGTACATCTTTATTTCCGGTGCTGTAGAAGTGGTTTTTCGAAGGCGTAACCCAGAACGACTGATATAATAGTCCACTATTCCCTGGTTTAAGACTTCTACAAAGTTTTACGCAAAGTTACATGCACACTCGGCGACGTAAATATTAGCCTTGCTAAATTGCCACGGATATTAATCCCGAGCCAACCTGTTCCCACTAGCGGTCTACGGTCATAGTCCTTTGTGTAGAGCGTCATTGCGGTTGGGGCCCGTCCGCGGAGGTTCCCCTTATGATCTAACCGCGGTGCAGGTTGACTGAATGCCATACACTATAGAGAAGACGTCTAAGTAGAAACGTTCTTTAAAAATCTTGAACTGACGGCCGAGTATTATCAAGAGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCATGCACAATAAGGTTAGAAGCAGCAAGCATGTATTCTTTGCATAGAGGCGGTAAAGCCGCCTTGCATACCCAGCAGCAGCCGCGAAGGCCTTACTCCAGAGGACAGAACTTCTCACACAGCGTCCGCATACACCGCGGACGTGACAAGGTTAGATAGCTCTAGTTTGCGGCAACCCTCGCATCAGGCCGACTCACCCGCGCTTGCTACCCGGAGGATGGGTCAAGGGATAAACATAGCACGTTAGTTAAGCCTAACGTCAGTTTTTAGAGTTTACATGCACGACTAAGTGCATCGAAATACACGCCGTTGACAGACCAACAGCGTGTCAACTGGGCCTTGAGAATTGTATCATAATAGCCAAATACGAGGCCAAGTAGTCCGACGAGAGGCACGTAGAGACCACTTTCCCTAAACGATCTGTCGCATTACCCTTTGACTCGCACCCTATGCCTTATGTTCCAAGCAGCACCGAAGTTAGATTTAAGGGCGTATCTATCGGTACCTCGGTTGGGCCGGTCCACAGCTCCAGCTGAATTAGTGCTCACCCCGCTTCGAGGTTGAGTAAGGGTCACTTTTAAAAATATGCTTAAGGGTGATTCACATGACAGTAATCGAATAGTGAGATATAAGTAGGTGCGCCCCGCGCACACATCAAAACTGTGCAGACTGAAACTGAATGCTGGAGGCTGAGGAAAATGAAGATCAGAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGAGAAGAGTGAATGAATAGATCTGCCGCTGAATCCCCGCGTGAGGTTTTTGCGAC
Code:
from sys import argv
import csv
from itertools import groupby
#first csv cma 2nd txt
# fread from CSV file first thing in a row is name then the number of strs
# fread from dna seq and read it into a memory
#find how many times each str censequetivel
# if number of strs == with a persons print the person
checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
header = csv.reader(p) # Header contains which strs to look for
for row in header:
checkstr = row[1:]
break
with open(f'{seq}','r') as f:#searching the text for strs
s = f.read()
for c in checkstr:
groups = groupby(s.split(c))
try:
return [sum(1 for _ in group)+1 for label, group in groups if label==''][0]
except IndexError:
return 0
def readcsv(n):
with open(f'{n}','r') as f:
readed = csv.DictReader(f)
for row in readed:
return row
def main():
counter = 0
if len(argv) != 3:
print("Please start program with cmd arguments.")
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()

I think the answer quite simple: you forgot an else-statement.
After you check the number arguments, you must place the else-statement.
def main():
counter = 0
if len(argv) != 3:
print("Please use: program <csvfile> <textfile>") # give usage and exit
else:
readtxt(argv[1], argv[2])#for fulling the checkstr
for i in range(0,len(checkstr)): #Do this as much as the number of special strings
for j in checkstr: #For each special string in the list
if readtxt(argv[1], argv[2]) == readcsv(argv[1])[f'{checkstr[j]}']: #If dictionary value that returns for that spesific str is matches to the spesific str
counter += 1
if counter == len(checkstr): # if all spesific strs matches, then we found our person!
print(readcsv(argv[1])['name'])
#readtxt(argv[1], argv[2])
#readcsv(argv[1])
main()

Related

Separate value that store consecutively in dictionary

I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/
The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.
This is the code i currently work (not complete yet):
import re, csv, sys
def main(argv):
# Open csv file
csv_file = open(sys.argv[1], 'r')
str_person = csv.reader(csv_file)
nucleotide = next(str_person)[1:]
# Open dna sequences file
txt_file = open(sys.argv[2], 'r')
dna_file = txt_file.read()
str_repeat = {}
str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
def find_STRrepeats(str_list, nucleotide, dna):
for STR in nucleotide:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = groups
print(str_list)
if __name__ == "__main__":
main(sys.argv[1:])
Output (from the print(str_list)):
{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}
But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups) it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.
So, I want it to store separately. Kind of like this:
{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}
What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna) ?
I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.
I would just store the highest number of repetitions:
...
if len(groups) == 0:
str_list[STR] = 0
else:
str_list[STR] = max(len(i)/len(str) for i in groups)
....
BTW, this would correctly handle the case where more than one sequence exists.

CS50 PSET6 DNA no match using regex to count STR

I have been stuck at this point for quite a while, hope to get some tips.
The problem can be simplified as to find what is the largest consecutive occurrence of a pattern in a string. As a pattern AATG, for a string like ATAATGAATGAATGGAATG the right result should be 3. I tired to count the occurrences of the pattern by using re.compile(). I have found out from the doc that if i want to find consecutive occurrence of a pattern i possibly have to use special character +. For instance, a pattern like AATG i have to use re.compile(r'(AATG)+') instead of re.compile(r'AATG'). Otherwise, the occurrences will be overcounted. However, in this program the pattern is not a fixed string. I have treat it as a variable. I have tried many ways to put it into re.compile() without positive results. Could anyone enlighten me the correct way to format it (which is in the Function def countSTR below)?
After that, i think finditer(the_string_to_be_analysis) should return a iterator including all matches found. Then i used match.end() - match.start() to obtain the length of every match to compare with each other in order to get the longest consecutive occurrence of the pattern. maybe something goes wrong there?
code attached. Every input would be appreciated!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()
just find out a correct way to get the desired result.
post it here in case any others are also confused.
From what I have understand, to match a variable named var_pattern could use re.compile(rf'{var_pattern}'). Then if consecutive occurrences of the var_pattern should be searched, could use re.compile(rf'(var_pattern)+'). There may be other smarter ways to implement that, however i managed to get it work as fine as previously .

Searching for how many times a word is occur consecutively in a string w/ Python (PSET6 CS50)

my goal is reading some strings (parts of DNA in this content) from a csv file, and then search another txt file for how many times those strings occur consecutively in those string but my current code creates an infinite loop(I did it that so way since I could not come up with a proper condition for while). Any help is appreciated thanks.
My idea was: Search the goal string if it is in, double its number if that's in too triple an increment the number until it is not in the readed anymore.
#Header line of csv : name,AGATC,AATG,TATC
# so checkstr = [AGATC,AATG,TATC]
#Example of searched strings `GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT`
For example should be able to find how many times consecutively AGATC occurs in that string and return that or record to memory.
checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
header = csv.reader(p)
for row in header:
checkstr = row[1:]
break
with open(f'{seq}','r') as f:#searching the text for strs
readed = f.read()
for j in checkstr:
n = 1
jnew = n * j
while True:
if jnew in readed:
n += 1
print(f"{jnew} and {n}")
break
else:
break
This operates on the idea that splitting a string by a substring will return an empty string on consecutive substrings. Such as:
s = 'abbcd'
s.split('b')
['a', '', 'cd']
In this case the number of consecutive b in abbcd is the count of empty strings plus 1 (2 in this case).
Expanding upon that we can use itertools groupby to count the number of times each group of text in the split string occurs, which as a result of the previous code means if we count the number of times '' occurs in the list and add one we will get your answer. The try/except statment is to handle instances where your substring is not in the string, and the resulting count is empty.
from itertools import groupby
checkstr = ['AGATC', 'AATG', 'TATC']
s = 'GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT'
for c in checkstr:
groups = groupby(s.split(c))
try:
print(c,[sum(1 for _ in group)+1 for label, group in groups if label==''][0])
except IndexError:
print(c,0)
Output
AGATC 0
AATG 43
TATC 5

Python counting occurrences across multiple lines using loops

I want a quick pythonic method to give me a count in a loop. I am actually too embarrassed to post up my solutions which are currently not working.
Given a sample from a text file structured follows:
script7
BLANK INTERRUPTION
script2
launch4.VBS
script3
script8
launch3.VBS
script5
launch1.VBS
script6
I want a count of all times script[y] is followed by a launch[X]. Launch has a range of values from 1-5, whilst script has range of 1-15.
Using script3 as an example, I would need a count for each of the following in a given file:
script3
launch1
#count this
script3
launch2
#count this
script3
launch3
#count this
script3
launch4
#count this
script3
launch4
#count this
script3
launch5
#count this
I think the sheer number of loops involved here has surpassed my knowledge of Python. Any assistance would be greatly appreciated.
Why not use a multi-line regex - then the script becomes:
import re
# read all the text of the file, and clean it up
with open('counts.txt', 'rt') as f:
alltext = '\n'.join(line.strip() for line in f)
# find all occurrences of the script line followed by the launch line
cont = re.findall('^script(\d)\nlaunch(\d+)\.VBS\n(?mi)',alltext)
# accumulate the counts of each launch number for each script number
# into nested dictionaries
scriptcounts = {}
for scriptnum,launchnum in cont:
# if we haven't seen this scriptnumber before, create the dictionary for it
if scriptnum not in scriptcounts:
scriptcounts[scriptnum]={}
# if we haven't seen this launchnumber with this scriptnumber before,
# initialize count to 0
if launchnum not in scriptcounts[scriptnum]:
scriptcounts[scriptnum][launchnum] = 0
# incremement the count for this combination of script and launch number
scriptcounts[scriptnum][launchnum] += 1
# produce the output in order of increasing scriptnum/launchnum
for scriptnum in sorted(scriptcounts.keys()):
for launchnum in sorted(scriptcounts[scriptnum].keys()):
print "script%s\nlaunch%s.VBS\n# count %d\n"%(scriptnum,launchnum,scriptcounts[scriptnum][launchnum])
The output (in the format you requested) is, for example:
script2
launch1.VBS
# count 1
script2
launch4.VBS
# count 1
script5
launch1.VBS
# count 1
script8
launch3.VBS
# count 3
re.findall() returns a list of all the matches - each match is a list of the () parts of the pattern except the (?mi) which is a directive to tell the regular expression matcher to work across line ends \n and to match case insensitive. The regex pattern as it stands e.g. fragment 'script(\d)' pulls out the digit following the script/launch into the match - this could as easily include 'script' by being '(script\d)', similarly '(launch\d+\.VBS)' and only the printing would need modification to handle this variation.
HTH
barny
Here is my solution using defaultdict with Counters and regex with lookahead.
import re
from collections import Counter, defaultdict
with open('in.txt', 'r') as f:
# make sure we have only \n as lineend and no leading or trailing whitespaces
# this makes the regex less complex
alltext = '\n'.join(line.strip() for line in f)
# find keyword script\d+ and capture it, then lazy expand and capture everything
# with lookahead so that we stop as soon as and only if next word is 'script' or
# end of the string
scriptPattern = re.compile(r'(script\d+)(.*?)(?=script|\n?$)', re.DOTALL)
# just find everything that matches launch\d+
launchPattern = re.compile(r'launch\d+')
# create a defaultdict with a counter for every entry
scriptDict = defaultdict(Counter)
# go through all matches
for match in scriptPattern.finditer(alltext):
script, body = match.groups()
# update the counter of this script
scriptDict[script].update(launchPattern.findall(body))
# print the results
for script in sorted(scriptDict):
counter = scriptDict[script]
if len(counter):
print('{} launches:'.format(script))
for launch in sorted(counter):
count = counter[launch]
print('\t{} {} time(s)'.format(launch, count))
else:
print('{} launches nothing'.format(script))
Using the string on regex101 (see link above) I get the following result:
script2 launches:
launch4 1 time(s)
script3 launches nothing
script5 launches:
launch1 1 time(s)
script6 launches nothing
script7 launches nothing
script8 launches:
launch3 1 time(s)
Here's an approach which uses nested dictionaries. Please tell me if you would like the output to be in a different format:
#!/usr/bin/env python3
import re
script_dict={}
with open('infile.txt','r') as infile:
scriptre = re.compile(r"^script\d+$")
for line in infile:
line = line.rstrip()
if scriptre.match(line) is not None:
script_dict[line] = {}
infile.seek(0) # go to beginning
launchre = re.compile(r"^launch\d+\.[vV][bB][sS]$")
current=None
for line in infile:
line = line.rstrip()
if line in script_dict:
current=line
elif launchre.match(line) is not None and current is not None:
if line not in script_dict[current]:
script_dict[current][line] = 1
else:
script_dict[current][line] += 1
print(script_dict)
You could use setdefault method
code:
dic={}
with open("a.txt") as inp:
check=0
key_string=""
for line in inp:
if check:
if line.strip().startswith("launch") and int(line.strip()[6])<6:
print "yes"
dic[key_string]=dic.setdefault(key_string,0)+1
check=0
if line.strip().startswith("script"):
key_string=line.strip()
check=1
For your given input the output would be
output:
{"script3":6}

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources