I'm looking at a very large set of binary data which is in a separate file. The challenge is to find the largest consecutive number of 1's and 0's. I have already accessed the file from within Python (I'm using Python btw) and have been able to code to find out the total number of 0's and 1's. any help would be much appreciated as I am a total beginner to coding when using Python. Cheers.
This what I've done thus far:
filename = "C:/01.txt"
file = open(filename, "r")
count_1 = 0
count_0 = 0
for line in file:
count_0 = count_0 + line.count("0")
count_1 = count_1 + line.count("1")
pass
print("Number of 1s = " + str(count_1))
print("Number of 0s = " + str(count_0))
I have not actually started the coding to find the consecutive numbers.
To find the longest occurrence of a certain substring, you could use a function like this one:
def longest_segment(sub, string):
return max(m.group() for m in re.finditer(r'(%s)\1*' % sub, string))
This works by finding all occurrences of the provided substring, sub, in the string, and returning the longest one.
Here is a simple solution: Loop through the data, count consecutive 1s read and when reading a 0 (meaning you reached the end of one segment) compare it's length to the longest segment of consecutive 1s found so far.
def getMaxSegmentLength(readable):
current_length= 0
max_length= 0
for x in readable:
if x == '1':
current_length+= 1
else:
max_length= max(max_length, current_length)
current_length= 0
return max(max_length, current_length)
def main():
# open a file located in G:/input.txt in read mode and name the file object: inputf
with open('G:/input.txt', 'r') as inputf:
# put all the text in filef in s
s= inputf.read()
# get the longest streak of 1s in string s
n= getMaxSegmentLength(s)
print(n)
if __name__ == '__main__':
main()
s=raw_input() #read s from file in this case
zero=0
one=0
zz=0
oo=0
for i in list(s):
if i=='1':
if zz>=1:
zero=max(zero,zz)
zz=0
oo+=1
else:
if oo>=1:
one=max(one,oo)
oo=0
zz+=1
if oo>=1:
one=max(oo,one)
if zz>=1:
zero=max(zero,zz)
print zero,one
#O(n)
Related
I have a list of numbers of varying lengths stored in a file, like this...
98
132145
132324848
4435012341
1254545221
2314565447
I need a function that looks through the list and counts every number that is 10 digits in length and begins with the number 1. I have stored the list in both a .txt and a .csv with no luck. I think a big part of the problem is that the numbers are integers, not strings.
`import regex
with open(r"C:\Desktop\file.csv") as file:
data = file.read()
x = regex.findall('\d+', data)
def filterNumberOne(n):
if(len(n)==10:
for i in n:
if(i.startswith(1)):
return True
else:
return False
one = list(filter(filterNumberOne, x))
print(len(one))`
You could simply do like this :
# Get your file content as a string.
with open(r"C:\Desktop\file.csv") as f:
s = " ".join([l.rstrip("\n").strip() for l in f])
# Look for the 10 digits number starting with a one.
nb = [n for n in s.split(' ') if len(n)==10 and n[0]=='1']
In your case, the output will be:
['1254545221']
So I ended up using the following, seems to work great..
def filterNumberOne(n):
if (len(n)==10:
if str(n)[0] == '1':
return True
else:
return False
one = list(filter(filterNumberOne, x ))
print(len(one))
This is my solution to CS50 pset6 DNA problem in python. It works fine on small database but gives an
Index error: List Index Out of range.
I tried print to see where is the error.. It prints out large database as well. Not sure what to do next.
import csv
import sys
def main():
# TODO: Check for command-line usage
if len(sys.argv) != 3:
print("Usage: python dna.py database.csv sequence.txt")
sys.exit(1)
# TODO: Read database file into a variable
dna_database =[]
with open(sys.argv[1], "r") as dna_data_file:
reader = csv.DictReader(dna_data_file)
for row in reader:
dna_database.append(row)
# TODO: Read DNA sequence file into a variable
with open(sys.argv[2], "r") as load_sequence:
sequence = load_sequence.read()
# TODO: Find longest match of each STR in DNA sequence
STR = list(dna_database[0].keys())[1:]
STR_match ={}
for i in range(len(dna_database)):
# print(dna_database)
STR_match[STR[i]] = longest_match(sequence,STR[i])
# TODO: Check database for matching profiles
for i in range(len(dna_database)):
matches = 0
for j in range(len(STR)):
if int(STR_match[STR[j]]) == int(dna_database[i][STR[j]]):
matches += 1
if matches == len(STR):
print(dna_database[i]['name'])
sys.exit(0)
print("No Match")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
I'm doing an exercise (cs50 - DNA) where I have to count specific consecutive substrings (STRS) mimicking DNA sequences, I'm finding myself overcomplicating my code and I'm having a hard time figuring out how to proceed.
I have a list of substrings:
strs = ['AGATC', 'AATG', 'TATC']
And a String with a random sequence of letters:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
I want to count the biggest consecutive substrings that match each strs.
So:
'AGATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'AATG' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'TATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
resulting in [4, 1, 5]
(Note that this isn't the best example since there are no random repeating patterns scatered around but I think it illustrates what I'm looking for)
I know that I should be something of the likes of re.match(rf"({strs}){2,}", string) because str.count(strs) will give me ALL consecutive and non consecutive items.
My code so far:
#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string
# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(2)
# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
data = list(csv.reader(database))
STRS = data[0]
data.pop(0)
# remove "name" so only thing remaining are STRs
STRS.pop(0)
# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
sequence = seq.read()
sequenceCount = []
# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
sequenceCount.append(sequence.count(s))
print(STRS)
print(sequenceCount)
"""
sequenceCount = {}
# for each STR count the occurences
for s in STRS:
sequenceCount[s] = sequence.count(s)
for line in data:
print(line)
for item in line[1:]:
continue
# rf"({STRS}){2,}"
"""
Regular expression for finding repeating strings is like r"(AGATC)+".
For example,
import re
sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"
r = re.search(r"({})+".format(pattern), sequence)
if r:
print("start at", r.start())
print("end at", r.end())
If a match is found, then you can access the starting and ending position by .start and .end methods. You can calculate the repetition using them.
If you need to find all matches in the sequence, then you can use re.finditer, which gives you match objects iteratively.
You can loop over target patterns and find the longest one.
Here using two for loops; one to grab each string (sequence) from strs, and the other to iterate over our dna strand to match each string from strs against it, and a while loop is used if a match was found to keep looking for consecutive (back2back) matches. (Added inline comments to give brief explanations on each step)
dna = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATAGATCTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
strs = ['AGATC', 'AATG', 'TATC']
def seq_finder(sequence, dna):
start = 0 # Will allow us to skip scanned sequences
counter = [0] * len(sequence) # Create a list of zeros to store sequence occurrences
for idx, seq in enumerate(sequence): # Iterate over every entry in our sequence "strs"
k = len(seq)
holder = 0 # A temporarily holder that will store #occurrences of *consecutive* sequences
for i in range(start, len(dna)): # For each sequence, iterate over our "dna" strand
if dna[i:i+k] == strs[idx]: # If match is found:
holder += 1 # Increment our holder by 1
while dna[i:i+k] == dna[i+k:i+k*2]: # If our match has an identical match ahead (consecutively):
holder += 1 # Increment our holder by 1
i += k # Start the next list indexing from our new match
start = i + 1 # To skip repetitive iterations over same matches
if holder > counter[idx]:
counter[idx] = holder # Only replace counter if new holder > old holder
holder = 0 # Reset the holder when we existed our of our while loop (finished finding consecutives)
return counter
I need to read an input file (input.txt) which contains one line of integers (13 34 14 53 56 76) and then compute the sum of the squares of each number.
This is my code:
# define main program function
def main():
print("\nThis is the last function: sum_of_squares")
print("Please include the path if the input file is not in the root directory")
fname = input("Please enter a filename : ")
sum_of_squares(fname)
def sum_of_squares(fname):
infile = open(fname, 'r')
sum2 = 0
for items in infile.readlines():
items = int(items)
sum2 += items**2
print("The sum of the squares is:", sum2)
infile.close()
# execute main program function
main()
If each number is on its own line, it works fine.
But, I can't figure out how to do it when all the numbers are on one line separated by a space. In that case, I receive the error: ValueError: invalid literal for int() with base 10: '13 34 14 53 56 76'
You can use file.read() to get a string and then use str.split to split by whitespace.
You'll need to convert each number from a string to an int first and then use the built in sum function to calculate the sum.
As an aside, you should use the with statement to open and close your file for you:
def sum_of_squares(fname):
with open(fname, 'r') as myFile: # This closes the file for you when you are done
contents = myFile.read()
sumOfSquares = sum(int(i)**2 for i in contents.split())
print("The sum of the squares is: ", sumOfSquares)
Output:
The sum of the squares is: 13242
You are trying to turn a string with spaces in it, into an integer.
What you want to do is use the split method (here, it would be items.split(' '), that will return a list of strings, containing numbers, without any space this time. You will then iterate through this list, convert each element to an int as you are already trying to do.
I believe you will find what to do next. :)
Here is a short code example, with more pythonic methods to achieve what you are trying to do.
# The `with` statement is the proper way to open a file.
# It opens the file, and closes it accordingly when you leave it.
with open('foo.txt', 'r') as file:
# You can directly iterate your lines through the file.
for line in file:
# You want a new sum number for each line.
sum_2 = 0
# Creating your list of numbers from your string.
lineNumbers = line.split(' ')
for number in lineNumbers:
# Casting EACH number that is still a string to an integer...
sum_2 += int(number) ** 2
print 'For this line, the sum of the squares is {}.'.format(sum_2)
You could try splitting your items on space using the split() function.
From the doc: For example, ' 1 2 3 '.split() returns ['1', '2', '3'].
def sum_of_squares(fname):
infile = open(fname, 'r')
sum2 = 0
for items in infile.readlines():
sum2 = sum(int(i)**2 for i in items.split())
print("The sum of the squares is:", sum2)
infile.close()
Just keep it really simple, no need for anything complicated. Here is a commented step by step solution:
def sum_of_squares(filename):
# create a summing variable
sum_squares = 0
# open file
with open(filename) as file:
# loop over each line in file
for line in file.readlines():
# create a list of strings splitted by whitespace
numbers = line.split()
# loop over potential numbers
for number in numbers:
# check if string is a number
if number.isdigit():
# add square to accumulated sum
sum_squares += int(number) ** 2
# when we reach here, we're done, and exit the function
return sum_squares
print("The sum of the squares is:", sum_of_squares("numbers.txt"))
Which outputs:
The sum of the squares is: 13242
Im trying to find the dinuc count and frequencies from a sequence in a text file, but my code is only outputting single nucleotide counts.
e = "ecoli.txt"
ecnt = {}
with open(e) as seq:
for line in seq:
for word in line.split():
for i in range(len(seqr)):
dinuc = (seqr[i] + seqr[i:i+2])
for dinuc in seqr:
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
for x,y in ecnt.items():
print(x, y)
Sample input: "AAATTTCGTCGTTGCCC"
Sample output:
AA:2
TT:3
TC:2
CG:2
GT:2
GC:1
CC:2
Right now, Im only getting single nucleotides for my output:
C 83550600
A 60342100
T 88192300
G 92834000
For the nucleotides that repeat i.e. "AAA", the count has to return all possible combinations of consecutive 'AA', so the output should be 2 rather than 1. It doesnt matter what order the dinucleotides are listed, I just need all combinations, and for the code to return the correct count for the repeated nucleotides. I was asking my TA and she said that my only problem was getting my 'for' loop to add the dinucleotides to my dictionary, and I think my range may or may not be wrong. The file is a really big one, so the sequence is split up into lines.
Thank you so much in advance!!!
I took a look at your code and found several things that you might want to take a look at.
For testing my solution, since I did not have ecoli.txt, I generated one of my own with random nucleotides with the following function:
import random
def write_random_sequence():
out_file = open("ecoli.txt", "w")
num_nts = 500
nts_per_line = 80
nts = []
for i in range(num_nts):
nt = random.choice(["A", "T", "C", "G"])
nts.append(nt)
lines = [nts[i:i+nts_per_line] for i in range(0, len(nts), nts_per_line)]
for line in lines:
out_file.write("".join(line) + "\n")
out_file.close()
write_random_sequence()
Notice that this file has a single sequence of 500 nucleotides separated into lines of 80 nucleotides each. In order to count dinucleotides where you have the first nucleotide at the end of one line and the second nucleotide at the start of the next line, we need to merge all of these separate lines into a single string, without spaces. Let's do that first:
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
Try printing out "seq" and notice that it should be one giant string containing all of the nucleotides. Next, we need to find the dinucleotides in the sequence string. We can do this using slicing, which I see you tried. So for each position in the string, we look at both the current nucleotide and the one after it.
for i in range(len(seq)-1):#note the -1
dinuc = seq[i:i+2]
We can then do the counting of the nucleotides and storage of them in a dictionary "ecnt" very much like you had. The final code looks like this:
ecnt = {}
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
for i in range(len(seq)-1):
dinuc = seq[i:i+2]
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
print ecnt
A perfect opportunity to use a defaultdict:
from collections import defaultdict
file_name = "ecoli.txt"
dinucleotide_counts = defaultdict(int)
sequence = ""
with open(file_name) as file:
for line in file:
sequence += line.strip()
for i in range(len(sequence) - 1):
dinucleotide_counts[sequence[i:i + 2]] += 1
for key, value in sorted(dinucleotide_counts.items()):
print(key, value)