Im trying to write a program that counts the number of N's at the end of a string.
I have a file containing a many lines of unique sequences and I want to measure how often the sequence ends with N, and how long the series of N's are. For example, the file input will look like this:
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACNN
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGN
NTACGTTTAGTAGAGACAGTGTCTTGCTATGTTGCCCAGGCTGGTCTCAAACTCCTGAGCTCTAGCAAGCCTTCCACCTCNNN
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGANNN
Ideally, the code will read through the file, line by line and count how often a line ends with 'N'.
Then, if a line ends with N, it should read each character backwards to see how long the string of N's is. This information will be used to calculate the percentage of lines ending in N, as well as the mean, mode, median and range of N strings.
Here is what I have so far.
filename = 'N_strings_test.txt'
n_strings = 0
n_string_len = []
with open(filename, 'r') as in_f_obj:
line_count = 0
for line in in_f_obj:
line_count += 1
base_seq = line.rstrip()
if base_seq[-1] == 'N':
n_strings += 1
if base_seq[-2] == 'N':
n_string_len.append(int(2))
else:
n_string_len.append(int(1))
print(line_count)
print(n_strings)
print(n_string_len)
All i'm getting is an index out of range error, but I don't understand why. Also, what I have so far is only limited to 2 characters.
I want to try and write this for myself, so I don't want to import any modules.
Thanks.
You will probably get the IndexError because your file has empty lines!
Two sound approaches. First the generic one: iterate the line in reverse using reversed():
line = line.rstrip()
count = 0
for c in reversed(line):
if c != 'N':
break
count += 1
# count will now contain the number of N characters from the end
Another, even easier, which does modify the string, is to rstrip() all whitespace, get the length, and then rstrip() all Ns. The number of trailing Ns is the difference in lengths:
without_whitespace = line.rstrip()
without_ns = without_whitespace .rstrip('N')
count = len(without_whitespace) - len(without_ns)
This code is:
Reading line by line
Reversing the string and lstriping it. Reversing is not necessary but it make things natural.
Read last character, if N then increment
Keep reading that line until we have stream of N
n_string_count, n_string_len, line_count = 0, [], 0
with open('file.txt', 'r') as input_file:
for line in input_file:
line_count += 1
line = line[::-1].lstrip()
if line:
if line[0] == 'N':
n_string_count += 1
consecutive_n = 1
while consecutive_n < len(line) and line[consecutive_n] == 'N': consecutive_n += 1
n_string_len.append(consecutive_n)
print(line_count)
print(n_string_count)
print(n_string_len)
Related
I have started my code and am on at a very good start, however, I have come to a road block when it comes to adding sum, average, minimum, and maximum to my code, I'm sure this is a pretty easy fix to someone who knows what there are doing. Any help would be greatly appreciated. The numbers in my file are 14, 22, and -99.
Here is my code so far:
def main ():
contents=''
try:
infile = openFile()
count, sum = readFile(infile)
closeFile(infile)
display(count, sum)
except IOError:
print('Error, input file not opened properly')
except ValueError:
print('Error, data within the file is corrupt')
def openFile():
infile=open('numbers.txt', 'r')
return infile
def readFile(inf):
count = 0
sum = 0
line = inf.readline()
while line != '':
number = int(line)
sum += number
count += 1
line = inf.readline()
return count, sum
def closeFile(inF):
inF.close()
def display(count, total):
print('count = ', count)
print('Sum = ', total)
main()
In the while line!=' ': statement, it will iterate one-one single element in the file, i.e. it will add 1+4 and break the loop when we get " " according to your example. Instead, you can use .split() function and use for loop. Your code (Assuming that all numbers are in a single line):
def read_file():
f=open("numbers.txt","r")
line=f.readline()
l=[int(g) for g in line.split(",")] #there should be no gap between number and comma
s=sum(l)
avg=sum(l)/len(l)
maximum=max(l)
minimum=min(l)
f.close()
return s, avg, maximum, minimum
read_file()
Your code contains a number of antipatterns: you apparently tried to structure it OO-like but without using a class... But this:
line = inf.readline()
while line != '':
number = int(line)
sum += number
count += 1
line = inf.readline()
is the worst part and probably the culprit.
Idiomatic Python seldom use readline and just iterate the file object, but good practices recommend to strip input lines to ignore trailing blank characters:
for line in inf:
if line.strip() == '':
break
sum += number
count += 1
hi im having trouble with some homework that i got. i have a list of number in a text file written in several lines. my project is asking me to select a specific line and then a number of line after to sum them up. For example from line 4 sum the next 4 line.
this is the code i have tried for now
fichNbr = open("nombres.txt", "r")
ligneDepart = int(input("entrer la ligne de depart: "))
nb_lignes = int(input("entrer le nombre de ligne a lire: "))
somme3 = 0
for line in fichNbr:
line = fichNbr.readline()
print(line)
for i in range(ligneDepart,(ligneDepart + nb_lignes),1):
n = fichNbr.readline().split()
for f in n:
somme3 += int(f)
print(somme3)
I dont really get what your code is doing wrong (kind of in a hurry so not enough time to analyse sorry) but if you're looking for a code that kind of does what you need (I think) here it is:
f = open("test.txt", "r")
start_line = input("line to start ") - 1
finish_line = input("line to finish ") - 1
soma = 0
for i, line in enumerate(f):
if i >= start_line and i <= finish_line:
soma += int(line)
print soma
Just a quick explanation, enumeration is a built-in function that will iterate through the file f and return a tuple containing the line number (starting from zero) and whatever is in the line. All you need to do is check if i is equal or greater than the line you wanna start reading from and also smaller than the line you wanna stop reading at.
Hope it helps :)
To do the sum of all lines (including the given line) from the designated line, you line variable needs to subtract 1. If it is exclusive of that line then just use the variable as is. Open the file using with ... as to have it close automatically and choose read ('r') as the mode. Create a variable to store the lines and choose readlines(). This stores each line into its own place in a list.
Create another variable to actually sum the lines. Using a while loop associated to you given start line variable, as long as it is less than the length of your list, iterate through the list adding each line to your sum variable (as shown using nums).
Because of how open() read the lines, it pulls it all as strings with \n at the end. Add strip('\n') to remove the \n and convert it all into an integer. Add 1 to your line variable to properly iterate and end the loop.
def example(file_name, line):
with open(file_name, 'r') as f:
x = f.readlines()
line = line - 1
nums = 0
while line < len(x):
nums += int(x[line].strip('\n'))
line += 1
print(nums)
example("example.txt", 4)
# My example.txt file has a different number on each line in this order: 2, 4, 3, 7, 5, 6, 4
If you want to do only a certain number of lines following the given line, add this extra number to the function and then add the line variable to this extra variable BEFORE subtracting 1 from the line variable. Instead of going iterating the length of the list, using this modified extra variable as you end.
def example(file_name, line, end_line):
with open(file_name, 'r') as f:
x = f.readlines()
end_line += line
line = line - 1
nums = 0
while line < end_line:
nums += int(x[line].strip('\n'))
line += 1
print(nums)
example("example.txt", 4, 2)
# My example.txt file has a different number on each line in this order: 2, 4, 3, 7, 5, 6, 4
Again, if you do not want the given line included, do not subtract 1.
I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)
Im trying to find the dinuc count and frequencies from a sequence in a text file, but my code is only outputting single nucleotide counts.
e = "ecoli.txt"
ecnt = {}
with open(e) as seq:
for line in seq:
for word in line.split():
for i in range(len(seqr)):
dinuc = (seqr[i] + seqr[i:i+2])
for dinuc in seqr:
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
for x,y in ecnt.items():
print(x, y)
Sample input: "AAATTTCGTCGTTGCCC"
Sample output:
AA:2
TT:3
TC:2
CG:2
GT:2
GC:1
CC:2
Right now, Im only getting single nucleotides for my output:
C 83550600
A 60342100
T 88192300
G 92834000
For the nucleotides that repeat i.e. "AAA", the count has to return all possible combinations of consecutive 'AA', so the output should be 2 rather than 1. It doesnt matter what order the dinucleotides are listed, I just need all combinations, and for the code to return the correct count for the repeated nucleotides. I was asking my TA and she said that my only problem was getting my 'for' loop to add the dinucleotides to my dictionary, and I think my range may or may not be wrong. The file is a really big one, so the sequence is split up into lines.
Thank you so much in advance!!!
I took a look at your code and found several things that you might want to take a look at.
For testing my solution, since I did not have ecoli.txt, I generated one of my own with random nucleotides with the following function:
import random
def write_random_sequence():
out_file = open("ecoli.txt", "w")
num_nts = 500
nts_per_line = 80
nts = []
for i in range(num_nts):
nt = random.choice(["A", "T", "C", "G"])
nts.append(nt)
lines = [nts[i:i+nts_per_line] for i in range(0, len(nts), nts_per_line)]
for line in lines:
out_file.write("".join(line) + "\n")
out_file.close()
write_random_sequence()
Notice that this file has a single sequence of 500 nucleotides separated into lines of 80 nucleotides each. In order to count dinucleotides where you have the first nucleotide at the end of one line and the second nucleotide at the start of the next line, we need to merge all of these separate lines into a single string, without spaces. Let's do that first:
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
Try printing out "seq" and notice that it should be one giant string containing all of the nucleotides. Next, we need to find the dinucleotides in the sequence string. We can do this using slicing, which I see you tried. So for each position in the string, we look at both the current nucleotide and the one after it.
for i in range(len(seq)-1):#note the -1
dinuc = seq[i:i+2]
We can then do the counting of the nucleotides and storage of them in a dictionary "ecnt" very much like you had. The final code looks like this:
ecnt = {}
seq = ""
with open("ecoli.txt", "r") as seq_data:
for line in seq_data:
seq += line.strip()
for i in range(len(seq)-1):
dinuc = seq[i:i+2]
if dinuc in ecnt:
ecnt[dinuc] += 1
else:
ecnt[dinuc] = 1
print ecnt
A perfect opportunity to use a defaultdict:
from collections import defaultdict
file_name = "ecoli.txt"
dinucleotide_counts = defaultdict(int)
sequence = ""
with open(file_name) as file:
for line in file:
sequence += line.strip()
for i in range(len(sequence) - 1):
dinucleotide_counts[sequence[i:i + 2]] += 1
for key, value in sorted(dinucleotide_counts.items()):
print(key, value)
I am trying to count the commas between entries in a text file so I can use the number of commas to find the number of entries to come up with the average. Unfortunately it comes up with commacount of zero.
file = open("inputs.txt", "r")
line = file.read()
commaCount = 0
for line in file:
for char in line:
if char == ',':
commaCount+=1
commacount2 = (multiply(commaCount,2))
total = sum(int(num) for num in line.strip(',').split(','))
print(commaCount)
print(commacount2)
print("Your average for all inputs is" + str(divide(total,commacount2)))
You have already consumed the file iterator with line = file.read() so you are not iterating over anything. You should forget read and iterate over the file object itself:
with open("inputs.txt", "r") as f:
count = sum(line.count(",") for line in f)
# f.seek(0)
# use the lines again
If you want to get the pointer back to the start to iterate again you could f.seek(0) but I am not sure what the total = sum(int(num) for num in line.strip(',').split(',')) is doing.
Once you call .read or .readlines you have move the pointer to the end of the file so unless you f.seek(0) you cannot iterate over all the lines again, you are basically doing:
In [8]: iterator = iter((1,2,3))
In [9]: list(iterator) # consume
Out[9]: [1, 2, 3]
In [10]: list(iterator) # empty
Out[10]: []
In [11]: list(iterator).count(1)
Out[11]: 0
If you have a comma separated file with integers you can use the csv module, the length of the rows will give you the count of how may elements and map the strings to ints and sum all the row values:
import csv
with open("inputs.txt") as f:
r = csv.reader(f) # create rows split on commas
sm = 0
com_count = 0
for row in r:
com_count += len(row) # "1,2,3"
sm += sum(map(int,row))
It would actually be sm += sum(map(int,row)) -1 to match the comma count but if you want the number of elements then counting the commas is not the correct approach "1,2,3".count(",") == 2 but there are three elements.
This should help you get started, It should give you the number of commas in a text file, If you use a loop you can use it for all the files you have.
with open('inputs.txt', 'r') as f:
numCommas = f.read().count(',')
print numCommas