I am writing a program that is supposed to return the minimum sequence alignment score (smaller = better), and it worked with the Coursera sample inputs, but for the dataset we're given, I can't manually input the sequences, so I have to resort to using a textfile. There are a few things which I found weird.
First things first,
pattern = 'AAA'
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line=lines.strip().strip('\n')
empty.append(line)
print(empty)
print(smallest_distance(pattern, DNA))
If I run this, my program outputs 0. If I comment out for loop, my program outputs 2. I didn't change DNA, so why should my program behave differently? Also, my strip('\n') is working (and for some reason, strip('n') works just as well) but my strip() is not working. Once I figure this out, I can test out empty in my smallest_distance function.
Here is what my data looks like:
ACTAG
CTTAGTATCACTCTGAAAAGAGATTCCGTATCGATGACCGCCAGTTAATACGTGCGAGAAGTGGACACGGCCGCCGACGGCTTCTACACGCTATTACGATG AACCAACAATTGCTCGAATCCTTCCTCAAAATCGCACACGTCTCTCTGGTCGTAGCACGGATCGGCGACCCACGCGTGACAGCCATCACCTATGATTGCCG
TTAAGGTACTGCTTCATTGATCAACACCCCTCAGCCGGCAATCACTCTGGGTGCGGGCTGGGTTTACAGGGGTATACGGAAACCGCTGCTTGCCCAATAAT
etc...
Solution:
pattern = 'AAA'
with open('practice_data.txt') as f_dna:
dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))
Explanation:
You were close to the solution, but you needed to remplace strip() by split()
-> strip() remove the extra characters, so your strip('\n') was a good guess.
But since \n is at the end of the line, split will automatically get rid of it because it is count as a delimitor
e.g
>>> 'test\ntest'.split()
>>> ['test', 'test']
>>> 'test\n'.split()
>>> ['test']
Now you have to remplace .append() by a simple addition between list operation since split returns a list.
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line = lines.split()
empty += line
But, there is still some problems in your code:
It is better to use the with statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:
empty = []
with open('practice_data.txt') as DNA:
for lines in DNA:
line = lines.split()
empty += line
Your code is now fine, you can still refactor using list-comprehension (very common in python)
with open('practice_data.txt') as DNA:
empty = [sequence for line in DNA for sequence in line.split()]
If you struggle understanding this; try to recompose it with for loop
empty = []
with open('practice_data.txt') as DNA:
for line in DNA:
for sequence in line.split():
empty.append(sequence)
Note: #MrGeek solution works, but as two major defaults:
as it is not using a with statement, the file is never closed, causing memory issue,
using .read().splitlines() will load ALL the content of the file in memory, this could lead to MemoryError exception if the file is too big.
Go further, handle huge file:
Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict, a better pratice will be to create another file for the result and process your DNA on the fly:
e.g
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for line in DNA:
for sequence in line.split():
result = smallest_distance(pattern, sequence)
f_result.write(result)
Warning: You will have to make sure your function smallest_distance accepts a string rather than a list.
If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.
Now you can refactor a bit using for example a genetor function to improve readability
def extract_sequence(file, pattern):
for line in file:
for sequence in line.split():
yield smallest_distance(pattern, sequence)
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for result in extract_sequence(f_dna, pattern):
f_result.write(result)
potential errors:
print(smallest_distance(pattern, DNA))
DNA is file descriptor, not a string array. Because DNA = open('practice_data.txt')
For loop consume DNA. So, if you are using for loop for lines in DNA: again in smallest_distance, it doesn't work.
Update:
In this case, the for loop go from the beginning of file to the end. It would not go back again like a list. Unless you call DNS.close() and re-initialize file descriptor again DNA = open('practice_data.txt')
An simple example you can try
DNA = open('text.txt')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # print everything in the file here
print ('try again')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # will not print anything at all
print ('done')
Read For loop not working twice on the same file descriptor for more detail
Write :
pattern = 'AAA'
DNA = open('practice_data.txt').read().splitlines()
newDNA = []
for line in DNA:
newDNA += line.split() # create an array with strings then concatenate it with the newDNA array
print(smallest_distance(pattern, newDNA))
Related
I have an array containing strings.
I have a text file.
I want to loop through the text file line by line.
And check whether each element of my array is present or not.
(they must be whole words and not substrings)
I am stuck because my script only checks for the presence of the first array element.
However, I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not.
#!/usr/bin/python
with open("/home/all_genera.txt") as file:
generaA=[]
for line in file:
line=line.strip('\n')
generaA.append(line)
with open("/home/config/config2.cnf") as config_file:
counter = 0
for line in config_file:
line=line.strip('\n')
for part in line .split():
if generaA[counter]in part:
print (generaA[counter], "is -----> PRESENT")
else:
continue
counter += 1
If I understand correctly, you want a sequence of words that are in both files. If yes, set is your friend:
def parse(f):
return set(word for line in f for word in line.strip().split())
with open("path/to/genera/file") as f:
source = parse(f)
with open("path/to/conf/file" as f:
conf = parse(f)
# elements that are common to both sets
common = conf & source
print(common)
# elements that are in `source` but not in `conf`
print(source - conf)
# elements that are in `conf` but not in `source`
print(conf - source)
So to answer "I would like it to return results with each array element and a note as to whether this array element is present in the entire file or not", you can use either common elements or the source - conf difference to annotate your source list:
# using common elements
common = conf & source
result = [(word, word in common) for word in source]
print(result)
# using difference
diff = source - conf
result = [(word, word not in diff) for word in source]
Both will yeld the same result and since set lookup is O(1) perfs should be similar too, so I suggest the first solution (positive assertions are easier to the brain than negative ones).
You can of course apply further cleaning / normalisation when building the sets, ie if you want case insensitive search:
def parse(f):
return set(word.lower() for line in f for word in line.strip().split())
from collection import Counter
import re
#first normalize the text (lowercase everything and remove puncuation(anything not alphanumeric)
normalized_text = re.sub("[^a-z0-9 ]","",open("some.txt","rb").read().lower())
# note that this normalization is subject to the rules of the language/alphabet/dialect you are using, and english ascii may not cover it
#counter will collect all the words into a dictionary of [word]:count
words = Counter(normalized_text.split())
# create a new set of all the words in both the text and our word_list_array
set(my_word_list_array).intersection(words.keys())
the counter is not increasing because it's outside the for loops.
with open("/home/all_genera.txt") as myfile: # don't use 'file' as variable, is a reserved word! use myfile instead
generaA=[]
for line in myfile: # use .readlines() if you want a list of lines!
generaA.append(line)
# if you just need to know if string are present in your file, you can use .read():
with open("/home/config/config2.cnf") as config_file:
mytext = config_file.read()
for mystring in generaA:
if mystring in mytext:
print mystring, "is -----> PRESENT"
# if you want to check if your string in line N is present in your file in the same line, you can go with:
with open("/home/config/config2.cnf") as config_file:
for N, line in enumerate(config):
if generaA[N] in line:
print "{0} is -----> PRESENT in line {1}".format(generaA[N], N)
I hope that everything is clear.
This code could be improved in many ways, but i tried to have it as similar as yours so it will be easier to understand
I have the following text file:
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
each key pair is how many times each string appears in a document [docID]:[stringFq]
How could you calculate the number of key pairs in this text file?
Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.
Given
%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
Code
import itertools as it
with open("foo.txt") as f:
lines = f.readlines()
#print(lines)
pred = lambda x: x.isalpha()
count = 0
for line in lines:
line = line.strip("\n")
line = "".join(it.dropwhile(pred, line))
pairs = line.strip().split(" ")
#print(pairs)
count += len(pairs)
count
# 15
Details
First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines(). We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:
def pred(x):
return x.isaplha()
We initialize a count variable and start iterating each line. Every line may have a trailing newline character \n, so we first strip() them away before feeding the line to dropwhile.
dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs.
Finally the length of each line of pairs is incrementally added to count. The final count is the sum of all lengths of pairs.
Summary
The code above shows how to tackle basic file handling with simple, iterative steps:
open the file
split the file into lines
while iterating each line, clean and process data
output a result
import re
file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2
print(numLen)
I'm trying to copy one file to another one ascii ordened, but is giving me some bugs, for example on the first line it adds a \n with no reason, I'm trying to understand it but I don't get it, also if you think this way is not a good one please advice to me to do it better, thanks.
demo.txt (An ascii file)
!=orIh^
-_hIdH2 !=orIh^
-_hIdH2
code .py
count = 0
try:
fcopy = open("demo.txt", 'r')
fdestination = open("demo2.txt", 'w')
for line in fcopy.readlines():
count = len(line) -1
list1 = ''.join(sorted(line))
str1 = ''.join(str(e) for e in list1)
fdestination.write(str(count)+str1)
fcopy.close()
fdestination.close()
except Exception, e:
print(str(e))
Note count is the count of letters that are on a line
Output
7
!=I^hor15
!-2=HII^_dhhor6-2HI_dh
the problem is it should be the number of letters and then ordened asciily
Each line in your code has a newline character at the end. When you sort all characters, the newline character is sorted, too, and moved to the appropriate position (which is in general not at the end of the string anymore). This causes line breaks to happen at almost random places.
What you need is to remove the line break before sorting and add it back after sorting. Also, the second join in your loop is not doing anything, and list1 is not a list but a string.
str1 = ''.join(sorted(line.strip('\n')))
fdestination.write(str(count)+str1+'\n')
I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.
Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>
The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)
I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']