How to speed up this word-tuple finding algorithm? - python

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Thanks.
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)

As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)

I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:
word_counts = Counter(word_tuple_list)

Related

Finding identical numbers in large files python

I have two data files in python, each containing two-column data as below:
3023084 5764
9152549 5812
18461998 5808
45553152 5808
74141469 5753
106932238 5830
112230478 5795
135207137 5800
148813978 5802
154818883 5798
There are about 10M entries in each file (~400Mb).
I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.
The code I currently have converted the files to lists:
ch1 = []
with open('ch1.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch1.append([line[0], line[1]])
ch2 = []
with open('ch2.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch2.append([line[0], line[1]])
I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list 'coin'
coin = []
for item1 in ch1:
for item2 in ch2:
if item1[0] == item2[0]:
coin.append(int(item1[1]) + int(item2[1]))
The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?
There are lots of ways to improve this; for example:
Since you only scan through the contents of ch1.txt once, you don't need to read it into a list, and should thus take up less memory, but probably won't speed things up all that much.
If you sort each of your lists, you can check for matches much more efficiently. Something like:
i1, i2 = 0, 0
while i1 < len(ch1) and i2 < len(ch2):
if ch1[i1][0] == ch2[i2][0]:
# Do what you do for matches
...
# Advance both indices
i1 += 1
i2 += 1
elif ch1[i1][0] < ch2[i2][0]:
# Advance index of the smaller value
i1 += 1
else: # ch1[i1][0] > ch2[i2][0]
i2 += 1
If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.
Few ideas to improve this:
store your data in dictionaries in such a way your first column is the key and the second column is the value of a dictionary for later use,
a match is if a key is in the intersection of the keys of the two dictionaries
Code example:
# store your data in dicts as following
ch1_dict[line[0]] = line[1]
ch2_dict[line[0]] = line[1]
#this is what you want to achieve
coin = [int(ch1_dict[key]) + int(ch2_dict[key]) for key in ch1_dict.keys() & ch2_dict.keys()]

String searching in text file and dict values combinations

i'm a total beginner to python, i'm studying it at university and professor gave us some work to do before the exam. Currently it's been almost 2 weeks that i'm stuck with this program, the rule is that we can't use any library.
Basically I have this dictionary with several possibility of translations from ancient language to english, a dictionary from english to italian (only 1 key - 1 value pairs), a text file in an ancient language and another text file in Italian. Until now what i've done is basically scan the ancient language file and search for corresponding strings with dictionary (using .strip(".,:;?!") method), now i saved those corresponding strigs that contain at least 2 words in a list of strings.
Now comes the hard part: basically i need to try all possible combination of translations (values from ancient language to English) and then take these translations from english to italian the the other dictionary and check if that string exists in the Italian file, if yes i save the result and the paragraph where has been found (result in different paragraphs doesn't count, must be in the same I've made a small piece of code to count the paragraphs).
I'm having issues here for the following reasons:
In the strings that i've found how I'm supposed to replace the words and keep the punctuation? Because the return result must contain all the punctuation otherwise the output result will be wrong
If the string is contained but in 2 different lines of the text how should i proceed in order to make it work? For example i have a string of 5 words, at the end of a line i found the first 2 words corresponding but the remaining 3 words are the first 3 words of the next line.
As mentioned before the dict from ancient language to english is huge and can have up to 7 values (translations) for each key (ancient langauge), is there any efficient way to try all the combinations while searching if the string exists in a text file? This is probably the hardest part.
Probably the best way to process this is word by word scan every time and in case the sequence is broken i reset it somehow and keep scanning the text file...
Any idea?
Here you have commented code of what i've managed to do until now:
k = 2 #Random value, the whole program gonna be a function and the "k" value will be different each time
file = [ line.strip().split(';') for line in open('lexicon-GR-EN.csv', encoding="utf8").readlines() ] #Opening CSV file with possible translations from ancient Greek to English
gr_en = { words[0]: tuple(words[1:]) for words in file } #Creating a dictionary with the several translations (values)
file = open('lexicon-EN-IT.csv', encoding="utf8") # Opening 2nd CSV file
en_it = {} # Initializing dictionary
for row in file: # Scanning each row of the CSV file (From English to Italian)
L = row.rstrip("\n").split(';') # Clearing newline char and splitting the words
x = L[0]
t1 = L[1]
en_it[x] = t1 # Since in this CSV file all the words are 1 - 1 is not necesary any check for the length (len(L) is always 2 basically)
file = open('odyssey.txt', encoding="utf8") # Opening text file
result = () # Empty tuple
spacechecker = 0 # This is the variable that i need to determine if i'm on a even or odd line, if odd the line will be scanned normaly otherwise word order and words will be reversed
wordcount = 0 # Counter of how many words have been found
paragraph = 0 # Paragraph counter, starts at 0
paragraphspace = 0 # Another paragraph variable, i need this to prevent double-space to count as paragraph
string = "" # Empty string to store corresponding sequences
foundwords = [] # Empty list to store words that have been found
completed_sequences = [] # Empty list, here will be stored all completed sequences of words
completed_paragraphs = [] # Paragraph counter, this shows in which paragraph has been found each sequence of completed_sequences
for index, line in enumerate(file.readlines()): # Starting line by line scan of the txt file
words = line.split() # Splitting words
if not line.isspace() and index == 0: # Since i don't know nothing about the "secret tests" that will be conducted with this program i've set this check for the start of the first paragraph to prevent errors: if first line is not space
paragraph += 1 # Add +1 to paragraph counter
spacechecker += 1 # Add +1 to spacechecker
elif not line.isspace() and paragraphspace == 1: # Checking if the previous line was space and the current is not
paragraphspace = 0 # Resetting paragraphspace (precedent line was space) value
spacechecker += 1 # Increasing the spacechecker +1
paragraph +=1 # This means we're on a new paragraph so +1 to paragraph
elif line.isspace() and paragraphspace == 1: # Checking if the current line is space and the precedent line was space too.
continue # Do nothing and cycle again
elif line.isspace(): # Checking if the current line is space
paragraphspace += 1 # Increase paragraphspace (precedent line was space variable) +1
continue
else:
spacechecker += 1 # Any other case increase spacechecker +1
if spacechecker % 2 == 1: # Check if spacechecker is odd
for i in range(len(words)): # If yes scan the words in normal order
if words[i].strip(",.!?:;-") in gr_en != "[unavailable]": # If words[i] without any special char is in dictionary
currword = words[i] # If yes, we will call it "currword"
foundwords.append(currword) # Add currword to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (words[i].strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # We will put the foundwords list in a string
completed_sequences.append(string) # And add this string to the list of strings of completed_sequences
completed_paragraphs.append(paragraph) # Then add the paragraph of that string to the list of completed_paragraphs
result = list(zip(completed_sequences, completed_paragraphs)) # This the output format required, a tuple with the string and the paragraph of that string
wordcount = 0
foundwords.clear() # Clearing the foundwords list
else: # If none of the above happened (word is not in dictionary and wordcounter still isn't >= k)
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
else: # The case of spacechecker being not odd,
words = words[::-1] # Reverse the word order
for i in range(len(words)): # Scanning the row of words
currword = words[i][::-1] # Currword in this case will be reversed since the words in even lines are written in reverse.
if currword.strip(",.!?:;-") in gr_en != "[unavailable]": # If currword without any special char is in dictionary
foundwords.append(currword) # Append it to the foundwords list
wordcount += 1 # Increase wordcount +1
elif (currword.strip(",.!?:;-") in gr_en == "[unavailable]" and wordcount >= k) or (currword.strip(",.!?:;-") not in gr_en and wordcount >= k): #Elif check if it's not in dictionary but wordcount has gone over k
string = " ".join(foundwords) # Add the words that has been found to the string
completed_sequences.append(string) # Append the string to completed_sequences list
completed_paragraphs.append(paragraph) # Append the paragraph of the strings to the completed_paragraphs list
result = list(zip(completed_sequences, completed_paragraphs)) # Adding to the result the tuple combination of strings and corresponding paragraphs
wordcount = 0 # Reset wordcount
foundwords.clear() # Clear foundwords list
else: # In case none of above happened
wordcount = 0 # Reset wordcount to 0
foundwords.clear() # Clear foundwords list
continue # Do nothing and cycle again
I'd probably take the following approach to solving this:
Try to collapse down the 2 word dictionaries into one (ancient_italian below), removing English from the equation. For example, if ancient->English has {"canus": ["dog","puppy", "wolf"]} and English->Italian has {"dog":"cane"} then you can create a new dictionary {"canus": "cane"}. (Of course if the English->Italian dict has all 3 English words, you need to either pick one, or display something like cane|cucciolo|lupo in the output).
Come up with a regular expression that can distinguish between words, and the separators (punctuation), and output them in order into a list (word_list below). I.e something like ['ecce', '!', ' ', 'magnus', ' ', 'canus', ' ', 'esurit', '.']
Step through this list, generating a new list. Something like:
translation = []
for item in word_list:
if item.isalpha():
# It's a word - translate it and add to the list
translation.append(ancient_italian[item])
else:
# It's a separator - add to the list as-is
translaton.append(item)
Finally join the list back together: ''.join(translation)
I'm unable to reply to your comment on the answer by match, but this may help:
For one, its not the most elegant approach but should work:
GR_IT = {}
for greek,eng in GR_EN.items():
for word in eng:
try:
GR_IT[greek] = EN_IT[word]
except:
pass
If theres no translation for a word it will be ignored though.
To get a list of words and punctuation split try this:
def repl_punc(s):
punct = ['.',',',':',';','?','!']
for p in punct:
s=s.replace(p,' '+p+' ')
return s
repl_punc(s).split()

How to parse letter by letter and make a list with Python?

I have a text file I am attempting to parse. Fairly new to Python.
It contains an ID, a sequence, and frequency
SA1 GDNNN 12
SA2 TDGNNED 8
SA3 VGGNNN 3
Say the user wants to compare the frequency of the first two sequences. They would input the ID number. I'm having trouble figuring out how I would parse with python to make a list like
GD this occurs once in the two so it = 12
DN this also occurs once =12
NN occurs 3 times = 12 + 12 + 8 =32
TD occurs once in the second sequence = 8
DG ""
NE ""
ED ""
What do you recommend to parse letter by letter? In a sequence GD, then DN, then NN (without repeating it in the list), TD.. Etc.?
I currently have:
#Read File
def main():
file = open("clonedata.txt", "r")
lines = file.readlines()
file.close()
class clone_data:
def __init__(id, seq, freq):
id.seq = seq
id.freq = freq
def myfunc(id)
id = input ("Input ID number to see frequency: ")
for line in infile:
line = line.strip().upper()
line.find(id)
#print('y')
I'm not entirely sure from the example, but it sounds like you're trying to look at each line in the file and determine if the ID is in a given line. If so, you want to add the number at the end of that line to the current count.
This can be done in Python with something like this:
def get_total_from_lines_for_id(id_string, lines):
total = 0 #record the total at the end of each line
#now loop over the lines searching for the ID string
for line in lines:
if id_string in line: #this will be true if the id_string is in the line and will only match once
split_line = line.split(" ") #split the line at each space character into an array
number_string = split_line[-1] #get the last item in the array, the number
number_int = int(number_string) #make the string a number so we can add it
total = total + number_int #increase the total
return total
I'm honestly not sure what part of that task seems difficult to you, in part because I'm not sure what exactly is the task you're trying to accomplish.
Unless you expect the datafile to be enormous, the simplest way to start would be to read it all into memory, recording the id, sequence and frequency in a dictionary indexed by id: [Note 1]
with open('clonedata.txt') as file:
data = { id : (sequence, int(frequency))
for id, sequence, frequency in (
line.split() for line in file)}
With the sample data provided, that gives you: (newlines added for legibility)
>>> data
{'SA1': ('GDNNN', 12),
'SA2': ('TDGNNED', 8),
'SA3': ('VGGNNN', 3)}
and you can get an individual sequence and frequency with something like:
seq, freq = data['SA2']
Apparently, you always want to count the number of digrams (instances of two consecutive letters) in a sequence of letters. You can do that easily with collections.Counter: [Note 2]
from collections import Counter
# ...
seq, freq = data['SA1']
Counter(zip(seq, seq[1:]))
which prints
Counter({('N', 'N'): 2, ('G', 'D'): 1, ('D', 'N'): 1})
It would probably be most convenient to make that into a function:
def count(seq):
return Counter(zip(seq, seq[1:]))
Also apparently, you actually want to multiply the counted frequency by the frequency extracted from the file. Unfortunately, Counter does not support multiplication (although you can, conveniently, add two Counters to get the sum of frequencies for each key, so there's no obvious reason why they shouldn't support multiplication.) However, you can multiply the counts afterwards:
def count_freq(seq, freq):
retval = count(seq)
for digram in retval:
retval[digram] *= freq
return retval
If you find tuples of pairs of letters annoying, you can easily turn them back into strings using ''.join().
Notes:
That code is completely devoid of error checking; it assumes that your data file is perfect, and will throw an exception for any line with two few elements, including blank lines. You could handle the blank lines by changing for line in file to for line in file if line.strip() or some other similar test, but a fully bullet-proof version would require more work.)
zip(a, a[1:]) is the idiomatic way of making an iterator out of overlapping pairs of elements of a list. If you want non-overlapping pairs, you can use something very similar, using the same list iterator twice:
def pairwise(a):
it = iter(a)
return zip(it, it)
(Or, javascript style: pairwise = lambda a: (lambda it:zip(it, it))(iter(a)).)

Is there a faster way to lookup dictionary indices?

I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)

how do I count the characters in a group of lines separated by another kind of line?

I am currently working with a text file that has a list of DNA extraction sequences (contigs), each with a header followed by lines of nucleotides, which is the nucleotide length of that contig. there are 120 contigs, with each entry marked by a line that starts with ">" to denote the sequence information. after this line, a length of nucleotides of that sequence is given.
example:
>gi|571136972|ref|XM_006625214.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 5 (Rps5) (rps5) mRNA, complete cds
ATGAGAAATATTTTATTAAAGAAAAAATTATATAATAGTAAAAATATTTATATTTTATATTATATTTTAATAATATTTAAAAGTATTTTTATTATTTTATTTAATAGTAAATATAATGTGAATTATTATTTATATAATAAAATTTATAATTTATTTATTATATATATAAAATTATATTATATTATAAATAATATATATTATAATAATAATTATTATTATATATATAATATGAATTATATA
TATTTTTATATTTATAAATATAATAGTTTAAATAATA
>gi|571136996|ref|XM_006625226.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 2 (Rps2) (rps2) mRNA, complete cds
ATGTTTATTACATTTAAAGATTTATTAAAATCTAAAATATATATAGGAAATAATTATAAAAATATTTATATTAATAATTATAAATTTATATATAAAATAAAATATAATTATTGTATTTTAAATTTTACATTAATTATATTATATTTATATAAATTATATTTATATATTTATAATATATCTATATTTAATAATAAAATTTTATTTATTATTAATAATAATTTAATTACAAATTTAATTATT
AATATATGTAATTTAACTAATAATTTTTATATTATTA
what I would like to do is make a list of every contig. My problem is, I do not know the syntax needed to tell Python to:
find the line after the line that starts with ">"
take a count of all of the characters in the lines of that sequence
return a value to a list of all contig values (a list that gives a list of length of every contig, ie 126, 300, 25...)
make sure the last contig (which has no ">" to denote its end) is counted.
I would like a list of integers, so that I can calculate things like the mean length of the contigs, standard deviation, cool gene equations etc.
I am relatively new to programming. if I am unclear or further information is needed, please let me know.
Don't reinvent the wheel, use biopython as Martin has suggested. Here's a start for you that will print the sequence ID and length to terminal. You can install biopython with pip, i.e. pip install biopython
from Bio import SeqIO
import sys
FileIn = sys.argv[1]
handle = open(FileIn, 'rU')
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
print "%s: %i bp" % (record.id, length) #print sequence ID: seq length
Or you could store the results in a dictionary:
handle = open(FileIn, 'rU')
sequence_lengths = {}
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
sequence_lengths[record.id] = length
#access dictionary outside of loop
print sequence_lengths
This might work for you: it prints the number of ACGT's in the lines that follow a line that includes >:
import re
with open("input.txt") as input_file:
data = input_file.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print(data)
thanks for all the help. I have looked at the biopython stuff and am excited to understand it and incorporate it. The overall goal of this assignment was to teach me how to understand python, rather than finding the solution outright, or at least if I find the solution, I have to be able to explain it in my own words.
Anyway, I have created a code incorporating that element as well as others. I have a few more things to do, and if I am confused, I will return to ask.
here is my first working code outside of working directly with my supervisor or tutorials that I made and understand (woo!):
import re
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
contigs = 0
for line in fasta:
if line.strip().startswith('>'):
contigs = contigs + 1
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
data = fasta.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print "Total number of contigs: %s" %contigs
total_contigs = sum(data)
N50 = sum(data)/2
print "number used to determine N50 = %s" %N50
average = 0
total = 0
for n in data:
total = total + n
mean = total / len(data)
print "mean length of contigs: %s" %mean
print "total nucleotides in fasta = %s" %total_contigs
#print "list of contigs by length: %s" %sorted([data])
l = data
l.sort(reverse = True)
print "list of contigs by length: %s" %l
this does what I want it to do, but if you have any comments or advice, I would love to hear.
next up, determining N50 with this sweet sweet list. thanks again!
I created a function to calculate N50 and it seemed to work nicely. I can parse the command line and run any .fa file through the program
def calc_n50(array):
array.sort(reverse = True)
n50 = 0 #sums lengths
n = 0 #n50 sequence
half = sum(array)/2
for val in array:
n50 += val
if n50 >= half:
n = val
break #breaks loop when condition is met
print "N50 is",n

Categories

Resources