Optimizing performance using a big dictionary in Python

Optimizing performance using a big dictionary in Python - python

Background:
I'm trying to create a simple python program that allows me to take part of a transcript by its transcriptomic coordinates and get both its sequence and its genomic coordinates.
I'm not an experienced bioinformatician or programmer, more a biologist, but the way I thought about doing it would be to split each transcript to its nucleotides and store along with each nucleotide, in a tuple, both its genomic coordinates and its coordinates inside the transcript. That way I can then use python to take part of a certain transcript (say, the last 200 nucleotides) and get the sequence and the various genomic windows that construct it. The end goal is more complicated than that (the final program will receive a set of coordinates in the form of distance from the translation start site (ATG) and randomly assign each coordinate to a random transcript and output the sequence and genomic coordinates)
This is the code I wrote for this, that takes the information from a BED file containing the coordinates and sequence of each exon (along with information such as transcript length, position of start (ATG) codon, position of stop codon):
from __future__ import print_function
from collections import OrderedDict
from collections import defaultdict
import time
import sys
import os
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
content.append(line.strip().split())
all_transcripts.append(line.strip().split()[3])
all_transcripts = list(OrderedDict.fromkeys(all_transcripts))
genes = dict.fromkeys(all_transcripts)
n=0
for line in content:
n+=1
if genes[line[3]] is not None:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
if line[5] == '+':
genes[line[3]][5].extend(seq)
elif line[5] == '-':
genes[line[3]][5] = seq + genes[line[3]][5]
else:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
genes[line[3]] = [line[0],line[5],line[11],line[12],line[13],seq]
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
This is an example of how to BED file looks:
Chr Start End Transcript_ID Exon_Type Strand Ex_Start Ex_End Ex_Length Ts_Start Ts_End Ts_Length ATG Stop Sequence
chr1 861120 861180 uc001abw.1 5UTR + 0 60 80 0 60 2554 80 2126 GCAGATCCCTGCGG
chr1 861301 861321 uc001abw.1 5UTR + 60 80 80 60 80 2554 80 2126 GGAAAAGTCTGAAG
chr1 861321 861393 uc001abw.1 CDS + 0 72 2046 80 152 2554 80 2126 ATGTCCAAGGGGAT
chr1 865534 865716 uc001abw.1 CDS + 72 254 2046 152 334 2554 80 2126 AACCGGGGGCGGCT
chr1 866418 866469 uc001abw.1 CDS + 254 305 2046 334 385 2554 80 2126 AGTCCACACCCACT
I wanted to create a dictionary in which each transcript id is the key, and the values stored are the length of the transcript, the chromsome it is in, the strand, the position of the ATG, the position of the Stop codon and most importantly - a list of tuples of the sequence.
Basically, the code works. However, once the dictionary starts to get big it runs very very slowly.
So, what I would like to know is, how can I make it run faster? Currently it is getting intolerably slow in around the 60,000th line of the bed file. Perhaps there is a more efficient way to do what I'm trying to do, or just a better way to store data for this.
The BED file is custom made btw using awk from UCSC tables.
EDIT:
Sharing what I learned...
I now know that the bottleneck is in the creation of a large dictionary.
If I alter the program to iterate by genes and create a new list everytime in a similar mechanism, with this code:
for transcript in groupby(content, lambda x: x[3]):
a = list(transcript[1])
b = a[0]
gene = [b[0],b[5],b[11],b[12],b[13]]
seq=[]
n+=1
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
for exon in a:
i=0
for nucleotide in list(exon)[14]:
seq.append((nucleotide,int(list(exon)[9])+i,int(list(exon)[1])+i))
i+=1
gene.append(seq)
It runs in less than 4 minutes, while in the former version of creating a big dictionary with all of the genes at once it takes an hour to run.

One this to make your code more efficient is to add to the dictionary as you read the file.
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
all_transcripts.append(line.strip().split()[3])
add_to_content_dict(line)
and then your add_to_content_dict() function would look like the code inside the for line in content: loop
(see here)
Also, you have to define your defaultdicts as such, I don't see where genes or another dict is defined as a defaultdict.

This might be a good read, which details the practice of defining all dot-notation functions used inside of your loop outside as variables to enhance performance, because you aren't looking up the definition in every iteration of the loop. For example, instead of
for line in f:
transcripts.append(line.strip().split()[3])
you would have
f_split = str.split
f_strip = str.strip
f_append = all_transcripts.append
for line in f:
f_append(f_split(f_strip(line))[3])
There are other goodies in that link about local variable access and is, again, definitely worth the read.
You may also consider using the Cython, PyInline, Pyrex, or PyPy libraries to use C code within your Python script (for the efficiency when dealing with lots and lots of iteration and/or file I/O.)
As for the data structure itself (which was your major concern), we're limited in python to how much control over a dictionary's expansion we have. Big dicts do get heavier on the memory consumption as they get bigger... but so do all data structures! You have a couple options that may make a minute difference (storing strings as bytestrings / using a translation dict for encoded integers), but you may want to consider implementing a database instead of holding all that stuff in a python dict during runtime.

Related

Applying function to pandas dataframe: is there a more efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?

Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

How to find a particular requirement in a file written in columms and copy it to another file

I have a file written in columns like this (I write the first rows but is longer):
Ncol 10 Nrow 9276
NO_POL = 2
NO_IF = 8
NO. ANTENNA SUBARRAY TSYS TANT
1 1 1 37 35
2 37 35
3 37 35
4 1 1 37 35
5 37 35
6 37 35
7 3 1 37 35
8 37 35
9 37 35
10 3 1 37 35
11 37 35
I want to copy in another file which number of antennas appear in this file but I want that the number of the antenna appear only once in the other file. The maximum number of antennas is 10.
What I have done is read the file in columns starting in the 5 row. Like I only want to see in the lines where the number of the antenna appears I have put the condition of that the length of the columns have to be more than 3. This is the code I have written to do this, but nothing is written in my new_file:
with open('file') as f1:
with open('new_file','a') as f2:
for i in range(1,11):
for line in f1.readlines()[4:]:
columns = line.split()
if len(columns) > 3 and columns[1] == i:
f2.write(i+'\n')
break
I think the problem could be in the condition of the number of antenna matching with i, but I don't know why... What am I doing wrong?

There are a few things to fix. I'll start by just correcting the type/code errors, and then address the algorithm itself.
Code issues
For starters, every time you call f1.readlines(), it reads from where it left off reading. So you only get the rest of the file instead of the entire file after the first read. What you'll have to do is store the contents of the file in a list outside of the loop, and then you'll loop in the same way you currently do except with line coming from this list instead of the file.
Next, you're trying to compare a string with an integer in your columns[1]==i, you have to comvert one into the other, so maybe int(columns[1])==i in the comparison.
A similar error occurs when you try to write to the output file, you have to convert i into a string in order to add '\n' to it, so something like f2.write(str(i)+'\n') would do it.
The resulting code with these changes would be:
f1=open('file')
contents=f1.readlines()[4:]
f1.close() #we don't need it anymore
with open('new_file','a') as f2:
for i in range(1,11):
for line in contents:
columns = line.split()
if len(columns) > 3 and int(columns[1]) == i:
f2.write(str(i)+'\n')
break
It seems to work as you want on my machine.
Algorithm
What you are doing is picking an antenna number, and then looking through the entire file to see if there is a line with that antenna number present. This is certainly one approach, but if you intend to do this sort of processing for large files this algorithm will take quite some time. An alternative, more efficient approach is to use a set.
Python has a set() function which creates an empty set, and then you add elements to the set with the add() function.
So you might end up doing something like this:
antennae=set()
f1=open('file')
lineno=1
for line in f1:
if lineno >= 5:
row = line.split()
if len(row) > 3:
antennae.add(int(row[1]))
lineno+=1
f1.close()
f2=open('new_file','a')
for antenna in antennae:
f2.write(str(antenna)+'\n')
f2.close()
This version is efficient both in memory and in time, since we're only reading lines as we need them (and we're using python's efficient reading algorithms) while also only checking each line once instead of once per antenna value.

for i in range(1,11):
for line in f1.readlines()[4:]:
What this does is "Try to read all the lines in the file 10 times over". That does not sound correct...
if len(columns) > 3 and columns[1] == i:
So iis the row count (it doesn't work because of the first problem, but let's assume it does) and you use it to select a column? that does not sound right either.
Maybe something like this (not tested):
f1 = open('file');
f2 = open('new_file', 'a');
for line in f1.readlines()[4:]:
columns = line.split()
if len(columns) > 3:
f2.write(columns[0]+'\n')
In the future I suggest adding debug printing in your code, that usually helps.

memory dump using Python

I have small program written for me in Python to help me generate all combinations of passwords from a different sets of numbers and words i know for me to recover a password i forgot, as i know all different words and sets of numbers i used i just wanted to generate all possible combinations, the only problem is that the list seems to go on for hours and hours so eventually i run out of memory and it doesn't finish.
I got told it needs to dump my memory so it can carry on but i'm not sure if this is right. is there any way i can get round this problem?
this is the program i am running:
#!/usr/bin/python
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = "££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
print "".join(item)
i have taken out a few sets and changed the numbers and word for obvious reasons but this is roughly the program.
another thing is i may be missing a particular word but didnt want to put it in the list because i know it might go before all the generated passwords, does anyone know how to add a prefix to my program.
sorry for the bad grammar and thanks for any help given.

I used guppy to understand the memory usage, I changed the OP code slightly (marked #!!!)
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = u"££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
from guppy import hpy # !!!
h=hpy() # !!!
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
print h.heap() #!!!
for item in itertools.permutations(mylist, length):
print item # !!!
Guppy outputs something like this every time h.heap() is called.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 11748 45 985544 29 985544 29 str
1 5858 23 472376 14 1457920 43 tuple
2 323 1 253640 8 1711560 51 dict (no owner)
3 67 0 213064 6 1924624 57 dict of module
4 199 1 210856 6 2135480 63 dict of type
5 1630 6 208640 6 2344120 70 types.CodeType
6 1593 6 191160 6 2535280 75 function
7 199 1 177008 5 2712288 80 type
8 124 0 135328 4 2847616 84 dict of class
9 1045 4 83600 2 2931216 87 __builtin__.wrapper_descriptor
Running python code.py > code.log and the fgrep Partition code.log shows.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Partition of a set of 25924 objects. Total size = 3355832 bytes.
Partition of a set of 25924 objects. Total size = 3355728 bytes.
Partition of a set of 25924 objects. Total size = 3372568 bytes.
Partition of a set of 25924 objects. Total size = 3372736 bytes.
Partition of a set of 25924 objects. Total size = 3355752 bytes.
Partition of a set of 25924 objects. Total size = 3372592 bytes.
Partition of a set of 25924 objects. Total size = 3372760 bytes.
Partition of a set of 25924 objects. Total size = 3355776 bytes.
Partition of a set of 25924 objects. Total size = 3372616 bytes.
Which I believe shows that the memory footprint stays fairly consistent.
Granted I may be misinterpreting the results from guppy. Although during my tests I deliberately added a new string to a list to see if the object count increased and it did.
For those interested I had to install guppy like so on OSX - Mountain Lion
pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
In summary I don't think that it's a running out of memory issue although granted we're not using the full OP dataset.

How about using IronPython and Visual Studio for its debug tools (which are pretty good)? You should be able to pause execution and look at the memory (essentially a memory dump).

Your program will run pretty efficiently by itself, as you now know. But make sure you don't just run it in IDLE, for example; that will slow it down to a crawl as IDLE updates the screen with more and more lines. Save the output directly into a file.
Even better: Have you thought about what you'll do when you have the passwords? If you can log on to the lost account from the command line, try doing that immediately instead of storing all the passwords for later use:
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
password = "".join(item)
try_to_logon(command, password)

To answer the above comment from #shaun if you want the file to output to notepad just run your file like so
Myfile.py >output.txt
If the text file doesn't exist it will be created.
EDIT:
Replace the line in your code at the bottom which reads:
print "" .join(item)
with this:
with open ("output.txt","a") as f:
f.write('\n'.join(items))
f.close
which will produce a file called output.txt.
Should work (haven't tested)

using python to search extremely large text file

I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format:
399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......
Each line starts off with a number and is followed by some other data. The numbers are in sorted order. I need to be able to:
Given a number x and and a range y, find all the lines whose number is within y range of x. For example if x=20 and y=5, I need to find all lines whose number is between 15 and 25.
Store these lines into another separate file.
What would be an efficient method to do this without having to trawl through the entire file?

If you don't want to generate a database ahead of time for line lengths, you can try this:
import os
import sys
# Configuration, change these to suit your needs
maxRowOffset = 100 #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25
#seek to first character c before the current position
def seekTo(f,c):
while f.read(1) != c:
f.seek(-2,1)
def parseRow(row):
return (int(row.split(None,1)[0]),row)
minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
while True:
f.seek(int(step),1)
seekTo(f,'\n')
row = parseRow(f.readline())
if row[0] < minRow:
if minRow - row[0] < maxRowOffset:
with open('outputFile.txt','w') as fo:
for row in f:
row = parseRow(row)
if row[0] > maxRow:
sys.exit()
if row[0] >= minRow:
fo.write(row[1])
else:
step /= 2.
step = step * -1 if step < 0 else step
else:
step /= 2.
step = step * -1 if step > 0 else step
It starts by performing a binary search on the file until it is near (less than maxRowOffset) the row to find. Then it starts reading every line until it finds one that is greater than x-y. That line, and every line after it are written to an output file until a line is found that is greater than x+y, and which point the program exits.
I tested this on a 1,000,000 line file and it runs in 0.05 seconds. Compare this to reading every line which took 3.8 seconds.

You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length.
One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). Then index the number column and you're good to go.
Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). You calculate the line-offset by adding the lengths of each row (including line end). Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. If you need to repeat the search later, pickle the in-memory structure and reload for later re-use.
This reads the entire file (which you said you don't want to do), but does so only once to build the index. After that you can execute as many requests against the file as you want and they will be very fast.
Note that this second solution is essentially creating a database index on your text file.
Rough code to create the index in second solution:
import Pickle
line_end_length = len('\n') # must be a better way to do this!
offset = 0
index = [] # probably a better structure to use than a list
f = open(filename)
for row in f:
nbr = float(row.split(' ')[0])
index.append([nbr, offset])
offset += len(row) + line_end_length
Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use
Now, you can perform a binary search on the list. There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types.

Since you want to match the first field, you can use gawk:
$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file
Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 # 2.80GHz. Sounds good enough for me.

In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. No other way, i.e. all data in the file has to be read and parsed for newline characters.
We have to run this search up to the first line that exceeds your upper limit and stop. Hence, it helps that the file is already sorted. This code will hopefully help:
with open(outpath) as outfile:
with open(inpath) as infile:
for line in infile:
t = float(line.split()[0])
if lower_limit <= t <= upper_limit:
outfile.write(line)
elif t > upper_limit:
break
I think theoretically there is no other option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing performance using a big dictionary in Python - python

Related

Applying function to pandas dataframe: is there a more efficient way of doing this?

Finding exon/ intron borders in a gene

How to find a particular requirement in a file written in columms and copy it to another file

memory dump using Python

using python to search extremely large text file

Categories

Resources