memory dump using Python - python

I have small program written for me in Python to help me generate all combinations of passwords from a different sets of numbers and words i know for me to recover a password i forgot, as i know all different words and sets of numbers i used i just wanted to generate all possible combinations, the only problem is that the list seems to go on for hours and hours so eventually i run out of memory and it doesn't finish.
I got told it needs to dump my memory so it can carry on but i'm not sure if this is right. is there any way i can get round this problem?
this is the program i am running:
#!/usr/bin/python
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = "££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
print "".join(item)
i have taken out a few sets and changed the numbers and word for obvious reasons but this is roughly the program.
another thing is i may be missing a particular word but didnt want to put it in the list because i know it might go before all the generated passwords, does anyone know how to add a prefix to my program.
sorry for the bad grammar and thanks for any help given.

I used guppy to understand the memory usage, I changed the OP code slightly (marked #!!!)
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = u"££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
from guppy import hpy # !!!
h=hpy() # !!!
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
print h.heap() #!!!
for item in itertools.permutations(mylist, length):
print item # !!!
Guppy outputs something like this every time h.heap() is called.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 11748 45 985544 29 985544 29 str
1 5858 23 472376 14 1457920 43 tuple
2 323 1 253640 8 1711560 51 dict (no owner)
3 67 0 213064 6 1924624 57 dict of module
4 199 1 210856 6 2135480 63 dict of type
5 1630 6 208640 6 2344120 70 types.CodeType
6 1593 6 191160 6 2535280 75 function
7 199 1 177008 5 2712288 80 type
8 124 0 135328 4 2847616 84 dict of class
9 1045 4 83600 2 2931216 87 __builtin__.wrapper_descriptor
Running python code.py > code.log and the fgrep Partition code.log shows.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Partition of a set of 25924 objects. Total size = 3355832 bytes.
Partition of a set of 25924 objects. Total size = 3355728 bytes.
Partition of a set of 25924 objects. Total size = 3372568 bytes.
Partition of a set of 25924 objects. Total size = 3372736 bytes.
Partition of a set of 25924 objects. Total size = 3355752 bytes.
Partition of a set of 25924 objects. Total size = 3372592 bytes.
Partition of a set of 25924 objects. Total size = 3372760 bytes.
Partition of a set of 25924 objects. Total size = 3355776 bytes.
Partition of a set of 25924 objects. Total size = 3372616 bytes.
Which I believe shows that the memory footprint stays fairly consistent.
Granted I may be misinterpreting the results from guppy. Although during my tests I deliberately added a new string to a list to see if the object count increased and it did.
For those interested I had to install guppy like so on OSX - Mountain Lion
pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
In summary I don't think that it's a running out of memory issue although granted we're not using the full OP dataset.

How about using IronPython and Visual Studio for its debug tools (which are pretty good)? You should be able to pause execution and look at the memory (essentially a memory dump).

Your program will run pretty efficiently by itself, as you now know. But make sure you don't just run it in IDLE, for example; that will slow it down to a crawl as IDLE updates the screen with more and more lines. Save the output directly into a file.
Even better: Have you thought about what you'll do when you have the passwords? If you can log on to the lost account from the command line, try doing that immediately instead of storing all the passwords for later use:
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
password = "".join(item)
try_to_logon(command, password)

To answer the above comment from #shaun if you want the file to output to notepad just run your file like so
Myfile.py >output.txt
If the text file doesn't exist it will be created.
EDIT:
Replace the line in your code at the bottom which reads:
print "" .join(item)
with this:
with open ("output.txt","a") as f:
f.write('\n'.join(items))
f.close
which will produce a file called output.txt.
Should work (haven't tested)

Related

pandas value_counts() returning incorrect counts

I was wondering if anyone else has ever experienced value_counts() returning incorrect counts. I have two variables, Pass and Fail, and when I use value_counts() it is returning the correct total but the wrong number for each variable.
The data in the data frame is for samples made with different sample preparation methods (A-G) and then tested on different testing machines (numbered 1-5; they run the same test we just have 5 different ones so we can run more tests) and I am trying to compare both the method and testers by putting the pass % into a pivot table. I would like to be able to do this for different sample materials as well so I have been trying to write the pass % function in a separate script so that I can call it to each material's script if that makes sense.
The pass % function is as follows:
def pass_percent(df_copy):
pds = df_copy.value_counts()
p = pds['PASS']
try:
f = pds['FAIL']
except:
f = 0
print(pds)
print(p)
print(f)
pass_pc = p/(p+f) *100
print(pass_pc)
return pass_pc
And then within the individual material script (e.g. material 1A) I have (among a few other things to tidy up the data frame before this - essentially getting rid of columns I don't need from the testing outputs):
from pass_pc_function import pass_percent
mat_1A = pd.pivot_table(df_copy, index='Prep_Method', columns='Test_Machine', aggfunc=pass_percent)
An example of what is happening is, for Material 1A I have 100 tests of Prep_Method A on Test_Machine 1 of which 65 passed and 35 failed, so a 65% pass rate. But value_counts() is returning 56 passes and 44 fails (so the total is still 100 which is correct but for some reason it is counting 9 passes as fails). This is just an example, I have much larger data sets than this but this is essentially what is happening.
I thought perhaps it could be a white space issue so I also have the line:
df_copy.columns = [x.strip() for x in df_copy.columns]
in my M1A script. However I am still getting this strange error.
Any advice would be appreciated. Thanks!
EDIT:
Results example as requested
PASS 31
FAIL 27
Name: Result, dtype: int64
31
27
53.44827586206896
Result
Test_Machine 1 2 3 4
Prep_Method
A 53.448276 89.655172 93.478261 97.916667
B 87.050360 90.833333 91.596639 97.468354
C 83.333333 93.150685 98.305085 100.000000
D 85.207101 94.339623 95.652174 97.163121
E 87.901701 96.310680 95.961538 98.655462
F 73.958333 82.178218 86.166008 93.750000
G 80.000000 91.743119 89.622642 98.529412

Python 3.7.4 -> How to keep memory usage low?

The following code retrieves and creates and indexes of uniqueCards on a given database.
for x in range(2010,2015):
for y in range(1,13):
index = str(x)+"-"+str("0"+str(y) if y<10 else y)
url = urlBase.replace("INDEX",index)
response = requests.post(url,data=query,auth=(user,pwd))
if response.status_code != 200:
continue
#this is a big json, around 4MB each
parsedJson = json.loads(response.content)["aggregations"]["uniqCards"]["buckets"]
for z in parsedJson:
valKey = 0
ind = 0
header = str(z["key"])[:8]
if header in headers:
ind = headers.index(header)
else:
headers.append(header)
valKey = int(str(ind)+str(z["key"])[8:])
creditCards.append(CreditCard(valKey,x*100+y))
The CreditCard object, the only one that survives the scope is around 64bytes long, each.
After running, this code was supposed to map around 10 million cards. That would translate to 640 million bytes, or around 640 Mega bytes.
The problem is that midway this operation, the memory consumption hits about 3GB...
My first guess is that, for some reason, the GC is not collecting the parsedJson. What should I do keep memory consumption under control? Can I dispose of that object manually?
Edit1:
the CreditCard is define as
class CreditCard:
number = 0
knownSince = 0
def __init__(self, num, date):
self.number=num
self.knownSince=date
Edit2:
When I get to 3.5 million cards on creditCards.__len__(), sys.getsizeof(creditCards) reports 31MB, but the process is consuming 2GB!
the problem is the json.load. Loading a 4MB results in a 5-8x memory jump.
Edit:
I manage to work around this using a custom mapper for the JSON:
def object_decoder(obj):
if obj.__contains__('key'):
return CreditCard(obj['key'],xy)
return obj
Now the memory grows slowly and I've been able to parse the whole set using around 2GB of memory

Optimizing performance using a big dictionary in Python

Background:
I'm trying to create a simple python program that allows me to take part of a transcript by its transcriptomic coordinates and get both its sequence and its genomic coordinates.
I'm not an experienced bioinformatician or programmer, more a biologist, but the way I thought about doing it would be to split each transcript to its nucleotides and store along with each nucleotide, in a tuple, both its genomic coordinates and its coordinates inside the transcript. That way I can then use python to take part of a certain transcript (say, the last 200 nucleotides) and get the sequence and the various genomic windows that construct it. The end goal is more complicated than that (the final program will receive a set of coordinates in the form of distance from the translation start site (ATG) and randomly assign each coordinate to a random transcript and output the sequence and genomic coordinates)
This is the code I wrote for this, that takes the information from a BED file containing the coordinates and sequence of each exon (along with information such as transcript length, position of start (ATG) codon, position of stop codon):
from __future__ import print_function
from collections import OrderedDict
from collections import defaultdict
import time
import sys
import os
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
content.append(line.strip().split())
all_transcripts.append(line.strip().split()[3])
all_transcripts = list(OrderedDict.fromkeys(all_transcripts))
genes = dict.fromkeys(all_transcripts)
n=0
for line in content:
n+=1
if genes[line[3]] is not None:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
if line[5] == '+':
genes[line[3]][5].extend(seq)
elif line[5] == '-':
genes[line[3]][5] = seq + genes[line[3]][5]
else:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
genes[line[3]] = [line[0],line[5],line[11],line[12],line[13],seq]
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
This is an example of how to BED file looks:
Chr Start End Transcript_ID Exon_Type Strand Ex_Start Ex_End Ex_Length Ts_Start Ts_End Ts_Length ATG Stop Sequence
chr1 861120 861180 uc001abw.1 5UTR + 0 60 80 0 60 2554 80 2126 GCAGATCCCTGCGG
chr1 861301 861321 uc001abw.1 5UTR + 60 80 80 60 80 2554 80 2126 GGAAAAGTCTGAAG
chr1 861321 861393 uc001abw.1 CDS + 0 72 2046 80 152 2554 80 2126 ATGTCCAAGGGGAT
chr1 865534 865716 uc001abw.1 CDS + 72 254 2046 152 334 2554 80 2126 AACCGGGGGCGGCT
chr1 866418 866469 uc001abw.1 CDS + 254 305 2046 334 385 2554 80 2126 AGTCCACACCCACT
I wanted to create a dictionary in which each transcript id is the key, and the values stored are the length of the transcript, the chromsome it is in, the strand, the position of the ATG, the position of the Stop codon and most importantly - a list of tuples of the sequence.
Basically, the code works. However, once the dictionary starts to get big it runs very very slowly.
So, what I would like to know is, how can I make it run faster? Currently it is getting intolerably slow in around the 60,000th line of the bed file. Perhaps there is a more efficient way to do what I'm trying to do, or just a better way to store data for this.
The BED file is custom made btw using awk from UCSC tables.
EDIT:
Sharing what I learned...
I now know that the bottleneck is in the creation of a large dictionary.
If I alter the program to iterate by genes and create a new list everytime in a similar mechanism, with this code:
for transcript in groupby(content, lambda x: x[3]):
a = list(transcript[1])
b = a[0]
gene = [b[0],b[5],b[11],b[12],b[13]]
seq=[]
n+=1
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
for exon in a:
i=0
for nucleotide in list(exon)[14]:
seq.append((nucleotide,int(list(exon)[9])+i,int(list(exon)[1])+i))
i+=1
gene.append(seq)
It runs in less than 4 minutes, while in the former version of creating a big dictionary with all of the genes at once it takes an hour to run.
One this to make your code more efficient is to add to the dictionary as you read the file.
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
all_transcripts.append(line.strip().split()[3])
add_to_content_dict(line)
and then your add_to_content_dict() function would look like the code inside the for line in content: loop
(see here)
Also, you have to define your defaultdicts as such, I don't see where genes or another dict is defined as a defaultdict.
This might be a good read, which details the practice of defining all dot-notation functions used inside of your loop outside as variables to enhance performance, because you aren't looking up the definition in every iteration of the loop. For example, instead of
for line in f:
transcripts.append(line.strip().split()[3])
you would have
f_split = str.split
f_strip = str.strip
f_append = all_transcripts.append
for line in f:
f_append(f_split(f_strip(line))[3])
There are other goodies in that link about local variable access and is, again, definitely worth the read.
You may also consider using the Cython, PyInline, Pyrex, or PyPy libraries to use C code within your Python script (for the efficiency when dealing with lots and lots of iteration and/or file I/O.)
As for the data structure itself (which was your major concern), we're limited in python to how much control over a dictionary's expansion we have. Big dicts do get heavier on the memory consumption as they get bigger... but so do all data structures! You have a couple options that may make a minute difference (storing strings as bytestrings / using a translation dict for encoded integers), but you may want to consider implementing a database instead of holding all that stuff in a python dict during runtime.

Database Compression in Python

I have hourly logs like
user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined
I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all the logs.
My end result is I want to have a log look like
user1:added comment
user2:log out
user3:joined
Now my first attempt on a small scale was to just do a dict like
log['user1'] = "added comment"
Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?
If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True
You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:
s = set()
# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them... There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()
for line in lines:
line = line.strip()
user, op = line.split(':')
if not user in s:
print line
s.add(user)
The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.
This sounds like the perfect kind of problem for a Map/Reduce solution. See:
http://en.wikipedia.org/wiki/MapReduce
Hadoop
for example.
Its pretty to easy to mock up the data structure to see how much memory it would take.
Something like this where you could change gen_string to generate data that would approximate the messages.
import random
from commands import getstatusoutput as gso
def gen_string():
return str(random.random())
d = {}
for z in range(10**6):
d[gen_string()] = gen_string()
print gso('ps -eo %mem,cmd |grep test.py')[1]
On a one gig netbook:
0.4 vim test.py
0.1 /bin/bash -c time python test.py
11.7 /usr/bin/python2.6 test.py
0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
0.0 grep test.py
real 0m26.325s
user 0m25.945s
sys 0m0.377s
... So its using about 10% of 1 gig for 100,000 records
But it would also depend on how much data redundancy you have ...
Thanks to #Ignacio for intern() -
def procLog(logName, userDict):
inf = open(logName, 'r')
for ln in inf.readlines():
name,act = ln.split(':')
userDict[name] = intern(act)
inf.close()
return userDict
def doLogs(logNameList):
userDict = {}
for logName in logNameList:
userDict = procLog(logName, userDict)
return userDict
def writeOrderedLog(logName, userDict):
keylist = userDict.keys()
keylist.sort()
outf = open(logName,'w')
for k in keylist:
outf.write(k + ':' + userDict[k])
outf.close()
def main():
mylogs = ['log20101214', 'log20101215', 'log20101216']
d = doLogs(mylogs)
writeOrderedLog('cumulativeLog', d)
the question, then, is how much memory this will consume.
def makeUserName():
ch = random.choice
syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
# 22**5 is about 5.1 million potential names
return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)
ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
for j in xrange(8000):
d[makeUserName()] = ch(states)
t.append( (len(d), sys.getsizeof(d)) )
which results in
(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?
Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.

Get actual disk space of a file

How do I get the actual filesize on disk in python? (the actual size it takes on the harddrive).
UNIX only:
import os
from collections import namedtuple
_ntuple_diskusage = namedtuple('usage', 'total used free')
def disk_usage(path):
"""Return disk usage statistics about the given path.
Returned valus is a named tuple with attributes 'total', 'used' and
'free', which are the amount of total, used and free space, in bytes.
"""
st = os.statvfs(path)
free = st.f_bavail * st.f_frsize
total = st.f_blocks * st.f_frsize
used = (st.f_blocks - st.f_bfree) * st.f_frsize
return _ntuple_diskusage(total, used, free)
Usage:
>>> disk_usage('/')
usage(total=21378641920, used=7650934784, free=12641718272)
>>>
Edit 1 - also for Windows: https://code.activestate.com/recipes/577972-disk-usage/?in=user-4178764
Edit 2 - this is also available in Python 3.3+: https://docs.python.org/3/library/shutil.html#shutil.disk_usage
Here is the correct way to get a file's size on disk, on platforms where st_blocks is set:
import os
def size_on_disk(path):
st = os.stat(path)
return st.st_blocks * 512
Other answers that indicate to multiply by os.stat(path).st_blksize or os.vfsstat(path).f_bsize are simply incorrect.
The Python documentation for os.stat_result.st_blocks very clearly states:
st_blocks
Number of 512-byte blocks allocated for file. This may be smaller than st_size/512 when the file has holes.
Furthermore, the stat(2) man page says the same thing:
blkcnt_t st_blocks; /* Number of 512B blocks allocated */
Update 2021-03-26: Previously, my answer rounded the logical size of the file up to be an integer multiple of the block size. This approach only works if the file is stored in a continuous sequence of blocks on disk (or if all the blocks are full except for one). Since this is a special case (though common for small files), I have updated my answer to make it more generally correct. However, note that unfortunately the statvfs method and the st_blocks value may not be available on some system (e.g., Windows 10).
Call os.stat(filename).st_blocks to get the number of blocks in the file.
Call os.statvfs(filename).f_bsize to get the filesystem block size.
Then compute the correct size on disk, as follows:
num_blocks = os.stat(filename).st_blocks
block_size = os.statvfs(filename).f_bsize
sizeOnDisk = num_blocks*block_size
st = os.stat(…)
du = st.st_blocks * st.st_blksize
Practically 12 years and no answer on how to do this in windows...
Here's how to find the 'Size on disk' in windows via ctypes;
import ctypes
def GetSizeOnDisk(path):
'''https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getcompressedfilesizew'''
filesizehigh = ctypes.c_ulonglong(0) # not sure about this... something about files >4gb
return ctypes.windll.kernel32.GetCompressedFileSizeW(ctypes.c_wchar_p(path),ctypes.pointer(filesizehigh))
'''
>>> os.stat(somecompressedorofflinefile).st_size
943141
>>> GetSizeOnDisk(somecompressedorofflinefile)
671744
>>>
'''
I'm not certain if this is size on disk, or the logical size:
import os
filename = "/home/tzhx/stuff.wev"
size = os.path.getsize(filename)
If it's not the droid your looking for, you can round it up by dividing by cluster size (as float), then using ceil, then multiplying.
To get the disk usage for a given file/folder, you can do the following:
import os
def disk_usage(path):
"""Return cumulative number of bytes for a given path."""
# get total usage of current path
total = os.path.getsize(path)
# if path is dir, collect children
if os.path.isdir(path):
for file_name in os.listdir(path):
child = os.path.join(path, file_name)
# recursively get byte use for children
total += disk_usage(child)
return total
The function recursively collects byte usage for files nested within a given path, and returns the cumulative use for the entire path.
You could also add a print "{path}: {bytes}".format(path, total) in there if you want the information for each file to print.

Categories

Resources