Python 3.7.4 -> How to keep memory usage low? - python

The following code retrieves and creates and indexes of uniqueCards on a given database.
for x in range(2010,2015):
for y in range(1,13):
index = str(x)+"-"+str("0"+str(y) if y<10 else y)
url = urlBase.replace("INDEX",index)
response = requests.post(url,data=query,auth=(user,pwd))
if response.status_code != 200:
continue
#this is a big json, around 4MB each
parsedJson = json.loads(response.content)["aggregations"]["uniqCards"]["buckets"]
for z in parsedJson:
valKey = 0
ind = 0
header = str(z["key"])[:8]
if header in headers:
ind = headers.index(header)
else:
headers.append(header)
valKey = int(str(ind)+str(z["key"])[8:])
creditCards.append(CreditCard(valKey,x*100+y))
The CreditCard object, the only one that survives the scope is around 64bytes long, each.
After running, this code was supposed to map around 10 million cards. That would translate to 640 million bytes, or around 640 Mega bytes.
The problem is that midway this operation, the memory consumption hits about 3GB...
My first guess is that, for some reason, the GC is not collecting the parsedJson. What should I do keep memory consumption under control? Can I dispose of that object manually?
Edit1:
the CreditCard is define as
class CreditCard:
number = 0
knownSince = 0
def __init__(self, num, date):
self.number=num
self.knownSince=date
Edit2:
When I get to 3.5 million cards on creditCards.__len__(), sys.getsizeof(creditCards) reports 31MB, but the process is consuming 2GB!

the problem is the json.load. Loading a 4MB results in a 5-8x memory jump.
Edit:
I manage to work around this using a custom mapper for the JSON:
def object_decoder(obj):
if obj.__contains__('key'):
return CreditCard(obj['key'],xy)
return obj
Now the memory grows slowly and I've been able to parse the whole set using around 2GB of memory

Related

Python consumes excessive memory, doesn't complete run even given adequate memory

I am writing a code for an information retrieval project, which reads Wikipedia pages in XML format from a file, processes the string (I've omitted this part for the sake of simplicity), tokenizes the strings and build positional indexes for the terms found on the pages. Then it saves the indexes to a file using pickle once, and then reads it from that file for the next usages for less processing time (I've included the code for that parts, but they're commented)
After that, I need to fill a 1572 * ~97000 matrix (1572 is the number of Wiki pages, and 97000 is the number of terms found in them. Like each Wiki page is a vector of words, and vectors[i][j] is, the number of occurrences of the i'th word of the word set in the j'th Wiki Page. (Again it's been simplified but it doesn't matter)
The problem is that it takes way too much memory to run the code, and even then, from a point between the 350th and 400th row of the matrix beyond, it doesn't proceed to run the code (it doesn't stop either). I thought the problem was with memory, because when its usage exceeded my 7.7GiB RAM and 1.7GiB swap, it stopped and printed:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
But when I added a 6GiB memory by making a swap file for Python3.7 (using the script recommended here, the program didn't run out of memory, but instead got stuck when it had 7.7GiB RAM + 3.9GiB swap memory occupied) as I said, at a point between the 350th and 400th iteration of i in the loop at the bottom. Instead of Ubuntu 18.04,I tried it on Windows 10, the screen simply went black. I tried this on Windows 7, again to no avail.
Next I thought it was a PyCharm issue, so I ran the python file using the python3 file.py command, and it got stuck at the very point it had with PyCharm. I even used the numpy.float16 datatype to save memory, but it had no effect. I asked a colleague about their matrix dimensions, they were similar to mine, but they weren't having problems with it. Is it malware or a memory leak? Or is it something am I doing something wrong here?
import pickle
from hazm import *
from collections import defaultdict
import numpy as np
'''For each word there's one of these. it stores the word's frequency, and the positions it has occurred in on each wiki page'''
class Positional_Index:
def __init__(self):
self.freq = 0
self.title = defaultdict(list)
self.text = defaultdict(list)
'''Here I tokenize words and construct indexes for them'''
# tree = ET.parse('Wiki.xml')
# root = tree.getroot()
# index_dict = defaultdict(Positional_Index)
# all_content = root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')
#
# for page_index, pg in enumerate(all_content):
# title = pg.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
# txt = pg.find('{http://www.mediawiki.org/xml/export-0.10/}revision') \
# .find('{http://www.mediawiki.org/xml/export-0.10/}text').text
#
# title_arr = word_tokenize(title)
# txt_arr = word_tokenize(txt)
#
# for term_index, term in enumerate(title_arr):
# index_dict[term].freq += 1
# index_dict[term].title[page_index] += [term_index]
#
# for term_index, term in enumerate(txt_arr):
# index_dict[term].freq += 1
# index_dict[term].text[page_index] += [term_index]
#
# with open('texts/indices.txt', 'wb') as f:
# pickle.dump(index_dict, f)
with open('texts/indices.txt', 'rb') as file:
data = pickle.load(file)
'''Here I'm trying to keep the number of occurrences of each word on each page'''
page_count = 1572
vectors = np.array([[0 for j in range(len(data.keys()))] for i in range(page_count)], dtype=np.float16)
words = list(data.keys())
word_count = len(words)
const_log_of_d = np.log10(1572)
""" :( """
for i in range(page_count):
for j in range(word_count):
vectors[i][j] = (len(data[words[j]].title[i]) + len(data[words[j]].text[i]))
if i % 50 == 0:
print("i:", i)
Update : I tried this on a friend's computer, this time it killed the process at someplace between the 1350th-1400th iteration.

Optimizing speed for bulk SSL API requests

I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])
Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.

Increasing memory limit in Python?

I am currently using a function making extremely long dictionaries (used to compare DNA strings) and sometimes I'm getting MemoryError.
Is there a way to allot more memory to Python so it can deal with more data at once?
Python doesn’t limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.
You'd need to look at making your code more memory/performance friendly.
Python has MomeoryError which is the limit of your System RAM util you've defined it manually with resource package.
Defining your class with slots makes the python interpreter know that the attributes/members of your class are fixed. And can lead to significant memory savings!
You can reduce dict creation by python interpreter by using __slot__ . This will tell interpreter to not create dict internally and reuse same variable.
If the memory consumed by your python processes will continue to grow with time. This seems to be a combination of:
How the C memory allocator in Python works. This is essentially memory fragmentation, because the allocation cannot call ‘free’ unless the entire memory chunk is unused. But the memory chunk usage is usually not perfectly aligned to the objects that you are creating and using.
Using a number of small string to compare data. A process called interning used internally but creating multiple small strings brings load on interpreter.
The best way is to create Worker Thread or single threaded pool to do your work and invalidate worker/kill to free up resources attached/used in worker thread.
Below code creates single thread worker :
__slot__ = ('dna1','dna2','lock','errorResultMap')
lock = threading.Lock()
errorResultMap = []
def process_dna_compare(dna1, dna2):
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
futures = {executor.submit(getDnaDict, lock, dna_key): dna_key for dna_key in dna1}
'''max_workers=1 will create single threadpool'''
dna_differences_map={}
count = 0
dna_processed = False;
for future in concurrent.futures.as_completed(futures):
result_dict = future.result()
if result_dict :
count += 1
'''Do your processing XYZ here'''
logger.info('Total dna keys processed ' + str(count))
def getDnaDict(lock,dna_key):
'''process dna_key here and return item'''
try:
dataItem = item[0]
return dataItem
except:
lock.acquire()
errorResultMap.append({'dna_key_1': '', 'dna_key_2': dna_key_2, 'dna_key_3': dna_key_3,
'dna_key_4': 'No data for dna found'})
lock.release()
logger.error('Error in processing dna :'+ dna_key)
pass
if __name__ == "__main__":
dna1 = '''get data for dna1'''
dna2 = '''get data for dna2'''
process_dna_compare(dna1,dna2)
if errorResultMap != []:
''' print or write to file the errorResultMap'''
Below code will help you understand memory usage :
import objgraph
import random
import inspect
class Dna(object):
def __init__(self):
self.val = None
def __str__(self):
return "dna – val: {0}".format(self.val)
def f():
l = []
for i in range(3):
dna = Dna()
#print “id of dna: {0}”.format(id(dna))
#print “dna is: {0}”.format(dna)
l.append(dna)
return l
def main():
d = {}
l = f()
d['k'] = l
print("list l has {0} objects of type Dna()".format(len(l)))
objgraph.show_most_common_types()
objgraph.show_backrefs(random.choice(objgraph.by_type('Dna')),
filename="dna_refs.png")
objgraph.show_refs(d, filename='myDna-image.png')
if __name__ == "__main__":
main()
Output for memory usage :
list l has 3 objects of type Dna()
function 2021
wrapper_descriptor 1072
dict 998
method_descriptor 778
builtin_function_or_method 759
tuple 667
weakref 577
getset_descriptor 396
member_descriptor 296
type 180
More read on slots please visit : https://elfsternberg.com/2009/07/06/python-what-the-hell-is-a-slot/
Try to update your py from 32bit to 64bit.
Simply type python in the command line and you will see which your python is. The memory in 32bit python is very low.

memory dump using Python

I have small program written for me in Python to help me generate all combinations of passwords from a different sets of numbers and words i know for me to recover a password i forgot, as i know all different words and sets of numbers i used i just wanted to generate all possible combinations, the only problem is that the list seems to go on for hours and hours so eventually i run out of memory and it doesn't finish.
I got told it needs to dump my memory so it can carry on but i'm not sure if this is right. is there any way i can get round this problem?
this is the program i am running:
#!/usr/bin/python
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = "££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
print "".join(item)
i have taken out a few sets and changed the numbers and word for obvious reasons but this is roughly the program.
another thing is i may be missing a particular word but didnt want to put it in the list because i know it might go before all the generated passwords, does anyone know how to add a prefix to my program.
sorry for the bad grammar and thanks for any help given.
I used guppy to understand the memory usage, I changed the OP code slightly (marked #!!!)
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = u"££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
from guppy import hpy # !!!
h=hpy() # !!!
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
print h.heap() #!!!
for item in itertools.permutations(mylist, length):
print item # !!!
Guppy outputs something like this every time h.heap() is called.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 11748 45 985544 29 985544 29 str
1 5858 23 472376 14 1457920 43 tuple
2 323 1 253640 8 1711560 51 dict (no owner)
3 67 0 213064 6 1924624 57 dict of module
4 199 1 210856 6 2135480 63 dict of type
5 1630 6 208640 6 2344120 70 types.CodeType
6 1593 6 191160 6 2535280 75 function
7 199 1 177008 5 2712288 80 type
8 124 0 135328 4 2847616 84 dict of class
9 1045 4 83600 2 2931216 87 __builtin__.wrapper_descriptor
Running python code.py > code.log and the fgrep Partition code.log shows.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Partition of a set of 25924 objects. Total size = 3355832 bytes.
Partition of a set of 25924 objects. Total size = 3355728 bytes.
Partition of a set of 25924 objects. Total size = 3372568 bytes.
Partition of a set of 25924 objects. Total size = 3372736 bytes.
Partition of a set of 25924 objects. Total size = 3355752 bytes.
Partition of a set of 25924 objects. Total size = 3372592 bytes.
Partition of a set of 25924 objects. Total size = 3372760 bytes.
Partition of a set of 25924 objects. Total size = 3355776 bytes.
Partition of a set of 25924 objects. Total size = 3372616 bytes.
Which I believe shows that the memory footprint stays fairly consistent.
Granted I may be misinterpreting the results from guppy. Although during my tests I deliberately added a new string to a list to see if the object count increased and it did.
For those interested I had to install guppy like so on OSX - Mountain Lion
pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
In summary I don't think that it's a running out of memory issue although granted we're not using the full OP dataset.
How about using IronPython and Visual Studio for its debug tools (which are pretty good)? You should be able to pause execution and look at the memory (essentially a memory dump).
Your program will run pretty efficiently by itself, as you now know. But make sure you don't just run it in IDLE, for example; that will slow it down to a crawl as IDLE updates the screen with more and more lines. Save the output directly into a file.
Even better: Have you thought about what you'll do when you have the passwords? If you can log on to the lost account from the command line, try doing that immediately instead of storing all the passwords for later use:
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
password = "".join(item)
try_to_logon(command, password)
To answer the above comment from #shaun if you want the file to output to notepad just run your file like so
Myfile.py >output.txt
If the text file doesn't exist it will be created.
EDIT:
Replace the line in your code at the bottom which reads:
print "" .join(item)
with this:
with open ("output.txt","a") as f:
f.write('\n'.join(items))
f.close
which will produce a file called output.txt.
Should work (haven't tested)

How to download huge Oracle LOB with cx_Oracle on memory constrained system?

I'm developing part of a system where processes are limited to about 350MB of RAM; we use cx_Oracle to download files from an external system for processing.
The external system stores files as BLOBs, and we can grab them doing something like this:
# ... set up Oracle connection, then
cursor.execute(u"""SELECT filename, data, filesize
FROM FILEDATA
WHERE ID = :id""", id=the_one_you_wanted)
filename, lob, filesize = cursor.fetchone()
with open(filename, "w") as the_file:
the_file.write(lob.read())
lob.read() will obviously fail with MemoryError when we hit a file larger than 300-350MB, so we've tried something like this instead of reading it all at once:
read_size = 0
chunk_size = lob.getchunksize() * 100
while read_size < filesize:
data = lob.read(chunk_size, read_size + 1)
read_size += len(data)
the_file.write(data)
Unfortunately, we still get MemoryError after several iterations. From the time lob.read() is taking, and the out-of-memory condition we eventually get, it looks as if lob.read() is pulling ( chunk_size + read_size ) bytes from the database every time. That is, reads are taking O(n) time and O(n) memory, even though the buffer is quite a bit smaller.
To work around this, we've tried something like:
read_size = 0
while read_size < filesize:
q = u'''SELECT dbms_lob.substr(data, 2000, %s)
FROM FILEDATA WHERE ID = :id''' % (read_bytes + 1)
cursor.execute(q, id=filedataid[0])
row = cursor.fetchone()
read_bytes += len(row[0])
the_file.write(row[0])
This pulls 2000 bytes (argh) at a time, and takes forever (something like two hours for a 1.5GB file). Why 2000 bytes? According to the Oracle docs, dbms_lob.substr() stores its return value in a RAW, which is limited to 2000 bytes.
Is there some way I can store the dbms_lob.substr() results in a larger data object and read maybe a few megabytes at a time? How do I do this with cx_Oracle?
I think that the argument order in lob.read() is reversed in your code. The first argument should be the offset, the second argument should be the amount to read. This would explain the O(n) time and memory usage.

Categories

Resources