I have a program which imports a text file through standard input and aggregates the lines into a dictionary. However the input file is very large (1Tb order) and I wont have enough space to store the whole dictionary in memory (running on 64Gb ram machine). Currently Iv got a very simple clause which outputs the dictionary once it has reached a certain length (in this case 100) and clears the memory. The output can then be aggregated at later point.
So i want to: output the dictionary once memory is full. what is the best way of managing this? Is there a function which gives me the current memory usage? Is this costly to keep on checking? Am I using the right tactic?
import sys
X_dic = dict()
# Used to print the dictionary in required format
def print_dic(dic):
for key, value in dic.iteritems():
print "{0}\t{1}".format(key, value)
for line in sys.stdin:
value, key = line.strip().split(",")
if (not key in X_dic):
X_dic[key] = []
X_dic[key].append(value)
# Limit size of dic.
if( len(X_dic) == 100):
print_dic(X_dic) # Print and clear dictionary
X_dic = dict()
# Now output
print_dic(X_dic)
The module resource provides some information on how much resources (memory, etc.) you are using. See here for a nice little usage.
On a Linux system (I don't know where you are) you can watch the contents of the file /proc/meminfo. As part of the proc file system it is updated automatically.
But I object to the whole strategy of monitoring the memory and using it up as much as possible, actually. I'd rather propose to dump the dictionary regularly (after 1M entries have been added or such). It probably will speed up your program to keep the dict smaller than possible; also it presumably will have advantages for later processing if all dumps are of similar size. If you dump a huge dict which fit into your whole memory when nothing else was using memory, then you later will have trouble re-reading that dict if something else is currently using some of your memory. So then you would have to create a situation in which nothing else is using memory (e. g. reboot or similar). Not very convenient.
Related
I have a rather large dictionary with about 40 million keys which I naively stored just by writing {key: value, key: value, ...} into a text file. I didn't consider the fact that I could never realistically access this data because python has an aversion to loading and evaluating a 1.44GB text file as a dictionary.
I know I could use something like shelve to be able to access the data without reading all of it at once, but I'm not sure how I would even convert this text file to a shelve file without regenerating all the data (which I would prefer not to do). Are there any better alternatives for storing, accessing, and potentially later changing this much data? If not, how should I go about converting this monstrosity over to a format usable by shelve?
If it matters, the dictionary is of the form {(int, int, int int): [[int, int], Bool]}
Redis is a in-memory key-value store that can be used for this kind of problems.
There are several Python clients.
hmset operation allows you to insert multiple key-values.
https://github.com/dagnelies/pysos
https://github.com/dagnelies/pysos
It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.
To install:
pip install pysos
Usage:
import pysos
db = pysos.Dict('somefile')
db['hello'] = 'persistence!'
Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):
import pysos
t = time.time()
import time
N = 100 * 1000
db = pysos.Dict("test.db")
for i in range(N):
db["key_" + str(i)] = {"some": "object_" + str(i)}
db.close()
print('PYSOS time:', time.time() - t)
# => PYSOS time: 3.424309253692627
The resulting file was about 3.5 Mb big.
So, in your case, if a million key/value pairs take roughly 1 minute to insert ...it would take you almost an hour to insert it all. Of course, the machine's specs can influence that a lot. It's just a very rough estimate.
I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?
If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.
What's going on
I'm collecting data from a few thousand network devices every few minutes in Python 2.7.8 via package netsnmp. I'm also using fastsnmpy so that I can access the (more efficient) Net-SNMP command snmpbulkwalk.
I'm trying to cut down how much memory my script uses. I'm running three instances of the same script which sleeps for two minutes before re-querying all devices for data we want. When I created the original script in bash they would use less than 500MB when active simultaneously. As I've converted this over to Python, however, each instance hogs 4GB each which indicates (to me) that my data structures need to be managed more efficiently. Even when idle they're consuming a total of 4GB.
Code Activity
My script begins with creating a list where I open a file and append the hostname of our target devices as separate values. These usually contain 80 to 1200 names.
expand = []
f = open(self.deviceList, 'r')
for line in f:
line = line.strip()
expand.append(line)
From there I set up the SNMP sessions and execute the requests
expandsession = SnmpSession ( timeout = 1000000 ,
retries = 1, # I slightly modified the original fastsnmpy
verbose = debug, # to reduce verbose messages and limit
oidlist = var, # the number of attempts to reach devices
targets = expand,
community = 'expand'
)
expandresults = expandsession.multiwalk(mode = 'bulkwalk')
Because of how both SNMP packages behave, the device responses are parsed up into lists and stored into one giant data structure. For example,
for output in expandresults:
print ouput.hostname, output.iid, output.val
#
host1 1 1
host1 2 2
host1 3 3
host2 1 4
host2 2 5
host2 3 6
# Object 'output' itself cannot be printed directly; the value returned from this is obscure
...
I'm having to iterate through each response, combine related data, then output each device's complete response. This is a bit difficult For example,
host1,1,2,3
host2,4,5,6
host3,7,8,9,10,11,12
host4,13,14
host5,15,16,17,18
...
Each device has a varying number of responses. I can't loop through expecting every device having a uniform arbitrary number of values to combine into a string to write out to a CSV.
How I'm handling the data
I believe it is here where I'm consuming a lot of memory but I cannot resolve how to simplify the process while simultaneously removing visited data.
expandarrays = dict()
for output in expandresults:
if output.val is not None:
if output.hostname in expandarrays:
expandarrays[output.hostname] += ',' + output.val
else:
expandarrays[output.hostname] = ',' + output.val
for key in expandarrays:
self.WriteOut(key,expandarrays[key])
Currently I'm creating a new dictionary, checking that the device response is not null, then appending the response value to a string that will be used to write out to the CSV file.
The problem with this is that I'm essentially cloning the existing dictionary, meaning I'm using twice as much system memory. I'd like to remove values that I've visited in expandresults when I move them to expandarrays so that I'm not using so much RAM. Is there an efficient method of doing this? Is there also a better way of reducing the complexity of my code so that it's easier to follow?
The Culprit
Thanks to those who answered. For those in the future that stumble across this thread due to experiencing similar issues: the fastsnmpy package is the culprit behind the large use of system memory. The multiwalk() function creates a thread for each host but does so all at once rather than putting some kind of upper limit. Since each instance of my script would handle up to 1200 devices that meant 1200 threads were instantiated and queued within just a few seconds. Using the bulkwalk() function was slower but still fast enough to suit my needs. The difference between the two was 4GB vs 250MB (of system memory use).
If the device responses are in order and are grouped together by host, then you don't need a dictionary, just three lists:
last_host = None
hosts = [] # the list of hosts
host_responses = [] # the list of responses for each host
responses = []
for output in expandresults:
if output.val is not None:
if output.hostname != last_host: # new host
if last_host: # only append host_responses after a new host
host_responses.append(responses)
hosts.append(output.hostname)
responses = [output.val] # start the new list of responses
last_host = output.hostname
else: # same host, append the response
responses.append(output.val)
host_responses.append(responses)
for host, responses in zip(hosts, host_responses):
self.WriteOut(host, ','.join(responses))
The memory consumption was due to instantiation of several workers in an unbound manner.
I've updated fastsnmpy (latest is version 1.2.1 ) and uploaded it to
PyPi. You can do a search from PyPi for 'fastsnmpy', or grab it
directly from my PyPi page here at FastSNMPy
Just finished updating the docs, and posted them to the project page at fastSNMPy DOCS
What I basically did here is to replace the earlier model of unbound-workers with a process-pool from multiprocessing. This can be passed in as an argument, or defaults to 1.
You now have just 2 methods for simplicity.
snmpwalk(processes=n) and snmpbulkwalk(processes=n)
You shouldn't see the memory issue anymore. If you do, please ping me on github.
You might have an easier time figuring out where the memory is going by using a profiler:
https://pypi.python.org/pypi/memory_profiler
Additionally, if you're already already tweaking the fastsnmpy classes, you can just change the implementation to do the dictionary based results merging for you instead of letting it construct a gigantic list first.
How long are you hanging on to the session? The result list will grow indefinitely if you reuse it.
I'm using Multiprocessing with a large (~5G) read-only dict used by processes. I started by passing the whole dict to each process, but ran into memory restraints, so changed to use a Multiprocessing Manager dict (after reading this How to share a dictionary between multiple processes in python without locking )
Since the change, performance has dived. What alternatives are there for a faster shared data store? The dict has a 40 character string key, and 2 small string element tuple data.
Use a memory mapped file. While this might sound insane (performance wise), it might not be if you use some clever tricks:
Sort the keys so you can use binary search in the file to locate a record
Try to make each line of the file the same length ("fixed width records")
If you can't use fixed width records, use this pseudo code:
Read 1KB in the middle (or enough to be sure the longest line fits *twice*)
Find the first new line character
Find the next new line character
Get a line as a substring between the two positions
Check the key (first 40 bytes)
If the key is too big, repeat with a 1KB block in the first half of the search range, else in the upper half of the search range
If the performance isn't good enough, consider writing an extension in C.
I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)
Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.
This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.
Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c
Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.
Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.