Understanding objgraph refs and memory leaks

Understanding objgraph refs and memory leaks - python

I am trying to debug a pretty severe memory leak in python. The memory fills from ~500Mb on startup to 16Gb over a few iterations across 20 minutes. I have been using objgraph to plot references and backreferences. Some of the things I see I do not have concrete answers for, and I am curious if the chart is showing an actual problem or if I am even on the right track.
My questions:
If a cycle is in the graph, does that indicate a problem? Should a proper (non-leaking) graph only be a tree?
I have a class (instance?) that contains two tuples: one with six elements and another with one element. Why is there not a tuple with 7 elements? What do the items in the tuples represent? Inheritance? Direct inheritance?
Is there a simple way to find what in the tuple is pointing back to "type A"
Why does type A point to two tuples (question 2) and to another type "type D"
Why does type D have attributes/inheritance in dictionaries but none of the other types? They are themselves classes that should have such properties, no?

Related

Deque random acces O(n) in python while O(1) in C++, why? [duplicate]

This question already has answers here:
Why is deque implemented as a linked list instead of a circular array?
(2 answers)
Closed 5 years ago.
C++ deque:
Random access - constant O(1)
Python deque:
Indexed access is O(1) at both ends but slows to O(n) in the middle.
If I'm not missing anything, everything else is equally fast for deques in python and in C++, at least complexity-wise. Is there anything that makes python's deque better for some cases? If not, why don't they just switch to what C++ has?

Disclaimer: this answer is largely inspired from Jeff's comment and the answer already posted at Why is a deque implemented as a linked list instead of a circular array ?
Your question is different in nature but the title above is an answer in itself: in Python, the module collections.deque has a linear time complexity when accessing items in the middle because it is implemented using a linked list.
From the pydoc:
A list-like sequence optimized for data accesses near its endpoints.
Now if you're wondering why this implementation was chosen, the answer is already available in the post pointed out by Jeff.

Because Deque is a data structed supposed to be used in a specific way, being accessed by the first or last element,
But python sometimes do weird things with its data structures and add more functions to them, or use composed data structures
In this case python has the function
remove(value)
#Remove the first occurrence of value. If not found, raises a ValueError.
this allow you to access the data structure elements on the middle of the deque it isn't a "core" operation of this data structure,
causing the "but slows to O(n) in the middle. "
Because in this case it behaves like an array (checking values one by one)

Is there any kind of hash list in python

I tried to find an answer here and in Python doc but the only things I got were questions about hashing list objects and details abaut how dicts work.
Background
I'm developing a program that parses over a huge graphs (atm. 44K nodes, 14K of them are of any interest and they are connected by 15K edges) and have problems with performance although I allready optimized my algorithm as far as I could and now the last resort is to optimize the data structure:
def single_pass_build(nodes):
for node in nodes:
if node.__class__ in listOfRequiredClasses:
children = get_children(node)
for child in children:
if child__class__ in listOfRequiredClasses:
add_edge(node, child)
def get_children(node):
return [attr for attr in node.__dict__.values() if attr.__class__ in listOfRequiredClasses]
I still have to care about my add_connection function but even without it my program takes slightly over 10 Minutes for nothing but this iteration. For comparison: the module I get the data from generates it from an xml document in not more than 5 seconds.
I have a total of 44K object, each representing a node in a ralation graph. The objects I get have plenty attributes so I could try to optimize get_children to know all relevant attributes for every class or just speed up the lookup. Lists take O(n) (so if a is the number os attributes and k the number of classes in my list I get a total O(nak + mak)). Many of my attribute classes are not in that list so I am closer to the worst case than to the average. I'd like to speed up the lookup from O(k) to O(1) or at least O(log(k))
Question
Knowing that a key lookup of dict should be O(log(n)) for many hash collision and with (few to) no hash collisions it becomes (almost) static. After I don't care for any values I'd like to know if there is a kind of (hash) list optimized for x in list?
I could use a dict with None values but with a total of 70000 lookups and greater graphs in future, every milli second counts. The space is not the big problem here because I expect ~50 classes total and in no case more than some hundred classes. In other cases, the space could be a issue too.
I don't expect the answer to be in standard Python but maby someone knows a common framework that can help or can make me believe that there is no reason at all why I can't use a dict for the job.

You want the builtin set type : https://docs.python.org/2/library/stdtypes.html#set
And yes its IS in standard Python ;)

What is the best data structure/algorithm for data requiring bidirectional 1 to N mappings?

I've got a reasonable (though rusty) background in algorithms and mathematics, and modest proficiency in Python and C. I can see sorta how to do this, but it's non-trivial, and gets more complicated every time I prototype it. I come before the collective for it's wisdom hoping for an elegant solution I'm not seeing. I think there's some sort of network or graph variant that might be apropos, but it's not clicking. And it's not a homework assignment :-).
I have three sets of data, A, B & C. Each element in A is a string, each element in B is an int and each C is a collection of metadata (dates, times, descriptions, etc.). There will be, potentially, thousands if not millions of elements in each set (though not soon).
Every A will map to zero or more items in B. Conversely, each element in B will map to zero or more items in A. Every item in A and B will have an associated C (possibly empty) which might be shared with other A's and/or B's.
Given an A, I need to report on all B's that it maps to. I further then need to report all A's that those B's map to, as well as all C's associated with what was found. I also need to be able to do the converse (given a B, report associated A's, B's and C's).
I understand there are some fairly pathological possibilities in here, and I'll need to detect loops (depth detection should work fine), etc.
Thoughts?

what first comes to mind for me would be a Graph or Directed Graph
Each element in the data sets could be a node and the edges would represent the mappings. You could write your specific implementation to provide helper methods for the things you know you're going to need to do, like get all Bs that a given A element maps to
UPDATE: i didn't notice you tagged the question graph-algorithm already which I assume means you already thought of a graph data structure

Python inverted index efficiency

I am writing some Python code to implement some of the concepts I have recently been learning, related to inverted indices / postings lists. I'm quite new to Python and am having some trouble understanding its efficiencies in some cases.
Theoretically, creating an inverted index of a set of documents D, each with a unique ID doc_id should involve:
Parsing / performing lexical analysis of each document in D
Removing stopwords, performing stemming etc.
Creating a list of all (word,doc_id) pairs
Sorting the list
Condensing duplicates into {word:[set_of_all_doc_ids]} (inverted index)
Step 5 is often carried out by having a dictionary containing the word with meta-data (term frequency, byte offsets) and a pointer to the postings list (list of documents it occurs in). The postings list is often implemented as a data structure which allows efficient random insert, i.e. a linked list.
My problem is that Python is a higher-level language, and direct use of things like memory pointers (and therefore linked lists) seems to be out of scope. I am optimising before profiling because for very large data sets it is already known that efficiency must be maximised to retain any kind of ability to calculate the index in a reasonable time.
Several other posts exist here on SO about Python inverted indices and, like MY current implementation, they use dictionaries mapping keys to lists (or sets). Is one to expect that this method have similar performance to a language which allows direct coding of pointers to linked lists?

There are a number of things to say:
If random access is required for a particular list implementation, a linked list is not optimal (regardless of the programming language used). To access the ith element of the list, a linked list requires you to iterate all the way from the 0th to the ith element. Instead, the list should be stored as one continuous block (or several large blocks if it is very long). Python lists [...] are stored in this way, so for a start, a Python list should be good enough.
In Python, any assignment a = b of an object b that is not a basic data type (such as int or float), is performed internally by passing a pointer and incrementing the reference count to b. So if b is a list or a dictionary (or a user-defined class, for that matter), this is in principle not much different from passing a pointer in C or C++.
However, there is obviously some overhead caused by a) reference counting and b) garbage collection. If the implementation is for study purposes, i.e. to understand the concepts of inverted indexing better, I would not worry about that. But for a serious, highly-optimized implementation, using pure Python (rather than, e.g. C/C++ embedded into Python) is not advisable.
As you optimise the implementation of your postings list further, you will probably see the need to a) make random inserts, b) keep it sorted and c) keep it compressed - all at the same time. At that point, the standard Python list won't be good enough any more, and you might want to look into implementing a more optimised list representation in C/C++ and embed it into Python. However, even then, sticking to pure Python would probably be possible. E.g. you could use a large string to implement the list and use itertools and buffer to access specific parts in a way that is, to some extent, similar to pointer arithmetic.
One thing that you should always keep in mind when dealing with strings in Python is that, despite what I said above about assignment operations, the substring operation text[i:j] involves creating an actual (deep) copy of the substring, rather than merely incrementing a reference count. This can be avoided by using the buffer data type mentioned above.

You can see the code and documentation for inverted index in Python at : http://www.ssiddique.info/creation-of-inverted-index-and-use-of-ranking-algorithm-python-code.html
Soon I will be coding it in C++..

Use of add(), append(), update() and extend() in Python

Is there an article or forum discussion or something somewhere that explains why lists use append/extend, but sets and dicts use add/update?
I frequently find myself converting lists into sets and this difference makes that quite tedious, so for my personal sanity I'd like to know what the rationalization is.
The need to convert between these occurs regularly as we iterate on development. Over time as the structure of the program morphs, various structures gain and lose requirements like ordering and duplicates.
For example, something that starts out as an unordered bunch of stuff in a list might pick up the the requirement that there be no duplicates and so need to be converted to a set.
All such changes require finding and changing all places where the relevant structure is added/appended and extended/updated.
So I'm curious to see the original discussion that led to this language choice, but unfortunately I didn't have any luck googling for it.

append has a popular definition of "add to the very end", and extend can be read similarly (in the nuance where it means "...beyond a certain point"); sets have no "end", nor any way to specify some "point" within them or "at their boundaries" (because there are no "boundaries"!), so it would be highly misleading to suggest that these operations could be performed.
x.append(y) always increases len(x) by exactly one (whether y was already in list x or not); no such assertion holds for s.add(z) (s's length may increase or stay the same). Moreover, in these snippets, y can have any value (i.e., the append operation never fails [except for the anomalous case in which you've run out of memory]) -- again no such assertion holds about z (which must be hashable, otherwise the add operation fails and raises an exception). Similar differences apply to extend vs update. Using the same name for operations with such drastically different semantics would be very misleading indeed.
it seems pythonic to just use a list
on the first pass and deal with the
performance on a later iteration
Performance is the least of it! lists support duplicate items, ordering, and any item type -- sets guarantee item uniqueness, have no concept of order, and demand item hashability. There is nothing Pythonic in using a list (plus goofy checks against duplicates, etc) to stand for a set -- performance or not, "say what you mean!" is the Pythonic Way;-). (In languages such as Fortran or C, where all you get as a built-in container type are arrays, you might have to perform such "mental mapping" if you need to avoid using add-on libraries; in Python, there is no such need).
Edit: the OP asserts in a comment that they don't know from the start (e.g.) that duplicates are disallowed in a certain algorithm (strange, but, whatever) -- they're looking for a painless way to make a list into a set once they do discover duplicates are bad there (and, I'll add: order doesn't matter, items are hashable, indexing/slicing unneeded, etc). To get exactly the same effect one would have if Python's sets had "synonyms" for the two methods in question:
class somewhatlistlikeset(set):
def append(self, x): self.add(x)
def extend(self, x): self.update(x)
Of course, if the only change is at the set creation (which used to be list creation), the code may be much more challenging to follow, having lost the useful clarity whereby using add vs append allows anybody reading the code to know "locally" whether the object is a set vs a list... but this, too, is part of the "exactly the same effect" above-mentioned!-)

set and dict are unordered. "Append" and "extend" conceptually only apply to ordered types.

It's written that way to annoy you.
Seriously. It's designed so that one can't simply convert one into the other easily. Historically, sets are based off dicts, so the two share naming conventions. While you could easily write a set wrapper to add these methods ...
class ListlikeSet(set):
def append(self, x):
self.add(x)
def extend(self, xs):
self.update(xs)
... the greater question is why you find yourself converting lists to sets with such regularity. They represent substantially different models of a collection of objects; if you have to convert between the two a lot, it suggests you may not have a very good handle on the conceptual architecture of your program.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.