I have a project that relies on finding all cycles in a graph that pass through a vertex at most k times. Naturally, I'm sticking with the case of k=1 for the sake of development right now. I've come to the conclusion that this algorithm as a depth first search is at worst O((kn)^(kn)) for a complete graph, but I rarely approach this upper bound in the context of the problem, so I would still like to give this approach a try.
I've implemented the following as a part of the project to achieve this end:
class Graph(object):
...
def path_is_valid(self, current_path):
"""
:param current_path:
:return: Boolean indicating a whether the given path is valid
"""
length = len(current_path)
if length < 3:
# The path is too short
return False
# Passes through vertex twice... sketchy for general case
if len(set(current_path)) != len(current_path):
return False
# The idea here is take a moving window of width three along the path
# and see if it's contained entirely in a polygon.
arc_triplets = (current_path[i:i+3] for i in xrange(length-2))
for triplet in arc_triplets:
for face in self.non_fourgons:
if set(triplet) <= set(face):
return False
# This is all kinds of unclear when looking at. There is an edge case
# pertaining to the beginning and end of a path existing inside of a
# polygon. The previous filter will not catch this, so we cycle the path
# and recheck moving window filter.
path_copy = list(current_path)
for i in xrange(length):
path_copy = path_copy[1:] + path_copy[:1] # wtf
arc_triplets = (path_copy[i:i+3] for i in xrange(length-2))
for triplet in arc_triplets:
for face in self.non_fourgons:
if set(triplet) <= set(face):
return False
return True
def cycle_dfs(self, current_node, start_node, graph, current_path):
"""
:param current_node:
:param start_node:
:param graph:
:param current_path:
:return:
"""
if len(current_path) >= 3:
last_three_vertices = current_path[-3:]
previous_three_faces = [set(self.faces_containing_arcs[vertex])
for vertex in last_three_vertices]
intersection_all = set.intersection(*previous_three_faces)
if len(intersection_all) == 2:
return []
if current_node == start_node:
if self.path_is_valid(current_path):
return [tuple(shift(list(current_path)))]
else:
return []
else:
loops = []
for adjacent_node in set(graph[current_node]):
current_path.append(adjacent_node)
graph[current_node].remove(adjacent_node)
graph[adjacent_node].remove(current_node)
loops += list(self.cycle_dfs(adjacent_node, start_node,
graph, current_path))
graph[current_node].append(adjacent_node)
graph[adjacent_node].append(current_node)
current_path.pop()
return loops
path_is_valid() aims to cut down on the number of paths produced by the depth first search as they are found, based upon filtering criteria that are specific to the problem. I tried to explain the purpose of each one reasonably, but everything is clearer in one's own head; I'd be happy to improve the comments if needed.
I'm open to any and all suggestions to improve performance, since, as the profile below shows, this is what is taking all my time.
Also, I'm about to turn to Cython, but my code heavily relies on Python objects and I don't know if that's a smart move. Can anyone shed some light as to whether or not this route is even beneficial with this many native Python data structures involved? I can't seem to find much information on this and any help would be appreciated.
Since I know people will ask, I have profiled my entire project and this is the source of the problem:
311 1 18668669 18668669.0 99.6 cycles = self.graph.find_cycles()
Here's the line-profiled output of the self.graph.find_cycles() and self.path_is_valid():
Function: cycle_dfs at line 106
Total time: 11.9584 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
106 def cycle_dfs(self, current_node, start_node, graph, current_path):
107 """
108 Naive depth first search applied to the pseudo-dual graph of the
109 reference curve. This sucker is terribly inefficient. More to come.
110 :param current_node:
111 :param start_node:
112 :param graph:
113 :param current_path:
114 :return:
115 """
116 437035 363181 0.8 3.6 if len(current_path) >= 3:
117 436508 365213 0.8 3.7 last_three_vertices = current_path[-3:]
118 436508 321115 0.7 3.2 previous_three_faces = [set(self.faces_containing_arcs[vertex])
119 1746032 1894481 1.1 18.9 for vertex in last_three_vertices]
120 436508 539400 1.2 5.4 intersection_all = set.intersection(*previous_three_faces)
121 436508 368725 0.8 3.7 if len(intersection_all) == 2:
122 return []
123
124 437035 340937 0.8 3.4 if current_node == start_node:
125 34848 1100071 31.6 11.0 if self.path_is_valid(current_path):
126 486 3400 7.0 0.0 return [tuple(shift(list(current_path)))]
127 else:
128 34362 27920 0.8 0.3 return []
129
130 else:
131 402187 299968 0.7 3.0 loops = []
132 839160 842350 1.0 8.4 for adjacent_node in set(graph[current_node]):
133 436973 388646 0.9 3.9 current_path.append(adjacent_node)
134 436973 438763 1.0 4.4 graph[current_node].remove(adjacent_node)
135 436973 440220 1.0 4.4 graph[adjacent_node].remove(current_node)
136 436973 377422 0.9 3.8 loops += list(self.cycle_dfs(adjacent_node, start_node,
137 436973 379207 0.9 3.8 graph, current_path))
138 436973 422298 1.0 4.2 graph[current_node].append(adjacent_node)
139 436973 388651 0.9 3.9 graph[adjacent_node].append(current_node)
140 436973 412489 0.9 4.1 current_path.pop()
141 402187 285471 0.7 2.9 return loops
Function: path_is_valid at line 65
Total time: 1.6726 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
65 def path_is_valid(self, current_path):
66 """
67 Aims to implicitly filter during dfs to decrease output size. Observe
68 that more complex filters are applied further along in the function.
69 We'd rather do less work to show the path is invalid rather than more,
70 so filters are applied in order of increasing complexity.
71 :param current_path:
72 :return: Boolean indicating a whether the given path is valid
73 """
74 34848 36728 1.1 2.2 length = len(current_path)
75 34848 33627 1.0 2.0 if length < 3:
76 # The path is too short
77 99 92 0.9 0.0 return False
78
79 # Passes through arcs twice... Sketchy for later.
80 34749 89536 2.6 5.4 if len(set(current_path)) != len(current_path):
81 31708 30402 1.0 1.8 return False
82
83 # The idea here is take a moving window of width three along the path
84 # and see if it's contained entirely in a polygon.
85 3041 6287 2.1 0.4 arc_triplets = (current_path[i:i+3] for i in xrange(length-2))
86 20211 33255 1.6 2.0 for triplet in arc_triplets:
87 73574 70670 1.0 4.2 for face in self.non_fourgons:
88 56404 94019 1.7 5.6 if set(triplet) <= set(face):
89 2477 2484 1.0 0.1 return False
90
91 # This is all kinds of unclear when looking at. There is an edge case
92 # pertaining to the beginning and end of a path existing inside of a
93 # polygon. The previous filter will not catch this, so we cycle the path
94 # a reasonable amount and recheck moving window filter.
95 564 895 1.6 0.1 path_copy = list(current_path)
96 8028 7771 1.0 0.5 for i in xrange(length):
97 7542 14199 1.9 0.8 path_copy = path_copy[1:] + path_copy[:1] # wtf
98 7542 11867 1.6 0.7 arc_triplets = (path_copy[i:i+3] for i in xrange(length-2))
99 125609 199100 1.6 11.9 for triplet in arc_triplets:
100 472421 458030 1.0 27.4 for face in self.non_fourgons:
101 354354 583106 1.6 34.9 if set(triplet) <= set(face):
102 78 83 1.1 0.0 return False
103
104 486 448 0.9 0.0 return True
Thanks!
EDIT: Well, after a lot of merciless profiling, I was able to bring the run time down from 12 seconds to ~1.5.
I changed this portion of cycle_dfs()
last_three_vertices = current_path[-3:]
previous_three_faces = [set(self.faces_containing_arcs[vertex])
for vertex in last_three_vertices]
intersection_all = set.intersection(*previous_three_faces)
if len(intersection_all) == 2: ...
to this:
# Count the number of times each face appears by incrementing values
# of face_id's
containing_faces = defaultdict(lambda: 0)
for face in (self.faces_containing_arcs[v]
for v in current_path[-3:]):
for f in face:
containing_faces[f] += 1
# If there's any face_id f that has a value of three, that means that
# there is one face that all three arcs bound. This is a trivial path
# so we discard it.
if 3 in containing_faces.values(): ...
This was motivated by another post I saw benchmarking Python dictionary assignment; turns out assigning and editing values in a dict is a tiny bit slower than adding integers (which still blows my mind). Along with the two additions to self.path_is_valid(), I squeaked out a 12x speedup. However, further suggestions would be appreciated since better performance overall will only make harder problems easier as the input complexity grows.
I would recommend two optimizations for path_is_valid. Of course, your main problem is in cycle_dfs, and you probably just need a better algorithm.
1) Avoid creating extra data structures:
for i in xrange(length-2):
for face in self.non_fourgons:
if (path[i] in face && path[i+1] in face && path[i+2] in face):
return False
2) Create a dictionary mapping points to the non_fourgons they are members of:
for i in xrange(length-2):
for face in self.non_fourgons[ path[i] ]:
if (path[i+1] in face && path[i+2] in face):
return False
The expression self.non_fourgons[ p ] should return a list of the non-fourgons
which contain p as a member. This reduces the number of polygons you have to check.
Related
I have just seen my Python process get killed again on my VPS with 1GB of RAM and am now looking into optimizing memory usage in my program.
I've got a function that downloads a web page, looks for data and then returns a Pandas Dataframe with what it found. This function is called thousands of times from within a for loop that ends up maxing out the memory on my server.
Line # Mem usage Increment Occurences Line Contents
============================================================
93 75.6 MiB 1.2 MiB 1 page = http.get(url)
94 75.6 MiB 0.0 MiB 1 if page.status_code == 200:
95 78.4 MiB 2.8 MiB 1 tree = html.fromstring(page.text)
96 78.4 MiB 0.0 MiB 1 del page
... code to search for data using xpaths and assign to data dict
117 78.4 MiB 0.1 MiB 1 df = pd.DataFrame(data)
118 78.4 MiB 0.0 MiB 1 del tree
119 78.4 MiB 0.0 MiB 1 gc.collect()
120 78.4 MiB 0.0 MiB 1 return df
With the memory_profiler results from above, it shows that the lines of my code with the largest memory increments are as expected: the http.get() and html.fromstring() calls and assignments. The actual Dataframe creation is much smaller in comparison.
Now I would expect that the only overall memory increase to my program is the size of the dataframe returned by the function and not ALSO the size of both the page and tree objects. With every call to this function, the memory increase on my program is the combination of all 3 objects and this does not ever decrease.
I have tried adding del before the end of the function to attempt to de-reference the objects I don't need anymore, but this does not seem to make a difference.
I do see that for a scalable application I would need to start saving to disk, but at this point even if I do save to disk I'm not sure how to free up the memory already used.
Thanks for your help
After a lot of digging, I finally found the answer to my own question. The issue was related to string results from my xpath expressions that by default use "smart-strings" that are known to eat up memory. Disabling these gives me the kind of memory consumption I was expecting.
More information lxml parser eats all memory and https://lxml.de/xpathxslt.html#xpath-return-values
I am trying to generate the summary of a large text file using Gensim Summarizer.
I am getting memory error. Have been facing this issue since sometime, any help
would be really appreciated. feel free to ask for more details.
from gensim.summarization.summarizer import summarize
file_read =open("xxxxx.txt",'r')
Content= file_read.read()
def Summary_gen(content):
print(len(Content))
summary_r=summarize(Content,ratio=0.02)
print(summary_r)
Summary_gen(Content)
The length of the document is:
365042
Error messsage:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-a91bd71076d1> in <module>()
10
11
---> 12 Summary_gen(Content)
<ipython-input-6-a91bd71076d1> in Summary_gen(content)
6 def Summary_gen(content):
7 print(len(Content))
----> 8 summary_r=summarize(Content,ratio=0.02)
9 print(summary_r)
10
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
428 corpus = _build_corpus(sentences)
429
--> 430 most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
431
432 # If couldn't get important docs, the algorithm ends.
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize_corpus(corpus, ratio)
367 return []
368
--> 369 pagerank_scores = _pagerank(graph)
370
371 hashable_corpus.sort(key=lambda doc: pagerank_scores.get(doc, 0), reverse=True)
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in pagerank_weighted(graph, damping)
57
58 """
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
60 probability_matrix = build_probability_matrix(graph)
61
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in build_adjacency_matrix(graph)
92 neighbors_sum = sum(graph.edge_weight((current_node, neighbor)) for neighbor in graph.neighbors(current_node))
93 for j in xrange(length):
---> 94 edge_weight = float(graph.edge_weight((current_node, nodes[j])))
95 if i != j and edge_weight != 0.0:
96 row.append(i)
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in edge_weight(self, edge)
255
256 """
--> 257 return self.get_edge_properties(edge).setdefault(self.WEIGHT_ATTRIBUTE_NAME, self.DEFAULT_WEIGHT)
258
259 def neighbors(self, node):
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in get_edge_properties(self, edge)
404
405 """
--> 406 return self.edge_properties.setdefault(edge, {})
407
408 def add_edge_attributes(self, edge, attrs):
MemoryError:
I have tried looking up for this error on the internet, but, couldn't find a workable solution to this.
From the logs, it looks like the code builds an adjacency matrix
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
This probably tries to create a huge adjacency matrix with your 365042 documents which cannot fit in your memory(i.e., RAM).
You could try:
Reducing the document size to fewer files (maybe start with 10000)
and check if it works
Try running it on a system with more RAM
Did you try to use word_count argument instead of ratio?
If the above still doesn't solve the problem, then that's because of gensim's implementation limitations. The only way to use gensim if you still OOM errors is to split documents. That also will speed up your solution (and if the document is really big, it shouldn't be a problem anyway).
What's the problem with summarize:
gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
BTW is the document length measured in words, or characters?
I've read about Raymond Hettinger's new method of implementing compact dicts. This explains why dicts in Python 3.6 use less memory than dicts in Python 2.7-3.5. However there seems to be a difference between the memory used in Python 2.7 and 3.3-3.5 dicts. Test code:
import sys
d = {i: i for i in range(n)}
print(sys.getsizeof(d))
Python 2.7: 12568
Python 3.5: 6240
Python 3.6: 4704
As mentioned I understand the savings between 3.5 and 3.6 but am curious about the cause of the savings between 2.7 and 3.5.
Turns out this is a red herring. The rules for increasing the size of dicts changed between cPython 2.7 - 3.2 and cPython 3.3 and again at cPython 3.4 (though this change only applies when deletions occur). We can see this using the following code to determine when the dict expands:
import sys
size_old = 0
for n in range(512):
d = {i: i for i in range(n)}
size = sys.getsizeof(d)
if size != size_old:
print(n, size_old, size)
size_old = size
Python 2.7:
(0, 0, 280)
(6, 280, 1048)
(22, 1048, 3352)
(86, 3352, 12568)
Python 3.5
0 0 288
6 288 480
12 480 864
22 864 1632
44 1632 3168
86 3168 6240
Python 3.6:
0 0 240
6 240 368
11 368 648
22 648 1184
43 1184 2280
86 2280 4704
Keeping in mind that dicts resize when they get to be 2/3 full, we can see that the cPython 2.7 dict implementation quadruples in size when it expands while the cPython 3.5/3.6 dict implementations only double in size.
This is explained in a comment in the dict source code:
/* GROWTH_RATE. Growth rate upon hitting maximum load.
* Currently set to used*2 + capacity/2.
* This means that dicts double in size when growing without deletions,
* but have more head room when the number of deletions is on a par with the
* number of insertions.
* Raising this to used*4 doubles memory consumption depending on the size of
* the dictionary, but results in half the number of resizes, less effort to
* resize.
* GROWTH_RATE was set to used*4 up to version 3.2.
* GROWTH_RATE was set to used*2 in version 3.3.0
*/
In trying to track down the source of an unacceptable (257ms) runtime of matplotlib's plt.draw() function, I stumbled upon this article: http://bastibe.de/2013-05-30-speeding-up-matplotlib.html. In particular, this quote caught my eye:
"I am using pause() here to update the plot without blocking. The correct way to do this is to use draw() instead..."
Digging further, I found that plt.draw() can be substituted by two commands,
plt.pause(0.001)
fig.canvas.blit(ax1.bbox)
Which take up 256ms and 1ms respectively in my code.
This was abnormal, why would a 1ms pause take 256ms to complete? I took some data, and found the following:
plt.pause(n):
n(s) time(ms) overhead(n(ms)-time(ms))
0.0001 270-246 ~246ms
0.001 270-254 ~253ms
0.01 280-265 ~255ms
0.1 398-354 ~254ms
0.2 470-451 ~251ms
0.5 779-759 ~259ms
1.0 1284-1250 ~250ms
numbers courtesy of rkern's line_profiler
This makes it very clear that plt.pause() is doing more than just pausing the program, and I was correct:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
175 def pause(interval):
176 """
177 Pause for *interval* seconds.
178
179 If there is an active figure it will be updated and displayed,
180 and the GUI event loop will run during the pause.
181
182 If there is no active figure, or if a non-interactive backend
183 is in use, this executes time.sleep(interval).
184
185 This can be used for crude animation. For more complex
186 animation, see :mod:`matplotlib.animation`.
187
188 This function is experimental; its behavior may be changed
189 or extended in a future release.
190
191 """
192 1 6 6.0 0.0 backend = rcParams['backend']
193 1 1 1.0 0.0 if backend in _interactive_bk:
194 1 5 5.0 0.0 figManager = _pylab_helpers.Gcf.get_active()
195 1 0 0.0 0.0 if figManager is not None:
196 1 2 2.0 0.0 canvas = figManager.canvas
197 1 257223 257223.0 20.4 canvas.draw()
198 1 145 145.0 0.0 show(block=False)
199 1 1000459 1000459.0 79.5 canvas.start_event_loop(interval)
200 1 2 2.0 0.0 return
201
202 # No on-screen figure is active, so sleep() is all we need.
203 import time
204 time.sleep(interval)
once again courtesy of rkern's line_profiler
This was a breakthrough, as it was suddenly clear why plt.pause() was able to replace plt.draw(), it had a draw function inside it with that same ~250ms overhead I was getting at the start of my program.
At this point, I decided to profile plt.draw() itself:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
551 def draw():
571 1 267174 267174.0 100.0 get_current_fig_manager().canvas.draw()
Alright, one more step down the rabbit hole:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
57 def draw_wrapper(artist, renderer, *args, **kwargs):
58 769 1798 2.3 0.7 before(artist, renderer)
59 769 242060 314.8 98.5 draw(artist, renderer, *args, **kwargs)
60 769 1886 2.5 0.8 after(artist, renderer)
Unfortunately, this was the point at which my ability to run my profiler through the source code ended, leaving me scratching my head at this next level of draw() function, and why it was being called 769 times.
It turns out, the answer was right in front of me the whole time! That same article, which started this whole obsessive hunt in the first place, was actually created to study the same strange behavior! Their solution: To replace plt.draw() with individual calls to each artist which needed to be updated, rather than every single one!
I hope my chasing of this behavior can help others understand it, though currently I'm stuck with a CGContextRef is NULL error whenever I try to replicate his methods, which seems to be specific to the MacOSX backend...
More info as it comes! Please add any more relevant information in answers below, or if you can help me with my CGContextRef is NULL error.
Can some body help me as how to find how much time and how much memory does it take for a code in python?
Use this for calculating time:
import time
time_start = time.clock()
#run your code
time_elapsed = (time.clock() - time_start)
As referenced by the Python documentation:
time.clock()
On Unix, return the current processor time as a floating
point number expressed in seconds. The precision, and in fact the very
definition of the meaning of “processor time”, depends on that of the
C function of the same name, but in any case, this is the function to
use for benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the
first call to this function, as a floating point number, based on the
Win32 function QueryPerformanceCounter(). The resolution is typically
better than one microsecond.
Reference: http://docs.python.org/library/time.html
Use this for calculating memory:
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
Reference: http://docs.python.org/library/resource.html
Based on #Daniel Li's answer for cut&paste convenience and Python 3.x compatibility:
import time
import resource
time_start = time.perf_counter()
# insert code here ...
time_elapsed = (time.perf_counter() - time_start)
memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0
print ("%5.1f secs %5.1f MByte" % (time_elapsed,memMb))
Example:
2.3 secs 140.8 MByte
There is a really good library called jackedCodeTimerPy for timing your code. You should then use resource package that Daniel Li suggested.
jackedCodeTimerPy gives really good reports like
label min max mean total run count
------- ----------- ----------- ----------- ----------- -----------
imports 0.00283813 0.00283813 0.00283813 0.00283813 1
loop 5.96046e-06 1.50204e-05 6.71864e-06 0.000335932 50
I like how it gives you statistics on it and the number of times the timer is run.
It's simple to use. If i want to measure the time code takes in a for loop i just do the following:
from jackedCodeTimerPY import JackedTiming
JTimer = JackedTiming()
for i in range(50):
JTimer.start('loop') # 'loop' is the name of the timer
doSomethingHere = 'This is really useful!'
JTimer.stop('loop')
print(JTimer.report()) # prints the timing report
You can can also have multiple timers running at the same time.
JTimer.start('first timer')
JTimer.start('second timer')
do_something = 'amazing'
JTimer.stop('first timer')
do_something = 'else'
JTimer.stop('second timer')
print(JTimer.report()) # prints the timing report
There are more use example in the repo. Hope this helps.
https://github.com/BebeSparkelSparkel/jackedCodeTimerPY
Use a memory profiler like guppy
>>> from guppy import hpy; h=hpy()
>>> h.heap()
Partition of a set of 48477 objects. Total size = 3265516 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 25773 53 1612820 49 1612820 49 str
1 11699 24 483960 15 2096780 64 tuple
2 174 0 241584 7 2338364 72 dict of module
3 3478 7 222592 7 2560956 78 types.CodeType
4 3296 7 184576 6 2745532 84 function
5 401 1 175112 5 2920644 89 dict of class
6 108 0 81888 3 3002532 92 dict (no owner)
7 114 0 79632 2 3082164 94 dict of type
8 117 0 51336 2 3133500 96 type
9 667 1 24012 1 3157512 97 __builtin__.wrapper_descriptor
<76 more rows. Type e.g. '_.more' to view.>
>>> h.iso(1,[],{})
Partition of a set of 3 objects. Total size = 176 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 33 136 77 136 77 dict (no owner)
1 1 33 28 16 164 93 list
2 1 33 12 7 176 100 int
>>> x=[]
>>> h.iso(x).sp
0: h.Root.i0_modules['__main__'].__dict__['x']