I wrote a scripts that logs mac addresses from pcapy into mysql through SQLAlchemy, I initially used straight sqlite3 but soon realized that something better was required, so this weekend that past I rewrote all the database talk to comply with SQLAlchemy. All works fine, data goes in and comes out again. I though the sessionmaker() would be very useful to manage all the sessions to the DB for me.
I see a strange occurrence with regards to memory consumption. I start the script... it collects and writes all to DB... but for every 2-4seconds I have a Megabyte in size increase in memory consumption. At the moment I'm talking about very few records, sub-100 rows.
Script Sequence:
Script Starts
SQLAlchemy reads mac_addr column into maclist[].
scapy gets packet > if new_mac is in maclist[]?
if true? only write timestamp to timestamp column where mac = newmac. back to Step 2.
if false? then write new mac to DB. clear maclist[] and call step 2 again.
After 1h30m I have a memory footprint of 1027MB (RES) and 1198MB (VIRT) with 124 rows in the 1 table database (MySQL).
Q: Could this be contributed to the maclist[] being cleaned and repopulated from DB everytime?
Q: Whats going to happen when it reaches system Max memory?
Any ideas or advice would be great thanks.
memory_profiler output for the segment in question where list[] gets populated from database mac_addr column.
Line # Mem usage Increment Line Contents
================================================
123 1025.434 MiB 0.000 MiB #profile
124 def sniffmgmt(p):
125 global __mac_reel
126 global _blacklist
127 1025.434 MiB 0.000 MiB stamgmtstypes = (0, 2, 4)
128 1025.434 MiB 0.000 MiB tmplist = []
129 1025.434 MiB 0.000 MiB matching = []
130 1025.434 MiB 0.000 MiB observedclients = []
131 1025.434 MiB 0.000 MiB tmplist = populate_observed_list()
132 1025.477 MiB 0.043 MiB for i in tmplist:
133 1025.477 MiB 0.000 MiB observedclients.append(i[0])
134 1025.477 MiB 0.000 MiB _mac_address = str(p.addr2)
135 1025.477 MiB 0.000 MiB if p.haslayer(Dot11):
136 1025.477 MiB 0.000 MiB if p.type == 0 and p.subtype in stamgmtstypes:
137 1024.309 MiB -1.168 MiB _timestamp = atimer()
138 1024.309 MiB 0.000 MiB if p.info == "":
139 1021.520 MiB -2.789 MiB _SSID = "hidden"
140 else:
141 1024.309 MiB 2.789 MiB _SSID = p.info
142
143 1024.309 MiB 0.000 MiB if p.addr2 not in observedclients:
144 1018.184 MiB -6.125 MiB db_add(_mac_address, _timestamp, _SSID)
145 1018.184 MiB 0.000 MiB greetings()
146 else:
147 1024.309 MiB 6.125 MiB add_time(_mac_address, _timestamp)
148 1024.309 MiB 0.000 MiB observedclients = [] #clear the list
149 1024.309 MiB 0.000 MiB observedclients = populate_observed_list() #repopulate the list
150 1024.309 MiB 0.000 MiB greetings()
You will see observedclients is the list in question.
I managed to find the actual cause to the memory consumption. It was scapy itself. Scapy by default is set to store all packets it captures. But you can disable it.
Disable:
sniff(iface=interface, prn=sniffmgmt, store=0)
Enable:
sniff(iface=interface, prn=sniffmgmt, store=1)
Thanks to BitBucket Ticket
As you can see profiler output suggests you use less memory by the end, so this is not representative of your situation.
Some directions to dig deeper:
1) add_time (why is it increasing memory usage?)
2) db_add (why is it decreasing memory usage? caching? closing/opening db connection? what happens in case of failure?)
3) populate_observed_list (is return value safe for garbage collection? may be there are some packets for which certain exception occurs?)
Also, what happens if you sniff more packets than your code is able to process do to performance?
I would profile these 3 functions and analyze possible exceptions/failures.
Very hard to say anything without the code, assuming it's not a leak in SQLAlchemy or scapy rather than in your code (seems unlikely).
You seem to have an idea of where the leak might happen, do some memory profiling to see if you were right.
Once your python process eats enough memory, you will probably get a MemoryError exception.
Thanks for the guidance everyone. I think I managed to resolve the increasing memory consumption.
A: Code logic plays a very big role in memory consumption as I have learnt.
If you look at the memory_profiler output in my initial question, I moved lines 131-133 into the IF statement at line 136. This seems to not increase the memory so frequently. I now need to refine that populate_observedlist() a bit more to not waste so much memory.
Related
I'm coding a Python script which makes many plots. These plots are called from a main program which calls them recursively (by this, I mean hundreds of times).
As the main function runs, I see how my computer's RAM memory fills up during the execution. Furthermore, even after the main function finishes, the RAM memory usage is still much much higher than before the main program execution. Sometimes it can even completely fill the RAM memory.
I tried to delete the heaviest variables and use garbage collector but the net RAM memory usage is always higher. Why this is happening?
I attached a simple (and exaggerated) example of one of my functions and I used memory profiler to see line-by-line the memory usage.
Line # Mem usage Increment Occurrences Line Contents
=============================================================
15 100.926 MiB 100.926 MiB 1 #profile
16 def my_func():
17 108.559 MiB 7.633 MiB 1 a = [1] * (10 ** 6)
18 261.148 MiB 152.590 MiB 1 b = [2] * (2 * 10 ** 7)
19 421.367 MiB 160.219 MiB 1 c = a + b
20 428.609 MiB 7.242 MiB 1 plt.figure(dpi=10000)
21 430.328 MiB 1.719 MiB 1 plt.plot(np.random.rand(1000),np.random.rand(1000))
22 487.738 MiB 57.410 MiB 1 plt.show()
23 487.738 MiB 0.000 MiB 1 plt.close('all')
24 167.297 MiB -320.441 MiB 1 del a,b,c
25 118.922 MiB -48.375 MiB 1 print(gc.collect())
I tried to delete the heaviest variables and use garbage collector but the net RAM memory usage is always higher.
I have just seen my Python process get killed again on my VPS with 1GB of RAM and am now looking into optimizing memory usage in my program.
I've got a function that downloads a web page, looks for data and then returns a Pandas Dataframe with what it found. This function is called thousands of times from within a for loop that ends up maxing out the memory on my server.
Line # Mem usage Increment Occurences Line Contents
============================================================
93 75.6 MiB 1.2 MiB 1 page = http.get(url)
94 75.6 MiB 0.0 MiB 1 if page.status_code == 200:
95 78.4 MiB 2.8 MiB 1 tree = html.fromstring(page.text)
96 78.4 MiB 0.0 MiB 1 del page
... code to search for data using xpaths and assign to data dict
117 78.4 MiB 0.1 MiB 1 df = pd.DataFrame(data)
118 78.4 MiB 0.0 MiB 1 del tree
119 78.4 MiB 0.0 MiB 1 gc.collect()
120 78.4 MiB 0.0 MiB 1 return df
With the memory_profiler results from above, it shows that the lines of my code with the largest memory increments are as expected: the http.get() and html.fromstring() calls and assignments. The actual Dataframe creation is much smaller in comparison.
Now I would expect that the only overall memory increase to my program is the size of the dataframe returned by the function and not ALSO the size of both the page and tree objects. With every call to this function, the memory increase on my program is the combination of all 3 objects and this does not ever decrease.
I have tried adding del before the end of the function to attempt to de-reference the objects I don't need anymore, but this does not seem to make a difference.
I do see that for a scalable application I would need to start saving to disk, but at this point even if I do save to disk I'm not sure how to free up the memory already used.
Thanks for your help
After a lot of digging, I finally found the answer to my own question. The issue was related to string results from my xpath expressions that by default use "smart-strings" that are known to eat up memory. Disabling these gives me the kind of memory consumption I was expecting.
More information lxml parser eats all memory and https://lxml.de/xpathxslt.html#xpath-return-values
I have a python list of very different size of texts.
When I try to convert that list to a Numpy array, I got very big spikes in Memory.
I am using np.array(list_of_texts) for the conversion.
Please find below the line per line memory usage that is returned by memory_profiler of an example of list of texts conversion.
from memory_profiler import profile
import numpy as np
Line # Mem usage Increment Line Contents
================================================
4 58.9 MiB 58.9 MiB #profile
5 def f():
23 59.6 MiB 0.3 MiB small_texts = ['a' for i in range(100000)]
24 60.4 MiB 0.8 MiB big_texts = small_texts + [''.join(small_texts)]
26 61.4 MiB 0.0 MiB a = np.array(small_texts)
27 38208.9 MiB 38147.5 MiB b = np.array(big_texts)
I suspect the problem comes from the different size of the texts in the list.
Any idea why this is happening?
How can I keep a reasonable in-ram memory while converting a list of texts to a numpy array?
I am having trouble with high memory usage when performing ffts with scipy's fftpack. Example obtained with the module memory_profiler:
Line # Mem usage Increment Line Contents
================================================
4 50.555 MiB 0.000 MiB #profile
5 def test():
6 127.012 MiB 76.457 MiB a = np.random.random(int(1e7))
7 432.840 MiB 305.828 MiB b = fftpack.fft(a)
8 891.512 MiB 458.672 MiB c = fftpack.ifft(b)
9 585.742 MiB -305.770 MiB del b, c
10 738.629 MiB 152.887 MiB b = fftpack.fft(a)
11 891.512 MiB 152.883 MiB c = fftpack.ifft(b)
12 509.293 MiB -382.219 MiB del a, b, c
13 547.520 MiB 38.227 MiB a = np.random.random(int(5e6))
14 700.410 MiB 152.891 MiB b = fftpack.fft(a)
15 929.738 MiB 229.328 MiB c = fftpack.ifft(b)
16 738.625 MiB -191.113 MiB del a, b, c
17 784.492 MiB 45.867 MiB a = np.random.random(int(6e6))
18 967.961 MiB 183.469 MiB b = fftpack.fft(a)
19 1243.160 MiB 275.199 MiB c = fftpack.ifft(b)
My attempt at understanding what is going on here:
The amount of memory allocated by both fft and ifft on lines 7 and 8 is more than what they need to allocate to return a result. For the call b = fftpack.fft(a), 305 MiB is allocated. The amount of memory needed for the b array is 16 B/value * 1e7 values = 160 MiB (16 B per value as the code is returning complex128). It seems that fftpack is allocating some type of workspace, and that the workspace is equal in size to the output array (?).
On lines 10 and 11 the same procedure is run again, but the memory usage is less this time, and more in line with what I expect. It therefore seems that fftpack is able to reuse the workspace.
On lines 13-15 and 17-19 ffts with different, smaller input sizes are performed. In both of these cases more memory than what is needed is allocated, and memory does not seem to be reused.
The memory usage reported above agrees with what windows task manager reports (to the accuracy I am able to read those graphs). If I write such a script with larger input sizes, I can make my (windows) computer very slow, indicating that it is swapping.
A second example to illustrate the problem of the memory allocated for workspace:
factor = 4.5
a = np.random.random(int(factor * 3e7))
start = time()
b = fftpack.fft(a)
c = fftpack.ifft(b)
end = time()
print("Elapsed: {:.4g}".format(end - start))
del a, b, c
print("Finished first fft")
a = np.random.random(int(factor * 2e7))
start = time()
b = fftpack.fft(a)
c = fftpack.ifft(b)
end = time()
print("Elapsed: {:.4g}".format(end - start))
del a, b, c
print("Finished first fft")
The code prints the following:
Elapsed: 17.62
Finished first fft
Elapsed: 38.41
Finished first fft
Filename: ffttest.py
Notice how the second fft, which has the smaller input size, takes more than twice as long to compute. I noticed that my computer was very slow (likely swapping) during the execution of this script.
Questions:
Is it correct that the fft can be calculated inplace, without the need for extra workspace? If so, why does not fftpack do that?
Is there a problem with fftpack here? Even if it needs extra workspace, why does it not reuse its workspace when the fft is rerun with different input sizes?
EDIT:
Old, but possibly related: https://mail.scipy.org/pipermail/scipy-dev/2012-March/017286.html
Is this the answer? https://github.com/scipy/scipy/issues/5986
This is a known issue, and is caused by fftpack caching its strategy for computing the fft for a given size. That cache is about as large as the output of the computation, so if one does large ffts with different input sizes memory the memory consumption can become significant.
The problem is described in detail here:
https://github.com/scipy/scipy/issues/5986
Numpy has a similar problem, which is being worked on:
https://github.com/numpy/numpy/pull/7686
Our game program will initialize the data of all players into the memory. My purpose is to reduce the memory which is not necessary. I traced the program and found that "for" taking a lot of memory.
For example:
Line # Mem usage Increment Line Contents
================================================
52 #profile
53 11.691 MB 0.000 MB def test():
54 19.336 MB 7.645 MB a = ["1"] * (10 ** 6)
55 19.359 MB 0.023 MB print recipe.total_size(a, verbose=False)
56 82.016 MB 62.656 MB for i in a:
57 pass
print recipe.total_size(a, verbose=False):8000098 bytes
The question is How can i release that 62.656 MB memory.
P.S.
Sorry, i know my English is not very well.I will appreciate everyone to read this.:-)
If you are absolutely desperate to reduce memory usage on the loop you can do it this way:
i = 0
while 1:
try:
a[i] #accessing an element here
i += 1
except IndexError:
break
Memory stats (if they are accurate):
12 9.215 MB 0.000 MB i = 0
13 9.215 MB 0.000 MB while 1:
14 60.484 MB 51.270 MB try:
15 60.484 MB 0.000 MB a[i]
16 60.484 MB 0.000 MB i += 1
17 60.484 MB 0.000 MB except IndexError:
18 60.484 MB 0.000 MB break
However, this code looks ugly and danger and reduction in memory usage is just tiny.
1) Instead of list iterator. You should use generator. according to your sample code:
#profile
def test():
a = ("1" for i in range(10**6)) #this will return a generator, instead of a list.
for i in a:
pass
Now if you use the generator 'a' in the for loop, it won't take that much memory.
2) If you are getting a list, then first convert it into generator.
#profile
def test():
a = ["1"] * (10**6) #getting list
g = (i for i in a) #converting list into a generator object
for i in g: #use generator object for iteration
pass
Try this. If is helps you.