I get significant memory leak when running Pytorch model to evaluate images from dataset.
Every new image evaluation is started in a new thread.
It doesn't matter if the code waits for the thread to finish or not. When the threads are not used (just evaluate function is called), there is no any leak. I've tried to delete the thread variable every iteration, but that doesn't help.
Here is the code:
hidden_sizes = [6336, 1000]
class Net(torch.nn.Module):
def __init__ (self):
super(Net, self).__init__()
self.fc1 = torch.nn.Linear(hidden_sizes[0], hidden_sizes[1])
def forward(self, x):
x= x.view(-1,hidden_sizes[0])
x = torch.nn.functional.log_softmax(self.fc1(x), dim=1)
return x
#---------------------------------------------------------------
def newThread(i):
image=cv2.imread(pathx+filenames[i], cv2.IMREAD_GRAYSCALE)
images = tran(image)
images = tran1(images)
images = images.unsqueeze(0)
#Run model
with torch.no_grad():
logps = model(images)
ps = torch.exp(logps)
probab = list(ps.numpy()[0])
pred_label = probab.index(max(probab))
#---------------------------------------------------------------
model = Net ()
model.load_state_dict(torch.load("test_memory_leak.pt"))
#normalize image
tran = transforms.ToTensor()
tran1 = transforms.Normalize((0.5,), (0.5,))
pathx="images\\"
filenames=os.listdir(pathx)
for i in range(len(filenames)):
thread1 = threading.Thread(target = newThread, args = (i, ))
thread1.start()
thread1.join()
What could be the reason for that?
UPD: Tried to detect memory leaks with guppy, but the reason still isn't clear. Here are some stats: First table is program memory usage on the beginning, the second one is when the memory usage increased x2.4 fold (up to 480mb) after analyzing 1000 images:
Partition of a set of 260529 objects. Total size = 33587422 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 72188 28 9511536 28 9511536 28 str
1 70921 27 5418336 16 14929872 44 tuple
2 32926 13 2526536 8 17456408 52 bytes
3 16843 6 2434008 7 19890416 59 types.CodeType
4 2384 1 2199952 7 22090368 66 type
5 14785 6 2010760 6 24101128 72 function
6 4227 2 1631384 5 25732512 77 dict (no owner)
7 794 0 1399928 4 27132440 81 dict of module
8 2384 1 1213816 4 28346256 84 dict of type
9 38 0 704064 2 29050320 86 dict of torch.tensortype
<575 more rows. Type e.g. '_.more' to view.>
-------------------------------------------
Partition of a set of 265841 objects. Total size = 34345930 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 72203 27 9523346 28 9523346 28 str
1 70924 27 5418584 16 14941930 44 tuple
2 32928 12 2530722 7 17472652 51 bytes
3 16844 6 2434152 7 19906804 58 types.CodeType
4 2384 1 2200488 6 22107292 64 type
5 14786 6 2010896 6 24118188 70 function
6 4232 2 1637736 5 25755924 75 dict (no owner)
7 794 0 1399928 4 27155852 79 dict of module
8 2384 1 1213816 4 28369668 83 dict of type
9 265 0 840672 2 29210340 85 set
<577 more rows. Type e.g. '_.more' to view.>
Related
I have a simple script in which a function does some calculation on a Pandas.Series object which I want to parallel process. I have made the Pandas.Series object as a shared memory object so that different processess can use it.
My code is given below.
from multiprocessing import shared_memory
import pandas as pd
import numpy as np
import multiprocessing
s = pd.Series(np.random.randn(50))
s = s.to_numpy()
# Create a shared memory variable shm which can be accessed by other processes
shm_s = shared_memory.SharedMemory(create=True, size=s.nbytes)
b = np.ndarray(s.shape, dtype=s.dtype, buffer=shm_s.buf)
b[:] = s[:]
# create a dictionary to store the results and which can be accessed after the processes works
mgr = multiprocessing.Manager()
pred_sales_all = mgr.dict()
forecast_period =1000
# my sudo function to run parallel process
def predict_model(b,model_list_str,forecast_period,pred_sales_all):
c = pd.Series(b)
temp_add = model_list_str + forecast_period
temp_series = c.add(model_list_str)
pred_sales_all[str(temp_add)] = temp_series
# parallel processing with shared memory
if __name__ == '__main__':
model_list = [1, 2, 3, 4]
all_process = []
for model_list_str in model_list:
# setup a process to run
process = multiprocessing.Process(target=predict_model, args=(b,model_list_str, forecast_period, pred_sales_all))
# start the process we need to join() them separately else they will finish execution before moving to next process
process.start()
# Append all process together
all_process.append(process)
# Finish execution of all process
for p in all_process:
p.join()
This code is working in ubuntu I checked. But when I run this in windows I am getting the following error.
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Also I tried the solution mentioned here in the stack-overflow question
What is wrong with the code and can someone solve the issue ? Is my parallelization code wrong ?
See my comments about moving statements at global scope to within the if __name__ == '__main__': block. Otherwise they will be executed by each subprocess as part of their initialization and there is no point in that. Moreover, the statement mgr = multiprocessing.Manager() has to be moved because this results in the creation of a new process.
from multiprocessing import shared_memory
import pandas as pd
import numpy as np
import multiprocessing
# my sudo function to run parallel process
def predict_model(b,model_list_str,forecast_period,pred_sales_all):
c = pd.Series(b)
temp_add = model_list_str + forecast_period
temp_series = c.add(model_list_str)
pred_sales_all[str(temp_add)] = temp_series
# parallel processing with shared memory
if __name__ == '__main__':
forecast_period =1000
s = pd.Series(np.random.randn(50))
s = s.to_numpy()
# Create a shared memory variable shm which can be accessed by other processes
shm_s = shared_memory.SharedMemory(create=True, size=s.nbytes)
b = np.ndarray(s.shape, dtype=s.dtype, buffer=shm_s.buf)
b[:] = s[:]
# create a dictionary to store the results and which can be accessed after the processes works
mgr = multiprocessing.Manager()
pred_sales_all = mgr.dict()
model_list = [1, 2, 3, 4]
all_process = []
for model_list_str in model_list:
# setup a process to run
process = multiprocessing.Process(target=predict_model, args=(b,model_list_str, forecast_period, pred_sales_all))
# start the process we need to join() them separately else they will finish execution before moving to next process
process.start()
# Append all process together
all_process.append(process)
# Finish execution of all process
for p in all_process:
p.join()
print(pred_sales_all)
Prints:
{'1004': 0 4.084857
1 2.871219
2 5.644114
3 2.146666
4 3.395946
5 3.362894
6 2.366361
7 3.209334
8 4.226132
9 3.158135
10 4.090616
11 5.299314
12 3.155669
13 5.602719
14 3.107825
15 1.809457
16 3.938050
17 1.144159
18 3.286502
19 4.302809
20 3.917498
21 5.055629
22 2.230594
23 3.255307
24 2.459930
25 3.591691
26 3.248188
27 3.635262
28 4.547589
29 4.883547
30 2.635874
31 5.551306
32 2.434944
33 5.358516
34 4.606322
35 5.383417
36 2.886735
37 4.267562
38 2.053871
39 3.863734
40 3.233764
41 4.089593
42 4.754793
43 4.125400
44 2.174840
45 7.207996
46 2.925736
47 4.604850
48 4.067672
49 4.397330
dtype: float64, '1001': 0 1.084857
1 -0.128781
2 2.644114
3 -0.853334
4 0.395946
5 0.362894
6 -0.633639
7 0.209334
8 1.226132
9 0.158135
10 1.090616
11 2.299314
12 0.155669
13 2.602719
14 0.107825
15 -1.190543
16 0.938050
17 -1.855841
18 0.286502
19 1.302809
20 0.917498
21 2.055629
22 -0.769406
23 0.255307
24 -0.540070
25 0.591691
26 0.248188
27 0.635262
28 1.547589
29 1.883547
30 -0.364126
31 2.551306
32 -0.565056
33 2.358516
34 1.606322
35 2.383417
36 -0.113265
37 1.267562
38 -0.946129
39 0.863734
40 0.233764
41 1.089593
42 1.754793
43 1.125400
44 -0.825160
45 4.207996
46 -0.074264
47 1.604850
48 1.067672
49 1.397330
dtype: float64, '1002': 0 2.084857
1 0.871219
2 3.644114
3 0.146666
4 1.395946
5 1.362894
6 0.366361
7 1.209334
8 2.226132
9 1.158135
10 2.090616
11 3.299314
12 1.155669
13 3.602719
14 1.107825
15 -0.190543
16 1.938050
17 -0.855841
18 1.286502
19 2.302809
20 1.917498
21 3.055629
22 0.230594
23 1.255307
24 0.459930
25 1.591691
26 1.248188
27 1.635262
28 2.547589
29 2.883547
30 0.635874
31 3.551306
32 0.434944
33 3.358516
34 2.606322
35 3.383417
36 0.886735
37 2.267562
38 0.053871
39 1.863734
40 1.233764
41 2.089593
42 2.754793
43 2.125400
44 0.174840
45 5.207996
46 0.925736
47 2.604850
48 2.067672
49 2.397330
dtype: float64, '1003': 0 3.084857
1 1.871219
2 4.644114
3 1.146666
4 2.395946
5 2.362894
6 1.366361
7 2.209334
8 3.226132
9 2.158135
10 3.090616
11 4.299314
12 2.155669
13 4.602719
14 2.107825
15 0.809457
16 2.938050
17 0.144159
18 2.286502
19 3.302809
20 2.917498
21 4.055629
22 1.230594
23 2.255307
24 1.459930
25 2.591691
26 2.248188
27 2.635262
28 3.547589
29 3.883547
30 1.635874
31 4.551306
32 1.434944
33 4.358516
34 3.606322
35 4.383417
36 1.886735
37 3.267562
38 1.053871
39 2.863734
40 2.233764
41 3.089593
42 3.754793
43 3.125400
44 1.174840
45 6.207996
46 1.925736
47 3.604850
48 3.067672
49 3.397330
dtype: float64}
The issue that I am having is a really strange issue.
What I am trying to accomplish is the following: I am training a neural network using pytorch, and I want to restart my training function if the training loss doesn't decrease, so as to re-initialize the neural network with a different set of weights. The training function is presented below:
def __train__(dp, i, j, net, restarts, epoch=0):
if net == '2CH': model = TwoChannelCNN().cuda()
elif net == 'Siam' : model = SiameseCNN().cuda()
elif net == 'Trad' : model = TraditionalCNN().cuda()
ls_fn = torch.nn.MSELoss(reduce=True)
optim = torch.optim.SGD(model.parameters(), lr=1e-6, momentum=0.9)
epochs = np.arange(100)
eloss = []
for epoch in epochs:
model.train()
train_loss = []
tr_batches = np.array_split(dp.train_set, int(len(dp.train_set)/8))
for tr_batch in tr_batches:
if net == '2CH': loaded_batch = dp.__load2CH__(tr_batch)
elif net == 'Siam': loaded_batch = dp.__loadSiam__(tr_batch)
elif net == 'Trad' : loaded_batch = dp.__load__(tr_batch, i)
for x_batch, y_batch in loaded_batch:
x_var, y_var = Variable(x_batch.cuda()), Variable(y_batch.cuda())
y_pred = torch.clamp(model(x_var), 0, 1)
loss = ls_fn(y_pred, y_var)
train_loss.append(abs(loss.item()))
optim.zero_grad()
loss.backward()
optim.step()
eloss.append(np.mean(train_loss))
print(epoch, np.mean(train_loss))
if epoch == 10 and np.mean(train_loss) > 0.2:
restarts += 1
print('Number of restarts for client {} and fold {}: {}'.format(i,j,restarts))
__train__(dp, i, j, net, restarts, epoch=0)
__plotLoss__(epochs, eloss, 'train', str(i), str(j))
torch.save(model.state_dict(), "Output/client_{}_fold_{}.pt".format(i, j))
So the restarting based on if epoch == 10 and np.mean(train_loss) > 0.2: works, but only sometimes, which is beyond my comprehension. Here is an example of the output:
0 0.5000133737921715
1 0.4999906486272812
2 0.464298670232296
3 0.2727506290078163
4 0.2628978116512299
5 0.2588871221542358
6 0.25728522151708605
7 0.25630473804473874
8 0.2556223524808884
9 0.25522999209165576
10 0.25467908215522767
Number of restarts for client 5 and fold 1: 3
0 0.10957609283713009
1 0.02840371729924134
2 0.021477583368030594
3 0.017759160268232682
4 0.015173796122947827
5 0.013349939693290782
6 0.011949078906879265
7 0.010810676779671655
8 0.00987362345259362
9 0.009110640348696108
10 0.008239036202623808
11 0.007680381585537574
12 0.007171026876221333
13 0.006765962297888837
14 0.006428168776848068
15 0.006133011780953467
16 0.005819878347673745
17 0.005572605537395361
18 0.00535818950227004
19 0.005159409143814457
20 0.0049763926251294235
21 0.004738794513338235
22 0.004578812885309958
23 0.004428663117960554
24 0.004282198464788351
25 0.004145324644400691
26 0.004018862769889626
27 0.0039044404603504573
28 0.0037960831121495744
29 0.0036947361258523586
30 0.0035982220717533267
31 0.0035018146670104723
32 0.0034150678806059887
33 0.0033372560733512698
34 0.003261332974241583
35 0.00318166259540763
36 0.003108531899014735
37 0.0030385089141125848
38 0.002977990984523103
39 0.0029195284016142937
40 0.002870084639441188
41 0.0028180573325994373
42 0.0027717544270049643
43 0.002719321814503495
44 0.0026704726860933194
45 0.0026204266263459316
46 0.002570544072460258
47 0.0025225681523167224
48 0.0024814611543610746
49 0.0024358948737413116
50 0.002398673941639636
51 0.0023606415423654587
52 0.002330436484101057
53 0.0022891738560574027
54 0.002260655496376241
55 0.002227568955708719
56 0.002191826719741698
57 0.0021609061182290058
58 0.0021279943092100666
59 0.0020966088490456513
60 0.002066195117003474
61 0.0020381672924407895
62 0.002009863329306995
63 0.001986304977759602
64 0.0019564831849032487
65 0.0019351609173580756
66 0.0019077356409993626
67 0.0018875047204855945
68 0.0018617453310780547
69 0.001839518720600381
70 0.001815563331498197
71 0.0017149778925132932
72 0.0016894878409248121
73 0.0016652211918212743
74 0.0016422999463582074
75 0.0016183732903472788
76 0.0015962369183098418
77 0.0015757764620279887
78 0.0015542267022799728
79 0.0015323152910759318
80 0.0014337954093957706
81 0.001410489170542867
82 0.0013871921329466962
83 0.0013641994057461773
84 0.001345829172682187
85 0.001322142209181493
86 0.00130379223035348
87 0.001282231878045458
88 0.001263879886683956
89 0.001243419097817167
90 0.0012279346547037929
91 0.001206978429649382
92 0.0011871445969959496
93 0.001172510546330841
94 0.0011529557384797045
95 0.0011350733004023273
96 0.001118382818282214
97 0.001103347793609089
98 0.0010848538354748599
99 0.0010698940242660911
11 0.2542190085053444
12 0.2538975296020508
So here you can see that the restarting is correct from the 3rd restart, but then, since the network converges, the training should be complete, but the function restarts AGAIN after the 99th epoch (for an unknown reason), and somehow starts at the 11th epoch, which also makes no sense as I am explicitly specifying epoch = 0 whenever the function starts or restarts. I should also add that, SOMETIMES, the function completes correctly after the epoch 99, when convergence has been achieved, and does not restart.
So my question is, why does this piece of code produce inconsistent results and outcomes? What am I missing here? Thanks in advance for any suggestions.
You are restarting the training by calling __train__ a second time in the case if epoch == 10 and np.mean(train_loss) > 0.2: but you never terminate the first loop.
So, after the second training has converged, the outer loop continues at epoch 11.
What you need is a break statement after the inner call to __train__.
My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.
I'm currently trying to improve the memory usage of a script, that produces figures which are very "heavy" over time.
before creating the figures :
('Before: heap:', Partition of a set of 337 objects. Total size = 82832 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 75 22 32520 39 32520 39 dict (no owner)
1 39 12 20904 25 53424 64 dict of guppy.etc.Glue.Interface
2 8 2 8384 10 61808 75 dict of guppy.etc.Glue.Share
3 16 5 4480 5 66288 80 dict of guppy.etc.Glue.Owner
4 84 25 4280 5 70568 85 str
5 23 7 3128 4 73696 89 list
6 39 12 2496 3 76192 92 guppy.etc.Glue.Interface
7 16 5 1152 1 77344 93 guppy.etc.Glue.Owner
8 1 0 1048 1 78392 95 dict of guppy.heapy.Classifiers.ByUnity
9 1 0 1048 1 79440 96 dict of guppy.heapy.Use._GLUECLAMP_
<15 more rows. Type e.g. '_.more' to view.>)
And after creating them :
('After : heap:', Partition of a set of 89339 objects. Total size = 32584064 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 2340 3 7843680 24 7843680 24 dict of matplotlib.lines.Line2D
1 1569 2 5259288 16 13102968 40 dict of matplotlib.text.Text
2 10137 11 3208536 10 16311504 50 dict (no owner)
3 2340 3 2452320 8 18763824 58 dict of matplotlib.markers.MarkerStyle
4 2261 3 2369528 7 21133352 65 dict of matplotlib.path.Path
5 662 1 2219024 7 23352376 72 dict of matplotlib.axis.XTick
6 1569 2 1644312 5 24996688 77 dict of matplotlib.font_manager.FontProperties
7 10806 12 856816 3 25853504 79 list
8 8861 10 708880 2 26562384 82 numpy.ndarray
9 1703 2 476840 1 27039224 83 dict of matplotlib.transforms.Affine2D
<181 more rows. Type e.g. '_.more' to view.>)
then, I do :
figures=[manager.canvas.figure for manager in matplotlib._pylab_helpers.Gcf.get_all_fig_managers()]
for i, figure in enumerate(figures): figure.clf(); plt.close(figure)
figures=[manager.canvas.figure for manager in matplotlib._pylab_helpers.Gcf.get_all_fig_managers()]#here, figures==[]
del figures
hp.heap()
This prints :
Partition of a set of 71966 objects. Total size = 23491976 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1581 2 5299512 23 5299512 23 dict of matplotlib.lines.Line2D
1 1063 1 3563176 15 8862688 38 dict of matplotlib.text.Text
2 7337 10 2356952 10 11219640 48 dict (no owner)
3 1584 2 1660032 7 12879672 55 dict of matplotlib.path.Path
4 1581 2 1656888 7 14536560 62 dict of matplotlib.markers.MarkerStyle
5 441 1 1478232 6 16014792 68 dict of matplotlib.axis.XTick
6 1063 1 1114024 5 17128816 73 dict of matplotlib.font_manager.FontProperties
7 7583 11 619384 3 17748200 76 list
8 6500 9 572000 2 18320200 78 __builtin__.weakref
9 6479 9 518320 2 18838520 80 numpy.ndarray
<199 more rows. Type e.g. '_.more' to view.>
So appearantly a handful of matplotlib objects have been deleted, but not all of them.
To begin with, I want to look at all the Line2D objects that are left :
objs = [obj for obj in gc.get_objects() if isinstance(obj, matplotlib.lines.Line2D)]
#[... very long list with e.g., <matplotlib.lines.Line2D object at 0x1375ede590>, <matplotlib.lines.Line2D object at 0x1375ede4d0>, <matplotlib.lines.Line2D object at 0x1375eec390>, <matplotlib.lines.Line2D object at 0x1375ef6350>, <matplotlib.lines.Line2D object at 0x1375eece10>, <matplotlib.lines.Line2D object at 0x1375eec690>, <matplotlib.lines.Line2D object at 0x1375eec610>, <matplotlib.lines.Line2D object at 0x1375eec590>, <matplotlib.lines.Line2D object at 0x1375eecb10>, <matplotlib.lines.Line2D object at 0x1375ef6850>, <matplotlib.lines.Line2D object at 0x1375eec350>]
print len(objs)#29199 (!!!)
So now I would like to be able to access all these objects to be able to delete them and clear the memory, but I don't know how I could do that...
Thanks for your help!
I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
for line in open('datafile','r').readlines():
process(line)
or,
for line in file(datafile):
process(line)
Is there a better way to load a large file line by line, say
a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?
Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
with open(datafile) as f:
for line in f:
process(line)
Also,
import fileinput
for line in fileinput.input([datafile]):
process(line)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt,
Is it necessary to flush the line after usage?
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Partition of a set of 36043 objects. Total size = 5307704 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15934 44 1301016 25 1301016 25 str
1 50 0 628400 12 1929416 36 dict of __main__.NodeStatistics
2 7584 21 620936 12 2550352 48 tuple
3 781 2 590776 11 3141128 59 dict (no owner)
4 90 0 278640 5 3419768 64 dict of module
5 2132 6 255840 5 3675608 69 types.CodeType
6 2059 6 247080 5 3922688 74 function
7 1716 5 245408 5 4168096 79 list
8 244 1 218512 4 4386608 83 type
9 224 1 213632 4 4600240 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
============================================================
Level 2 Profiling for Level 1-Index 0
Partition of a set of 15934 objects. Total size = 1301016 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 13 274232 21 274232 21 '.co_code'
1 2132 13 189832 15 464064 36 '.co_filename'
2 2024 13 114120 9 578184 44 '.co_lnotab'
3 247 2 110672 9 688856 53 "['__doc__']"
4 347 2 92456 7 781312 60 '.func_doc', '[0]'
5 448 3 27152 2 808464 62 '[1]'
6 260 2 15040 1 823504 63 '[2]'
7 201 1 11696 1 835200 64 '[3]'
8 188 1 11080 1 846280 65 '[0]'
9 157 1 8904 1 855184 66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 2 Profiling for Level 1-Index 2
Partition of a set of 7584 objects. Total size = 620936 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 26 188160 30 188160 30 '.co_names'
1 2096 28 171072 28 359232 58 '.co_varnames'
2 2078 27 157608 25 516840 83 '.co_consts'
3 261 3 21616 3 538456 87 '.__mro__'
4 331 4 21488 3 559944 90 '.__bases__'
5 296 4 20216 3 580160 93 '.func_defaults'
6 55 1 3952 1 584112 94 '.co_freevars'
7 47 1 3456 1 587568 95 '.co_cellvars'
8 35 0 2560 0 590128 95 '[0]'
9 27 0 1952 0 592080 95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 3
Partition of a set of 781 objects. Total size = 590776 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 0 98584 17 98584 17 "['locale_alias']"
1 29 4 35768 6 134352 23 '[180]'
2 28 4 34720 6 169072 29 '[90]'
3 30 4 34512 6 203584 34 '[270]'
4 27 3 33672 6 237256 40 '[0]'
5 25 3 26968 5 264224 45 "['data']"
6 1 0 24856 4 289080 49 "['windows_locale']"
7 64 8 20224 3 309304 52 "['inters']"
8 64 8 17920 3 327224 55 "['galog']"
9 64 8 17920 3 345144 58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>
============================================================
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Partition of a set of 2132 objects. Total size = 274232 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 100 274232 100 274232 100 '.co_code'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Partition of a set of 1995 objects. Total size = 188160 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 100 188160 100 188160 100 '.co_names'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Partition of a set of 1 object. Total size = 98584 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 100 98584 100 98584 100 "['locale_alias']"
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0]
s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 Y Y]
r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.
I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,
import gc
for i,line in enumerate(file(datafile)):
if (i%500000==0):
print '-----------This is line number', i
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,
I did the same memory profiling as before,
Level 1 Profiling
Partition of a set of 35474 objects. Total size = 5273376 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15889 45 1283640 24 1283640 24 str
1 50 0 628400 12 1912040 36 dict of __main__.NodeStatistics
2 7559 21 617496 12 2529536 48 tuple
3 781 2 589240 11 3118776 59 dict (no owner)
4 90 0 278640 5 3397416 64 dict of module
5 2132 6 255840 5 3653256 69 types.CodeType
6 2059 6 247080 5 3900336 74 function
7 1716 5 245408 5 4145744 79 list
8 244 1 218512 4 4364256 83 type
9 224 1 213632 4 4577888 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) -- SOLVED
I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.
I find the memory starts to bleed when I add a class as below,
class PacketStatistics(object):
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
I am using 3 classes with 136 counters.
Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.
I converted the class as below,
class PacketStatistics(object):
__slots__ = ('event_id', 'event_source', 'event_dest',...)
def __init__(self):
self.event_id = 0
self.event_source = 0
self.event_dest = 0
...
When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!
The below is the memory profiling after the dict to slot convertion.
Partition of a set of 36157 objects. Total size = 4758960 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15966 44 1304424 27 1304424 27 str
1 7592 21 624776 13 1929200 41 tuple
2 780 2 587424 12 2516624 53 dict (no owner)
3 90 0 278640 6 2795264 59 dict of module
4 2132 6 255840 5 3051104 64 types.CodeType
5 2059 6 247080 5 3298184 69 function
6 1715 5 245336 5 3543520 74 list
7 225 1 232344 5 3775864 79 dict of type
8 244 1 223952 5 3999816 84 type
9 166 0 190096 4 4189912 88 dict of class
<101 more rows. Type e.g. '_.more' to view.>
The dict of __main__.NodeStatistics is not in the top 10 anymore.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
rgds
Saravanan K
with open('datafile') as f:
for line in f:
process(line)
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.
The fileinput module will let you read it line by line without loading the entire file into memory. pydocs
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
Code example taken from yak.net
#mgilson 's answer is correct. The simple solution bears official mention though (#HerrKaputt mentioned this in a comment)
file = open('datafile')
for line in file:
process(line)
file.close()
This is simple, pythonic, and understandable. If you don't understand how with works just use this.
As the other poster mentioned this does not create a large list like file.readlines(). Rather it pulls off one line at a time in the way that is traditional to unix files/pipes.
If the file is JSON, XML, CSV, genomics or any other well-known format, there are specialized readers which use C code directly and are far more optimized for both speed and memory than parsing in native Python - avoid parsing it natively whenever possible.
But in general, tips from my experience:
Python's multiprocessing package is fantastic for managing subprocesses, all memory leaks go away when the subprocess ends.
run the reader subprocess as a multiprocessing.Process and use a multiprocessing.Pipe(duplex=True) to communicate (send the filename and any other args, then read its stdout)
read in small (but not tiny) chunks, say 64Kb-1Mb. Better for memory usage, also for responsiveness wrt other running processes/subprocesses