I made some weird observations that my GridSearches keep failing after a couple of hours and I initially couldn't figure out why. I monitored the memory usage then over time and saw that it it started with a few gigabytes (~6 Gb) and kept increasing until it crashed the node when it reached the max. 128 Gb the hardware can take.
I was experimenting with random forests for classification of a large number of text documents. For simplicity -- to figure out what's going on -- I went back to naive Bayes.
The versions I am using are
Python 3.4.2
scikit-learn 0.15.2
I found some related discussion on the scikit-issue list on GitHub about this topic: https://github.com/scikit-learn/scikit-learn/issues/565 and
https://github.com/scikit-learn/scikit-learn/pull/770
And it sounds like it was already successfully addressed!
So, the relevant code that I am using is
grid_search = GridSearchCV(pipeline,
parameters,
n_jobs=1, #
cv=5,
scoring='roc_auc',
verbose=2,
pre_dispatch='2*n_jobs',
refit=False) # tried both True and False
grid_search.fit(X_train, y_train)
print('Best score: {0}'.format(grid_search.best_score_))
print('Best parameters set:')
Just out of curiosity, I later decided to do the grid search the quick & dirty way via nested for loop
for p1 in parameterset1:
for p2 in parameterset2:
...
pipeline = Pipeline([
('vec', CountVectorizer(
binary=True,
tokenizer=params_dict[i][0][0],
max_df=params_dict[i][0][1],
max_features=params_dict[i][0][2],
stop_words=params_dict[i][0][3],
ngram_range=params_dict[i][0][4],)),
('tfidf', TfidfTransformer(
norm=params_dict[i][0][5],
use_idf=params_dict[i][0][6],
sublinear_tf=params_dict[i][0][7],)),
('clf', MultinomialNB())])
scores = cross_validation.cross_val_score(
estimator=pipeline,
X=X_train,
y=y_train,
cv=5,
scoring='roc_auc',
n_jobs=1)
params_dict[i][1] = '%s,%0.4f,%0.4f' % (params_dict[i][1], scores.mean(), scores.std())
sys.stdout.write(params_dict[i][1] + '\n')
So far so good. The grid search runs and writes the results to stdout. However, after some time it exceeds the memory cap of 128 Gb again. Same problem as with the GridSearch in scikit. After some experimentation, I finally found out that
gc.collect()
len(gc.get_objects()) # particularly this part!
in the for loop solves the problem and the memory usage stays constantly at 6.5 Gb over the run time of ~10 hours.
Eventually, I got it to work with the above fix, however, I am curious to hear your ideas about what might be causing this issue and your tips & suggestions!
RandomForest in 0.15.2 does not support sparse inputs.
Upgrade sklearn and try again...hopefully this will allow the multiple copies that end up being made to consume way less memory. (and speed things up)
I can't see your exact code, but I faced similar problem nowadays.
It is worth a try.
The similar memory blow-up easily could happen when we copy values from a mutable array or list like object to an other variable creating a copy of the original one and then we modify the new array or list with append or something similar increasing the size of it and the same time increasing the original object too in the background.
So this is an exponential process, so after some time we are out of memory. I was be able to and maybe you can avoid this kind of phenomenon with deepcopy() the original object at a value passing.
I had the similar problem, I blew-up the memory with a similar process, then I managed to stay at 10% memory load.
UPDATE:
Now I see the snippet of the code with pandas DataFrame. There would be such a valuecopy issue easily.
I'm not familiar with GridSearch sir, but I'd suggest when memory and huge lists are an issue write a small custom generator. It can be reused for all your items, just use one that takes any list. If implementing beyond the lower solution here first read this article, best generator article I've found. I typed it all in and went piece by piece, any questions you have after reading it I can try too
https://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/
Don't need:
for p1 in parameterset1:
Try
def listerator(this_list):
i = 0
while True:
yield this_list[i]
i += 1
The 'yield' word (anywhere in the declaration) makes this a generator, not a regular function. This runs through and says i equals 0, while True I gotta do stuff, they want me to yield this_list[0], here you go I'll wait for you at i += 1 if you need me again. The next time it is called, it picks up and does i += 1, and notices it's still in a while loop and gives this_list[1], and records its location (i += 1 again...it will wait there until called again). Notice as I feed it the list once and make a generator (x here), it will exhaust your list.
In [141]: x = listerator([1,2,3])
In [142]: next(x)
Out[142]: 1
In [143]: next(x)
Out[143]: 2
In [144]: next(x)
Out[144]: 3
In [148]: next(x)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-148-5e4e57af3a97> in <module>()
----> 1 next(x)
<ipython-input-139-ed3d6d61a17c> in listerator(this_list)
2 i = 0
3 while True:
----> 4 yield this_list[i]
5 i += 1
6
IndexError: list index out of range
Let's see if we can use it in a for:
In [221]: for val in listerator([1,2,3,4]):
.....: print val
.....:
1
2
3
4
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-221-fa4f59138165> in <module>()
----> 1 for val in listerator([1,2,3,4]):
2 print val
3
<ipython-input-220-263fba1d810b> in listerator(this_list, seed)
2 i = seed or 0
3 while True:
----> 4 yield this_list[i]
5 i += 1
IndexError: list index out of range
Nope. Let's try to handle that:
def listerator(this_list):
i = 0
while True:
try:
yield this_list[i]
except IndexError:
break
i += 1
In [223]: for val in listerator([1,2,3,4]):
print val
.....:
1
2
3
4
That works. Now it won't blindly try to return a list element even if it isn't there. From what you said, I can almost guarantee you'll need to be able to seed it (pick up from a certain place, or start freshly from a certain place):
def listerator(this_list, seed=None):
i = seed or 0
while True:
try:
yield this_list[i]
except IndexError:
break
i += 1
In [150]: l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
In [151]: x = listerator(l, 8)
In [152]: next(x)
Out[152]: 9
In [153]: next(x)
Out[153]: 10
In [154]: next(x)
Out[154]: 11
i = seed or 0 is a thing that looks for seed, but seed defaults to None so will usually just start at the logical place, 0, the beginning of the list
How can you use this beast without using (almost) any memory?
parameterset1 = [1,2,3,4]
parameterset2 = ['a','b','c','d']
In [224]: for p1 in listerator(parameterset1):
for p2 in listerator(parameterset2):
print p1, p2
.....:
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b
3 c
3 d
4 a
4 b
4 c
4 d
that looks familiar huh? Now you can process a trillion values one by one, picking important ones to write to disk, and never blowing up your system. Enjoy!
Related
Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).
I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).
I was wondering if anyone else has ever experienced value_counts() returning incorrect counts. I have two variables, Pass and Fail, and when I use value_counts() it is returning the correct total but the wrong number for each variable.
The data in the data frame is for samples made with different sample preparation methods (A-G) and then tested on different testing machines (numbered 1-5; they run the same test we just have 5 different ones so we can run more tests) and I am trying to compare both the method and testers by putting the pass % into a pivot table. I would like to be able to do this for different sample materials as well so I have been trying to write the pass % function in a separate script so that I can call it to each material's script if that makes sense.
The pass % function is as follows:
def pass_percent(df_copy):
pds = df_copy.value_counts()
p = pds['PASS']
try:
f = pds['FAIL']
except:
f = 0
print(pds)
print(p)
print(f)
pass_pc = p/(p+f) *100
print(pass_pc)
return pass_pc
And then within the individual material script (e.g. material 1A) I have (among a few other things to tidy up the data frame before this - essentially getting rid of columns I don't need from the testing outputs):
from pass_pc_function import pass_percent
mat_1A = pd.pivot_table(df_copy, index='Prep_Method', columns='Test_Machine', aggfunc=pass_percent)
An example of what is happening is, for Material 1A I have 100 tests of Prep_Method A on Test_Machine 1 of which 65 passed and 35 failed, so a 65% pass rate. But value_counts() is returning 56 passes and 44 fails (so the total is still 100 which is correct but for some reason it is counting 9 passes as fails). This is just an example, I have much larger data sets than this but this is essentially what is happening.
I thought perhaps it could be a white space issue so I also have the line:
df_copy.columns = [x.strip() for x in df_copy.columns]
in my M1A script. However I am still getting this strange error.
Any advice would be appreciated. Thanks!
EDIT:
Results example as requested
PASS 31
FAIL 27
Name: Result, dtype: int64
31
27
53.44827586206896
Result
Test_Machine 1 2 3 4
Prep_Method
A 53.448276 89.655172 93.478261 97.916667
B 87.050360 90.833333 91.596639 97.468354
C 83.333333 93.150685 98.305085 100.000000
D 85.207101 94.339623 95.652174 97.163121
E 87.901701 96.310680 95.961538 98.655462
F 73.958333 82.178218 86.166008 93.750000
G 80.000000 91.743119 89.622642 98.529412
I am trying to implement np.random.choice in tensorflow. Here is my implementation
import numpy as np
import tensorflow as tf
p=tf.Variable(0,tf.int32)
selection_sample=[i for i in range(10)]#sample to select from
k=tf.convert_to_tensor(selection_sample)
samples = tf.random.categorical(tf.math.log([[1, 0.5, 0.3, 0.6]]),1)
sample_selected=tf.cast(samples[0][0],tf.int64)
op=tf.assign(p,k[sample_selected])
#selection_sample[samples]
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sample_selected.eval())
print(k.eval())
print((sess.run(op)))
print(p.eval())
However when sample_selected is for example 1 i expect p.eval to be 1 i.e k[1] but this is not the case. For example running this code a sample output is
3
[0 1 2 3 4 5 6 7 8 9]
1
1
yet p.eval should be k[3] and sess.run(op) should also be k[3]
what I am doing wrong. Thanks
When you do:
print(sample_selected.eval())
You get a random value derived from tf.random.categorical. That random value is returned by the session and not saved anywhere else.
Then, when you do:
print((sess.run(op)))
You are assigning the variable p a new random value produced in this call to run. That is the value printed, which is now saved in the variable.
Finally, when you do:
print(p.eval())
You see the value currently stored in p, which is the random value generated in the previous call to run.
I need to multiply two huge vectors (30720,1)* (1,30720) so this will give a 30720*30720 matrix . I am using numpy.dot to multiply them but it is taking a very long time.
with float64 data, the result size is about 7 Go, so it doesn't fit in a lot of PC RAM. But you have only 30720² # 1e9 multiplications to do, which take a few seconds.
A way to avoid the memory issues is to cut the result in reasonable chunks, with sizes < 1Go, and save the partial results in files with binary protocol for speed, with adds to control what happens :
n=3
div=10240
a=rand(n*div,1)
b=rand(1,n*div)
import pickle
def calculate(i,j):
u=dot(a[i*div:(i+1)*div,:],b[:,j*div:(j+1)*div])
return u
def save(i,j,u):
with open('data'+str(i)+str(j)+'.pk','wb') as f :
pickle.dump(u,f)
def timecount(f,args):
t0=time.time()
res=f(*args)
return res,time.time()-t0
def multidot():
tcalc,tsave=0,0
for i in range(n):
for j in range(n):
print (i,j)
u,dt=timecount(calculate,(i,j))
tcalc+=dt
_,dt=timecount(save,(i,j,u))
tsave+=dt
print('dot time',tcalc)
print('save time',tsave)
Then the run :
In [64]: multidot()
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
dot time 4.697121858596802
save time 29.11250686645508
So you have no problem with dot, only with memory issues.
To read back your data, read it, chunk by chunk, like that:
with open('data00.pk','rb') as f : u=pickle.load(f)
Don't forget to del data*.pk after this run, it takes 6Go on your disk ;)