I'm fitting a LDA model with lots of data using scikit-learn. Relevant code piece looks like this:
lda = LatentDirichletAllocation(n_topics = n_topics,
max_iter = iters,
learning_method = 'online',
learning_offset = offset,
random_state = 0,
evaluate_every = 5,
n_jobs = 3,
verbose = 0)
lda.fit(X)
(I guess the only possibly relevant detail here is that I'm using multiple jobs.)
After some time I'm getting "No space left on device" error, even though there is plenty of space on the disk and plenty of free memory. I tried the same code several times, on two different computers (on my local machine and on a remote server), first using python3, then using python2, and each time I ended up with the same error.
If I run the same code on a smaller sample of data everything works fine.
The entire stack trace:
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save
obj, filename = self._write_array(obj, filename)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array
self.np.save(filename, array)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save
pickle_kwargs=pickle_kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/format.py", line 584, in write_array
array.tofile(fp)
IOError: 275500 requested and 210934 written
IOErrorTraceback (most recent call last)
<ipython-input-7-6af7e7c9845f> in <module>()
7 n_jobs = 3,
8 verbose = 0)
----> 9 lda.fit(X)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in fit(self, X, y)
509 for idx_slice in gen_batches(n_samples, batch_size):
510 self._em_step(X[idx_slice, :], total_samples=n_samples,
--> 511 batch_update=False, parallel=parallel)
512 else:
513 # batch update
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _em_step(self, X, total_samples, batch_update, parallel)
403 # E-step
404 _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 405 parallel=parallel)
406
407 # M-step
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _e_step(self, X, cal_sstats, random_init, parallel)
356 self.mean_change_tol, cal_sstats,
357 random_state)
--> 358 for idx_slice in gen_even_slices(X.shape[0], n_jobs))
359
360 # merge result
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
808 # consumption.
809 self._iterating = False
--> 810 self.retrieve()
811 # Make sure that we get a last message telling us we are done
812 elapsed_time = time.time() - self._start_time
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
725 job = self._jobs.pop(0)
726 try:
--> 727 self._output.extend(job.get())
728 except tuple(self.exceptions) as exception:
729 # Stop dispatching any new job in the async callback thread
/home/ubuntu/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
IOError: [Errno 28] No space left on device
Had the same problem with LatentDirichletAllocation. It seems, that your are running out of shared memory (/dev/shm when you run df -h). Try setting JOBLIB_TEMP_FOLDER environment variable to something different: e.g., to /tmp. In my case it has solved the problem.
Or just increase the size of the shared memory, if you have the appropriate rights for the machine you are training the LDA on.
This problem occurs when shared memory is consumed and no I/O operation is permissible. This is a frustrating problem that occurs to most of the Kaggle users while fitting machine learning models.
I overcame this problem by setting JOBLIB_TEMP_FOLDER variable using following code.
%env JOBLIB_TEMP_FOLDER=/tmp
The solution of #silterser solved the problem for me.
If you want to set the environment variable in the code do this:
import os
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
This is because you have set n_jobs=3. You could set it to 1, then shared memory will not be used, even though learning will take longer time. You can chose to select a joblib cache dir as per above answer, but bear in mind that this cache can quickly fill up your disk as well, depending on the dataset? and disk transactions can slow your job down.
I know it's kind of late, but I got over this problem by setting learning_method = 'batch'.
This could present other issues, such as extending training times, but it alleviated the problem of not having enough space on the shared memory.
Or maybe a smaller batch_size can be set. Although I have not tested this myself.
Related
I have a python function that reads random snippets from a large file and does some processing on it. I want the processing to happen in multiple processes and so make use of multiprocessing. I open the file (in binary mode) in the parent process and pass the file descriptor to each child process then use a multiprocessing.Lock() to synchronize access to the file. With a single worker things work as expected, but with more workers, even with the lock, the file reads will randomly return bad data (usually a bit from one part of the file and a bit from the another part of the file). In addition, the position within the file (as returned by file.tell()) will often get messed up. This all suggests a basic race condition accessing the descriptor, but my understanding is the multiprocessing.Lock() should prevent concurrent access to it. Does file.seek() and/or file.read() have some kind of asynchronous operations that don't get contained within the lock/unlock barriers? What is going here?
An easy workaround is to have each process open the file individually and get its own file descriptor (I've confirmed this does work), but I'd like to understand what I'm missing. Opening the file in text mode also prevents the issue from occurring, but doesn't work for my use case and doesn't explain what is happening in the binary case.
I've run the following reproducer on a number of Linux systems and OS X and on various local and remote file systems. I always get quite a few bad file positions and at least a couple of checksum errors. I know the read isn't guaranteed to read the full amount of data requested, but I've confirmed that is not what is happening here and omitted that code in an effort to keep things concise.
import argparse
import multiprocessing
import random
import string
def worker(worker, args):
rng = random.Random(1234 + worker)
for i in range(args.count):
block = rng.randrange(args.blockcount)
start = block * args.blocksize
with args.lock:
args.fd.seek(start)
data = args.fd.read(args.blocksize)
pos = args.fd.tell()
if pos != start + args.blocksize:
print(i, "bad file position", start, start + args.blocksize, pos)
cksm = sum(data)
if cksm != args.cksms[block]:
print(i, "bad checksum", cksm, args.cksms[block])
args = argparse.Namespace()
args.file = '/tmp/text'
args.count = 1000
args.blocksize = 1000
args.blockcount = args.count
args.filesize = args.blocksize * args.blockcount
args.num_workers = 4
args.cksms = multiprocessing.Array('i', [0] * args.blockcount)
with open(args.file, 'w') as f:
for i in range(args.blockcount):
data = ''.join(random.choice(string.ascii_lowercase) for x in range(args.blocksize))
args.cksms[i] = sum(data.encode())
f.write(data)
args.fd = open(args.file, 'rb')
args.lock = multiprocessing.Lock()
procs = []
for i in range(args.num_workers):
p = multiprocessing.Process(target=worker, args=(i, args))
procs.append(p)
p.start()
Example output:
$ python test.py
158 bad file position 969000 970000 741000
223 bad file position 908000 909000 13000
232 bad file position 679000 680000 960000
263 bad file position 959000 960000 205000
390 bad file position 771000 772000 36000
410 bad file position 148000 149000 42000
441 bad file position 677000 678000 21000
459 bad file position 143000 144000 636000
505 bad file position 579000 580000 731000
505 bad checksum 109372 109889
532 bad file position 962000 963000 243000
494 bad file position 418000 419000 2000
569 bad file position 266000 267000 991000
752 bad file position 732000 733000 264000
840 bad file position 801000 802000 933000
799 bad file position 332000 333000 989000
866 bad file position 150000 151000 248000
866 bad checksum 109116 109375
887 bad file position 39000 40000 974000
937 bad file position 18000 19000 938000
969 bad file position 20000 21000 24000
953 bad file position 542000 543000 767000
977 bad file position 694000 695000 782000
This seems to be caused by buffering: using open(args.file, 'rb', buffering=0) I can't reproduce anymore.
https://docs.python.org/3/library/functions.html#open
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off [...] When no buffering argument is given, the default buffering policy works as follows: [...] Binary files are buffered in fixed-size chunks; the size of the buffer [...] will typically be 4096 or 8192 bytes long. [...]
I've checked, only using multiprocessing.Lock (without buffering = 0), still met the bad data. with both multiprocessing.Lock and buffering=0, all things goes well
I am trying to generate the summary of a large text file using Gensim Summarizer.
I am getting memory error. Have been facing this issue since sometime, any help
would be really appreciated. feel free to ask for more details.
from gensim.summarization.summarizer import summarize
file_read =open("xxxxx.txt",'r')
Content= file_read.read()
def Summary_gen(content):
print(len(Content))
summary_r=summarize(Content,ratio=0.02)
print(summary_r)
Summary_gen(Content)
The length of the document is:
365042
Error messsage:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-a91bd71076d1> in <module>()
10
11
---> 12 Summary_gen(Content)
<ipython-input-6-a91bd71076d1> in Summary_gen(content)
6 def Summary_gen(content):
7 print(len(Content))
----> 8 summary_r=summarize(Content,ratio=0.02)
9 print(summary_r)
10
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
428 corpus = _build_corpus(sentences)
429
--> 430 most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
431
432 # If couldn't get important docs, the algorithm ends.
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize_corpus(corpus, ratio)
367 return []
368
--> 369 pagerank_scores = _pagerank(graph)
370
371 hashable_corpus.sort(key=lambda doc: pagerank_scores.get(doc, 0), reverse=True)
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in pagerank_weighted(graph, damping)
57
58 """
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
60 probability_matrix = build_probability_matrix(graph)
61
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in build_adjacency_matrix(graph)
92 neighbors_sum = sum(graph.edge_weight((current_node, neighbor)) for neighbor in graph.neighbors(current_node))
93 for j in xrange(length):
---> 94 edge_weight = float(graph.edge_weight((current_node, nodes[j])))
95 if i != j and edge_weight != 0.0:
96 row.append(i)
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in edge_weight(self, edge)
255
256 """
--> 257 return self.get_edge_properties(edge).setdefault(self.WEIGHT_ATTRIBUTE_NAME, self.DEFAULT_WEIGHT)
258
259 def neighbors(self, node):
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in get_edge_properties(self, edge)
404
405 """
--> 406 return self.edge_properties.setdefault(edge, {})
407
408 def add_edge_attributes(self, edge, attrs):
MemoryError:
I have tried looking up for this error on the internet, but, couldn't find a workable solution to this.
From the logs, it looks like the code builds an adjacency matrix
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
This probably tries to create a huge adjacency matrix with your 365042 documents which cannot fit in your memory(i.e., RAM).
You could try:
Reducing the document size to fewer files (maybe start with 10000)
and check if it works
Try running it on a system with more RAM
Did you try to use word_count argument instead of ratio?
If the above still doesn't solve the problem, then that's because of gensim's implementation limitations. The only way to use gensim if you still OOM errors is to split documents. That also will speed up your solution (and if the document is really big, it shouldn't be a problem anyway).
What's the problem with summarize:
gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
BTW is the document length measured in words, or characters?
I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.
I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.
Here is what I have so far:
from ua_parser import user_agent_parser
http_str = df['user_agents'].tolist()
def uaparse(http_str):
for i, item in enumerate(http_str):
return user_agent_parser.Parse(http_str[i])
pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))
Right now I'm seeing the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-701fbf58d263> in <module>()
7
8 pool = mp.Pool(processes=10)
----> 9 results = pool.map(uaparse, range(0,len(http_str)))
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
249 '''
250 assert self._state == RUN
--> 251 return self.map_async(func, iterable, chunksize).get()
252
253 def imap(self, func, iterable, chunksize=1):
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
TypeError: 'int' object is not iterable
Thanks in advance for any assistance/direction you can provide.
It seems like all you need is:
http_str = df['user_agents'].tolist()
pool = mp.Pool(processes=10)
parsed = pool.map(user_agent_parser.Parse, http_str)
I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.
After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:
import pandas as pd
from IPython.parallel import Client
def calculemus(corpus):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
return vectorizer.fit_transform(corpus)
review = pd.read_csv('review.csv')['text']
review = review.fillna('')
client = Client()
r = client[-1].apply(calculemus, review).get()
BUT I got this error instead:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
154 sa.data = m.bytes
155
--> 156 args = uncanSequence(map(unserialize, sargs), g)
157 kwargs = {}
158 for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
175
176 def unserialize(serialized):
--> 177 return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
159 buf = self.serialized.getData()
160 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161 result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
162 else:
163 raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer
I'm not sure what the problem is, could someone enlighten me on this?
UPDATE
Apparently the error says exactly what it says. If I do this:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
it kinda works.
So the next question is, is this a feature or a limitation of IPython?
This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.
PS:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
is equivalent to
r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))
Using 64-bit Python 3.3.1 and 32GB RAM and this function to generate target expression 1+1/(2+1/(2+1/...)):
def sqrt2Expansion(limit):
term = "1+1/2"
for _ in range(limit):
i = term.rfind('2')
term = term[:i] + '(2+1/2)' + term[i+1:]
return term
I'm getting MemoryError when calling:
simplify(sqrt2Expansion(100))
Shorter expressions work fine, e.g:
simplify(sqrt2Expansion(50))
Is there a way to configure SymPy to complete this calculation? Below is the error message:
MemoryError Traceback (most recent call last)
<ipython-input-90-07c1e2de29d1> in <module>()
----> 1 simplify(sqrt2Expansion(100))
C:\Python33\lib\site-packages\sympy\simplify\simplify.py in simplify(expr, ratio, measure)
2878 from sympy.functions.special.bessel import BesselBase
2879
-> 2880 original_expr = expr = sympify(expr)
2881
2882 expr = signsimp(expr)
C:\Python33\lib\site-packages\sympy\core\sympify.py in sympify(a, locals, convert_xor, strict, rational)
176 try:
177 a = a.replace('\n', '')
--> 178 expr = parse_expr(a, locals or {}, rational, convert_xor)
179 except (TokenError, SyntaxError):
180 raise SympifyError('could not parse %r' % a)
C:\Python33\lib\site-packages\sympy\parsing\sympy_parser.py in parse_expr(s, local_dict, rationalize, convert_xor)
161
162 code = _transform(s.strip(), local_dict, global_dict, rationalize, convert_xor)
--> 163 expr = eval(code, global_dict, local_dict) # take local objects in preference
164
165 if not hit:
MemoryError:
EDIT:
I wrote a version using sympy expressions instead of strings:
def sqrt2Expansion(limit):
x = Symbol('x')
term = 1+1/x
for _ in range(limit):
term = term.subs({x: (2+1/x)})
return term.subs({x: 2})
It runs better: sqrt2Expansion(100) returns valid result, but sqrt2Expansion(200) produces RuntimeError with many pages of traceback and hangs up IPython interpreter with plenty of system memory left unused. I created new question Long expression crashes SymPy with this issue.
SymPy is using eval along the path to turn your string into a SymPy object, and eval uses the built-in Python parser, which has a maximum limit. This isn't really a SymPy issue.
For example, for me:
>>> eval("("*100+'3'+")"*100)
s_push: parser stack overflow
Traceback (most recent call last):
File "<ipython-input-46-1ce3bf24ce9d>", line 1, in <module>
eval("("*100+'3'+")"*100)
MemoryError
Short of modifying MAXSTACK in Parser.h and recompiling Python with a different limit, probably the best way to get where you're headed is to avoid using strings in the first place. [I should mention that the PyPy interpreter can make it up to ~1100 for me.]