Gensim Summarizer throws MemoryError, Any Solution? - python

I am trying to generate the summary of a large text file using Gensim Summarizer.
I am getting memory error. Have been facing this issue since sometime, any help
would be really appreciated. feel free to ask for more details.
from gensim.summarization.summarizer import summarize
file_read =open("xxxxx.txt",'r')
Content= file_read.read()
def Summary_gen(content):
print(len(Content))
summary_r=summarize(Content,ratio=0.02)
print(summary_r)
Summary_gen(Content)
The length of the document is:
365042
Error messsage:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-a91bd71076d1> in <module>()
10
11
---> 12 Summary_gen(Content)
<ipython-input-6-a91bd71076d1> in Summary_gen(content)
6 def Summary_gen(content):
7 print(len(Content))
----> 8 summary_r=summarize(Content,ratio=0.02)
9 print(summary_r)
10
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
428 corpus = _build_corpus(sentences)
429
--> 430 most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
431
432 # If couldn't get important docs, the algorithm ends.
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize_corpus(corpus, ratio)
367 return []
368
--> 369 pagerank_scores = _pagerank(graph)
370
371 hashable_corpus.sort(key=lambda doc: pagerank_scores.get(doc, 0), reverse=True)
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in pagerank_weighted(graph, damping)
57
58 """
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
60 probability_matrix = build_probability_matrix(graph)
61
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in build_adjacency_matrix(graph)
92 neighbors_sum = sum(graph.edge_weight((current_node, neighbor)) for neighbor in graph.neighbors(current_node))
93 for j in xrange(length):
---> 94 edge_weight = float(graph.edge_weight((current_node, nodes[j])))
95 if i != j and edge_weight != 0.0:
96 row.append(i)
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in edge_weight(self, edge)
255
256 """
--> 257 return self.get_edge_properties(edge).setdefault(self.WEIGHT_ATTRIBUTE_NAME, self.DEFAULT_WEIGHT)
258
259 def neighbors(self, node):
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in get_edge_properties(self, edge)
404
405 """
--> 406 return self.edge_properties.setdefault(edge, {})
407
408 def add_edge_attributes(self, edge, attrs):
MemoryError:
I have tried looking up for this error on the internet, but, couldn't find a workable solution to this.

From the logs, it looks like the code builds an adjacency matrix
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
This probably tries to create a huge adjacency matrix with your 365042 documents which cannot fit in your memory(i.e., RAM).
You could try:
Reducing the document size to fewer files (maybe start with 10000)
and check if it works
Try running it on a system with more RAM

Did you try to use word_count argument instead of ratio?
If the above still doesn't solve the problem, then that's because of gensim's implementation limitations. The only way to use gensim if you still OOM errors is to split documents. That also will speed up your solution (and if the document is really big, it shouldn't be a problem anyway).
What's the problem with summarize:
gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
BTW is the document length measured in words, or characters?

Related

list out of range when using embeddings

I have the following list:
list1=[['brute-force',
'password-guessing',
'password-guessing',
'default-credentials',
'shell'],
['malware',
'ddos',
'phishing',
'spam',
'botnet',
'cryptojacking',
'xss',
'sqli',
'vulnerability'],
['sensitive-information']]
I am trying the example from here enter link description here
However when I am fitting my list to get the embeddings :
embeddings1 =sbert_model.encode(list1, convert_to_tensor=True)
I get the embeding i get the following error:
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16484/3954167634.py in <module>
----> 1 embeddings2 = sbert_model.encode(list3, convert_to_tensor=True)
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
159 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
160 sentences_batch = sentences_sorted[start_index:start_index+batch_size]
--> 161 features = self.tokenize(sentences_batch)
162 features = batch_to_device(features, device)
163
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\SentenceTransformer.py in tokenize(self, texts)
317 Tokenizes the texts
318 """
--> 319 return self._first_module().tokenize(texts)
320
321 def get_sentence_features(self, *features):
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\models\Transformer.py in tokenize(self, texts)
101 for text_tuple in texts:
102 batch1.append(text_tuple[0])
--> 103 batch2.append(text_tuple[1])
104 to_tokenize = [batch1, batch2]
105
IndexError: list index out of range
I am understanding how lists work and I have read many asnwers to same problem in here but i cannot fiqure out why is going out of range.
Any ideas?
You need to flatten your input nested list first.
from nltk import flatten
flattened_list1 = flatten(list1)
embeddings1 = sbert_model.encode(flattened_list1, convert_to_tensor=True)

I cant import the ucf 101 dataset (torchvision), 'list index out of range' error

dataset = torchvision.datasets.UCF101(r'my_directory', annotation_path=r'my_directory2', frames_per_clip=16, step_between_clips=1, frame_rate=None, fold=1, train=True, transform=transforms.Compose([transforms.ToTensor()]), _precomputed_metadata=None, num_workers=1, _video_width=64, _video_height=64, _video_min_dimension=0, _audio_samples=0)
This line works, the problem is when I try to do any kind of operation using 'dataset', in particular this one:
data = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=True)
I get the error, so i can't work on the video data because I can't use the dataLoader, the error is:
IndexError: list index out of range
Complete error message:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-16-dc3631481cb3> in <module>
----> 1 data = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=True)
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context)
211 else: # map-style
212 if shuffle:
--> 213 sampler = RandomSampler(dataset)
214 else:
215 sampler = SequentialSampler(dataset)
~\anaconda3\lib\site-packages\torch\utils\data\sampler.py in __init__(self, data_source, replacement, num_samples)
90 "since a random permute will be performed.")
91
---> 92 if not isinstance(self.num_samples, int) or self.num_samples <= 0:
93 raise ValueError("num_samples should be a positive integer "
94 "value, but got num_samples={}".format(self.num_samples))
~\anaconda3\lib\site-packages\torch\utils\data\sampler.py in num_samples(self)
98 # dataset size might change at runtime
99 if self._num_samples is None:
--> 100 return len(self.data_source)
101 return self._num_samples
102
~\anaconda3\lib\site-packages\torchvision\datasets\ucf101.py in __len__(self)
96
97 def __len__(self):
---> 98 return self.video_clips.num_clips()
99
100 def __getitem__(self, idx):
~\anaconda3\lib\site-packages\torchvision\datasets\video_utils.py in num_clips(self)
241 Number of subclips that are available in the video list.
242 """
--> 243 return self.cumulative_sizes[-1]
244
245 def get_clip_location(self, idx):
IndexError: list index out of range
Problem:
This problem occurs when you run your code on windows because windows paths use backslash ("\") instead of forward slash ("/").
As you see in the code:
https://github.com/pytorch/vision/blob/7b9d30eb7c4d92490d9ac038a140398e0a690db6/torchvision/datasets/ucf101.py#L94
So, this line of code reads the file path from label file as “action\video_name” and merge it with “root” path using backslash therefore full path becomes like “root\action/video_name”. Such paths doesn’t match with the video lists at line#97 and returns empty list for indices variable.
Solution:
Two of possible solutions can be:
Replace the forwardslashes “/” in the label files with backslashes “\”.
Override the _select_fold(…) function of class UCF101 and fix the backslashes inside the function

"No space left on device" error while fitting Sklearn model

I'm fitting a LDA model with lots of data using scikit-learn. Relevant code piece looks like this:
lda = LatentDirichletAllocation(n_topics = n_topics,
max_iter = iters,
learning_method = 'online',
learning_offset = offset,
random_state = 0,
evaluate_every = 5,
n_jobs = 3,
verbose = 0)
lda.fit(X)
(I guess the only possibly relevant detail here is that I'm using multiple jobs.)
After some time I'm getting "No space left on device" error, even though there is plenty of space on the disk and plenty of free memory. I tried the same code several times, on two different computers (on my local machine and on a remote server), first using python3, then using python2, and each time I ended up with the same error.
If I run the same code on a smaller sample of data everything works fine.
The entire stack trace:
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save
obj, filename = self._write_array(obj, filename)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array
self.np.save(filename, array)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save
pickle_kwargs=pickle_kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/format.py", line 584, in write_array
array.tofile(fp)
IOError: 275500 requested and 210934 written
IOErrorTraceback (most recent call last)
<ipython-input-7-6af7e7c9845f> in <module>()
7 n_jobs = 3,
8 verbose = 0)
----> 9 lda.fit(X)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in fit(self, X, y)
509 for idx_slice in gen_batches(n_samples, batch_size):
510 self._em_step(X[idx_slice, :], total_samples=n_samples,
--> 511 batch_update=False, parallel=parallel)
512 else:
513 # batch update
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _em_step(self, X, total_samples, batch_update, parallel)
403 # E-step
404 _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 405 parallel=parallel)
406
407 # M-step
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _e_step(self, X, cal_sstats, random_init, parallel)
356 self.mean_change_tol, cal_sstats,
357 random_state)
--> 358 for idx_slice in gen_even_slices(X.shape[0], n_jobs))
359
360 # merge result
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
808 # consumption.
809 self._iterating = False
--> 810 self.retrieve()
811 # Make sure that we get a last message telling us we are done
812 elapsed_time = time.time() - self._start_time
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
725 job = self._jobs.pop(0)
726 try:
--> 727 self._output.extend(job.get())
728 except tuple(self.exceptions) as exception:
729 # Stop dispatching any new job in the async callback thread
/home/ubuntu/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
IOError: [Errno 28] No space left on device
Had the same problem with LatentDirichletAllocation. It seems, that your are running out of shared memory (/dev/shm when you run df -h). Try setting JOBLIB_TEMP_FOLDER environment variable to something different: e.g., to /tmp. In my case it has solved the problem.
Or just increase the size of the shared memory, if you have the appropriate rights for the machine you are training the LDA on.
This problem occurs when shared memory is consumed and no I/O operation is permissible. This is a frustrating problem that occurs to most of the Kaggle users while fitting machine learning models.
I overcame this problem by setting JOBLIB_TEMP_FOLDER variable using following code.
%env JOBLIB_TEMP_FOLDER=/tmp
The solution of #silterser solved the problem for me.
If you want to set the environment variable in the code do this:
import os
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
This is because you have set n_jobs=3. You could set it to 1, then shared memory will not be used, even though learning will take longer time. You can chose to select a joblib cache dir as per above answer, but bear in mind that this cache can quickly fill up your disk as well, depending on the dataset? and disk transactions can slow your job down.
I know it's kind of late, but I got over this problem by setting learning_method = 'batch'.
This could present other issues, such as extending training times, but it alleviated the problem of not having enough space on the shared memory.
Or maybe a smaller batch_size can be set. Although I have not tested this myself.

using pool.map to apply function to list of strings in parallel?

I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.
I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.
Here is what I have so far:
from ua_parser import user_agent_parser
http_str = df['user_agents'].tolist()
def uaparse(http_str):
for i, item in enumerate(http_str):
return user_agent_parser.Parse(http_str[i])
pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))
Right now I'm seeing the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-701fbf58d263> in <module>()
7
8 pool = mp.Pool(processes=10)
----> 9 results = pool.map(uaparse, range(0,len(http_str)))
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
249 '''
250 assert self._state == RUN
--> 251 return self.map_async(func, iterable, chunksize).get()
252
253 def imap(self, func, iterable, chunksize=1):
/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
TypeError: 'int' object is not iterable
Thanks in advance for any assistance/direction you can provide.
It seems like all you need is:
http_str = df['user_agents'].tolist()
pool = mp.Pool(processes=10)
parsed = pool.map(user_agent_parser.Parse, http_str)

IPython.parallel ValueError: cannot create an OBJECT array from memory buffer

I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.
After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:
import pandas as pd
from IPython.parallel import Client
def calculemus(corpus):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
return vectorizer.fit_transform(corpus)
review = pd.read_csv('review.csv')['text']
review = review.fillna('')
client = Client()
r = client[-1].apply(calculemus, review).get()
BUT I got this error instead:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
154 sa.data = m.bytes
155
--> 156 args = uncanSequence(map(unserialize, sargs), g)
157 kwargs = {}
158 for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
175
176 def unserialize(serialized):
--> 177 return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
159 buf = self.serialized.getData()
160 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161 result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
162 else:
163 raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer
I'm not sure what the problem is, could someone enlighten me on this?
UPDATE
Apparently the error says exactly what it says. If I do this:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
it kinda works.
So the next question is, is this a feature or a limitation of IPython?
This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.
PS:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
is equivalent to
r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))

Categories

Resources