list out of range when using embeddings

list out of range when using embeddings - python

I have the following list:
list1=[['brute-force',
'password-guessing',
'password-guessing',
'default-credentials',
'shell'],
['malware',
'ddos',
'phishing',
'spam',
'botnet',
'cryptojacking',
'xss',
'sqli',
'vulnerability'],
['sensitive-information']]
I am trying the example from here enter link description here
However when I am fitting my list to get the embeddings :
embeddings1 =sbert_model.encode(list1, convert_to_tensor=True)
I get the embeding i get the following error:
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16484/3954167634.py in <module>
----> 1 embeddings2 = sbert_model.encode(list3, convert_to_tensor=True)
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
159 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
160 sentences_batch = sentences_sorted[start_index:start_index+batch_size]
--> 161 features = self.tokenize(sentences_batch)
162 features = batch_to_device(features, device)
163
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\SentenceTransformer.py in tokenize(self, texts)
317 Tokenizes the texts
318 """
--> 319 return self._first_module().tokenize(texts)
320
321 def get_sentence_features(self, *features):
~\anaconda3\envs\tensorflow_env\lib\site-packages\sentence_transformers\models\Transformer.py in tokenize(self, texts)
101 for text_tuple in texts:
102 batch1.append(text_tuple[0])
--> 103 batch2.append(text_tuple[1])
104 to_tokenize = [batch1, batch2]
105
IndexError: list index out of range
I am understanding how lists work and I have read many asnwers to same problem in here but i cannot fiqure out why is going out of range.
Any ideas?

You need to flatten your input nested list first.
from nltk import flatten
flattened_list1 = flatten(list1)
embeddings1 = sbert_model.encode(flattened_list1, convert_to_tensor=True)

Related

zip argument #2 must support iterationzip argument #2 must support iteration column transformer error

i am using the column transformer for the first time and I keep getting a Type Error, is something wrong with the code?
code :
transformer = ColumnTransformer(('cat',OrdinalEncoder(),['job_industry_category','job_title','wealth_segmenr','gender']),
('numb',MinMaxScaler(),['tenure', 'age'])
data_impute = transformer.fit_transform(data_impute)
error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-ca10f9ee762b> in <module>
----> 1 data_impute = transformer.fit_transform(data_impute)
~\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
512 self._feature_names_in = None
513 X = _check_X(X)
--> 514 self._validate_transformers()
515 self._validate_column_callables(X)
516 self._validate_remainder(X)
~\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_transformers(self)
271 return
272
--> 273 names, transformers, _ = zip(*self.transformers)
274
275 # validate names
TypeError: zip argument #2 must support iteration

A wrong construction of the ColumnTransformer object results in passing wrong arguments to the internal zip call, resulting in a useless error message for the mere mortal.
constructor is
class sklearn.compose.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)[source]¶
The wrong construction code passes transformers as ('cat',OrdinalEncoder(),'job_industry_category','job_title','wealth_segmenr','gender']) then remainder as ('numb',MinMaxScaler(),['tenure', 'age']), then it fails miserably...
from an example in the docs:
ct = ColumnTransformer(
[("norm1", Normalizer(norm='l1'), [0, 1]),
("norm2", Normalizer(norm='l1'), slice(2, 4))])
you must pass a list of tuples, not the tuples as arguments
fix:
transformer = ColumnTransformer([ # note the [ to start the list
('cat',OrdinalEncoder(),'job_industry_category','job_title','wealth_segmenr','gender']),
('numb',MinMaxScaler(),['tenure', 'age'])
]) # note the ] to end the list
The datascience modules generally lack strong type hinting or type checking. Read the docs + copy/adapt the examples instead of starting from a blank page!

I cant import the ucf 101 dataset (torchvision), 'list index out of range' error

dataset = torchvision.datasets.UCF101(r'my_directory', annotation_path=r'my_directory2', frames_per_clip=16, step_between_clips=1, frame_rate=None, fold=1, train=True, transform=transforms.Compose([transforms.ToTensor()]), _precomputed_metadata=None, num_workers=1, _video_width=64, _video_height=64, _video_min_dimension=0, _audio_samples=0)
This line works, the problem is when I try to do any kind of operation using 'dataset', in particular this one:
data = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=True)
I get the error, so i can't work on the video data because I can't use the dataLoader, the error is:
IndexError: list index out of range
Complete error message:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-16-dc3631481cb3> in <module>
----> 1 data = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=True)
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context)
211 else: # map-style
212 if shuffle:
--> 213 sampler = RandomSampler(dataset)
214 else:
215 sampler = SequentialSampler(dataset)
~\anaconda3\lib\site-packages\torch\utils\data\sampler.py in __init__(self, data_source, replacement, num_samples)
90 "since a random permute will be performed.")
91
---> 92 if not isinstance(self.num_samples, int) or self.num_samples <= 0:
93 raise ValueError("num_samples should be a positive integer "
94 "value, but got num_samples={}".format(self.num_samples))
~\anaconda3\lib\site-packages\torch\utils\data\sampler.py in num_samples(self)
98 # dataset size might change at runtime
99 if self._num_samples is None:
--> 100 return len(self.data_source)
101 return self._num_samples
102
~\anaconda3\lib\site-packages\torchvision\datasets\ucf101.py in __len__(self)
96
97 def __len__(self):
---> 98 return self.video_clips.num_clips()
99
100 def __getitem__(self, idx):
~\anaconda3\lib\site-packages\torchvision\datasets\video_utils.py in num_clips(self)
241 Number of subclips that are available in the video list.
242 """
--> 243 return self.cumulative_sizes[-1]
244
245 def get_clip_location(self, idx):
IndexError: list index out of range

Problem:
This problem occurs when you run your code on windows because windows paths use backslash ("\") instead of forward slash ("/").
As you see in the code:
https://github.com/pytorch/vision/blob/7b9d30eb7c4d92490d9ac038a140398e0a690db6/torchvision/datasets/ucf101.py#L94
So, this line of code reads the file path from label file as “action\video_name” and merge it with “root” path using backslash therefore full path becomes like “root\action/video_name”. Such paths doesn’t match with the video lists at line#97 and returns empty list for indices variable.
Solution:
Two of possible solutions can be:
Replace the forwardslashes “/” in the label files with backslashes “\”.
Override the _select_fold(…) function of class UCF101 and fix the backslashes inside the function

Gensim Summarizer throws MemoryError, Any Solution?

I am trying to generate the summary of a large text file using Gensim Summarizer.
I am getting memory error. Have been facing this issue since sometime, any help
would be really appreciated. feel free to ask for more details.
from gensim.summarization.summarizer import summarize
file_read =open("xxxxx.txt",'r')
Content= file_read.read()
def Summary_gen(content):
print(len(Content))
summary_r=summarize(Content,ratio=0.02)
print(summary_r)
Summary_gen(Content)
The length of the document is:
365042
Error messsage:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-a91bd71076d1> in <module>()
10
11
---> 12 Summary_gen(Content)
<ipython-input-6-a91bd71076d1> in Summary_gen(content)
6 def Summary_gen(content):
7 print(len(Content))
----> 8 summary_r=summarize(Content,ratio=0.02)
9 print(summary_r)
10
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize(text, ratio, word_count, split)
428 corpus = _build_corpus(sentences)
429
--> 430 most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
431
432 # If couldn't get important docs, the algorithm ends.
c:\python3.6\lib\site-packages\gensim\summarization\summarizer.py in summarize_corpus(corpus, ratio)
367 return []
368
--> 369 pagerank_scores = _pagerank(graph)
370
371 hashable_corpus.sort(key=lambda doc: pagerank_scores.get(doc, 0), reverse=True)
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in pagerank_weighted(graph, damping)
57
58 """
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
60 probability_matrix = build_probability_matrix(graph)
61
c:\python3.6\lib\site-packages\gensim\summarization\pagerank_weighted.py in build_adjacency_matrix(graph)
92 neighbors_sum = sum(graph.edge_weight((current_node, neighbor)) for neighbor in graph.neighbors(current_node))
93 for j in xrange(length):
---> 94 edge_weight = float(graph.edge_weight((current_node, nodes[j])))
95 if i != j and edge_weight != 0.0:
96 row.append(i)
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in edge_weight(self, edge)
255
256 """
--> 257 return self.get_edge_properties(edge).setdefault(self.WEIGHT_ATTRIBUTE_NAME, self.DEFAULT_WEIGHT)
258
259 def neighbors(self, node):
c:\python3.6\lib\site-packages\gensim\summarization\graph.py in get_edge_properties(self, edge)
404
405 """
--> 406 return self.edge_properties.setdefault(edge, {})
407
408 def add_edge_attributes(self, edge, attrs):
MemoryError:
I have tried looking up for this error on the internet, but, couldn't find a workable solution to this.

From the logs, it looks like the code builds an adjacency matrix
---> 59 adjacency_matrix = build_adjacency_matrix(graph)
This probably tries to create a huge adjacency matrix with your 365042 documents which cannot fit in your memory(i.e., RAM).
You could try:
Reducing the document size to fewer files (maybe start with 10000)
and check if it works
Try running it on a system with more RAM

Did you try to use word_count argument instead of ratio?
If the above still doesn't solve the problem, then that's because of gensim's implementation limitations. The only way to use gensim if you still OOM errors is to split documents. That also will speed up your solution (and if the document is really big, it shouldn't be a problem anyway).
What's the problem with summarize:
gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
BTW is the document length measured in words, or characters?

What type of variable works with dtype('<U95') in Python [duplicate]

I am creating bag of words representation of the sentence. Then taking the words that exist in the sentence to compare to the file "vectors.txt", in order to get their embedding vectors. After getting vectors for each word that exists in the sentence, I am taking average of the vectors of the words in the sentence. This is my code:
import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
vocabulary = [word for word, _ in fdist.most_common(10)]
num_sents = len(news_sents)
def averageEmbeddings(sentenceTokens, embeddingLookupTable):
listOfEmb=[]
for token in sentenceTokens:
embedding = embeddingLookupTable[token]
listOfEmb.append(embedding)
return sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
embeddingVectors = {}
with open("D:\\Embedding\\vectors.txt") as file:
for line in file:
(key, *val) = line.split()
embeddingVectors[key] = val
for i in range(num_sents):
features = {}
for word in vocabulary:
features[word] = int(word in news_sents[i])
print(features)
print(list(features.values()))
sentenceTokens = []
for key, value in features.items():
if value == 1:
sentenceTokens.append(key)
sentenceTokens.remove(".")
print(sentenceTokens)
print(averageEmbeddings(sentenceTokens, embeddingVectors))
print(features.keys())
Not sure why, but I get this error:
TypeError Traceback (most recent call last)
<ipython-input-4-643ccd012438> in <module>()
39 sentenceTokens.remove(".")
40 print(sentenceTokens)
---> 41 print(averageEmbeddings(sentenceTokens, embeddingVectors))
42
43 print(features.keys())
<ipython-input-4-643ccd012438> in averageEmbeddings(sentenceTokens, embeddingLookupTable)
18 listOfEmb.append(embedding)
19
---> 20 return sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
21
22 embeddingVectors = {}
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U9') dtype('<U9') dtype('<U9')
P.S. Embedding Vector looks like:
the 0.011384 0.010512 -0.008450 -0.007628 0.000360 -0.010121 0.004674 -0.000076
of 0.002954 0.004546 0.005513 -0.004026 0.002296 -0.016979 -0.011469 -0.009159
and 0.004691 -0.012989 -0.003122 0.004786 -0.002907 0.000526 -0.006146 -0.003058
one 0.014722 -0.000810 0.003737 -0.001110 -0.011229 0.001577 -0.007403 -0.005355
in -0.001046 -0.008302 0.010973 0.009608 0.009494 -0.008253 0.001744 0.003263
After using np.sum I get this error:
TypeError Traceback (most recent call last)
<ipython-input-13-8a7edbb9d946> in <module>()
40 sentenceTokens.remove(".")
41 print(sentenceTokens)
---> 42 print(averageEmbeddings(sentenceTokens, embeddingVectors))
43
44 print(features.keys())
<ipython-input-13-8a7edbb9d946> in averageEmbeddings(sentenceTokens, embeddingLookupTable)
18 listOfEmb.append(embedding)
19
---> 20 return np.sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
21
22 embeddingVectors = {}
C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in sum(a, axis, dtype, out, keepdims)
1829 else:
1830 return _methods._sum(a, axis=axis, dtype=dtype,
-> 1831 out=out, keepdims=keepdims)
1832
1833
C:\Anaconda3\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32 return umr_sum(a, axis, dtype, out, keepdims)
33
34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
TypeError: cannot perform reduce with flexible type

You have a numpy array of strings, not floats. This is what is meant by dtype('<U9') -- a little endian encoded unicode string with up to 9 characters.
try:
return sum(np.asarray(listOfEmb, dtype=float)) / float(len(listOfEmb))
However, you don't need numpy here at all. You can really just do:
return sum(float(embedding) for embedding in listOfEmb) / len(listOfEmb)
Or if you're really set on using numpy.
return np.asarray(listOfEmb, dtype=float).mean()

Getting an error "The suffix tree string must not contain terminal symbol!"

I want to build a generalized suffix tree. I am using http://www.daimi.au.dk/~mailund/suffix_tree.html for this implementation. My code is as follows:
s1 = u'abcd'
x = 36
for i in range(x):
listing.append(s1)
stree = GeneralisedSuffixTree(listing)
For the value of x = 35 the code is woking fine but for x = 36 or more then I'm getting following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-e7b83eadb302> in <module>()
6 listing.append(s1)
7
----> 8 stree = GeneralisedSuffixTree(listing)
9
10 count = []
/home/darshan/anaconda/lib/python2.7/site-packages/suffix_tree.pyc in __init__(self, sequences)
113 self.sequences += [u'']
114
--> 115 SuffixTree.__init__(self,concatString)
116 self._annotateNodes()
117
/home/darshan/anaconda/lib/python2.7/site-packages/suffix_tree.pyc in __init__(self, s, t)
60 must not contain the special symbol $.'''
61 if t in s:
---> 62 raise "The suffix tree string must not contain terminal symbol!"
63 _suffix_tree.SuffixTree.__init__(self,s,t)
64
TypeError: exceptions must be old-style classes or derived from BaseException, not str
Exceptions are from this file https://github.com/Yacoby/suffix-tree-unicode/blob/master/suffix_tree.py
I don't understand why it is working for values x < 36 but not for other values.
Please help me to understand what is going on here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

list out of range when using embeddings - python

You need to flatten your input nested list first. from nltk import flatten flattened_list1 = flatten(list1) embeddings1 = sbert_model.encode(flattened_list1, convert_to_tensor=True)

Related

zip argument #2 must support iterationzip argument #2 must support iteration column transformer error

I cant import the ucf 101 dataset (torchvision), 'list index out of range' error

Gensim Summarizer throws MemoryError, Any Solution?

What type of variable works with dtype('<U95') in Python [duplicate]

Getting an error "The suffix tree string must not contain terminal symbol!"

Categories

Resources