Related
I have tuples in a list of lists and would like to extract only some elements in the tuple. Sample of the input data is below.
# input
[[('ab', 0.026412873688749918), ('dc', 0.016451082731822664), ('on', 0.014278088125928066),
('qc', 0.009752817881775656), ('mn', 0.008332886637563352), ('nt', 0.008250535392602258),
('nsw', 0.006874273287824427), ('bar', 0.005878684829852004), ('tor', 0.005741627328513831),
('wds', 0.004119216502907735)],
[('nb', 0.03053649661493629), ('ns', 0.01925207174326825), ('ham', 0.016207228280183325),
('bra', 0.013390785663058102), ('nia', 0.00878166482558038), ('knxr', 0.004648856466085521),
('nwm', 0.004463444159552605), ('md', 0.004377821331080258), ('ut', 0.004165890522922745),
('va', 0.0037484060754341083)]]
What I am trying to do is get the first items in the tuples.
# output
[['ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'],
['nb', 'ns', 'ham', 'bra', 'nia', 'knxr', 'nwm', 'md', 'ut', 'va']]
input = [
[('ab', 0.026412873688749918), ('dc', 0.016451082731822664), ('on', 0.014278088125928066),
('qc', 0.009752817881775656), ('mn', 0.008332886637563352), ('nt', 0.008250535392602258),
('nsw', 0.006874273287824427), ('bar', 0.005878684829852004), ('tor', 0.005741627328513831),
('wds', 0.004119216502907735)],
[('nb', 0.03053649661493629), ('ns', 0.01925207174326825), ('ham', 0.016207228280183325),
('bra', 0.013390785663058102), ('nia', 0.00878166482558038), ('knxr', 0.004648856466085521),
('nwm', 0.004463444159552605), ('md', 0.004377821331080258), ('ut', 0.004165890522922745),
('va', 0.0037484060754341083)]
]
As illustrated in the comments you can use list comprehensions to achieve this:
[[idx for idx, val in x] for x in input]
# Result
[['ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'],
['nb', 'ns', 'ham', 'bra', 'nia', 'knxr', 'nwm', 'md', 'ut', 'va']]
A more complex way to achieve this would be to use zip() to separate the first elements from the second elements of the tuples as shown below:
[('ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'),
(0.026412873688749918,0.016451082731822664,0.014278088125928066,0.009752817881775656,0.008332886637563352,0.008250535392602258,0.006874273287824427,0.005878684829852004,0.005741627328513831,0.004119216502907735)]
This approach can be done using:
[list(list(zip(*x))[0]) for x in input]
# Result
[['ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'],
['nb', 'ns', 'ham', 'bra', 'nia', 'knxr', 'nwm', 'md', 'ut', 'va']]
You can use loop or list comprehension to do this.
The input data is list of lists that contains tuples. Access the first element of the tuple by using tuple[0] and save it into an empty list like this:-
input_data = [
[('ab', 0.026412873688749918), ('dc', 0.016451082731822664), ('on', 0.014278088125928066),
('qc', 0.009752817881775656), ('mn', 0.008332886637563352), ('nt', 0.008250535392602258),
('nsw', 0.006874273287824427), ('bar', 0.005878684829852004), ('tor', 0.005741627328513831),
('wds', 0.004119216502907735)],
[('nb', 0.03053649661493629), ('ns', 0.01925207174326825), ('ham', 0.016207228280183325),
('bra', 0.013390785663058102), ('nia', 0.00878166482558038), ('knxr', 0.004648856466085521),
('nwm', 0.004463444159552605), ('md', 0.004377821331080258), ('ut', 0.004165890522922745),
('va', 0.0037484060754341083)]
]
data_list = []
for x in input_data:
d_list = []
for y in x:
d_list.append(y[0])
data_list.append(d_list)
# Result...
[['ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'],
['nb', 'ns', 'ham', 'bra', 'nia', 'knxr', 'nwm', 'md', 'ut', 'va']]
Using list comprehension:-
It is a shorthand way to write the for loop above by removing append() method and the initial empty lists.
data_list = [ [y[0] for y in x] for x in input_data ]
# Result...
[['ab', 'dc', 'on', 'qc', 'mn', 'nt', 'nsw', 'bar', 'tor', 'wds'],
['nb', 'ns', 'ham', 'bra', 'nia', 'knxr', 'nwm', 'md', 'ut', 'va']]
How do I solve this: I have an table of letters which is labeled out by each grid point.
(0,0)(0,1)(0,2)(0,3) o l n c
(1,0)(1,1)(1,2)(1,3) e t e a
(2,0)(2,1)(2,2)(2,3) i b t a
(3,0)(3,1)(3,2)(3,3) o m m f
I am trying to find all possible combinations going through the grid creating lines of 3,4,5 length.
ie: PossibleSolutions = [[(0,0),(0,1),(0,2)],[(0,0),(1,1),(2,2)],[(0,0),(1,0),(2,0)]]
each of these representing:[[o,l,n],[o,t,t],[o,e,i]]
All possible combinations but keeping within the grid layout.
from itertools import combinations
def PossibleWords(possible, board):
words = []
for i in range(len(possible)):
words.append(board[possible[i]])
return words
def Combinations(board):
coord = list(board.keys())
temp = []
temp.append(list(combinations([
(0,0),(0,1),(0,2),(0,3),
(1,0),(1,1),(1,2),(1,3),
(2,0),(2,1),(2,2),(2,3),
(3,0),(3,1),(3,2),(3,3)
], 3)))
possible = temp[0]
form = []
temp = []
solutions = []
for i in range(len(possible)):
form.append(PossibleWords(possible[i], board))
for i in range(len(form)):
temp.append(form[i])
for i in range(len(temp)):
solutions.append(''.join(temp[i]))
return solutions
output = ['ole', 'ole', 'one', 'one', 'oaf', 'oaf', 'let', 'lee', 'lei', 'let', 'lei', 'let', 'lab', 'lam', 'lam', 'lit', 'lam', 'lam', 'net', 'nee', 'net', 'net', 'nam', 'nam', 'nam', 'nam', 'cee', 'cab', 'cat', 'cam', 'cam', 'cam', 'cam', 'eta', 'eta', 'eat', 'eta', 'tea', 'tet', 'tea', 'tab', 'tat', 'tit', 'tom', 'tom', 'eat', 'eta', 'aim', 'aim', 'bam', 'bam', 'tom', 'tom']
I've tried combinations() but since my grid is in a list it doesn't follow the grid boundaries. Any guidance would be helpful, thank you.
I want to remove duplicate items from a list and keep the order of the original list, as in the example I have:
['hangman', 'song', 'most', 'broadly', 'song', 'hangman', 'work', 'music', 'work', 'broadly', 'typically']
and I want this:
['hangman', 'song', 'most', 'broadly', 'work', 'music', 'typically']
How can I do this :
You can use set for this, set will not accept one item more than once:
old_list = ['hangman', 'song', 'most', 'broadly', 'song', 'hangman', 'work', 'music', 'work', 'broadly', 'typically']
new_list = list(set(old_list))
The new_list will be:
['hangman', 'song', 'most', 'broadly', 'work', 'music', 'typically']
this will keep the order of your old_list in case you need to.
old_list = ['hangman', 'song', 'most', 'broadly', 'song', 'hangman', 'work', 'music', 'work', 'broadly', 'typically']
new_list = []
for n in old_list:
if n not in new_list:
new_list.append(n)
print(new_list)
You can do this using a Set. Python set implementation docs here (assuming you are using python3)enter link description here. Using the first list as an input you can add all elements to the set and the add operation will ignore duplicates. However when reading back from the set the order is not guaranteed.
example:
>>> mylist = ["a", "b", "c", "a"]
>>> myset = set(mylist)
>>> myset
set(['a', 'c', 'b'])
To remove duplicates and keep the order of the original list in Python 3.7 you can use dict.fromkeys():
l = ['hangman', 'song', 'most', 'broadly', 'song', 'hangman', 'work', 'music', 'work', 'broadly', 'typically']
list(dict.fromkeys(l))
# ['hangman', 'song', 'most', 'broadly', 'work', 'music', 'typically']
Otherwise use OrderedDict.
I am performing topic modelling and using functions to get the top keywords in the topic models as shown below.
def getTopKWords(self, K):
results = []
"""
returns top K discriminative words for topic t
ie words v for which p(v|t) is maximum
"""
index = []
key_terms = []
pseudocounts = np.copy(self.n_vt)
normalizer = np.sum(pseudocounts, (0))
pseudocounts /= normalizer[np.newaxis, :]
for t in range(self.numTopics):
topWordIndices = pseudocounts[:, t].argsort()[-1:-(K+1):-1]
vocab = self.vectorizer.get_feature_names()
print (t, [vocab[i] for i in topWordIndices])
## Code for storing the values in a single list
return results
The above functions gives me the code as shown in the fig
0 ['computer', 'laptop', 'mac', 'use', 'bought', 'like', 'warranty', 'screen', 'way', 'just']
1 ['laptop', 'computer', 'use', 'just', 'like', 'time', 'great', 'windows', 'macbook', 'months']
2 ['computer', 'great', 'laptop', 'mac', 'buy', 'just', 'macbook', 'use', 'pro', 'windows']
3 ['laptop', 'computer', 'great', 'time', 'battery', 'use', 'apple', 'love', 'just', 'work']
It results from the 4 time the loop runs and print index and all keywords in each vocab.
Now, I want to return a single list from the function which returns me the following output.
return [keyword1, keyword2, keyword3, keyword4]
where, keyword1/2/3/4 are the words which were occuring the most in vocab lists with index 0, 1,2,3 in output.
You can use collection.Counter:
from collections import Counter
a = ['computer', 'laptop', 'mac', 'use', 'bought', 'like',
'warranty', 'screen', 'way', 'just']
b = ['laptop', 'computer', 'use', 'just', 'like', 'time',
'great', 'windows', 'macbook', 'months']
c = ['computer', 'great', 'laptop', 'mac', 'buy', 'just',
'macbook', 'use', 'pro', 'windows']
d = ['laptop', 'computer', 'great', 'time', 'battery', 'use',
'apple', 'love', 'just', 'work']
def get_most_common(*kwargs):
"""Accepts iterables, feeds all into Counter and returns the Counter instance"""
c = Counter()
for k in kwargs:
c.update(k)
return c
# get the most common ones
mc = get_most_common(a,b,c,d).most_common()
# print top 4 keys
top4 = [k for k,v in mc[0:4]]
print (top4)
Output:
['computer', 'laptop', 'use', 'just']
some_results = [] # store stuff
for t in range(self.numTopics):
topWordIndices = pseudocounts[:, t].argsort()[-1:-(K+1):-1]
vocab = self.vectorizer.get_feature_names()
print (t, [vocab[i] for i in topWordIndices])
some_results.append( [vocab[i] for i in topWordIndices] )
mc = get_most_common(*some_results).most_common()
return [k for k,v in mc[0:4]]
I am trying to build an inverted index, i.e. map a text to the document it came from.
It's position within the list/document.
In my case i have parsed list containing lists(i.e list of lists).
My input is like this.
[
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
This is my code
def create_inverted(mylists):
myDict = {}
for sublist in mylists:
for i in range(len(sublist)):
if sublist[i] in myDict:
myDict[sublist[i]].append(i)
else:
myDict[sublist[i]] = [i]
return myDict
It does build the dictionary, but when i do a search i am not getting the correct
result. I am trying to do something like this.
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
index = {'owl': [0, 2],
'lion': [0, 1], # IDs are sorted.
'deer': [1],
'leopard': [2]}
def indexed_search(documents, index, query):
return [documents[doc_id] for doc_id in index[query]]
print indexed_search(documents, index, 'lion')
Where i can enter search text and it gets the list ids.
Any Ideas.
You're mapping each word to the positions it was found in in each document, not which document it was found in. You should store indexes into the list of documents instead of indexes into the documents themselves, or perhaps just map words to documents directly instead of to indices:
def create_inverted_index(documents):
index = {}
for i, document in enumerate(documents):
for word in set(document):
if word in index:
index[word].append(i)
else:
index[word] = [i]
return index
Most of this is the same as your code. The main differences are in these two lines:
for i, document in enumerate(documents):
for word in set(document):
which correspond to the following part of your code:
for sublist in mylists:
for i in range(len(sublist)):
enumerate iterates over the indices and elements of a sequence. Since enumerate is on the outer loop, i in my code is the index of the document, while i in your code is the index of a word within a document.
set(document) creates a set of the words in the document, where each word appears only once. This ensures that each word is only counted once per document, rather than having 10 occurrences of 2 in the list for 'Cheetos' if 'Cheetos' appears in document 2 10 times.
At first I would extract all possible words and store them in one set.
Then I look up each word in each list and collect all the indexes of lists the word happens to be in...
source = [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
allWords = set(word for lst in source for word in lst)
wordDict = { word: [
i for i, lst in enumerate(source) if word in lst
] for word in allWords }
print wordDict
Out[30]:
{'a': [1, 2, 3],
'afraid': [3],
'always': [1, 4],
'and': [2],
...
This is straightforward as long you don't need efficient code:
documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
def index(docs):
doc_index = {}
for doc_id, doc in enumerate(docs, 1):
for term_pos, term in enumerate(doc, 1):
doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos)
return doc_index
Now you get a two-level dictionary giving you access to the document ids, and then to the positions of the terms in this document:
>>> index(documents)
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}
This is only a preliminary step for indexing; afterwards, you need to separate the term dictionary from the document postings from the positions postings. Typically, the dictionary is stored in a tree-like structures (there are Python packages for this), and the document postings and positions postings are represented as arrays of unsigned integers.
I'd accumulate the indices into a set to avoid duplicates and then sort
>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
>>> from collections import defaultdict
>>> D = defaultdict(set)
>>> for i, doc in enumerate(documents):
... for word in doc:
... D[word].add(i)
...
>>> D ## Take a look at the defaultdict
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}})
>>> {k:sorted(v) for k,v in D.items()}
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}