Determining the position of sub-string in list of strings - python

I have a list of words (strings), say:
word_lst = ['This','is','a','great','programming','language']
And a second list with sub-strings, say:
subs_lst= ['This is', 'language', 'a great']
And let's suppose each sub-string in subs_lst appears only one time in word_lst. (sub-strings can be of any length)
I want an easy way to find the hierarchical position of the sub-strings in the word_lst.
So what I want is to order subs_lst according to they appearance in word_lst.
In the previous example, the output would be:
out = ['This is', 'a great', language]
Does anyone know an easy way to do this?

There's probably a faster way to do this, but this works, at least:
word_lst = ['This','is','a','great','programming','language']
subs_lst= ['This is', 'language', 'a great']
substr_lst = [' '.join(word_lst[i:j]) for i in range(len(word_lst)) for j in range(i+1, len(word_lst)+1)]
sorted_subs_list = sorted(subs_lst, key=lambda x:substr_lst.index(x))
print sorted_subs_list
Output:
['This is', 'a great', 'language']
The idea is to build a list of every substring in word_lst, ordered so that all the entries that start with "This" come first, followed by all the entries starting with "is", etc.. We store that in substr_lst.
>>> print substr_lst
['This', 'This is', 'This is a', 'This is a great', 'This is a great programming', 'This is a great programming language', 'is', 'is a', 'is a great', 'is a great programming', 'is a great programming language', 'a', 'a great', 'a great programming', 'a great programming language', 'great', 'great programming', 'great programming language', 'programming', 'programming language', 'language']
Once we have that list, we sort subs_list, using the index of each entry in substr_list as the key to sort by:
>>> substr_lst.index("This is")
1
>>> substr_lst.index("language")
20
>>> substr_lst.index("a great")
12

The intermediate step seems unneeded to me. Why not just make the word list a single string and find the substrings in that?
sorted(subs_lst, key = lambda x : ' '.join(word_lst).index(x))

Related

Merge some list elements in a Python List by input Number

I am having a hard time merging the elements within the Python list according to a given number.
I already found the solution to work on with a certain number. But I want to work with a variety of given number(N).
i.e. When I have a list
['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
Result,
When N = 2
['There was', 'a farmer', 'who had', 'a dog', 'and cat', '.']
or N = 3
['There was a', 'farmer who had', 'a dog and', 'cat .']
I would much prefer that it modified the existing list directly, not used any module or library.
Any help is greatly appreciated!!
Here's the sensible way to do it. it does create a new list. Since that is more efficient than trying to modify the original
a = ['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
n = 2
print([' '.join(a[i:i+n]) for i in range(0, len(a), n)])
n = 3
print([' '.join(a[i:i+n]) for i in range(0, len(a), n)])
Output:
['There was', 'a farmer', 'who had', 'a dog', 'and cat', '.']
['There was a', 'farmer who had', 'a dog and', 'cat .']
This is the simplest way I can remember of:
my_list = ['a','b','c','d','e','f','g','h','i']
N = 3
my_list[0:N] = [''.join(my_list[0:N])]
word_list = ['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
new_word_list = []
N = int(input())
for i in range(0, len(word_list), N):
string = ''
for j in (range(len(word_list) - i) if len(word_list) - i < N else range(N)):
string = string + word_list[i + j] + ' '
new_word_list.append(string)
print(new_word_list)
Here I implemented it using basic for loops to iterate over the list, although new list had to be created.

get unmatched words after CountVectorizer transform

I am using count vectorizer to apply string matching in a large dataset of texts. What I want is to get the words that do not match any term in the resulting matrix. For example, if the resulting terms (features) after fitting are:
{'hello world', 'world and', 'and stackoverflow', 'hello', 'world', 'stackoverflow', 'and'}
and I transformed this text:
"oh hello world and stackoverflow this is a great morning"
I want to get the string oh this is a greate morining since it matches nothing in the features. Is there any efficient method to do this?
I tried using inverse_transform method to get the features and remove them from the text but I ran into many problems and long running time.
Transforming a piece of text on the basis of a fitted vocabulary is going to return you a matrix with counts of the known vocabulary.
For example, if your input document is as in your example:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1, 2))
docs = ['hello world and stackoverflow']
vec.fit(docs)
Then the fitted vocabulary might look as follows:
In [522]: print(vec.vocabulary_)
{'hello': 2,
'world': 5,
'and': 0,
'stackoverflow': 4,
'hello world': 3,
'world and': 6,
'and stackoverflow': 1}
Which represents a token to index mapping. Transforming some new documents subsequently returns a matrix with counts of all known vocabulary tokens. Words that are not in the vocabulary are ignored!
other_docs = ['hello stackoverflow',
'hello and hello',
'oh hello world and stackoverflow this is a great morning']
X = vec.transform(other_docs)
In [523]: print(X.A)
[[0 0 1 0 1 0 0]
[1 0 2 0 0 0 0]
[1 1 1 1 1 1 1]]
Your vocabulary consists of 7 items, hence the matrix X contains 7 columns. And we've transformed 3 documents, so its a 3x7 matrix. The elements of the matrix are counts, of how often the particular word occurred in the document. For example for the second document "hello and hello", we have a count of 2 in column 2 (0-indexed) and a count of 1 in column 0, which refer to "hello" and "and", respectively.
Inverse transform is then a mapping from features (i.e. indices) back to the vocabulary items:
In [534]: print(vec.inverse_transform([1, 2, 3, 4, 5, 6, 7]))
[array(['and', 'and stackoverflow', 'hello', 'hello world',
'stackoverflow', 'world', 'world and'], dtype='<U17')]
Note: This is now 1-indexed w.r.t. to the vocabulary indices printed above.
Now lets get to your actual question, which is identifying all out-of-vocabulary (OOV) items in a given input document. Its fairly straightforward using sets if you're only interested in unigrams:
tokens = 'oh hello world and stackoverflow this is a great morning'.split()
In [542]: print(set(tokens) - set(vec.vocabulary_.keys()))
{'morning', 'a', 'is', 'this', 'oh', 'great'}
Things are slightly more involved if you're also interested in bigrams (or any other n-gram where n > 1), as first you'd need to generate all bigrams from your input document (note there are various ways to generate all ngrams from an input document of which the following is just one):
bigrams = list(map(lambda x: ' '.join(x), zip(tokens, tokens[1:])))
In [546]: print(bigrams)
['oh hello', 'hello world', 'world and', 'and stackoverflow', 'stackoverflow this', 'this is', 'is a', 'a great', 'great morning']
This line looks fancy, but all it does is zip two lists together (with the second list starting at the second item), which results in a tuple such as ('oh', 'hello'), the map statement just joins the tuple by a single space in order to transform ('oh', 'hello') into 'oh hello', and subsequently the map generator is converted into a list. Now you can build the union of unigrams and bigrams:
doc_vocab = set(tokens) | set(bigrams)
In [549]: print(doc_vocab)
{'and stackoverflow', 'hello', 'a', 'morning', 'hello world', 'great morning', 'world', 'stackoverflow', 'stackoverflow this', 'is', 'world and', 'oh hello', 'oh', 'this', 'is a', 'this is', 'and', 'a great', 'great'}
Now you can do the same as with unigrams above to retrieve all OOV items:
In [550]: print(doc_vocab - set(vec.vocabulary_.keys()))
{'morning', 'a', 'great morning', 'stackoverflow this', 'is a', 'is', 'oh hello', 'this', 'this is', 'oh', 'a great', 'great'}
This now represents all unigrams and bigrams that are not in your vectorizer's vocabulary.

How do I set a value to be the same as in the last item?

I have a text that I'm splitting into a list of sentences, and I want to find the subject of each sentence. For example, if the text is 'Dogs are great. They are so awesome', they have to be splitted into the two sentences 'Dogs are great.' and 'They are so awesome'. Then I use a for loop to find what the subject of each sentence is, whether it is 'cats' or 'dogs'.
sentence_list=['Dogs are great', 'They are so awesome']
for sentence in sentence_list:
if 'Dog' in sentence:
subject= 'Dog'
elif 'Cat' in sentence:
subject='Cat'
Because 'They' is used as a replacement for one of these, I want to set the subject for that sentence to the same as the last sentence. So in this example, the subject would be 'Dog' for both sentences.
You already have the last value. If neither the if clause nor the elif clause are true, then subject hasn't been set this iteration. That means it will still hold the same value it held last iteration.
sentence_list=['Dogs are great', 'They are so awesome']
for sentence in sentence_list:
if 'Dog' in sentence:
subject= 'Dog'
elif 'Cat' in sentence:
subject='Cat'
print(subject)
Will result in:
Dog
Dog
This solution has a little more flexibility to handle pluralization, pronoun selection as well as finding pronouns that may not be the first word of the sentence.
This likely extends past your scope since then you will get into verb tense issues, but I thought it might be helpful for others.
sentence_list = ['Dogs are great', 'They are so awesome', 'Cats are nice', 'They can be dangerous',
'On the lonely roads, they can be found.', 'He is fluffy.']
new_list = []
pronowns = ['they', 'them', 'she', 'her', 'he', 'him', 'us', 'we', 'it']
plurals = ['they', 'them', 'us', 'we']
last_subject = 'Dog'
for i, sentence in enumerate(sentence_list):
# update last subject
if 'Dog' in sentence:
last_subject = 'Dog'
elif 'Cat' in sentence:
last_subject = 'Cat'
if 'dog' not in sentence.lower() and 'cat' not in sentence.lower():
# find pronoun
for pn in pronowns:
if pn in sentence.lower():
# if it a plural usage add s
if pn in plurals:
sentence_list[i] = sentence.lower().replace(pn, last_subject + 's')
else:
sentence_list[i] = sentence.lower().replace(pn, last_subject)
break
print(sentence_list)
Output:
['Dogs are great',
'Dogs are so awesome',
'Cats are nice',
'Cats can be dangerous',
'on the lonely roads, Cats can be found.',
'Cat is fluffy.']
I may suggest you to use startswith() string method to check if a sentence starts with "They" or "It" for example, and then perform the substitution with the subject of the previous sentences. It is very simple and will probably fail on complex sentences, but it does the job for your question:
sentence_list=['Dogs are great', 'They are so awesome','Cats are nice', 'They can be dangerous']
for i,sentence in enumerate(sentence_list):
if sentence.startswith('Dogs'):
subject= 'Dogs'
elif sentence.startswith('Cats'):
subject='Cats'
if sentence.startswith('They'):
sentence_list[i] = sentence.replace('They', subject)
print(sentence_list)
# ['Dogs are great', 'Dogs are so awesome', 'Cats are nice', 'Cats can be dangerous']

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type,
To be or not to be
I want the program to generate
to be, be or, or not, not to, to be
I tried the following code but just gives me
<generator object bigrams at 0x0000000009231360>
This is my code:
import nltk
bigrm = nltk.bigrams(text)
print(bigrm)
So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be).
nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it):
bigrm = list(nltk.bigrams(text.split()))
To print them out separated with commas, you could (in python 3):
print(*map(' '.join, bigrm), sep=', ')
If on python 2, then for example:
print ', '.join(' '.join((a, b)) for a, b in bigrm)
Note that just for printing you do not need to generate a list, just use the iterator.
The following code produce a bigram list for a given sentence
>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be
Quite late, but this is another way.
>>> from nltk.util import ngrams
>>> text = "I am batman and I like coffee"
>>> _1gram = text.split(" ")
>>> _2gram = [' '.join(e) for e in ngrams(_1gram, 2)]
>>> _3gram = [' '.join(e) for e in ngrams(_1gram, 3)]
>>>
>>> _1gram
['I', 'am', 'batman', 'and', 'I', 'like', 'coffee']
>>> _2gram
['I am', 'am batman', 'batman and', 'and I', 'I like', 'like coffee']
>>> _3gram
['I am batman', 'am batman and', 'batman and I', 'and I like', 'I like coffee']

How to find unique starts of strings?

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.
Your question is confusing, as it is not clear what you really want. So I'll give three answers and hope that one of them at least partially answers your question.
To get all unique prefixes of a given list of string, you can do:
>>> l = ['blah 1', 'blah 2', 'xyz fg', 'xyz penguin']
>>> set(s[:i] for s in l for i in range(len(s) + 1))
{'', 'xyz pe', 'xyz penguin', 'b', 'xyz fg', 'xyz peng', 'xyz pengui', 'bl', 'blah 2', 'blah 1', 'blah', 'xyz f', 'xy', 'xyz pengu', 'xyz p', 'x', 'blah ', 'xyz pen', 'bla', 'xyz', 'xyz '}
This code generates all initial slices of every string in the list and passes these to a set to remove duplicates.
To get all largest initial word sequences smaller than the full string, you could go with:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(s.rsplit(' ', 1)[0] for s in l)
{'a', 'a b', 'b'}
This code creates a set by splitting all strings at their rightmost space, if available (otherwise the while string will be returned).
On the other hand, to get all unique initial word sequences without considering full strings, you could go for:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(' '.join(w[:i]) for s in l for w in (s.split(),) for i in range(len(w)))
{'', 'a', 'b', 'a b'}
This code splits each word at any whitespace and concatenates all initial slices of the resulting list, except the largest one. This code has pitfall: it will e.g. convert tabs to spaces. This may or may not be an issue in your case.
If you mean unique first words of strings (words being separated by space), this would be:
arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin']
unique=list(set([x.split(' ')[0] for x in arr]))

Categories

Resources