What is the best way to check if some words in combination using logical operators (or,and) exist in a list of strings ?
Say you have a list of strings:
list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
I have two cases 1), and 2) where I would like to get the strings that contain or do not contain some words. however I would prefer not to repeat as I do now if 'word1' not in i and 'word2' not in i and 'word3' not in i
I would like to get 1)
list_1 = [i for i in list_of_str if 'word1' not in i and 'word2' not in i and 'word3' not in i]
output: ['another phrase']
and 2)
list_2 = [i for i in list_of_str if 'word1' in i or 'word2' in i or 'word3' in i]
output: ['some phrase with word1', 'other phrase with word2']
I did find that I can do this for 2), but couldn't use the all for case 1)
list_2 = [i for i in list_of_str if any(word in ['word1','word2','word3'] for word in i.split())]
output: ['some phrase with word1', 'other phrase with word2']
Also is this the most efficient way of doing things ?
you can use:
words = ['word1', 'word2', 'word23']
list_1 = [i for i in list_of_str if all(w not in i for w in words)]
list_2 = [i for i in list_of_str if any(w in i for w in words)]
I think this is a good use-case for regex alternation, if efficiency matters:
>>> import re
>>> words = ['word1', 'word2', 'word23']
>>> regex = re.compile('|'.join([re.escape(w) for w in words]))
>>> regex
re.compile('word1|word2|word23')
>>> list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
>>> [phrase for phrase in list_of_str if not regex.search(phrase)]
['another phrase']
>>> [phrase for phrase in list_of_str if regex.search(phrase)]
['some phrase with word1', 'other phrase with word2']
>>>
If you think about it in sets, you want sentences from that list where the set of search words and the set of words in the sentence are either disjoint or intersect.
E.g.:
set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
not set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
# or:
set('some phrase with word1'.split()) & {'word1', 'word2', 'word23'}
So:
search_terms = {'word1', 'word2', 'word23'}
list1 = [i for i in list_of_str if set(i.split()).isdisjoint(search_terms)]
list2 = [i for i in list_of_str if not set(i.split()).isdisjoint(search_terms)]
Related
list comprehension to check for presence of any of the items.
I have some text and would like to check on some keywords. It should return me the sentence if it contains any of the keywords.
An example:
text = [t for t in string.split('. ')
if 'drink' in t or 'eat' in t
or 'sleep' in t]
This works. However, I am thinking if there is a better way, as the list of keywords may grow.
I tried putting the keywords in a list but it would not work in this list comprehension.
OR using if any
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any (l in pattern for l in t)]
You were almost there:
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any(word in t for word in pattern)]
The key is to check for each word in pattern if that work is inside the sentence:
any(word in t for word in pattern)
Your use of any is backwards. This is what you want:
[t for t in string.split('. ') if any(l in t for l in pattern)]
An alternative approach is using a regex:
import re
regex = '|'.join(pattern)
[t for t in string.split('. ') if regex.search(t)]
I am trying to add any word that has a length greater than 5 characters, but I don't know how to add the words to it.
s = 'This sentence is a string'
l = list(map(len, s.split()))
l.sort()
w=[]
for i in l:
if (i >= 5):
w.append(i)
print(w)
output [6]
[6, 8]
I can get the size of each word in the sentence, but linking the length to the word itself has been hard as it is between string and integers.
You could simply use a list-comprehension to do that:
s = 'This sentence is a string'
words = [w for w in s.split() if len(w) > 5]
print(words) # ==> ['sentence', 'string']
Alternatively, a filter with a lambda could be used as well:
s = 'This sentence is a string'
words = list(filter(lambda w: len(w) > 5, s.split()))
print(words) # ==> ['sentence', 'string']
So I'm trying to write a reddit bot to find articles with certain words in the title. Here's what I have so far:
top_posts = page.hot(limit=20)
for post in top_posts:
title = post.title
if title.lower() in ['word1', 'word2', 'word3']:
print(title)
If I replace the last 2 lines with...
if 'word1' in title.lower():
print(title)
then it'll print the titles that have word1 in them but when I put it into a list it won't. I want to use a list to match different spellings of the same word. What am I doing wrong here?
You have the order of the operands wrongly placed and you're not doing it right.
Use any to check if any of words in the list is contained in the title:
if any(wd in title.lower() for wd in ['word1', 'word2', 'word3']):
print(title)
To check if all of the words are contained in title, use all instead.
title.lower() in ['word1', 'word2', 'word3']
This checks exactly what it says: Whether title.lower(), the lowercase title, is in the list of words.
This will work in cases where title is a single word, for example:
>>> title = 'Word1'
>>> title.lower() in ['word1', 'word2', 'word3']
True
But of course, this will not work when title is an actual sentence that contains multiple words. title = 'Word1 foo bar' will never be an element of that single-word list.
So, you do have to have to check for every word from your word list whether it is contained in the title string:
>>> title = 'Word1 foo bar'
>>> 'word1' in title.lower()
True
>>> 'word2' in title.lower()
False
>>> 'word3' in title.lower()
False
You could do that in a loop and break out of it as soon as you hit a positive result:
>>> def titleContainsWords(title, words):
for word in words:
if word in title:
return True
return False
>>> wordlist = ['word1', 'word2', 'word3']
>>> titleContainsWords(title.lower(), wordlist)
True
This is such a common thing, that there is also a shorter way to do the same thing, combining the any() function with generator expressions:
>>> any(word in title.lower() for word in wordlist)
True
I see the following script snippet from the gensim tutorial page.
What's the syntax of word for word in below Python script?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]
This is a list comprehension. The code you posted loops through every element in document.lower.split() and creates a new list that contains only the elements that meet the if condition. It does this for each document in documents.
Try it out...
elems = [1, 2, 3, 4]
squares = [e*e for e in elems] # square each element
big = [e for e in elems if e > 2] # keep elements bigger than 2
As you can see from your example, list comprehensions can be nested.
That is a list comprehension. An easier example might be:
evens = [num for num in range(100) if num % 2 == 0]
I'm quite sure i saw that line in some NLP applications.
This list comprehension:
[[word for word in document.lower().split() if word not in stoplist] for document in documents]
is the same as
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
internal_list = [] # often known as a a list tokens
for word in document.lower().split():
if word not in stoplist:
internal_list.append(word) # this is where the [[word for word...] ...] appears
ending_list.append(internal_list)
Basically you want a list of documents that contains a list of tokens. So by looping through the documents,
for document in documents:
you then split each document into tokens
list_of_tokens = []
for word in document.lower().split():
and then make a list of of these tokens:
list_of_tokens.append(word)
For example:
>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
It's the same as:
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
... list_of_tokens.append(word)
...
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
Suppose if my list has ["Apple","ball","caT","dog"]
then it should give me result 'ball and 'dog'.
How do I do that using re.findall()?
You don't need re.findall() here, at all.
Use:
[s for s in inputlist if s.islower()]
The str.islower() method returns True if all letters in the string are lower-cased.
Demo:
>>> inputlist = ["Apple","ball","caT","dog"]
>>> [s for s in inputlist if s.islower()]
['ball', 'dog']
Use re.findall() to find lowercased text in a larger string, not in a list:
>>> import re
>>> re.findall(r'\b[a-z]+\b', 'The quick Brown Fox jumped!')
['quick', 'jumped']
re.findall is not what you want here. It was designed to work with a single string, not a list of them.
Instead, you can use filter and str.islower:
>>> lst = ["Apple", "ball", "caT", "dog"]
>>> # list(filter(str.islower, lst)) if you are on Python 3.x.
>>> filter(str.islower, lst)
['ball', 'dog']
>>>
Ask how to kill a mosquito with a cannon, I guess you can technically do that...
c = re.compile(r'[a-z]')
[x for x in li if len(re.findall(c,x)) == len(x)]
Out[29]: ['ball', 'dog']
(don't use regex for this)