list comprehension to check for presence of any of the items.
I have some text and would like to check on some keywords. It should return me the sentence if it contains any of the keywords.
An example:
text = [t for t in string.split('. ')
if 'drink' in t or 'eat' in t
or 'sleep' in t]
This works. However, I am thinking if there is a better way, as the list of keywords may grow.
I tried putting the keywords in a list but it would not work in this list comprehension.
OR using if any
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any (l in pattern for l in t)]
You were almost there:
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any(word in t for word in pattern)]
The key is to check for each word in pattern if that work is inside the sentence:
any(word in t for word in pattern)
Your use of any is backwards. This is what you want:
[t for t in string.split('. ') if any(l in t for l in pattern)]
An alternative approach is using a regex:
import re
regex = '|'.join(pattern)
[t for t in string.split('. ') if regex.search(t)]
Related
I would like to know how to build a very simple tokenizer. Given a dictionary d (in this case a list) and a sentence s I would like to return all possible tokens (=words) of the sentence. Here is what I tried:
l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"
def subs(string, ret=['']):
if len(string) == 0:
return ret
head, tail = string[0], string[1:]
ret = ret + list(map(lambda x: x+head, ret))
return subs(tail, ret)
print((list(set(subs(sentence))&set(l))))
But this returns:
["snow","ball","cold","is","snowball","the"]
I could compare substrings but there must be a better way to do this, right?
What I want:
["the","snowball","is","cold"]
You can utilize a regular expression here:
import re
l = ["the","snow","ball","snowball","is","cold"]
pattern = "|".join(sorted(l, key=len, reverse=True))
sentence = "thesnowballisverycold"
print( re.findall(pattern, sentence) )
# => ['the', 'snowball', 'is', 'cold']
See the Python demo.
The pattern will look like snowball|snow|ball|cold|the|is, see the regex demo online. The trick is to make sure all alternatives are listed from the longest to shortest. See Order of regular expression operator (..|.. ... ..|..). The sorted(l, key=len, reverse=True) part sorts the items in l by length in the descending order, and "|".join(...) creates the alternation pattern.
I have the following code which has for and if alternatively.
lines = ['apple berry Citrus ', 34, 4.46, 'Audi Apple ']
corpus = [ ]
for line in lines:
# Check if the element is string and proceed
if isinstance(line, str):
# Split the element and check if first character is upper case
for word in line.split():
if word[0].isupper():
# Append the word to corpus result
corpus.append(word)
print(corpus)
# Output : ['Citrus', 'Audi', 'Apple']
I am trying to do this in list comprehension but failing. I have tried as below.
# corpus = [ word if word[0].isupper() for word in line.split() for line in lines if isinstance(line, str)]
How can i achieve this in List Comprehension ?
The following will work:
corpus = [
word for line in lines if isinstance(line, str)
for word in line.split() if word[0].isupper()
]
The scope in nested comprehensions can be confusing at first, but you'll notice that the order of for and if is the same as in the nested loops.
You can do:
[word for line in lines if isinstance(line, str)
for word in line.split() if word[0].isupper()]
An alternate approach is to prefilter the list lines with filter:
[word for line in filter(lambda e: isinstance(e, str), lines)
for word in line.split() if word[0].isupper()]
Or, as pointed out in comments, you can eliminate the lambda with:
[word for line in filter(str.__instancecheck__, lines)
for word in line.split() if word[0].isupper()]
Or even two filters:
[word for line in filter(str.__instancecheck__, lines)
for word in filter(lambda w: w[0].isupper(), line.split())]
As a general rule, you can take the inner part of the nested loops where you do corpus.append(xxxx) and place it (the xxx) at the beginning of the list comprehension. Then add the nested for loops and conditions without the ':'.
corpus = [ word # from corpus.append(word)
for line in lines # ':' removed ...
if isinstance(line,str)
for word in line.split()
if word[0].isupper() ]
# on a single line:
corpus = [ word for line in lines if isinstance(line,str) for word in line.split() if word[0].isupper() ]
print(corpus) # ['Citrus', 'Audi', 'Apple']
I want to find all the "phrases" in a list in remove them from the list, so that I have only words (without spaces) left. I'm making a hangman type game and want the computer to choose a random word. I'm new to Python and coding, so I'm happy to hear other suggestions for my code as well.
import random
fhand = open('common_words.txt')
words = []
for line in fhand:
line = line.strip()
words.append(line)
for word in words:
if ' ' in word:
words.remove(word)
print(words)
Sets are more efficient than lists. When lazily constructed like here, you can gain significant performance boost.
# Load all words
words = {}
with open('common_words.txt') as file:
for line in file.readlines():
line = line.strip()
if " " not in line:
words.add(line)
# Can be converted to one-liner using magic of Python
words = set(filter(lambda x: " " in x, map(str.strip, open('common_words.txt').readlines())))
# Get random word
import random
print(random.choice(words))
Use str.split(). It separates by both spaces and newlines by default.
>>> 'some words\nsome more'.split()
['some', 'words', 'some', 'more']
>>> 'this is a sentence.'.split()
['this', 'is', 'a', 'sentence.']
>>> 'dfsonf 43 SDFd fe#2'.split()
['dfsonf', '43', 'SDFd', 'fe#2']
Read the file normally and make a list this way:
words = []
with open('filename.txt','r') as file:
words = file.read().split()
That should be good.
with open( 'common_words.txt', 'r' ) as f:
words = [ word for word in filter( lambda x: len( x ) > 0 and ' ' not in x, map( lambda x: x.strip(), f.readlines() ) ) ]
with is used because file objects are content managers. The strange list-like syntax is a list comprehension, so it builds a list from the statements inside of the brackets. map is a function with takes in an iterable, applying a provided function to each item in the iterable, placing each transformed result into a new list*. filter is function which takes in an iterable, testing each item against the provided predicate, placing each item which evaluated to True into a new list*. lambda is used to define a function (with a specific signature) in-line.
*: The actual return types are generators, which function like iterators so they can still be used with for loops.
I am not sure if I understand you correctly, but I guess the split() method is something for you, like:
with open('common_words.txt') as f:
words = [line.split() for line in f]
words = [word for words in words_nested for word in words] # flatten nested list
As mentioned, the
.split()
Method could be a solution.
Also, the NLTK module might be useful for future language processing tasks.
Hope this helps!
I have a list of strings like such,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and'] I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
For example, the above set would be split into
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
actually with the _ delimiters you dont really need to sort it by length so you could just do
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
You could use a regular expression:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats' is split on '_for_':
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the | metacharacter) you get to split on any of the keywords:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable() lets us treat all those lists as one long iterable.
Demo:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and'] in every string with a replacement string, say for example _1_ (it could be any string), then at then end of each iteration, to split over this replacement string:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
At a high level, what I'm trying to accomplish is:
given a list of words, return all the words that do not consist solely of digits
My first thought of how to do that is:
import string
result = []
for word in words:
for each_char in word:
if each_char not in string.digit:
result.append(word)
break
return result
This works fine. To be more Pythonic, I figured - list comprehensions, right? So:
return [word for word in words for char in word if not char in string.digits]
Unfortunately, this adds a copy of word to the result for each character that isn't a digit. So for f(['foo']), I end up with ['foo', 'foo', 'foo'].
Is there a clever way to do what I'm trying to do? My current solution is to just write an is_all_digits function, and say [word for word in words if not is_all_digits(word)]. My general understanding is that list comprehensions allow this sort of operation to be declarative, and the helper function is plenty declarative for me; just curious as to whether there's some clever way to make it one compound statement.
Thanks!
Why not just check the whole string for isdigit():
>>> words = ['foo', '123', 'foo123']
>>> [word for word in words if not word.isdigit()]
['foo', 'foo123']
Or, turn the logic other way and use any():
>>> [word for word in words if any(not char.isdigit() for char in word)]
['foo', 'foo123']
any() would stop on the first non-digit character for a word and return True.
filter(lambda _: not _.isdigit(), iterable)
Example:
>>> list(filter(lambda _: not _.isdigit(), ["hello", "good1", "1234"]))
['hello', 'good1']