What does "word for word" syntax mean in Python? - python

I see the following script snippet from the gensim tutorial page.
What's the syntax of word for word in below Python script?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]

This is a list comprehension. The code you posted loops through every element in document.lower.split() and creates a new list that contains only the elements that meet the if condition. It does this for each document in documents.
Try it out...
elems = [1, 2, 3, 4]
squares = [e*e for e in elems] # square each element
big = [e for e in elems if e > 2] # keep elements bigger than 2
As you can see from your example, list comprehensions can be nested.

That is a list comprehension. An easier example might be:
evens = [num for num in range(100) if num % 2 == 0]

I'm quite sure i saw that line in some NLP applications.
This list comprehension:
[[word for word in document.lower().split() if word not in stoplist] for document in documents]
is the same as
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
internal_list = [] # often known as a a list tokens
for word in document.lower().split():
if word not in stoplist:
internal_list.append(word) # this is where the [[word for word...] ...] appears
ending_list.append(internal_list)
Basically you want a list of documents that contains a list of tokens. So by looping through the documents,
for document in documents:
you then split each document into tokens
list_of_tokens = []
for word in document.lower().split():
and then make a list of of these tokens:
list_of_tokens.append(word)
For example:
>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
It's the same as:
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
... list_of_tokens.append(word)
...
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']

Related

How to print out print out only elements of a list containing certain letters?

I am working on a project and I want to write a code, that would find words containing only certain letters in a sentence and then return them (print them out).
sentence = "I am asking a question on Stack Overflow"
lst = []
# this gives me a list of all words in a sentence
change = sentence.split()
# NOTE: I know this isn't correct syntax, but that's basically what I want to do.
lst.append(only words containing "a")
print(lst)
Now the part I am struggeling with is, how do I append only words containig letter "a" for example?
you can act like this:
words = sentence.split()
lst = [word for word in words if 'a' in word]
print(lst)
# ['am', 'asking', 'a', 'Stack']
Try this! I hope it's well understood!
sentence = "I am asking a question on Stack Overflow"
lst = []
change = sentence.split()
#we are going to check in every word of the sentence, if letter 'a' is in it.
for a in change:
if 'a' in a:
print(a+" has an a! ")
lst.append(a)
print(lst)
This will output:
['am', 'asking', 'a', 'Stack']

nltk how to give multiple separated sentences

I have list of sentences (each sentence is a list) in English and I would like to fetch ngrams.
For example:
sentences = [['this', 'is', 'sentence', 'one'], ['hello','again']]
In order to run
nltk.utils.ngram
I need to flat the list to:
sentences = ['this','is','sentence','one','hello','again']
But then I get a fault bgram in
('one','hello')
.
What is the best way to deal with it?
Thanks!
Try this:
from itertools import chain
sentences = list(chain(*sentences))
chain return a chain object whose .__next__() method returns elements from the first iterable until it is exhausted, then elements from the next
iterable, until all of the iterables are exhausted.
or you can do:
sentences = [i for s in sentences for i in s]
you can also use list comprehension
f = []
[f.extend(_l) for _l in sentences]
f = ['this', 'is', 'sentence', 'one', 'hello', 'again']

Comparing lists with text files

I have the following list: t = ['one', 'two', 'three']
I want to read a file and add a point for every word that exists in the list. E.g. if "one" and "two" exists in "CV.txt", points = 2. If all of them exist, then points = 3.
import nltk
from nltk import word_tokenize
t = ['one', 'two', 'three']
CV = open("cv.txt","r").read().lower()
points = 0
for words in t:
if words in CV:
#print(words)
words = nltk.word_tokenize(words)
print(words)
li = len(words)
print(li)
points = li
print(points)
Assuming 'CV.txt' contains the words "one" and "two", and it is split by words (tokenized), 2 points should be added to the variable "points"
However, this code returns:
['one']
1
1
['two']
1
1
As you can see, the length is only 1, but it should be 2. I'm sure there's a more efficient way to to this with iterating loops or something rather than len.
Any help with this would be appreciated.
I don't think you need to tokenize within loop, so may be easier way to do it would be as following:
First tokenize the words in txt file
Check each word that is common
in t
And finally the points would be number of words in common_words.
import nltk
from nltk import word_tokenize
t = ['one', 'two', 'three']
CV = open("untitled.txt","r").read().lower()
points = 0
words = nltk.word_tokenize(CV)
common_words = [word for word in words if word in t]
points = len(common_words)
Note: if you want to avoid duplicates then, you need set of common words as following in above code:
common_words = set(word for word in words if word in t)

How to get my definite loop to print one per line

I'm trying to process a list of words and return a new list
containing only unique word. My definite loop works, however it will only print the words all together, instead of one per line. Can anyone help me out? This is probably a simple question but I am very new to Python. Thank you!
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
return uniqueWords
You can use str.join:
>>> all_words = ['two', 'two', 'one', 'uno']
>>> print('\n'.join(get_unique_words(all_words)))
one
uno
Or plain for loop:
>>> for word in get_unique_words(all_words):
... print(word)
...
one
uno
However, your method won't work for odd counts:
>>> get_unique_words(['three', 'three', 'three'])
['three']
If your goal is to get all words that appear exactly once, here's a shorter method that works using collections.Counter:
from collections import Counter
def get_unique_words(all_words):
return [word for word, count in Counter(all_words).items() if count == 1]
This code may help, it prints unique words line by line, is what I understood in your question:
allWords = ['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
for i in uniqueWords:
print i
If the order of the words is not important I recommend you to create a set to store the unique words:
uniqueWords = set(allWords)
As you can see running the code below, it can be much faster, but it may depend on the original list of words:
import timeit
setup="""
word_list = [str(x) for x in range(1000, 2000)]
allWords = []
for word in word_list:
allWords.append(word)
allWords.append(word)
"""
smt1 = "unique = set(allWords)"
smt2 = """
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
"""
print("SET:", timeit.timeit(smt1, setup, number=1000))
print("LOOP:", timeit.timeit(smt2, setup, number=1000))
OUTPUT:
SET: 0.03147706200002176
LOOP: 0.12346845000001849
maybe this fits your idea:
allWords=['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords=dict()
for word in allWords:
if word not in uniqueWords:
uniqueWords.update({word:1})
else:
uniqueWords[word]+=1
for k, v in uniqueWords.items():
if v==1:
print(k)
Prints:
distance
yes

Python check if string contains all words in Python

I want to check if all words are found in another string without any loops or iterations:
a = ['god', 'this', 'a']
sentence = "this is a god damn sentence in python"
all(a in sentence)
should return TRUE.
You could use a set depending on your exact needs as follows:
a = ['god', 'this', 'a']
sentence = "this is a god damn sentence in python"
print set(a) <= set(sentence.split())
This would print True, where <= is issubset.
It should be:
all(x in sentence for x in a)
Or:
>>> chk = list(filter(lambda x: x not in sentence, a)) #Python3, for Python2 no need to convert to list
[] #Will return empty if all words from a are in sentence
>>> if not chk:
print('All words are in sentence')

Categories

Resources