I have the following list: t = ['one', 'two', 'three']
I want to read a file and add a point for every word that exists in the list. E.g. if "one" and "two" exists in "CV.txt", points = 2. If all of them exist, then points = 3.
import nltk
from nltk import word_tokenize
t = ['one', 'two', 'three']
CV = open("cv.txt","r").read().lower()
points = 0
for words in t:
if words in CV:
#print(words)
words = nltk.word_tokenize(words)
print(words)
li = len(words)
print(li)
points = li
print(points)
Assuming 'CV.txt' contains the words "one" and "two", and it is split by words (tokenized), 2 points should be added to the variable "points"
However, this code returns:
['one']
1
1
['two']
1
1
As you can see, the length is only 1, but it should be 2. I'm sure there's a more efficient way to to this with iterating loops or something rather than len.
Any help with this would be appreciated.
I don't think you need to tokenize within loop, so may be easier way to do it would be as following:
First tokenize the words in txt file
Check each word that is common
in t
And finally the points would be number of words in common_words.
import nltk
from nltk import word_tokenize
t = ['one', 'two', 'three']
CV = open("untitled.txt","r").read().lower()
points = 0
words = nltk.word_tokenize(CV)
common_words = [word for word in words if word in t]
points = len(common_words)
Note: if you want to avoid duplicates then, you need set of common words as following in above code:
common_words = set(word for word in words if word in t)
Related
I'm trying to process a list of words and return a new list
containing only unique word. My definite loop works, however it will only print the words all together, instead of one per line. Can anyone help me out? This is probably a simple question but I am very new to Python. Thank you!
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
return uniqueWords
You can use str.join:
>>> all_words = ['two', 'two', 'one', 'uno']
>>> print('\n'.join(get_unique_words(all_words)))
one
uno
Or plain for loop:
>>> for word in get_unique_words(all_words):
... print(word)
...
one
uno
However, your method won't work for odd counts:
>>> get_unique_words(['three', 'three', 'three'])
['three']
If your goal is to get all words that appear exactly once, here's a shorter method that works using collections.Counter:
from collections import Counter
def get_unique_words(all_words):
return [word for word, count in Counter(all_words).items() if count == 1]
This code may help, it prints unique words line by line, is what I understood in your question:
allWords = ['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
for i in uniqueWords:
print i
If the order of the words is not important I recommend you to create a set to store the unique words:
uniqueWords = set(allWords)
As you can see running the code below, it can be much faster, but it may depend on the original list of words:
import timeit
setup="""
word_list = [str(x) for x in range(1000, 2000)]
allWords = []
for word in word_list:
allWords.append(word)
allWords.append(word)
"""
smt1 = "unique = set(allWords)"
smt2 = """
uniqueWords = [ ]
for word in allWords:
if word not in uniqueWords:
uniqueWords.append(word)
else:
uniqueWords.remove(word)
"""
print("SET:", timeit.timeit(smt1, setup, number=1000))
print("LOOP:", timeit.timeit(smt2, setup, number=1000))
OUTPUT:
SET: 0.03147706200002176
LOOP: 0.12346845000001849
maybe this fits your idea:
allWords=['hola', 'hello', 'distance', 'hello', 'hola', 'yes']
uniqueWords=dict()
for word in allWords:
if word not in uniqueWords:
uniqueWords.update({word:1})
else:
uniqueWords[word]+=1
for k, v in uniqueWords.items():
if v==1:
print(k)
Prints:
distance
yes
What is the best way to count the number of matches between the list and the string in python??
for example if I have this list:
list = ['one', 'two', 'three']
and this string:
line = "some one long. two phrase three and one again"
I want to get 4 because I have
one 2 times
two 1 time
three 1 time
I try below code based on this question answers and it's worked but I got error if I add many many words (4000 words) to list:
import re
word_list = ['one', 'two', 'three']
line = "some one long. two phrase three and one again"
words_re = re.compile("|".join(word_list))
print(len(words_re.findall(line)))
This is my error:
words_re = re.compile("|".join(word_list))
File "/usr/lib/python2.7/re.py", line 190, in compile
If you want case insensitive and to match whole words ignoring punctuation, split the string and strip the punctuation using a dict to store the words you want to count:
lst = ['one', 'two', 'three']
from string import punctuation
cn = dict.fromkeys(lst, 0)
line = "some one long. two phrase three and one again"
for word in line.lower().split():
word = word.strip(punctuation)
if word in cn:
cn[word] += 1
print(cn)
{'three': 1, 'two': 1, 'one': 2}
If you just want the sum use a set with the same logic:
from string import punctuation
st = {'one', 'two', 'three'}
line = "some one long. two phrase three and one again"
print(sum(word.strip(punctuation) in st for word in line.lower().split()))
This does a single pass over the the words after they are split, the set lookup is 0(1) so it is substantially more efficient than list.count.
I see the following script snippet from the gensim tutorial page.
What's the syntax of word for word in below Python script?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]
This is a list comprehension. The code you posted loops through every element in document.lower.split() and creates a new list that contains only the elements that meet the if condition. It does this for each document in documents.
Try it out...
elems = [1, 2, 3, 4]
squares = [e*e for e in elems] # square each element
big = [e for e in elems if e > 2] # keep elements bigger than 2
As you can see from your example, list comprehensions can be nested.
That is a list comprehension. An easier example might be:
evens = [num for num in range(100) if num % 2 == 0]
I'm quite sure i saw that line in some NLP applications.
This list comprehension:
[[word for word in document.lower().split() if word not in stoplist] for document in documents]
is the same as
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
internal_list = [] # often known as a a list tokens
for word in document.lower().split():
if word not in stoplist:
internal_list.append(word) # this is where the [[word for word...] ...] appears
ending_list.append(internal_list)
Basically you want a list of documents that contains a list of tokens. So by looping through the documents,
for document in documents:
you then split each document into tokens
list_of_tokens = []
for word in document.lower().split():
and then make a list of of these tokens:
list_of_tokens.append(word)
For example:
>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
It's the same as:
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
... list_of_tokens.append(word)
...
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
Here is the code I have.
All I need to do is make sure the list is organized with upper case words first and lower case words second. I looked around but no luck with .sort or .sorted command.
string = input("Please type in a string? ")
words = string.strip().split()
for word in words:
print(word)
The sorted() function should sort items alphabetically taking caps into account.
>>> string = "Don't touch that, Zaphod Beeblebox!"
>>> words = string.split()
>>> print( sorted(words) )
['Beeblebox!', "Don't", 'Zaphod', 'that,', 'touch']
But if for some reason sorted() ignored caps, then you could do it manually with a sort of list comprehension if you wanted:
words = sorted([i for i in words if i[0].isupper()]) + sorted([i for i in words if i[0].islower()])
This creates two separate lists, the first with capitalized words and the second without, then sorts both individually and conjoins them to give the same result.
But in the end you should definitely just use sorted(); it's much more efficient and concise.
EDIT: Sorry, I might have miss-interpreted your question; if you want to organize just Caps and not without sorting alphabetically, then this works:
>>> string = "ONE TWO one THREE two three FOUR"
>>> words = string.split()
>>> l = []
>>> print [i for i in [i if i[0].isupper() else l.append(i) for i in words] if i!=None]+l
['ONE', 'TWO', 'THREE', 'FOUR', 'one', 'two', 'three']
I can't find a method that's more efficient then that, so there you go.
string = raw_input("Please type in a string? ")
words = string.strip().split()
words.sort()
As to how to separate upper and lower case words into separate columns:
string = raw_input("Please type in a string? ")
words = string.split()
column1 = []
column2 = []
for word in words:
if word.islower():
column1.append(word)
else
column2.append(word)
The .islower() function evaluates to true if all the letters are lower case. If this doesn't work for your problem's definition of upper and lower case, look into the .isupper() and .istitle() methods here.
I need to take a file and shuffle the middles letters of each word but I can't shuffle the first and last letters, and I only shuffle words longer then 3 characters. I think I can figure out a way to shuffle them if I can put each word into their own separate list where all the letters are separated. Any help would be appreciated. Thanks.
text = "Take in a file and shuffle all the middle letters in between"
words = text.split()
def shuffle(word):
# get your word as a list
word = list(word)
# perform the shuffle operation
# return the list as a string
word = ''.join(word)
return word
for word in words:
if len(word) > 3:
print word[0] + ' ' + shuffle(word[1:-1]) + ' ' + word[-1]
else:
print word
The shuffle algorithm is intentionally not implemented.
Look at random.shuffle. It shuffles a list object in place which seems to be what youre aiming for. You can do something like this for shuffling the letters around
`
def scramble(word):
output = list(word[1:-1])
random.shuffle(output)
output.append(word[-1])
return word[0] + "".join(output)`
Just remember to import random
#with open("words.txt",'w') as f:
# f.write("one two three four five\nsix seven eight nine")
def get_words(f):
for line in f:
for word in line.split():
yield word
import random
def shuffle_word(word):
if len(word)>3:
word=list(word)
middle=word[1:-1]
random.shuffle(middle)
word[1:-1]=middle
word="".join(word)
return word
with open("words.txt") as f:
#print list(get_words(f))
#['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']
#print map(shuffle_word,get_words(f))
#['one', 'two', 'trhee', 'four', 'fvie', 'six', 'sveen', 'eihgt', 'nnie']
import tempfile
with tempfile.NamedTemporaryFile(delete=False) as tmp:
tmp.write(" ".join(map(shuffle_word,get_words(f))))
fname=tmp.name
import shutil
shutil.move(fname,"words.txt")