I would appreciate someone's help on this probably simple matter: I have a long list of words in the form ['word', 'another', 'word', 'and', 'yet', 'another']. I want to compare these words to a list that I specify, thus looking for target words whether they are contained in the first list or not.
I would like to output which of my "search" words are contained in the first list and how many times they appear. I tried something like list(set(a).intersection(set(b))) - but it splits up the words and compares letters instead.
How can I write in a list of words to compare with the existing long list? And how can I output co-occurences and their frequencies? Thank you so much for your time and help.
>>> lst = ['word', 'another', 'word', 'and', 'yet', 'another']
>>> search = ['word', 'and', 'but']
>>> [(w, lst.count(w)) for w in set(lst) if w in search]
[('and', 1), ('word', 2)]
This code basically iterates through the unique elements of lst, and if the element is in the search list, it adds the word, along with the number of occurences, to the resulting list.
Preprocess your list of words with a Counter:
from collections import Counter
a = ['word', 'another', 'word', 'and', 'yet', 'another']
c = Counter(a)
# c == Counter({'word': 2, 'another': 2, 'and': 1, 'yet': 1})
Now you can iterate over your new list of words and check whether they are contained within this Counter-dictionary and the value gives you their number of appearance in the original list:
words = ['word', 'no', 'another']
for w in words:
print w, c.get(w, 0)
which prints:
word 2
no 0
another 2
or output it in a list:
[(w, c.get(w, 0)) for w in words]
# returns [('word', 2), ('no', 0), ('another', 2)]
Related
I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.
text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
So I need this output:
same_words= ['a', 'sentence', 'interest']
You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:
from nltk.stem import PorterStemmer
text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]
answer = []
for i in text_list:
answer.append(list(set(words_list).intersection(set(i))))
output = sum(answer, [])
print(output)
>>> ['interest', 'a', 'sentenc']
There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.
First of all, you will need to flatten your nested list to a list/set with unique strings.
from itertools import chain
newset = set(chain(*text_list))
{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}
Next, from the fuzzywuzzy package, we import the fuzz function.
from fuzzywuzzy import fuzz
result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]
[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]
by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)
Finally, you will use map to get your desired output.
similarity_score, fuzzy_match = map(list,zip(*result))
fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']
Extra
If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio
a = ['У', 'вас', 'є', 'чашка', 'кави?']
b = ['ви']
[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]
I have a big text file like this (without the blank space in between words but every word in each line):
this
is
my
text
and
it
should
be
awesome
.
And I have also a list like this:
index_list = [[1,2,3,4,5],[6,7,8][9,10]]
Now I want to replace every element of each list with the corresponding index line of my text file, so the expected answer would be:
new_list = [[this, is, my, text, and],[it, should, be],[awesome, .]
I tried a nasty workaround with two for loops with a range function that was way too complicated (so I thought). Then I tried it with linecache.getline, but that also has some issues:
import linecache
new_list = []
for l in index_list:
for j in l:
new_list.append(linecache.getline('text_list', j))
This does produce only one big list, which I don't want. Also, after every word I get a bad \n which I do not get when I open the file with b = open('text_list', 'r').read.splitlines() but I don't know how to implement this in my replace function (or create, rather) so I don't get [['this\n' ,'is\n' , etc...
You are very close. Just use a temp list and the append that to the main list. Also you can use str.strip to remove newline char.
Ex:
import linecache
new_list = []
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
for l in index_list:
temp = [] #Temp List
for j in l:
temp.append(linecache.getline('text_list', j).strip())
new_list.append(temp) #Append to main list.
You could use iter to do this as long as you text_list has exactly as many elements as sum(map(len, index_list))
text_list = ['this', 'is', 'my', 'text', 'and', 'it', 'should', 'be', 'awesome', '.']
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
text_list_iter = iter(text_list)
texts = [[next(text_list_iter) for _ in index] for index in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
But I am not sure if this is what you wanted to do. Maybe I am assuming some sort of ordering of index_list. The other answer I can think of is this list comprehension
texts_ = [[text_list[i-1] for i in l] for l in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
I am having a little trouble figuring out how to recreate the sentence:
"Believe in the me, that believes in you!"
(a little cringe for those who have watched Gurren Lagann...) while using the indexes that I obtained by enumerating the list of words:
['believe', 'in', 'the', 'me', 'that', 'believes', 'you']
This list was .split() and .lower() to remove punctuation in a previous bit of code I made to make the words list file and index list file.
When indexed, these are the words in their enumerated form:
(1, 'believe')
(2, 'in')
(3, 'the')
(4, 'me')
(5, 'that')
(6, 'believes')
(7, 'you')
That is all I have so far as I have been searching for a solution which none have worked for my code. Here is the whole thing so far:
with open("Words list file 2.txt", 'r') as File:
contain = File.read()
contain = contain.replace("[", "")
contain = contain.replace("]", "")
contain = contain.replace(",", "")
contain = contain.replace("'", "")
contain = contain.split()
print("The orginal file reads:")#prints to tell the user the orginal file
print(contain)
for i in enumerate(contain, start = 1):
print(i)
You may join the strings in list via using join method as:
my_list = ['believe', 'in', 'the', 'me', 'that', 'believes', 'you']
>>> ' '.join(my_list)
'believe in the me that believes you'
# ^ missing "in"
But this will result in string with missing "in" after "believes". If you want the make a new string based on index of words in previous list, you may use a temporary list to store the index and then do join on a generator expression as:
>>> temp_list = [0, 1, 2, 3, 4, 5, 1, 6]
>>> ' '.join(my_list[i] for i in temp_list)
'believe in the me that believes in you'
' '.join(['believe', 'in', 'the', 'me', 'that', 'believes', 'you'])
Not sure what the original file contains or what the "contains" variable is when loaded from the file. Please show that.
restart = True
while restart == True:
option = input("Would you like to compress or decompress this file?\nIf you would like to compress type c \nIf you would like to decompress type d.\n").lower()
if option == 'c':
text = input("Please type the text you would like to compress.\n")
text = text.split()
for count,word in enumerate(text):
if text.count(word) < 2:
order.append (max(order)+1)
else:
order.append (text.index(word)+1)
print (uniqueWords)
print (order)
break
elif option == 'd':
pass
else:
print("Sorry that was not an option")
For part of my assignment I need to identify unique words and send them to a text file. I understand how to write text to a text file I do not understand how I can order this code appropriately so it reproduces in a text file (if I was to input "the world of the flowers is a small world to be in":
the,world,of,flowers,is,a,small,to,be,in
1, 2, 3, 1, 5, 6, 7, 8, 2, 9, 10
The top line stating the unique words and the second line showing the order of the words in order to be later decompressed. I have no issue with the decompression or the sorting of the numbers but only the unique words being in order.
Any assistance would be much appreciated!
text = "the world of the flowers is a small world to be in"
words = text.split()
unique_ordered = []
for word in words:
if word not in unique_ordered:
unique_ordered.append(word)
from collections import OrderedDict
text = "the world of the flowers is a small world to be in"
words = text.split()
print list(OrderedDict.fromkeys(words))
output
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
That's an interesting problem, in fact it can be solved using a dictionary to keep the index of the first occurence and to check if it was already encountered:
string = "the world of the flowers is a small world to be in"
dct = {}
words = []
indices = []
idx = 1
for substring in string.split():
# Check if you've seen it already.
if substring in dct:
# Already seen it, so append the index of the first occurence
indices.append(dct[substring])
else:
# Add it to the dictionary with the index and just append the word and index
dct[substring] = idx
words.append(substring)
indices.append(idx)
idx += 1
>>> print(words)
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> print(indices)
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]
If you don't want the indices there are also some external modules that have such a function to get the unique words in order of appearance:
>>> from iteration_utilities import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from more_itertools import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from toolz import unique
>>> list(unique(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
To remove the duplicate entries from list whilst preserving the order, you my check How do you remove duplicates from a list in whilst preserving order?'s answers. For example:
my_sentence = "the world of the flowers is a small world to be in"
wordlist = my_sentence.split()
# Accepted approach in linked post
def get_ordered_unique(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
unique_list = get_ordered_unique(wordlist)
# where `unique_list` holds:
# ['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
Then in order to print the position of word, you may list.index() with list comprehension expression as:
>>> [unique_list.index(word)+1 for word in wordlist]
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]
I have a list of curse words I want to match against other list to remove matches. I normally use list.remove('entry') on an individual basis, but looping through a list of entries against another list - then removing them has me stumped. Any ideas?
Using filter:
>>> words = ['there', 'was', 'a', 'ffff', 'time', 'ssss']
>>> curses = set(['ffff', 'ssss'])
>>> filter(lambda x: x not in curses, words)
['there', 'was', 'a', 'time']
>>>
It could also be done with list comprehension:
>>> [x for x in words if x not in curses]
Use sets.
a=set(["cat","dog","budgie"])
b=set(["horse","budgie","donkey"])
a-b
->set(['dog', 'cat'])