Extracting set of tokens from list of strings - python

I have a list of strings and I want to extract all tokens into one set of tokens -not a list of sets. I need every token mixed up.
My sentences are stored as a list of strings in "sentences"
So if try:
words = set([])
a=set(sentences[1].split())
b=set(sentences[2].split())
a.union(b)
I get a and b sets in one set like this. This is what I'm searching for
{',', '.', '2.252', '35-1/7', '37-year-old', 'B', 'Blood', 'Fred', 'G4', 'Grauman', 'O+', 'P3-5', 'pregnancy', 'product', 'rubella', surface', 'the', 'to', 'type', 'week', 'woman'}
But with list comprehension
words = set()
[words.union(set(sent.split())) for sent in sentences]
The output is a list of sets, like this
[{'.', 'Care', 'He', 'Intensive', 'Neonatal''}, {'.', '2.252', 35-1/7', '37-year-old', 'Fred', 'G4', 'Grauman','}]
Is there away to get what I need with some compact line of code like a list comprehension?
====
Well I just did, after list comprehension for "words",
a = set()
a.union(*words)

If your sentences are in strings you can just join them and split them again.
set(" ".join(sentences).split())
turns ['A short sentence', 'A second sentence']
into {'A', 'second', 'sentence', 'short'}

How about doing:
set(' '.join(sentences).split())
Or you could try to use reduce from functools.

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']
You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

What can I use for finding names words in two list? Python

I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.
text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
So I need this output:
same_words= ['a', 'sentence', 'interest']
You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:
from nltk.stem import PorterStemmer
text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]
answer = []
for i in text_list:
answer.append(list(set(words_list).intersection(set(i))))
output = sum(answer, [])
print(output)
>>> ['interest', 'a', 'sentenc']
There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.
First of all, you will need to flatten your nested list to a list/set with unique strings.
from itertools import chain
newset = set(chain(*text_list))
{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}
Next, from the fuzzywuzzy package, we import the fuzz function.
from fuzzywuzzy import fuzz
result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]
[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]
by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)
Finally, you will use map to get your desired output.
similarity_score, fuzzy_match = map(list,zip(*result))
fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']
Extra
If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio
a = ['У', 'вас', 'є', 'чашка', 'кави?']
b = ['ви']
[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]

How to convert a list of phrases into list of words?

I want to convert a list which contains both phrases and words to a list which contains only words. For example, if the input is:
list_of_phrases_and_words = ['I am', 'John', 'michael and', 'I am', '16',
'years', 'old']
The expected output is:
list_of_words = ['I', 'am', 'John', 'michael', 'and', 'I', 'am', '16', 'years', 'old']
What is the efficient way to achieve this is in Python?
You can use a list comprehension:
list_of_words = [
word
for phrase in list_of_phrases_and_words
for word in phrase.split()
]
An alternative that might be slightly less efficient for larger lists would be to first create a large string containing everything and then splitting it:
list_of_words = " ".join(list_of_phrases_and_words).split()
The trick is a nested for loop, whereby you split on the space character " ".
words = [word for phrase in list_of_phrases_and_words for word in phrase.split(" ")]

Replacing numbers in a list of lists with corresponding lines from a text file

I have a big text file like this (without the blank space in between words but every word in each line):
this
is
my
text
and
it
should
be
awesome
.
And I have also a list like this:
index_list = [[1,2,3,4,5],[6,7,8][9,10]]
Now I want to replace every element of each list with the corresponding index line of my text file, so the expected answer would be:
new_list = [[this, is, my, text, and],[it, should, be],[awesome, .]
I tried a nasty workaround with two for loops with a range function that was way too complicated (so I thought). Then I tried it with linecache.getline, but that also has some issues:
import linecache
new_list = []
for l in index_list:
for j in l:
new_list.append(linecache.getline('text_list', j))
This does produce only one big list, which I don't want. Also, after every word I get a bad \n which I do not get when I open the file with b = open('text_list', 'r').read.splitlines() but I don't know how to implement this in my replace function (or create, rather) so I don't get [['this\n' ,'is\n' , etc...
You are very close. Just use a temp list and the append that to the main list. Also you can use str.strip to remove newline char.
Ex:
import linecache
new_list = []
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
for l in index_list:
temp = [] #Temp List
for j in l:
temp.append(linecache.getline('text_list', j).strip())
new_list.append(temp) #Append to main list.
You could use iter to do this as long as you text_list has exactly as many elements as sum(map(len, index_list))
text_list = ['this', 'is', 'my', 'text', 'and', 'it', 'should', 'be', 'awesome', '.']
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
text_list_iter = iter(text_list)
texts = [[next(text_list_iter) for _ in index] for index in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
But I am not sure if this is what you wanted to do. Maybe I am assuming some sort of ordering of index_list. The other answer I can think of is this list comprehension
texts_ = [[text_list[i-1] for i in l] for l in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]

Splitting python lists

I'm a newbie , I've written a tokenize function which basically takes in a txt file that consists of sentences and splits them based on whitespaces and punctuations. The thing here is it gives me an output with sublists present within a parent list.
My code:
def tokenize(document)
file = open("document.txt")
text = file.read()
hey = text.lower()
words = re.split(r'\s\s+', hey)
print [re.findall(r'\w+', b) for b in words]
My output:
[['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden'], ['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Desired Output:
['what', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'eggs', 'warden']['his', 'dad', 'was', 'warden', 'in', 'the', 'kitchen', 'poaching', 'eggs']
How do i remove the parent list out in my output ?? What changes do i need to make in my code inorder to remove the outer list brackets ??
I want them as individual lists
A function in Python can only return one value. If you want to return two things (for example, in your case, there are two lists of words) you have to return an object that can hold two things like a list, a tuple, a dictionary.
Do not confuse how you want to print the output vs. what is the object returned.
To simply print the lists:
for b in words:
print(re.findall(r'\w+', b))
If you do this, then your method doesn't return anything (it actually returns None).
To return both the lists:
return [re.findall(r'\w+', b) for b in words]
Then call your method like this:
word_lists = tokenize(document)
for word_list in word_lists:
print(word_list)
this should work
print ','.join([re.findall(r'\w+', b) for b in words])
I have an example, which I guess is not much different from the problem you have...
where I only take a certain part of the list.
>>> a = [['sa', 'bbb', 'ccc'], ['dad', 'des', 'kkk']]
>>>
>>> print a[0], a[1]
['sa', 'bbb', 'ccc'] ['dad', 'des', 'kkk']
>>>

Categories

Resources