Related
I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.
text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
So I need this output:
same_words= ['a', 'sentence', 'interest']
You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:
from nltk.stem import PorterStemmer
text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]
answer = []
for i in text_list:
answer.append(list(set(words_list).intersection(set(i))))
output = sum(answer, [])
print(output)
>>> ['interest', 'a', 'sentenc']
There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.
First of all, you will need to flatten your nested list to a list/set with unique strings.
from itertools import chain
newset = set(chain(*text_list))
{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}
Next, from the fuzzywuzzy package, we import the fuzz function.
from fuzzywuzzy import fuzz
result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]
[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]
by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)
Finally, you will use map to get your desired output.
similarity_score, fuzzy_match = map(list,zip(*result))
fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']
Extra
If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio
a = ['У', 'вас', 'є', 'чашка', 'кави?']
b = ['ви']
[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]
I have a big text file like this (without the blank space in between words but every word in each line):
this
is
my
text
and
it
should
be
awesome
.
And I have also a list like this:
index_list = [[1,2,3,4,5],[6,7,8][9,10]]
Now I want to replace every element of each list with the corresponding index line of my text file, so the expected answer would be:
new_list = [[this, is, my, text, and],[it, should, be],[awesome, .]
I tried a nasty workaround with two for loops with a range function that was way too complicated (so I thought). Then I tried it with linecache.getline, but that also has some issues:
import linecache
new_list = []
for l in index_list:
for j in l:
new_list.append(linecache.getline('text_list', j))
This does produce only one big list, which I don't want. Also, after every word I get a bad \n which I do not get when I open the file with b = open('text_list', 'r').read.splitlines() but I don't know how to implement this in my replace function (or create, rather) so I don't get [['this\n' ,'is\n' , etc...
You are very close. Just use a temp list and the append that to the main list. Also you can use str.strip to remove newline char.
Ex:
import linecache
new_list = []
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
for l in index_list:
temp = [] #Temp List
for j in l:
temp.append(linecache.getline('text_list', j).strip())
new_list.append(temp) #Append to main list.
You could use iter to do this as long as you text_list has exactly as many elements as sum(map(len, index_list))
text_list = ['this', 'is', 'my', 'text', 'and', 'it', 'should', 'be', 'awesome', '.']
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
text_list_iter = iter(text_list)
texts = [[next(text_list_iter) for _ in index] for index in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
But I am not sure if this is what you wanted to do. Maybe I am assuming some sort of ordering of index_list. The other answer I can think of is this list comprehension
texts_ = [[text_list[i-1] for i in l] for l in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
I have to take a large list of words in the form:
['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']
and then using the strip function, turn it into:
['this', 'is', 'a', 'list', 'of', 'words']
I thought that what I had written would work, but I keep getting an error saying:
"'list' object has no attribute 'strip'"
Here is the code that I tried:
strip_list = []
for lengths in range(1,20):
strip_list.append(0) #longest word in the text file is 20 characters long
for a in lines:
strip_list.append(lines[a].strip())
You can either use a list comprehension
my_list = ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']
stripped = [s.strip() for s in my_list]
or alternatively use map():
stripped = list(map(str.strip, my_list))
In Python 2, map() directly returned a list, so you didn't need the call to list. In Python 3, the list comprehension is more concise and generally considered more idiomatic.
list comprehension?
[x.strip() for x in lst]
You can use lists comprehensions:
strip_list = [item.strip() for item in lines]
Or the map function:
# with a lambda
strip_list = map(lambda it: it.strip(), lines)
# without a lambda
strip_list = map(str.strip, lines)
This can be done using list comprehensions as defined in PEP 202
[w.strip() for w in ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']]
All other answers, and mainly about list comprehension, are great. But just to explain your error:
strip_list = []
for lengths in range(1,20):
strip_list.append(0) #longest word in the text file is 20 characters long
for a in lines:
strip_list.append(lines[a].strip())
a is a member of your list, not an index. What you could write is this:
[...]
for a in lines:
strip_list.append(a.strip())
Another important comment: you can create an empty list this way:
strip_list = [0] * 20
But this is not so useful, as .append appends stuff to your list. In your case, it's not useful to create a list with defaut values, as you'll build it item per item when appending stripped strings.
So your code should be like:
strip_list = []
for a in lines:
strip_list.append(a.strip())
But, for sure, the best one is this one, as this is exactly the same thing:
stripped = [line.strip() for line in lines]
In case you have something more complicated than just a .strip, put this in a function, and do the same. That's the most readable way to work with lists.
If you need to remove just trailing whitespace, you could use str.rstrip(), which should be slightly more efficient than str.strip():
>>> lst = ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']
>>> [x.rstrip() for x in lst]
['this', 'is', 'a', 'list', 'of', 'words']
>>> list(map(str.rstrip, lst))
['this', 'is', 'a', 'list', 'of', 'words']
my_list = ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']
print([l.strip() for l in my_list])
Output:
['this', 'is', 'a', 'list', 'of', 'words']
I have a list that consists of both words and digits. Lets say:
list1 = ['1','100', 'Stack', 'over','flow']
From this list I would like to filter all the digits and keep the words. I have imported re and found the re code for it, namely:
[^0-9]
However, I am not sure how to implement this so that I get a list like below.
result = ['Stack', 'over', 'flow']
No need to regex, use isdigit() :
list1 = ['1','100', 'Stack', 'over','flow']
print([i for i in list1 if not i.isdigit()])
returns :
['Stack', 'over', 'flow']
use list-comprehension and string method isdigit:
[elem for elem in list1 if not elem.isdigit()]
You can do this quite nicely with list comprehension:
list1 = ['1','100', 'Stack', 'over','flow']
list2 = [i for i in list1 if not i.isdigit()]
If, for whatever reason, you did want to use regex to do this (maybe you have more complex filtering criteria), you could do it using something like this:
import re
list1 = ['1','100', 'Stack', 'over','flow']
list2 = [i for i in list1 if re.fullmatch('[^0-9]+', i)]
Using filter + lambda:
list(filter(lambda x: not x.isdigit(), list1))
# ['Stack', 'over', 'flow']
Like other answers suggested, you don't really need Regexes, but they can be more flexible if your requirements change in the future. For example.
from re import match
list1 = ['1','100', 'Stack', 'over','flow']
result = list(filter(lambda el: match(r'^[^0-9]*$', el), list1))
^: start of the string
[...]: character group
^: negates the character group
0-9: digits 0-9 (you could use \d as well)
*: zero or more times
$: end of the string
If you want all elements that don't start with a number, use ^[^0-9].* where . is any character.
I don't know exact pattern of your list element but this code should work for given example
import re
pattern = re.compile("([A-Za-z])")
list1 = ['1','100', 'Stack', 'over','flow']
result = []
for x in list1:
check = pattern.match(x)
if check is not None:
result.append(x)
print (result)
#python 3
olist = list(filter(lambda s: s.isalpha() , list1))
<br>print(olist) # ['Stack', 'over', 'flow']
#python2
olist = filter(lambda s:s.isalpha(), list1)
<br>print olist # ['Stack', 'over', 'flow']
I would appreciate someone's help on this probably simple matter: I have a long list of words in the form ['word', 'another', 'word', 'and', 'yet', 'another']. I want to compare these words to a list that I specify, thus looking for target words whether they are contained in the first list or not.
I would like to output which of my "search" words are contained in the first list and how many times they appear. I tried something like list(set(a).intersection(set(b))) - but it splits up the words and compares letters instead.
How can I write in a list of words to compare with the existing long list? And how can I output co-occurences and their frequencies? Thank you so much for your time and help.
>>> lst = ['word', 'another', 'word', 'and', 'yet', 'another']
>>> search = ['word', 'and', 'but']
>>> [(w, lst.count(w)) for w in set(lst) if w in search]
[('and', 1), ('word', 2)]
This code basically iterates through the unique elements of lst, and if the element is in the search list, it adds the word, along with the number of occurences, to the resulting list.
Preprocess your list of words with a Counter:
from collections import Counter
a = ['word', 'another', 'word', 'and', 'yet', 'another']
c = Counter(a)
# c == Counter({'word': 2, 'another': 2, 'and': 1, 'yet': 1})
Now you can iterate over your new list of words and check whether they are contained within this Counter-dictionary and the value gives you their number of appearance in the original list:
words = ['word', 'no', 'another']
for w in words:
print w, c.get(w, 0)
which prints:
word 2
no 0
another 2
or output it in a list:
[(w, c.get(w, 0)) for w in words]
# returns [('word', 2), ('no', 0), ('another', 2)]