Python noob so sorry for simple question but I can't find the exact solution for my situation.
I've got a python list, I want to remove stop words from a list. My code isn't removing the stopword if it's paired with another token.
from nltk.corpus import stopwords
rawData = ['for', 'the', 'game', 'the movie']
text = [each_string.lower() for each_string in rawData]
newText = [word for word in text if word not in stopwords.words('english')]
print(newText)
current output:
['game', 'the movie']
desired output
['game', 'movie']
I'd prefer to use list comprehension for this.
It took me a while to do this because list comprehensions are not my thing. Anyways, this is how I did it:
import functools
stopwords = ["for", "the"]
rawData = ['for', 'the', 'game', 'the movie']
lst = functools.reduce(lambda x,y: x+y, [i.split() for i in rawData])
newText = [word for word in lst if word not in stopwords]
print(newText)
Basically, line 4 splits the list values to make a nested list AND turns the nested list one dimensional.
Related
I have a list of tuples with some data like this: [('word1','sentence1'),('word2','sentence1'),('word3','sentence1') ...], I loop on every tuple to get each word and sentence like this:
for collection in tup:
qword = collection[0]
sentence = collection[1]
so far, so good. I needed to remove each word from the sentence so I did this:
q_sentence_split = sentence.split()
new_sentence_split = [word.replace(q_word, '.....') for word in q_sentence_split]
new_sentence = ' '.join(sentence_split)
but This didn't give me what I needed, as it removes the characters of the word of the tuple from each word in the q_senetnce_split, but what I need is the word only, not comparing the characters of the word to the characters of each word of the sentence.
I tried putting if after the for word in q_sentence_split like this:
new_sentence_split = [word.replace(q_word, '.....') for word in q_sentence_split if word == qword]
but this just removed every word the sentence, so I don't know what is wrong with my code
Try this :
lst = [('word1', 'The word1 is in this sentence'), ('word2', 'The word2blawblaw is in this sentence'),
('word3', 'The word1 is in this sentence')]
for word, sentence in lst:
print(' '.join(i for i in sentence.split() if word not in i))
output :
The is in this sentence
The is in this sentence
The word1 is in this sentence
I intentionally put word1 in third sentence.
Explanation :
First we iterate through lst, it gives us a tuple in each iteration, we unpack it with word, sentence.
After splitting the sentence it becomes a list which is ['The', 'word1', 'is', 'in', 'this', 'sentence'] (in first iteration).
Then again we iterate through this list which gives us the individual words, all we have to do is to chech if our word is inside any of these words or not. If it wasn't there that's what we want.
finally we do ' '.join() to make the sentence.
Here is what you can do:
list1 = [('word1', 'The word1 is in this sentence'), ('word2', 'The word2blawblaw is in this sentence'),
('word3', 'The word1 is in this sentence')]
for x, y in list1:
print(' '.join(i for i in y.split() if word not in i))
I have a list as an input which contains words, these words sometimes contain non-ascii letter characters, I need to filter out the entire word if they contain letters that are not in the ascii list.
So the if the input is:
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
I need the Output:
['Hello', 'my', 'dear', Friends']
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
al = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = [char for char in al]
filtered_words=[]
I tried it with this:
for el in words:
try:
words in ascii_letters
except FALSE:
filtered_words.append(el)
and this
filtered words = [ele for ele in words if all(ch not in ele for ch in ascii_letters)]
but both of them do not result in what I need - I do understand why but since I have only been learning python for a week I fail to adjust them to make them do what I want them to, maybe someone knows how to handle this (without using any libraries)?
Thanks
You could check whether your alphabet is a superset of the words:
>>> [*filter(set(al).issuperset, words)]
['Hello', 'my', 'dear', 'Friends']
Btw, better don't hardcode that alphabet (I've seen quite a few people do that and forget letters) but import it:
from string import ascii_letters as al
You need to iterate trough the words in the words list to check whether all letters are ion ASCII or you can use the all() function:
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
al = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = [char for char in al]
out = []
for word in words:
not_in_ascii = False
for letter in word:
if letter not in ascii_letters:
not_in_ascii = True
if not_in_ascii:
continue
out.append(word)
It is also possible with list comprehension and all() as you tried:
out = [word for word in words if all([letter in ascii_letters for letter in word])]
[i for i in words if i.isalpha()]
Result:
['Hello', 'my', 'dear', 'Friends']
I have a dictionary and a text:
{"love":1, "expect":2, "annoy":-2}
test="i love you, that is annoying"
I need to remove the words from the string if they appear in the dictionary. I have tried this code:
for k in dict:
if k in test:
test=test.replace(k, "")
However the result is:
i you,that is ing
And this is not what I am looking for, as it should not remove "annoy" as a part of the word, the whole word should be evaluated. How can I achieve it?
First, you should not assign names to variables that are also names of builtin in classes, such as dict.
Variable test is a string composed of characters. When you say, if k in test:, you will be testing k to see if it is a substring of test. What you want to do is break up test into a list of words and compare k against each complete word in that list. If words are separated by a single space, then they may be "split" with:
test.split(' ')
The only complication is that it will create the following list:
['i', '', 'you,', 'that', 'is', 'annoying']
Note that the third item still has a , in it. So we should first get rid of punctuation marks we might expect to find in our sentence:
test.replace('.', '').replace(',', ' ').split(' ')
Yielding:
['i', '', 'you', '', 'that', 'is', 'annoying']
The following will actually get rid of all punctuation:
import string
test.translate(str.maketrans('', '', string.punctuation))
So now our code becomes:
>>> import string
>>> d = {"love":1, "expect":2, "annoy":-2}
>>> test="i love you, that is annoying"
>>> for k in d:
... if k in test.translate(str.maketrans('', '', string.punctuation)).split(' '):
... test=test.replace(k, "")
...
>>> print(test)
i you, that is annoying
>>>
You may now find you have extra spaces in your sentence, but you can figure out how to get rid of those.
you can use this:
query = "i love you, that is annoying"
query = query.replace('.', '').replace(',', '')
my_dict = {"love": 1, "expect": 2, "annoy": -2}
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in my_dict]
result = ' '.join(resultwords)
print(result)
>> 'i you, that is annoying'
If you want to exclude all words without being key sensitive convert all keys in my_dict to lowercase:
my_dict = {k.lower(): v for k, v in my_dict.items()}
Assuming I have a string
string = 'i am a person i believe i can fly i believe i can touch the sky'.
What I would like to do is to get all the words that are next to (from the right side) the word 'i', so in this case am, believe, can, believe, can.
How could I do that in python ? I found this but it only gives the first word, so in this case, 'am'
Simple generator method:
def get_next_words(text, match, sep=' '):
words = iter(text.split(sep))
for word in words:
if word == match:
yield next(words)
Usage:
text = 'i am a person i believe i can fly i believe i can touch the sky'
words = get_next_words(text, 'i')
for w in words:
print(w)
# am
# believe
# can
# believe
# can
You can write a regular expression to find the words after the target word:
import re
word = "i"
string = 'i am a person i believe i can fly i believe i can touch the sky'
pat = re.compile(r'\b{}\b \b(\w+)\b'.format(word))
print(pat.findall(string))
# ['am', 'believe', 'can', 'believe', 'can']
One way is to use a regular expression with a look behind assertion:
>>> import re
>>> string = 'i am a person i believe i can fly i believe i can touch the sky'
>>> re.findall(r'(?<=\bi )\w+', string)
['am', 'believe', 'can', 'believe', 'can']
You can split the string and get the next index of the word "i" as you iterate with enumerate:
string = 'i am a person i believe i can fly i believe i can touch the sky'
sl = string.split()
all_is = [sl[i + 1] for i, word in enumerate(sl[:-1]) if word == 'i']
print(all_is)
# ['am', 'believe', 'can', 'believe', 'can']
Note that as #PatrickHaugh pointed out, we want to be careful if "i" is the last word so we can exclude iterating over the last word completely.
import re
string = 'i am a person i believe i can fly i believe i can touch the sky'
words = [w.split()[0] for w in re.split('i +', string) if w]
print(words)
I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John
While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:
To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))
It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence