Create dictionary of context words without stopwords - python

I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term's position in the string. Effectively, I want to ignore the stopwords in my output vectors.
My code is below. I can get the stopwords out of my dictionary's keys but not the values.
words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2
stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size+1)) if word.lower() not in stopwords}
print(stripes)
the output is:
{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}

words = ["This", "is", "a", "longer", "example", "sentence"]
stopwords = set(["it", "the", "was", "of", "is", "a"])
context_size = 2
stripes = []
for index, word in enumerate(words):
if word.lower() in stopwords:
continue
i = max(index - context_size, 0)
j = min(index + context_size, len(words) - 1) + 1
context = words[i:index] + words[index + 1:j]
stripes.append((word, context))
print(stripes)
I would recommend to use a tuple list so in case a word occurs more than once in words the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.
I also excluded the word itself from the context but depending on how you want to use it you might want to include it.
This results in:
[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]

Related

How to iterate and apply text pre processing on sublists

I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.
For example:
X = [
["i","love,"to","play","games","","."],
["my","favourite,"colour","is","purple","!"],
["#ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []
for tweet in tweets:
tweet = tweet.lower()
tweet_tokens.append(tweet)
This is how I lowercased my tokens.
How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of #'s.
This is what I thought/tried but its not giving me the right results (only showing stop words for an example)
filtered_sentence = []
filtered_word = []
for sent in X:
for word in sent:
if word not in stopwords:
filtered_word.append(word)
filtered_sentence.append(word)
What would be the correct way to iterate through each sublists, process without disrupting the lists.
Ideally the output should look like this
Cleaned_X = [
["love,"play","games"],
["favourite,"colour","purple",],
["ladygaga","love","#stan"]
]
import validators
punctuation_list = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
dirty = [
["i","love","to","play","games","","."],
["my","favourite","colour","is","purple","!"],
["#ladygaga","we","love","you","#stan","https://test.de"]
]
def clean_list_list(tweets):
return [[elem for elem in tweet if elem_check(elem)]
for tweet in tweets]
def tweet_check(elem):
return elem not in punctuation_list and not validators.url(elem)
clean_list_list(dirty)
I have testet this, it should be very close to the solution you are looking for.
output
[['i', 'love', 'to', 'play', 'games'],
['my', 'favourite', 'colour', 'is', 'purple'],
['#ladygaga', 'we', 'love', 'you', '#stan']]
You can write your own validate function if you want to or use validators:
pip install validators
This works.
x = [
["i","love", "to","play","games","","."],
["my","favourite", "colour","is","purple","!"],
["#ladygaga","we", "love","you","#stan","'someurl"]
]
punctuations = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
stop_words = stopwords.words('english')
clean_list = []
for sub_list in x:
for word in sub_list:
if word not in stop_words and word not in punctuations:
clean_list.append(word)
print(clean_list)
Output:
['love', 'play', 'games', 'favourite', 'colour', 'purple', '#ladygaga', 'love', '#stan', "'someurl"]

Exclude words from a list that contain one or more characters from another list python

I have a list as an input which contains words, these words sometimes contain non-ascii letter characters, I need to filter out the entire word if they contain letters that are not in the ascii list.
So the if the input is:
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
I need the Output:
['Hello', 'my', 'dear', Friends']
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
al = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = [char for char in al]
filtered_words=[]
I tried it with this:
for el in words:
try:
words in ascii_letters
except FALSE:
filtered_words.append(el)
and this
filtered words = [ele for ele in words if all(ch not in ele for ch in ascii_letters)]
but both of them do not result in what I need - I do understand why but since I have only been learning python for a week I fail to adjust them to make them do what I want them to, maybe someone knows how to handle this (without using any libraries)?
Thanks
You could check whether your alphabet is a superset of the words:
>>> [*filter(set(al).issuperset, words)]
['Hello', 'my', 'dear', 'Friends']
Btw, better don't hardcode that alphabet (I've seen quite a few people do that and forget letters) but import it:
from string import ascii_letters as al
You need to iterate trough the words in the words list to check whether all letters are ion ASCII or you can use the all() function:
words = ['Hello', 'my','dear', 'de7ar', 'Fri?ends', 'Friends']
al = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = [char for char in al]
out = []
for word in words:
not_in_ascii = False
for letter in word:
if letter not in ascii_letters:
not_in_ascii = True
if not_in_ascii:
continue
out.append(word)
It is also possible with list comprehension and all() as you tried:
out = [word for word in words if all([letter in ascii_letters for letter in word])]
[i for i in words if i.isalpha()]
Result:
['Hello', 'my', 'dear', 'Friends']

Alter the letter in list of strings

An example:
eword_list = ["a", "is", "bus", "on", "the"]
alter_the_list("A bus station is where a bus stops A train station is where a train stops On my desk I have a work station", word_list)
print("1.", word_list)
word_list = ["a", 'up', "you", "it", "on", "the", 'is']
alter_the_list("It is up to YOU", word_list)
print("2.", word_list)
word_list = ["easy", "come", "go"]
alter_the_list("Easy come easy go go go", word_list)
print("3.", word_list)
word_list = ["a", "is", "i", "on"]
alter_the_list("", word_list)
print("4.", word_list)
word_list = ["a", "is", "i", "on", "the"]
alter_the_list("May your coffee be strong and your Monday be short", word_list)
print("5.", word_list)
def alter_the_list(text, word_list):
return[text for text in word_list if text in word_list]
I'm trying to remove any word from the list of words which is a separate word in the string of text. The string of text should be converted to lower case before I check the elements of the list of words are all in lower case. There is no punctuation in the string of text and each word in the parameter list of word is unique. I don't know how to fix it.
output:
1. ['a', 'is', 'bus', 'on', 'the']
2. ['a', 'up', 'you', 'it', 'on', 'the', 'is']
3. ['easy', 'come', 'go']
4. ['a', 'is', 'i', 'on']
5. ['a', 'is', 'i', 'on', 'the']
expected:
1. ['the']
2. ['a', 'on', 'the']
3. []
4. ['a', 'is', 'i', 'on']
5. ['a', 'is', 'i', 'on', 'the']
I've done it like this:
def alter_the_list(text, word_list):
for word in text.lower().split():
if word in word_list:
word_list.remove(word)
text.lower().split() returns a list of all space-separated tokens in text.
The key is that you're required to alter word_list. It is not enough to return a new list; you have to use Python 3's list methods to modify the list in-place.
If the order of the resulting list does not matter you can use sets:
def alter_the_list(text, word_list):
word_list[:] = set(word_list).difference(text.lower().split())
This function will update word_list in place due to the assignment to the list slice with word_list[:] = ...
1
Your main problem is that you return a value from your function, but then ignore it. You have to save it in some way to print out, such as:
word_list = ["easy", "come", "go"]
word_out = alter_the_list("Easy come easy go go go", word_list)
print("3.", word_out)
What you printed is the original word list, not the function result.
2
You ignore the text parameter to the function. You reuse the variable name as a loop index in your list comprehension. Get a different variable name, such as
return[word for word in word_list if word in word_list]
3
You still have to involve text in the logic of the list you build. Remember that you're looking for words that are not in the given text.
Most of all, learn basic debugging.
See this lovely debug blog for help.
If nothing else, learn to use simple print statements to display the values of your variables, and to trace program execution.
Does that get you moving toward a solution?
I like #Simon's answer better, but if you want to do it in two list comprehensions:
def alter_the_list(text, word_list):
# Pull out all words found in the word list
c = [w for w in word_list for t in text.split() if t == w]
# Find the difference of the two lists
return [w for w in word_list if w not in c]

Check whether there is any synonyms between two word sets

I need to find the synonyms for a given word from a sentence. For an example
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
Here I need to find whether there is any synonyms for 'happy' word from list2 and print them. I'am new to the Python. I tried it from following code.
from nltk.corpus import wordnet
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
list = []
for word1 in list1:
for word2 in list2:
wordFromList1 = wordnet.synsets(word1)
wordFromList2 = wordnet.synsets(word2)
if wordFromList1 and wordFromList2:
s = wordFromList1[0].word1.lemmas(wordFromList2[0])
s = wordFromList1[0].word2.lemmas(wordFromList2[0])
list.append(s)
print((list))
But it does not work. Please help me.
When you use wordnet.synsets("happy") it returns synset entries (such as Synset('happy.a.01')) that contains part of speech info and an ID. You need to use lemma_names() on this to get the actual word forms. Try this:
from nltk.corpus import wordnet
def get_word_synonyms_from_sent(word, sent):
word_synonyms = []
for synset in wordnet.synsets(word):
for lemma in synset.lemma_names():
if lemma in sent and lemma != word:
word_synonyms.append(lemma)
return word_synonyms
word = "happy"
sent = ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
word_synonyms = get_word_synonyms_from_sent(word, sent)
print ("WORD:", word)
print ("SENTENCE:", sent)
print ("SYNONYMS FOR '" + word.upper() + "' FOUND IN THE SENTENCE: " + ", ".join(word_synonyms))
# OUTPUT
# >>> WORD: happy
# >>> SENTENCE: ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
# >>> SYNONYMS FOR 'HAPPY' FOUND IN THE SENTENCE: felicitous, glad

Dictionary and position list back to sentence

I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John
While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:
To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))
It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence

Categories

Resources