I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.
For example:
X = [
["i","love,"to","play","games","","."],
["my","favourite,"colour","is","purple","!"],
["#ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []
for tweet in tweets:
tweet = tweet.lower()
tweet_tokens.append(tweet)
This is how I lowercased my tokens.
How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of #'s.
This is what I thought/tried but its not giving me the right results (only showing stop words for an example)
filtered_sentence = []
filtered_word = []
for sent in X:
for word in sent:
if word not in stopwords:
filtered_word.append(word)
filtered_sentence.append(word)
What would be the correct way to iterate through each sublists, process without disrupting the lists.
Ideally the output should look like this
Cleaned_X = [
["love,"play","games"],
["favourite,"colour","purple",],
["ladygaga","love","#stan"]
]
import validators
punctuation_list = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
dirty = [
["i","love","to","play","games","","."],
["my","favourite","colour","is","purple","!"],
["#ladygaga","we","love","you","#stan","https://test.de"]
]
def clean_list_list(tweets):
return [[elem for elem in tweet if elem_check(elem)]
for tweet in tweets]
def tweet_check(elem):
return elem not in punctuation_list and not validators.url(elem)
clean_list_list(dirty)
I have testet this, it should be very close to the solution you are looking for.
output
[['i', 'love', 'to', 'play', 'games'],
['my', 'favourite', 'colour', 'is', 'purple'],
['#ladygaga', 'we', 'love', 'you', '#stan']]
You can write your own validate function if you want to or use validators:
pip install validators
This works.
x = [
["i","love", "to","play","games","","."],
["my","favourite", "colour","is","purple","!"],
["#ladygaga","we", "love","you","#stan","'someurl"]
]
punctuations = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
stop_words = stopwords.words('english')
clean_list = []
for sub_list in x:
for word in sub_list:
if word not in stop_words and word not in punctuations:
clean_list.append(word)
print(clean_list)
Output:
['love', 'play', 'games', 'favourite', 'colour', 'purple', '#ladygaga', 'love', '#stan', "'someurl"]
Related
I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term's position in the string. Effectively, I want to ignore the stopwords in my output vectors.
My code is below. I can get the stopwords out of my dictionary's keys but not the values.
words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2
stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size+1)) if word.lower() not in stopwords}
print(stripes)
the output is:
{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}
words = ["This", "is", "a", "longer", "example", "sentence"]
stopwords = set(["it", "the", "was", "of", "is", "a"])
context_size = 2
stripes = []
for index, word in enumerate(words):
if word.lower() in stopwords:
continue
i = max(index - context_size, 0)
j = min(index + context_size, len(words) - 1) + 1
context = words[i:index] + words[index + 1:j]
stripes.append((word, context))
print(stripes)
I would recommend to use a tuple list so in case a word occurs more than once in words the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.
I also excluded the word itself from the context but depending on how you want to use it you might want to include it.
This results in:
[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]
Python noob so sorry for simple question but I can't find the exact solution for my situation.
I've got a python list, I want to remove stop words from a list. My code isn't removing the stopword if it's paired with another token.
from nltk.corpus import stopwords
rawData = ['for', 'the', 'game', 'the movie']
text = [each_string.lower() for each_string in rawData]
newText = [word for word in text if word not in stopwords.words('english')]
print(newText)
current output:
['game', 'the movie']
desired output
['game', 'movie']
I'd prefer to use list comprehension for this.
It took me a while to do this because list comprehensions are not my thing. Anyways, this is how I did it:
import functools
stopwords = ["for", "the"]
rawData = ['for', 'the', 'game', 'the movie']
lst = functools.reduce(lambda x,y: x+y, [i.split() for i in rawData])
newText = [word for word in lst if word not in stopwords]
print(newText)
Basically, line 4 splits the list values to make a nested list AND turns the nested list one dimensional.
I am trying to remove stop words from the list of tokens I have. But, it seems like the words are not removed. What would be the problem? Thanks.
Tried:
Trans = []
with open('data.txt', 'r') as myfile:
file = myfile.read()
#start readin from the start of the charecter
myfile.seek(0)
for row in myfile:
split = row.split()
Trans.append(split)
myfile.close()
stop_words = list(get_stop_words('en'))
nltk_words = list(stopwords.words('english'))
stop_words.extend(nltk_words)
output = [w for w in Trans if not w in stop_words]
Input:
[['Apparent',
'magnitude',
'is',
'a',
'measure',
'of',
'the',
'brightness',
'of',
'a',
'star',
'or',
'other']]
output:
It returns the same words as input.
I think Trans.append(split) should be Trans.extend(split) because split returns a list.
For more readability create a function. ex:
def drop_stopwords(row):
stop_words = set(stopwords.words('en'))
return [word for word in row if word not in stop_words and word not in list(string.punctuation)]
and with open() does not need a close()
and create a list of strings (sentences) and apply the function. ex:
Trans = Trans.map(str).apply(drop_stopwords)
This will be applied to each sentence...
You can add other functions for lemmitize, etc. Here there is a very clear example (code):
https://github.com/SamLevinSE/job_recommender_with_NLP/blob/master/job_recommender_data_mining_JOBS.ipynb
As the input contain list of list you need to traverse once the outer list and the inner list element after that you can get correct output using
output = [j for w in Trans for j in w if j not in stop_words]
An example:
eword_list = ["a", "is", "bus", "on", "the"]
alter_the_list("A bus station is where a bus stops A train station is where a train stops On my desk I have a work station", word_list)
print("1.", word_list)
word_list = ["a", 'up', "you", "it", "on", "the", 'is']
alter_the_list("It is up to YOU", word_list)
print("2.", word_list)
word_list = ["easy", "come", "go"]
alter_the_list("Easy come easy go go go", word_list)
print("3.", word_list)
word_list = ["a", "is", "i", "on"]
alter_the_list("", word_list)
print("4.", word_list)
word_list = ["a", "is", "i", "on", "the"]
alter_the_list("May your coffee be strong and your Monday be short", word_list)
print("5.", word_list)
def alter_the_list(text, word_list):
return[text for text in word_list if text in word_list]
I'm trying to remove any word from the list of words which is a separate word in the string of text. The string of text should be converted to lower case before I check the elements of the list of words are all in lower case. There is no punctuation in the string of text and each word in the parameter list of word is unique. I don't know how to fix it.
output:
1. ['a', 'is', 'bus', 'on', 'the']
2. ['a', 'up', 'you', 'it', 'on', 'the', 'is']
3. ['easy', 'come', 'go']
4. ['a', 'is', 'i', 'on']
5. ['a', 'is', 'i', 'on', 'the']
expected:
1. ['the']
2. ['a', 'on', 'the']
3. []
4. ['a', 'is', 'i', 'on']
5. ['a', 'is', 'i', 'on', 'the']
I've done it like this:
def alter_the_list(text, word_list):
for word in text.lower().split():
if word in word_list:
word_list.remove(word)
text.lower().split() returns a list of all space-separated tokens in text.
The key is that you're required to alter word_list. It is not enough to return a new list; you have to use Python 3's list methods to modify the list in-place.
If the order of the resulting list does not matter you can use sets:
def alter_the_list(text, word_list):
word_list[:] = set(word_list).difference(text.lower().split())
This function will update word_list in place due to the assignment to the list slice with word_list[:] = ...
1
Your main problem is that you return a value from your function, but then ignore it. You have to save it in some way to print out, such as:
word_list = ["easy", "come", "go"]
word_out = alter_the_list("Easy come easy go go go", word_list)
print("3.", word_out)
What you printed is the original word list, not the function result.
2
You ignore the text parameter to the function. You reuse the variable name as a loop index in your list comprehension. Get a different variable name, such as
return[word for word in word_list if word in word_list]
3
You still have to involve text in the logic of the list you build. Remember that you're looking for words that are not in the given text.
Most of all, learn basic debugging.
See this lovely debug blog for help.
If nothing else, learn to use simple print statements to display the values of your variables, and to trace program execution.
Does that get you moving toward a solution?
I like #Simon's answer better, but if you want to do it in two list comprehensions:
def alter_the_list(text, word_list):
# Pull out all words found in the word list
c = [w for w in word_list for t in text.split() if t == w]
# Find the difference of the two lists
return [w for w in word_list if w not in c]
I need to find the synonyms for a given word from a sentence. For an example
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
Here I need to find whether there is any synonyms for 'happy' word from list2 and print them. I'am new to the Python. I tried it from following code.
from nltk.corpus import wordnet
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
list = []
for word1 in list1:
for word2 in list2:
wordFromList1 = wordnet.synsets(word1)
wordFromList2 = wordnet.synsets(word2)
if wordFromList1 and wordFromList2:
s = wordFromList1[0].word1.lemmas(wordFromList2[0])
s = wordFromList1[0].word2.lemmas(wordFromList2[0])
list.append(s)
print((list))
But it does not work. Please help me.
When you use wordnet.synsets("happy") it returns synset entries (such as Synset('happy.a.01')) that contains part of speech info and an ID. You need to use lemma_names() on this to get the actual word forms. Try this:
from nltk.corpus import wordnet
def get_word_synonyms_from_sent(word, sent):
word_synonyms = []
for synset in wordnet.synsets(word):
for lemma in synset.lemma_names():
if lemma in sent and lemma != word:
word_synonyms.append(lemma)
return word_synonyms
word = "happy"
sent = ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
word_synonyms = get_word_synonyms_from_sent(word, sent)
print ("WORD:", word)
print ("SENTENCE:", sent)
print ("SYNONYMS FOR '" + word.upper() + "' FOUND IN THE SENTENCE: " + ", ".join(word_synonyms))
# OUTPUT
# >>> WORD: happy
# >>> SENTENCE: ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
# >>> SYNONYMS FOR 'HAPPY' FOUND IN THE SENTENCE: felicitous, glad