I am trying to remove stop words from the list of tokens I have. But, it seems like the words are not removed. What would be the problem? Thanks.
Tried:
Trans = []
with open('data.txt', 'r') as myfile:
file = myfile.read()
#start readin from the start of the charecter
myfile.seek(0)
for row in myfile:
split = row.split()
Trans.append(split)
myfile.close()
stop_words = list(get_stop_words('en'))
nltk_words = list(stopwords.words('english'))
stop_words.extend(nltk_words)
output = [w for w in Trans if not w in stop_words]
Input:
[['Apparent',
'magnitude',
'is',
'a',
'measure',
'of',
'the',
'brightness',
'of',
'a',
'star',
'or',
'other']]
output:
It returns the same words as input.
I think Trans.append(split) should be Trans.extend(split) because split returns a list.
For more readability create a function. ex:
def drop_stopwords(row):
stop_words = set(stopwords.words('en'))
return [word for word in row if word not in stop_words and word not in list(string.punctuation)]
and with open() does not need a close()
and create a list of strings (sentences) and apply the function. ex:
Trans = Trans.map(str).apply(drop_stopwords)
This will be applied to each sentence...
You can add other functions for lemmitize, etc. Here there is a very clear example (code):
https://github.com/SamLevinSE/job_recommender_with_NLP/blob/master/job_recommender_data_mining_JOBS.ipynb
As the input contain list of list you need to traverse once the outer list and the inner list element after that you can get correct output using
output = [j for w in Trans for j in w if j not in stop_words]
Related
I have a txt file it contains 4 lines. (like a poem)
The thing that I want is to add all words to one list.
For example the poem like this :
I am done with you,
Don't love me anymore
I want it like this : ['I', 'am', 'done', 'with', 'you', 'dont', 'love', 'me', 'anymore']
But I can not remove the row end of the first sentence it gives me 2 separated list.
romeo = open(r'd:\romeo.txt')
list = []
for line in romeo:
line = line.rstrip()
line = line.split()
list = list + [line]
print(list)
with open(r'd:\romeo.txt', 'r') as msg:
data = msg.read().replace("\n"," ")
data = [x for x in data.split() if x.strip()]
Even shorter:
with open(r'd:\romeo.txt', 'r') as msg:
list = " ".join(msg.split()).split(' ')
Or with removing the comma:
with open(r'd:\romeo.txt', 'r') as msg:
list = " ".join(msg.replace(',', ' ').split()).split(' ')
You can use regular expresion like this.
import re
poem = '' # your poem
split = re.split(r'\040|\n', poem)
print(split)
Regular expresion \040 is for white space an \n to match a new line.
The output is:
['I', 'am', 'done', 'with', 'you,', "Don't", 'love', 'me', 'anymore']
I wanna return a list of words containing a letter disregarding its case.
Say if i have sentence = "Anyone who has never made a mistake has never tried anything new", then f(sentence, a) would return
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
This is what i have
import re
def f(string, match):
string_list = string.split()
match_list = []
for word in string_list:
if match in word:
match_list.append(word)
return match_list
You don't need re. Use str.casefold:
[w for w in sentence.split() if "a" in w.casefold()]
Output:
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
You can use string splitting for it, if there is not punctuation.
match_list = [s for s in sentence.split(' ') if 'a' in s.lower()]
Here's another variation :
sentence = 'Anyone who has never made a mistake has never tried anything new'
def f (string, match) :
match_list = []
for word in string.split () :
if match in word.lower ():
match_list.append (word)
return match_list
print (f (sentence, 'a'))
I have a big text file like this (without the blank space in between words but every word in each line):
this
is
my
text
and
it
should
be
awesome
.
And I have also a list like this:
index_list = [[1,2,3,4,5],[6,7,8][9,10]]
Now I want to replace every element of each list with the corresponding index line of my text file, so the expected answer would be:
new_list = [[this, is, my, text, and],[it, should, be],[awesome, .]
I tried a nasty workaround with two for loops with a range function that was way too complicated (so I thought). Then I tried it with linecache.getline, but that also has some issues:
import linecache
new_list = []
for l in index_list:
for j in l:
new_list.append(linecache.getline('text_list', j))
This does produce only one big list, which I don't want. Also, after every word I get a bad \n which I do not get when I open the file with b = open('text_list', 'r').read.splitlines() but I don't know how to implement this in my replace function (or create, rather) so I don't get [['this\n' ,'is\n' , etc...
You are very close. Just use a temp list and the append that to the main list. Also you can use str.strip to remove newline char.
Ex:
import linecache
new_list = []
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
for l in index_list:
temp = [] #Temp List
for j in l:
temp.append(linecache.getline('text_list', j).strip())
new_list.append(temp) #Append to main list.
You could use iter to do this as long as you text_list has exactly as many elements as sum(map(len, index_list))
text_list = ['this', 'is', 'my', 'text', 'and', 'it', 'should', 'be', 'awesome', '.']
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
text_list_iter = iter(text_list)
texts = [[next(text_list_iter) for _ in index] for index in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
But I am not sure if this is what you wanted to do. Maybe I am assuming some sort of ordering of index_list. The other answer I can think of is this list comprehension
texts_ = [[text_list[i-1] for i in l] for l in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
I am trying to extract unique words out of the following text into 1 list.
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
But I keep getting a list within the list for each line of the text. I understand I have some "\n" to get rid of but can't figure out how.
Here is my code:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip("\n")
for word in line:
word = line.lower().split()
lst.append(word)
print(lst)
And the output I get:
[['but', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'], ['it', 'is', 'the', 'east', 'and', 'juliet', 'is', 'the', 'sun'], ['arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon'], ['who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']]
Thanks!!
When you do line.lower.split() you get a List of words. You're appending a list of words to your list, lst. Use extend instead of append. Extend would add each element of the list returned by the split() function. Also the second for loop for word in line: is unnecessary.
Additionally, if you want to extract unique words, you might want to look into the set datatype.
Use this:
list += word
Instead of:
lst.append(word)
As #Shalan and #BladeMight suggested, the issue is that word = line.lower().split() produces a list, and append appends the list rather than adding to it. I think a syntactically simple way to write this would be:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip("\n")
lst += line.lower().split()
If the order is not important, you can use set instead of list:
fname = input("Enter file name: ")
fh = open(fname)
uniq_words = set()
for line in fh:
line = line.strip()
uniq_words_in_line = line.split(' ')
uniq_words.update(uniq_words_in_line)
print(uniq_words)
A list comprehension does the same like you've done.
Then use chain.from_iterable to chain all the sublists into one list:
from itertools import chain
lst = list(chain.from_iterable(line.lower().split() for line in f))
Open the file romeo.txt and read it line by line. For each line,
split the line into a list of words using the split() function. The
program should build a list of words. For each word on each line check
to see if the word is already in the list and if not append it to the
list. When the program completes, sort and print the resulting words
in alphabetical order.
http://www.pythonlearn.com/code/romeo.txt
Here's my code :
fname = raw_input("Enter file name: ")
fh = open(fname)
for line in fh:
for word in line.split():
if word in line.split():
line.split().append(word)
if word not in line.split():
continue
print word
It only returns the last word of the last line, for some reason.
At the top of your loop, add a list to which you'll collect your words. Right now you are just discarding everything.
Your logic is also reverse, you are discarding words that you should be saving.
words = []
fname = raw_input("Enter file name: ")
fh = open(fname)
for line in fh:
for word in line.split():
if word not in words:
words.append(word)
fh.close()
# Now you should sort the words list and continue with your assignment
Try the following, it uses a set() to build a unique list of words. Each word is also lower-cased so that "The" and "the" are treated the same.
import re
word_set = set()
re_nonalpha = re.compile('[^a-zA-Z ]+')
fname = raw_input("Enter file name: ")
with open(fname, "r") as f_input:
for line in f_input:
line = re_nonalpha.sub(' ', line) # Convert all non a-z to spaces
for word in line.split():
word_set.add(word.lower())
word_list = list(word_set)
word_list.sort()
print word_list
This will display the following list:
['already', 'and', 'arise', 'bits', 'breaks', 'but', 'east', 'envious', 'fair', 'grief', 'has', 'is', 'it', 'juliet', 'kill', 'light', 'many', 'moon', 'pale', 'punctation', 'sick', 'soft', 'sun', 'the', 'this', 'through', 'too', 'way', 'what', 'who', 'window', 'with', 'yonder']
sorted(set([w for l in open(fname) for w in l.split()]))
I think you misunderstand what line.split() is doing. line.split() will return a list containing the "words" that are in the string line. Here we interpret a "word" as "substring delimited by the space character". So if line was equal to "Hello, World. I <3 Python", line.split() would return the list ["Hello,", "World.", "I", "<3", "Python"].
When you write for word in line.split() you are iterating through each element of that list. So the condition word in line.split() will always be true! What you really want is a cumulative list of "words you have already come across". At the top of the program you would create it using DiscoveredWords = []. Then for every word in every line you would check
if word not in DiscoveredWords:
DiscoveredWords.append(word)
Got it? :) Now since it seems you are new to Python (welcome to the fun by the way) here is how I would have written the code:
fname = raw_input("Enter file name: ")
with open(fname) as fh:
words = [word for line in fh for word in line.strip().split()]
words = list(set(words))
words.sort()
Let's do a quick overview of this code so that you can understand what is going on:
with open(fname) as fh is a handy trick to remember. It allows you to ensure that your file gets closed! Once python exits the with block it will close the file for you automatically :D
words = [word for line in fh for word in line.strip().split()] is another handy trick. This is one of the more concise ways to get a list containing all of the words in a file! We are telling python to make a list by taking every line in the file (for line in fh) and then every word in that line (for word in line.strip().split()).
words = list(set(words)) casts our list to a set and then back to a list. This is a quick way to remove duplicates as a set in python contains unique elements.
Finally we sort the list using words.sort().
Hope this was helpful and instructive :)
words=list()
fname = input("Enter file name: ")
fh = open(fname).read()
fh=fh.split()
for word in fh:
if word in words:
continue
else:
words.append(word)
words.sort()
print(words)