So essentially I have the following text file
hello you how are you doing?
i am doing well
is that so?
yes
I wish to append all the words into a new list where all words are larger than 3 and smaller than 7. My code is displayed below.
f=open("W7Ex11.txt","r")
words=[]
for line in f:
line=line.rstrip()
if len(line)>=3 and len(line)<=7:
words.append(line)
f.close()
print(words)
Unfortunaly, I am getting only the last word appended, which is "yes". I honestly don't see why its going wrong. Any chance someone knows why my code isn't working as I want it to be?
result
https://imgur.com/a/MAdhOfz
You need to split the line into words. The split(char) function splits a string wherever it finds char. Try this:
words = []
with open("W7Ex11.txt","r") as f: # will automatically close file afterwards
for line in f:
line = line.rstrip()
for word in line.split(' '):
word = word.strip() # strip to remove whitespace around the word
if 3 <= len(word) <= 7: # yes, you can do that in Python :)
words.append(word)
print(words)
Read the file and search for the length of words and extract them
import re
with open("W7Ex11.txt","r") as f:
a=' '.join(f.read().splitlines())
re.findall('([\w\?]{3,7})',a)
Or
new_list=[]
for line in a:
for i in re.finditer('([\w\?]{3,7})',line):
new_list.append(line[i.start():i.end()+1].strip(' '))
print(new_list)
Output:
['hello', 'you', 'how', 'are', 'you', 'doing?', 'doing', 'well', 'that', 'so?', 'yes']
Related
I have a txt file it contains 4 lines. (like a poem)
The thing that I want is to add all words to one list.
For example the poem like this :
I am done with you,
Don't love me anymore
I want it like this : ['I', 'am', 'done', 'with', 'you', 'dont', 'love', 'me', 'anymore']
But I can not remove the row end of the first sentence it gives me 2 separated list.
romeo = open(r'd:\romeo.txt')
list = []
for line in romeo:
line = line.rstrip()
line = line.split()
list = list + [line]
print(list)
with open(r'd:\romeo.txt', 'r') as msg:
data = msg.read().replace("\n"," ")
data = [x for x in data.split() if x.strip()]
Even shorter:
with open(r'd:\romeo.txt', 'r') as msg:
list = " ".join(msg.split()).split(' ')
Or with removing the comma:
with open(r'd:\romeo.txt', 'r') as msg:
list = " ".join(msg.replace(',', ' ').split()).split(' ')
You can use regular expresion like this.
import re
poem = '' # your poem
split = re.split(r'\040|\n', poem)
print(split)
Regular expresion \040 is for white space an \n to match a new line.
The output is:
['I', 'am', 'done', 'with', 'you,', "Don't", 'love', 'me', 'anymore']
I am trying to remove stop words from the list of tokens I have. But, it seems like the words are not removed. What would be the problem? Thanks.
Tried:
Trans = []
with open('data.txt', 'r') as myfile:
file = myfile.read()
#start readin from the start of the charecter
myfile.seek(0)
for row in myfile:
split = row.split()
Trans.append(split)
myfile.close()
stop_words = list(get_stop_words('en'))
nltk_words = list(stopwords.words('english'))
stop_words.extend(nltk_words)
output = [w for w in Trans if not w in stop_words]
Input:
[['Apparent',
'magnitude',
'is',
'a',
'measure',
'of',
'the',
'brightness',
'of',
'a',
'star',
'or',
'other']]
output:
It returns the same words as input.
I think Trans.append(split) should be Trans.extend(split) because split returns a list.
For more readability create a function. ex:
def drop_stopwords(row):
stop_words = set(stopwords.words('en'))
return [word for word in row if word not in stop_words and word not in list(string.punctuation)]
and with open() does not need a close()
and create a list of strings (sentences) and apply the function. ex:
Trans = Trans.map(str).apply(drop_stopwords)
This will be applied to each sentence...
You can add other functions for lemmitize, etc. Here there is a very clear example (code):
https://github.com/SamLevinSE/job_recommender_with_NLP/blob/master/job_recommender_data_mining_JOBS.ipynb
As the input contain list of list you need to traverse once the outer list and the inner list element after that you can get correct output using
output = [j for w in Trans for j in w if j not in stop_words]
I am trying to extract unique words out of the following text into 1 list.
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
But I keep getting a list within the list for each line of the text. I understand I have some "\n" to get rid of but can't figure out how.
Here is my code:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip("\n")
for word in line:
word = line.lower().split()
lst.append(word)
print(lst)
And the output I get:
[['but', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'], ['it', 'is', 'the', 'east', 'and', 'juliet', 'is', 'the', 'sun'], ['arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon'], ['who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']]
Thanks!!
When you do line.lower.split() you get a List of words. You're appending a list of words to your list, lst. Use extend instead of append. Extend would add each element of the list returned by the split() function. Also the second for loop for word in line: is unnecessary.
Additionally, if you want to extract unique words, you might want to look into the set datatype.
Use this:
list += word
Instead of:
lst.append(word)
As #Shalan and #BladeMight suggested, the issue is that word = line.lower().split() produces a list, and append appends the list rather than adding to it. I think a syntactically simple way to write this would be:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip("\n")
lst += line.lower().split()
If the order is not important, you can use set instead of list:
fname = input("Enter file name: ")
fh = open(fname)
uniq_words = set()
for line in fh:
line = line.strip()
uniq_words_in_line = line.split(' ')
uniq_words.update(uniq_words_in_line)
print(uniq_words)
A list comprehension does the same like you've done.
Then use chain.from_iterable to chain all the sublists into one list:
from itertools import chain
lst = list(chain.from_iterable(line.lower().split() for line in f))
Open the file romeo.txt and read it line by line. For each line,
split the line into a list of words using the split() function. The
program should build a list of words. For each word on each line check
to see if the word is already in the list and if not append it to the
list. When the program completes, sort and print the resulting words
in alphabetical order.
http://www.pythonlearn.com/code/romeo.txt
Here's my code :
fname = raw_input("Enter file name: ")
fh = open(fname)
for line in fh:
for word in line.split():
if word in line.split():
line.split().append(word)
if word not in line.split():
continue
print word
It only returns the last word of the last line, for some reason.
At the top of your loop, add a list to which you'll collect your words. Right now you are just discarding everything.
Your logic is also reverse, you are discarding words that you should be saving.
words = []
fname = raw_input("Enter file name: ")
fh = open(fname)
for line in fh:
for word in line.split():
if word not in words:
words.append(word)
fh.close()
# Now you should sort the words list and continue with your assignment
Try the following, it uses a set() to build a unique list of words. Each word is also lower-cased so that "The" and "the" are treated the same.
import re
word_set = set()
re_nonalpha = re.compile('[^a-zA-Z ]+')
fname = raw_input("Enter file name: ")
with open(fname, "r") as f_input:
for line in f_input:
line = re_nonalpha.sub(' ', line) # Convert all non a-z to spaces
for word in line.split():
word_set.add(word.lower())
word_list = list(word_set)
word_list.sort()
print word_list
This will display the following list:
['already', 'and', 'arise', 'bits', 'breaks', 'but', 'east', 'envious', 'fair', 'grief', 'has', 'is', 'it', 'juliet', 'kill', 'light', 'many', 'moon', 'pale', 'punctation', 'sick', 'soft', 'sun', 'the', 'this', 'through', 'too', 'way', 'what', 'who', 'window', 'with', 'yonder']
sorted(set([w for l in open(fname) for w in l.split()]))
I think you misunderstand what line.split() is doing. line.split() will return a list containing the "words" that are in the string line. Here we interpret a "word" as "substring delimited by the space character". So if line was equal to "Hello, World. I <3 Python", line.split() would return the list ["Hello,", "World.", "I", "<3", "Python"].
When you write for word in line.split() you are iterating through each element of that list. So the condition word in line.split() will always be true! What you really want is a cumulative list of "words you have already come across". At the top of the program you would create it using DiscoveredWords = []. Then for every word in every line you would check
if word not in DiscoveredWords:
DiscoveredWords.append(word)
Got it? :) Now since it seems you are new to Python (welcome to the fun by the way) here is how I would have written the code:
fname = raw_input("Enter file name: ")
with open(fname) as fh:
words = [word for line in fh for word in line.strip().split()]
words = list(set(words))
words.sort()
Let's do a quick overview of this code so that you can understand what is going on:
with open(fname) as fh is a handy trick to remember. It allows you to ensure that your file gets closed! Once python exits the with block it will close the file for you automatically :D
words = [word for line in fh for word in line.strip().split()] is another handy trick. This is one of the more concise ways to get a list containing all of the words in a file! We are telling python to make a list by taking every line in the file (for line in fh) and then every word in that line (for word in line.strip().split()).
words = list(set(words)) casts our list to a set and then back to a list. This is a quick way to remove duplicates as a set in python contains unique elements.
Finally we sort the list using words.sort().
Hope this was helpful and instructive :)
words=list()
fname = input("Enter file name: ")
fh = open(fname).read()
fh=fh.split()
for word in fh:
if word in words:
continue
else:
words.append(word)
words.sort()
print(words)
What is the best way to perform the following? The sample document (hello.txt) contains the following:
>>> repr(hello.txt) #show object representations
Hello there! This is a sample text. \n Ten plus ten is twenty. \n Twenty times two is forty \n
>>> print(hello.txt)
Hello There. This is a sample text
Ten plus ten is twenty
Twenty times two is forty
To Do:
Open a file, split each line into a list, then for each word on each line check to see if the word is in list and if not append it to the list
open_file = open('hello.txt')
lst = list() #create empty list
for line in open_file:
line = line.rstrip() #strip white space at the end of each line
words = line.split() #split string into a list of words
for word in words:
if word not in words:
#Missing code here; tried 'if word not in words', but then it produces a empty list
lst.append(word)
lst.sort()
print(lst)
Output from above code:
['Hello', 'Ten', 'There', 'This', 'Twenty', 'a', 'forty', 'is', 'is', 'is', 'plus', 'sample', 'ten', 'text', 'times', 'twenty', 'two']
The 'is' string is present 3 times when it should be present only once. I'm stuck with figuring out how to write code for checking each word on each line to see if the word is in list and if not append it to the list..
Desired output:
['Hello', 'Ten', 'There', 'This', 'Twenty', 'a', 'forty', 'is', 'plus', 'sample', 'ten', 'text', 'times', 'twenty', 'two']
Your error lies in these two lines:
for word in words:
if word not in words:
Perhaps you meant:
for word in words:
if word not in lst:
For whatever it is worth, here is how I would write the entire program:
import string
result = sorted(set(
word.strip(string.punctuation)
for line in open('hello.txt')
for word in line.split()))
print result
Sets are ideal for unique membership.
Content of hello.txt:
Hello there! This is a sample text.
Ten plus ten is twenty.
Twenty times two is forty
Code:
result = set()
with open('hello.txt', 'r') as myfile:
for line in myfile:
temp = [result.add(item) for item in line.strip().split()]
for item in result:
print(item)
Results:
text.
Twenty
This
ten
a
sample
times
twenty.
Hello
is
there!
two
forty
plus
Ten
You could also modify your code to say if word not in lst instead of if word not in words, then it would work.
If you want to sort a set... well, sets are unordered, but you can sort the output with sorted(result).
The Solution:
open_file = open('hello.txt') #open file
lst = list() #create empty list
for line in open_file: #1st for loop strip white space and split string into list of words
line = line.rstrip()
words = line.split()
for word in words: #nested for loop, check if the word is in list and if not append it to the list
if word not in lst:
lst.append(word)
lst.sort() #sort the list of words: alphabetically
print(lst) #print the list of words