Python's alphabetical sort file error - python

i'm new to python & here is my question:
Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
This is the file:
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Desired Output:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
This is my code:
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
#loop through the text to get the lines
for line in fh:
line = line.rstrip()
#loop through the line to get the words
for word in line:
words = line.split()
#if a word is not in the empty list, append it
if not word in lst: lst.append(word)
lst.sort()
print lst
My output:
[' ', 'A', 'B', 'I', 'J', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y']
If you could tell me what is wrong (to get only the first letters of the words with a space in the beginning instead of the whole words), it would be great..
Note: I want the code using these instructions, not other advanced instructions (to keep my learning sequence)
Thank you

You should be calling split() on the line, not the words in the line.
for line in file:
result = line.split() # this returns a list of values
for word in result:
# check if it already is in your list of words
list.sort()

Let's take the code line-by-line
for line in fh:
line = line.rstrip()
So now our first line contains "But soft what light through yonder window breaks", and it's a string.
for word in line:
ah, but now, we've said "Let's iterate over line (a string) and let word be each part of it when we go through the loop. But line is a complete string! When you iterate over a string like this you get one letter at a time, which is what you're seeing in your results.
Instead, don't have that for loop and just split the line as you were before:
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)

Try this:
with open('test.txt') as f:
words = []
for line in f:
if line:
words.extend(line.split())
print(sorted(set(words)))
Output:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

just small thing you are missing is this line
for word in line.split(): #split() gives you list of word seprated by space
by doing this you are making line => list of words
right now it is list of char(a simple string). try to print word in your example and print word after using line.split() you will get better idea.
Checkout this link > how to use split

Related

beginner issue with python : how to make one list of separated lines in a file in python

I have an issue as a beginner that made me exhausted trying to solve it so many times/ways but still feel dump, the problem is that I have a small file that I read in python and I have to make a list of the whole lines to sort it in alphabetical order. but when I try to make it in a list, it makes a separate list for each line.
here is my may that I tried to solve the issue using it:
file = open("romeo.txt")
for line in file:
words = line.split()
unique = list()
if words not in unique:
unique.extend(words)
unique.sort()
print(unique)
output:
['But', 'breaks', 'light', 'soft', 'through', 'what', 'window', 'yonder']
['It', 'Juliet', 'and', 'east', 'is', 'is', 'sun', 'the', 'the']
['Arise', 'and', 'envious', 'fair', 'kill', 'moon', 'sun', 'the']
['Who', 'already', 'and', 'grief', 'is', 'pale', 'sick', 'with']
To have a list of all lines you can use simple
with open(your_file, 'r') as f:
data = [''.join(x.split('\n')) for x in f.readlines()] # I used simple list comprehension to delete the `\n` at the end.
In data you have each line in a list. To sort the list you have to use sorted()
new_list = sorted(data)
now the new_list is the sorted list.
You have a inbuilt function for it
lines_of_files = open("filename.txt").readlines()
This returns a list of each line in the file.Hope this solves you question

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

How do I remove duplicate words from a list in python without using sets?

I have the following python code which almost works for me (I'm SO close!). I have text file from one Shakespeare's plays that I'm opening:
Original text file:
"But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief"
And the result of the code I worte gives me is this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and',
'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill',
'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the',
'through', 'what', 'window', 'with', 'yonder']
So this is almost what I want: It's already in a list sorted the way I want it, but how do I remove the duplicate words? I'm trying to create a new ResultsList and append the words to it, but it gives me the above result without getting rid of the duplicate words. If I "print ResultsList" it just dumps a ton of words out. They way I have it now is close, but I want to get rid of the extra "and's", "is's", "sun's" and "the's".... I want to keep it simple and use append(), but I'm not sure how I can get it to work. I don't want to do anything crazy with the code. What simple thing am I missing from my code inorder to remove the duplicate words?
fname = raw_input("Enter file name: ")
fhand = open(fname)
NewList = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
NewList.extend(words) #make the list from 4 lists to 1 list
for word in line.split(): #for each word in line.split()
if words not in line.split(): #if a word isn't in line.split
NewList.sort() #sort it
ResultList.append(words) #append it, but this doesn't work.
print NewList
#print ResultList (doesn't work the way I want it to)
mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist contains a list of the set of unique values from mylist, sorted by each item's index in mylist.
You did have a couple logic error with your code. I fixed them, hope it helps.
fname = "stuff.txt"
fhand = open(fname)
AllWords = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
AllWords.extend(words) #make the list from 4 lists to 1 list
AllWords.sort() #sort list
for word in AllWords: #for each word in line.split()
if word not in ResultList: #if a word isn't in line.split
ResultList.append(word) #append it.
print(ResultList)
Tested on Python 3.4, no importing.
Below function might help.
def remove_duplicate_from_list(temp_list):
if temp_list:
my_list_temp = []
for word in temp_list:
if word not in my_list_temp:
my_list_temp.append(word)
return my_list_temp
else: return []
This should work, it walks the list and adds elements to a new list if they are not the same as the last element added to the new list.
def unique(lst):
""" Assumes lst is already sorted """
unique_list = []
for el in lst:
if el != unique_list[-1]:
unique_list.append(el)
return unique_list
You could also use collections.groupby which works similarly
from collections import groupby
# lst must already be sorted
unique_list = [key for key, _ in groupby(lst)]
A good alternative to using a set would be to use a dictionary. The collections module contains a class called Counter which is specialized dictionary for counting the number of times each of its keys are seen. Using it you could do something like this:
from collections import Counter
wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(Counter(wordlist),
key=lambda w: w.lower()) # case insensitive sort
print(newlist)
Output:
['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']
There is a problem with your code.
I think you mean:
for word in line.split(): #for each word in line.split()
if words not in ResultList: #if a word isn't in ResultList
Use plain old lists. Almost certainly not as efficient as Counter.
fname = raw_input("Enter file name: ")
Words = []
with open(fname) as fhand:
for line in fhand:
line = line.strip()
# lines probably not needed
#if line.startswith('"'):
# line = line[1:]
#if line.endswith('"'):
# line = line[:-1]
Words.extend(line.split())
UniqueWords = []
for word in Words:
if word.lower() not in UniqueWords:
UniqueWords.append(word.lower())
print Words
UniqueWords.sort()
print UniqueWords
This always checks against the lowercase version of the word, to ensure the same word but in a different case configuration is not counted as 2 different words.
I added checks to remove the double quotes at the start and end of the file, but if they are not present in the actual file. These lines could be disregarded.
This should do the job:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print(lst)

Convert a list of string sentences to words

I'm trying to essentially take a list of strings containg sentences such as:
sentence = ['Here is an example of what I am working with', 'But I need to change the format', 'to something more useable']
and convert it into the following:
word_list = ['Here', 'is', 'an', 'example', 'of', 'what', 'I', 'am',
'working', 'with', 'But', 'I', 'need', 'to', 'change', 'the format',
'to', 'something', 'more', 'useable']
I tried using this:
for item in sentence:
for word in item:
word_list.append(word)
I thought it would take each string and append each item of that string to word_list, however the output is something along the lines of:
word_list = ['H', 'e', 'r', 'e', ' ', 'i', 's' .....etc]
I know I am making a stupid mistake but I can't figure out why, can anyone help?
You need str.split() to split each string into words:
word_list = [word for line in sentence for word in line.split()]
Just .split and .join:
word_list = ' '.join(sentence).split(' ')
You haven't told it how to distinguish a word. By default, iterating through a string simply iterates through the characters.
You can use .split(' ') to split a string by spaces. So this would work:
for item in sentence:
for word in item.split(' '):
word_list.append(word)
for item in sentence:
for word in item.split():
word_list.append(word)
Split sentence into words:
print(sentence.rsplit())

Categories

Resources