Related
I have an issue as a beginner that made me exhausted trying to solve it so many times/ways but still feel dump, the problem is that I have a small file that I read in python and I have to make a list of the whole lines to sort it in alphabetical order. but when I try to make it in a list, it makes a separate list for each line.
here is my may that I tried to solve the issue using it:
file = open("romeo.txt")
for line in file:
words = line.split()
unique = list()
if words not in unique:
unique.extend(words)
unique.sort()
print(unique)
output:
['But', 'breaks', 'light', 'soft', 'through', 'what', 'window', 'yonder']
['It', 'Juliet', 'and', 'east', 'is', 'is', 'sun', 'the', 'the']
['Arise', 'and', 'envious', 'fair', 'kill', 'moon', 'sun', 'the']
['Who', 'already', 'and', 'grief', 'is', 'pale', 'sick', 'with']
To have a list of all lines you can use simple
with open(your_file, 'r') as f:
data = [''.join(x.split('\n')) for x in f.readlines()] # I used simple list comprehension to delete the `\n` at the end.
In data you have each line in a list. To sort the list you have to use sorted()
new_list = sorted(data)
now the new_list is the sorted list.
You have a inbuilt function for it
lines_of_files = open("filename.txt").readlines()
This returns a list of each line in the file.Hope this solves you question
This question already has answers here:
What does "list comprehension" and similar mean? How does it work and how can I use it?
(5 answers)
List comprehension returning values plus [None, None, None], why? [duplicate]
(4 answers)
Appending item to lists within a list comprehension
(7 answers)
List comprehension output is None [duplicate]
(3 answers)
Closed 5 years ago.
I want to put the unique items from one list to another list, i.e eliminating duplicate items. When I do it by the longer method I am able to do it see for example.
>>>new_list = []
>>>a = ['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
>>> for word in a:
if word not in a:
new_list.append(word)
>>> new_list
['It', 'is', 'the', 'east', 'and', 'Juliet', 'sun']
But when try to accomplish this using list comprehension in a single line the each iteration returns value "None"
>>> new_list = []
>>> a = ['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
>>> new_list = [new_list.append(word) for word in a if word not in new_list]
Can someone please help in understanding whats going wrong in the list comprehension.
Thanks in Advance
Umesh
List comprehensions provide a concise way to create lists. Common
applications are to make new lists where each element is the result of
some operations applied to each member of another sequence or
iterable, or to create a subsequence of those elements that satisfy a
certain condition.
Maybe you can try this:
>>> new_list = []
>>> a = ['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
>>> unused=[new_list.append(word) for word in a if word not in new_list]
>>> new_list
['It', 'is', 'the', 'east', 'and', 'Juliet', 'sun']
>>> unused
[None, None, None, None, None, None, None]
Notice:
append() returns None if the inserted operation is successful.
Another way, you can try to use set to remove duplicate item:
>>> a = ['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
>>> list(set(a))
['and', 'sun', 'is', 'It', 'the', 'east', 'Juliet']
If you want a unique list of words, you can use set().
list(set(a))
# returns:
# ['It', 'is', 'east', 'and', 'the', 'sun', 'Juliet']
If the order is important, try:
new_list = []
for word in a:
if not a in new_list:
new_list.append(word)
I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))
I have the following python code which almost works for me (I'm SO close!). I have text file from one Shakespeare's plays that I'm opening:
Original text file:
"But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief"
And the result of the code I worte gives me is this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and',
'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill',
'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the',
'through', 'what', 'window', 'with', 'yonder']
So this is almost what I want: It's already in a list sorted the way I want it, but how do I remove the duplicate words? I'm trying to create a new ResultsList and append the words to it, but it gives me the above result without getting rid of the duplicate words. If I "print ResultsList" it just dumps a ton of words out. They way I have it now is close, but I want to get rid of the extra "and's", "is's", "sun's" and "the's".... I want to keep it simple and use append(), but I'm not sure how I can get it to work. I don't want to do anything crazy with the code. What simple thing am I missing from my code inorder to remove the duplicate words?
fname = raw_input("Enter file name: ")
fhand = open(fname)
NewList = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
NewList.extend(words) #make the list from 4 lists to 1 list
for word in line.split(): #for each word in line.split()
if words not in line.split(): #if a word isn't in line.split
NewList.sort() #sort it
ResultList.append(words) #append it, but this doesn't work.
print NewList
#print ResultList (doesn't work the way I want it to)
mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist contains a list of the set of unique values from mylist, sorted by each item's index in mylist.
You did have a couple logic error with your code. I fixed them, hope it helps.
fname = "stuff.txt"
fhand = open(fname)
AllWords = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
AllWords.extend(words) #make the list from 4 lists to 1 list
AllWords.sort() #sort list
for word in AllWords: #for each word in line.split()
if word not in ResultList: #if a word isn't in line.split
ResultList.append(word) #append it.
print(ResultList)
Tested on Python 3.4, no importing.
Below function might help.
def remove_duplicate_from_list(temp_list):
if temp_list:
my_list_temp = []
for word in temp_list:
if word not in my_list_temp:
my_list_temp.append(word)
return my_list_temp
else: return []
This should work, it walks the list and adds elements to a new list if they are not the same as the last element added to the new list.
def unique(lst):
""" Assumes lst is already sorted """
unique_list = []
for el in lst:
if el != unique_list[-1]:
unique_list.append(el)
return unique_list
You could also use collections.groupby which works similarly
from collections import groupby
# lst must already be sorted
unique_list = [key for key, _ in groupby(lst)]
A good alternative to using a set would be to use a dictionary. The collections module contains a class called Counter which is specialized dictionary for counting the number of times each of its keys are seen. Using it you could do something like this:
from collections import Counter
wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(Counter(wordlist),
key=lambda w: w.lower()) # case insensitive sort
print(newlist)
Output:
['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']
There is a problem with your code.
I think you mean:
for word in line.split(): #for each word in line.split()
if words not in ResultList: #if a word isn't in ResultList
Use plain old lists. Almost certainly not as efficient as Counter.
fname = raw_input("Enter file name: ")
Words = []
with open(fname) as fhand:
for line in fhand:
line = line.strip()
# lines probably not needed
#if line.startswith('"'):
# line = line[1:]
#if line.endswith('"'):
# line = line[:-1]
Words.extend(line.split())
UniqueWords = []
for word in Words:
if word.lower() not in UniqueWords:
UniqueWords.append(word.lower())
print Words
UniqueWords.sort()
print UniqueWords
This always checks against the lowercase version of the word, to ensure the same word but in a different case configuration is not counted as 2 different words.
I added checks to remove the double quotes at the start and end of the file, but if they are not present in the actual file. These lines could be disregarded.
This should do the job:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print(lst)
Hi as per the earlier post.
Given the following list:
['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
I am trying to count how many times each word THAT STARTS WITH A CAPITAL appears and display the top 3.
I am not interested in words that do not start with a capital.
If a word appears multiple times, sometimes starting with a capital and sometimes not, only count the times it does with a capital.
THis is what my code looks like currently:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
word_counter = {}
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
matches = []
for i in range(3):
print word_counter[top_3[i]], top_3[i]
#uncomment to produce the word file
##words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
##open('novel.txt','w').write('\n'.join(words))
import string
cap_words = [word.strip(string.punctuation) for word in open('novel.txt').read().split() if word.istitle()]
##print(cap_words) # debug
try:
from collections import Counter # Python >= 2.7
print('Counter')
print(Counter(cap_words).most_common(3))
except ImportError:
print('Normal dict')
wordcount= dict()
for word in cap_words:
wordcount[word] = (wordcount[word] + 1
if word in wordcount
else 1)
print(sorted(wordcount.items(), key = lambda x: x[1], reverse = True)[:3])
I do not get it why you would like to keep different kinds of line termination with 'rU' mode. I would use normally normal open as I wrote in edited code above.
EDIT: You had words together with punctuation, so cleaned up those with strip()
print "\n".join(sorted(["%d %s" % (lst.count(i), i) \
for i in set(lst) if i.istitle()])[-3:])
2 And
5 Cats
6 Jellicle
Here are some additional comments:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
can be replaced with:
text = open('novel.txt', 'rU').read() # read everything
wordlist = text.split() # split on all whitespace
But you don't use your 'must start with a capital letter' requirement yet. Time to add:
capwordlist = (word for word in wordlist if word.istitle())
istitle() means word[0].isupper() and word[1:].islower(). This means 'SO'.istitle() -> False.
That might work for you, but maybe you just want the word[0].isupper() instead.
This part is good if you can't use collections.Counter (new in 2.7)
word_counter = {}
for word in capwordlist:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
else this simply becomes:
from collections import Counter
word_counter = Counter(capwords)
top_3 = word_counter.most_common(3) # gives `word, count` pairs!
And this:
for i in range(3):
print word_counter[top_3[i]], top_3[i]
can be this:
for word in top_3:
print word_counter[word], word
One thing i'd avoid is reading all the words in before processing. It will work, but IMHO, it's better not to do that if you don't need to, and you don't. Here's my solution (with elements liberally stolen from previous ones!), done with 2.6.2:
import sys
# a generator function which iterates over the words in a file
def words(f):
for line in f:
for word in line.split():
yield word
# returns a generator expression filtering an iterator down to titlecase words
def titles(s):
return (word for word in s if word.istitle())
# count the titlecase words in the file
count = {}
for word in titles(words(file(sys.argv[1]))):
count[word] = count.get(word, 0) + 1
# build a list of tuples with the count for each word
countsAndWords = [(kv[1], kv[0]) for kv in count.iteritems()]
# put them in decreasing order
countsAndWords.sort()
countsAndWords.reverse()
# print the top three
for count, word in countsAndWords[:3]:
print word, count
I do a sort of decorate-sort-undecorate on the counts rather than doing the sort with a comparator which does lookups in the count dictionary; it's less elegant, but i believe it will be faster. That's probably a sinful thing to do.
In general, word[0].isupper() will tel you if a word starts with an uppercase letter. Combine this into a list comprehension (or your loop)
[x for x in my_list if x[0].isupper()]
(assuming there are no empty strings)
and you get all words starting with an uppercase letter.
Since you aren't using Python2.7 and don't have Counter
from collections import defaultdict
counter = defaultdict(int)
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
for word in (word for word in words if word[0].isupper()):
counter[word]+=1
print counter
you could use itertools
import itertools
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
capwords = (word for word in words if len(word) > 1 and word[0].isupper())
capwordssorted = sorted(capwords)
wordswithcounts = ((k,len(list(g))) for (k,g) in itertools.groupby(capwordssorted))
print sorted(wordswithcounts,key=lambda x:x[1],reverse=True)[:3]