removing duplicates from a list of strings - python

I am trying to read a file, make a list of words and then make a new list of words removing the duplicates.
I am not able to append the words to the new list. it says none type object has no attribute'append'
Here is the bit of code:
fh = open("gdgf.txt")
lst = list()
file = fh.read()
for line in fh:
line = line.rstrip()
file = file.split()
for word in file:
if word in lst:
continue
lst = lst.append(word)
print lst

python append will return None.So set will help here to remove duplicates.
In [102]: mylist = ["aa","bb","cc","aa"]
In [103]: list(set(mylist))
Out[103]: ['aa', 'cc', 'bb']
Hope this helps
In your case
file = fh.read()
After this fh will be an empty generator.So you cannot use it since it is already used.You have to do operations with variable file

You are replacing your list with the return value of the append function, which is not a list. Simply do this instead:
lst.append(word)

append modifies the list it was called on, and returns None. I.e., you should replace the line:
lst=lst.append(word)
with simply
lst.append(word)

fh=open("gdgf.txt")
file=fh.read()
for line in fh:
line=line.rstrip()
lst = []
file=file.split()
for word in file:
lst.append(word)
print (set(lst))

append appends an item in-place which means it does not return any value. You should get rid of lst= when appending word:
if word in lst:
continue
lst.append(word)

list.append() is inplace append, it returns None (as it does not return anything). so you do not need to set the return value of list.append() back to the list. Just change the line - lst=lst.append(word) to -
lst.append(word)
Another issue, you are first calling .read() on the file and then iterating over its lines, you do not need to do that. Just remove the iteration part.
Also, an easy way to remove duplicates, if you are not interested in the order of the elements is to use set.
Example -
>>> lst = [1,2,3,4,1,1,2,3]
>>> set(lst)
{1, 2, 3, 4}
So, in your case you can initialize lst as - lst=set() . And then use lst.add() element, you would not even need to do a check whether the element already exists or not. At the end, if you really want the result as a list, do - list(lst) , to convert it to list. (Though when doing this, you want to consider renaming the variable to something better that makes it easy to understand that its a set not a list )

append() does not return anything, so don't assign it. lst.append() is
enough.
Modified Code:
fh = open("gdgf.txt")
lst = []
file=fh.read()
for line in fh:
line = line.rstrip()
file=file.split()
for word in file:
if word in lst:
continue
lst.append(word)
print lst
I suggest you use set() , because it is used for unordered collections of unique elements.
fh = open("gdgf.txt")
lst = []
file = fh.read()
for line in fh:
line = line.rstrip()
file = file.split()
lst = list( set(lst) )
print lst

You can simplify your code by reading and adding the words directly to a set. Sets do not allow duplicates, so you'll be left with just the unique words:
words = set()
with open('gdgf.txt') as f:
for line in f:
for word in line.strip():
words.add(word.strip())
print(words)
The problem with the logic above, is that words that end in punctuation will be counted as separate words:
>>> s = "Hello? Hello should only be twice in the list"
>>> set(s.split())
set(['be', 'twice', 'list', 'should', 'Hello?', 'only', 'in', 'the', 'Hello'])
You can see you have Hello? and Hello.
You can enhance the code above by using a regular expression to extract words, which will take care of the punctuation:
>>> set(re.findall(r"(\w[\w']*\w|\w)", s))
set(['be', 'list', 'should', 'twice', 'only', 'in', 'the', 'Hello'])
Now your code is:
import re
with open('gdgf.txt') as f:
words = set(re.findall(r"(\w[\w']*\w|\w)", f.read(), re.M))
print(words)
Even with the above, you'll have duplicates as Word and word will be counted twice. You can enhance it further if you want to store a single version of each word.

I think the solution to this problem can be more succinct:
import string
with open("gdgf.txt") as fh:
word_set = set()
for line in fh:
line = line.split()
for word in line:
# For each character in string.punctuation, iterate and remove
# from the word by replacing with '', an empty string
for char in string.punctuation:
word = word.replace(char, '')
# Add the word to the set
word_set.add(word)
word_list = list(word_set)
# Sort the set to be fastidious.
word_list.sort()
print(word_list)
One thing about counting words by "split" is that you are splitting on whitespace, so this will make "words" out of things like "Hello!" and "Really?" The words will include punctuation, which may probably not be what you want.
Your variable names could be a bit more descriptive, and your indentation seems a bit off, but I think it may be the matter of cutting/pasting into the posting. I have tried to name the variables I used based on whatever the logical structure it is I am interacting with (file, line, word, char, and so on).
To see the contents of 'string.punctuation' you can launch iPython, import string, then simply enter string.punctuation to see what is the what.
It is also unclear if you need to have a list, or if you just need a data structure that contains a unique list of words. A set or a list that has been properly created to avoid duplicates should do the trick. Following on with the question, I used a set to uniquely store elements, then converted that set to a list trivially, and later sorted this alphabetically.
Hope this helps!

Related

The output is unsorted and sorting on the second value is not possible. Is there special method to sort on the second value

The output is unsorted, and sorting on the second column is not possible. Is there special method to sort on the second value.
This program takes a text and counts how many times a word is in a text
import string
with open("romeo.txt") as file: # opens the file with text
lst = []
d = dict ()
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
print(d)
Dictionaries with default value
The code snippet:
d = dict()
...
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
has become so common in python that a subclass of dict has been created to get rid of it. It goes by the name defaultdict and can be found in module collections.
Thus we can simplify your code snippet to:
from collections import defaultdict
d = defaultdict(int)
...
d[word] = d[word] + 1
No need for this manual if/else test; if word is not in the defaultdict, it will be added automatically with initial value 0.
Counters
Counting occurrences is also something that is frequently useful; so much so that there exists a subclass of dict called Counter in module collections. It will do all the hard work for you.
from collections import Counter
import string
with open('romeo.txt') as input_file:
counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split())
with open('romeo_unique.txt', 'w') as output_file:
for word in counts:
output_file.write(word + '\n')
As far as I can tell from the documentation, Counters are not guaranteed to be ordered by number of occurrences by default; however:
When I use them in the interactive python interpreter they are always printed in decreasing number of occurrences;
they provide a method .most_common() which is guaranteed to return in decreasing number of occurrences.
In Python, standard dictionaries are an unsorted data type, but you can look here, assuming that with sorting your output you mean d
A couple of remarks first:
You are not sorting explicitly (e.g. by using sorted) by a given property. Dictionaries might be considered to have a "natural" order by the alphanumeric value of the value part of each key-value pair and they might sort correctly when iterated (e.g. for printing), but it is better to explicitly sort a dict.
You check the existence of a word in the lst variable, which is very slow since checking a list requires checking all entries until something is found (or not). It would be much better to check for existence in a dict.
I'm assuming by "the second column" you mean the information for each word that counts the order in which the word first appeared.
With that I'd change the code to also record the word index of the first occurence of each word with, which then allows for sorting on exactly that.
Edit: Fixed the code. The sorting yielded by sorted sorts by key, not value. That's what I get for not testing code before posting an answer.
import string
from operator import itemgetter
with open("romeo.txt") as file: # opens the file with text
first_occurence = {}
uniqueness = {}
word_index = 1
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in uniqueness:
uniqueness[word] = 1
else:
uniqueness[word] += 1
if word not in first_occurence:
first_occurence[word] = word_index
uniquewords.write(str(word) + '\n') # write the unique word to the file
word_index += 1
print(sorted(uniqueness.items(), key=itemgetter(1)))
print(sorted(first_occurence.items(), key=itemgetter(1)))

I don't know how to add a filter function in my program to remove same elements, especially '(', ')' and same words

def doc_read_alpha():
with open('input.txt', 'r') as file:
for line in file:
f_contents = file.read()
lines = line.split()
lines = sorted(lines)
The above is an algorithm used to iterate through my file contents and separate each word into an element of an array 'lines'. I am having trouble adding a filter function which would remove duplicates from my array.
input_file_string = " ".join(lines)
return lines
def main():
print(doc_read_alpha())
if __name__ == '__main__':
main()
If I understand you correctly, you want to have a list of unique words/tokens extracted from a text. You can achieve that through a "set" instead of a list, which behaves almost identically as a list, but does not allow for duplicate entries.
It is as simple as:
lines = set(line.split())
If you want to avoid duplicates in your list. A better solution would be to use a set. Each element in the set is unique and does not repeat.
you can convert your list into a set using
s = set(lines)
However, sets are unordered and while it makes it faster to check if a value belongs in the set. It means you cannot find an element using an index
def doc_read_alpha():
with open('text.txt', 'r') as file:
s = set()
for line in file.readlines():
s.update(line.split())
return s
I dont know if this is the best solution but it works. However this method does count "word" and "word." as two different words. To avoid this you have do strip all non letter characters.

Working with list and split method in list in python

I have to open a file read it line by line. For each line, I have to split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already on the list and if not append it to the list. When the program completes, print the resulting words in alphabetical order.
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
for each in line:
word = line.split()
if word not in lst:
lst.append(word)
print(lst)
after this, I am getting 4 different lists but I am required to get a single list and I am not able to get that
one line, create a set in a set comprehension using split on each line, and sort the set into a list, using ,key=str.casefold to sort case-insensitive/locale wise.
with open(fname) as f:
result = sorted({word for line in f for word in line.split()},key=str.casefold)
this is particularly efficient since you don't have to use in in your existing list, which performs a linear search, very slow if the list is big.
If the file contains punctuation, that won't work very well because split won't remove it. Use a regex in that case:
result = sorted({word for line in f for word in re.split("\W+",line) if word},key = str.casefold)
(you have to add an extra non-empty filter)
Try this
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
for each in line:
word = line.split()
for wrd in word:
if wrd not in lst:
lst.append(wrd)
print(lst)
There are several problems in the code
First, it's unclear what you want to do with for each in line. I simply removed it.
Second, with word = line.split(), you get word as a list of words. You need to iterate through the list of words and perform actions on individual words.
Then, use sorted() to sort the words.
The refined code looks like this:
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
words = line.split()
for word in words
if word not in lst:
lst.append(word)
fh.close()
print(sorted(lst))
Side note: You're not closing the file. Use fh.close() (as I added above) or use with open(fname) as fh: which will close the file for you after leaving the with block.
Same style as the answers before but rather than using a for loop to get all of the lines I use the readlines() functions
The line lines=fh.readlines() will return a full list of the lines.
Then you must look at each line with for line in lines to see each line. Then you look at each word in the line with for word in words.
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
lines=fh.readlines()
for line in lines:
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
You can do this like this:
with open(fname) as fh:
unique_words = set(f.read().split())
To move that set into list use:
unique_words = list(unique_words)
and to sort that list:
unique_words.sort()
You should be able to adapt this idea to your problem.
words.txt
In linguistics, a word is the smallest element
that can be uttered in isolation with objective
or practical meaning. This contrasts deeply with
a morpheme, which is the smallest unit of meaning
but will not necessarily stand on its own.
import re
with open('words.txt', 'r', encoding='utf8') as f:
words = f.read()
all_words = re.findall(r'\w+', words)
result = sorted(set(all_words), key=str.lower)
print(result)
['a', 'be', 'but', 'can', 'contrasts', 'deeply', 'element', 'in', 'In',
'is', 'isolation', 'its', 'linguistics', 'meaning', 'morpheme',
'necessarily', 'not', 'objective', 'of', 'on', 'or', 'own', 'practical',
'smallest', 'stand', 'that', 'the', 'This', 'unit', 'uttered', 'which',
'will', 'with', 'word']
Is this a interview question by chanse? I am asking because of this part:
For each word on each line check to see if the word is already in the list
If this is a interview question, it could very well be that your interviewer does not necessarily want you to use language features to solve this but rather implement a searching algorithm or structure, for example a BST.
Assuming this is not the case however, let's go over your code. First, I would recommend you switch from opening the file like you do, fh = open(fname) and switch to a context, using with. Or, at least, close the file handler.
word_dictionary = dict()
with open(file_name) as source:
for line in source:
for word in line.split():
if word_dictionary.get(word) is None:
word_dictionary[word] = True
word_list = [word for word, _ in word_dictionary.items()]
sorted(word_list)
print word_list
Let's go over the code together.
First, we define a dictionary, called word_dictionary. A dictionary or hash table is a data structure that allows you to make lookup operation in constant, O(1), time. That is to say, very fast.
Second, we open the file containing the text. This is with open(file_name) as source: We call this a context. It is a convenient way of dealing with files ( and not only ) that automatically takes care of resource management. I won't go into detail, but i will recommend this article.
We begin by reading each line, and for each line, we read each word.
for line in source:
for word in line.split():
For each word, we check to see if we have already encountered it. We do this by using the .get() method of dictionaries. This method will check to see if the argument exists as a key in the dictionary. If it is, it will return the value associated with the key. Otherwise, it will return None.
if word_dictionary.get(word) is None:
word_dictionary[word] = True
This says that if we encounter a word that we have not seen already, then we register seeing it. Note that it's not necessary to use True as a value. Anything different to None will work.
One we have seen every word in the text, we do this.
word_list = [word for word, _ in word_dictionary.items()]
Using .items() we iterate over the key, value pairs of the dictionary. That is to say, if we had a dictionary d = {0: "a", 1: "b", 2: "c"}, calling for key, value in d.items() will yield key = 0, value = "a" at first, key = 1, value = "b" second and finally key = 2, value = "c"
In our case, we are not interested in the value. That is why we use _. We are only interested in the word.
What the list comprehension will leave us with is a list of all the words you encountered, in a random order. This is because dictionaries make no guarantee on the order to the key, value pairs.
Therefore, we need to sort.
sorted(word_list)
Since the word_list variable is going to be a list of string and since strings are comparable by default, they will be sorted.
Now, there are several things to consider. First, .split() will consider 'this' and 'this!' to be different words. You may or may not want this.
If you do not want this, you can use the 'string' module to check against punctuation or you can use a regex to clean up your word.
The second thing to consider is capitalisation. You make no mention of this. Are you allowed to lower or capitalise your text? If you are, your life will be easier. You lowercase everything and your problems go away.
If you are not allowed to lower your text, you will have to change the sort call. this is because capital letters are "smaller" then lowercase ones.
>>> "A" < "a"
True
This will manifest in the following way.
>>> sorted(["b", "a", "C"])
['C', 'a', 'b']
Most likely you expect ["a", "b", "C"] here. In this case, I recommend you look into the cpm argument for sort.

text file to dictionary as keys

How can I convert file's words to dictionary as it's keys, directly, without using the list? Here is my code:
def abc(f):
with open(f, "r")as f:
dict = {}
lst = []
for line in f:
lst += line.strip().split()
for item in lst:
dict[item] = ""
return dict
print(abc("file.txt"))
Example of input "file.txt":
abc def
ghi jkl mno
pqr
Output:
{"abc":"", "def":"", "ghi":"", "jkl":"", "mno":"", "pqr":""}
The output of split() is a list. So, by use of that I need to read the data from file, store it on a list and then define it as dictionary's keys. My question is how can we ignore the list, and after reading data from file, put them directly to the dictionary?
if you really want a dictionary, build it in one go using a dict comprehension on the splitted words:
def abc(f):
return {word:"" for line in f for word in line.split()}
but you probably want a set instead since there are no values to your dict:
def abc(f):
return {word for line in f for word in line.split()}
I suspect you want to count the words, in that case:
def abc(f):
return collections.Counter(word for line in f for word in line.split())
note that split doesn't split on punctuation, so if the text contains some, you'll have duplicate words unless you replace
for word in line.split()
by
for word in re.split("\W",line) if word
(using re package, which has the slight drawback to generate empty fields at the start / end which is easily fixed by filtering word)
I don't know what you're trying to do nor what the expected output really is, but the following is functionnaly equivalent to your code snippet (and a bit cleaner too)
def abc(f):
dic = {}
with open(f, "r")as f:
for line in f:
# using split() will already remove
# trainling whitespaces so you don't
# need strip() here
for item in line.split():
dic[item] = ""
return dic

Converting sentences in a file to word tokens in a list

I'm using python to convert the words in sentences in a text file to individual tokens in a list for the purpose of counting up word frequencies. I'm having trouble converting the different sentences into a single list. Here's what I do:
f = open('music.txt', 'r')
sent = [word.lower().split() for word in f]
That gives me the following list:
[['party', 'rock', 'is', 'in', 'the', 'house', 'tonight'],
['everybody', 'just', 'have', 'a', 'good', 'time'],...]
Since the sentences in the file were in separate lines, it returns this list of lists and defaultdict can't identify the individual tokens to count up.
It tried the following list comprehension to isolate the tokens in the different lists and return them to a single list, but it returns an empty list instead:
sent2 = [[w for w in word] for word in sent]
Is there a way to do this using list comprehensions? Or perhaps another easier way?
Just use a nested loop inside the list comprehension:
sent = [word for line in f for word in line.lower().split()]
There are some alternatives to this approach, for example using itertools.chain.from_iterable(), but I think the nested loop is much easier in this case.
Just read the entire file to memory,a s a single string, and apply splitonce tot hat string.
There is no need to read the file line by line in such a case.
Therefore your core can be as short as:
sent = open("music.txt").read().split()
(A few niceties like closing the file, checking for errors, turn the code a little larger, of course)
Since you want to be counting word frequencies, you can use the collections.Counter class for that:
from collections import Counter
counter = Counter()
for word in open("music.txt").read().split():
counter[word] += 1
List comprehensions can do the job but will accumulate everything in memory. For large inputs this could be an unacceptable cost. The below solution will not accumulate large amounts of data in memory, even for large files. The final product is a dictionary of the form {token: occurrences}.
import itertools
def distinct_tokens(filename):
tokendict = {}
f = open(filename, 'r')
tokens = itertools.imap(lambda L: iter(L.lower.split()), f)
for tok in itertools.chain.from_iterable(tokens):
if tok in tokendict:
tokendict[tok] += 1
else:
tokendict[tok] = 1
f.close()
return tokendict

Categories

Resources