Python Unique words in text file - python

I am in need of inserting unique words into a text file. One word per line and that word will be unique in whole file.
Whenever a new word comes as variable "word" then I need a way to check if it exists in the file. If it exists then it will again pick another word. It will again check if it exists or not and will do so until an unique word comes.
How can I do that?
By the way, I was doing:
newword = "learn"
f = open('wordlist.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
while s.find(newword) != -1:
return newword
else:
return newword
Thanks in advance.

Assuming the file isn't too large, you could just read it all to memory, and check if the word is there:
def word_in_file(filename, word):
with open(filename) as f:
words = f.read().splitlines()
return word in words
If you're going to do this more than once, you'd better keep the words list and append to it each word you add to the file instead of reading it multiple times.
Additionally, creating a set from words should improve the search time and may be worth the "effort" if you're adding multiple words to the file.

Related

Finding number of words in a file in python

I'm new to python and attempting to do an exercise where I open a txt file and then read the contents of it (probably straight forward for most but I will admit I am struggling a bit).
I opened my file and used .read() to read the file. I then proceeded to remove the file of any punctation.
Next I created a for loop. In this loop I began my using .split() and adding to an expression:
words = words + len(characters)
words being previously defined as 0 outside the loop and characters being what was split at the beginning of the loop.
Very long story short, the problem that I'm having now is that instead of adding the entire word to my counter, each individual character is being added. Anything I can do to fix that in my for loop?
my_document = open("book.txt")
readTheDocument = my_document.read
comma = readTheDocument.replace(",", "")
period = comma.replace(".", "")
stripDocument = period.strip()
numberOfWords = 0
for line in my_document:
splitDocument = line.split()
numberOfWords = numberOfWords + len(splitDocument)
print(numberOfWords)
A more Pythonic way is to use with:
with open("book.txt") as infile:
count = len(infile.read().split())
You've got to understand that by using .split() you are not really getting real grammatical words. You are getting word-like fragments. If you want proper words, use module nltk:
import nltk
with open("book.txt") as infile:
count = len(nltk.word_tokenize(infile.read()))
Just open the file and split to get the count of words.
file=open("path/to/file/name.txt","r+")
count=0
for word in file.read().split():
count = count + 1
print(count)

Word Count from File: Is it having problems opening the file, or have I coded it incorrectly?

Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least
Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes
There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!

Replace words of a long document in Python

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)

How to input a line word by word in Python?

I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.
Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>
The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)

how to assign a word from a text file to a variable in python

I have a text file and I need to assign a random word from this text file (each word is on a separate line) to a variable in Python. Then I need to remove this word from the text file.
This is what I have so far.
with open("words.txt") as f: #Open the text file
wordlist = [x.rstrip() for x in f]
variable = random.sample(wordlist,1) #Assigning the random word
print(variable)
Use random.choice to pick a single word:
variable = random.choice(wordlist)
You can then remove it from the word list by another comprehension:
new_wordlist = [word for word in wordlist if word != variable]
(You can also use filter for this part)
You can then save that word list to a file by using:
with open("words.txt", 'w') as f: # Open file for writing
f.write('\n'.join(new_wordlist))
If you want to remove just a single instance of the word you should choose an index to use. See this answer.
If you need to handle duplicates, and it's not acceptable to reshuffle the list every time, there's a simple solution: Instead of just randomly picking a word, randomly pick an index. Like this:
index = random.randrange(len(wordlist))
word = wordlist.pop(index)
with open("words.txt", 'w') as f:
f.write('\n'.join(new_wordlist))
Or, alternatively, use enumerate to pick both at once:
word, index = random.choice(enumerate(wordlist))
del wordlist[index]
with open("words.txt", 'w') as f:
f.write('\n'.join(new_wordlist))
Rather than random.choice as Reut suggested, I would do this because it keeps duplicates:
random.shuffle(wordlist) # shuffle the word list
theword = wordlist.pop() # pop the first element

Categories

Resources