python dictionary function, textfile - python

I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')

You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()

I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))

Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.

sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)

Related

Is there a more efficient way to create an inverted index from a large text file?

def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
def inverted_index(doc):
# this will open the file
file = open(doc, encoding='utf8')
f = file.read()
file.seek(0)
# Get number of lines in file
lines = 1
for word in f:
if word == '\n':
lines += 1
print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version
d = {}
for i in range(lines):
line = file.readline()
l = line.lower().split(' ')
for item in l:
if item not in d:
d[item] = [i+1]
if item in d:
d[item].append(i+1)
return d
print(inverted_index('file.txt'))
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.

How to make a list of words from a text file? split() function is not working correctly [duplicate]

I have a text file which is named test.txt. I want to read it and return a list of all words (with newlines removed) from the file.
This is my current code:
def read_words(test.txt):
open_file = open(words_file, 'r')
words_list =[]
contents = open_file.readlines()
for i in range(len(contents)):
words_list.append(contents[i].strip('\n'))
return words_list
open_file.close()
Running this code produces this list:
['hello there how is everything ', 'thank you all', 'again', 'thanks a lot']
I want the list to look like this:
['hello','there','how','is','everything','thank','you','all','again','thanks','a','lot']
Depending on the size of the file, this seems like it would be as easy as:
with open(file) as f:
words = f.read().split()
Replace the words_list.append(...) line in the for loop with the following:
words_list.extend(contents[i].split())
This will split each line on whitespace characters, and then add each element of the resulting list to words_list.
Or as an alternative method for rewriting the entire function as a list comprehension:
def read_words(words_file):
return [word for line in open(words_file, 'r') for word in line.split()]
Here is how I'd write that:
def read_words(words_file):
with open(words_file, 'r') as f:
ret = []
for line in f:
ret += line.split()
return ret
print read_words('test.txt')
The function can be somewhat shortened by using itertools, but I personally find the result less readable:
import itertools
def read_words(words_file):
with open(words_file, 'r') as f:
return list(itertools.chain.from_iterable(line.split() for line in f))
print read_words('test.txt')
The nice thing about the second version is that it can be made to be entirely generator-based and thus avoid keeping all of the file's words in memory at once.
There are several ways to do this. Here are a few:
If you don't care about repeated words:
def getWords(filepath):
with open('filepath') as f:
return list(itertools.chain(line.split() for line in f))
If you want to return a list of words in which each word appears only once:
Note: this does not preserve the order of the words
def getWords(filepath):
with open('filepath') as f:
return {word for word in line.split() for line in f} # python2.7
return set((word for word in line.split() for line in f)) # python 2.6
If you want a set --and-- want to preserve the order of words:
def getWords(filepath):
with open('filepath') as f:
words = []
pos = {}
position = itertools.count()
for line in f:
for word in line.split():
if word not in pos:
pos[word] = position.next()
words.append(word)
return sorted(words, key=pos.__getitem__)
If you want a word-frequency dictionary:
def getWords(filepath):
with open('filepath') as f:
return collections.Counter(itertools.chain(line.split() for line in file))
Hope these help
The actual question has already been answered, but I would like to point out that the line f.close() will not be executed as the function returns before that line. Try writing f.close() before the return statement.

find words in txt files Python 3

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Python create a sorted list from file using append remove duplicate words

I am currently working in a python course and I am lost on this after 6 hrs +. Assignment directs student to create a program where the user enters a file name and python opens the file and builds a sorted word list with out duplicates. Directions are very clear that For loop and append must be used. " For each word on each line check to see if the word is already in the list and if not append it to the list."
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.strip()
words = line.split()
for words in fh:
if words in 1st:continue
elif 1st.append
1st.sort()
print 1st
It would be easier to just use the set() by itself, but this would be a good implementation per the assignment instructions. It's really fast compared to a list only version!
from collections import Set
def get_exclusive_list(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
data = []
already_seen = set()
for thing in words:
if thing not in already_seen:
data.append(thing)
already_seen.add(thing)
return data
# The better implementation
def get_exclusive_list_improved(fname):
words = []
with open(fname.txt, 'r') as file_d:
data = file_d.read()
[words.extend(li.split(' ')) for li in data.splitlines()]
return list(set(words))
Not sure what the following loop is supposed to do -
for words in fh:
if words in 1st:continue
elif 1st.append
The above does not do anything because you have already exhausted the file fh before control reaches this part.
You should put an inner loop inside - for line in fh: - that goes over the words in words list one by one and appends to lst if its not already there.
Also, you should do lst.append(word)
Also, i do not think your if..elif block is valid syntax either.
You should be doing something like xample -
for line in fh:
line = line.strip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)

How to return unique words from the text file using Python

How do I return all the unique words from a text file using Python?
For example:
I am not a robot
I am a human
Should return:
I
am
not
a
robot
human
Here is what I've done so far:
def unique_file(input_filename, output_filename):
input_file = open(input_filename, 'r')
file_contents = input_file.read()
input_file.close()
word_list = file_contents.split()
file = open(output_filename, 'w')
for word in word_list:
if word not in word_list:
file.write(str(word) + "\n")
file.close()
The text file the Python creates has nothing in it. I'm not sure what I am doing wrong
for word in word_list:
if word not in word_list:
every word is in word_list, by definition from the first line.
Instead of that logic, use a set:
unique_words = set(word_list)
for word in unique_words:
file.write(str(word) + "\n")
sets only hold unique members, which is exactly what you're trying to achieve.
Note that order won't be preserved, but you didn't specify if that's a requirement.
Simply iterate over the lines in the file and use set to keep only the unique ones.
from itertools import chain
def unique_words(lines):
return set(chain(*(line.split() for line in lines if line)))
Then simply do the following to read all unique lines from a file and print them
with open(filename, 'r') as f:
print(unique_words(f))
This seems to be a typical application for a collection:
...
import collections
d = collections.OrderedDict()
for word in wordlist: d[word] = None
# use this if you also want to count the words:
# for word in wordlist: d[word] = d.get(word, 0) + 1
for k in d.keys(): print k
You could also use a collection.Counter(), which would also count the elements you feed in. The order of the words would get lost though. I added a line for counting and keeping the order.
string = "I am not a robot\n I am a human"
list_str = string.split()
print list(set(list_str))
def unique_file(input_filename, output_filename):
input_file = open(input_filename, 'r')
file_contents = input_file.read()
input_file.close()
duplicates = []
word_list = file_contents.split()
file = open(output_filename, 'w')
for word in word_list:
if word not in duplicates:
duplicates.append(word)
file.write(str(word) + "\n")
file.close()
This code loops over every word, and if it is not in a list duplicates, it appends the word and writes it to a file.
Using Regex and Set:
import re
words = re.findall('\w+', text.lower())
uniq_words = set(words)
Other way is creating a Dict and inserting the words like keys:
for i in range(len(doc)):
frase = doc[i].split(" ")
for palavra in frase:
if palavra not in dict_word:
dict_word[palavra] = 1
print dict_word.keys()
The problem with your code is word_list already has all possible words of the input file. When iterating over the loop you are basically checking if a word in word_list is not present in itself. So it'll always be false. This should work.. (Note that this wll also preserve the order).
def unique_file(input_filename, output_filename):
z = []
with open(input_filename,'r') as fileIn, open(output_filename,'w') as fileOut:
for line in fileIn:
for word in line.split():
if word not in z:
z.append(word)
fileOut.write(word+'\n')
Use a set. You don't need to import anything to do this.
#Open the file
my_File = open(file_Name, 'r')
#Read the file
read_File = my_File.read()
#Split the words
words = read_File.split()
#Using a set will only save the unique words
unique_words = set(words)
#You can then print the set as a whole or loop through the set etc
for word in unique_words:
print(word)
try:
with open("gridlex.txt",mode="r",encoding="utf-8")as india:
for data in india:
if chr(data)==chr(data):
print("no of chrats",len(chr(data)))
else:
print("data")
except IOError:
print("sorry")

Categories

Resources