find words in txt files Python 3 - python

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !

Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.

As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Related

How to find and sum up all with the matching words on a list?

EmpRecords=[1,'Angelo','Fabregas','South','City',
2,'Fabian','Fabregas','North','City',
3,'Griffin','De Leon','West','City',
4,'John','Doe','East','City',
5,'Jane','Doe','Southville','Town']
Output should something be like:
Enter word to search: Doe
Same words: 2
How do I do this? I should also clarify that EmpRecords is actually just a text File that is converted into a list.
so it's actually:
EmpRecords='''1,Angelo,Fabregas,South,City;
2,Fabian,Fabregas,North,City;
3,Griffin,De Leon,West,City;
4,John,Doe,East,City;
5,Jane,Doe',Southville,Town'''
Maybe this has something to do with finding the matching words?
Assuming you want to search for any word separated by comma and each line is a separate item:
Since your actual records are separated by ";" you need to create a nested list as below:
>>> record_groups = EmpRecords.split(";")
>>> final_groups = [each_group.split(",") for each_group in record_groups]
Later you can search through list items for the given word:
>>> word = "Doe"
>>> counter = 0
>>> for each_entry in final_groups:
if word in each_entry:
counter += 1
>>> print(counter)
APPROACH 2:
If it is already in a file you can directly open line by line and search:
word = "Doe"
counter = 0
with open("input.txt") as fd:
for line in fd:
if word in line.strip().split(",")
counter += 1
print(counter)
If you want to read from the file and count, you can use a loop.
import csv
with open('records.txt') as csvfile:
linereader = csv.reader(csvfile, delimiter=',')
count = 0;
target_value = 'Doe'
for row in linereader:
if row[2] == target_value:
count += 1;
print("Count: ",count)
You may need to remove the semicolon (;) from the last field if you will be using the data.

Is there a more efficient way to create an inverted index from a large text file?

def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
def inverted_index(doc):
# this will open the file
file = open(doc, encoding='utf8')
f = file.read()
file.seek(0)
# Get number of lines in file
lines = 1
for word in f:
if word == '\n':
lines += 1
print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version
d = {}
for i in range(lines):
line = file.readline()
l = line.lower().split(' ')
for item in l:
if item not in d:
d[item] = [i+1]
if item in d:
d[item].append(i+1)
return d
print(inverted_index('file.txt'))
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.

Recreating a sentence using text files and lists

I am relatively new to Python and I am currently working on a compression program that uses lists containing positions of words in a lists and a list of words that make up the sentence. So far I have written my program inside two functions, the first function; 'compression', gets the words that make up the sentence and the positions of those words. My second function is called 'recreate', this function uses he lists to recreate the sentence. The recreated senetence is then stored in a file called recreate.txt. My issue is that the positions of words and the words that make up the sentence are not being written to their respective files and the 'recreate' file is not being created and written to. Any help would be greatly appreciated. Thanks :)
sentence = input("Input the sentence that you wish to be compressed")
sentence.lower()
sentencelist = sentence.split()
d = {}
plist = []
wds = []
def compress():
for i in sentencelist:
if i not in wds:
wds.append(i)
for i ,j in enumerate(sentencelist):
if j in (d):
plist.append(d[j])
else:
plist.append(i)
print (plist)
tsk3pos = open ("tsk3pos.txt", "wt")
for item in plist:
tsk3pos.write("%s\n" % item)
tsk3pos.close()
tsk3wds = open ("tsk3wds.txt", "wt")
for item in wds:
tsk3wds.write("%s\n" % item)
tsk3wds.close()
print (wds)
def recreate(compress):
compress()
num = list()
wds = list()
with open("tsk3wds.txt", "r") as txt:
for line in txt:
words += line.split()
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(words[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
UPDATED
I have fixed all other problems except the recreate function which will not make the 'recreate' file and will not recreate the sentence with the words, although
it recreates the sentence with the positions.
def recreate(compress): #function that will be used to recreate the compressed sentence.
compress()
num = list()
wds = list()
with open("words.txt", "r") as txt: #with statement opening the word text file
for line in txt: #iterating over each line in the text file.
words += line.split() #turning the textfile into a list and appending it to num
with open("tsk3pos.txt", "r") as txt:
for line in txt:
num += [int(i) for i in line.split()]
recreate = ' '.join(wds[pos] for pos in num)
with open("recreate.txt", "wt") as txt:
txt.write(recreate)
main()
def main():
print("Do you want to compress an input or recreate a compressed input?")
user = input("Type 'a' if you want to compress an input. Type 'b' if you wan to recreate an input").lower()
if user not in ("a","b"):
print ("That's not an option. Please try again")
elif user == "a":
compress()
elif user == "b":
recreate(compress)
main()
main()
A simpler ( yet less efficient ) approach :
recreate_file_object = open ( "C:/FullPathToWriteFolder/recreate.txt" , "w" )
recreate_file_object.write ( recreate )
recreate_file_object.close ( )

Python Beginning Program Dictionary and List Issue

Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt

python dictionary function, textfile

I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)

Categories

Resources