Am Writing a program that prompts for a file name, then opens that file and reads through the file, looking for lines of the form:
X-DSPAM-Confidence: 0.8475
I want to count these lines and extract the floating point values from each of the lines and compute the average of those values. Can I please get some help. I just started programming so I need something very simple. This is the code I have already written.
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = 'mbox-short.txt'
fh = open(fname,'r')
count = 0
total = 0
#Average = total/num of lines
for line in fh:
if not line.startswith("X-DSPAM-Confidence:"): continue
count = count+1
print line
Try:
total += float(line.split(' ')[1])
so that total / count gives you the answer.
Iterate over the file (using the context manager ("with") handles the closing automatically), looking for such lines (like you did), and then read them in like this:
fname = raw_input("Enter file name:")
if not fname:
fname = "mbox-short.txt"
scores = []
with open(fname) as f:
for line in f:
if not line.startswith("X-DSPAM-Confidence:"):
continue
_, score = line.split()
scores.append(float(score))
print sum(scores)/len(scores)
Or a bit more compact:
mean = lambda x: sum(x)/len(x)
with open(fname) as f:
result = mean([float(l.split()[1]) if line.startswith("X-DSPAM-Confidence:") for l in f])
A program like the following should satisfy your needs. If you need to change what the program is looking for, just change the PATTERN variable to describe what you are trying to match. The code is written for Python 3.x but can be adapted for Python 2.x without much difficulty if needed.
Program:
#! /usr/bin/env python3
import re
import statistics
import sys
PATTERN = r'X-DSPAM-Confidence:\s*(?P<float>[+-]?\d*\.\d+)'
def main(argv):
"""Calculate the average X-DSPAM-Confidence from a file."""
filename = argv[1] if len(argv) > 1 else input('Filename: ')
if filename in {'', 'default'}:
filename = 'mbox-short.txt'
print('Average:', statistics.mean(get_numbers(filename)))
return 0
def get_numbers(filename):
"""Extract all X-DSPAM-Confidence values from the named file."""
with open(filename) as file:
for line in file:
for match in re.finditer(PATTERN, line, re.IGNORECASE):
yield float(match.groupdict()['float'])
if __name__ == '__main__':
sys.exit(main(sys.argv))
You may also implement the get_numbers generator in the following way if desired.
Alternative:
def get_numbers(filename):
"""Extract all X-DSPAM-Confidence values from the named file."""
with open(filename) as file:
yield from (float(match.groupdict()['float'])
for line in file
for match in re.finditer(PATTERN, line, re.IGNORECASE))
Related
New to Python and I'm trying to count the words in a directory of text files and write the output to a separate text file. However, I want to specify conditions. So if word count is > 0 is would like to write the count and file path to one file and if the count is == 0. I would like to write the count and file path to a separate file. Below is my code so far. I think I'm close, but I'm hung up on how to do the conditions and separate files. Thanks.
import sys
import os
from collections import Counter
import glob
stdoutOrigin=sys.stdout
sys.stdout = open("log.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join("path", '*.txt')):
with open(filepath) as f:
data = f.read()
for key,val in words.items():
#print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key,val in sorted(words.items()):
print(filepath)
if val > 0:
print('{0}:\t{1}'.format(
key,
val))
filepath = sys.argv[1]
keys = ["x", "y"]
words = dict.fromkeys(keys,0)
count_words_in_dir(filepath, words, action=print_summary)
sys.stdout.close()
sys.stdout=stdoutOrigin
I would strongly urge you to not repurpose stdout for writing data to a file as part of the normal course of your program. I also wonder how you can ever have a word "count < 0". I assume you meant "count == 0".
The main problem that your code has is in this line:
for filepath in glob.iglob(os.path.join("path", '*.txt')):
The string constant "path" I'm pretty sure doesn't belong there. I think you want filepath there instead. I would think that this problem would prevent your code from working at all.
Here's a version of your code where I fixed these issues and added the logic to write to two different output files based on the count:
import sys
import os
import glob
out1 = open("/tmp/so/seen.txt", "w")
out2 = open("/tmp/so/missing.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
with open(filepath) as f:
data = f.read()
for key, val in words.items():
# print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key, val in sorted(words.items()):
whichout = out1 if val > 0 else out2
print(filepath, file=whichout)
print('{0}: {1}'.format(key, val), file=whichout)
filepath = sys.argv[1]
keys = ["country", "friend", "turnip"]
words = dict.fromkeys(keys, 0)
count_words_in_dir(filepath, words, action=print_summary)
out1.close()
out2.close()
Result:
file seen.txt:
/Users/steve/tmp/so/dir/data2.txt
friend: 1
/Users/steve/tmp/so/dir/data.txt
country: 2
/Users/steve/tmp/so/dir/data.txt
friend: 1
file missing.txt:
/Users/steve/tmp/so/dir/data2.txt
country: 0
/Users/steve/tmp/so/dir/data2.txt
turnip: 0
/Users/steve/tmp/so/dir/data.txt
turnip: 0
(excuse me for using some search words that were a bit more interesting than yours)
Hello I hope I understood your question correctly, this code will count how many different words are in your file and depending on the conditions will do something you want.
import os
all_words = {}
def count(file_path):
with open(file_path, "r") as f:
# for better performance it is a good idea to go line by line through file
for line in f:
# singles out all the words, by splitting string around spaces
words = line.split(" ")
# and checks if word already exists in all_words dictionary...
for word in words:
try:
# ...if it does increment number of repetitions
all_words[word.replace(",", "").replace(".", "").lower()] += 1
except Exception:
# ...if it doesn't create it and give it number of repetitions 1
all_words[word.replace(",", "").replace(".", "").lower()] = 1
if __name__ == '__main__':
# for every text file in your current directory count how many words it has
for file in os.listdir("."):
if file.endswith(".txt"):
all_words = {}
count(file)
n = len(all_words)
# depending on the number of words do something
if n > 0:
with open("count1.txt", "a") as f:
f.write(file + "\n" + str(n) + "\n")
else:
with open("count2.txt", "a") as f:
f.write(file + "\n" + str(n) + "\n")
if you want to count same word multiple times you can add up all values from dictionary or you can eliminate try-except block and count every word there.
I have the code below to write out a list of N-grams in Python.
from nltk.util import ngrams
def word_grams(words, min=1, max=6):
s = []
for n in range(min, max):
for ngram in ngrams(words, n):
s.append(' '.join(str(i) for i in ngram))
return s
email = open("output.txt", "r")
for line in email.readlines():
with open('file.txt', 'w') as f:
for line in email:
prnt = word_grams(email.split(' '))
f.write("prnt")
email.close()
f.close()
when I print out the word_grams it prints out the files correctly but when it comes to writing the output into files.txt it doesn't work. The "file.txt" is empty.
So I guess the problem must be within these lines of codes:
for line in email.readlines():
with open('file.txt', 'w') as f:
for line in email:
prnt = word_grams(email.split(' '))
f.write("prnt")
email.close()
f.close()
1) the final f.close() does something else than what you want (f inside the loop is another object)
2) You name the file "file.txt" but want the output in "files.txt". Are you sure that you are looking in a correct file?
3) You are overwriting the file for each line in the email. Perhaps the with statement for "file.txt" should be outside the loop.
4) You are writing "prnt" instead of prnt
Something like this?
def word_grams(words, min=1, max=6):
s = []
for n in range(min, max):
for ngram in ngrams(words, n):
s.append(' '.join(str(i) for i in ngram))
return s
with open("output.txt", "r") as email:
with open('file.txt', 'w') as f:
for line in email.readlines():
prnt = word_grams(line.split(' '))
for ngram in prnt:
f.write(ngram)
I don't know what you are trying to accomplish exactly, but it seems that you would like to apply the function word_grams to every word in the file "output.txt" and save the output to a file called "file.txt", probably one item per line.
With these assumptions, I would recommend to rewrite your iteration in this manner:
words = []
# load words from input
with open("output.txt") as f:
for line in f:
words += line.strip().split(" ")
# generate and save output
grams = apply(word_grams, words)
with open("file.txt", "w") as f:
f.write("\n".join(grams))
However, this code assumes that the function word_grams is working properly.
Your code in loop:
for line in email:
did not run!
Because after email.readlines()run,the variable email is empty.
You can do some test like fallows:
email = open("output.txt", "r")
for line in email.readlines():
print '1'
for line in email:
print '2'
if you have 3 lines in your output.txt,after you run this test,you will get:
1
1
1
in the output.
And you can do a test like this:
email = open("output.txt", "r")
email.readlines()
you will see a list with the lines in your output.txt.
but when you run email.readlines()again,you will get an empty list!
so,there should be the problem.your variable email is empty in your second loop.
For example:
import codecs
def main():
fileName = input("Please input a python file: ")
file = codecs.open(fileName, encoding = "utf8")
fornum = 0
for line in file:
data = line.split()
if "for" in data:
fornum += 1
print("The number of for loop in", fileName, ":", fornum)
main()
There are 1 for-statement in above codes. But the program counts the 'for' inside the quotation mark which is not expected and it displays 2. How can I change the codes to make it counts the keywords(for) without counting the words inside ""? Thx
As mentioned in comments to propely count for loops you should parse Python file and walk through it AST. You could do it with ast module. Example code:
import ast
def main():
fileName = input("Please input a python file: ")
with open(fileName) as f:
src = f.read()
source_tree = ast.parse(src) # get AST of source file
fornum = 0
# and recursively walk through all AST nodes
for n in ast.walk(source_tree):
if n.__class__.__name__ == "For":
fornum = fornum+1
print("The number of for loop in ", fileName, ":", fornum)
main()
Currently, I'm trying to search for an exact word/phrase in a text file. I am using Python 3.4
Here is the code I have so far.
import re
def main():
fileName = input("Please input the file name").lower()
term = input("Please enter the search term").lower()
fileName = fileName + ".txt"
regex_search(fileName, term)
def regex_search(file,term):
source = open(file, 'r')
destination = open("new.txt", 'w')
lines = []
for line in source:
if re.search(term, line):
lines.append(line)
for line in lines:
destination.write(line)
source.close()
destination.close()
'''
def search(file, term): #This function doesn't work
source = open(file, 'r')
destination = open("new.txt", 'w')
lines = [line for line in source if term in line.split()]
for line in lines:
destination.write(line)
source.close()
destination.close()'''
main()
In my function regex_search I use regex to search for the particular string. However, I don't know how to search for a particular phrase.
In the second function, search, I split the line into a list and search for the word in there. However, this won't be able to search for a particular phrase because I am searching for ["dog walked"] in ['the','dog','walked'] which won't return the correct lines.
edit: Considering that you don't want to match partial words ('foo' should not match 'foobar'), you need to look ahead in the data stream. The code for that is a bit awkward, so I think regex (your current regex_search with a fix) is the way to go:
def regex_search(filename, term):
searcher = re.compile(term + r'([^\w-]|$)').search
with open(file, 'r') as source, open("new.txt", 'w') as destination:
for line in source:
if searcher(line):
destination.write(line)
I have two different functions in my program, one writes an output to a txt file (function A) and the other one reads it and should use it as an input (function B).
Function A works just fine (although i'm always open to suggestions on how i could improve).
It looks like this:
def createFile():
fileName = raw_input("Filename: ")
fileNameExt = fileName + ".txt" #to make sure a .txt extension is used
line1 = "1.1.1"
line2 = int(input("Enter line 2: ")
line3 = int(input("Enter line 3: ")
file = (fileNameExt, "w+")
file.write("%s\n%s\n%s" % (line1, line2, line3))
file.close()
return
This appears to work fine and will create a file like
1.1.1
123
456
Now, function B should use that file as an input. This is how far i've gotten so far:
def loadFile():
loadFileName = raw_input("Filename: ")
loadFile = open(loadFileName, "r")
line1 = loadFile.read(5)
That's where i'm stuck, i know how to use this first 5 characters but i need line 2 and 3 as variables too.
f = open('file.txt')
lines = f.readlines()
f.close()
lines is what you want
Other option:
f = open( "file.txt", "r" )
lines = []
for line in f:
lines.append(line)
f.close()
More read:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
from string import ascii_uppercase
my_data = dict(zip(ascii_uppercase,open("some_file_to_read.txt"))
print my_data["A"]
this will store them in a dictionary with lettters as keys ... if you really want to cram it into variables(note that in general this is a TERRIBLE idea) you can do
globals().update(my_data)
print A