Not counting characters right in text file - python

I am doing another program with text file I/O and i'm confused because my code seems perfectly reasonable but the result seems crazy. I want to count the number of words, characters, sentances, and unique words in a text file of political speeches. Here is my code so it might clear things up a bit.
#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
#CISC 101, Queen's University
#By Damian Connors; 10138187
def main():
harper = readFile("Harper's Speech.txt")
print(numCharacters(harper), "Characters.")
obama1 = readFile("Obama's 2009 Speech.txt")
print(numCharacters(obama1), "Characters.")
obama2 = readFile("Obama's 2008 Speech.txt")
print(numCharacters(obama1), "Characters.")
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.readlines()
inFile1.close()
print(filename.replace(".txt", "") + ":") #this prints filename
return fileContentsList
def numCharacters(file):
return len(file) - file.count(" ")
What i'm having trouble with at the moment is counting the characters. It keeps saying that the # is 85, but it's a pretty big file and i know it is supposed to be 7792 characters. Any idea what i'm doing wrong with this? Here is my shell output and i'm using python 3.3.3
>>> ================================ RESTART ================================
>>>
Harper's Speech:
85 Characters.
Obama's 2009 Speech:
67 Characters.
Obama's 2008 Speech:
67 Characters.
>>>
so as you can see i have 3 speech files, but there's no way they are that little amount of characters.

You should change this line fileContentsList = inFile1.readlines()
Now you are counting how many lines obama has in his speech.
change readLines to read() and it will work

The readlines function returns a list containing the lines, so the length of it will be the number of lines in the file, not the number of characters.
You either have to find a way to read in all the characters (so that the length is correct), like using read().
Or go through each line tallying up the character in it, perhaps something like:
tot = 0
for line in file:
tot = tot + len(line) - line.count(" ")
return tot
(assuming your actual method chosen for calculating characters is correct, of course).
As an aside, your third output statement is referencing obama1 rather than obama2, you may want to fix that as well.

You are instead counting lines. In more detail, you are effectively reading the file into a list of lines and then counting them. A cleaned up version of your code follows.
def count_lines(filename):
with open(filename) as stream:
return len(stream.readlines())
The easiest change to such code to count words is to instead read out the whole file and split it into words and then count them, see the following code.
def count_words(filename):
with open(filename) as stream:
return len(stream.read().split())
Notes:
The code may need to be updated to match your exact definition of words.
This method is not suitable for very large files as it reads the whole file into memory and the list of words is also stored there.
Therefore the above code is more a conceptual model than the best final solution.

What you are currently seeing is the number of lines in your file. As the fileContentsList will return a list, numCharacters will return size of list.
If you want to continue to use 'readlines', you need to count number of characters in each line and add them to get total number of characters in file.
def main():
print(readFile("Harper's Speech.txt"), "Characters.")
print(readFile("Obama's 2009 Speech.txt"), "Characters.")
print(readFile("Obama's 2008 Speech.txt"), "Characters.")
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.readlines()
inFile1.close()
totalChar =0 # Variable to store total number of characters
for line in fileContentsList: # reading all lines
line = line.rstrip("\n") # removing line end character '\n' from lines
totalChar = totalChar + len(line) - line.count(" ") # adding number of characters in line to total characters,
# also removing number of whitespaces in current line
print(filename.replace(".txt", "") + ":") #this prints filename
return totalChar
main() # calling main function.

Related

Creating words out of numbers from a file

Right so, I have a file with lots of numbers in it. It is a .txt file with binary numbers in it so it has 0's, 1's and spaces every 8 numbers (it is the lyrics to Telegraph Road but in binary, but that isn't important). What I am trying to do is create a program that takes the file, reads a single character of the file, and depending on what it reads it then writes either "one" or "zero" in a second .txt file.
As it stands, as a proof of concept, this works:
with open('binary.txt') as file:
while 1:
string = file.read(1)
if string == "1":
print("one")
elif string == "0":
print("zero")
It prints out either a "one" or "zero" in about 15000 lines:
Picture of the IDLE Shell after running the program
In the future I want it to print them in set of eight (so that one line = one binary ascii code) but this is pointless if I cant get it to do the following first.
The following is instead of printing it to the IDLE, I want to write it into a new .txt file. A few hours of searching, testing and outright guessing has me here:
with open('binary.txt') as file:
with open('words.txt') as paper:
while 1:
string = file.read(1)
if string == "1":
line = ['one']
paper.writelines(line)
elif string == "0":
line = ['zero']
paper.writelines(line)
This then prints to the IDLE:
paper.writelines(line),
io.UnsupportedOperation: not writable
It goes without saying that I'm not very good at this. it has been a long time since I tried programming and I wasn't very good at it then either so any help I will be much appreciative of.
Cheers.
So in python when you use open() function it opens the file in read mode, but you change this by passing a second argument specifying the permissions you need. For your need and to keep things simple you need to open the file in write mode or append mode.
# this will clear all the contents of the file and put the cursor and the beginning
with open('file.txt') as f:
# logic
# this will open the file and put the cursor at the end of file after its contents
with open('file.txt', 'a') as f:
# logic
i think thats what you are trying to do
with open('binary.txt',"r") as file:
with open('words.txt',"r+") as paper:
while 1:
for i in range(8):
string = file.read(1)
if string == "1":
line = 'one '
paper.write(line)
elif string == "0":
line = 'zero '
paper.write(line)
paper.write("\n")
this program will take the binary file items.. and put them in the paper file as you said (1=one and 0=zero. 8 for every line)
Here it is:
with open('binary.txt',"r") as file:
data = file.read()
characters = data.split()
print(len(characters))
file.close()
with open('binary.txt',"r") as file:
with open('words.txt',"w") as paper:
count = 0
while count < len(characters):
for i in range(9):
string = file.read(1)
if string == "1":
line = 'one '
paper.write(line)
print("one")
elif string == "0":
line = 'zero '
paper.write(line)
print("zero")
paper.write("\n")
count += 1
Its probably really inefficient in places but it does the job.
Important stuff:
By not having the written file opened for writing but as default (reading), the program was unable to write to it. Adding "w" solved this, and also added an "r" onto the reading file to be safe.
Opening the read file once before to work out number of lines so that the program would stop churning out returns and making the write file millions of lines long if you didn't stop it immediately.
Worked out the number of lines necessary (the (print(len(characters)) line is, ironically, unnecessary):
data = file.read()
characters = data.split()
print(len(characters))
Limited the number of lines the program could write by first setting a variable to 0, then changed the while loop so that it would continue looping if the variable has a number less than that of how many lines there are and then adding a bit of code to add 1 to the variable after every new line is made:
count = 0
while count < len(characters):
count += 1
for i in range(8) was changed to for i in range(9) so that it would actually produce words in blocks of 8 before making new lines. Before it would make a line of 8 and then a line of 7 every time until the last line, where it would only right the leftover words, which in my case was 6
Read file must have a very specific layout, that is only 0's and 1's, must have a space every 8 numbers and no returns at all:
Picture of the text file, notice how each "line" continues beyond the boundary, this is because each line is 1001 columns long but due to there being a limit per column of 1001, it returns back around to create a new line, while still counting as the same column on the counter at the bottom.

Python word counting program for .txt files keeps on showing string index out of range as an error code

Im pretty new to this and i was trying to write a program which counts the words in txt files. There is probably a better way of doing this, but this was the idea i came up with, so i wanted to go through with it. I just don´t understand, why i, or any variable, does´nt work for as an index for the string of the page, that i´m counting on...
Do you guys have a solution or should i just take a different approach?
page = open("venv\harrry_potter.txt", "r")
alphabet = "qwertzuiopüasdfghjklöäyxcvbnmßQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM"
# Counting the characters
list_of_lines = page.readlines()
characternum = 0
textstr = "" # to convert the .txt file to string
for line in list_of_lines:
for character in line:
characternum += 1
textstr += character
# Counting the words
i = 0
wordnum = 1
while i <= characternum:
if textstr[i] not in alphabet and textstr[i+1] in alphabet:
wordnum += 1
i += 1
print(wordnum)
page.close()
Counting the characters and converting the .txt file to string is done a bit weird, because i thought the other way could be the source of the problem...
Can you help me please?
Typically you want to use split for simplistically counting words. They way you are doing it you will get right-minded as two words, or don't as 2 words. If you can just rely on spaces then you can just use split like this:
book = "Hello, my name is Inigo Montoya, you killed my father, prepare to die."
words = book.split()
print(f'word count = {len(words)}')
you can also use parameters to split to add more options if the given doesn't suit you.
https://pythonexamples.org/python-count-number-of-words-in-text-file/
You want to get the word count of a text file
The shortest code is this (that I could come up with):
with open('lorem.txt', 'r') as file:
print(len(file.read().split()))
First of for smaller files this is fine but this loads all of the data into the memory so not that great for large files. First of use a context manager (with), it helps with error handling an other stuff. What happens is you print the length of the whole file read and split by space so file.read() reads the whole file and returns a string, so you use .split() on it and it splits the whole string by space and returns a list of each word in between spaces so you get the lenght of that.
A better approach would be this:
word_count = 0
with open('lorem.txt', 'r') as file:
for line in file:
word_count += len(line.split())
print(word_count)
Because here the whole file is not saved into memory, you read each line separately and overwrite the previous in the memory. Here again for each line you split it by space and measure the length of the returned list, then add to the total word count. At the end simply print out the total word count.
Useful sources:
about with
Context Managers - Efficiently Managing Resources (to learn how they work a bit in detail) by Corey Schafer
.split() "docs"

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

I have a set of three .fasta files of standardized format. Each one begins with a string that acts as a header on line 1, followed by a long string of nucleotides on line 2, where the header string denotes the animal that the nucleotide sequence came from. There are 14 of them altogether, for a total of 28 lines, and each of the three files has the headers in the same order. A snippet of one of the files is included below as an example, with the sequences shortened for clarity.
anas-crecca-crecca_KSW4951-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM021-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM020-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
What I would like to do is write a script or program that cats each of the strings of nucleotides together, but keeps them in the same position. My knowledge, however, is limited to rudimentary python, and I'd appreciate any help or tips someone could give me.
Try this:
data = ""
with open('filename.fasta') as f:
i = 0
for line in f:
i=i+1
if (i%2 == 0):
data = data + line[:-1]
# Copy and paste above block for each file,
# replacing filename with the actual name.
print(data)
Remember to replace "filename.fasta" with your actual file name!
How it works
Variable i acts as a line counter, when it is even, i%2 will be zero and the new line is concatenated to the "data" string. This way, the odd lines are ignored.
The [:-1] at the end of the data line removes the line break, allowing you to add all sequences to the same line.

How to input a line word by word in Python?

I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.
Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>
The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)

How to count the number of characters in a file (not using the len function)?

Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.

Categories

Resources