counting letters in a text file in python - python

So I'm trying to do this problem
Write a program that reads a file named text.txt and prints the following to the
screen:
 The number of characters in that file
 The number of letters in that file
 The number of uppercase letters in that file
 The number of vowels in that file
I have gotten this so far but I am stuck on step 2 this is what I got so far.
file = open('text.txt', 'r')
lineC = 0
chC = 0
lowC = 0
vowC = 0
capsC = 0
for line in file:
for ch in line:
words = line.split()
lineC += 1
chC += len(ch)
for letters in file:
for ch in line:
print("Charcter Count = " + str(chC))
print("Letter Count = " + str(num))

You can do this using regular expressions. Find all occurrences of your pattern as your list and then finding the length of that list.
import re
with open('text.txt') as f:
text = f.read()
characters = len(re.findall('\S', text))
letters = len(re.findall('[A-Za-z]', text))
uppercase = len(re.findall('[A-Z]', text))
vowels = len(re.findall('[AEIOUYaeiouy]', text))

The answer above uses regular expressions, which are very useful and worth learning about if you haven't used them before. Bunji's code is also more efficient, as looping through characters in a string in Python is relatively slow.
However, if you want to try doing this using just Python, take a look at the code below. A couple of points: First, wrap your open() inside a using statement, which will automatically call close() on the file when you are finished. Next, notice that Python lets you use the in keyword in all kinds of interesting ways. Anything that is a sequence can be "in-ed", including strings. You could replace all of the string.xxx lines with your own string if you would like.
import string
chars = []
with open("notes.txt", "r") as f:
for c in f.read():
chars.append(c)
num_chars = len(chars)
num_upper = 0;
num_vowels = 0;
num_letters = 0
vowels = "aeiouAEIOU"
for c in chars:
if c in vowels:
num_vowels += 1
if c in string.ascii_uppercase:
num_upper += 1
if c in string.ascii_letters:
num_letters += 1
print(num_chars)
print(num_letters)
print(num_upper)
print(num_vowels)

Related

find words in txt files Python 3

I'd like to create a program in python 3 to find how many time a specific words appears in txt files and then to built an excel tabel with these values.
I made this function but at the end when I recall the function and put the input, the progam doesn't work. Appearing this sentence: unindent does not match any outer indentation level
def wordcount(filename, listwords):
try:
file = open( filename, "r")
read = file.readlines()
file.close()
for x in listwords:
y = x.lower()
counter = 0
for z in read:
line = z.split()
for ss in line:
l = ss.lower()
if y == l:
counter += 1
print(y , counter)
Now I try to recall the function with a txt file and the word to find
wordcount("aaa.txt" , 'word' )
Like output I'd like to watch
word 4
thanks to everybody !
Here is an example you can use to find the number of time a specific word is in a text file;
def searching(filename,word):
counter = 0
with open(filename) as f:
for line in f:
if word in line:
print(word)
counter += 1
return counter
x = searching("filename","wordtofind")
print(x)
The output will be the word you try to find and the number of time it occur.
As short as possible:
def wordcount(filename, listwords):
with open(filename) as file_object:
file_text = file_object.read()
return {word: file_text.count(word) for word in listwords}
for word, count in wordcount('aaa.txt', ['a', 'list', 'of', 'words']).items():
print("Count of {}: {}".format(word, count))
Getting back to mij's comment about passing listwofwords as an actual list: If you pass a string to code that expects a list, python will interpret the string as a list of characters, which can be confusing if this behaviour is unfamiliar.

Printing character percent in a text file

I just wrote a function which prints character percent in a text file. However, I got a problem. My program is counting uppercase characters as a different character and also counting spaces. That's why the result is wrong. How can i fix this?
def count_char(text, char):
count = 0
for character in text:
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / len(text)
print("{0} - {1}%".format(char, round(perc, 2)))
You should try making the text lower case using text.lower() and then to avoid spaces being counted you should split the string into a list using: text.lower().split(). This should do:
def count_char(text, char):
count = 0
for word in text.lower().split(): # this iterates returning every word in the text
for character in word: # this iterates returning every character in each word
if character == char:
count += 1
return count
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read()
totalChars = sum([len(i) for i in text.lower().split()]
for char in "abcdefghijklmnopqrstuvwxyz":
perc = 100 * count_char(text, char) / totalChars
print("{0} - {1}%".format(char, round(perc, 2)))
Notice the change in perc definition, sum([len(i) for i in text.lower().split()] returns the number of characters in a list of words, len(text) also counts spaces.
You can use a counter and a generator expression to count all letters like so:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
Explanation of generator expression:
c=Counter(c.lower() for line in f # continued below
^ create a counter
^ ^ each character, make lower case
^ read one line from the file
# continued
for c in line if c.isalpha())
^ one character from each line of the file
^ iterate over line one character at a time
^ only add if a a-zA-Z letter
Then get the total letter counts:
total_letters=float(sum(c.values()))
Then the total percent of any letter is c[letter] / total_letters * 100
Note that the Counter c only has letters -- not spaces. So the calculated percent of each letter is the percent of that letter of all letters.
The advantage here:
You are reading the entire file anyway to get the total count of the character in question and the total of all characters. You might as well just count the frequency of all character as you read them;
You do not need to read the entire file into memory. That is fine for smaller files but not for larger ones;
A Counter will correctly return 0 for letters not in the file;
Idiomatic Python.
So your entire program becomes:
from collections import Counter
with open(fn) as f:
c=Counter(c.lower() for line in f for c in line if c.isalpha())
total_letters=float(sum(c.values()))
for char in "abcdefghijklmnopqrstuvwxyz":
print("{} - {:.2%}".format(char, c[char] / total_letters))
You want to make the text lower case before counting the char:
def count_char(text, char):
count = 0
for character in text.lower():
if character == char:
count += 1
return count
You can use the built in .count function to count the characters after converting everything to lowercase via .lower. Additionally, your current program doesn't work properly as it doesn't exclude spaces and punctuation when calling the len function.
import string
filename = input("Enter the file name: ")
with open(filename) as file:
text = file.read().lower()
chars = {char:text.count(char) for char in string.ascii_lowercase}
allLetters = float(sum(chars.values()))
for char in chars:
print("{} - {}%".format(char, round(chars[char]/allLetters*100, 2)))

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Write a function to take a text file and print the number of lines, words, and characters in the file

I am currently struggling with a question related to Python File I/O. The question I that I fail to be able to complete is this:
Write a function stats() that takes one input argument: the
name of a text file. The function should print, on the screen, the
number of lines, words, and characters in the file; your function
should open the file only once.
I must also eliminate all puncuation to properly get all of the words.The function should print like so:
>>>stats('example.txt')
line count: 3
word count: 20
character count: 98
Please review the Python documentation about I/O here, Built-in Functions here, and Common string operations here.
There are many ways to get this done. With a quick go at it, the following should get the job done based on your requirements. The split function will eliminate spaces when it converts each line to a list.
def GetFileCounts(in_file):
char_count = 0
word_count = 0
line_count = 0
with open(in_file) as read_file:
for line in read_file:
line_count += 1
char_count += len(line)
word_count += len(line.split(' '))
print "line count: {0}\nword count: {1}\ncharacter count: {2}".format(line_count, word_count, char_count)
You may want to refine your requirements defintion a bit more as there are some subtle things that can change the output:
What is your definition of a character?
only letters?
letters, spaces, in word punctuation (e.g. hyphens), end of lines, etc.?
As far as your question goes,you can achieve it by:
fname = "inputfile.txt"
num_lines = 0
num_words = 0
num_chars = 0
with open(fname, 'r') as f:
for line in f:
words = line.split()
num_lines += 1
num_words += len(words)
num_chars += len(line)
There are lots of free tutorial online for python. Please refer those throughly. You can find reference books here
Are you open to third party libs?
Have a look at: https://textblob.readthedocs.org/en/dev/
It has all of the features you want implemented pretty well.
#nagato , i think yout with num_chars += len(line) you come all from the file , character's and blank space , but thank for your code , i can my to do...
My...
def char_freq_table():
with open('/home/.....', 'r') as f:
line = 0
word = 0
character = 0
for i in f:
line = len(list('f'))
word += len(i.split())
character = i.replace(" ", "") #lose the blank space
print "line count: ", line
print "word count: ", word
print "character count: ", len(character) # length all character
char_freq_table()
filename = 'Inputfile.txt'
def filecounts(filename):
try:
f= open(filename)
except IOError:
print("Unable to open the file for reading %s: %s" % (filename))
text = f.readlines()
f.close()
linecount = 0
wordcount = 0
lettercount = 0
for i in text:
linecount += 1
lettercount += len(i)
wordcount += len(i.split(' '))
return lettercount, wordcount, linecount
print ("Problem 1: letters: %d, words: %d, lines: %d" % (filecounts(filename)))

Reading a file by word without using split in Python

I have a one line file that I want to read word by word, i.e., with space separating words. Is there a way to do this without loading the data into the memory and using split? The file is too large.
You can read the file char by char and yield a word after each new white space, below is a simple solution for a file with single white spaces, you should refine it for complex cases (tabs, multiple spaces, etc).
def read_words(filename):
with open(filename) as f:
out = ''
while True:
c = f.read(1)
if not c:
break
elif c == ' ':
yield out
out = ''
else:
out += c
Example:
for i in read_words("test"):
print i
It uses a generator to avoid have to allocate a big chunk of memory
Try this little function:
def readword(file):
c = ''
word = ''
while c != ' ' and c != '\n':
word += c
c = file.read(1)
return word
Then to use it, you can do something like:
f = open('file.ext', 'r')
print(readword(f))
This will read the first word in the file, so if your file is like this:
12 22 word x yy
another word
...
then the output should be 12.
Next time you call this function, it will read the next word, and so on...

Categories

Resources