Find the number of characters in a file using Python

Find the number of characters in a file using Python - python

Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.

Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.

Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)

To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.

I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)

This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.

How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:

It is probably counting new line characters. Subtract characters with (lines+1)

Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.

A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.

You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...

Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)

You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.

It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())

Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.

taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify

num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())

Related

How to get first word from text file removing \n - python

If the text file is /n/n Hello world!/n I like python./n
How do I get the first word from that text?
I tried to code:
def word_file(file):
files = open(file, 'r')
l = files.readlines()
for i in range(len(l)):
a = l[i].rstrip("\n")
line = l[0]
word = line.strip().split(" ")[0]
return word
There is space in front Hello.
The result I get is NONE. How should I correct it?
Can anybody help?

Assuming there is a word in the file:
def word_file(f):
with open(f) as file:
return file.read().split()[0]
file.read reads the entire file as a string. Do a split with no parameters on that string (i.e. sep=None). Then according to the Python manual "runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace." So the splitting will be done on consecutive white space and there will be no empty strings returned as a result of the split. Therefore the first element of the returned list will be the first word in the file.
If there is a possibility that the file is empty or contains nothing but white space, then you would need to check the return value from file.read().split() to ensure it is not an empty list.
If you need to avoid having to read the entire file into memory at once, then the following, less terse code can be used:
def word_file(f):
with open(f) as file:
for line in file:
words = line.split()
if words:
return words[0]
return None # No words found

Edit: #Booboo answer is far better than my answer
This should work:
def word_file(file):
with open(file, 'r') as f:
for line in f:
for index, character in enumerate(line):
if not character.isspace():
line = line[index:]
for ind, ch in enumerate(line):
if ch.isspace():
return line[:ind]
return line # could not find whitespace character at end
return None # no words found
output:
Hello

Python - Calculating length of string is inaccurate for certain strings only

I'm new to programming and trying to make a basic hangman game. For some reason when calculating the length of a string from a text file some words have the length of the string calculated incorrectly. Some strings have values too high and some too low. I can't seem to figure out why. I have already ensured that there are no spaces in the text file so that the space is counted as a character.
import random
#chooses word from textfile for hangman
def choose_word():
words = []
with open("words.txt", "r") as file:
words = file.readlines()
#number of words in text file
num_words = sum(1 for line in open("words.txt"))
n = random.randint(1, num_words)
#chooses the selected word from the text file
chosen_word = (words[n-1])
print(chosen_word)
#calculates the length of the word
len_word = len(chosen_word)
print(len_word)
choose_word()
#obama returns 5
#snake, turtle, hoodie, all return 7
#intentions returns 11
#racecar returns 8
words.txt
snake
racecar
turtle
cowboy
intentions
hoodie
obama

Use strip().
string.strip(s[, chars])
Return a copy of the string with leading and trailing characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.
Example:
>>> ' Hello'.strip()
'Hello'
Try this:
import random
#chooses word from textfile for hangman
def choose_word():
words = []
with open("words.txt", "r") as file:
words = file.readlines()
#number of words in text file
num_words = sum(1 for line in open("words.txt"))
n = random.randint(1, num_words)
#chooses the selected word from the text file
chosen_word = (words[n-1].strip())
print(chosen_word)
#calculates the length of the word
len_word = len(chosen_word)
print(len_word)
choose_word()

You are reading a random line from a text file.
Probably you have spaces in some lines after the words in those lines.
For example, the word "snake" is written in the file as "snake ", so it has length of 7.
To solve it you can either:
A) Manually or by a script remove the spaces in the file
B) When you read a random line from the text, before you check the length of the word, write: chosen_word = chosen_word.replace(" ", "").
This will remove the spaces from your word.

You need to strip all spaces from each line. This removes the beginning and trailing spaces. Here is your corrected code.
import random
# chooses word from textfile for hangman
def choose_word():
words = []
with open("./words.txt", "r") as file:
words = file.readlines()
# number of words in text file
num_words = sum(1 for line in open("words.txt"))
n = random.randint(1, num_words)
# chooses the selected word from the text file
# Added strip() to remove spaces surrounding your words
chosen_word = (words[n-1]).strip()
print(chosen_word)
# calculates the length of the word
len_word = len(chosen_word)
print(len_word)
choose_word()

Im supposing that the .txt file contains one word per line and without commas.
Maybe try to change some things here:
First, notice that the readlines() method is returning a list with all the lines but that also includes the newline string "\n".
# This deletes the newline from each line
# strip() also matches new lines as Hampus Larsson suggested
words = [x.strip() for x in file.readlines()]
You can calculate the number of words from the length of the words list itself:
num_words = len(words)
You do not need parenthesis to get the random word
chosen_word = words[n]
It should now work correctly!

in the file everyword has an \n to symbolize a new line.
in order to cut that out you have to replace:
chosen_word = (words[n-1])
by
chosen_word = (words[n-1][:-1])
this will cut of the last two letters of the chosen word!

Detect text connected

I'm trying to detect how many times a word appears in a txt file but the word is connected with other letters.
Detecting Hello
Text: Hellooo, how are you?
Expected output: 1
Here is the code I have now:
total = 0
with open('text.txt') as f:
for line in f:
finded = line.find('Hello')
if finded != -1 and finded != 0:
total += 1
print total´
Do you know how can I fix this problem?

As suggested in the comment by #SruthiV, you can use re.findall from re module,
import re
pattern = re.compile(r"Hello")
total = 0
with open('text.txt', 'r') as fin:
for line in fin:
total += len(re.findall(pattern, line))
print total
re.compile creates a pattern for regex to use, here "Hello". Using re.compile improves programs performance and is (by some) recommended for repeated usage of the same pattern. More here.
Remaining part of the program opens the file, reads it line by line, and looks for occurrences of the pattern in every line using re.findall. Since re.findall returns a list of matches, total is updated with the length of that list, i.e. number of matches in a given line.
Note: this program will count all occurrences of Hello- as separate words or as part of other words. Also, it is case sensitive so hello will not be counted.

For every line, you can iterate through every word by splitting the line on spaces which makes the line into a list of words. Then, iterate through the words and check if the string is in the word:
total = 0
with open('text.txt') as f:
# Iterate through lines
for line in f:
# Iterate through words by splitting on spaces
for word in line.split(' '):
# Match string in word
if 'Hello' in word:
total += 1
print total

Python unicode search not giving correct answer

I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found.
This is the code:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count
This is finding some words, but ignoring some others
The input files are:
File-1:
पौधा
वनस्पति
File-2:
वनस्पति, पेड़-पौधा
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग
पादप_समूह, पेड़-पौधे, वनस्पति_समूह
पेड़-पौधा
This gives output:
0 1
3 1
Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

I think the problem is here:
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
.readlines() will leave the line break at the end, so you're not searching for पौधा, you're searching for पौधा\n, and you'll only match at the end of a line. If I use .read().split() instead, I get
0 2
2 1
3 1

That because You don't remove the "\n" charactor at the end of lines.
So you don't search "some_pattern\n", not "some_pattern".
Use strip() function to chop them off like this:
import codecs
words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []
for line in hypernyms:
count_arr.append(0)
for word in words:
count_arr[-1] += (word in line)
for count in enumerate(count_arr):
if count:
print iterator, ' ', count

Put this code and you will see why that happens,because of the spaces:
in file 1 the first word is पौधा[space]....
for i in hypernyms:
print "file1",i
for i in words:
print "file2",i
After count_arr = [] and before for counter, line...

Read words from .txt, and count for each words

I wonder, how to read character string like fscanf. I need to read for word, in the all .txt . I need a count for each words.
collectwords = collections.defaultdict(int)
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
v=""
for char in line:
if str(char) != " ":
v=v+str(char)
elif str(char) == " ":
collectwords[v] += 1
v=""
this way, I cant to read the last word.

You might also consider using collections.counter if you are using Python >=2.7
http://docs.python.org/library/collections.html#collections.Counter
It adds a number of methods like 'most_common', which might be useful in this type of application.
From Doug Hellmann's PyMOTW:
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
http://www.doughellmann.com/PyMOTW/collections/counter.html -- although this does letter counts instead of word counts. In the c.update line, you would want to replace line.rstrip().lower with line.split() and perhaps some code to get rid of punctuation.
Edit: To remove punctuation here is probably the fastest solution:
import collections
import string
c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
for line in f:
c.update(line.translate(string.maketrans("",""), string.punctuation).split())
(borrowed from the following question Best way to strip punctuation from a string in Python)

Uhm, like this?
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
for word in line.split():
collectwords[word] += 1

Python makes this easy:
collectwords = []
filetxt = open('DatoSO.txt', 'r')
for line in filetxt:
collectwords.extend(line.split())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the number of characters in a file using Python - python

I found this solution very simply and readable: with open("filename", 'r') as file: text = file.read().strip().split() len_chars = sum(len(word) for word in text) print(len_chars)

It is probably counting new line characters. Subtract characters with (lines+1)

Here is the code: fp = open(fname, 'r+').read() chars = fp.decode('utf8') print len(chars) Check the output. I just tested it.

It's very simple: f = open('file.txt', 'rb') f.seek(0) # Move to the start of file print len(f.read())

num_lines = sum(1 for line in open('filename.txt')) num_words = sum(1 for word in open('filename.txt').read().split()) num_chars = sum(len(word) for word in open('filename.txt').read().split())

Related

How to get first word from text file removing \n - python

Python - Calculating length of string is inaccurate for certain strings only

Detect text connected

Python unicode search not giving correct answer

Read words from .txt, and count for each words

Categories

Resources