Counting lines, words, and characters within a text file using Python - python

I'm having a bit of a rough time laying out how I would count certain elements within a text file using Python. I'm a few months into Python and I'm familiar with the following functions;
raw_input
open
split
len
print
rsplit()
Here's my code so far:
fname = "feed.txt"
fname = open('feed.txt', 'r')
num_lines = 0
num_words = 0
num_chars = 0
for line in feed:
lines = line.split('\n')
At this point I'm not sure what to do next. I feel the most logical way to approach it would be to first count the lines, count the words within each line, and then count the number of characters within each word. But one of the issues I ran into was trying to perform all of the necessary functions at once, without having to re-open the file to perform each function seperately.

Try this:
fname = "feed.txt"
num_lines = 0
num_words = 0
num_chars = 0
with open(fname, 'r') as f:
for line in f:
words = line.split()
num_lines += 1
num_words += len(words)
num_chars += len(line)
Back to your code:
fname = "feed.txt"
fname = open('feed.txt', 'r')
what's the point of this? fname is a string first and then a file object. You don't really use the string defined in the first line and you should use one variable for one thing only: either a string or a file object.
for line in feed:
lines = line.split('\n')
line is one line from the file. It does not make sense to split('\n') it.

Functions that might be helpful:
open("file").read() which reads the contents of the whole file at once
'string'.splitlines() which separates lines from each other (and discards empty lines)
By using len() and those functions you could accomplish what you're doing.

fname = "feed.txt"
feed = open(fname, 'r')
num_lines = len(feed.splitlines())
num_words = 0
num_chars = 0
for line in lines:
num_words += len(line.split())

file__IO = input('\nEnter file name here to analize with path:: ')
with open(file__IO, 'r') as f:
data = f.read()
line = data.splitlines()
words = data.split()
spaces = data.split(" ")
charc = (len(data) - len(spaces))
print('\n Line number ::', len(line), '\n Words number ::', len(words), '\n Spaces ::', len(spaces), '\n Charecters ::', (len(data)-len(spaces)))
I tried this code & it works as expected.

One of the way I like is this one , but may be good for small files
with open(fileName,'r') as content_file:
content = content_file.read()
lineCount = len(re.split("\n",content))
words = re.split("\W+",content.lower())
To count words, there is two way, if you don't care about repetition you can just do
words_count = len(words)
if you want the counts of each word you can just do
import collections
words_count = collections.Counter(words) #Count the occurrence of each word

Related

Python program to number rows

i have a file with data as such.
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>3_DL_2021.1214
>4_DL_2021.1214
>4_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
now as you can see the data is not numbered properly and hence needs to be numbered.
what im aiming for is this:
>1_DL_2021.1123
>2_DL_2021.1206
>3_DL_2021.1202
>4_DL_2021.1214
>5_DL_2021.1214
>6_DL_2021.1214
>7_DL_2021.1214
>8_DL_2021.1214
>9_DL_2021.1214
now the file has a lot of other stuff between these lines starting with > sign. i want only the > sign stuff affected.
could someone please help me out with this.
also there are 563 such lines so manually doing it is out of question.
So, assuming input data file is "input.txt"
You can achieve what you want with this
import re
with open("input.txt", "r") as f:
a = f.readlines()
regex = re.compile(r"^>\d+_DL_2021\.\d+\n$")
counter = 1
for i, line in enumerate(a):
if regex.match(line):
tokens = line.split("_")
tokens[0] = f">{counter}"
a[i] = "_".join(tokens)
counter += 1
with open("input.txt", "w") as f:
f.writelines(a)
So what it does it searches for line with the regex ^>\d+_DL_2021\.\d+\n$, then splits it by _ and gets the first (0th) element and rewrites it, then counts up by 1 and continues the same thing, after all it just writes updated strings back to "input.txt"
sudden_appearance already provided a good answer.
In case you don't like regex too much you can use this code instead:
new_lines = []
with open('test_file.txt', 'r') as f:
c = 1
for line in f:
if line[0] == '>':
after_dash = line.split('_',1)[1]
new_line = '>' + str(c) + '_' + after_dash
c += 1
new_lines.append(new_line)
else:
new_lines.append(line)
with open('test_file.txt', 'w') as f:
f.writelines(new_lines)
Also you can have a look at this split tutorial for more information about how to use split.

How to add first 30 chars of a .txt file to a variable?

So I have this homework question:
Assign the first 30 characters of school_prompt.txt as a string to the variable beginning_chars.
In the previous problem I managed to count all the characters in the txt file but I don't know how to add the first 30 into a variable.
fname = "school_prompt.txt"
lines = 0
nwords = 0
beginning_chars = 0
with open(fname, 'r') as f:
for line in f:
if line >= 30:
words = line.split()
lines +=1
nwords += len(words)
beginning_chars += len(line)
It's as simple as this:
fname = "school_prompt.txt"
with open(fname, 'r') as f:
beginning_chars = f.read(30)
The read method can take the number of bytes to read as an argument. In most encodings one byte will equal one character.

Searching through a file in Python

Say that I have a file of restaurant names and that I need to search through said file and find a particular string like "Italian". How would the code look if I searched the file for the string and print out the number of restaurants with the same string?
f = open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt", "r")
content = f.read()
f.close()
lines = content.split("\n")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
print ("There are", len(f.readlines()), "restaurants in the dataset")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "GREEK" in line:
for l in searchlines[i:i+3]: print (l),
print
You could count all the words using a Counter dict and then do lookups for certain words:
from collections import Counter
from string import punctuation
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
# sum(1 for _ in f) -> counts lines
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
# reset file pointer back to the start
f.seek(0)
# get count of how many times each word appears, at most once per line
cn = Counter(word.strip(punctuation).lower() for line in f for word in set(line.split()))
print(cn["italian"]) # no keyError if missing, will be 0
we use set(line.split()) so if a word appeared twice for a certain restaurant, we would only count it once. That looks for exact matches, if you are also looking to match partials like foo in foobar then it is going to be more complex to create a dataset where you can efficiently lookup multiple words.
If you really just want to count one word all you need to do is use sum how many times the substring appears in a line:
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
f.seek(0)
sub = "italian"
count = sum(sub in line.lower() for line in f)
If you want exact matches, you would need the split logic again or to use a regex with word boundaries.
You input the file as a string.
Then use the count method of strings.
Code:
#Let the file be taken as a string in s1
print s1.count("italian")

how to loop around every word in a line and then every line in a file?

I have a dictonary like
list1={'ab':10,'ba':20,'def':30}.
Now my input file contains :
ab def
ba ab
I have coded:
filename=raw_input("enter file:")
f=open(filename,'r')
ff=open(filename+'_value','w')
for word in f.read().split():
s=0
if word in list1:
ff.write(word+'\t'+list1[word]+'\n');
s+=int(list1[word])
else:
ff.write(word+'\n')
ff.write("\n"+"total:%d"%(s)+"\n")
Now I want my output file to contain:
ab 10
def 30
total: 40
ba 20
ab 10
total: 30
Am not able to loop it for each line. How should I do it? I tried a few variations using f.readlines(), f.read(), and tried looping once, then twice with them. But I cannot get it right.
Instead of giving the answer right away, Let me give you a gist of what you ask:
To read the whole file:
f = open('myfile','r')
data = f.read()
To loop through each line in the file:
for line in data:
To loop through each word in the line:
for word in line.split():
Use it wisely to get what you want.
You need to make 2 loops and not only one:
filename = raw_input("enter file:")
with open(filename, 'r') as f, open(filename + '_value','w') as ff:
# Read each line sequentially
for line in f.read():
# In each line, read each word
total = 0
for word in line.split():
if word in list1:
ff.write("%s\t%s\n" % (word, list1[word]))
total += int(list1[word])
else:
ff.write(word+'\n')
ff.write("\ntotal: %s\n" % total)
I have also cleaned a little bit your code to be more readable. Also see What is the python "with" statement designed for? if you want to understand the with block
with open("in.txt","r") as f:
with open("out.txt","w") as f1:
for line in f:
words = line.split() # split into list of two words
f1.write("{} {}\n".format((words[0]),list1[words[0]])) # write first word plus value
f1.write("{} {}\n".format((words[1]),list1[words[1]])) # second word plus value
f1.write("Total: {}\n".format((int(list1[words[0]]) + int(list1[words[1]])))) # finally add first and second and get total

Count how many full stops a text file contains in Python

I would like to write a code that will read and open a text file and tell me how many "." (full stops) it contains
I have something like this but i don't know what to do now?!
f = open( "mustang.txt", "r" )
a = []
for line in f:
with open('mustang.txt') as f:
s = sum(line.count(".") for line in f)
Assuming there is absolutely no danger of your file being so large it will cause your computer to run out of memory (for instance, in a production environment where users can select arbitrary files, you may not wish to use this method):
f = open("mustang.txt", "r")
count = f.read().count('.')
f.close()
print count
More properly:
with open("mustang.txt", "r") as f:
count = f.read().count('.')
print count
I'd do it like so:
with open('mustang.txt', 'r') as handle:
count = handle.read().count('.')
If your file isn't too big, just load it into memory as a string and count the dots.
with open('mustang.txt') as f:
fullstops = 0
for line in f:
fullstops += line.count('.')
This will work:
with open('mustangused.txt') as inf:
count = 0
for line in inf:
count += line.count('.')
print 'found %d periods in file.' % count
even with Regular Expression
import re
with open('filename.txt','r') as f:
c = re.findall('\.+',f.read())
if c:print len(c)

Categories

Resources