Trying to count words in a file using Python - python

I am attempting to count the number of 'difficult words' in a file, which requires me to count the number of letters in each word. For now, I am only trying to get single words, one at a time, from a file. I've written the following:
file = open('infile.txt', 'r+')
fileinput = file.read()
for line in fileinput:
for word in line.split():
print(word)
Output:
t
h
e
o
r
i
g
i
n
.
.
.
It seems to be printing one character at a time instead of one word at a time. I'd really like to know more about what is actually happening here. Any suggestions?

Use splitlines():
fopen = open('infile.txt', 'r+')
fileinput = fopen.read()
for line in fileinput.splitlines():
for word in line.split():
print(word)
fopen.close()
Without splitlines():
You can also use with statement to open the file. It closes the file automagically:
with open('infile.txt', 'r+') as fopen:
for line in fopen:
for word in line.split():
print(word)

A file supports the iteration protocol, which for bigger files is much better than reading the whole content in memory in one go
with open('infile.txt', 'r+') as f:
for line in f:
for word in line.split():
print(word)
Assuming you are going to define a filter function, you could do something along the line
def is_difficult(word):
return len(word)>5
with open('infile.txt', 'r+') as f:
words = (w for line in f for w in line.split() if is_difficult(w))
for w in words:
print(w)
which, with an input file of
ciao come va
oggi meglio di domani
ieri peggio di oggi
produces
meglio
domani
peggio

Your code is giving you single characters because you called .read() which store all the content as a single string so when you for line in fileinput you are iterating over the string char by char, there is no good reason to use read and splitlines you as can simple iterate over the file object, if you did want a list of lines you would call readlines.
If you want to group words by length use a dict using the length of the word as the key, you will want to also remove punctuation from the words which you can do with str.strip:
def words(n, fle):
from collections import defaultdict
d = defaultdict(list)
from string import punctuation
with open(fle) as f:
for line in f:
for word in line.split():
word = word.strip(punctuation)
_len = len(word)
if _len >= n:
d[_len].append(word)
return d
Your dict will contain all the words in the file grouped by length and all at least n characters long.

Related

Searching through a file in Python

Say that I have a file of restaurant names and that I need to search through said file and find a particular string like "Italian". How would the code look if I searched the file for the string and print out the number of restaurants with the same string?
f = open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt", "r")
content = f.read()
f.close()
lines = content.split("\n")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
print ("There are", len(f.readlines()), "restaurants in the dataset")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "GREEK" in line:
for l in searchlines[i:i+3]: print (l),
print
You could count all the words using a Counter dict and then do lookups for certain words:
from collections import Counter
from string import punctuation
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
# sum(1 for _ in f) -> counts lines
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
# reset file pointer back to the start
f.seek(0)
# get count of how many times each word appears, at most once per line
cn = Counter(word.strip(punctuation).lower() for line in f for word in set(line.split()))
print(cn["italian"]) # no keyError if missing, will be 0
we use set(line.split()) so if a word appeared twice for a certain restaurant, we would only count it once. That looks for exact matches, if you are also looking to match partials like foo in foobar then it is going to be more complex to create a dataset where you can efficiently lookup multiple words.
If you really just want to count one word all you need to do is use sum how many times the substring appears in a line:
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
f.seek(0)
sub = "italian"
count = sum(sub in line.lower() for line in f)
If you want exact matches, you would need the split logic again or to use a regex with word boundaries.
You input the file as a string.
Then use the count method of strings.
Code:
#Let the file be taken as a string in s1
print s1.count("italian")

How would I read only the first word of each line of a text file?

I wanted to know how I could read ONLY the FIRST WORD of each line in a text file. I tried various codes and tried altering codes but can only manage to read whole lines from a text file.
The code I used is as shown below:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
line = QuizList[0]
for word in line.split():
print(word)
This refers to an attempt to extract only the first word from the first line. In order to repeat the process for every line i would do the following:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
capacity = len(QuizList)
capacity = capacity-1
index = 0
while index!=capacity:
line = QuizList[index]
for word in line.split():
print(word)
index = index+1
You are using split at the wrong point, try:
for line in f:
QuizList.append(line.split(None, 1)[0]) # add only first word
Changed to a one-liner that's also more efficient with the strip as Jon Clements suggested in a comment.
with open('Quizzes.txt', 'r') as f:
wordlist = [line.split(None, 1)[0] for line in f]
This is pretty irrelevant to your question, but just so the line.split(None, 1) doesn't confuse you, it's a bit more efficient because it only splits the line 1 time.
From the str.split([sep[, maxsplit]]) docs
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
' 1 2 3 '.split() returns ['1', '2', '3']
and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
with Open(filename,"r") as f:
wordlist = [r.split()[0] for r in f]
I'd go for the str.split and similar approaches, but for completness here's one that uses a combination of mmap and re if you needed to extract more complicated data:
import mmap, re
with open('quizzes.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
wordlist = re.findall('^(\w+)', mf, flags=re.M)
You should read one character at a time:
import string
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
for i, c in enumerate(line):
if c not in string.letters:
print line[:i]
break
l=[]
with open ('task-1.txt', 'rt') as myfile:
for x in myfile:
l.append(x)
for i in l:
print[i.split()[0] ]

Store words of file in dictionary

I want to store the words of a text file in a dictionary.
My code is
word=0
char=0
i=0
a=0
d={}
with open("m.txt","r") as f:
for line in f:
w=line.split()
d[i]=w[a]
i=i+1
a=a+1
word=word+len(w)
char=char+len(line)
print(word,char)
print(d)
my text file is
jdfjdnv dj g gjv,kjvbm
but the problem is that the dictionary is storing only the first word of the text file .how to store the rest of the words.please help
How many lines does your text file have? If it only has one line your loop executes only once, splits whole line into separate words, then saves one word in Python dict. If you want to save all words from this text file with one line you need to add another loop. Like this:
for word in line.split():
d[i] = word
i += 1
You only store the first word because you only have one line in the file, and your only for loop is over the lines.
Generally, if you are going to key the dictionary by index, you can just use the list you are already making:
w = []
char = 0
with open("m.txt", "r") as f:
for line in f:
char += len(line)
w.extend(line.split())
word = sum(map(len, w))

Read words from .txt, and count for each words

I wonder, how to read character string like fscanf. I need to read for word, in the all .txt . I need a count for each words.
collectwords = collections.defaultdict(int)
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
v=""
for char in line:
if str(char) != " ":
v=v+str(char)
elif str(char) == " ":
collectwords[v] += 1
v=""
this way, I cant to read the last word.
You might also consider using collections.counter if you are using Python >=2.7
http://docs.python.org/library/collections.html#collections.Counter
It adds a number of methods like 'most_common', which might be useful in this type of application.
From Doug Hellmann's PyMOTW:
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
http://www.doughellmann.com/PyMOTW/collections/counter.html -- although this does letter counts instead of word counts. In the c.update line, you would want to replace line.rstrip().lower with line.split() and perhaps some code to get rid of punctuation.
Edit: To remove punctuation here is probably the fastest solution:
import collections
import string
c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
for line in f:
c.update(line.translate(string.maketrans("",""), string.punctuation).split())
(borrowed from the following question Best way to strip punctuation from a string in Python)
Uhm, like this?
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
for word in line.split():
collectwords[word] += 1
Python makes this easy:
collectwords = []
filetxt = open('DatoSO.txt', 'r')
for line in filetxt:
collectwords.extend(line.split())

I have a text file of a paragraph of writing, and want to iterate through each word in Python

How would I do this? I want to iterate through each word and see if it fits certain parameters (for example is it longer than 4 letters..etc. not really important though).
The text file is literally a rambling of text with punctuation and white spaces, much like this posting.
Try split()ing the string.
f = open('your_file')
for line in f:
for word in line.split():
# do something
If you want it without punctuation:
f = open('your_file')
for line in f:
for word in line.split():
word = word.strip('.,?!')
# do something
You can simply content.split()
f = open(filename,"r");
lines = f.readlines();
for i in lines:
thisline = i.split(" ");
data=open("file").read().split()
for item in data:
if len(item)>4:
print "longer than 4: ",item

Categories

Resources