Read words from .txt, and count for each words - python

I wonder, how to read character string like fscanf. I need to read for word, in the all .txt . I need a count for each words.
collectwords = collections.defaultdict(int)
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
v=""
for char in line:
if str(char) != " ":
v=v+str(char)
elif str(char) == " ":
collectwords[v] += 1
v=""
this way, I cant to read the last word.

You might also consider using collections.counter if you are using Python >=2.7
http://docs.python.org/library/collections.html#collections.Counter
It adds a number of methods like 'most_common', which might be useful in this type of application.
From Doug Hellmann's PyMOTW:
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
http://www.doughellmann.com/PyMOTW/collections/counter.html -- although this does letter counts instead of word counts. In the c.update line, you would want to replace line.rstrip().lower with line.split() and perhaps some code to get rid of punctuation.
Edit: To remove punctuation here is probably the fastest solution:
import collections
import string
c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
for line in f:
c.update(line.translate(string.maketrans("",""), string.punctuation).split())
(borrowed from the following question Best way to strip punctuation from a string in Python)

Uhm, like this?
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
for word in line.split():
collectwords[word] += 1

Python makes this easy:
collectwords = []
filetxt = open('DatoSO.txt', 'r')
for line in filetxt:
collectwords.extend(line.split())

Related

Total number of first words in a txt file

I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????
Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.
Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.

Counting specific characters in a file (Python)

I'd like to count specific things from a file, i.e. how many times "--undefined--" appears. Here is a piece of the file's content:
"jo:ns 76.434
pRE 75.417
zi: 75.178
dEnt --undefined--
ba --undefined--
I tried to use something like this. But it won't work:
with open("v3.txt", 'r') as infile:
data = infile.readlines().decode("UTF-8")
count = 0
for i in data:
if i.endswith("--undefined--"):
count += 1
print count
Do I have to implement, say, dictionary of tuples to tackle this or there is an easier solution for that?
EDIT:
The word in question appears only once in a line.
you can read all the data in one string and split the string in a list, and count occurrences of the substring in that list.
with open('afile.txt', 'r') as myfile:
data=myfile.read().replace('\n', ' ')
data.split(' ').count("--undefined--")
or directly from the string :
data.count("--undefined--")
readlines() returns the list of lines, but they are not stripped (ie. they contain the newline character).
Either strip them first:
data = [line.strip() for line in data]
or check for --undefined--\n:
if line.endswith("--undefined--\n"):
Alternatively, consider string's .count() method:
file_contents.count("--undefined--")
Or don't limit yourself to .endswith(), use the in operator.
data = ''
count = 0
with open('v3.txt', 'r') as infile:
data = infile.readlines()
print(data)
for line in data:
if '--undefined--' in line:
count += 1
count
When reading a file line by line, each line ends with the newline character:
>>> with open("blookcore/models.py") as f:
... lines = f.readlines()
...
>>> lines[0]
'# -*- coding: utf-8 -*-\n'
>>>
so your endswith() test just can't work - you have to strip the line first:
if i.strip().endswith("--undefined--"):
count += 1
Now reading a whole file in memory is more often than not a bad idea - even if the file fits in memory, it still eats fresources for no good reason. Python's file objects are iterable, so you can just loop over your file. And finally, you can specify which encoding should be used when opening the file (instead of decoding manually) using the codecs module (python 2) or directly (python3):
# py3
with open("your/file.text", encoding="utf-8") as f:
# py2:
import codecs
with codecs.open("your/file.text", encoding="utf-8") as f:
then just use the builtin sum and a generator expression:
result = sum(line.strip().endswith("whatever") for line in f)
this relies on the fact that booleans are integers with values 0 (False) and 1 (True).
Quoting Raymond Hettinger, "There must be a better way":
from collections import Counter
counter = Counter()
words = ('--undefined--', 'otherword', 'onemore')
with open("v3.txt", 'r') as f:
lines = f.readlines()
for line in lines:
for word in words:
if word in line:
counter.update((word,)) # note the single element tuple
print counter

Find the number of characters in a file using Python

Here is the question:
I have a file with these words:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
Below is my program, but the number of counts for the characters without space is not correct.
The number of words is correct and the number of line is correct.
What is the mistake in the same loop?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
Whereas the number of characters without space is 35 and with space is 45.
If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.
Sum up the length of all words in a line:
characters += sum(len(word) for word in wordslist)
The whole program:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
3
13
35
This:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum:
sum(len(word) for word in wordslist)
Improved version
This version takes advantage of enumerate, so you save two lines of code, while keeping the readability:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation.
It is always good practice to close file after your are done using it.
Remember that each line (except for the last) has a line separator.
I.e. "\r\n" for Windows or "\n" for Linux and Mac.
Thus, exactly two characters are added in this case, as 47 and not 45.
A nice way to overcome this could be to use:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)
To count the characters, you should count each individual word. So you could have another loop that counts characters:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split() perhaps.
I found this solution very simply and readable:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)
This is too long for a comment.
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2.7.12
>>>len("taña")
5
Python 3.5.2
>>>len("taña")
4
Huh? The answer lies in unicode. That ñ is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.
How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
35
The image below shows this tested on RegExr:
It is probably counting new line characters. Subtract characters with (lines+1)
Here is the code:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.
A more Pythonic solution than the others:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines() does. There's no reason to reinvent the wheel.
You do have the correct answer - and your code is completely correct. The thing that I think it is doing is that there is an end of line character being passed through, which includes your character count by two (there isn't one on the last line, as there is no new line to go to). If you want to remove this, the simple fudge would be to do as Loaf suggested
characters = characters - (lines - 1)
See csl's answer for the second part...
Simply skip unwanted characters while calling len,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum the count,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str from the wordlist and take len,
characters=characters+ len(''.join(wordlist))
or sum the characters in the wordlist. I think this is the fastest.
characters=characters+ sum(1 for word in wordlist for char in word)
You have two problems. One is the line endings and the other is the spaces in between.
Now there are many people who posted pretty good answers, but I find this method easier to understand:
characters = characters + len(line.strip()) - line.strip().count(' ')
line.strip() removes the trailing and leading spaces. Then I'm subtracting the number of spaces from the total length.
It's very simple:
f = open('file.txt', 'rb')
f.seek(0) # Move to the start of file
print len(f.read())
Here i got smallest program with less memory usage for your problem
with open('FileName.txt') as f:
lines = f.readlines()
data = ''.join(lines)
print('lines =',len(lines))
print('Words = ',len(data.split()))
data = ''.join(data.split())
print('characters = ',len(data))
lines will be list of lines,so length of lines is nothing but number of lines.Next step data contains a string of your file contents(each word separated by a whitespace), so if we split data gives list of words in your file. thus, length of that list gives number of words. again if we join the words list you will get all characters as a single string. thus length of that gives number of characters.
taking the input as file name i.e files.txt from the input parameter and then counting the total number of characters in the file and save to the variable
char
fname = input("Enter the name of the file:")
infile = open(fname, 'r') # connection of the file
lines = 0
words = 0
char = 0 # init as zero integer
for line in infile:
wordslist = line.split() # splitting line to word
lines = lines + 1 # counter up the word
words = words + len(wordslist) # splitting word to charac
char = char + len(line) # counter up the character
print("lines are: " + str(lines))
print("words are: " + str(words))
print("chars are: " + str(char)) # printing beautify
num_lines = sum(1 for line in open('filename.txt'))
num_words = sum(1 for word in open('filename.txt').read().split())
num_chars = sum(len(word) for word in open('filename.txt').read().split())

Searching through a file in Python

Say that I have a file of restaurant names and that I need to search through said file and find a particular string like "Italian". How would the code look if I searched the file for the string and print out the number of restaurants with the same string?
f = open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt", "r")
content = f.read()
f.close()
lines = content.split("\n")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
print ("There are", len(f.readlines()), "restaurants in the dataset")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "GREEK" in line:
for l in searchlines[i:i+3]: print (l),
print
You could count all the words using a Counter dict and then do lookups for certain words:
from collections import Counter
from string import punctuation
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
# sum(1 for _ in f) -> counts lines
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
# reset file pointer back to the start
f.seek(0)
# get count of how many times each word appears, at most once per line
cn = Counter(word.strip(punctuation).lower() for line in f for word in set(line.split()))
print(cn["italian"]) # no keyError if missing, will be 0
we use set(line.split()) so if a word appeared twice for a certain restaurant, we would only count it once. That looks for exact matches, if you are also looking to match partials like foo in foobar then it is going to be more complex to create a dataset where you can efficiently lookup multiple words.
If you really just want to count one word all you need to do is use sum how many times the substring appears in a line:
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
f.seek(0)
sub = "italian"
count = sum(sub in line.lower() for line in f)
If you want exact matches, you would need the split logic again or to use a regex with word boundaries.
You input the file as a string.
Then use the count method of strings.
Code:
#Let the file be taken as a string in s1
print s1.count("italian")

Trying to count words in a file using Python

I am attempting to count the number of 'difficult words' in a file, which requires me to count the number of letters in each word. For now, I am only trying to get single words, one at a time, from a file. I've written the following:
file = open('infile.txt', 'r+')
fileinput = file.read()
for line in fileinput:
for word in line.split():
print(word)
Output:
t
h
e
o
r
i
g
i
n
.
.
.
It seems to be printing one character at a time instead of one word at a time. I'd really like to know more about what is actually happening here. Any suggestions?
Use splitlines():
fopen = open('infile.txt', 'r+')
fileinput = fopen.read()
for line in fileinput.splitlines():
for word in line.split():
print(word)
fopen.close()
Without splitlines():
You can also use with statement to open the file. It closes the file automagically:
with open('infile.txt', 'r+') as fopen:
for line in fopen:
for word in line.split():
print(word)
A file supports the iteration protocol, which for bigger files is much better than reading the whole content in memory in one go
with open('infile.txt', 'r+') as f:
for line in f:
for word in line.split():
print(word)
Assuming you are going to define a filter function, you could do something along the line
def is_difficult(word):
return len(word)>5
with open('infile.txt', 'r+') as f:
words = (w for line in f for w in line.split() if is_difficult(w))
for w in words:
print(w)
which, with an input file of
ciao come va
oggi meglio di domani
ieri peggio di oggi
produces
meglio
domani
peggio
Your code is giving you single characters because you called .read() which store all the content as a single string so when you for line in fileinput you are iterating over the string char by char, there is no good reason to use read and splitlines you as can simple iterate over the file object, if you did want a list of lines you would call readlines.
If you want to group words by length use a dict using the length of the word as the key, you will want to also remove punctuation from the words which you can do with str.strip:
def words(n, fle):
from collections import defaultdict
d = defaultdict(list)
from string import punctuation
with open(fle) as f:
for line in f:
for word in line.split():
word = word.strip(punctuation)
_len = len(word)
if _len >= n:
d[_len].append(word)
return d
Your dict will contain all the words in the file grouped by length and all at least n characters long.

Categories

Resources