I have two python files to count the words and frequency
import io
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
words = infh.read().split()
with open('e1.txt', 'a') as f:
for word, count in Counter(words).most_common(10):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
import io
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
for line in infh:
words =line.split()
with open('e1.txt', 'a') as f:
for word, count in Counter(words).most_common(10):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
None of the provides output.
The code contains no syntax error.
Output
താത്കാലിക 1
- 1
ഒഴിവ് 1
അധ്യാപക 1
വാര്ത്തകള് 1
ആലപ്പുഴ 1
ഇന്നത്തെപരിപാടി 1
വിവാഹം 1
അമ്പലപ്പുഴ 1
The actual file contains 100 occurrence of these words.
I am not printing anything, I am writing all to a file(e1)
Update: I tried another one and got result
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
words =infh.read().split()
with open('file.txt', 'wb') as f:
for word, count in Counter(words).most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
It can count up to 2 GB files in 4Gb RAM
What is the problem here?
I coded up the task and here is my solution.
I have tested the program with a 5.1 GB text file. The program finished in ~20 minutes on a MBP6.2.
Let me know if there are any confusions or suggestions. Best of luck.
from collections import Counter
import io
import sys
cnt = Counter()
if len(sys.argv) < 2:
print("Provide an input file as argument")
sys.exit()
try:
with io.open(sys.argv[1], 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
cnt[word] += 1
except FileNotFoundError:
print("File not found")
with sys.stdout as f:
total_word_count = sum(cnt.values())
for word, count in cnt.most_common(30):
f.write('{: < 6} {:<7.2%} {}\n'.format(
count, count / total_word_count, word))
Output:
~ python countword.py CSW07.txt
79619 4.58% [n]
63717 3.67% a
56783 3.27% of
42341 2.44% to
40156 2.31% the
39295 2.26% [v]
38231 2.20% [n
36592 2.11% -S]
35250 2.03% or
17113 0.98% in
You need to read each line, split it into words, and then update the counter. Otherwise you are only counting each line separately. Even if the file is very big, since you are only storing the individual words, you will be processing it line-by-line.
Try this version instead:
import collections
import io
c = collections.defaultdict(int)
with io.open('somefile.txt', encoding='utf-8') as f:
for line in f:
if len(line.strip()):
for word in line.split(' '):
c[word] += 1
with io.open('out.txt', 'w') as f:
for word, count in c.iteritems():
f.write('{} {}\n'.format(word, count))
You are counting words for each line.
Maybe try to read whole file, split by words, and make the Counter call.
Edit: If you don't have enough memory for read all file but enough for store all different words:
import io
import collections
import codecs
from collections import Counter
def count(file):
f = open(file,'r')
cnt = Counter()
for line in f.readlines():
words = line.split(" ")
for word in words:
cnt[word] += 1
f.close()
return cnt
Now get the counter return and print to file the data you want.
Related
imagine this situation:
a file with 1000 lines. the name of the file is file.txt
file = file.txt
word = 'error'
for line in file:
if word in line:
execute things
if I want the 8 lines BEFORE the line with the word "error", how I get it?
Read the file and save the lines in a deque of a fixed size
from collections import deque
file = "file.txt"
word = 'error'
lines = deque(maxlen=8)
with open(file) as f:
for line in f:
if word in line:
break
lines.append(line)
print(lines)
You can use combination of collections.deque() with fixed length and itertools.takewhile().
from collections import deque
from itertools import takewhile
with open("file.txt") as f:
lines = deque(takewhile(lambda l: "error" not in l, f), maxlen=8)
print(*lines, sep="", end="")
So I have compiled a bunch of txt files and I got python to combine them and print all of them in the console as seen here:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob
import sys
rally_files= glob.glob("C:/Users/epicr/Downloads/archive/*.txt")
for file in rally_files:
with open(file, 'r', encoding="UTF-8") as f:
lines= f.readlines()
for line in lines:
for word in line.split('\n'):
print (word)
The output works here it is right here:
word word word word word word word word word word more random words Joe Joe random random Joe
Now, I want to loop through each single word in the huge text file I just made and find a specific word. If that word exists, I want to add it in a counter. Lets say if if it detects the word 'Joe', the counter will go up. Here is my code below:
from __future__ import print_function
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob
import sys
rally_files= glob.glob("C:/Users/epicr/Downloads/archive/*.txt")
count=0
for file in rally_files:
with open(file, 'r', encoding="UTF-8") as f:
lines= f.readlines()
for line in lines:
for word in line.split('\n'):
if word in line.split('\n')== 'Joe':
count +=1
print (count)
It seems to not be picking up anything. I know for a fact the word 'Joe' has shown up about 300ish times.. can anyone help me please?
count=0
for file in rally_files:
with open(file, 'r', encoding="UTF-8") as f:
lines= f.readlines()
for line in lines:
count += line.count("Joe")
Need to open text file and find numbers of occurrences for the names given in the other file. Program should write name; count pairs, separated by semicolons into the file with .csv format
It should look like:
Jane; 77
Hector; 34
Anna; 39
...
Tried to use "Counter" but then it looks like a list, so I think that this is a wrong way to do the task
import re
import collections
from collections import Counter
wanted = re.findall('\w+', open('iliadcounts.csv').read().lower())
cnt = Counter()
words = re.findall('\w+', open('pg6130.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
but this is definitely not the right code for this task...
You can feed the whole list of words to Counter at once, it will count it for you.
You can then print only the words in wanted by iterating over it:
import re
import collections
from collections import Counter
# create some demo data as I do not have your data at hand - uses your filenames
def create_demo_files():
with open('iliadcounts.csv',"w") as f:
f.write("hug,crane,box")
with open('pg6130.txt',"w") as f:
f.write("hug,shoe,blues,crane,crane,box,box,box,wood")
create_demo_files()
# work with your files
with open('iliadcounts.csv') as f:
wanted = re.findall('\w+', f.read().lower())
with open('pg6130.txt') as f:
cnt = Counter( re.findall('\w+', f.read().lower()) )
# printed output for all words in wanted (all words are counted)
for word in wanted:
print("{}; {}".format(word, cnt.get(word)))
# would work as well:
# https://docs.python.org/3/library/string.html#string-formatting
# print(f"{word}; {cnt.get(word)}")
Output:
hug; 1
crane; 2
box; 3
Or you can print the whole Counter:
print(cnt)
Output:
Counter({'box': 3, 'crane': 2, 'hug': 1, 'shoe': 1, 'blues': 1, 'wood': 1})
Links:
https://pyformat.info/
string formatting
with open(...) as f:
I am trying to count elements in a text file. I know I am missing an obvious part, but I can't put my finger on it. This is what I currently have which just produces the count of the letter "f" not the file:
filename = open("output3.txt")
f = open("countoutput.txt", "w")
import collections
for line in filename:
for number in line.split():
print(collections.Counter("f"))
break
import collections
counts = collections.Counter() # create a new counter
with open(filename) as infile: # open the file for reading
for line in infile:
for number in line.split():
counts.update((number,))
print("Now there are {} instances of {}".format(counts[number], number))
print(counts)
I wonder, how to read character string like fscanf. I need to read for word, in the all .txt . I need a count for each words.
collectwords = collections.defaultdict(int)
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
v=""
for char in line:
if str(char) != " ":
v=v+str(char)
elif str(char) == " ":
collectwords[v] += 1
v=""
this way, I cant to read the last word.
You might also consider using collections.counter if you are using Python >=2.7
http://docs.python.org/library/collections.html#collections.Counter
It adds a number of methods like 'most_common', which might be useful in this type of application.
From Doug Hellmann's PyMOTW:
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
http://www.doughellmann.com/PyMOTW/collections/counter.html -- although this does letter counts instead of word counts. In the c.update line, you would want to replace line.rstrip().lower with line.split() and perhaps some code to get rid of punctuation.
Edit: To remove punctuation here is probably the fastest solution:
import collections
import string
c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
for line in f:
c.update(line.translate(string.maketrans("",""), string.punctuation).split())
(borrowed from the following question Best way to strip punctuation from a string in Python)
Uhm, like this?
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
for word in line.split():
collectwords[word] += 1
Python makes this easy:
collectwords = []
filetxt = open('DatoSO.txt', 'r')
for line in filetxt:
collectwords.extend(line.split())