Say that I have a file of restaurant names and that I need to search through said file and find a particular string like "Italian". How would the code look if I searched the file for the string and print out the number of restaurants with the same string?
f = open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt", "r")
content = f.read()
f.close()
lines = content.split("\n")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
print ("There are", len(f.readlines()), "restaurants in the dataset")
with open("/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "GREEK" in line:
for l in searchlines[i:i+3]: print (l),
print
You could count all the words using a Counter dict and then do lookups for certain words:
from collections import Counter
from string import punctuation
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
# sum(1 for _ in f) -> counts lines
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
# reset file pointer back to the start
f.seek(0)
# get count of how many times each word appears, at most once per line
cn = Counter(word.strip(punctuation).lower() for line in f for word in set(line.split()))
print(cn["italian"]) # no keyError if missing, will be 0
we use set(line.split()) so if a word appeared twice for a certain restaurant, we would only count it once. That looks for exact matches, if you are also looking to match partials like foo in foobar then it is going to be more complex to create a dataset where you can efficiently lookup multiple words.
If you really just want to count one word all you need to do is use sum how many times the substring appears in a line:
f_name = "/home/ubuntu/ipynb/NYU_Notes/2-Introduction_to_Python/data/restaurant-names.txt"
with open(f_name) as f:
print ("There are", sum(1 for _ in f), "restaurants in the dataset")
f.seek(0)
sub = "italian"
count = sum(sub in line.lower() for line in f)
If you want exact matches, you would need the split logic again or to use a regex with word boundaries.
You input the file as a string.
Then use the count method of strings.
Code:
#Let the file be taken as a string in s1
print s1.count("italian")
Related
I need a program that counts the top 5 most common first words of the lines in a file and which does not include lines where the first word is followed by a "DM" or an "RT"?
I don't have any code as of so far because I'm completely lost.
f = open("C:/Users/Joe Simpleton/Desktop/talking.txt", "r")
?????
Read each line of your text in. For each line, split it into words using a regular expression, this will return a list of the words. If there are at least two words, test the second word to make sure it is not in your list. Then use a Counter() to keep track of all of the word counts. Store the lowercase of each word so that uppercase and lowercase versions of the same word are not counted separately:
from collections import Counter
import re
word_counts = Counter()
with open('talking.txt') as f_input:
for line in f_input:
words = re.findall(r'\w+', line)
if (len(words) > 1 and words[1] not in ['DM', 'RT']) or len(words) == 1:
word_counts.update(word.lower() for word in words)
print(word_counts.most_common(5))
The Counter() has a useful feature in being able to show the most common values.
Not tested, but should work roughly like that:
from collections import Counter
count = Counter()
with open("path") as f:
for line in f:
parts = line.split(" ")
if parts[1] not in ["DM", "RT"]:
count[parts[0]] += 1
print(count.most_common(5))
You should also add a check that ensures that parts has > 2 elements.
I am attempting to count the number of 'difficult words' in a file, which requires me to count the number of letters in each word. For now, I am only trying to get single words, one at a time, from a file. I've written the following:
file = open('infile.txt', 'r+')
fileinput = file.read()
for line in fileinput:
for word in line.split():
print(word)
Output:
t
h
e
o
r
i
g
i
n
.
.
.
It seems to be printing one character at a time instead of one word at a time. I'd really like to know more about what is actually happening here. Any suggestions?
Use splitlines():
fopen = open('infile.txt', 'r+')
fileinput = fopen.read()
for line in fileinput.splitlines():
for word in line.split():
print(word)
fopen.close()
Without splitlines():
You can also use with statement to open the file. It closes the file automagically:
with open('infile.txt', 'r+') as fopen:
for line in fopen:
for word in line.split():
print(word)
A file supports the iteration protocol, which for bigger files is much better than reading the whole content in memory in one go
with open('infile.txt', 'r+') as f:
for line in f:
for word in line.split():
print(word)
Assuming you are going to define a filter function, you could do something along the line
def is_difficult(word):
return len(word)>5
with open('infile.txt', 'r+') as f:
words = (w for line in f for w in line.split() if is_difficult(w))
for w in words:
print(w)
which, with an input file of
ciao come va
oggi meglio di domani
ieri peggio di oggi
produces
meglio
domani
peggio
Your code is giving you single characters because you called .read() which store all the content as a single string so when you for line in fileinput you are iterating over the string char by char, there is no good reason to use read and splitlines you as can simple iterate over the file object, if you did want a list of lines you would call readlines.
If you want to group words by length use a dict using the length of the word as the key, you will want to also remove punctuation from the words which you can do with str.strip:
def words(n, fle):
from collections import defaultdict
d = defaultdict(list)
from string import punctuation
with open(fle) as f:
for line in f:
for word in line.split():
word = word.strip(punctuation)
_len = len(word)
if _len >= n:
d[_len].append(word)
return d
Your dict will contain all the words in the file grouped by length and all at least n characters long.
I have a dictonary like
list1={'ab':10,'ba':20,'def':30}.
Now my input file contains :
ab def
ba ab
I have coded:
filename=raw_input("enter file:")
f=open(filename,'r')
ff=open(filename+'_value','w')
for word in f.read().split():
s=0
if word in list1:
ff.write(word+'\t'+list1[word]+'\n');
s+=int(list1[word])
else:
ff.write(word+'\n')
ff.write("\n"+"total:%d"%(s)+"\n")
Now I want my output file to contain:
ab 10
def 30
total: 40
ba 20
ab 10
total: 30
Am not able to loop it for each line. How should I do it? I tried a few variations using f.readlines(), f.read(), and tried looping once, then twice with them. But I cannot get it right.
Instead of giving the answer right away, Let me give you a gist of what you ask:
To read the whole file:
f = open('myfile','r')
data = f.read()
To loop through each line in the file:
for line in data:
To loop through each word in the line:
for word in line.split():
Use it wisely to get what you want.
You need to make 2 loops and not only one:
filename = raw_input("enter file:")
with open(filename, 'r') as f, open(filename + '_value','w') as ff:
# Read each line sequentially
for line in f.read():
# In each line, read each word
total = 0
for word in line.split():
if word in list1:
ff.write("%s\t%s\n" % (word, list1[word]))
total += int(list1[word])
else:
ff.write(word+'\n')
ff.write("\ntotal: %s\n" % total)
I have also cleaned a little bit your code to be more readable. Also see What is the python "with" statement designed for? if you want to understand the with block
with open("in.txt","r") as f:
with open("out.txt","w") as f1:
for line in f:
words = line.split() # split into list of two words
f1.write("{} {}\n".format((words[0]),list1[words[0]])) # write first word plus value
f1.write("{} {}\n".format((words[1]),list1[words[1]])) # second word plus value
f1.write("Total: {}\n".format((int(list1[words[0]]) + int(list1[words[1]])))) # finally add first and second and get total
I wonder, how to read character string like fscanf. I need to read for word, in the all .txt . I need a count for each words.
collectwords = collections.defaultdict(int)
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
v=""
for char in line:
if str(char) != " ":
v=v+str(char)
elif str(char) == " ":
collectwords[v] += 1
v=""
this way, I cant to read the last word.
You might also consider using collections.counter if you are using Python >=2.7
http://docs.python.org/library/collections.html#collections.Counter
It adds a number of methods like 'most_common', which might be useful in this type of application.
From Doug Hellmann's PyMOTW:
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
http://www.doughellmann.com/PyMOTW/collections/counter.html -- although this does letter counts instead of word counts. In the c.update line, you would want to replace line.rstrip().lower with line.split() and perhaps some code to get rid of punctuation.
Edit: To remove punctuation here is probably the fastest solution:
import collections
import string
c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
for line in f:
c.update(line.translate(string.maketrans("",""), string.punctuation).split())
(borrowed from the following question Best way to strip punctuation from a string in Python)
Uhm, like this?
with open('DatoSO.txt', 'r') as filetxt:
for line in filetxt:
for word in line.split():
collectwords[word] += 1
Python makes this easy:
collectwords = []
filetxt = open('DatoSO.txt', 'r')
for line in filetxt:
collectwords.extend(line.split())
I'm having a bit of a rough time laying out how I would count certain elements within a text file using Python. I'm a few months into Python and I'm familiar with the following functions;
raw_input
open
split
len
print
rsplit()
Here's my code so far:
fname = "feed.txt"
fname = open('feed.txt', 'r')
num_lines = 0
num_words = 0
num_chars = 0
for line in feed:
lines = line.split('\n')
At this point I'm not sure what to do next. I feel the most logical way to approach it would be to first count the lines, count the words within each line, and then count the number of characters within each word. But one of the issues I ran into was trying to perform all of the necessary functions at once, without having to re-open the file to perform each function seperately.
Try this:
fname = "feed.txt"
num_lines = 0
num_words = 0
num_chars = 0
with open(fname, 'r') as f:
for line in f:
words = line.split()
num_lines += 1
num_words += len(words)
num_chars += len(line)
Back to your code:
fname = "feed.txt"
fname = open('feed.txt', 'r')
what's the point of this? fname is a string first and then a file object. You don't really use the string defined in the first line and you should use one variable for one thing only: either a string or a file object.
for line in feed:
lines = line.split('\n')
line is one line from the file. It does not make sense to split('\n') it.
Functions that might be helpful:
open("file").read() which reads the contents of the whole file at once
'string'.splitlines() which separates lines from each other (and discards empty lines)
By using len() and those functions you could accomplish what you're doing.
fname = "feed.txt"
feed = open(fname, 'r')
num_lines = len(feed.splitlines())
num_words = 0
num_chars = 0
for line in lines:
num_words += len(line.split())
file__IO = input('\nEnter file name here to analize with path:: ')
with open(file__IO, 'r') as f:
data = f.read()
line = data.splitlines()
words = data.split()
spaces = data.split(" ")
charc = (len(data) - len(spaces))
print('\n Line number ::', len(line), '\n Words number ::', len(words), '\n Spaces ::', len(spaces), '\n Charecters ::', (len(data)-len(spaces)))
I tried this code & it works as expected.
One of the way I like is this one , but may be good for small files
with open(fileName,'r') as content_file:
content = content_file.read()
lineCount = len(re.split("\n",content))
words = re.split("\W+",content.lower())
To count words, there is two way, if you don't care about repetition you can just do
words_count = len(words)
if you want the counts of each word you can just do
import collections
words_count = collections.Counter(words) #Count the occurrence of each word