counting the word length in a file

counting the word length in a file - python

So my function should open a file and count the word length and give the output. For example,
many('sample.txt')
Words of length 1: 2
Words of length 2: 6
Words of length 3: 7
Words of length 4: 6
My sample.txt file contains:
This is a test file. How many words are of length one?
How many words are of length three? We should figure it out!
Can a function do this?
My coding so far,
def many(fname): infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
L.sort
for item in L:
if item == 1:
print('Words of length 1:', L.count(item))
Can anyone tell me what I'm doing wrong. I call the function nothing happens. It's clearly because of my coding but I don't know where to go from here. Any help would be nice, thanks.

You want to obtain a list of lengths (1, 2, 3, 4,... characters) and a number of occurrences of words with this length in the file.
So until L = text.split() it was a good approach. Now have a look at dictionaries in Python, that will allow you to store the data structure mentioned above and iterate over the list of words in the file. Just a hint...

Since this is homework, I'll post a short solution here, and leave it as exercise to figure out what it does and why it works :)
>>> from collections import Counter
>>> text = open("sample.txt").read()
>>> counts = Counter([len(word.strip('?!,.')) for word in text.split()])
>>> counts[3]
7

What do you expect here
if item == 1:
and here
L.count(item)
And what does actually happen? Use a debugger and have a look at the variable values or just print them to the screen.

Maybe also this:
>>> s
'This is a test file. How many words are of length one? How many words are of length three? We should figure it out! Can a function do this?'
>>> {x:[len([c for c in w ]) for w in s.split()].count(x) for x in [len([c for c in w ]) for w in s.split()] }
{1: 2, 2: 6, 3: 5, 4: 6, 5: 4, 6: 5, 8: 1}

Let's analyze your problem step-by-step.
You need to:
Retrieve all the words from a file
Iterate over all the words
Increment the counter N every time you find a word of length N
Output the result
You already did the step 1:
def many(fname):
infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
Then you (try to) sort the words, but it is not useful. You would sort them alphanumerically, so it is not useful for your task.
Instead, let's define a Python dictionary to hold the count of words
lengths = dict()
#sukhbir correctly suggested in a comment to use the Counter class, and I encourage you to go and search for it, but I'll stick to traditional dictionaries in this example as i find it important to familiarize with the basics of the language before exploring the library.
Let's go on with step 2:
for word in L:
length = len(word)
For each word in the list, we assign to the variable length the length of the current word. Let's check if the counter already has a slot for our length:
if length not in lengths:
lengths[length] = 0
If no word of length length was encountered, we allocate that slot and we set that to zero. We can finally execute step 3:
lengths[length] += 1
Finally, we incremented the counter of words with the current length of 1 unit.
At the end of the function, you'll find that lengths will contain a map of word length -> number of words of that length. Let's verify that by printing its contents (step 4):
for length, counter in lengths.items():
print "Words of length %d: %d" % (length, counter)
If you copy and paste the code I wrote (respecting the indentation!!) you will get the answers you need.
I strongly suggest you to go through the Python tutorial.

The regular expression library might also be helpful, if being somewhat overkill. A simple word matching re might be something like:
import re
f = open("sample.txt")
text = f.read()
words = re.findall("\w+", text)
Words is then a list of... words :)
This however will not properly match words like 'isn't' and 'I'm', as \w only matches alphanumerics. In the spirit of this being homework I guess I'll leave that for the interested reader, but Python Regular Expression documentation is pretty good as a start.
Then my approach for counting these words by length would be something like:
occurrence = dict()
for word in words:
try:
occurrence[len(word)] = occurrence[len(word)] + 1
except KeyError:
occurrence[len(word)] = 1
print occurrence.items()
Where a dictionary (occurrence) is used to store the word lengths and their occurrence in your text. The try: and except: keywords deal with the first time we try and store a particular length of word in the dictionary, where in this case the dictionary is not happy at being asked to retrieve something that it has no knowledge of, and the except: picks up the exception that is thrown as a result and stores the first occurrence of that length of word. The last line prints everything in your dictionary.
Hope this helps :)

Related

I'm looking for a string in a file, seems to not be working

My function first calculates all possible anagrams of the given word. Then, for each of these anagrams, it checks if they are valid words, but checking if they equal to any of the words in the wordlist.txt file. The file is a giant file with a bunch of words line by line. So I decided to just read each line and check if each anagram is there. However, it comes up blank. Here is my code:
def perm1(lst):
if len(lst) == 0:
return []
elif len(lst) == 1:
return [lst]
else:
l = []
for i in range(len(lst)):
x = lst[i]
xs = lst[:i] + lst[i+1:]
for p in perm1(xs):
l.append([x] + p)
return l
def jumbo_solve(string):
'''jumbo_solve(string) -> list
returns list of valid words that are anagrams of string'''
passer = list(string)
allAnagrams = []
validWords = []
for x in perm1(passer):
allAnagrams.append((''.join(x)))
for x in allAnagrams:
if x in open("C:\\Users\\Chris\\Python\\wordlist.txt"):
validWords.append(x)
return(validWords)
print(jumbo_solve("rarom"))
If have put in many print statements to debug, and the passed in list, "allAnagrams", is fully functional. For example, with the input "rarom, one valid anagram is the word "armor", which is contained in the wordlist.txt file. However, when I run it, it does not detect if for some reason. Thanks again, I'm still a little new to Python so all the help is appreciated, thanks!

You missed a tiny but important aspect of:
word in open("C:\\Users\\Chris\\Python\\wordlist.txt")
This will search the file line by line, as if open(...).readlines() was used, and attempt to match the entire line, with '\n' in the end. Really, anything that demands iterating over open(...) works like readlines().
You would need
x+'\n' in open("C:\\Users\\Chris\\Python\\wordlist.txt")
if the file is a list of words on separate lines to make this work to fix what you have, but it's inefficient to do this on every function call. Better to do once:
wordlist = open("C:\\Users\\Chris\\Python\\wordlist.txt").read().split('\n')
this will create a list of words if the file is a '\n' separated word list. Note you can use
`readlines()`
instead of read().split('\n'), but this will keep the \n on every word, like you have, and you would need to include that in your search as I show above. Now you can use the list as a global variable or as a function argument.
if x in wordlist: stuff
Note Graphier raised an important suggestion in the comments. A set:
wordlist = set(open("C:\\Users\\Chris\\Python\\wordlist.txt").read().split('\n'))
Is better suited for a word lookup than a list, since it's O(word length).

You have used the following code in the wrong way:
if x in open("C:\\Users\\Chris\\Python\\wordlist.txt"):
Instead, try the following code, it should solve your problem:
with open("words.txt", "r") as file:
lines = file.read().splitlines()
for line in lines:
# do something here

So, putting all advice together, your code could be as simple as:
from itertools import permutations
def get_valid_words(file_name):
with open(file_name) as f:
return set(line.strip() for line in f)
def jumbo_solve(s, valid_words=None):
"""jumbo_solve(s: str) -> list
returns list of valid words that are anagrams of `s`"""
if valid_words is None:
valid_words = get_valid_words("C:\\Users\\Chris\\Python\\wordlist.txt")
return [word for word in permutations(s) if word in valid_words]
if __name__ == "__main__":
print(jumbo_solve("rarom"))

fast way to search for a set of words in a list of words python

I have a set of fixed words of size 20. I have a large file of 20,000 records, where each record contains a string and I want to find if any word from the fixed set is present in a string and if present the index of the word.
example
s1=set([barely,rarely, hardly])#( actual size 20)
l2= =["i hardly visit", "i do not visit", "i can barely talk"] #( actual size 20,000)
def get_token_index(token,indx):
if token in s1:
return indx
else:
return -1
def find_word(text):
tokens=nltk.word_tokenize(text)
indexlist=[]
for i in range(0,len(tokens)):
indexlist.append(i)
word_indx=map(get_token_index,tokens,indexlist)
for indx in word_indx:
if indx !=-1:
# Do Something with tokens[indx]
I want to know if there is a better/faster way to do it.

This suggesting is only removing some glaring inefficiencies, but won't affect the overall complexity of your solution:
def find_word(text, s1=s1): # micro-optimization, make s1 local
tokens = nltk.word_tokenize(text)
for i, word in in enumerate(tokens):
if word in s1:
# Do something with `word` and `i`
Essentially, you are slowing things down by using map when all you really need is a condition inside your loop body anyway... So basically, just get rid of get_token_index, it is over-engineered.

You can use list comprehension with a double for loop:
s1=set(["barely","rarely", "hardly"])
l2 = ["i hardly visit", "i do not visit", "i can barely talk"]
locations = [c for c, b in enumerate(l2) for a in s1 if a in b]
In this example, the output would be:
[0, 2]
However, if you would like a way of accessing the indexes at which a certain word appears:
from collections import defaultdict
d = defaultdict(list)
for word in s1:
for index, sentence in l2:
if word in sentence:
d[word].append(index)

This should work:
strings = []
for string in l2:
words = string.split(' ')
for s in s1:
if s in words:
print "%s at index %d" % (s, words.index(s))

The Easiest Way and Slightly More Efficient way would be using the Python Generator Function
index_tuple = list((l2.index(i) for i in s1 i in l2))
you can time it and check how efficiently this works with your requirement

How can I find and print the indexes of multiple elements in a list?

I'm trying to create a basic program to pick out the positions of words in a quote. So far, I've got the following code:
print("Your word appears in your quote at position(s)", string.index(word))
However, this only prints the first position where the word is indexed, which is fine if the quote only contains the word once, but if the word appears multiple times, it will still only print the first position and none of the others.
How can I make it so that the program will print every position in succession?
Note: very confusingly, string here stores a list. The program is supposed to find the positions of words stored within this list.

It seems that you're trying to find occurrences of a word inside a string: the re library has a function called finditer that is ideal for this purpose. We can use this along with a list comprehension to make a list of the indexes of a word:
>>> import re
>>> word = "foo"
>>> string = "Bar foo lorem foo ipsum"
>>> [x.start() for x in re.finditer(word, string)]
[4, 14]
This function will find matches even if the word is inside another, like this:
>>> [x.start() for x in re.finditer("foo", "Lorem ipsum foobar")]
[12]
If you don't want this, encase your word inside a regular expression like this:
[x.start() for x in re.finditer("\s+" + word + "\s+", string)]

Probably not the fastest/best way but it will work. Used in rather than == in case there were quotations or other unexpected punctuation aswell! Hope this helps!!
def getWord(string, word):
index = 0
data = []
for i in string.split(' '):
if i.lower() in word.lower():
data.append(index)
index += 1
return data

Here is a code I quickly made that should work:
string = "Hello my name is Amit and I'm answering your question".split(' ')
indices = [index for (word, index) in enumerate(string) if word == "QUERY"]
That should work, although returns the index of the word. You could make a calculation that adds the lengths of all words before that word to get the index of the letter.

python check if word is in certain elements of a list

I was wondering if there was a better way to put:
if word==wordList[0] or word==wordList[2] or word==wordList[3] or word==worldList[4]

word in wordList
Or, if you want to check the 4 first,
word in wordList[:4]

Very simple task, and so many ways to deal with it. Exciting! Here is what I think:
If you know for sure that wordList is small (else it might be too inefficient), then I recommend using this one:
b = word in (wordList[:1] + wordList[2:])
Otherwise I would probably go for this (still, it depends!):
b = word in (w for i, w in enumerate(wordList) if i != 1)
For example, if you want to ignore several indexes:
ignore = frozenset([5, 17])
b = word in (w for i, w in enumerate(wordList) if i not in ignore)
This is pythonic and it scales.
However, there are noteworthy alternatives:
### Constructing a tuple ad-hoc. Easy to read/understand, but doesn't scale.
# Note lack of index 1.
b = word in (wordList[0], wordList[2], wordList[3], wordList[4])
### Playing around with iterators. Scales, but rather hard to understand.
from itertools import chain, islice
b = word in chain(islice(wordList, None, 1), islice(wordList, 2, None))
### More efficient, if condition is to be evaluated many times in a loop.
from itertools import chain
words = frozenset(chain(wordList[:1], wordList[2:]))
b = word in words

Have indexList be a list of the indicies you want to check (ie, [0,2,3]) and have wordList be all the words you want to check. Then, the following command will return the 0th, 2nd, and 3rd elements of wordList, as a list:
[wordList[i] for i in indexList]
This will return [wordList[0], wordList[2], wordList[3]].

python and palindromes

i recently wrote a method to cycle through /usr/share/dict/words and return a list of palindromes using my ispalindrome(x) method
here's some of the code...what's wrong with it? it just stalls for 10 minutes and then returns a list of all the words in the file
def reverse(a):
return a[::-1]
def ispalindrome(a):
b = reverse(a)
if b.lower() == a.lower():
return True
else:
return False
wl = open('/usr/share/dict/words', 'r')
wordlist = wl.readlines()
wl.close()
for x in wordlist:
if not ispalindrome(x):
wordlist.remove(x)
print wordlist

wordlist = wl.readlines()
When you do this, there is a new line character at the end, so your list is like:
['eye\n','bye\n', 'cyc\n']
the elements of which are obviously not a palindrome.
You need this:
['eye','bye', 'cyc']
So strip the newline character and it should be fine.
To do this in one line:
wordlist = [line.strip() for line in open('/usr/share/dict/words')]
EDIT: Iterating over a list and modifying it is causing problems. Use a list comprehension,as pointed out by Matthew.

Others have already pointed out better solutions. I want to show you why the list is not empty after running your code. Since your ispalindrome() function will never return True because of the "newlines problem" mentioned in the other answers, your code will call wordlist.remove(x) for every single item. So why is the list not empty at the end?
Because you're modifying the list as you're iterating over it. Consider the following:
>>> l = [1,2,3,4,5,6]
>>> for i in l:
... l.remove(i)
...
>>> l
[2, 4, 6]
When you remove the 1, the rest of the elements travels one step upwards, so now l[0] is 2. The iteration counter has advanced, though, and will look at l[1] in the next iteration and therefore remove 3 and so on.
So your code removes half of the entries. Moral: Never modify a list while you're iterating over it (unless you know exactly what you're doing :)).

I think there are two problems.
Firstly, what is the point in reading all of the words into a list? Why not process each word in turn and print it if it's a palindrome.
Secondly, watch out for whitespace. You have newlines at the end of each of your words!
Since you're not identifying any palindromes (due to the whitespace), you're going to attempt to remove every item from the list. While you're iterating over it!
This solution runs in well under a second and identifies lots of palindromes:
for word in open('/usr/share/dict/words', 'r'):
word = word.strip()
if ispalindrome(word):
print word
Edit:
Perhaps more 'pythonic' is to use generator expressions:
def ispalindrome(a):
return a[::-1].lower() == a.lower()
words = (word.strip() for word in open('/usr/share/dict/words', 'r'))
palindromes = (word for word in words if ispalindrome(word))
print '\n'.join(palindromes)

It doesn't return all the words. It returns half. This is because you're modifying the list while iterating over it, which is a mistake. A simpler, and more effective solution, is to use a list comprehension. You can modify sukhbir's to do the whole thing:
[word for word in (word.strip() for word in wl.readlines()) if ispalindrome(word)]
You can also break this up:
stripped = (word.strip() for word in wl.readlines())
wordlist = [word for word in stripped if ispalindrome(word)]

You're including the newline at the end of each word in /usr/share/dict/words. That means you never find any palindromes. You'll speed things up if you just log the palindromes as you find them, instead of deleting non-palindromes from the list, too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

counting the word length in a file - python

Since this is homework, I'll post a short solution here, and leave it as exercise to figure out what it does and why it works :) >>> from collections import Counter >>> text = open("sample.txt").read() >>> counts = Counter([len(word.strip('?!,.')) for word in text.split()]) >>> counts[3] 7

What do you expect here if item == 1: and here L.count(item) And what does actually happen? Use a debugger and have a look at the variable values or just print them to the screen.

Related

I'm looking for a string in a file, seems to not be working

fast way to search for a set of words in a list of words python

How can I find and print the indexes of multiple elements in a list?

python check if word is in certain elements of a list

python and palindromes

Categories

Resources