Find if groups of characters are repeated in a string in python - python

I am a beginner and I want to know that how to determine if a string contains characters which are re-occuring in a pattern.
Example: "aabcdabcdabcdabcd"
Here four characters - 'abcd' ore getting repeated.
But I do not know that how many characters are getting repeated.
The pattern is not certain. I do not know it. "abcd is just" an example.
The pattern can be in any order
Please help.
My code is :
I don't actually know the string!
s1=str("aabcdabcdabcd")
x=0
z=""
for i in range (1,len(s1)):
z=s1[i:i+5]
s1.replace(z,"",1)
if z in s1:
x+=1
if x!=0:
print "yes":
else:
print "no"
The above program works only for the given string. I want it to be able to evaluate any string.

This will find all repeats of letters - then you can filter the sets you want.
cstr = 'aabcdabcdabcdabcd'
dd = {}
for ii, ch in enumerate(cstr):
# find all sequences of 3-6 characters long
for jj in range(3,7):
wrd = cstr[ii:ii+jj]
if not len(wrd) == jj:
break
dd.setdefault(wrd, 0)
dd[wrd] += 1
# find any "word" that occurs more than once
for k, v in dd.iteritems():
if v > 2:
print k, v

I'm relatively new to Python myself, and one of the things I found most exciting initially was that it's very easy to start working with the characters which make up strings.
To answer your problem, I would start with:
for letter in string:
# work through the string and check for repeated patterns

In Natural Language Processing these are called ngrams, for most common NLP tasks the nltk library is very useful:
from nltk.util import ngrams
from collections import Counter
s = 'aabcdabcdabcdabcd'
max_ngram = 5
minimum_count = 2
ngrams_found = Counter()
for x in range(max_ngram-1):
ngrams_found += Counter(["".join(ngram) for ngram in ngrams(s, x+minimum_count)])
for key, val in ngrams_found.items():
if val < minimum_count:
del ngrams_found[key]
else:
print(key, val)
The Counter object also allows you to print the x most common ngrams:
ngrams_found.most_common(5)

Related

Leetcode 792: number of matching subsequences, wrong answer

I'm experiencing an issue with Leetcode #792, what I understand from the description is that abc possible subsequences are a, b, c, ab, ac, bc, abc. Impossible subsequences would be f, gh, bb, cc, ca ... If this understanding is correct, a simple solution would be keeping count of all letters and where we are in a given string, and if counts[letter] = 0 or non-existent, a subsequence cannot be formed. Here's the implementation of what I just described, if I'm missing something, pointing it out would be greatly appreciated. Code works for the examples in the description but fails one of the test cases.
from collections import Counter
class Solution:
def numMatchingSubseq(self, s: str, words: List[str]) -> int:
total = 0
counts = Counter(s)
for word in words:
seen = counts.copy()
for char in word:
if not seen.get(char, 0):
break
seen[char] -= 1
else:
total += 1
return total
Here's the description of what's required:
Given a string s and an array of strings words, return the number of words[i] that is a subsequence of s. A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters. For example, "ace" is a subsequence of "abcde".
Example 1:
Input: s = "abcde", words = ["a","bb","acd","ace"]
Output: 3
Explanation: There are three strings in words that are a subsequence of s: "a", "acd", "ace".
Example 2:
Input: s = "dsahjpjauf", words = ["ahjpjau","ja","ahbwzgqnuk","tnmlanowax"]
Output: 2
The failed case:
s = "ricogwqznwxxcpueelcobbbkuvxxrvgyehsudccpsnuxpcqobtvwkuvsubiidjtccoqvuahijyefbpqhbejuisksutsowhufsygtwteiqyligsnbqglqblhpdzzeurtdohdcbjvzgjwylmmoiundjscnlhbrhookmioxqighkxfugpeekgtdofwzemelpyjsdeeppapjoliqlhbrbghqjezzaxuwyrbczodtrhsvnaxhcjiyiphbglyolnswlvtlbmkrsurrcsgdzutwgjofowhryrubnxkahocqjzwwagqidjhwbunvlchojtbvnzdzqpvrazfcxtvhkruvuturdicnucvndigovkzrqiyastqpmfmuouycodvsyjajekhvyjyrydhxkdhffyytldcdlxqbaszbuxsacqwqnhrewhagldzhryzdmmrwnxhaqfezeeabuacyswollycgiowuuudrgzmwnxaezuqlsfvchjfloczlwbefksxsbanrektvibbwxnokzkhndmdhweyeycamjeplecewpnpbshhidnzwopdjuwbecarkgapyjfgmanuavzrxricbgagblomyseyvoeurekqjyljosvbneofjzxtaizjypbcxnbfeibrfjwyjqrisuybfxpvqywqjdlyznmojdhbeomyjqptltpugzceyzenflfnhrptuugyfsghluythksqhmxlmggtcbdddeoincygycdpehteiugqbptyqbvokpwovbnplshnzafunqglnpjvwddvdlmjjyzmwwxzjckmaptilrbfpjxiarmwalhbdjiwbaknvcqovwcqiekzfskpbhgxpyomekqvzpqyirelpadooxjhsyxjkfqavbaoqqvvknqryhotjritrkvdveyapjfsfzenfpuazdrfdofhudqbfnzxnvpluwicurrtshyvevkriudayyysepzqfgqwhgobwyhxltligahroyshfndydvffd"
words = ["iowuuudrgzmw","azfcxtvhkruvuturdicnucvndigovkzrq","ylmmo","maptilrbfpjxiarmwalhbd","oqvuahijyefbpqhbejuisksutsowhufsygtwteiqyligsnbqgl","ytldcdlxqbaszbuxsacqwqnhrewhagldzhr","zeeab","cqie","pvrazfcxtvhkruvuturdicnucvndigovkzrqiya","zxnvpluwicurrtshyvevkriudayyysepzq","wyhxltligahroyshfn","nhrewhagldzhryzdmmrwn","yqbvokpwovbnplshnzafunqglnpjvwddvdlmjjyzmw","nhrptuugyfsghluythksqhmxlmggtcbdd","yligsnbqglqblhpdzzeurtdohdcbjvzgjwylmmoiundjsc","zdrfdofhudqbfnzxnvpluwicurrtshyvevkriudayyysepzq","ncygycdpehteiugqbptyqbvokpwovbnplshnzafun","gdzutwgjofowhryrubnxkahocqjzww","eppapjoliqlhbrbgh","qwhgobwyhxltligahroys","dzutwgjofowhryrubnxkah","rydhxkdhffyytldcdlxqbaszbuxs","tyqbvokpwovbnplshnzafunqglnpjvwddvdlmjjyzmwwxzjc","khvyjyrydhxkdhffyytldcdlxqbasz","jajekhvyjyrydhxkdhffyytldcdlxqbaszbuxsacqwqn","ppapjoliqlhbrbghq","zmwwxzjckmaptilrbfpjxiarm","nxkahocqjzwwagqidjhwbunvlchoj","ybfxpvqywqjdlyznmojdhbeomyjqptltp","udrgzmwnxae","nqglnpjvwddvdlmjjyzmww","swlvtlbmkrsurrcsgdzutwgjofowhryrubn","hudqbfnzxnvpluwicurr","xaezuqlsfvchjf","tvibbwxnokzkhndmdhweyeycamjeplec","olnswlvtlbmkrsurrcsgdzu","qiyastqpmfmuouycodvsyjajekhvyjyrydhxkdhffyyt","eiqyligsnbqglqblhpdzzeurtdohdcbjvzgjwyl","cgiowuuudrgzmwnxaezuqlsfvchjflocz","rxric","cygycdpehteiugqbptyqbvokpwovbnplshnzaf","g","surrcsgd","yzenflfnhrptuugyfsghluythksqh","gdzutwgjofowhryrubnxkahocqjzwwagqid","ddeoincygycdpeh","yznmojdhbeomyjqptltpugzceyzenflfnhrptuug","ejuisks","teiqyligsnbqglqblhpdzzeurtdohdcbjvzgjwylmmoi","mrwnxhaqfezeeabuacyswollycgio","qfskkpfakjretogrokmxemjjbvgmmqrfdxlkfvycwalbdeumav","wjgjhlrpvhqozvvkifhftnfqcfjmmzhtxsoqbeduqmnpvimagq","ibxhtobuolmllbasaxlanjgalgmbjuxmqpadllryaobcucdeqc","ydlddogzvzttizzzjohfsenatvbpngarutztgdqczkzoenbxzv","rmsakibpprdrttycxglfgtjlifznnnlkgjqseguijfctrcahbb","pqquuarnoybphojyoyizhuyjfgwdlzcmkdbdqzatgmabhnpuyh","akposmzwykwrenlcrqwrrvsfqxzohrramdajwzlseguupjfzvd","vyldyqpvmnoemzeyxslcoysqfpvvotenkmehqvopynllvwhxzr","ysyskgrbolixwmffygycvgewxqnxvjsfefpmxrtsqsvpowoctw","oqjgumitldivceezxgoiwjgozfqcnkergctffspdxdbnmvjago","bpfgqhlkvevfazcmpdqakonkudniuobhqzypqlyocjdngltywn","ttucplgotbiceepzfxdebvluioeeitzmesmoxliuwqsftfmvlg","xhkklcwblyjmdyhfscmeffmmerxdioseybombzxjatkkltrvzq","qkvvbrgbzzfhzizulssaxupyqwniqradvkjivedckjrinrlxgi","itjudnlqncbspswkbcwldkwujlshwsgziontsobirsvskmjbrq","nmfgxfeqgqefxqivxtdrxeelsucufkhivijmzgioxioosmdpwx","ihygxkykuczvyokuveuchermxceexajilpkcxjjnwmdbwnxccl","etvcfbmadfxlprevjjnojxwonnnwjnamgrfwohgyhievupsdqd","ngskodiaxeswtqvjaqyulpedaqcchcuktfjlzyvddfeblnczmh","vnmntdvhaxqltluzwwwwrbpqwahebgtmhivtkadczpzabgcjzx","yjqqdvoxxxjbrccoaqqspqlsnxcnderaewsaqpkigtiqoqopth","wdytqvztzbdzffllbxexxughdvetajclynypnzaokqizfxqrjl","yvvwkphuzosvvntckxkmvuflrubigexkivyzzaimkxvqitpixo","lkdgtxmbgsenzmrlccmsunaezbausnsszryztfhjtezssttmsr","idyybesughzyzfdiibylnkkdeatqjjqqjbertrcactapbcarzb","ujiajnirancrfdvrfardygbcnzkqsvujkhcegdfibtcuxzbpds","jjtkmalhmrknaasskjnixzwjgvusbozslrribgazdhaylaxobj","nizuzttgartfxiwcsqchizlxvvnebqdtkmghtcyzjmgyzszwgi","egtvislckyltpfogtvfbtxbsssuwvjcduxjnjuvnqyiykvmrxl","ozvzwalcvaobxbicbwjrububyxlmfcokdxcrkvuehbnokkzala","azhukctuheiwghkalboxfnuofwopsrutamthzyzlzkrlsefwcz","yhvjjzsxlescylsnvmcxzcrrzgfhbsdsvdfcykwifzjcjjbmmu","tspdebnuhrgnmhhuplbzvpkkhfpeilbwkkbgfjiuwrdmkftphk","jvnbeqzaxecwxspuxhrngmvnkvulmgobvsnqyxdplrnnwfhfqq","bcbkgwpfmmqwmzjgmflichzhrjdjxbcescfijfztpxpxvbzjch","bdrkibtxygyicjcfnzigghdekmgoybvfwshxqnjlctcdkiunob","koctqrqvfftflwsvssnokdotgtxalgegscyeotcrvyywmzescq","boigqjvosgxpsnklxdjaxtrhqlyvanuvnpldmoknmzugnubfoa","jjtxbxyazxldpnbxzgslgguvgyevyliywihuqottxuyowrwfar","zqsacrwcysmkfbpzxoaszgqqsvqglnblmxhxtjqmnectaxntvb","izcakfitdhgujdborjuhtwubqcoppsgkqtqoqyswjfldsbfcct","rroiqffqzenlerchkvmjsbmoybisjafcdzgeppyhojoggdlpzq","xwjqfobmmqomhczwufwlesolvmbtvpdxejzslxrvnijhvevxmc","ccrubahioyaxuwzloyhqyluwoknxnydbedenrccljoydfxwaxy","jjoeiuncnvixvhhynaxbkmlurwxcpukredieqlilgkupminjaj","pdbsbjnrqzrbmewmdkqqhcpzielskcazuliiatmvhcaksrusae","nizbnxpqbzsihakkadsbtgxovyuebgtzvrvbowxllkzevktkuu","hklskdbopqjwdrefpgoxaoxzevpdaiubejuaxxbrhzbamdznrr","uccnuegvmkqtagudujuildlwefbyoywypakjrhiibrxdmsspjl","awinuyoppufjxgqvcddleqdhbkmolxqyvsqprnwcoehpturicf"]
Your function does not take into account that although a character might be available, it only occurs before the characters you have already used, and so it actually is not available.
counts does not give any clue where a letter occurs, so counts would be the same whether s is "ab" or "ba", yet it is clear that this difference in s influences the sub-sequences that are possible.
For example, if s="ab" and one of the words is "ba" you'll get a false positive.
To solve this, you need to not only know the number of occurrences of a certain letter, but the indices of where these occurrences are.
There are several ways to do this. A trivial way is to just look for the next occurrence in s based on a given index (using index function and its second argument).
Another way, to improve a bit on efficiency, is to have a dictionary with 26 keys: one per lowercase letter of the Latin alphabet. The corresponding values are lists of indices. Then when a word is processed, you could keep track of a current index and use a binary search in the appropriate list of indices to find the next one, and so find an increasing sequence of indices. As soon as no index is available, the word fails as candidate.
Here is a spoiler implementation of that idea:
from bisect import bisect
class Solution:
def numMatchingSubseq(self, s: str, words: List[str]) -> int:
indexes = {
ch: []
for ch in 'abcdefghijklmnopqrstuvwxyz'
}
for i, ch in enumerate(s):
indexes[ch].append(i)
count = 0
for word in words:
i = -1
for ch in word:
lookup = indexes[ch]
k = bisect(lookup, i)
if k == len(lookup):
break
i = lookup[k]
else:
count += 1
return count

How to remove characters that appear more than once from a string?

So, I had a similar exercise on my IT classes: 'Print a string without characters appearing more than once (if they appear more than once, remove them)'. I thought that it was easy (and maybe it is), but I have completely no idea how to do that. I can do similar exercises (print all unique characters from a string / remove duplicates etc).
Example:
Input: '12345555555678'
Output: '1234678'
basic algorithm for this is described in this answer- for each char you check if it appears more than once by counting it's occurrences in the string.
However that's fairly inefficient, since it goes trough the string n ^ 2. You can improve that with the expense of some memory (which is illustrated in this answer - but obfuscated by a library).
The algorithm would then be to go once trough the string and count the number of occurrences for each char and save them somewhere, then go again trough the string and print only the chars that have the count 1.
inp = '1345552225555678'
counts = {};
for ch in inp:
if ch in counts:
counts[ch] = counts[ch] + 1
else:
counts[ch] = 1
result = '';
for ch in inp:
if counts[ch] == 1:
result = result + ch
print result
Arguably, this would be O(n) since the access time for a dictionary is generally considered O(1) (see this question for a discussion)
Note: Usually this is done using an array the size of the number legal chars, but since strings in python are Unicode, an array would be huge, however the access time would be truly O(1);
You could use collections.Counter().
from collections import Counter
inp = '12345555555678'
c = Counter(inp)
output = ''.join(k for k, v in c.items() if v == 1) # -> 1234678
Simple implementation of Counter
c = {}
for char in inp:
c[char] = c.get(char, 0) + 1
This should look like what you want
input_str = 'ahuadvzudnioqdazvyduazdazdui'
for c in input_str:
if input_str.count(c)==1:
print(c)
It's easier to understand, but note that it has quite low performance (Complexity of O(n^2)).
To make it little faster you can use List Comprehension.
input_str = '12345555555678'
[x for x in input_str if input_str.count(x) == 1]
If order of the element doesn't matter to you the iterating over set of the list will be beneficial.
If you convert list into set using set(input_str) then it will have unique values which may evantually reduce search space.
Then you can apply list complrehension.
input_str = '12345555555678'
[x for x in set(input_str) if input_str.count(x) == 1]
Note: Do not forget the condition that order will not be preserved after converting to set.
i_str = '12345555555678'
b = sorted(i_str)
for i in range(len(b)-1):
if b[i] == b[i+1]:
i_str = i_str.replace(b[i],'')
You just sort the string and compare each nth element with next element.If it is not same it is unique.
Also I am pretty sure it should be faster than using count function which will iterate though all the string for each unique element and check if the count of character is not greater than 1.
I solved a similar task on the codeacademy. I was requested to define a function that removes all vowels, even if it repeats. My code that allows to remove repeating symbols is below:
def anti_vowel(text):
all_vowels = ["A", "E", "U", "I", "O", "a", "e", "o", "u", "i"]
listed_text = []
for letter in text:
listed_text.append(letter)
for vowel in all_vowels:
while vowel in listed_text:
listed_text.remove(vowel)
return "".join(listed_text)
print(anti_vowel("Hey look Words!"))
output:
Hy lk Wrds!

fast way to search for a set of words in a list of words python

I have a set of fixed words of size 20. I have a large file of 20,000 records, where each record contains a string and I want to find if any word from the fixed set is present in a string and if present the index of the word.
example
s1=set([barely,rarely, hardly])#( actual size 20)
l2= =["i hardly visit", "i do not visit", "i can barely talk"] #( actual size 20,000)
def get_token_index(token,indx):
if token in s1:
return indx
else:
return -1
def find_word(text):
tokens=nltk.word_tokenize(text)
indexlist=[]
for i in range(0,len(tokens)):
indexlist.append(i)
word_indx=map(get_token_index,tokens,indexlist)
for indx in word_indx:
if indx !=-1:
# Do Something with tokens[indx]
I want to know if there is a better/faster way to do it.
This suggesting is only removing some glaring inefficiencies, but won't affect the overall complexity of your solution:
def find_word(text, s1=s1): # micro-optimization, make s1 local
tokens = nltk.word_tokenize(text)
for i, word in in enumerate(tokens):
if word in s1:
# Do something with `word` and `i`
Essentially, you are slowing things down by using map when all you really need is a condition inside your loop body anyway... So basically, just get rid of get_token_index, it is over-engineered.
You can use list comprehension with a double for loop:
s1=set(["barely","rarely", "hardly"])
l2 = ["i hardly visit", "i do not visit", "i can barely talk"]
locations = [c for c, b in enumerate(l2) for a in s1 if a in b]
In this example, the output would be:
[0, 2]
However, if you would like a way of accessing the indexes at which a certain word appears:
from collections import defaultdict
d = defaultdict(list)
for word in s1:
for index, sentence in l2:
if word in sentence:
d[word].append(index)
This should work:
strings = []
for string in l2:
words = string.split(' ')
for s in s1:
if s in words:
print "%s at index %d" % (s, words.index(s))
The Easiest Way and Slightly More Efficient way would be using the Python Generator Function
index_tuple = list((l2.index(i) for i in s1 i in l2))
you can time it and check how efficiently this works with your requirement

How to remove duplicate characters in a string and print according to the longest occurrence

I've been trying to solve this program, but i am unable.
x="abcaa" # sample input
x="bca" # sample output
i have tried this:
from collections import OrderedDict
def f(x):
print ''.join(OrderedDict.fromkeys(x))
t=input()
for i in range(t):
x=raw_input()
f(x)
The above code is giving:
x="abcaa" # Sample input
x="abc" # sample output
More Details:
Sample Input:
abc
aaadcea
abcdaaae
Sample Output:
abc
adce
bcdae
In first case, the string is="abcaa", here 'a' is repeated maximum at the last so that is placed at last so resulting "bca" And in other case, "aaadcea", here 'a' is repeated maximum at the first so it is placed at first, resulting "adce".
The OrderedDict isn't helping you at all, because the order you're preserving isn't the one you want.
If I understand your question (and I'm not at all sure I do…) the order you want is a sorted order, using the number of times the character appears as the sorting key, so the most frequent characters appear last.
So, this means you need to associate each character with a count in some way. You could do that with an explicit loop and d.setdefault(char, 0) and so on, but if you look in the collections docs, you'll see something named Counter right next to OrderedDict, which is a:
dict subclass for counting hashable objects
That's exactly what you want:
>>> x = 'abcaa'
>>> collections.Counter(x)
Counter({'a': 3, 'b': 1, 'c': 1})
And now you just need to sort with a key function:
>>> ''.join(sorted(c, key=c.__getitem__))
'bca'
If you want this to be a stable sort, so that elements with the same counts are shown in the order they first appear, or the order they first reach that count, then you will need OrderedDict. How do you get both OrderedDict behavior and Counter behavior? There's a recipe in the docs that shows how to do it. (And you actually don't even need that much; the __repr__ and __reduce__ are irrelevant for your use, so you can just inherit from Counter and OrderedDict and pass for the body.)
Taking a different guess at what you want:
For each character, you want to find the position at which it has the most repetitions.
That means that, as you go along, you need to keep track of two things for each character: the position at which it has the most repetitions so far, and how many. And you also need to keep track of the current run of characters.
In that case, the OrderedDict is necessary, it's just not sufficient. You need to add characters to the OrderedDict as you find them, and remove them and readd them when you find a longer run, and you also need to store a count in the value for each key rather that just use the OrderedDict as an OrderedSet. Like this:
d = collections.OrderedDict()
lastch, runlength = None, None
for ch in x:
if ch == lastch:
runlength += 1
else:
try:
del d[lastch]
except KeyError:
pass
if runlength:
d[lastch] = runlength
lastch, runlength = ch, 1
try:
del d[lastch]
except KeyError:
pass
if runlength:
d[lastch] = runlength
x = ''.join(d)
You may notice that there's a bit of repetition here, and a lot of verbosity. You can simplify the problem quite a bit by breaking it into two steps: first compress the string into runs, then just keep track of the largest run for each character. Thanks to the magic of iterators, this doesn't even have to be done in two passes, the first step can be done lazily.
Also, because you're still using Python 2.7 and therefore don't have OrderedDict.move_to_end, we have to do that silly delete-then-add shuffle, but we can use pop to make that more concise.
So:
d = collections.OrderedDict()
for key, group in itertools.groupby(x):
runlength = len(list(group))
if runlength > d.get(key, 0):
d.pop(key, None)
d[key] = runlength
x = ''.join(d)
A different way to solve this would be to use a plain-old dict, and store the runlength and position for each character, then sort the results in position order. This means we no longer need to do the move-to-end shuffle, we're just updating the position as part of the value:
d = {}
for i, (key, group) in enumerate(itertools.groupby(x)):
runlength = len(list(group))
if runlength > d.get(key, (None, 0))[1]:
d[key] = (i, runlength)
x = ''.join(sorted(d, key=d.__getitem__))
However, I'm not sure this improvement actually improves the readability, so I'd go with the second version above.
This is an inelegant, ugly, inefficient, and almost certainly non-Pythonic solution but I think it does what you're looking for.
t = raw_input('Write your string here: ')
# Create a list initalized to 0 to store character counts
seen = dict()
# Make sure actually have a string
if len(t) < 1:
print ""
else:
prevChar = t[0]
count = 0
for char in t:
if char == prevChar:
count = count + 1
else:
# Check if the substring we just finished is the longest
if count > seen.get(prevChar, 0):
seen[prevChar] = count
# Characters differ, restart
count = 1
prevChar = char
# Append last character
seen[prevChar] = count
# Now let's build the string, appending the character when we find the longest version
count = 0
prevChar = t[0]
finalString = ""
for char in t:
if char in finalString:
# Make sure we don't append a char twice, append the first time we find the longest subsequence
continue
if char == prevChar:
count = count + 1
else:
# Check if the substring we just finished is the longest
if count == seen.get(prevChar, 0):
finalString = finalString + prevChar
# Characters differ, restart
count = 1
prevChar = char
# Check the last character
if count == seen[prevChar]:
finalString= finalString + prevChar
print finalString

counting the word length in a file

So my function should open a file and count the word length and give the output. For example,
many('sample.txt')
Words of length 1: 2
Words of length 2: 6
Words of length 3: 7
Words of length 4: 6
My sample.txt file contains:
This is a test file. How many words are of length one?
How many words are of length three? We should figure it out!
Can a function do this?
My coding so far,
def many(fname): infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
L.sort
for item in L:
if item == 1:
print('Words of length 1:', L.count(item))
Can anyone tell me what I'm doing wrong. I call the function nothing happens. It's clearly because of my coding but I don't know where to go from here. Any help would be nice, thanks.
You want to obtain a list of lengths (1, 2, 3, 4,... characters) and a number of occurrences of words with this length in the file.
So until L = text.split() it was a good approach. Now have a look at dictionaries in Python, that will allow you to store the data structure mentioned above and iterate over the list of words in the file. Just a hint...
Since this is homework, I'll post a short solution here, and leave it as exercise to figure out what it does and why it works :)
>>> from collections import Counter
>>> text = open("sample.txt").read()
>>> counts = Counter([len(word.strip('?!,.')) for word in text.split()])
>>> counts[3]
7
What do you expect here
if item == 1:
and here
L.count(item)
And what does actually happen? Use a debugger and have a look at the variable values or just print them to the screen.
Maybe also this:
>>> s
'This is a test file. How many words are of length one? How many words are of length three? We should figure it out! Can a function do this?'
>>> {x:[len([c for c in w ]) for w in s.split()].count(x) for x in [len([c for c in w ]) for w in s.split()] }
{1: 2, 2: 6, 3: 5, 4: 6, 5: 4, 6: 5, 8: 1}
Let's analyze your problem step-by-step.
You need to:
Retrieve all the words from a file
Iterate over all the words
Increment the counter N every time you find a word of length N
Output the result
You already did the step 1:
def many(fname):
infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
Then you (try to) sort the words, but it is not useful. You would sort them alphanumerically, so it is not useful for your task.
Instead, let's define a Python dictionary to hold the count of words
lengths = dict()
#sukhbir correctly suggested in a comment to use the Counter class, and I encourage you to go and search for it, but I'll stick to traditional dictionaries in this example as i find it important to familiarize with the basics of the language before exploring the library.
Let's go on with step 2:
for word in L:
length = len(word)
For each word in the list, we assign to the variable length the length of the current word. Let's check if the counter already has a slot for our length:
if length not in lengths:
lengths[length] = 0
If no word of length length was encountered, we allocate that slot and we set that to zero. We can finally execute step 3:
lengths[length] += 1
Finally, we incremented the counter of words with the current length of 1 unit.
At the end of the function, you'll find that lengths will contain a map of word length -> number of words of that length. Let's verify that by printing its contents (step 4):
for length, counter in lengths.items():
print "Words of length %d: %d" % (length, counter)
If you copy and paste the code I wrote (respecting the indentation!!) you will get the answers you need.
I strongly suggest you to go through the Python tutorial.
The regular expression library might also be helpful, if being somewhat overkill. A simple word matching re might be something like:
import re
f = open("sample.txt")
text = f.read()
words = re.findall("\w+", text)
Words is then a list of... words :)
This however will not properly match words like 'isn't' and 'I'm', as \w only matches alphanumerics. In the spirit of this being homework I guess I'll leave that for the interested reader, but Python Regular Expression documentation is pretty good as a start.
Then my approach for counting these words by length would be something like:
occurrence = dict()
for word in words:
try:
occurrence[len(word)] = occurrence[len(word)] + 1
except KeyError:
occurrence[len(word)] = 1
print occurrence.items()
Where a dictionary (occurrence) is used to store the word lengths and their occurrence in your text. The try: and except: keywords deal with the first time we try and store a particular length of word in the dictionary, where in this case the dictionary is not happy at being asked to retrieve something that it has no knowledge of, and the except: picks up the exception that is thrown as a result and stores the first occurrence of that length of word. The last line prints everything in your dictionary.
Hope this helps :)

Categories

Resources