Currently, I am in the midst of writing a program that calculates all of the non white space characters in a user submitted string and then returns the most frequently used character. I cannot use collections, a counter, or the dictionary. Here is what I want to do:
Split the string so that white space is removed. Then count each character and return a value. I would have something to post here but everything I have attempted thus far has been met with critical failure. The closest I came was this program here:
strin=input('Enter a string: ')
fc=[]
nfc=0
for ch in strin:
i=0
j=0
while i<len(strin):
if ch.lower()==strin[i].lower():
j+=1
i+=1
if j>nfc and ch!=' ':
nfc=j
fc=ch
print('The most frequent character in string is: ', fc )
If you can fix this code or tell me a better way of doing it that meets the required criteria that would be helpful. And, before you say this has been done a hundred times on this forum please note I created an account specifically to ask this question. Yes there are a ton of questions like this but some that are reading from a text file or an existing string within the program. And an overwhelmingly large amount of these contain either a dictionary, counter, or collection which I cannot presently use in this chapter.
Just do it "the old way". Create a list (okay it's a collection, but a very basic one so shouldn't be a problem) of 26 zeroes and increase according to position. Compute max index at the same time.
strin="lazy cat dog whatever"
l=[0]*26
maxindex=-1
maxvalue=0
for c in strin.lower():
pos = ord(c)-ord('a')
if 0<=pos<=25:
l[pos]+=1
if l[pos]>maxvalue:
maxindex=pos
maxvalue = l[pos]
print("max count {} for letter {}".format(maxvalue,chr(maxindex+ord('a'))))
result:
max count 3 for letter a
As an alternative to Jean's solution (not using a list that allows for one-pass over the string), you could just use str.count here which does pretty much what you're trying to do:
strin = input("Enter a string: ").strip()
maxcount = float('-inf')
maxchar = ''
for char in strin:
c = strin.count(char) if not char.isspace() else 0
if c > maxcount:
maxcount = c
maxchar = char
print("Char {}, Count {}".format(maxchar, maxcount))
If lists are available, I'd use Jean's solution. He doesn't use a O(N) function N times :-)
P.s: you could compact this with one line if you use max:
max(((strin.count(i), i) for i in strin if not i.isspace()))
To keep track of several counts for different characters, you have to use a collection (even if it is a global namespace implemented as a dictionary in Python).
To print the most frequent non-space character while supporting arbitrary Unicode strings:
import sys
text = input("Enter a string (case is ignored)").casefold() # default caseless matching
# count non-space character frequencies
counter = [0] * (sys.maxunicode + 1)
for nonspace in map(ord, ''.join(text.split())):
counter[nonspace] += 1
# find the most common character
print(chr(max(range(len(counter)), key=counter.__getitem__)))
A similar list in Cython was the fastest way to find frequency of each character.
Related
I'm trying to solve a problem that I have with a recurring character problem.
I'm a beginner in development so I'm trying to think of ways I can do this.
thisWord = input()
def firstChar(thisWord):
for i in range(len(thisWord)):
for j in range(i+1, len(thisWord)):
if thisWord[i] == thisWord[j]:
return thisWord[i]
print(firstChar(thisWord))
This is what I came up with. In plenty of use cases, the result is fine. The problem I found after some fiddling around is that with a word like "statistics", where the "t" is the first recurring letter rather than the "s" because of the distance between the letters, my code counts the "s" first and returns that as the result.
I've tried weird solutions like measuring the entire string first for each possible case, creating variables for string length, and then comparing it to another variable, but I'm just ending up with more errors than I can handle.
Thank you in advance.
So you want to find the first letter that recurs in your text, with "first" being determined by the recurrence, not the first occurrence of the letter? To illustrate that with your "statistics" example, the t is the first letter that recurs, but the s had its first occurrence before the first occurrence of the t. I understand that in such cases, it's the t you want, not the s.
If that's the case, then I think a set is what you want, since it allows you to keep track of letters you've already seen before:
thisword = "statistics"
set_of_letters = set()
for letter in thisword:
if letter not in set_of_letters:
set_of_letters.add(letter)
else:
firstchar = letter
break
print(firstchar)
Whenever you're looking at a certain character in the word, you should not check whether the character will occur again at all, but whether it has already occurred. The algorithmically optimal way would be to use a set to store and look up characters as you go, but it could just as well be done with your double loop. The second one should then become for j in range(i).
This is not an answer to your problem (one was already provided), but an advice for a better solution:
def firstChar(thisWord):
occurrences: dict[str, int] = {char: 0 for char in thisWord} # At the beginning all the characters occurred once
for char in thisWord:
occurrences[char] += 1 # You found this char
if (occurrences[char] == 2): # This was already found one time before
return char # So you return it as the first duplicate
This works as expected:
>>> firstChar("statistics")
't'
EDIT:
occurrences: dict[str, int] = {char: 0 for char in thisWord}
This line of code creates a dictionary with the chars from thisWord as keys and 0 as values, so that you can use it to count the occurrences starting from 0 (before finding a char its count is 0).
I need to output any repeated character to refer to the previous character.
For example: a(-1)rdv(-4)(-4)k or hel(-1)o
This is my code so far:
text= 'aardvark'
i=0
j=0
for i in range(len(text)-1):
for j in range(i+1, len(text)):
if text[j]==text[i]:
sub= text[j]
val2=text.find(sub, i+1, len(text))
p=val2+1
val=str(i-j)
text= text[:val2] + val + text[p:]
break
print(text)
Output: a-1rdva-4k
The second 'a' is not recognised. And I'm not sure how to include brackets in my print.
By updating the text in-place each time you find a back-reference, you muck up your indices (your text gets longer each time) and you never process the last characters properly. You stop checking when you find the first repeat of the 'current' character, so the 3rd a is never processed. This applies to every 3rd repeat in an input string. In addition, if your input text contains any - characters or digits they'll end up being tested against the -offset references you inserted before them too!
For your specific example of aardvark, a string with 8 characters, what happens is this:
You find the second a and set text to a-1rdvark. The text is now 9 characters long, so the last r will never be checked (you loop to i = 6 at most); this would be a problem if your test string ended in a double letter. You break out of the loop, so the j for loop never comes to the 3rd a, and the second a can't be tested for anymore as it has already been replaced.
Your code finds - (not repeated), 1 (not repeated) and then r (repeated once), so now you replace text with a-1rdva-4k. Now you have a string of 10 characters, so -, and 4 will never be tested. Not a big problem anymore, but what if there was a repeat in just the last 3 positions of the string?
Build a new object for the output (adding both letters you haven't seen before and backreferences). That way you won't cause the text you are looping over to grow, and you will continue to find repeats; for the parentheses you could use more string concatenation. You'll need to scan the part of the string before i, not after, for this to work, and go backwards! Testing i - 1, i - 2, etc, down to 0. Naturally, this means your i loop should then range up to the full length:
output = ''
for i in range(len(text)):
current = text[i]
for j in range(i - 1, -1, -1):
if text[j] == current:
current = '(' + str(j - i) + ')'
break
output = output + current
print(output)
I kept the fix to a minimum here, but ideally I'd also make some more changes:
Add all processed characters and references to a new list instead of a string, then use str.join() to join that list into the output afterwards. This is far more efficient than rebuilding the string each iteration.
Using two loops means you check every character in the string again while looping over the text, so the number of steps the algorithm takes grows exponentially with the length of the input. In Computer Science we talk about the time complexity of algorithms, and yours is a O(N^2) (N squared) exponential algorithm. A text with 1000 letters would take up to 1 million steps to process! Rather than loop an exponential number of times, you can use a dictionary to track indices of letters you have seen. If the current character is in the dictionary you can then trivially calculate the offset. Dictionary lookups take constant time (O(1)), making the whole algorithm take linear time (O(N)), meaning that the time the process takes is directly proportional to the length of the input string.
Use enumerate() to add a counter to the loop so you can just loop over the characters directly, no need to use range().
You can use string formatting to build a "(<offset>)" string; Python 3.6 and newer have formatted string literals, where f'...' strings take {} placeholders that are just expressions. f'({some - calculation + or * other})' will execute the expression and put the result in a string that has(and)characters in it too. For earlier Python versions, you can use the [str.format()method](https://docs.python.org/3/library/stdtypes.html#str.format) to get the same result; the syntax then becomes'({})'.format(some - calculation + or * other)`.
Put together, that becomes:
def add_backrefs(text):
output = []
seen = {}
for i, character in enumerate(text):
if character in seen:
# add a back-reference, we have seen this already
output.append(f'({seen[character] - i})')
else:
# add the literal character instead
output.append(character)
# record the position of this character for later reference
seen[character] = i
return ''.join(output)
Demo:
>>> add_backrefs('aardvark')
'a(-1)rdv(-4)(-4)k'
>>> add_backrefs('hello')
'hel(-1)o'
text= 'aardvark'
d={} # create a dictionary to keep track of index of element last seen at
new_text='' # new text to be generated
for i in range(len(text)): # iterate in text from index 0 up to length of text
c = text[i] # storing a character in temporary element as used frequently
if c not in d: # check if character which is explored is visited before or not
d[c] = i # if character visited first time then just add index value of it in dictionary
new_text += c # concatenate character to result text
else: # visiting alreaady visited character
new_text += '({0})'.format(d[c]-i) # used string formatting which will print value of difference of last seen repeated character with current index instead of {0}
d[c] = i # change last seen character index
print(new_text)
Output:
a(-1)rdv(-4)(-4)k
I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:
jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf
What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?
What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.
I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.
from random import choice
import string
dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words
text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself
words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
word = chunk[:j] #subchunk
if word in dictionary: #constant time hash lookup if it's in dictionary
words_found.add(word) #add to set of words
print words_found
Here is a bisection/binary search that should be usefull.
def isaprefix(frag, wordlist, first, last):
"""
Recursive binary search of wordlist for words that start with frag.
assumes wordlist is a sorted list
typically called with first = 0 and last = len(wordlist)
first,last -->> integer
returns bool
"""
# base case - down to two elements
if (last - first) < 2:
# return False unless frag is a prefix
# of either of the two remaining words
return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)
#mid = (first + last)/2
midword = wordlist[(first + last) / 2]
# go ahead and return if you find one
# a second base case?
if midword.startswith(frag):
return True
#print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
# start the tests
# python does just fine comparing strings
if frag < midword:
# set the limits to the lower half
# of the previous range searched and recurse
return isaprefix(frag, wordlist, first, mid-1)
# frag is > midword: set the limits to the upper half
# of the previous range searched and recurse
return isaprefix(frag, wordlist, mid+1, last)
You can think of creating a sequence out of the entire dictionary and then aligning them to get the words in the sequence using smith water man or any heuristic local alignment algorithm
I'm working with some text that has a mix of languages, which I've already done some processing on and is in the form a list of single characters (called "letters"). I can tell which language each character is by simply testing if it has case or not (with a small function called "test_lang"). I then want to insert a space between characters of different types, so I don't have any words that are a mix of character types. At the same time, I want to insert a space between words and punctuation (which I defined in a list called "punc"). I wrote a script that does this in a very straight-forward way that made sense to me (below), but apparently is the wrong way to do it, because it is incredibly slow.
Can anyone tell me what the better way to do this is?
# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ""
i = 0
while i <= len(letters)-2: #range excludes last letter to avoid Out of Range error for i+1
cleaned += letters[i]
# words that have case are Latin; otherwise Arabic
if test_lang(letters[i]) != test_lang(letters[i+1]):
cleaned += " "
if letters[i] in punc or letters[i+1] in punc:
cleaned += " "
i += 1
cleaned += letters[len(letters)-1] # add in last letter
There are a few things going on here:
You call test_lang() on every letter in the string twice, this is probably the main reason this is slow.
Concatenating strings in Python isn't very efficient, you should instead use a list or generator and then use str.join() (most likely, ''.join()).
Here is the approach I would take, using itertools.groupby():
from itertools import groupby
def keyfunc(letter):
return (test_lang(letter), letter in punc)
cleaned = ' '.join(''.join(g) for k, g in groupby(letters, keyfunc))
This will group the letters into consecutive letters of the same language and whether or not they are punctuation, then ''.join(g) converts each group back into a string, then ' '.join() combines these strings adding a space between each string.
Also, as noted in comments by DSM, make sure that punc is a set.
Every time you perform a string concatenation, a new string is created. The longer the string gets, the longer each concatenation takes.
http://en.wikipedia.org/wiki/Schlemiel_the_Painter's_algorithm
You might be better off declaring a list big enough to store the characters of the output, and joining them at the end.
I suggest an entirely different solution that should be very fast:
import re
cleaned = re.sub(r"(?<!\s)\b(?!\s)", " ", letters, flags=re.LOCALE)
This inserts a space at every word boundary (defining words as "sequences of alphanumeric characters, including accented characters in your current locale", which should work in most cases), unless it's a word boundary next to whitespace.
This should split between Latin and Arabic characters as well as between Latin and punctuation.
Assuming test_lang is not the bottleneck, I'd try:
''.join(
x + ' '
if x in punc or y in punc or test_lang(x) != test_lang(y)
else x
for x, y in zip(letters[:-1], letters[1:])
)
Here is a solution that uses yield. I would be interested to know whether this runs any faster than your original solution.
This avoids all the indexing in the original. It just iterates through the input, holding onto a single previous character.
This should be easy to modify if your requirements change in the future.
ch_sep = ' '
def _sep_chars_by_lang(s_input):
itr = iter(s_input)
ch_prev = next(itr)
yield ch_prev
while True:
ch = next(itr)
if test_lang(ch_prev) != test_lang(ch) or ch_prev in punc:
yield ch_sep
yield ch
ch_prev = ch
def sep_chars_by_lang(s_input):
return ''.join(_sep_chars_by_lang(s_input))
Keeping the basic logic of the OP's original code, we speed it up by not doing all that [i] and [i+1] indexing. We use a prev and next reference that scan through the string, maintaining prev one character behind next:
# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ''
prev = letters[0]
for next in letters[1:]:
cleaned += prev
if test_lang(prev) != test_lang(next):
cleaned += ' '
if prev in punc or next in punc:
cleaned += ' '
prev = next
cleaned += next
Testing on a string of 10 million characters shows this is about twice the speed of the OP code. The "string concatenation is slow" complaint is obsolete, as others have pointed out. Running the test again using the ''.join(...) metaphor shows a slighly slower execution than using string concatenation.
Further speedup may come through not calling the test_lang() function but by inlining some simple code. Can't comment as I don't really know what test_lang() does :).
Edit: removed a 'return' statement that should not have been there (testing remnant!).
Edit: Could also speedup by not calling test_lang() twice on the same character (on next in one loop and then prev in the following loop). Cache the test_lang(next) result.
I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1