Best way to count the number of occurrences of a substring? [duplicate] - python
This question already has answers here:
String count with overlapping occurrences [closed]
(25 answers)
Closed last year.
C = 'Kabansososkabansosos'
I need the number of times the string 'sos' occurs in C. I've already used the C.count('sos') method, but it did not help me.
def count_occurrences(expression,word):
L = len(word)
counts = 0
for i in range(L,len(expression)+1):
if expression[i-L:i]==word:
counts += 1
return counts
count_occurrences('Kabansososkabansosos','sos')
>>> 4
To be explicit, your problem is that you want to count non-overlapping occurrences of the sub-string, which is not what str.count() does.
First, let's use the powerful re (regular expression) module instead of str.count():
C = 'Kabansososkabansosos'
matches = re.findall(r'sos', C)
n = len(matches)
print(n) # 2
At first sight this seems silly, as it 1) also does not work with over-lapping occurrences and 2) produce a (potential) large list of matches, all of which will just be the str 'sos'.
However, using the clever trick of the accepted answer here, we can make re.findall() work with overlapping occurrences:
C = 'Kabansososkabansosos'
matches = re.findall(r'(?=(sos))', C)
n = len(matches)
print(n) # 4
If C is very large, we would like to avoid the potentially large temporary list matches. We can do this by instead using the re.finditer() function, which returns a generator rather than the list:
C = 'Kabansososkabansosos'
matches = re.finditer(r'(?=(sos))', C)
n = sum(1 for match in matches)
print(n) # 4
where the sum() over 1s simply produce the length of the iterator, i.e. the number of matches.
Related
Is there an operator to specify an index "not less than 0"? [duplicate]
This question already has answers here: Pythonic way to replace list values with upper and lower bound (clamping, clipping, thresholding)? (2 answers) Closed last month. To "peek" at characters in a string preceding the current index, i, is there a one-liner for avoiding negative indices? I want n chars before the current char index of the string OR beginning of the string to current char. My intuition says there might be something simple I can put in the same line as the list operation instead of another line to make an index. I'd rather avoid any libraries, and I'm aware a simple if check would work... Hoping there's a magic operator I missed. >>> s = 'ABCDEFG' >>> s[0:5] 'ABCDE' >>> s[-5:5] 'CDE'
There is no operator for doing this, but you can achieve the same effect with max: >>> s = 'abcdefg' >>> i = 3 >>> s[max(i - 6, 0):i + 2] 'abcde'
How about this? s = 'ABCDEFG' i = 5 n = 3 print(s[max(0, i-n): i]) Output: CDE
counting the number of substrings in a string
I am working on an Python assignment and I am stuck here. Apparently, I have to write a code that counts the number of a given substring within a string. I thought I got it right, then I am stuck here. def count(substr,theStr): # your code here num = 0 i = 0 while substr in theStr[i:]: i = i + theStr.find(substr)+1 num = num + 1 return num substr = 'is' theStr = 'mississipi' print(count(substr,theStr)) if I run this, I expect to get 2 as the result, rather, I get 3... See, other examples such as ana and banana works fine, but this specific example keeps making the error. I don't know what I did wrong here. Would you PLEASE help me out.
In your code while substr in theStr[i:]: correctly advances over the target string theStr, however the i = i + theStr.find(substr)+1 keeps looking from the start of theStr. The str.find method accepts optional start and end arguments to limit the search: str.find(sub[, start[, end]]) Return the lowest index in the string where substring sub is found within the slice s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found. We don't really need to use in here: we can just check that find doesn't return -1. It's a bit wasteful performing an in search when we then need to repeat the search using find to get the index of the substring. I assume that you want to find overlapping matches, since the str.count method can find non-overlapping matches, and since it's implemented in C it's more efficient than implementing it yourself in Python. def count(substr, theStr): num = i = 0 while True: j = theStr.find(substr, i) if j == -1: break num += 1 i = j + 1 return num print(count('is', 'mississipi')) print(count('ana', 'bananana')) output 2 3 The core of this code is j = theStr.find(substr, i) i is initialised to 0, so we start searching from the beginning of theStr, and because of i = j + 1 subsequent searches start looking from the index following the last found match.
The code change you need is - i = i + theStr[i:].find(substr)+ 1 instead of i = i + theStr.find(substr)+ 1 In your code the substring is always found until i reaches position 4 or more. But while finding the index of the substring, you were using the original(whole) string which in turn returns the position as 1. In your example of banana, after first iteration i becomes 2. So, in next iteration str[i:] becomes nana. And the position of substring ana in this sliced string and the original string is 1. So, the bug in the code is just suppressed and the code seems to work fine. If your code is purely for learning purpose, the you can do this way. Otherwise you may want to make use of python provided functions (like count()) to do the job.
Counting the number of substrings: def count(substr,theStr): num = 0 for i in range(len(theStr)): if theStr[i:i+len(substr)] == substr: num += 1 return num substr = 'is' theStr = 'mississipi' print(count(substr,theStr)) O/P : 2 where theStr[i:i+len(substr)] is slice string, i is strating index and i+len(substr) is ending index. Eg. i = 0 substr length = 2 first-time compare substring is => mi String slice more details
Is it possible to use .count on multiple strings in one command for PYTHON?
I was wondering is it possible to count for multiple strings using the .count function? string = "abcdefg" string.count("." or "!") When I use the or command, it only gives me the count for 1 of the variables, but I want the total. How do you combine it such that it counts for 2 strings without splitting it into two functions? Thanks
Unfortunately, the built-in str.count function doesn't support doing counts of multiple characters in a single call. The other respondents both use multiple calls to str.count and make multiple passes over the data. I believe your question specified that you didn't want to split the calls. If you aspire for a single-pass search in only one call, there are several other ways. One way uses a regex such as len(re.findall(r'[ab]', s)) which counts the number of a or b characters in a single pass. You could even do this in a memory-efficient iterator form, sum(1 for _ in re.finditer(r'[ab]', s)). Another way uses sets together with filter. For example, len(filter({'a', 'b'}.__contains__, s)) also counts the number of a or b characters in a single pass. You also do this one in a memory-efficient iterator form, sum(1 for _ in itertools.ifilter({'a', 'b'}.__contains__, s)). >>> s = 'abracadabra' >>> re.findall(r'[ab]', s) ['a', 'b', 'a', 'a', 'a', 'b', 'a'] >>> filter({'a', 'b'}.__contains__, s) 'abaaaba' Just for grins, there is one other way but it is a bit off-beat, len(s) - len(s.translate(None, 'ab')). This starts with the total string length and subtracts the length of a translated string where the a and b characters have been removed. Side note: In Python 3, filter() has changed to return an iterator. So the new code would become len(list(filter({'a', 'b'}.__contains__, s))) and sum(1 for _ in filter({'a', 'b'}.__contains__, s)).
You could use a counter: simple but may not be memory efficient. string_ = "abcdefg" from collections import Counter counter = Counter(string_) sum_ = counter['.'] + counter['!'] You could also simply use a list comprehension with sum (better): string_ = "abcdefg" sum_ = sum(1 for c in string_ if c in '.!' else 0)
You can use this code, and set it to a variable and then print the variable. text.count(".") + text.count("?") + text.count("!") For example: count = text.count(".") + text.count("?") + text.count("!") print(count) 3
The or statement will select the . as is is not None and the method count will never see the exclamation mark. However, it would also not be able to handle two independent strings to count. So you need to do this separately. string.count(".") + string.count("!")
You can use sum method to get the total. string = "abcdefg.a.!" print(sum(string.count(i) for i in '.!')) Output: 3
You could use: string.replace("!",".").count(".") From preformance point of view I don't expect any benefit here, since the string hast to be parsed twice. However if your main motivation is to count the the chars in a one-liner (e.g. myFunctionReturnsString().replace("!",".").count(".")), this might help
Without using count(), you can split your original string and search string into arrays, and then check the length of their overlapping array s = list('abcdefg.a.!') t = list('.!') print len([i for i in s if i in t]) >> 3
How to find maximum value of two numbers in python? [duplicate]
This question already has answers here: How do I compare version numbers in Python? (16 answers) Closed 6 years ago. I want to get the maximum value from a list. List = ['1.23','1.8.1.1'] print max(List) If I print this I'm getting 1.8.1.1 instead of 1.23. What I am doing wrong?
The easiest way is, to use tuple comparison. Say: versions = ['1.23','1.8.1.1'] def getVersionTuple(v): return tuple(map(int, v.strip().split('.'))) Now you can use, print(max(map(getVersionTuple, versions))) to get the maximum. EDIT: You can use '.'.join(map(str, m)) to get the original string (given m holds the max tuple).
These aren't numbers, they are strings, and as such they are sorted lexicographically. Since the character 8 comes after 2, 1.8.1.1 is returned as the maximum. One way to solve this is to write your own comparing function which takes each part of the string as an int and compares them numerically: def version_compare(a, b): a_parts = a.split('.') b_parts = b.split('.') a_len = len(a_parts) b_len = len(b_parts) length = min(a_len, b_len) # Compare the parts one by one for i in range(length): a_int = int(a_parts[i]) b_int = int(b_parts[i]) # And return on the first nonequl part if a_int != b_int: return a_int - b_int # If we got here, the longest list is the "biggest" return a_len - b_len print sorted(['1.23','1.8.1.1'], cmp=version_compare, reverse=True)[0]
A similar approach - assuming these strings are version numbers - is to turn the version string to an integer list: vsn_list=['1.23', '1.8.1.1'] print sorted( [ [int(v) for v in x.split(".")] for x in vsn_list ] ) When you compare strings, they are compared character by character so any string starting with '2' will sort before a string starting with '8' for example.
Python finding patterns within large group of numbers? [duplicate]
This question already has answers here: Finding a repeating sequence at the end of a sequence of numbers (5 answers) Closed 9 years ago. I'm working with a list of lists that have the periods of continued fractions for non-perfect square roots in each of them. What I'm trying to do with them is to check the size of the largest repeating pattern in each list. Some of the lists for example: [ [1,1,1,1,1,1....], [4,1,4,1,4,1....], [1,2,10,1,2,10....], [1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8....], [2,2,2,4,2,2,2,4....], [1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15....], [1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25....] ] The two similar methods that I've been working with are: def lengths(seq): for i in range(len(seq),1,-1): if seq[0:i] == seq[i:i*2]: return i def lengths(seq): for i in range(1,len(seq)-1): if seq[0:i] == seq[i:i*2]: return i These both take the size of the lists and compare indexed sizes of it from the current position. The problem is first one returns wrong for just one repeating digit because it starts big and see's just the one large pattern. The problem with the second is that there are nested patterns like the sixth and seventh example list and it will be satisfied with the nested loop and overlook the rest of the pattern.
Works (caught a typo in 4th element of your sample) >>> seq_l = [ ... [1,1,1,1,1,1], ... [4,1,4,1,4,1], ... [1,2,10,1,2,10], ... [1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8], ... [2,2,2,4,2,2,2,4,2,2,2,4,2,2,2,4], ... [1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15], ... [1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25] ... ] >>> >>> def rep_len(seq): ... s_len = len(seq) ... for i in range(1,s_len-1): ... if s_len%i == 0: ... j = s_len/i ... if seq == j*seq[:i]: ... return i ... ... >>> [rep_len(seq) for seq in seq_l] [1, 2, 3, 12, 4, 18, 12]
If it's not unfeasible to convert your lists to strings, using regular expressions would make this a trivial task. import re lists = [ [1,1,1,1,1,1], [4,1,4,1,4,1], [1,2,10,1,2,10], [1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8], #I think you had a typo in this one... [2,2,2,4,2,2,2,4], [1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15], [1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25] ] for l in lists: s = "x".join(str(i) for i in l) print s match = re.match(r"^(?P<foo>.*)x?(?P=foo)", s) if match: print match.group('foo') else: print "****" print (?P<foo>.*) creates a group known as "foo" and (?P=foo) matches that. Since regular expressions are greedy, you get the longest match by default. The "x?" just allows for a single x in the middle to handle even/odd lengths.
You probably could do a collections.defaultdict(int) to keep counts of All the sublists, unless you know there are some sublists you don't care about. Convert the sublists to tuples before making them dictionary keys. You might be able to get somewhere using a series of bloom filters though, if space is tight. You'd have one bloom filter for subsequences of length 1, another for subsequences of length 2, etc. Then the largest bloom filter that gets a collision has your maximum length sublist. http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
I think you just have to check two levels of sequences at once.0..i == i..i*2 and 0..i/2 != i/2..i. def lengths(seq): for i in range(len(seq),1,-1): if seq[0:i] == seq[i:i*2] and seq[0:i/2] != seq[i/2:i]: return i If the two halves of 0..i are equal then it means that you are actually comparing two concatenated patterns with each other.
Starting with the first example method, you could recursively search the sub pattern. def lengths(seq): for i in range(len(seq)-1,1,-1): if seq[0:i] == seq[i:i*2]: j = lengths(seq[0:i]) # Search pattern for sub pattern if j < i and i % j == 0: # Found a smaller pattern; further, a longer repeated # pattern length must be a multiple of the shorter pattern length n = i/j # Number of pattern repetitions (might change to // if using Py3K) for k in range(1, n): # Check that all the smaller patterns are the same if seq[0:j] != seq[j*n:j*(n+1)]: # Stop when we find a mismatch return i # Not a repetition of smaller pattern else: return j # All the sub-patterns are the same, return the smaller length else: return i # No smaller pattern I get the feeling this solution isn't quite correct, but I'll do some testing and edit it as necessary. (Quick note: Shouldn't the initial for loop start at len(seq)-1? If not, you compare seq[0:len] to seq[len:len], which seems silly, and would cause the recursion to loop infinitely.) Edit: Seems sorta similar to the top answer in the related question senderle posted, so you'd best just go read that. ;)