Memory consumption of python lists of lists

Memory consumption of python lists of lists - python

A code I was recently working on was found to be using around 200MB of memory to run, and I'm stumped as to why it would need that much.
Basically it mapped a text file onto a list where each character in the file was its own list containing the character and how often it has shown up so far (starting from zero) as its two items.
So 'abbac...' would be [['a','0'],['b','0'],['b','1'],['a','1'],['c','0'],...]
For a text file 1 million characters long, it used 200MB.
Is this reasonable or was it something else my code was doing? If it is reasonable, was it because of the high number of lists? Would [a,0,b,0,b,1,a,1,c,0...] take up substantially less space?

If you do not need the list itself, then I fully subscribe #Lattyware's solution of using a generator.
However, if that's not an option then perhaps you could compress the data in your list without loss of information by storing only the positions for each character in the file.
import random
import string
def track_char(s):
# Make sure all characters have the same case
s = s.lower()
d = dict((k, []) for k in set(s))
for position, char in enumerate(s):
d[char].append(position)
return d
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
d = track_char(st)
len(d["a"])
# Total number of occurrences of character 2
for char, vals in d.items():
if 2 in vals:
print("Character %s has %s occurrences" % (char,len(d[char]))
Character C has 1878 occurrences
# Number of occurrences of character 2 so far
for char, vals in d.items():
if 2 in vals:
print("Character %s has %s occurrences so far" % (char, len([x for x in d[char] if x <= 2))
Character C has 1 occurrences so far
This way there is no need to duplicate the character string each time there is an occurrence, and you maintain the information of all their occurrences.
To compare the object size of your original list or this approach, here's a test
import random
import string
from sys import getsizeof
# random generation of a string with 50k characters
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
# Function that returns the original list for this string
def original_track(s):
l = []
for position, char in enumerate(s):
l.append([char, position])
return l
# Testing sizes
original_list = original_track(st)
dict_format = track_char(st)
getsizeof(original_list)
406496
getsizeof(dict_format)
1632
As you can see, the dict_format is roughly 250x times smaller in size. However this difference in sizes should be more pronounced in larger strings.

When it comes to memory use and lists, one of the best ways to reduce memory usage is to avoid lists at all - Python has great support for iterators in the form of generators. If you can produce a generator instead of constructing a list, you should be able to do something like this with very little memory usage. Of course, it depends what you are doing with the data afterwards (say you are writing this structure out to a file, you could do so piece by piece and not store the entire thing at once).
from collections import Counter
def charactersWithCounts():
seen = Counter()
for character in data:
yield (character, seen[character])
seen[character] += 1

Related

Approaches for finding matches in a large dataset

I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.
My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.
results_boolean = [keyword in n for n in string_data]
Is there a way I can greatly speed this up with a more appropriate approach?

Using a generator with a set is probably your best bet ... this solution i think will work and presumably faster
def find_matches(target_words,filename_to_search):
targets = set(target_words)
with open("search_me.txt") as f:
for line_no,line in enumerate(f):
matching_intersection = targets.intersection(line.split())
if matching_intersection:
yield (line_no,line,matching_intersection) # there was a match
for match in find_matches(["unique","list","of","strings"],"search_me.txt"):
print("Match: %s"%(match,))
input("Hit Enter For next match:") #py3 ... just to see your matches
of coarse it gets harder if your matches are not single words, especially if there is no reliable grouping delimiter

In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.
Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.
Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:
# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
from collections import defaultdict
ch2ixs = defaultdict(set)
for i, s in enumerate(string_data):
for ch in s:
ch2ixs[ch].add(i)
return ch2ixs
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
ch = keyword[0]
if ch in ch2ixs:
result = []
for i in ch2ixs[ch]:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
Then, e.g.,
string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)
find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
string_data,
ch2ixs)
displays:
'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]
If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.
In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.
Note that you should really find a way to get rid of this:
results_boolean = [keyword in n for n in string_data]
Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.
Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
keyset = set(keyword)
if all(ch in ch2ixs for ch in keyset):
ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
result = []
for i in ixs:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)

Find all the possible N-length anagrams - fast alternatives

I am given a sequence of letters and have to produce all the N-length anagrams of the sequence given, where N is the length of the sequence.
I am following a kinda naive approach in python, where I am taking all the permutations in order to achieve that. I have found some similar threads like this one but I would prefer a math-oriented approach in Python. So what would be a more performant alternative to permutations? Is there anything particularly wrong in my attempt below?
from itertools import permutations
def find_all_anagrams(word):
pp = permutations(word)
perm_set = set()
for i in pp:
perm_set.add(i)
ll = [list(i) for i in perm_set]
ll.sort()
print(ll)

If there are lots of repeated letters, the key will be to produce each anagram only once instead of producing all possible permutations and eliminating duplicates.
Here's one possible algorithm which only produces each anagram once:
from collections import Counter
def perm(unplaced, prefix):
if unplaced:
for element in unplaced:
yield from perm(unplaced - Counter(element), prefix + element)
else:
yield prefix
def permutations(iterable):
yield from perm(Counter(iterable), "")
That's actually not much different from the classic recursion to produce all permutations; the only difference is that it uses a collections.Counter (a multiset) to hold the as-yet-unplaced elements instead of just using a list.
The number of Counter objects produced in the course of the iteration is certainly excessive, and there is almost certainly a faster way of writing that; I chose this version for its simplicity and (hopefully) its clarity

This is very slow for long words with many similar characters. Slow compared to theoretical maximum performance that is. For example, permutations("mississippi") will produce a much longer list than necessary. It will have a length of 39916800, but but the set has a size of 34650.
>>> len(list(permutations("mississippi")))
39916800
>>> len(set(permutations("mississippi")))
34650
So the big flaw with your method is that you generate ALL anagrams and then remove the duplicates. Use a method that only generates the unique anagrams.
EDIT:
Here is some working, but extremely ugly and possibly buggy code. I'm making it nicer as you're reading this. It does give 34650 for mississippi, so I assume there aren't any major bugs. Warning again. UGLY!
# Returns a dictionary with letter count
# get_letter_list("mississippi") returns
# {'i':4, 'm':1, 'p': 2, 's':4}
def get_letter_list(word):
w = sorted(word)
c = 0
dd = {}
dd[w[0]]=1
for l in range(1,len(w)):
if w[l]==w[l-1]:
d[c]=d[c]+1
dd[w[l]]=dd[w[l]]+1
else:
c=c+1
d.append(1)
dd[w[l]]=1
return dd
def sum_dict(d):
s=0
for x in d:
s=s+d[x]
return s
# Recursively create the anagrams. It takes a letter list
# from the above function as an argument.
def create_anagrams(dd):
if sum_dict(dd)==1: # If there's only one letter left
for l in dd:
return l # Ugly hack, because I'm not used to dics
a = []
for l in dd:
if dd[l] != 0:
newdd=dict(dd)
newdd[l]=newdd[l]-1
if newdd[l]==0:
newdd.pop(l)
newl=create(newdd)
for x in newl:
a.append(str(l)+str(x))
return a
>>> print (len(create_anagrams(get_letter_list("mississippi"))))
34650
It works like this: For every unique letter l, create all unique permutations with one less occurance of the letter l, and then append l to all these permutations.
For "mississippi", this is way faster than set(permutations(word)) and it's far from optimally written. For instance, dictionaries are quite slow and there's probably lots of things to improve in this code, but it shows that the algorithm itself is much faster than your approach.

Maybe I am missing something, but why don't you just do this:
from itertools import permutations
def find_all_anagrams(word):
return sorted(set(permutations(word)))

You could simplify to:
from itertools import permutations
def find_all_anagrams(word):
word = set(''.join(sorted(word)))
return list(permutations(word))
In the doc for permutations the code is detailled and it seems already optimized.

I don't know python but I want to try to help you: probably there are a lot of other more performant algorithm, but I've thought about this one: it's completely recursive and it should cover all the cases of a permutation. I want to start with a basic example:
permutation of ABC
Now, this algorithm works in this way: for Length times you shift right the letters, but the last letter will become the first one (you could easily do this with a queue).
Back to the example, we will have:
ABC
BCA
CAB
Now you repeat the first (and only) step with the substring built from the second letter to the last one.
Unfortunately, with this algorithm you cannot consider permutation with repetition.

Why does text representation as a list consume so much memory?

I have a 335 MB large text file. The entire text is tokenized. Each token is separated by a whitespace. I want to represent each sentence as a list of words while the entire text is a list of sentences. This means that I'll get a list of lists.
I use this simple peace of code to load the text into my main memory:
def get_tokenized_text(file_name):
tokens = list()
with open(file_name,'rt') as f:
sentences = f.readlines()
return [sent.strip().split(' ') for sent in sentences]
Unfortunately, this method consumes so much memory that my laptop always crashes. I have 4 GB RAM, but it is congested after about five seconds.
Why? The text should occupy about 335 MB. Even if I'd been generous and I'd approved let's say four times as much memory just for administration stuff, there is no reason for memory congestion. Is there any memory leak that I oversee right now?

Lists and strings are objects and objects have properties that take memory space. You can check the size of the objects and the overhead with sys.getsizeof:
>>> sys.getsizeof('')
49
>>> sys.getsizeof('abcd')
53
>>> sys.getsizeof([])
64
>>> sys.getsizeof(['a'])
72
>>> sys.getsizeof(['a', 'b'])
80

Why? The text should occupy about 335 MB.
Supposing that the text is encoded in UTF-8 or one of the various single-byte encodings -- which is likely -- the text itself does occupy a bit more than 335 MB in Python 2, but at least twice as much and maybe four times as much in Python 3, depending on your implementation. This is because Python 3 strings are Unicode strings by default, and they are represented internally with either two or four bytes per character.
Even if I'd been generous and I'd approved let's say four times as much memory just for administration stuff, there is no reason for memory congestion.
But there is. Each Python object has relatively substantial overhead. In CPython 3.4, for example, there is a refcount, a pointer to a type object, a couple of additional pointers linking the objects together into a doubly-linked list, and type-specific additional data. Almost all of that is overhead. Ignoring the type-specific data, just the three pointers and the refcount represent 32 bytes of overhead per object in a 64-bit build.
Strings have an additional length, hashcode, data pointer, and flags, for about 24 more bytes per object (again assuming a 64-bit build).
If your words average 6 characters then each one takes about 6 bytes in your text file, but about 68 bytes as a Python object (maybe as little as 40-ish bytes in a 32-bit Python). That doesn't count the overhead of the lists, which likely add at least 8 bytes per word and 8 more per sentence.
So yes, an expansion of a factor of 12 or more does not seem at all unlikely.
Is there any memory leak that I oversee right now?
Unlikely. Python does a pretty good job of tracking objects and collecting garbage. You do not generally see memory leaks in pure Python codes.

You are keeping multiple representations of the data in memory at the same time. The file buffer in readlines(), also sentences and again when you are building the list to return. To reduce memory, process the file a line at a time. Only words will then hold the entire contents of the file.
def get_tokenized_text(file_name):
words = []
f = open(file_name,'rt')
for line in f:
words.extend( x for x in line.strip().split(' ') if x not in words)
return words
words = get_tokenized_text('book.txt')
print words

My first answer attempted to reduce memory usage by not keeping intermediate lists in memory at the same time.
But that still did not manage to squeeze the whole data structure into 4GB of RAM.
With this approach, using a 40MB text file made up of Project Gutenberg books as test data, the data requirement is reduced from 270 to 55 MB.
A 355 MB input file would then take an estimated 500MB memory, which hopefully would fit.
This approach builds a dictionary of unique words and assigns a unique integer token to each one (word_dict).
Then the list of sentences word_tokens uses the integer token instead of the word itself.
Then word_dict has its keys and values swapped so that the integer tokens in word_tokens can be used to lookup the corresponding word.
I'm using 32-bit Python which uses a lot less memory then 64-bit Python, because the pointers are half the size.
To get the total size of containers like list & dictionary, I've used code from http://code.activestate.com/recipes/577504/ by Raymond Hettinger.
It includes not just the container itself but sub-containers and the bottom level items they point to.
import sys, os, fnmatch, datetime, time, re
# Original approach
def get_tokenized_text(file_name):
words = []
f = open(file_name,'rt')
for line in f:
words.append( line.strip().split(' ') )
return words
# Two step approach
# 1. Build a dictionary of unique words in the file indexed with an integer
def build_dict(file_name):
dict = {}
n = 0
f = open(file_name,'rt')
for line in f:
words = line.strip().split(' ')
for w in words:
if not w in dict:
dict[w] = n
n = n + 1
return dict
# 2. Read the file again and build list of sentence-words but using the integer indexes instead of the word itself
def read_with_dict(file_name):
tokens = []
f = open(file_name,'rt')
for line in f:
words = line.strip().split(' ')
tokens.append( dict[w] for w in words )
return tokens
# Adapted from http://code.activestate.com/recipes/577504/ by Raymond Hettinger
from itertools import chain
from collections import deque
def total_size(o, handlers={}):
""" Returns the approximate memory footprint an object and all of its contents.
Automatically finds the contents of the following builtin containers and
their subclasses: tuple, list, deque, dict, set and frozenset.
To search other containers, add handlers to iterate over their contents:
handlers = {SomeContainerClass: iter,
OtherContainerClass: OtherContainerClass.get_elements}
"""
dict_handler = lambda d: chain.from_iterable(d.items())
all_handlers = {tuple: iter,
list: iter,
deque: iter,
dict: dict_handler,
set: iter,
frozenset: iter,
}
all_handlers.update(handlers) # user handlers take precedence
seen = set() # track which object id's have already been seen
default_size = sys.getsizeof(0) # estimate sizeof object without __sizeof__
def sizeof(o):
if id(o) in seen: # do not double count the same object
return 0
seen.add(id(o))
s = sys.getsizeof(o, default_size)
for typ, handler in all_handlers.items():
if isinstance(o, typ):
s += sum(map(sizeof, handler(o)))
break
return s
return sizeof(o)
# Display your Python configurstion? 32-bit Python takes about half the memory of 64-bit
import platform
print platform.architecture(), sys.maxsize # ('32bit', 'WindowsPE') 2147483647
file_name = 'LargeTextTest40.txt' # 41,573,429 bytes
# I ran this only for a size comparison - don't run it on your machine
# words = get_tokenized_text(file_name)
# print len(words), total_size(words) # 962,632 268,314,991
word_dict = build_dict(file_name)
print len(word_dict), total_size(word_dict) # 185,980 13,885,970
word_tokens = read_with_dict(file_name)
print len(word_tokens), total_size(word_tokens) # 962,632 42,370,804
# Reverse the dictionary by swapping key and value so the integer token can be used to lookup corresponding word
word_dict.update( dict((word_dict[k], k) for k in word_dict) )

python longest repeated substring in a single string? [duplicate]

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:
fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld
then I would like the value "helloworld" to be returned.
I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
Hope this helps!

Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn't clear whether overlapping matches should or shouldn't be included, this brute force method includes them.
from collections import defaultdict
def getsubs(loc, s):
substr = s[loc:]
i = -1
while(substr):
yield substr
substr = s[loc:i]
i -= 1
def longestRepetitiveSubstring(r, minocc=3):
occ = defaultdict(int)
# tally all occurrences of all substrings
for i in range(len(r)):
for sub in getsubs(i,r):
occ[sub] += 1
# filter out all substrings with fewer than minocc occurrences
occ_minocc = [k for k,v in occ.items() if v >= minocc]
if occ_minocc:
maxkey = max(occ_minocc, key=len)
return maxkey, occ[maxkey]
else:
raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))
prints:
('helloworld', 3)

Let's start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]>=3:
seq=freqs.most_common(1)[0][0]
break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
Result:
>>> sequence 'helloworld' of length 10 occurs 3 or more times
Edit: if you have the feeling that you're dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can't find any that appear at least 3 time:
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]<3:
n-=1
break
else:
seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
The same result as above.

The first idea that came to mind is searching with progressively larger regular expressions:
import re
text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1
while 1:
m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
if not m:
break
largest = m.group(1)
i += 1
print largest # helloworld
The code ran successfully. The time complexity appears to be at least O(n^2).

If you reverse the input string, then feed it to a regex like (.+)(?:.*\1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)
Edit:
I have to say cancel this way. It's dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won't work for this.

In Python you can use the string count method.
We also use an additional generator which will generate all the unique substrings of a given length for our example string.
The code is straightforward:
test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
def generate_substrings_of_length(this_string, length):
''' Generates unique substrings of a given length for a given string'''
for i in range(len(this_string)-2*length+1):
yield this_string[i:i+length]
def longest_substring(this_string):
'''Returns the string with at least two repetitions which has maximum length'''
max_substring = ''
for subs_length in range(2, len(this_string) // 2 + 1):
for substring in generate_substrings_of_length(this_string, subs_length):
count_occurences = this_string.count(substring)
if count_occurences > 1 :
if len(substring) > len(max_substring) :
max_substring = substring
return max_substring
I must note here (and this is important) that the generate_substrings_of_length generator does not generate all the substrings of a certain length. It will generate only the required substring to be able to make comparisons. Otherwise we will have some artificial duplicates. For example in the case :
test_string = "banana"
GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)
will result :
ba
an
na
and this is enough for what we need.

from collections import Counter
def Longest(string):
b = []
le = []
for i in set(string):
for j in range(Counter(string)[i]+1):
b.append(i* (j+1))
for i in b:
if i in string:
le.append(i)
return ([s for s in le if len(s)==len(max( le , key = len))])

comparing occurrence of strings in list in python

i'm super duper new in python. I'm kinda stuck for one of my class exercises.
The question goes something like this: You have a file that contains characters i.e. words. (I'm still at the stage where all the terms get mixed up, I apologize if that is not the correct term)
Example of the file.txt content: accbd
The question asks me to import the file to python editor and make sure that no letter occurs more than letter that comes later than it in the alphabet. e.g. a cannot occur more frequently than b; b cannot occur more than c, and so on. In the example file, c occurs more frequently than d, so I need to raise an error message.
Here's my pathetic attempt :
def main():
f=open('.txt','r') # 1st import the file and open it.
data = f.read() #2nd read the file
words = list(data) #3rd create a list that contains every letter
newwords = sorted(words) # sort according to alphabetical order
I'm stuck at the last part which is to count that the former word doesn't occur more than the later word, and so on. I tried two ways but neither is working. Here's trial 1:
from collections import counter
for i in newwords:
try:
if counter(i) <=counter(i+1):
print 'ok'
else:
print 'not ok between indexes %d and %d' % (i, i+1)
except:
pass
The 2nd trial is similar
for i in newwords:
try:
if newwords.count(i) <= newwords.count(i+1):
print 'ok'
else:
print 'ok between indexes %d and %d' % (i, i+1)
except:
pass
What is the correct way to compare the count for each word in sequential order?

I had posted an answer, but I see it's for an assignment, so I'll try to explain instead of just splatting a solution here.
My suggestion would be to solve it in three steps:
1) in the first line, create a list of sorted characters that appear in the string:
from the data string you can use set(data) to pick every unique character
if you use sort() on this set you can create a list of characters, sorted alphabetically.
2) then use this list in a for loop (or list comprehension) to create a second list, of their number of occurrences in data, using data.count(<letter in the list>); note that the elements in this second list are technically sorted based on the alphabetical order of the letters in the first list you made (because of the for loop).
3) compare this second list of values with a sorted version of itself (now sorted by values), and see if they match or not. If they don't match, it's because some of the initial letters appears too many times compared to the next ones.

To be a little more clear:
In [2]: string = 'accbd'
In [3]: import collections
In [4]: collections.Counter(string)
Out[4]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})
Then it's just a for loop with enumerate(list_).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.