I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:
fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld
then I would like the value "helloworld" to be returned.
I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.
This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
Hope this helps!
Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn't clear whether overlapping matches should or shouldn't be included, this brute force method includes them.
from collections import defaultdict
def getsubs(loc, s):
substr = s[loc:]
i = -1
while(substr):
yield substr
substr = s[loc:i]
i -= 1
def longestRepetitiveSubstring(r, minocc=3):
occ = defaultdict(int)
# tally all occurrences of all substrings
for i in range(len(r)):
for sub in getsubs(i,r):
occ[sub] += 1
# filter out all substrings with fewer than minocc occurrences
occ_minocc = [k for k,v in occ.items() if v >= minocc]
if occ_minocc:
maxkey = max(occ_minocc, key=len)
return maxkey, occ[maxkey]
else:
raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))
prints:
('helloworld', 3)
Let's start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]>=3:
seq=freqs.most_common(1)[0][0]
break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
Result:
>>> sequence 'helloworld' of length 10 occurs 3 or more times
Edit: if you have the feeling that you're dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can't find any that appear at least 3 time:
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]<3:
n-=1
break
else:
seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
The same result as above.
The first idea that came to mind is searching with progressively larger regular expressions:
import re
text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1
while 1:
m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
if not m:
break
largest = m.group(1)
i += 1
print largest # helloworld
The code ran successfully. The time complexity appears to be at least O(n^2).
If you reverse the input string, then feed it to a regex like (.+)(?:.*\1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)
Edit:
I have to say cancel this way. It's dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won't work for this.
In Python you can use the string count method.
We also use an additional generator which will generate all the unique substrings of a given length for our example string.
The code is straightforward:
test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
def generate_substrings_of_length(this_string, length):
''' Generates unique substrings of a given length for a given string'''
for i in range(len(this_string)-2*length+1):
yield this_string[i:i+length]
def longest_substring(this_string):
'''Returns the string with at least two repetitions which has maximum length'''
max_substring = ''
for subs_length in range(2, len(this_string) // 2 + 1):
for substring in generate_substrings_of_length(this_string, subs_length):
count_occurences = this_string.count(substring)
if count_occurences > 1 :
if len(substring) > len(max_substring) :
max_substring = substring
return max_substring
I must note here (and this is important) that the generate_substrings_of_length generator does not generate all the substrings of a certain length. It will generate only the required substring to be able to make comparisons. Otherwise we will have some artificial duplicates. For example in the case :
test_string = "banana"
GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)
will result :
ba
an
na
and this is enough for what we need.
from collections import Counter
def Longest(string):
b = []
le = []
for i in set(string):
for j in range(Counter(string)[i]+1):
b.append(i* (j+1))
for i in b:
if i in string:
le.append(i)
return ([s for s in le if len(s)==len(max( le , key = len))])
Related
I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.
My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.
results_boolean = [keyword in n for n in string_data]
Is there a way I can greatly speed this up with a more appropriate approach?
Using a generator with a set is probably your best bet ... this solution i think will work and presumably faster
def find_matches(target_words,filename_to_search):
targets = set(target_words)
with open("search_me.txt") as f:
for line_no,line in enumerate(f):
matching_intersection = targets.intersection(line.split())
if matching_intersection:
yield (line_no,line,matching_intersection) # there was a match
for match in find_matches(["unique","list","of","strings"],"search_me.txt"):
print("Match: %s"%(match,))
input("Hit Enter For next match:") #py3 ... just to see your matches
of coarse it gets harder if your matches are not single words, especially if there is no reliable grouping delimiter
In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.
Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.
Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:
# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
from collections import defaultdict
ch2ixs = defaultdict(set)
for i, s in enumerate(string_data):
for ch in s:
ch2ixs[ch].add(i)
return ch2ixs
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
ch = keyword[0]
if ch in ch2ixs:
result = []
for i in ch2ixs[ch]:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
Then, e.g.,
string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)
find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
string_data,
ch2ixs)
displays:
'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]
If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.
In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.
Note that you should really find a way to get rid of this:
results_boolean = [keyword in n for n in string_data]
Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.
Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
keyset = set(keyword)
if all(ch in ch2ixs for ch in keyset):
ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
result = []
for i in ixs:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
There is a problem where array containing numbers is given, the statement is to find the maximum length of a sub-sequence formed from the given array such that all the elements in that sub-sequence share at least one common digit.
Now whats the catch? Well, I intended to use a dictionary b to store key as every digit and value as count so far while traversing the given array, digit by digit. I thought the maximum number in values of dictionary i.e., bigger count of a digit would be the problems answer, given that we still have a glitch that we should not count a same digit that present in ONE element of array more than one time. To overcome that glitch, I used set c.
The code function for this along with driver function written below for convinience.
def solve (a):
b={}
answer=1
for i in a:
j=i
c=set()
c.clear()
while(j):
last_digit=i%10
if(last_digit not in b and last_digit not in c):
b[last_digit]=1
c.add(last_digit)
elif(last_digit in b and last_digit not in c):
b[last_digit]+=1
c.add(last_digit)
answer=max(answer,b[last_digit])
j//=10
return answer
a=list(map(int,input().strip().split()))
print(solve(a))
There are lot test cases concerned for this code to be correct.. One of them is input is 12 11 3 4 5, the output that code gave is 1 and expected output is 2. What gives?
You have good ideas. But your code would be easier if you use the Counter object from the collections module. It is designed to do just what you are trying to do: count the number of occurrences of an item in an iterable.
This code uses a generator expression to look at each value in the list alist, uses the built-in str() function to convert that integer to a string of digits, then uses the set() built-in function to convert that to a set. As you said, this removes the duplicate digits since you want to count each digit only once per item. The Counter object then looks at these digits and counts their occurrences. The code then uses Counter's most_common method to choose the digit that occurs most (the (1) parameter returns only the single most popular digit in a list, and the 0 index takes that digit and its count out of the list) then takes the count of that digit (that's the 1 index). That count is then returned to the caller.
If you are not familiar with Counter or with generator expressions, you could do the counting yourself and use regular for loops. But this code is short and fairly clear to anyone who knows the Counter object. You could use the line in the comment to replace the following four lines, if you want brief code, but I expanded out the code to make it more clear.
from collections import Counter
def solve(alist):
digitscount = Counter(digit for val in alist for digit in set(str(abs(val))))
# return digitscount.most_common(1)[0][1]
most_common_list = digitscount.most_common(1)
most_common_item = most_common_list[0]
most_common_count = most_common_item[1]
return most_common_count
alist = list(map(int, input().strip().split()))
print(solve(alist))
For your example input 12 11 3 4 5, this prints the correct answer 2. Note that my code will give an error if the input is empty or contains a non-integer. This version of my code takes the absolute value of the list values, which prevents a minus (or negative) sign from being counted as a digit.
Here's my own implementation of this:
def solve(list_of_numbers):
counts = {str(i):0 for i in range(10)} # make a dict of placeholders 0 through 9
longest_sequence = 0 # store the length of the longest sequence so far
for num in list_of_numbers: # iterate through our list of numbers
num_str = str(num) # convert number to a string (this makes "is digit present?" easier)
for digit, ct in counts.items(): # evaluate whether each digit is present
if digit in num_str: # if the digit is present in this number...
counts[digit] += 1 # ...then increment the running count of this digit
else: # otherwise, we've broken the sequence...
counts[digit] = 0 # ...so we reset the running count to 0.
if ct > longest_sequence: # check if we've found a new longest sequence...
longest_sequence = ct # ...and if so, replace the current longest sequence
return longest_sequence[1] # finally, return the digit that had the longest running sequence.
It uses a dict to store the running counts of each digit's consecutive occurrences - for every number, the count is incremented if the digit is present, and reset to 0 if it isn't. The longest sequence's length so far is saved in its own variable for storage.
There are a few details that your implementation, I think, is overlooking:
Your code might be returning the digit that appears the most, rather than the greatest number of appearances. I'm not sure, because your code is kind of hard for me to parse, and you only gave the one test example.
Try to stay away from using single-letter variable names, if you can. Notice how much more clear my code above is, since I've used full names (and, at worst, clear abbreviations like ct for "count"). This makes it much easier to debug your own code.
I see what you're doing to find the digits that are present in the number, but that's a bit more verbose than the situation requires. A simpler solution, which I used, was to simply cast the number into a string and use each individual digit's character instead of its value. In your implementation, you could do something like this: c = set(int(digit) for digit in str(j))
You don't seem to have anything in your code that detects when a digit is no longer present, which could lead to an incorrect result.
I'm having trouble understanding the original problem, but I think what you need to do is cast each item as a string if it is an int and then split out each digit.
digits = {}
for item in thelist:
digit[item] = []
if len(item) > 1:
for digit in str(item):
digits[item].append(int(digit))
if your test case is 12 11 3 4 5 then this would produce a dictionary of {12 : [1,2], 11 : [1,1], etc}
I have a long string, let us say astr = "I am a very long string and I could contain a lot of text, so think of efficiency here". I also have a list alist = ["I", "am a", "list", "of strings", "and each string", "could be made up of many words", "so think of efficiency here"]. Now, my list of strings also has a corresponding list of integers alist_ofints = [1, 2, 3, 4, 5, 6, 7] that represents how many points each string in this list equals.
I am supposed to create a function that finds how many of the words in astr appear the list alist, and create a "points" counter using the corresponding points list alist_ofints. So, in this example, the words "I", "am a", "so think of efficiency here" appear twice, once, and once respectively. That would give us 1*2 + 2*1 + 7*1 = 11 points.
I have come up with two naive solutions. The first is to create a function that looks into this list of strings alist and checks to see if each item is in astr, and if it is, to apply the obvious following logic. This is inefficient because I will be looking into astr a len(alist) amount of times. That is a waste, isn't it? It is clean and to the point, but inefficient.
The second solution was to make astr a list of words, and I would check each word at index i through to index j, where i is where I am in the list and j is the length of the phrase in alist that I am looking for. So, "am a" is a phrase of length 2 (since it has two words in it), so I would look at i = some number, and j = some number + 1. If I am looking for the phrase "and each string", i = some number, j = some number + 3. So I am looking at three words, when testing for this phrase. Now, I think this also has the same time complexity. Although I am not looping through the astr list once, I am looping through my list of words alist len(list(astr)) times. Also, I have to create a list of astr, which adds some complexity, I imagine.
So, I like the first solution better so far, because its the easiest, simplest, and cleanest. Is there a better way to do this? Extra points if you can find a list comprehension way...
Thank you
NOTE: I know list(astr) will not return the list of words. Imagine that for this example, it does.
TLDR: I have two lists. I need to check if each element in a list is equal to an element in the other list and create a count of how many times they appear. Is there a more efficient way of doing this than to check every element in list 1 against every other element in list 2 (I think this is O(n^2))?
A more efficient algorithm could index the long string astr using a string index (e.g., Suffix Array). Then you search each entry in alist in the index and increment the points accordingly when you found results.
The runtime for indexing astr is then O(n) where n is the length of astr.
Searching an entry from alist of length m in the index is in O(log n)
Overall you should get away with O(p log n) where p is the number of entries in alist.
Example
Lets consider the long string astr to be
I am a very long string
then the corresponding suffix array (all lower-case) will be
SA = [1 4 6 11 16 5 2 8 22 15 0 20 12 3 21 14 13 19 9 17 18 7 10]
these are all suffixes of astr (represented by their starting index) sorted lexicographical. For example SA[9] = 15 represents the string in astr starting at position 15 ("g string").
Now lets assume your list of phrases
alist = ["I am", "very long",...]
then for each of the entries you want to search the occurrences in the suffix array. This is done using binary search on the suffix array. For "I am" this will look as follows:
First you look at the middle entry of the suffix array (SA[11] = 20). Then you look at the suffix represented by that index ("ing"). Since this suffix is larger than your search phrase "I am" you want to look in the left half of your suffix array. Continue this binary search until you have found the phrase or you are sure it's not in there.
I have written this line that seems to do exactly what you want:
print sum([str.count(s) * i for (s,i) in zip(alist, alist_ofints)])
It is more like your first approach, but I don't find it that inefficient.
One thing you should note though is that str.count(s) only finds the numver of non-overlapping occurrences of s in str.
You can build a Trie data structure for the list of words with the end nodes containing the index of the points array.
From WikipediaA trie structure for the input = ["A","to", "tea", "ted", "ten", "i", "in", and "inn"] would look like this
<p><img src="https://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg" alt="Trie example.svg" height="145" width="155"><br>By Booyabazooka (based on PNG image by Deco). Modifications by Superm401. - own work (based on PNG image by Deco), Public Domain, Link</p>
So we can run through the entire length of the input string and every time we encounter a end of word node add it's points and move on.
So the search for the entire word can be done in linear time.
But in case of overlapping list items like ["ab", "cd", "abcd"], with points [3, 4, 1] and the word was abcd. The we will not be able to have a linear time solution after pre-processing because after every time we encounter a end of word the max points can come from either.
Extending the string so far word and looking further ahead.
Starting to look for remaining string as an individual word from the list.
Time and space complexity to build the Trie structure : O(w * m) where w is the amount of words and m is the max size of the word in the list.
Search can be done in O(m) where m is the length of the word being searched for.
(I think this is similar to thebenman's answer.) Depending on the kinds of overlaps in alist, you might get away with turning alist into a dictionary (or a nested dictionary, i.e., a tree):
{
I: [(None, 1)],
am: [(a, 2)],
list: [(None, 3)],
of: [(strings,4)],
and: [(each, 0), (string, 5)],
could: [(be, 0), (made, 0)...,(words, 6)],
so: [(think, 0), (of, 0)...,(here, 7)]
}
Now we could traverse astr once as words without indexing it, keeping a reference to and updating all currently open, accumulating matches.
You could also generate all possible subsequences, use a Counter on it and then the lookup time would be almost O(1).
This would take more memory to generate the dictionary (or index), but it would be more efficient in case you need to lookup the same long string multiple times.
Something like this:
from collections import Counter
def get_all_counts(input_string):
cnt = Counter()
length = len(input_string)
alist = []
s = input_string.split()
for i in range(0, len(s)):
current_subsequence = ''
for j in range(i, len(s)):
current_subsequence += ' ' + s[j]
cnt[current_subsequence.strip()] += 1 # I've put 1 here, but you could easily replace it with a lookup of your "points"
return cnt
counts = get_all_counts(
'I am a very long string and I could contain a lot of text, so think of efficiency here')
print(counts['am'])
print(counts['of'])
Probably using itertools would be better, but you should get the idea.
Another advantage of this is that you could turn this into a pandas dataframe and do queries on it.
For example something like:
df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
print(df[df[0] > 1])
would give you all the substrings with an occurrence greater than 1.
I am given a sequence of letters and have to produce all the N-length anagrams of the sequence given, where N is the length of the sequence.
I am following a kinda naive approach in python, where I am taking all the permutations in order to achieve that. I have found some similar threads like this one but I would prefer a math-oriented approach in Python. So what would be a more performant alternative to permutations? Is there anything particularly wrong in my attempt below?
from itertools import permutations
def find_all_anagrams(word):
pp = permutations(word)
perm_set = set()
for i in pp:
perm_set.add(i)
ll = [list(i) for i in perm_set]
ll.sort()
print(ll)
If there are lots of repeated letters, the key will be to produce each anagram only once instead of producing all possible permutations and eliminating duplicates.
Here's one possible algorithm which only produces each anagram once:
from collections import Counter
def perm(unplaced, prefix):
if unplaced:
for element in unplaced:
yield from perm(unplaced - Counter(element), prefix + element)
else:
yield prefix
def permutations(iterable):
yield from perm(Counter(iterable), "")
That's actually not much different from the classic recursion to produce all permutations; the only difference is that it uses a collections.Counter (a multiset) to hold the as-yet-unplaced elements instead of just using a list.
The number of Counter objects produced in the course of the iteration is certainly excessive, and there is almost certainly a faster way of writing that; I chose this version for its simplicity and (hopefully) its clarity
This is very slow for long words with many similar characters. Slow compared to theoretical maximum performance that is. For example, permutations("mississippi") will produce a much longer list than necessary. It will have a length of 39916800, but but the set has a size of 34650.
>>> len(list(permutations("mississippi")))
39916800
>>> len(set(permutations("mississippi")))
34650
So the big flaw with your method is that you generate ALL anagrams and then remove the duplicates. Use a method that only generates the unique anagrams.
EDIT:
Here is some working, but extremely ugly and possibly buggy code. I'm making it nicer as you're reading this. It does give 34650 for mississippi, so I assume there aren't any major bugs. Warning again. UGLY!
# Returns a dictionary with letter count
# get_letter_list("mississippi") returns
# {'i':4, 'm':1, 'p': 2, 's':4}
def get_letter_list(word):
w = sorted(word)
c = 0
dd = {}
dd[w[0]]=1
for l in range(1,len(w)):
if w[l]==w[l-1]:
d[c]=d[c]+1
dd[w[l]]=dd[w[l]]+1
else:
c=c+1
d.append(1)
dd[w[l]]=1
return dd
def sum_dict(d):
s=0
for x in d:
s=s+d[x]
return s
# Recursively create the anagrams. It takes a letter list
# from the above function as an argument.
def create_anagrams(dd):
if sum_dict(dd)==1: # If there's only one letter left
for l in dd:
return l # Ugly hack, because I'm not used to dics
a = []
for l in dd:
if dd[l] != 0:
newdd=dict(dd)
newdd[l]=newdd[l]-1
if newdd[l]==0:
newdd.pop(l)
newl=create(newdd)
for x in newl:
a.append(str(l)+str(x))
return a
>>> print (len(create_anagrams(get_letter_list("mississippi"))))
34650
It works like this: For every unique letter l, create all unique permutations with one less occurance of the letter l, and then append l to all these permutations.
For "mississippi", this is way faster than set(permutations(word)) and it's far from optimally written. For instance, dictionaries are quite slow and there's probably lots of things to improve in this code, but it shows that the algorithm itself is much faster than your approach.
Maybe I am missing something, but why don't you just do this:
from itertools import permutations
def find_all_anagrams(word):
return sorted(set(permutations(word)))
You could simplify to:
from itertools import permutations
def find_all_anagrams(word):
word = set(''.join(sorted(word)))
return list(permutations(word))
In the doc for permutations the code is detailled and it seems already optimized.
I don't know python but I want to try to help you: probably there are a lot of other more performant algorithm, but I've thought about this one: it's completely recursive and it should cover all the cases of a permutation. I want to start with a basic example:
permutation of ABC
Now, this algorithm works in this way: for Length times you shift right the letters, but the last letter will become the first one (you could easily do this with a queue).
Back to the example, we will have:
ABC
BCA
CAB
Now you repeat the first (and only) step with the substring built from the second letter to the last one.
Unfortunately, with this algorithm you cannot consider permutation with repetition.
A code I was recently working on was found to be using around 200MB of memory to run, and I'm stumped as to why it would need that much.
Basically it mapped a text file onto a list where each character in the file was its own list containing the character and how often it has shown up so far (starting from zero) as its two items.
So 'abbac...' would be [['a','0'],['b','0'],['b','1'],['a','1'],['c','0'],...]
For a text file 1 million characters long, it used 200MB.
Is this reasonable or was it something else my code was doing? If it is reasonable, was it because of the high number of lists? Would [a,0,b,0,b,1,a,1,c,0...] take up substantially less space?
If you do not need the list itself, then I fully subscribe #Lattyware's solution of using a generator.
However, if that's not an option then perhaps you could compress the data in your list without loss of information by storing only the positions for each character in the file.
import random
import string
def track_char(s):
# Make sure all characters have the same case
s = s.lower()
d = dict((k, []) for k in set(s))
for position, char in enumerate(s):
d[char].append(position)
return d
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
d = track_char(st)
len(d["a"])
# Total number of occurrences of character 2
for char, vals in d.items():
if 2 in vals:
print("Character %s has %s occurrences" % (char,len(d[char]))
Character C has 1878 occurrences
# Number of occurrences of character 2 so far
for char, vals in d.items():
if 2 in vals:
print("Character %s has %s occurrences so far" % (char, len([x for x in d[char] if x <= 2))
Character C has 1 occurrences so far
This way there is no need to duplicate the character string each time there is an occurrence, and you maintain the information of all their occurrences.
To compare the object size of your original list or this approach, here's a test
import random
import string
from sys import getsizeof
# random generation of a string with 50k characters
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
# Function that returns the original list for this string
def original_track(s):
l = []
for position, char in enumerate(s):
l.append([char, position])
return l
# Testing sizes
original_list = original_track(st)
dict_format = track_char(st)
getsizeof(original_list)
406496
getsizeof(dict_format)
1632
As you can see, the dict_format is roughly 250x times smaller in size. However this difference in sizes should be more pronounced in larger strings.
When it comes to memory use and lists, one of the best ways to reduce memory usage is to avoid lists at all - Python has great support for iterators in the form of generators. If you can produce a generator instead of constructing a list, you should be able to do something like this with very little memory usage. Of course, it depends what you are doing with the data afterwards (say you are writing this structure out to a file, you could do so piece by piece and not store the entire thing at once).
from collections import Counter
def charactersWithCounts():
seen = Counter()
for character in data:
yield (character, seen[character])
seen[character] += 1