Insert items into a list based on their occurrence - python

Say I am continuously generating new data (e.g. integers) and want to collect them in a list.
import random
lst = []
for _ in range(50):
num = random.randint(0, 10)
lst.append(num)
When a new value is generated, I want it to be positioned in the list based on the count of occurrences of that value, so data with lower "current occurrence" should be placed before those with higher "current occurrence".
"Current occurrence" means "the number of duplicates of that data that have already been collected so far, up to this iteration". For the data that have the same occurrence, they should then follow the order in which they are generated.
For example, if at iteration 10 the current list is [1,2,3,4,2,3,4,3,4], let's say a new value 1 is generated, then it should be inserted at index 7, resulting in [1,2,3,4,2,3,4,1,3,4]. Because it is the second occurrence of 1, it should be placed after all the values that only occur once, but after all other existing items that occur twice: 2, 3 and 4 (hence, preserving the order).
This is my current code that can rearrange the list:
from collections import defaultdict
def rearrange(lst):
d = defaultdict(list)
count = defaultdict(int)
for x in lst:
count[x] += 1
d[count[x]].append(x)
res = []
for k in sorted(d.keys()):
res += d[k]
return res
lst = rearrange(lst)
However, this is not giving my expected result.
I wrote a separate algorithm that keeps generating new data until some convergence criterion is met, where the list has the potential to become extremely large.
Therefore I want to rearrange my generated values on-the-fly, i.e. to constantly insert data into the list "in-place". Of course I can use my rearrage function in each iteration, but that would be super inefficient. What I want to do is to insert new data into the correct position of the list, not replacing it with a new list in each iteration.
Any suggestions?
Edit: the data structure doesn't necessarily need to be a list, but it has to be ordered, and doesn't require another data structure to hold information.

The data structure I think that might work better for your purpose is a forest (in this case, a disjoint union of lists).
In summary, you keep one internal list for each occurrence of the values. When a new value comes, you add it to the list just after the one you added the last value this item came.
In order to keep track of the counts of occurrences, you can use a built-in Counter.
Here is a sample implementation:
from collections import Counter
def rearranged(iterable):
forest, counter = list(), Counter()
for x in iterable:
c = counter[x]
if c == len(forest):
forest.append([x])
else:
forest[c] += [x]
counter[x] += 1
return [x for lst in forest for x in lst]
rearranged([1,2,3,4,2,3,4,3,4,1])
# [1, 2, 3, 4, 2, 3, 4, 1, 3, 4]
For this to work better, your input iterable should be a generator (so the items can be generated on the fly).

Related

Dictionary comprehension Bigram

I have a python assignment to extract bigrams from a string into a dictionary and I think I have found the solution online but cant remember where I found it. But it seems to work but I am having trouble understanding it as I am new to python. Can anyone explain the code below which takes a string and extracts chars into tuples and counts instances and puts it into a dictionary
'''
s = 'Mississippi' # Your string
# Dictionary comprehension
dic_ = {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}}
'''
First let's understand comprehensions:
A list, dict, set, etc. can be made with a comprehension. Basically a comprehension is taking a generator and using it to form a new variable. A generator is just an object that returns a different value each iteration so to use list as an example: to make a list with a list comprehension we take the values that the generator outputs and put them into their own spot in a list. Take this generator for example:
x for x in range(0, 10)
This will just give 0 on the first iteration, then 1, then 2, etc. so to make this a list we would use [] (list brakets) like so:
[x for x in range(0, 10)]
This would give:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #note: range does not include the second input
for a dictionary and for a set we use {}, but since dictionaries uses key-value pairs our generator will be different for sets and dictionaries. For a set it is the same as a list:
{x for x in range(0, 10)} #gives the set --> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
but for a dictionary we need a key and a value. Since enumerate gives two items this could be useful for dictionaries in some cases:
{key: value for key, value in enumerate([1,2,3])}
In this case the keys are the indexes and the values are the items in the list. So this gives:
{0: 1, 1: 2, 2: 3} #dictionary
It doesn't make a set because we denote x : y which is the format for items in a dictionary, not a set.
Now, let's break this down:
This part of the code:
{s[i]+s[i+1] for i in range(len(s)-1)}
is making a set of values that is every pair of touching letters, s[i] is one letter, s[i+1] is the letter after, so it is saying get this pair (s[i]+s[i+1]) and do it for every item in the string (for i in range(len(s)-1) Notice there is a -1 since the last letter does not have a touching letter after it (so we don't want to run it for the last letter).
Now that we have a set let's save it to a variable so it's easier to see:
setOfPairs = {s[i]+s[i+1] for i in range(len(s)-1)}
Then our original comprehension would change to:
{k : s.count(k) for k in setOfPairs}
This is saying we want to make a dictionary that has keys of k and values of s.count(k) since we get every k from our pairs list: for k in setOfPairs the keys of the dictionary are, then, the pairs. Since s.count(k) returns the number of times k is in s, the values of the dictionary are the number of times the key appears in s.
Let's take this apart one step at a time:
s[i] is the code to select the i-th letter in the string s.
s[i]+s[i+1] concatenates the letter at position i and the letter at position i+1.
s[i]+s[i+1] for i in range(len(s)-1) iterates each index i (except the last one) and so computes all the bigrams.
Since the expression in 3 is surrounded by curly brackets, the result is a set, meaning that all duplicate bigrams are removed.
for k in {s[i]+s[i+1] for i in range(len(s)-1)} therefore iterates over all unique bigrams in the given string s.
Lastly, {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}} maps each each bigram k to the amount of times it appears in s, because the str.count function returns the number of times a substring appears in a string.
I hope that helps. If you want to know more about list/set/dict comprehensions in Python, the relevant entry in the Python documentation is here: https://docs.python.org/3/tutorial/datastructures.html?highlight=comprehension#list-comprehensions
dic_ = {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}}
Read backwards
dic_ = {k : s.count(k)
## Step 3 with each of the pair of characters
count how many are in the string
store the 2 char string as the key and the count as the value for the dictionary.
for k in{s[i]+s[i+1]
# Step 2 From each of the position take 2 characters out of the string
for i in range(len(s)-1)}}
# Step 1 loop over all but the last character of the string.
The code may be inefficient for long strings with many repetitions. Step 2 takes every pair so the count and store will be repeated count times.
Refactoring so you can test if the key already exists and not repeating the count may speed it up. bench mark time on ... say a billion base pair DNA sequence.

How to know when to refer an item in a for loop by itself vs. referring to an item as an index of the list

In a for loop, I'm trying to understand when to refer to an item by its item name and when to refer to the item as an index of the list I'm looping through.
In the code pasted below, I don't understand why "idx" is referred to in the "if" statement with a reference to the list index but then in the definition of maximum_score_index, it is referred to by itself.
def linear_search(search_list):
maximum_score_index = None
for **idx** in range(len(search_list)):
if not maximum_score_index or **search_list[idx]** > search_list[maximum_score_index]:
maximum_score_index = **idx**
return maximum_score_index
I'd love to have an explanation so I can differentiate in the future and some examples to show the difference so I can understand.
In Python, range(num) (more or less) returns a list of numbers from 0 through num - 1. It follows that range(len(my_list)) will generate a list of numbers from 0 through the length of my_list minus one. This is frequently useful, because the generated numbers are the indices of each item in my_list (Python lists start counting at 0). For example, range(len(["a", "b", "c"])) is [0, 1, 2], the indices needed to access each item in the original list. ["a", "b", "c"][0] is "a", and so on.
In Python, the for x in mylist loop iterates through each item in mylist, setting x to the value of each item in order. One common pattern for Python for loops is the for x in range(len(my_list)). This is useful, because you loop through the indices of each list item instead of the values themselves. It's almost as easy to access the values (just use my_list[x]) but it's much easier to do things like access the preceding value (just use my_list[x-1], much simpler than it would be if you didn't have the index!).
In your example, idx is tracking the index of each list item as the program iterates through search_list. In order to retrieve values from search_list, the program uses search_list[idx], much like I used my_list[x] in my example. The code then assigns maximum_score_index to the index itself, a number like 0, 1, or 2, rather than the value. It's still easy to find out what the maximum score is, with search_list[maximum_score_index]. The reason idx is not being used as a list accessor in the second case is because the program is storing the index itself, not the value of the array at that index.
Basically, this line:
if not maximum_score_index or **search_list[idx]** > search_list[maximum_score_index]:
maximum_score_index = **idx**
Can be thought of as:
if (this is the first pass) or (element at index > this loop-iteration element):
keep this index as largest element
What I recommend to do:
Go through the code, on a piece of paper and iterate over a list to
see what the code does
Write the code in any IDE, and use a debugger
to see what the code does
Are you looking for the index of the highest element in the list or the value?
If you are looking for the value, it can be as simple as:
highest = max(search_list)
You could also use enumerate, which will grant you "free" access to the current index in the loop:
>>> search_list
[10, 15, 5, 3]
>>> maximum_score_index = None
>>> for idx, value in enumerate(search_list):
... if not maximum_score_index or search_list[idx] > value:
... maximum_score_index = idx
...
>>> maximum_score_index
1
>>> search_list[maximum_score_index]
15

Need help speeding up this function

Input: A list of lists of various positions.
[['61097', '12204947'],
['61097', '239293'],
['61794', '37020977'],
['61794', '63243'],
['63243', '5380636']]
Output: A sorted list that contains the count of unique numbers in a list.
[4, 3, 3, 3, 3]
The idea is fairly simple, I have a list of lists where each list contains a variable number of positions (in our example there is only 2 in each list, but lists of up to 10 exist). I want to loop through each list and if there exists ANY other list that contains the same number then that list gets appended to the original list.
Example: Taking the input data from above and using the following code:
def gen_haplotype_blocks(df):
counts = []
for i in range(len(df)):
my_list = [item for item in df if any(x in item for x in df[i])]
my_list = list(itertools.chain.from_iterable(my_list))
uniq_counts = len(set(my_list))
counts.append(uniq_counts)
clear_output()
display('Currently Running ' +str(i))
return sorted(counts, reverse=True)
I get the output that is expected. In this case when I loop through the first list ['61097', '12204947'] I find that my second list ['61097', '239293'] both contain '61097' so these who lists get concatenated together and form ['61097', '12204947', '61097', '239293']. This is done for every single list outputting the following:
['61097', '12204947', '61097', '239293']
['61097', '12204947', '61097', '239293']
['61794', '37020977', '61794', '63243']
['61794', '37020977', '61794', '63243', '63243', '5380636']
['61794', '63243', '63243', '5380636']
Once this list is complete, I then count the number of unique values in each list, append that to another list, then sort the final list and return that.
So in the case of ['61097', '12204947', '61097', '239293'], we have two '61097', one '12204947' and one '239293' which equals to 3 unique numbers.
While my code works, it is VERY slow. Running for nearly two hours and still only on line ~44k.
I am looking for a way to speed up this function considerably. Preferably without changing the original data structure. I am very new to python.
Thanks in advance!
Too considerably improve the speed of your program, especially for larger data set. The key is to use a hash table, or a dictionary in Python's term, to store different numbers as the key, and the lines each unique number exist as value. Then in the second pass, merge the lists for each line based on the dictionary and count unique elements.
def gen_haplotype_blocks(input):
unique_numbers = {}
for i, numbers in enumerate(input):
for number in numbers:
if number in unique_numbers:
unique_numbers[number].append(i)
else:
unique_numbers[number] = [i]
output = [[] for _ in range(len(input))]
for i, numbers in enumerate(input):
for number in numbers:
for line in unique_numbers[number]:
output[i] += input[line]
counts = [len(set(x)) for x in output]
return sorted(counts, reverse=True)
In theory, the time complexity of your algorithm is O(N*N), N as the size of the input list. Because you need to compare each list with all other lists. But in this approach the complexity is O(N), which should be considerably faster for a larger data set. And the trade-off is extra space complexity.
Not sure how much you expect by saying "considerably", but converting your inner lists to sets from the beginning should speed up things. The following works approximately 2.5x faster in my testing:
def gen_haplotype_blocks_improved(df):
df_set = [set(d) for d in df]
counts = []
for d1 in df_set:
row = d1
for d2 in df_set:
if d1.intersection(d2) and d1 != d2:
row = row.union(d2)
counts.append(len(row))
return sorted(counts, reverse=True)

How do I initialize and fill a list of lists in Python?

What I'm trying to do is sort word objects (which consist of a scanned word, its alphabetized version, and its length) into lists by their length. So, I initialized an list of length 0 and I am extending it as I'm going through my input file. What I want to do it to have a list within a list such that my results[5] contain a list of length 5. How do I do that?
I first initialize my list as follows:
results = []
I then scan through the input file line by line creating temp objects and I want them to be be placed into their appropriate lists:
try: #check if there exists an array for that length
results[lineLength]
except IndexError: #if it doesn't, create it up to that length
# Grow the list so that the new highest index is len(word)
difference = len(results) - lineLength
results.extend([] for _ in range(difference))
finally:
results[lineLength].append(tempWordObject)
I feel at least one of the following needs to be edited
(1) The way I initialize the results list
(2) The way I append objects to the list
(3) The way I'm extending the list (though I think that part is right)
I am using Python 3.4.
EDIT:
from sys import argv
main, filename = argv
file = open(filename)
for line in file: #go through the file
if line == '\n': #if the line is empty (aka end of file), exit loop
break
lineLength = (len(line)-1) #get the line length
line= line.strip('\r\n')
if lineLength > maxL: #keeps track of length of longest word encountered
maxL = lineLength
#note: I've written a mergesort algorithm in a separate area in the code and it works
tempAZ = mergesort(line) #mergesort the word into alphabetical order
tempAZ = ''.join(tempAZ) #merges the chars back together to form a string
tempWordObject = word(line,tempAZ,lineLength) #creates a new word object
try: #check if there exists an array for that length
results[lineLength]
except IndexError: #if it doesn't, create it up to that length
# Grow the list so that the new highest index is len(word)
difference = len(results) - lineLength
results.extend([] for _ in range(difference))
print("lineLength: ", lineLength, " difference:", difference)
finally:
results[lineLength].append(tempWordObject)
EDIT:
This is my word class:
class word(object): #object class
def __init__(self, originalWord=None, azWord=None, wLength=None):
self.originalWord = originalWord
self.azWord = azWord
self.wLength = wLength
EDIT:
Here is a clarification of what I'm trying to achieve: As I'm iterating through a list (of unknown length) of words (also of unknown length), I am creating word objects that include the word, its alphabetized version, and its length (e.g. dog, dgo, 3). As I'm going through that list, I want all objects to go into a list that is within another list (results[]), indexed by the word's length. If results[] does not contain such an index (e.g. 3), I want to extend results[] and start a list in results[3] that contains the word object (dog, dgo, 3). At the end, results[] should contain lists of words indexed by their length.
Rather than a list, you could have a dictionary:
d = {}
here the key would be length and the value a list of words:
if linelength not in d:
d[linelength] = []
d[linelength].append(tempWordObject)
You can simplify further with d = collections.defaultdict(list).
Your difference is negative. You need to subtract the other way round. You'll also need to add one extra since index starts at 0
difference = lineLength - len(results) + 1
Turns out it's usually easier to use a defaultdict for this
eg:
from collections import defaultdict
D = defaultdict(list)
for tempWordObject in the_file:
D[len(tempWordObject)].append(tempWordObject)
Three notes on your questions.
Nested list initialization
You mention it in your question title, although you might not need it in the end. One simple way to do this is to use two nested list comprehensions:
import pprint
m, n = 3, 4 # 2D: 3 rows, 4 columns
lol = [[(j, i) for i in range(n)] for j in range(m)]
pprint.pprint(lol)
# [[(0, 0), (0, 1), (0, 2), (0, 3)],
# [(1, 0), (1, 1), (1, 2), (1, 3)],
# [(2, 0), (2, 1), (2, 2), (2, 3)]]
Using some default data structure
As others pointed out, you could use a dictionary. In particular, a collections.defaultdict will give you initialization-on-demand:
import collections
dd = collections.defaultdict(list)
for value in range(10):
dd[value % 3].append(value)
pprint.pprint(dd)
# defaultdict(<type 'list'>, {0: [0, 3, 6, 9], 1: [1, 4, 7], 2: [2, 5, 8]})
Comparing custom objects
The built-in sorted function takes a keyword argument key, that can be used to compare custom object, that do not themselves provide sorting hooks:
import operator
class Thing:
def __init__(self, word):
self.word = word
self.length = len(word)
def __repr__(self):
return '<Word %s>' % self.word
things = [Thing('the'), Thing('me'), Thing('them'), Thing('anybody')]
print(sorted(things, key=lambda obj: obj.length))
# [<Word me>, <Word the>, <Word them>, <Word anybody>]
If you're set on using a list (which may not be the best choice), I think it would be easier and more clear to create the list as big as it needs to be from the get go. That is to say, if the longest word is 5 characters long, you start by creating this list:
output = [None, [], [], [], [], []]
This has the advantage that you won't have to worry about catching exceptions as you go but it does require that you know all your words before you start. Since you created an object class to store all this, I'm assuming you're actually storing all this so it shouldn't be an issue.
You'll always need the None at the beginning so the indices match up. Once you have this you can iterate through your list of words and simply append it to the appropriate list as you already do.
for word in wordlist:
output[len(word)].append(word)
So specifically for you, what I would do is instead of storing tempWordObject, I'd make a list (wordObjList) of these objects as you work through your file. Once you're done with the file, close the handle, then proceed to do the rest of you processing.
Generate the template list:
output = [None]
for i in range(maxLen):
output.append([])
Fill the list from your list of word obejcts
for wordObj in wordObjList:
output[wordObj.wLength].append(wordObj.originalWord)
Some other things to note:
You don't need to handle hitting the end of the file. When Python reaches the end of the file in the for loop, it will automatically stop iterating
Always make sure you close your files. You can you the with construction to do this (with open("file.txt", 'r') as f: for line in f:)
You refused to accept answers proposed to store your objects in dictionaries. However your real problem is that you want to put your 6mil of words containing scanned images into your memory. Use indexing (or some kind of simple references) and track them into your structure, then lookup for your data based on them. Use iterators to retrieve the info you need.

python combine multiple sorted lists into one big sorted list one by one

I have several sorted lists, and I want to add them together into one big sorted list. What is the most efficient way to do this?
Here is what I would do, but it is too inefficient:
big_list=[]
for slist in sorted_lists: # sorted_lists is a generator, so lists have to be added one by one
big_list.extend(slist)
big_list.sort()
Here is an example for sorted_lists:
The size of sorted_lists =200
Size of first element of sorted_lists=1668
sorted_lists=[
['000008.htm_181_0040_0009', '000008.htm_181_0040_0037', '000008.htm_201_0041_0031', '000008.htm_213_0029_0004', '000008.htm_263_0015_0011', '000018.htm_116_0071_0002', '000018.htm_147_0046_0002', '000018.htm_153_0038_0015', '000018.htm_160_0060_0001', '000018.htm_205_0016_0002', '000031.htm_4_0003_0001', '000032.htm_4_0003_0001', '000065.htm_5_0013_0005', '000065.htm_8_0008_0006', '000065.htm_14_0038_0036', '000065.htm_127_0016_0006', '000065.htm_168_0111_0056', '000072.htm_97_0016_0012', '000072.htm_175_0028_0020', '000072.htm_188_0035_0004'….],
['000018.htm_68_0039_0030', '000018.htm_173_0038_0029', '000018.htm_179_0042_0040', '000018.htm_180_0054_0021', '000018.htm_180_0054_0031', '000018.htm_182_0025_0023', '000018.htm_191_0041_0010', '000065.htm_5_0013_0007', '000072.htm_11_0008_0002', '000072.htm_14_0015_0002', '000072.htm_75_0040_0021', '000079.htm_11_0005_0000', '000079.htm_14_0006_0000', '000079.htm_16_0054_0006', '000079.htm_61_0018_0012', '000079.htm_154_0027_0011', '000086.htm_8_0003_0000', '000086.htm_9_0030_0005', '000086.htm_11_0038_0004', '000086.htm_34_0031_0024'….],
['000001.htm_13_0037_0004', '000008.htm_48_0025_0006', '000008.htm_68_0025_0008', '000008.htm_73_0024_0014', '000008.htm_122_0034_0026', '000008.htm_124_0016_0005', '000008.htm_144_0046_0030', '000059.htm_99_0022_0012', '000065.htm_69_0045_0017', '000065.htm_383_0026_0020', '000072.htm_164_0030_0002', '000079.htm_122_0030_0009', '000079.htm_123_0049_0015', '000086.htm_13_0037_0004', '000109.htm_71_0054_0029', '000109.htm_73_0035_0005', '000109.htm_75_0018_0004', '000109.htm_76_0027_0013', '000109.htm_101_0030_0008', '000109.htm_134_0036_0030']]
EDIT
Thank you for the answers. I think I should have made it more clear that I do not have the sorted lists simulateneosly but I am iterating over some large files to get them. So, I need to add them one by one, as I am showing in my crude code above.
The standard library provides heapq.merge for this purpose:
>>> a=[1,3,5,6]
>>> b=[2,4,6,8]
>>> c=[2.5,4.5]
>>> list(heapq.merge(a,b,c))
[1, 2, 2.5, 3, 4, 4.5, 5, 6, 6, 8]
>>>
Or, in your case:
big_list = list(heapq.merge(*sorted_lists))
Note that you don't have to create the list, since heapq.merge returns an iterable:
for item in heapq.merge(*sorted_lists):
Quoting the doc:
Similar to sorted(itertools.chain(*iterables)) but returns an iterable, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).
Use the heapq module to track which list to pick the next sorted value from:
import heapq
def merge(*iterables):
h = []
for it in map(iter, iterables):
try:
next = it.next
h.append([next(), next])
except StopIteration:
pass
heapq.heapify(h)
while True:
try:
while True:
v, next = s = h[0]
yield v
s[0] = next()
heapq._siftup(h, 0)
except StopIteration:
heapq.heappop(h)
except IndexError:
return
This pushes all lists unto a heap, kept sorted by their next value. Every time this yields the lowest value, the heap is updated with the next value from the iterable used and reorders the heap again.
This in essence keeps a list of [next_value, iterable] lists, and these are sorted efficiently by next_value.
Usage:
for value in merge(*sorted_lists):
# loops over all values in `sorted_lists` in sorted order
or
big_list = list(merge(*sorted_lists))
to create a new big list with all values sorted, efficiently.
This exact implementation was added to the heapq module as the heapq.merge() function so you can just do:
from heapq import merge
big_list = list(merge(*sorted_lists))
def merge_lists(*args):
new_list = sorted(list(heapq.merge(*args)))
print(new_list)

Categories

Resources