I have a python assignment to extract bigrams from a string into a dictionary and I think I have found the solution online but cant remember where I found it. But it seems to work but I am having trouble understanding it as I am new to python. Can anyone explain the code below which takes a string and extracts chars into tuples and counts instances and puts it into a dictionary
'''
s = 'Mississippi' # Your string
# Dictionary comprehension
dic_ = {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}}
'''
First let's understand comprehensions:
A list, dict, set, etc. can be made with a comprehension. Basically a comprehension is taking a generator and using it to form a new variable. A generator is just an object that returns a different value each iteration so to use list as an example: to make a list with a list comprehension we take the values that the generator outputs and put them into their own spot in a list. Take this generator for example:
x for x in range(0, 10)
This will just give 0 on the first iteration, then 1, then 2, etc. so to make this a list we would use [] (list brakets) like so:
[x for x in range(0, 10)]
This would give:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #note: range does not include the second input
for a dictionary and for a set we use {}, but since dictionaries uses key-value pairs our generator will be different for sets and dictionaries. For a set it is the same as a list:
{x for x in range(0, 10)} #gives the set --> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
but for a dictionary we need a key and a value. Since enumerate gives two items this could be useful for dictionaries in some cases:
{key: value for key, value in enumerate([1,2,3])}
In this case the keys are the indexes and the values are the items in the list. So this gives:
{0: 1, 1: 2, 2: 3} #dictionary
It doesn't make a set because we denote x : y which is the format for items in a dictionary, not a set.
Now, let's break this down:
This part of the code:
{s[i]+s[i+1] for i in range(len(s)-1)}
is making a set of values that is every pair of touching letters, s[i] is one letter, s[i+1] is the letter after, so it is saying get this pair (s[i]+s[i+1]) and do it for every item in the string (for i in range(len(s)-1) Notice there is a -1 since the last letter does not have a touching letter after it (so we don't want to run it for the last letter).
Now that we have a set let's save it to a variable so it's easier to see:
setOfPairs = {s[i]+s[i+1] for i in range(len(s)-1)}
Then our original comprehension would change to:
{k : s.count(k) for k in setOfPairs}
This is saying we want to make a dictionary that has keys of k and values of s.count(k) since we get every k from our pairs list: for k in setOfPairs the keys of the dictionary are, then, the pairs. Since s.count(k) returns the number of times k is in s, the values of the dictionary are the number of times the key appears in s.
Let's take this apart one step at a time:
s[i] is the code to select the i-th letter in the string s.
s[i]+s[i+1] concatenates the letter at position i and the letter at position i+1.
s[i]+s[i+1] for i in range(len(s)-1) iterates each index i (except the last one) and so computes all the bigrams.
Since the expression in 3 is surrounded by curly brackets, the result is a set, meaning that all duplicate bigrams are removed.
for k in {s[i]+s[i+1] for i in range(len(s)-1)} therefore iterates over all unique bigrams in the given string s.
Lastly, {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}} maps each each bigram k to the amount of times it appears in s, because the str.count function returns the number of times a substring appears in a string.
I hope that helps. If you want to know more about list/set/dict comprehensions in Python, the relevant entry in the Python documentation is here: https://docs.python.org/3/tutorial/datastructures.html?highlight=comprehension#list-comprehensions
dic_ = {k : s.count(k) for k in{s[i]+s[i+1] for i in range(len(s)-1)}}
Read backwards
dic_ = {k : s.count(k)
## Step 3 with each of the pair of characters
count how many are in the string
store the 2 char string as the key and the count as the value for the dictionary.
for k in{s[i]+s[i+1]
# Step 2 From each of the position take 2 characters out of the string
for i in range(len(s)-1)}}
# Step 1 loop over all but the last character of the string.
The code may be inefficient for long strings with many repetitions. Step 2 takes every pair so the count and store will be repeated count times.
Refactoring so you can test if the key already exists and not repeating the count may speed it up. bench mark time on ... say a billion base pair DNA sequence.
Related
Say I am continuously generating new data (e.g. integers) and want to collect them in a list.
import random
lst = []
for _ in range(50):
num = random.randint(0, 10)
lst.append(num)
When a new value is generated, I want it to be positioned in the list based on the count of occurrences of that value, so data with lower "current occurrence" should be placed before those with higher "current occurrence".
"Current occurrence" means "the number of duplicates of that data that have already been collected so far, up to this iteration". For the data that have the same occurrence, they should then follow the order in which they are generated.
For example, if at iteration 10 the current list is [1,2,3,4,2,3,4,3,4], let's say a new value 1 is generated, then it should be inserted at index 7, resulting in [1,2,3,4,2,3,4,1,3,4]. Because it is the second occurrence of 1, it should be placed after all the values that only occur once, but after all other existing items that occur twice: 2, 3 and 4 (hence, preserving the order).
This is my current code that can rearrange the list:
from collections import defaultdict
def rearrange(lst):
d = defaultdict(list)
count = defaultdict(int)
for x in lst:
count[x] += 1
d[count[x]].append(x)
res = []
for k in sorted(d.keys()):
res += d[k]
return res
lst = rearrange(lst)
However, this is not giving my expected result.
I wrote a separate algorithm that keeps generating new data until some convergence criterion is met, where the list has the potential to become extremely large.
Therefore I want to rearrange my generated values on-the-fly, i.e. to constantly insert data into the list "in-place". Of course I can use my rearrage function in each iteration, but that would be super inefficient. What I want to do is to insert new data into the correct position of the list, not replacing it with a new list in each iteration.
Any suggestions?
Edit: the data structure doesn't necessarily need to be a list, but it has to be ordered, and doesn't require another data structure to hold information.
The data structure I think that might work better for your purpose is a forest (in this case, a disjoint union of lists).
In summary, you keep one internal list for each occurrence of the values. When a new value comes, you add it to the list just after the one you added the last value this item came.
In order to keep track of the counts of occurrences, you can use a built-in Counter.
Here is a sample implementation:
from collections import Counter
def rearranged(iterable):
forest, counter = list(), Counter()
for x in iterable:
c = counter[x]
if c == len(forest):
forest.append([x])
else:
forest[c] += [x]
counter[x] += 1
return [x for lst in forest for x in lst]
rearranged([1,2,3,4,2,3,4,3,4,1])
# [1, 2, 3, 4, 2, 3, 4, 1, 3, 4]
For this to work better, your input iterable should be a generator (so the items can be generated on the fly).
Input: A list of lists of various positions.
[['61097', '12204947'],
['61097', '239293'],
['61794', '37020977'],
['61794', '63243'],
['63243', '5380636']]
Output: A sorted list that contains the count of unique numbers in a list.
[4, 3, 3, 3, 3]
The idea is fairly simple, I have a list of lists where each list contains a variable number of positions (in our example there is only 2 in each list, but lists of up to 10 exist). I want to loop through each list and if there exists ANY other list that contains the same number then that list gets appended to the original list.
Example: Taking the input data from above and using the following code:
def gen_haplotype_blocks(df):
counts = []
for i in range(len(df)):
my_list = [item for item in df if any(x in item for x in df[i])]
my_list = list(itertools.chain.from_iterable(my_list))
uniq_counts = len(set(my_list))
counts.append(uniq_counts)
clear_output()
display('Currently Running ' +str(i))
return sorted(counts, reverse=True)
I get the output that is expected. In this case when I loop through the first list ['61097', '12204947'] I find that my second list ['61097', '239293'] both contain '61097' so these who lists get concatenated together and form ['61097', '12204947', '61097', '239293']. This is done for every single list outputting the following:
['61097', '12204947', '61097', '239293']
['61097', '12204947', '61097', '239293']
['61794', '37020977', '61794', '63243']
['61794', '37020977', '61794', '63243', '63243', '5380636']
['61794', '63243', '63243', '5380636']
Once this list is complete, I then count the number of unique values in each list, append that to another list, then sort the final list and return that.
So in the case of ['61097', '12204947', '61097', '239293'], we have two '61097', one '12204947' and one '239293' which equals to 3 unique numbers.
While my code works, it is VERY slow. Running for nearly two hours and still only on line ~44k.
I am looking for a way to speed up this function considerably. Preferably without changing the original data structure. I am very new to python.
Thanks in advance!
Too considerably improve the speed of your program, especially for larger data set. The key is to use a hash table, or a dictionary in Python's term, to store different numbers as the key, and the lines each unique number exist as value. Then in the second pass, merge the lists for each line based on the dictionary and count unique elements.
def gen_haplotype_blocks(input):
unique_numbers = {}
for i, numbers in enumerate(input):
for number in numbers:
if number in unique_numbers:
unique_numbers[number].append(i)
else:
unique_numbers[number] = [i]
output = [[] for _ in range(len(input))]
for i, numbers in enumerate(input):
for number in numbers:
for line in unique_numbers[number]:
output[i] += input[line]
counts = [len(set(x)) for x in output]
return sorted(counts, reverse=True)
In theory, the time complexity of your algorithm is O(N*N), N as the size of the input list. Because you need to compare each list with all other lists. But in this approach the complexity is O(N), which should be considerably faster for a larger data set. And the trade-off is extra space complexity.
Not sure how much you expect by saying "considerably", but converting your inner lists to sets from the beginning should speed up things. The following works approximately 2.5x faster in my testing:
def gen_haplotype_blocks_improved(df):
df_set = [set(d) for d in df]
counts = []
for d1 in df_set:
row = d1
for d2 in df_set:
if d1.intersection(d2) and d1 != d2:
row = row.union(d2)
counts.append(len(row))
return sorted(counts, reverse=True)
I have to create this function that has as inputs a String and a list of strings; and as output a list of the indices of strings that contain the String. I have done it, but then I should ordinate the indices according to the occurrences of the String in the strings. How can i do that? This is my code:
I added the 'count' under 'if' to count the occurrences, how can i use it to ordinate the indices according to that?
You can add a list of counts in each string to your function,
def function(s,lst):
l=[]
counts = []
for i in range(len(lst)):
if s in lst[i]:
counts += [lst[i].count(s)]
l += [i]
return l, counts
Here counts is a list in which each entry is the count of occurrences of s in the string in your input list. The function now returns two lists in a tuple, for example with the first tuple element being l and the second being counts. Note that i=-1 is redundant here as i is an element of the iterable made with range and assigning a value to it before the loop doesn't change it's loop value.
You can now sort the first list based on the second list using a line modified from this post,
out_fun = function(s,inp)
out = [x for x,_ in sorted(zip(out_fun[0],out_fun[1]), key = lambda x: x[1], reverse=True)]
inp is the list of strings, for example inp = ["hello", "cure", "access code"]. out_fun is the return tuple of two lists from the function function. s is the string of interest - here as in your original example it is 'c'.
What this line does is that it first creates a list of tuples using zip, where each first element of the tuple is is element from the list of indices and the second is from the list of occurrences. The program then sorts the tuples based on the second element in reverse order (largest first). The list comprehension fetches only the first element from each tuple in the sorted result, which is again the index list.
If you have questions about this solution, feel free to ask. You have a Python 2.7 tag - in Python 3.X you would need to use list(zip()) as zip returns a zip object rather than a list.
This is a more concise version of your program:
def function(s,lst):
t = [(i,x.count(s)) for i,x in enumerate(lst) if s in x]
return t
It uses a list comprehension to create and return a list of tuples t with first element being the index of the string that has the character s and second being the count. This is not necessarily more efficient, that would need to be checked. But it's a clean one-liner that at least to me is more readable.
The list of tuples can than be sorted in a similar way to the previous program, based on second tuple element i.e. count,
out_fun = function(s,inp)
out = [x for x,_ in sorted(out_fun, key = lambda x: x[1], reverse=True)]
Hey I have a doubt in the following python code i wrote :
#create a list of elements
#use a dictionary to find out the frequency of each element
list = [1,2,6,3,4,5,1,1,3,2,2,5]
list.sort()
dict = {i: list.count(i) for i in list}
print(dict)
In the dictionary compression method, "for i in list" is the sequence supplied to the method right ? So it takes 1,2,3,4.. as the keys. My question is why doesn't it take 1 three times ? Because i've said "for i in list", doesn't it have to take each and every element in the list as a key ?
(I'm new to python so be easy on me !)
My question is why doesn't it take 1 three times ?
That's because dictionary keys are unique. If there is another entry found for the same key, the previous value for that key will be overwritten.
Well, for your issue, if you are only after counting the frequency of each element in your list, then you can use collections.Counter
And please don't use list as variable name. It's a built-in.
>>> lst = [1,2,6,3,4,5,1,1,3,2,2,5]
>>> from collections import Counter
>>> Counter(lst)
Counter({1: 3, 2: 3, 3: 2, 5: 2, 4: 1, 6: 1})
Yes, your suspicion is correct. 1 will come up 3 times during iteration. However, since dictionaries have unique keys, each time 1 comes up it will replace the previously generated key/value pair with the newly generated key/value pair. This will give the right answer, it's not the most efficient. You could convert the list to a set instead to avoid reprocessing duplicate keys:
dict = {i: list.count(i) for i in set(list)}
However, even this method is horribly inefficient because it does a full pass over the list for each value in the list, i.e. O(n²) total comparisons. You could do this in a single pass over the list, but you wouldn't use a dict comprehension:
xs = [1,2,6,3,4,5,1,1,3,2,2,5]
counts = {}
for x in xs:
counts[x] = counts.get(x, 0) + 1
The result for counts is: {1: 3, 2: 3, 3: 2, 4: 1, 5: 2, 6: 1}
Edit: I didn't realize there was something in the library to do this for you. You should use Rohit Jain's solution with collections.Counter instead.
I have a dictionary in the form:
{"a": (1, 0.1), "b": (2, 0.2), ...}
Each tuple corresponds to (score, standard deviation).
How can I take the average of just the first integer in each tuple?
I've tried this:
for word in d:
(score, std) = d[word]
d[word]=float(score),float(std)
if word in string:
number = len(string)
v = sum(score)
return (v) / number
Get this error:
v = sum(score)
TypeError: 'int' object is not iterable
It's easy to do using list comprehensions. First, you can get all the dictionary values from d.values(). To make a list of just the first item in each value you make a list like [v[0] for v in d.values()]. Then, just take the sum of those elements, and divide by the number of items in the dictionary:
sum([v[0] for v in d.values()]) / float(len(d))
As Pedro rightly points out, this actually creates the list, and then does the sum. If you have a huge dictionary, this might take up a bunch of memory and be inefficient, so you would want a generator expression instead of a list comprehension. In this case, that just means getting rid of one pair of brackets:
sum(v[0] for v in d.values()) / float(len(d))
The two methods are compared in another question.