I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I am currently reading Learning Python, 5th Edition - by Mark Lutz and have come across the phrase "Physically Stored Sequence".
From what I've learnt so far, a sequence is an object that contains items that can be indexed in sequential order from left to right e.g. Strings, Tuples and Lists.
So in regards to a "Physically Stored Sequence", would that be a Sequence that is referenced by a variable for use later on in a program? Or am not getting it?
Thank you in advance for your answers.
A Physically Stored Sequence is best explained by contrast. It is one type of "iterable" with the main example of the other type being a "generator."
A generator is an iterable, meaning you can iterate over it as in a "for" loop, but it does not actually store anything--it merely spits out values when requested. Examples of this would be a pseudo-random number generator, the whole itertools package, or any function you write yourself using yield. Those sorts of things can be the subject of a "for" loop but do not actually "contain" any data.
A physically stored sequence then is an iterable which does contain its data. Examples include most data structures in Python, like lists. It doesn't matter in the Python parlance if the items in the sequence have any particular reference count or anything like that (e.g. the None object exists only once in Python, so [None, None] does not exactly "store" it twice).
A key feature of physically stored sequences is that you can usually iterate over them multiple times, and sometimes get items other than the "first" one (the one any iterable gives you when you call next() on it).
All that said, this phrase is not very common--certainly not something you'd expect to see or use as a workaday Python programmer.
Let's say I want to sort rows and I want to resolve any ties with the next column, subsequent ties to with the next-next column etc.
In python words the equivalent of sorted(rows, key=itemgetter(1, 2, 3, 4, ...)).
I tried writing my own generator but sorted doesn't iterate over my generator as it does with the tuple itemgetter returns. Any advice?
For the reasons noted in the comments, you cannot sort a list of things that hasn't been yet created. Generators exist to yield results when they are asked for so you can't sort a an iterable that hasn't been iterated (as with list(generator()).
To put in more ordinary terms, I'm thinking of ten names but am not telling you what they are yet, please sort them into alphabetical order. You should respond "how can I sort them when you haven't given them to me?" and you'd be correct: you can't.
OK, here's what you say you want to do:
I want to sort rows and I want to resolve any ties with the next column, subsequent ties to with the next-next column etc.
Note, first, that the documentation for the key argument does the following:
key specifies a function of one argument that is used to extract a comparison key from each list element
So your itemgetter idea isn't quite right, since you want to move through the list only when a comparison is equal.
However, things are actually much easier than you think. Check out the Python docs (See also this SO question.):
Sequence types also support comparisons. In particular, tuples and lists are compared lexicographically by comparing corresponding elements. This means that to compare equal, every element must compare equal and the two sequences must be of the same type and have the same length. (For full details see Comparisons in the language reference.)
Which, I think, is exactly what you want if you just make sure that each row is an equal-length sequence (list or tuple).
(Aha, I just read the comment regarding the die-roll function producing the keys. Confusing -- not sure if the above is helpful in that case, but I'm not sure what you are asking actually makes sense...)
>>> {x for x in 'spam'}
{'a', 'p', 's', 'm'}
Why does it change the order? If you take a look at a loop, it works perfectly:
>>> for x in 'spam':
... print(x)
...
s
p
a
m
>>>
Sets in python (and in set theory) are not ordered. So when you loop over them, there is no defined ordering.
You looped over the string literal 'spam' to make a set containing each character in that string. Once you did that, the ordering was gone.
When you perform the for loop over 'spam', you are performing the loop against a string which does have ordering.
From Set types:
These represent unordered, finite sets of unique, immutable objects. As such, they cannot be indexed by any subscript [because no ordering is defined among the elemnts]. However, they can be iterated over, and the built-in function len() returns the number of items in a set. Common uses for sets are fast membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
But if you really need to preserve the order, then please check ordered set.
And anyway you may like really to write just >>> set('spam') instead of any comprehension.
set is not an ordered collection, and as such, the internal order of keys is undefined.
From docs.python.org
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built in dict, list, and tuple classes, and the collections module.)
sets are unordered by definition. The reason for this is that their implementation runs faster that way, by using appropriate data structures that do not preserve order. If you need order, you can use the (slower) OrderedDict type.
Python sets are defined as unordered, so Python is free to order them any way it likes (efficiently, I pressme).
I am learning Python for a class now, and we just covered tuples as one of the data types. I read the Wikipedia page on it, but, I could not figure out where such a data type would be useful in practice. Can I have some examples, perhaps in Python, where an immutable set of numbers would be needed? How is this different from a list?
Tuples are used whenever you want to return multiple results from a function.
Since they're immutable, they can be used as keys for a dictionary (lists can't).
Tuples make good dictionary keys when you need to combine more than one piece of data into your key and don't feel like making a class for it.
a = {}
a[(1,2,"bob")] = "hello!"
a[("Hello","en-US")] = "Hi There!"
I've used this feature primarily to create a dictionary with keys that are coordinates of the vertices of a mesh. However, in my particular case, the exact comparison of the floats involved worked fine which might not always be true for your purposes [in which case I'd probably convert your incoming floats to some kind of fixed-point integer]
The best way to think about it is:
A tuple is a record whose fields don't have names.
You use a tuple instead of a record when you can't be bothered to specify the field names.
So instead of writing things like:
person = {"name": "Sam", "age": 42}
name, age = person["name"], person["age"]
Or the even more verbose:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
person = Person("Sam", 42)
name, age = person.name, person.age
You can just write:
person = ("Sam", 42)
name, age = person
This is useful when you want to pass around a record that has only a couple of fields, or a record that is only used in a few places. In that case specifying a whole new record type with field names (in Python, you'd use an object or a dictionary, as above) could be too verbose.
Tuples originate from the world of functional programming (Haskell, OCaml, Elm, F#, etc.), where they are commonly used for this purpose. Unlike Python, most functional programming languages are statically typed (a variable can only hold one type of value, and that type is determined at compile time). Static typing makes the role of tuples more obvious. For example, in the Elm language:
type alias Person = (String, Int)
person : Person
person = ("Sam", 42)
This highlights the fact that a particular type of tuple is always supposed to have a fixed number of fields in a fixed order, and each of those fields is always supposed to be of the same type. In this example, a person is always a tuple of two fields, one is a string and the other is an integer.
The above is in stark contrast to lists, which are supposed to be variable length (the number of items is normally different in each list, and you write functions to add and remove items) and each item in the list is normally of the same type. For example, you'd have one list of people and another list of addresses - you would not mix people and addresses in the same list. Whereas mixing different types of data inside the same tuple is the whole point of tuples. Fields in a tuple are usually of different types (but not always - e.g. you could have a (Float, Float, Float) tuple to represent x,y,z coordinates).
Tuples and lists are often nested. It's common to have a list of tuples. You could have a list of Person tuples just as well as a list of Person objects. You can also have a tuple field whose value is a list. For example, if you have an address book where one person can have multiple addresses, you could have a tuple of type (Person, [String]). The [String] type is commonly used in functional programming languages to denote a list of strings. In Python, you wouldn't write down the type, but you could use tuples like that in exactly the same manner, putting a Person object in the first field of a tuple and a list of strings in its second field.
In Python, confusion arises because the language does not enforce any of these practices that are enforced by the compiler in statically typed functional languages. In those languages, you cannot mix different kinds of tuples. For example, you cannot return a (String, String) tuple from a function whose type says that it returns a (String, Integer) tuple. You also cannot return a list when the type says you plan to return a tuple, and vice versa. Lists are used strictly for growing collections of items, and tuples strictly for fixed-size records. Python doesn't stop you from breaking any of these rules if you want to.
In Python, a list is sometimes converted into a tuple for use as a dictionary key, because Python dictionary keys need to be immutable (i.e. constant) values, whereas Python lists are mutable (you can add and remove items at any time). This is a workaround for a particular limitation in Python, not a property of tuples as a computer science concept.
So in Python, lists are mutable and tuples are immutable. But this is just a design choice, not an intrinsic property of lists and tuples in computer science. You could just as well have immutable lists and mutable tuples.
In Python (using the default CPython implementation), tuples are also faster than objects or dictionaries for most purposes, so they are occasionally used for that reason, even when naming the fields using an object or dictionary would be clearer.
Finally, to make it even more obvious that tuples are intended to be another kind of record (not another kind of list), Python also has named tuples:
from collections import namedtuple
Person = namedtuple("Person", "name age")
person = Person("Sam", 42)
name, age = person.name, person.age
This is often the best choice - shorter than defining a new class, but the meaning of the fields is more obvious than when using normal tuples whose fields don't have names.
Immutable lists are highly useful for many purposes, but the topic is far too complex to answer here. The main point is that things that cannot change are easier to reason about than things that can change. Most software bugs come from things changing in unexpected ways, so restricting the ways in which they can change is a good way to eliminate bugs. If you are interested, I recommend reading a tutorial for a functional programming language such as Elm, Haskell or Clojure (Elm is the friendliest). The designers of those languages considered immutability so useful that all lists are immutable there. (Instead of changing a list to add and or remove an item, you make a new list with the item added or removed. Immutability guarantees that the old copy of the list can never change, so the compiler and runtime can make the code perform well by re-using parts of the old list in the new one and garbage-collecting the left-over parts when they are longer needed.)
I like this explanation.
Basically, you should use tuples when there's a constant structure (the 1st position always holds one type of value and the second another, and so forth), and lists should be used for lists of homogeneous values.
Of course there's always exceptions, but this is a good general guideline.
Tuples and lists have the same uses in general. Immutable data types in general have many benefits, mostly about concurrency issues.
So, when you have lists that are not volatile in nature and you need to guarantee that no consumer is altering it, you may use a tuple.
Typical examples are fixed data in an application like company divisions, categories, etc. If this data change, typically a single producer rebuilts the tuple.
I find them useful when you always deal with two or more objects as a set.
A tuple is a sequence of values. The values can be any type, and they are indexed by integer, so tuples are not like lists. The most important difference is that tuples are immutable.
A tuple is a comma-separated list of values:
t = 'p', 'q', 'r', 's', 't'
it is good practice to enclose tuples in parentheses:
t = ('p', 'q', 'r', 's', 't')
A list can always replace a tuple, with respect to functionality (except, apparently, as keys in a dict). However, a tuple can make things go faster. The same is true for, for example, immutable strings in Java -- when will you ever need to be unable to alter your strings? Never!
I just read a decent discussion on limiting what you can do in order to make better programs; Why Why Functional Programming Matters Matters
A tuple is useful for storing multiple values.. As you note a tuple is just like a list that is immutable - e.g. once created you cannot add/remove/swap elements.
One benefit of being immutable is that because the tuple is fixed size it allows the run-time to perform certain optimizations. This is particularly beneficial when a tupple is used in the context of a return value or a parameter to a function.
Use Tuple
If your data should or does not need to be changed.
Tuples are faster than lists. We should use a Tuple instead of a List if we are defining a constant set of values and all we are ever going to do
with it is iterate through it.
If we need an array of elements to be
used as dictionary keys, we can use Tuples. As Lists are mutable,
they can never be used as dictionary keys.
Furthermore, Tuples are immutable, whereas Lists are mutable. By the same token, Tuples are fixed size in nature, whereas Lists are dynamic.
a_tuple = tuple(range(1000))
a_list = list(range(1000))
a_tuple.__sizeof__() # 8024 bytes
a_list.__sizeof__() # 9088 bytes
more information :
https://jerrynsh.com/tuples-vs-lists-vs-sets-in-python/
In addition to the places where they're syntactically required like the string % operation and for multiple return values, I use tuples as a form of lightweight classes. For example, suppose you have an object that passes out an opaque cookie to a caller from one method which is then passed into another method. A tuple is a good way to pack multiple values into that cookie without having to define a separate class to contain them.
I try to be judicious about this particular use, though. If the cookies are used liberally throughout the code, it's better to create a class because it helps document their use. If they are only used in one place (e.g. one pair of methods) then I might use a tuple. In any case, because it's Python you can start with a tuple and then change it to an instance of a custom class without having to change any code in the caller.
Tuples are used in :
places where you want your sequence of elements to be immutable
in tuple assignments
a,b=1,2
in variable length arguments
def add(*arg) #arg is a tuple
return sum(arg)