how does python' set difference work internally? - python

Recently, i am looking through some python modules to understand their behavior and how optimized their implementation are. Can any one tell what algorithm does python use to perform the set difference operations. One possible way to achieve set difference is by using hash tables which will involve an extra N space. I tried to find the source code of set operations but i am not able to find out the code location. Please help.

A set in python is a hash itself. So implementing difference for it is not as hard as you imagine. Looking from a higher level how does one implement set difference? Iterate over one of the collections and add to the result all elements that are not present in the other sequence.

Related

Why is sys.path a list?

Why would the implementers choose to make sys.path into a list as opposed to a ordered set?
Having sys.path as a list gives rise to the possibility of having multiple duplicates in the path, slowing down the search time for modules.
An artificial example would be the following silly example
# instant importing
import os
import sys
for i in xrange(50000):
sys.path.insert(0, os.path.abspath(".")
# importing takes a while to fail
import hello
To summarise from the comments and answers given:
It seems from the responses below that a list is a simple structure which handles 99% of everyone's needs, it does not come with a safety feature of avoiding duplicates however it does come with a primitive prioritisation which is the index of the element in the list where you can easily set the highest priority by prepending or lowest priority by appending.
Adding a richer prioritisation i.e. insert before this element would be rarely used as the interface to this would be too much effort for a simple task. As the accepted answer states, there is no practical need for anything more advanced covering these extra use cases as historically people are used to this.
Ordered set is
a very recent idea (the recipe mentioned in Does Python have an ordered set? is for 2.6+)
a very special-purpose structure (not even in the standard library)
There's no practical need for the added complexity
List is a very simple structure, while ordered set is basically a hash table + list + weaving logic
You don't need to do operations with sys.path that a set is designed for - check if the exact path is in sys.path - even less so, do it very quickly
On the contrary, sys.path's typical use cases are those exactly for a list: trying elements in sequence, prepending or appending one
To summarize, there's both a historical reason and a lack of any practical need.
sys.path specifies a search path. Typically search paths are ordered with the order of the items indicating search order. If sys.path was a set then there would be no explicit ordering making sys.path less useful. It's also worth considering that optimization is a tricky issue. A reasonable optimization to address any performance concerns would be to simply keep a record of already searched elements of sys.path. Trying to be tricky with ordered sets probably isn't worth the effort.

Use APIs for sorting or algorithm?

In a programming language like Python Which will have better efficiency? if i use a sorting algorithm like merge sort to sort an array or If I use a built in API like sort() to sort the array? If Algorithms are independent of programming languages, then what is the advantage of algorithms over built in methods or API's
Why to use public APIs:
The built in methods were written and reviewed by very experienced and many coders, and a lot of effort was invested to optimize them to be as efficient as it gets.
Since the built in methods are public APIs, it is also means they are constantly used, which means you get a massive "free" testing. You are much more likely to detect issues in public APIs than in private ones, and once something is discovered - it will be fixed for you.
Don't reinvent the wheel. Someone already programmed it for you, use it. If your profiler says there is a problem, think about replacing it. Not before.
Why to use custom made methods:
That said, the public APIs are general case. If you need something
very specific for your scenario, you might find a solution that will
be more efficient, but it will take you quite some time to actually
achieve better than the already optimize general purpose public API.
tl;dr: Use public APIs unless you:
Need it and can afford a lot of time to replace it.
Know what you are doing pretty well.
Intend to maintain it and do robust testing for it.
The libraries normally use well tested and correctly optimized algorythms. For example Python uses Timsort which:
is a stable sort (order of elements that compare equal is preserved)
in the worst case takes O( n log ⁡ n ) comparisons to sort an array of n elements
in the best case (when the input is already sorted) runs in linear time
Unless you have special requirements that make you know that for your particular data sets one sort algorythm will give best result you can use the standard library implementation.
The other reason to build a sort by hand, is evidently for academic purposes...

Python: How to obtain current bandwidth usage?

I need to get current bandwidth usage on given interface in my Python code. How can I achieve this?
I tried to extract the value from /proc/net/dev twice with short sleep interval (e.g. 0.3s) between two consecutive calls and compute the difference in bytes divided by time. It works but I do not know if I can trust this results. I'm looking for some more elegant solutions. Any tools, libraries, or simple algorithms?
I will appreciate all suggestions!
There are libraries for this, like PSutil, but if you are on *Nix I would opt for using the underlying system.
Although you could use /proc/net/dev, rather use /sys/class/net/eth0/statistics since you will not have to parse anything since there are individual files for each statistic

How best to store large sequences of text in Python?

I recently discovered that a student of mine was doing an independent project in which he was using very large strings (2-4MB) as values in a dictionary.
I've never had a reason to work with such large blocks of text and it got me wondering if there were performance issues associated with creating such large strings.
Is there a better way of doing it than to simply create a string? I realize this question is largely context dependent, but I'm looking for generalized answers that may cover more than one possible use-case.
If you were working with that much text, how would you store it in your code, and would you do anything different than if you were simply working with an ordinary string of only a few characters?
It depends a lot on what you're doing with the strings. I'm not exactly sure how Python stores strings but I've done a lot of work on XEmacs (similar to GNU Emacs) and on the underlying implementation of Emacs Lisp, which is a dynamic language like Python, and I know how strings are implemented there. Strings are going to be stored as blocks of memory similar to arrays. There's not a huge issue creating large arrays in Python, so I don't think simply storing the strings this way will cause performance issues. Some things to consider though:
How are you building up the string? If you build up piece-by-piece by simply appending to ever larger strings, you have an O(N^2) algorithm that will be very slow. Java handles this with a StringBuilder class. I'm not sure if there's an exact equivalent in Python but you can simply create an array with all the parts you want to join together, then join at the end using ''.join(array).
Do you need to search the string? This isn't related to creating the strings but it's something to consider. Searching will in general be O(n) in the size of the string; there are speedups that make it O(n/m) where m is the size of the substring you're searching for, but that's about it. The main consideration here is whether to store one big string or a series of substrings. If you need to search all the substrings, that won't help much over searching a big string, but it's possible you might know in advance that some parts don't need to be searched.
Do you need to access substrings? Again, this isn't related to creating the strings, it's something to consider. Accessing a substring by position is just a matter of indexing to the right memory location, but if you need to take large substrings, it may be inefficient, and you might be able to speed things up by storing your string as an array of substrings, and then creating a new string as another array with some of the strings shared. However, doing it this way takes work, and shouldn't be done unless it's really necessary.
In sum, I think for simple cases it's fine to have large strings like this, but you should think about the sorts of operations you're going to perform and what their O(...) time is.
I would say that potential issues depend on two things:
how many strings of this kind are hold in memory at the same time, compared to the capacity of the memory (the RAM) ?
what are the operations done on these strings ?
It seems to me I've read that operations on strings in Python are very efficient, so it isn't supposed to present problem working on very long strings. But in fact it depends on the algorithm of each operation performed on a big string.
This answer is rather vague, I haven't enough eperience to make more useful estimation of the problem. But the question is also very broad.

Data structure in python for 2d range counting queries

I need a data structure for doing 2d range counting queries (i.e. how many points are in a given rectangle).
I think my best bet is range tree (it can count in log^2, or even log after some optimizations). Does it sound like a good choice? Does anybody know about a python implementation or will I have to write one myself?
See scipy.spatial.KDTree for one implementation.
There's also a less generic (but occasionally more useful, particularly with regards to what you have in mind) implementation using shapelib's quadtree. See this blog and the corresponding package in PyPi.
There are probably other implementations, too, but those are the two that I've used...

Categories

Resources