Dynamically append and delete elements from a vector - python

Is there any way of dynamically adding elements and at the same time removing some of them? Preferably in MATLAB.
For example, let's say I am streaming data from a sensor. Since it will be streaming forever, I would like to keep only the last, say, 100 samples/elements of the vector.

You may try Queue module in python:
from Queue import Queue
q=Queue()
to enque at back : q.put(x)
to deque from front : q.get()
You may use deque from collections as well(in case you have some advance requirement) in python:
from collections import deque
d = deque([])
to enque at back : d.append(x)
to enque at front : d.appendleft(x)
to deque from back : d.pop()
to deqeue from front : d.popleft()

There's no formal queue data structure for this in Matlab, but the basic case you describe can be implemented quite simply with clever use of indexing and max:
d = []; % Allocate empty array
n = 100; % Max length of buffer/queue
% A loop example for illustration
for i = 1:1e3
x = rand(1,3); % Data to append
d = [d(max(end-n+1+length(x),1):end) x]; % Append to end, remove from front if needed
end
The above assumes the appended data, x, is a row vector of length 0 to n. You can easily modify this to append to the front, etc. It could also be turned into a function as well.
You can also find classes that implement various forms of queues on the MathWorks File Exchange, e.g., this one.

Related

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?
q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.
If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

Using a dictionary to index parallel arrays?

I have 4 parallel arrays based on a table representing attributes of a map. Each array has approx. 500 values, but all have the same number of values.
The arrays are:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
I am attempting to create a data structure from which I can use a recursive function on to determine the start and end points every 2000m along the length.
The following question and answer describe what I am attempting to accomplish:
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
How do I store these 4 parallel arrays in a dictionary keyed by start?
I am new to writing functions, dictionaries and using arrays in dictionaries. I am attempting to do this task in Python.
I think this is what you mean:
d = {}
for i in range(len(start)):
d[start[i]] = (shape[i],length[i],end[i])
so now d[some_start_value] will hold the corresponding shape length and end values.
If you want to do things a little bit more Python-esque, you can use enumerate:
d = {}
for (i,st) in enumerate(start):
d[st] = (shape[i],length[i],end[i])
or even better - zip:
d = {}
for (st,sh,le,en) in zip(start,shape,length,end):
d[st] = (sh,le,en)
Note that you can leave out the parantheses around the first part of the for loops (i.e. between the for and in keywords). I used them solely for enhanced code readability.
As with WeaselFox's answer, d[some_start_value] will now hold the corresponding shape, length and end values.
In addition to the above answers, I would recommend using namedtuple to simplify accesses:
from collections import namedtuple
# This creates a namedtuple called GISData. Name of the object and name in the first argument
# should be the same.
GISData = namedtuple('GISData', 'start shape length end')
# zip creates 1 list of 4-tuples from 4 single lists
# There are other ways to write this; this is just the shortest for me.
# Note that if you need this ordered, you should use an OrderedDict,
# which is in the collections module in python 2.7+, or you can find
# backported versions for python 2.6+. In those, the keys preserve ordering,
# so can still be searched as a list, which is useful if you need to find e.g.
# 479, which is not in the dictionary, but 400 and 500 are and you have to interpolate etc.
GISDict = dict((x[0], GISData(*x)) for x in zip(start, shape, length, end))
# The dictionary for any given start value
# Access the 4 individual pieces by name, or by index
GISDict[start_lookup].shape
etc.

Initialize Multiple Numpy Arrays (Multiple Assignment) - Like MATLAB deal()

I was unable to find anything describing how to do this, which leads to be believe I'm not doing this in the proper idiomatic Python way. Advice on the 'proper' Python way to do this would also be appreciated.
I have a bunch of variables for a datalogger I'm writing (arbitrary logging length, with a known maximum length). In MATLAB, I would initialize them all as 1-D arrays of zeros of length n, n bigger than the number of entries I would ever see, assign each individual element variable(measurement_no) = data_point in the logging loop, and trim off the extraneous zeros when the measurement was over. The initialization would look like this:
[dData gData cTotalEnergy cResFinal etc] = deal(zeros(n,1));
Is there a way to do this in Python/NumPy so I don't either have to put each variable on its own line:
dData = np.zeros(n)
gData = np.zeros(n)
etc.
I would also prefer not just make one big matrix, because keeping track of which column is which variable is unpleasant. Perhaps the solution is to make the (length x numvars) matrix, and assign the column slices out to individual variables?
EDIT: Assume I'm going to have a lot of vectors of the same length by the time this is over; e.g., my post-processing takes each log file, calculates a bunch of separate metrics (>50), stores them, and repeats until the logs are all processed. Then I generate histograms, means/maxes/sigmas/etc. for all the various metrics I computed. Since initializing 50+ vectors is clearly not easy in Python, what's the best (cleanest code and decent performance) way of doing this?
If you're really motivated to do this in a one-liner you could create an (n_vars, ...) array of zeros, then unpack it along the first dimension:
a, b, c = np.zeros((3, 5))
print(a is b)
# False
Another option is to use a list comprehension or a generator expression:
a, b, c = [np.zeros(5) for _ in range(3)] # list comprehension
d, e, f = (np.zeros(5) for _ in range(3)) # generator expression
print(a is b, d is e)
# False False
Be careful, though! You might think that using the * operator on a list or tuple containing your call to np.zeros() would achieve the same thing, but it doesn't:
h, i, j = (np.zeros(5),) * 3
print(h is i)
# True
This is because the expression inside the tuple gets evaluated first. np.zeros(5) therefore only gets called once, and each element in the repeated tuple ends up being a reference to the same array. This is the same reason why you can't just use a = b = c = np.zeros(5).
Unless you really need to assign a large number of empty array variables and you really care deeply about making your code compact (!), I would recommend initialising them on separate lines for readability.
Nothing wrong or un-Pythonic with
dData = np.zeros(n)
gData = np.zeros(n)
etc.
You could put them on one line, but there's no particular reason to do so.
dData, gData = np.zeros(n), np.zeros(n)
Don't try dData = gData = np.zeros(n), because a change to dData changes gData (they point to the same object). For the same reason you usually don't want to use x = y = [].
The deal in MATLAB is a convenience, but isn't magical. Here's how Octave implements it
function [varargout] = deal (varargin)
if (nargin == 0)
print_usage ();
elseif (nargin == 1 || nargin == nargout)
varargout(1:nargout) = varargin;
else
error ("deal: nargin > 1 and nargin != nargout");
endif
endfunction
In contrast to Python, in Octave (and presumably MATLAB)
one=two=three=zeros(1,3)
assigns different objects to the 3 variables.
Notice also how MATLAB talks about deal as a way of assigning contents of cells and structure arrays. http://www.mathworks.com/company/newsletters/articles/whats-the-big-deal.html
If you put your data in a collections.defaultdict you won't need to do any explicit initialization. Everything will be initialized the first time it is used.
import numpy as np
import collections
n = 100
data = collections.defaultdict(lambda: np.zeros(n))
for i in range(1, n):
data['g'][i] = data['d'][i - 1]
# ...
How about using map:
import numpy as np
n = 10 # Number of data points per array
m = 3 # Number of arrays being initialised
gData, pData, qData = map(np.zeros, [n] * m)

Python speeding up the search for a value in a dictionary of ranges

I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient
Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff

How can I pop() lots of elements from a deque?

I have a deque object what holds a large amount of data. I want to extract, say, 4096 elements from the front of the queue (I'm using it as a kind of FIFO). It seems like there should be way of doing this without having to iterate over 4096 pop requests.
Is this correct/efficient/stupid?
A = arange(100000)
B = deque()
C = [] # List will do
B.extend(A) # Nice large deque
# extract 4096 elements
for i in xrange(4096):
C.append(A.popleft())
There is no multi-pop method for deques. You're welcome to submit a feature request to bugs.python.org and I'll consider adding it.
I don't know the details of your use case, but if your data comes in blocks of 4096, consider storing the blocks in tuples or lists and then adding the blocks to the deque:
block = data[:4096]
d.append(block)
...
someblock = d.popleft()
Where you're using a deque the .popleft() method is really the best method of getting elements off the front. You can index into it, but index performance degrades toward the middle of the deque (as opposed to a list that has quick indexed access, but slow pops). You could get away with this though (saves a few lines of code):
A = arange(100000)
B = deque(A)
C = [B.popleft() for _i in xrange(4096)]

Categories

Resources