I have a list of numbers, e.g. [50,100,150,200,250]. I need to increment (or decrement) each number from a specified index and by a specified amount. I have been able to do this in two ways:
from itertools import islice
l = [50,100,150,200,250]
start_increment_index = 3
l[start_increment_index:] = [e+100 for e in l[start_increment_index:]]
print (l)
l = [50,100,150,200,250]
l[start_increment_index:] = [e+100 for e in islice(l,start_increment_index,len(l))]
print (l)
Both print: [50, 100, 150, 300, 350].
However, my real list contains millions of numbers and this operation is performed repeatedly with different indexes and different increments/decrements. Would there be a faster way of doing this using a Python list? I have been considering writing my own C/C++ extension to deal with this.
Edit: Would this be a useful module for Python in general? Having a function written in C which can take parameters (python_list_object, increment_amount, start_index, end_index)?
Main problem in your solution that you creates(allocating memory + copy) two lists. First it's list comprehension by itself and second l[start_increment_index:] inside it.
If you data source is python list, you can do you operation for O(n):
for i in range(start_increment_index, len(l)):
l[i] += increment
NB: define increment first.
It depends specifically on your goals. I suppose that you can use segment tree for this case. For more information see https://en.m.wikipedia.org/wiki/Segment_tree.
Just for brief description. This structure represents array upon which will be performed range operations (like addition/substraction subarray with number) This structure is optimized for case where you have very big number of such range queries.
Note: if you want to use only python list structure, then you can implement sparse table (it is another view of segment tree with implicit storing of tree in arrays)
Related
I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?
First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.
Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.
Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).
Here is a significantly faster implementation:
import simdjson
averages = []
counts = []
allvals = []
for line in mystream:
sublist = simdjson.loads(line.rstrip())
averages.append(sum(sublist) / len(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
np_averages = np.array(averages, np.float64)
One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.
If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?
q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.
If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.
I have a very large (say a few thousand) list of partitions, something like:
[[9,0,0,0,0,0,0,0,0],
[8,1,0,0,0,0,0,0,0],
...,
[1,1,1,1,1,1,1,1,1]]
What I want to do is apply to each of them a function (which outputs a small number of partitions), then put all the outputs in a list and remove duplicates.
I am able to do this, but the problem is that my computer gets very slow if I put the above list directly into the python file (esp. when scrolling). What is making it slow? If it is memory being used to load the whole list,
Is there a way to put the partitions in another file, and have the function just read the list term by term?
EDIT: I am adding some code. My code is probably very inefficient because I'm quite an amateur. So what I really have is a list of lists of partitions, that I want to add to:
listofparts3 = [[[3],[2,1],[1,1,1]],
[[6],[5,1],...,[1,1,1,1,1,1]],...]
def addtolist3(n):
a=int(n/3)-2
counter = 0
added = []
for i in range(len(listofparts3[a])):
first = listofparts3[a][i]
if len(first)<n:
for i in range(n-len(first)):
first.append(0)
answer = lowering1(fock(first),-2)[0]
for j in range(len(answer)):
newelement = True
for k in range(len(added)):
if (answer[j]==added[k]).all():
newelement = False
break
if newelement==True:
added.append(answer[j])
print(counter)
counter = counter+1
for i in range(len(added)):
added[i]=partition(added[i]).tolist()
return(added)
fock, lowering1, partition are all functions in earlier code, they are pretty simple functions. The above function, say addtolist(24), takes all the partition of 21 that I have and returns the desired list of partitions of 24, which I can then append to the end of listofparts3.
A few thousand partitions uses only a modest amount of memory, so that likely isn't the source of your problem.
One way to speed-up function application is to use map() for Python 3 or itertools.imap() from Python 2.
The fastest way to eliminate duplicates is to feed them into a Python set() object.
I want to eliminate extremes from a list of integers in Python. I'd say that my problem is one of design. Here's what I cooked up so far:
listToTest = [120,130,140,160,200]
def function(l):
length = len(l)
for x in xrange(0,length - 1):
if l[x] < (l[x+1] - l[x]) * 4:
l.remove(l[x+1])
return l
print function(listToTest)
So the output of this should be: 120,130,140,160 without 200, since that's way too far ahead from the others.
And this works, given 200 is the last one or there's only one extreme. Though, it gets problematic with a list like this:
listToTest = [120,200,130,140,160,200]
Or
listToTest = [120,130,140,160,200,140,130,120,200]
So, the output for the last list should be: 120,130,140,160,140,130,120. 200 should be gone, since it's a lot bigger than the "usual", which revolved around ~130-140.
To illustrate it, here's an image:
Obviously, my method doesn't work. Some thoughts:
- I need to somehow do a comparison between x and x+1, see if the next two pairs have a bigger difference than the last pair, then if it does, the pair that has a bigger difference should have one element eliminated (the biggest one), then, recursively do this again. I think I should also have an "acceptable difference", so it knows when the difference is acceptable and not break the recursivity so I end up with only 2 values.
I tried writting it, but no luck so far.
You can use statistics here, eliminating values that fall beyond n standard deviations from the mean:
import numpy as np
test = [120,130,140,160,200,140,130,120,200]
n = 1
output = [x for x in test if abs(x - np.mean(test)) < np.std(test) * n]
# output is [120, 130, 140, 160, 140, 130, 120]
Your problem statement is not clear. If you simply want to remove the max and min then that is a simple
O(N) with 2 extra memory- which is O(1)
operation. This is achieved by retaining the current min/max value and comparing it to each entry in the list in turn.
If you want the min/max K items it is still
O(N + KlogK) with O(k) extra memory
operation. This is achieved by two priorityqueue's of size K: one for the mins, one for the max's.
Or did you intend a different output/outcome from your algorithm?
UPDATE the OP has updated the question: it appears they want a moving (/windowed) average and to delete outliers.
The following is an online algorithm -i.e. it can handle streaming data http://en.wikipedia.org/wiki/Online_algorithm
We can retain a moving average: let's say you keep K entries for the average.
Then create a linked list of size K and a pointer to the head and tail. Now: handling items within the first K entries needs to be thought out separately. After the first K retained items the algo can proceed as follows:
check the next item in the input list against the running k-average. If the value exceeds the acceptable ratio threshold then put its list index into a separate "deletion queue" list. Otherwise: update the running windowed sum as follows:
(a) remove the head entry from the linked list and subtract its value from the running sum
(b) add the latest list entry as the tail of the linked list and add its value to the running sum
(c) recalculate the running average as the running sum /K
Now: how to handle the first K entries? - i.e. before we have a properly initialized running sum?
You will need to make some hard-coded decisions here. A possibility:
run through all first K+2D (D << K) entries.
Keep d max/min values
Remove the d (<< K) max/min values from that list
I have 4 parallel arrays based on a table representing attributes of a map. Each array has approx. 500 values, but all have the same number of values.
The arrays are:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
I am attempting to create a data structure from which I can use a recursive function on to determine the start and end points every 2000m along the length.
The following question and answer describe what I am attempting to accomplish:
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
How do I store these 4 parallel arrays in a dictionary keyed by start?
I am new to writing functions, dictionaries and using arrays in dictionaries. I am attempting to do this task in Python.
I think this is what you mean:
d = {}
for i in range(len(start)):
d[start[i]] = (shape[i],length[i],end[i])
so now d[some_start_value] will hold the corresponding shape length and end values.
If you want to do things a little bit more Python-esque, you can use enumerate:
d = {}
for (i,st) in enumerate(start):
d[st] = (shape[i],length[i],end[i])
or even better - zip:
d = {}
for (st,sh,le,en) in zip(start,shape,length,end):
d[st] = (sh,le,en)
Note that you can leave out the parantheses around the first part of the for loops (i.e. between the for and in keywords). I used them solely for enhanced code readability.
As with WeaselFox's answer, d[some_start_value] will now hold the corresponding shape, length and end values.
In addition to the above answers, I would recommend using namedtuple to simplify accesses:
from collections import namedtuple
# This creates a namedtuple called GISData. Name of the object and name in the first argument
# should be the same.
GISData = namedtuple('GISData', 'start shape length end')
# zip creates 1 list of 4-tuples from 4 single lists
# There are other ways to write this; this is just the shortest for me.
# Note that if you need this ordered, you should use an OrderedDict,
# which is in the collections module in python 2.7+, or you can find
# backported versions for python 2.6+. In those, the keys preserve ordering,
# so can still be searched as a list, which is useful if you need to find e.g.
# 479, which is not in the dictionary, but 400 and 500 are and you have to interpolate etc.
GISDict = dict((x[0], GISData(*x)) for x in zip(start, shape, length, end))
# The dictionary for any given start value
# Access the 4 individual pieces by name, or by index
GISDict[start_lookup].shape
etc.