sum elements of list under conditions of second list - python

I'm trying to add up certain elements of two lists that are related. I will put an example so you understand what I'm talking about. In the end I write the code I have, it works but I want to optimize it, otherwise I have to write lots of things by hand. Apologies if the question is not interesting.
list1 = [4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0]
list2 = [2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1]
List 1 corresponds to time, so what I want to do is to cluster everything every 10 [units of time], i.e. from list1 I can see that the first and second element belong to the range 0-10, so I would need to add their corresponding points in list2. Later from list1 I see that the third and fourth elements belong to the range (10< time <= 20), so I add the same elements in list2, later for the third range, I need to add the following 4 elements in list3 and so on. In the end I would like to create 2 new lists
list3 = [10., 20., 30., 40.]
list4 = [3.9, 14.5, 20.7, 35.6]
The code I wrote is the following:
list1 = [4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0]
list2 = [2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1]
list3 = numpy.arange(0., 40., 10.)
a = [[] for i in range(4)]
for i, j in enumerate(list1):
if 0.<=j<=10.:
a[0].append(list2[i])
elif 10.<j<=20.:
a[1].append(list2[i])
elif 20.<j<=30.:
a[2].append(list2[i])
elif 30.<j<=40.:
a[3].append(list2[i])
list4 = [sum(i) for i in a]
it works, however, list1 in reality is way more larger (few orders of magnitude) and I don't want to write all the if's by hand (as well as the sublists I make). Any suggestions will be appreciated.

First of all if we are talking about huge sets, I would use numpy, pandas, or another tool that is designed for this. From my experience, Python itself is not designed to work for things with more than 10M elements (unless there is a structure in the data you can exploit).
Now we can use this as follows:
import numpy as np
# construct lists
l1 = np.array(list1)
l2 = np.array(list2)
# determine the "groups" of the values
g = (l1-0.00001)//10
# create a boolean mask that determines where the groups change
flag = np.concatenate(([True], g[1:] != g[:-1]))
# determine the indices of the swaps
inv_idx, = flag.nonzero()
# calculate the sum per subrange
result = np.add.reduceat(list2,inv_idx)
For your sample output, this gives:
>>> result
array([ 3.9, 14.5, 20.7, 35.6])
The 0.00001 is used to push a 20.0 to some 19.9999 is and thus assign it to group 1 instead of group 2. The advantage of this approach is that (a) it works for an arbitrary number of "groups" and (b) a fixed number of "swipes" are done over the list so it scales linear with the number of elements in the list.

If you transform your list in numpy.array, there are easy way to extract some stuff in a 1D-array based on another one:
import numpy
list1 = numpy.array([4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0])
list2 = numpy.array([2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1])
step = 10
r, s = range(0,50,10), []
for i in r:
s.append(numpy.sum([l for l in list2[(list1 > i) & (list1 <= i+step)]]))
print r[1:], s[:-1]
#[10, 20, 30, 40] [3.9, 14.5, 20.7, 35.6]
Edit
In one line:
s = [numpy.sum([l for l in list2[(list1 > i) & (list1 < i+step)]]) for i in r]

Related

How can I reshape a 2D array into 1D in python?

Let me edit my question again. I know how flatten works but I am looking if it possible to remove the inside braces and just simple two outside braces just like in MATLAB and maintain the same shape of (3,4). here it is arrays inside array, and I want to have just one array so I can plot it easily also get the same results is it is in Matlab.
For example I have the following matrix (which is arrays inside array):
s=np.arange(12).reshape(3,4)
print(s)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Is it possible to reshape or flatten() it and get results like this:
[ 0 1 2 3
4 5 6 7
8 9 10 11]
First answer
If I understood correctly your question (and 4 other answers say I didn't), your problem is not how to flatten() or reshape(-1) an array, but how to ensure that even after reshaping, it still display with 4 elements per line.
I don't think you can, strictly speaking. Arrays are just a bunch of elements. They don't contain indication about how we want to see them. That's a printing problem, you are supposed to solve when printing. You can see [here][1] that people who want to do that... start with reshaping array in 2D.
That being said, without creating your own printing function, you can control how numpy display arrays, using np.set_printoptions.
Still, it is tricky so, because this function allows you only to specify how many characters, not elements, are printed per line. So you need to know how many chars each element will need, to force linebreaks.
In your example:
np.set_printoptions(formatter={"all":lambda x:"{:>6}".format(x)}, linewidth=7+(6+2)*4)
The formatter ensure that each number use 6 chars.
The linewidth, taking into account "array([" part, and the closing "])" (9 chars) plus the 2 ", " between each elements, knowing we want 4 elements, must be 9+6×4+2×3: 9 chars for "array([...])", 6×4 for each 4 numbers, 2×3 for each 3 ", " separator. Or 7+(6+2)×4.
You can use it only for one printing
with np.printoptions(formatter={"all":lambda x:"{:>6}".format(x)}, linewidth=7+(6+2)*4):
print(s.reshape(-1))
Edit after some times : subclass
Another method that came to my mind, would be to subclass ndarray, to make it behave as you would want
import numpy as np
class MyArr(np.ndarray):
# To create a new array, with args ls: number of element to print per line, and arr, normal array to take data from
def __new__(cls, ls, arr):
n=np.ndarray.__new__(MyArr, (len(arr,)))
n.ls=ls
n[:]=arr[:]
return n
def __init__(self, *args):
pass
# So that this .ls is viral: when ever the array is created from an operation from an array that has this .ls, the .ls is copyied in the new array
def __array_finalize__(self, obj):
if not hasattr(self, 'ls') and type(obj)==MyArr and hasattr(obj, 'ls'):
self.ls=obj.ls
# Function to print an array with .ls elements per line
def __repr__(self):
# For other than 1D array, just use standard representation
if len(self.shape)!=1:
return super().__repr__()
mxsize=max(len(str(s)) for s in self)
s='['
for i in range(len(self)):
if i%self.ls==0 and i>0:
s+='\n '
s+=f'{{:{mxsize}}}'.format(self[i])
if i+1<len(self): s+=', '
s+=']'
return s
Now you can use this MyArr to build your own 1D array
MyArr(4, range(12))
shows
[ 0.0, 1.0, 2.0, 3.0,
4.0, 5.0, 6.0, 7.0,
8.0, 9.0, 10.0, 11.0]
And you can use it anywhere a 1d ndarray is legal. And most of the time, the .ls attribute will follows (I say "most of the time", because I cannot guarantee that some functions wont build a new ndarray, and fill them with the data from this one)
a=MyArr(4, range(12))
a*2
#[ 0.0, 2.0, 4.0, 6.0,
# 8.0, 10.0, 12.0, 14.0,
# 16.0, 18.0, 20.0, 22.0]
a*a
#[ 0.0, 1.0, 4.0, 9.0,
# 16.0, 25.0, 36.0, 49.0,
# 64.0, 81.0, 100.0, 121.0]
a[8::-1]
#[8.0, 7.0, 6.0, 5.0,
# 4.0, 3.0, 2.0, 1.0,
# 0.0]
# It even resists reshaping
b=a.reshape((3,4))
b
#MyArr([[ 0., 1., 2., 3.],
# [ 4., 5., 6., 7.],
# [ 8., 9., 10., 11.]])
b.reshape((12,))
#[ 0.0, 1.0, 2.0, 3.0,
# 4.0, 5.0, 6.0, 7.0,
# 8.0, 9.0, 10.0, 11.0]
# Or fancy indexing
a[np.array([1,2,5,5,5])]
#[1.0, 2.0, 5.0, 5.0,
# 5.0]
# Or matrix operations
M=np.eye(12,k=1)+2*M.identity(12) # Just a matrix
M#a
#[ 1.0, 4.0, 7.0, 10.0,
# 13.0, 16.0, 19.0, 22.0,
# 25.0, 28.0, 31.0, 22.0]
np.diag(M*a)
#[ 0.0, 2.0, 4.0, 6.0,
# 8.0, 10.0, 12.0, 14.0,
# 16.0, 18.0, 20.0, 22.0]
# But of course, some time you loose the MyArr class
import pandas as pd
pd.DataFrame(a, columns=['v']).v.values
#array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])
[1]: https://stackoverflow.com/questions/25991666/how-to-efficiently-output-n-items-per-line-from-numpy-array
Simply, using reshape function with -1 as shape should do:
print(s)
print(s.reshape(-1))
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[ 0 1 2 3 4 5 6 7 8 9 10 11]
Try .ravel():
s = np.arange(12).reshape(3, 4)
print(s.ravel())
Prints:
[ 0 1 2 3 4 5 6 7 8 9 10 11]
you can use itertools.chain
from itertools import chain
import numpy as np
s=np.arange(12).reshape(3,4)
print(list(chain(*s)))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print(s.reshape(12,)) # this will also work
print(s.reshape(s.shape[0] * s.shape[1],)) # if don't know number of elements before hand

Find median for each element in list

I have some large list of data, between 1000 and 10000 elements. Now I want to filter out some peak values with the help of the median function.
#example list with just 10 elements
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
#list of medians calculated from 3 elements
my_median_list = []
for i in range(len(my_list)):
if i == 0:
my_median_list.append(statistics.median([my_list[0], my_list[1], my_list[2]])
elif i == (len(my_list)-1):
my_median_list.append(statistics.median([my_list[-1], my_list[-2], my_list[-3]])
else:
my_median_list.append(statistics.median([my_list[i-1], my_list[i], my_list[i+1]])
print(my_median_list)
# [4.7, 4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6, 4.6]
This works so far. But I think it looks ugly and is maybe inefficient? Is there a way with statistics or NumPy to do it faster? Or another solution? Also, I look for a solution where I can pass an argument from how many elements the median is calculated. In my example, I used the median always from 3 elements. But with my real data, I want to play with the median setting and then maybe use the median out of 10 elements.
You are calculating too many values since:
my_median_list.append(statistics.median([my_list[i-1], my_list[i], my_list[i+1]])
and
my_median_list.append(statistics.median([my_list[0], my_list[1], my_list[2]])
are the same when i == 1. The same error happens at the end so you get one too many end values.
It's easier and less error-prone to do this with zip() which will make the three element tuples for you:
from statistics import median
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
[median(l) for l in zip(my_list, my_list[1:], my_list[2:])]
# [4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6]
For groups of arbitrary size collections.deque is super handy because you can set a max size. Then you just keep pushing items on one end and it removes items off the other to maintain the size. Here's a generator example that takes you groups size as n:
from statistics import median
from collections import deque
def rolling_median(l, n):
d = deque(l[0:n], n)
yield median(d)
for num in l[n:]:
d.append(num)
yield median(d)
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
list(rolling_median(my_list, 3))
# [4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6]
list(rolling_median(my_list, 5))
# [4.7, 5.1, 5.1, 4.3, 5.0, 4.6]

Find the indices of first positive elements in list - python

I am trying to find the indices of the starting position of each positive value sequence. I only got the position of the positive values ​​in the code. My code looks like following:
index = []
for i, x in enumerate(lst):
if x > 0:
index.append(i)
print index
I expect the output of [-1.1, 2.0, 3.0, 4.0, 5.0, -2.0, -3.0, -4.0, 5.5, 6.6, 7.7, 8.8, 9.9] to be [1, 8]
I think it would better if you use list comprehension
index = [i for i, x in enumerate(lst) if x > 0]
Currently you are selecting all indexes where the number is positive, instead you would want to collect the index only when a number switches from negative to positive.
Additionally you can handle all negative numbers, or numbers starting from positive as well
def get_pos_indexes(lst):
index = []
#Iterate over the list using indexes
for i in range(len(lst)-1):
#If first element was positive, add 0 as index
if i == 0:
if lst[i] > 0:
index.append(0)
#If successive values are negative and positive, i.e indexes switch over, collect the positive index
if lst[i] < 0 and lst[i+1] > 0:
index.append(i+1)
#If index list was empty, all negative characters were encountered, hence add -1 to index
if len(index) == 0:
index = [-1]
return index
print(get_pos_indexes([-1.1, 2.0, 3.0, 4.0, 5.0, -2.0, -3.0, -4.0, 5.5, 6.6, 7.7, 8.8, 9.9]))
print(get_pos_indexes([2.0, 3.0, 4.0, 5.0, -2.0, -3.0, -4.0, 5.5, 6.6, 7.7, 8.8, 9.9]))
print(get_pos_indexes([2.0,1.0,4.0,5.0]))
print(get_pos_indexes([-2.0,-1.0,-4.0,-5.0]))
The output will be
[1, 8]
[0, 7]
[0]
[-1]

Python Linear Regression Error

I have two arrays with the following values:
>>> x = [24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0, 18.0,
... 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
>>> y = [10.0, 9.0, 22.0, 7.0, 4.0, 7.0, 56.0, 5.0, 24.0, 25.0, 11.0, 2.0,
... 9.0, 1.0, 9.0, 12.0, 9.0, 4.0, 2.0]
I used the scipy library to calculate r-squared:
>>> from scipy.interpolate import polyfit
>>> p1 = polyfit(x, y, 1)
When I run the code below:
>>> yfit = p1[0] * x + p1[1]
>>> yfit
array([], dtype=float64)
The yfit array is empty. I don't understand why.
The problem is you are performing scalar addition with an empty list.
The reason you have an empty list is because you try to perform scalar multiplication with a python list rather than with a numpy.array. The scalar is converted to an integer, 0, and creates a zero length list.
We'll explore this below, but to fix it you just need your data in numpy arrays instead of in lists. Either create it originally, or convert the lists to arrays:
>>> x = numpy.array([24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0,
... 18.0, 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
An explanation of what was going on follows:
Let's unpack the expression yfit = p1[0] * x + p1[1].
The component parts are:
>>> p1[0]
-0.58791208791208893
p1[0] isn't a float however, it's a numpy data type:
>>> type(p1[0])
<class 'numpy.float64'>
x is as given above.
>>> p1[1]
20.230769230769241
Similar to p1[0], the type of p1[1] is also numpy.float64:
>>> type(p1[0])
<class 'numpy.float64'>
Multiplying a list by a non-integer interpolates the number to be an integer, so p1[0] which is -0.58791208791208893 becomes 0:
>>> p1[0] * x
[]
as
>>> 0 * [1, 2, 3]
[]
Finally you are adding the empty list to p[1], which is a numpy.float64.
This doesn't try to append the value to the empty list. It performs scalar addition, i.e. it adds 20.230769230769241 to each entry in the list.
However, since the list is empty there is no effect, other than it returns an empty numpy array with the type numpy.float64:
>>> [] + p1[1]
array([], dtype=float64)
An example of a scalar addition having an effect:
>>> [10, 20, 30] + p1[1]
array([ 30.23076923, 40.23076923, 50.23076923])

How to improve my performance in filling gaps in time series and data lists with Python

I'm having a time series data sets comprising of 10 Hz data over several years. For one year my data has around 3.1*10^8 rows of data (each row has a time stamp and 8 float values). My data has gaps which I need to identify and fill with 'NaN'. My python code below is capable of doing so but the performance is by far too bad for my kind of problem. I cannot get though my data set in anything even close to a reasonable time.
Below an minimal working example.
I have for example series (time-seris-data) and data as lits with same lengths:
series = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
I would like series to advance in intervals of 1, hence the gaps of series are 4.1, 5.1, 6.1, 11.1, 12.1, 13.1, 17.1, 18.1, 19.1. The data_a and data_b lists shall be filled with float(nan)'s.
so data_a for example should become:
[1.2, 1.2, 1.2, nan, nan, nan, 2.2, 2.2, 2.2, 2.2, nan, nan, nan, 3.2, 3.2, 3.2, nan, nan, nan, 4.2]
I archived this using:
d_max = 1.0 # Normal increment in series where no gaps shall be filled
shift = 0
for i in range(len(series)-1):
diff = series[i+1] - series[i]
if diff > d_max:
num_fills = round(diff/d_max)-1 # Number of fills within one gap
for it in range(num_fills):
data_a.insert(i+1+it+shift, float(nan))
data_b.insert(i+1+it+shift, float(nan))
shift = int(shift + num_fills) # Shift the index by the number of inserts from the previous gap filling
I searched for other solutions to this problems but only came across the use of the find() function yielding the indices of the gaps. Is the function find() faster than my solution? But then how would I insert NaN's in data_a and data_b in a more efficient way?
First, realize that your innermost loop is not necessary:
for it in range(num_fills):
data_a.insert(i+1+it+shift, float(nan))
is the same as
data_a[i+1+shift:i+1+shift] = [float(nan)] * int(num_fills)
That might make it slightly faster because there's less allocation and less moving items going on.
Then, for large numerical problems, always use NumPy. It may take some effort to learn, but the performance is likely to go up orders of magnitude. Start with something like:
import numpy as np
series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
d_max = 1.0 # Normal increment in series where no gaps shall be filled
shift = 0
# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
nf = num_fills[i]
nans = [np.nan] * nf
data_a[i+1+shift:i+1+shift] = nans
data_b[i+1+shift:i+1+shift] = nans
shift = int(shift + nf)
IIRC, inserts into python lists are expensive, with the size of the list.
I'd recommend not loading your huge data sets into memory, but to iterate through them with a generator function something like:
from itertools import izip
series = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
def fillGaps(series,data_a,data_b,d_max=1.0):
prev = None
for s, a, b in izip(series,data_a,data_b):
if prev is not None:
diff = s - prev
if s - prev > d_max:
for x in xrange(int(round(diff/d_max))-1):
yield (float('nan'),float('nan'))
prev = s
yield (a,b)
newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
newA.append(a)
newB.append(b)
E.g. read the data into the izip and write it out instead of list appends.

Categories

Resources