Translating float index interpolation from MATLAB to Python - python

For example, I have a index array
ax = [0, 0.2, 2] #start from index 0: python
and matrix I
I=
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
10 20 30 40 50
In MATLAB, by running this code
[gx, gy] = meshgrid([1,1.2,3], [1,1.2,3]);
I = [10:10:50];
I = vertcat(I,I,I,I,I)
SI = interp2(I,gx,gy,'bilinear');
The resulting SI is
SI =
10 12 30
10 12 30
10 12 30
I tried to do the same interpolation in Python, using NumPy. I first interpolate row-wise, then column-wise
import numpy as np
ax = np.array([0.0, 0.2, 2.0])
ay = np.array([0.0, 0.2, 2.0])
I = np.array([[10,20,30,40,50]])
I = np.concatenate((I,I,I,I,I), axis=0)
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
I_row = np.transpose(np.array([np.interp(ax, r_idx, I[:,x]) for x in range(0,I.shape[0])]))
I_col = np.array([np.interp(ay, c_idx, I_row[y,:]) for y in range(0, I_row.shape[0])])
SI = I_col
However, the resulting SI is
SI =
10 10 20
10 10 20
10 10 20
Why are my results using Python different from those using MATLAB?

It seems that you over-corrected yourself by passing from MATLAB to Python, as shown by your first code excerpt.
ax = [0, 0.2, 2] #start from index 0: python
In numpy logic this sequence does not represents the indexes but the coordinate
for the function to interpolate.
Since you already take care of incrementing the coordinate to be compatible with matlab here:
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
You can reuse the same interpolation coordinate that you used in Matlab:
ax = [1,1.2,3]
Full code:
import numpy as np
ax = np.array([1.0, 1.2, 3.0])
ay = np.array([1.0, 1.2, 3.0])
I = np.array([[10,20,30,40,50]])
I = np.concatenate((I,I,I,I,I), axis=0)
r_idx = np.arange(1, I.shape[0]+1)
c_idx = np.arange(1, I.shape[1]+1)
I_row = np.transpose(np.array([np.interp(ax, r_idx, I[:,x]) for x in range(0,I.shape[
0])]))
I_col = np.array([np.interp(ay, c_idx, I_row[y,:]) for y in range(0, I_row.shape[0])]
)
SI = I_col
and result:
array([[10., 12., 30.],
[10., 12., 30.],
[10., 12., 30.]])
Explanation about the bug
Since ax represented coordinates the first two values 0.0 and 0.2 were before the first coordinate of r_idx.
According to the documentation, the interpolation will default to I[:,x][0].

Related

Optimization of equation parameter values such that largest distance between groups is created

For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
This creates the following table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
We thus calculate all the coordinates and plot them
Plot
What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.
Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):
You may want to experiment with other optimization methods - see scipy.optimize.minimize.

How to add new variables for an xarray dataset using groupby and apply?

I am facing serious difficulties in understanding how the xarray.groupby really works. I am trying to apply a given function "f" over each group of a xarray DatasetGroupBy collection, such that "f" should add new variables to each of the applied groups of the original xr.DataSet.
Here is a Brief Introduction:
My problem is commonly found in geoscience, remote sensing, etc.
The aim is to apply a given function over an Array, pixel by pixel (or gridcell by gridcell).
Example
Let's assume that I want to evaluate the wind speed components (u,v) of a wind-field for a given region in respect to a new direction. Therefore, I whish to evaluate rotated version of the 'u' and 'v components, namely: u_rotated and v_rotated.
Let's assume that this new direction is 30° rotated anti-clockwise in respect to each pixel position in the wind-field. So the new wind components would be (u_30_degrees and v_30_degrees).
My first attempt was to stack each of the x and y coordinates (or longitudes and latitudes) into a new dimension called pixel, and later groupby by this new dimension ("pixel") and apply a function which would do the vector-wind rotation.
Here is a snippet of my initial attempt:
# First, let's create some functions for vector rotation:
def rotate_2D_vector_per_given_degrees(array2D, angle=30):
'''
Parameters
----------
array2D : 1D length 2 numpy array
angle : float angle in degrees (optional)
DESCRIPTION. The default is 30.
Returns
-------
Rotated_2D_Vector : 1D of length 2 numpy array
'''
R = get_rotation_matrix(rotation = angle)
Rotated_2D_Vector = np.dot(R, array2D)
return Rotated_2D_Vector
def get_rotation_matrix(rotation=90):
'''
Description:
This function creates a rotation matrix given a defined rotation angle (in degrees)
Parameters:
rotation: in degrees
Returns:
rotation matrix
'''
theta = np.radians(rotation) # degrees
c, s = np.cos(theta), np.sin(theta)
R = np.array(((c, -s), (s, c)))
return R
# Then let's create a reproducible dataset for analysis:
u_wind = xr.DataArray(np.ones( shape=(20, 30)),
dims=('x', 'y'),
coords={'x': np.arange(0, 20),
'y': np.arange(0, 30)},
name='u')
v_wind = xr.DataArray(np.ones( shape=(20, 30))*0.3,
dims=('x', 'y'),
coords={'x': np.arange(0, 20),
'y': np.arange(0, 30)},
name='v')
data = xr.merge([u_wind, v_wind])
# Let's create the given function that will be applied per each group in the dataset:
def rotate_wind(array, degrees=30):
# This next line, I create a 1-dimension vector of length 2,
# with wind speed of the u and v components, respectively.
# The best solution I found has been conver the dataset into a single xr.DataArray
# by stacking the 'u' and 'v' components into a single variable named 'wind'.
vector = array.to_array(dim='wind').values
# Now, I rotate the wind vector given a rotation angle in degrees
Rotated = rotate_2D_vector_per_given_degrees(vector, degrees)
# Ensuring numerical division problems as 1e-17 == 0.
Rotated = np.where( np.abs(Rotated - 6.123234e-15) < 1e-15, 0, Rotated)
# sanity check for each point position:
print('Coords: ', array['point'].values,
'Wind Speed: ', vector,
'Response :', Rotated,
end='\n\n'+'-'*20+'\n')
components = [a for a in data.variables if a not in data.dims]
for dim, value in zip(components, Rotated):
array['{0}_rotated_{1}'.format(dim, degrees)] = value
return array
# Finally, lets stack our dataset per grid-point, groupby this new dimension, and apply the desired function:
stacked = data.stack(point = ['x', 'y'])
stacked = stacked.groupby('point').apply(rotate_wind)
# lets unstack the data to recover the original dataset:
data = stacked.unstack('point')
# Let's check if the function worked correctly
data.to_dataframe().head(30)
Though the above example is apparently working, I am still unsure if its results are correct, or even if the groupby-apply function implementation is efficient (clean, non-redundant, fast, etc.).
Any insights are most welcome!
Sincerely,
You can merely multiply the whole array by the rotation matrice, something like np.dot(R, da).
So, if you have the following Dataset:
>>> dims = ("x", "y")
>>> sizes = (20, 30)
>>> ds = xr.Dataset(
data_vars=dict(u=(dims, np.ones(sizes)), v=(dims, np.ones(sizes) * 0.3)),
coords={d: np.arange(s) for d, s in zip(dims, sizes)},
)
>>> ds
<xarray.Dataset>
Dimensions: (x: 20, y: 30)
Coordinates:
* x (x) int64 0 1 2 3 4 ... 16 17 18 19
* y (y) int64 0 1 2 3 4 ... 26 27 28 29
Data variables:
u (x, y) float64 1.0 1.0 ... 1.0 1.0
v (x, y) float64 0.3 0.3 ... 0.3 0.3
Converted, like you did, to the following DataArray:
>>> da = ds.stack(point=["x", "y"]).to_array(dim="wind")
>>> da
<xarray.DataArray (wind: 2, point: 600)>
array([[1. , 1. , 1. , ..., 1. , 1. , 1. ],
[0.3, 0.3, 0.3, ..., 0.3, 0.3, 0.3]])
Coordinates:
* point (point) MultiIndex
- x (point) int64 0 0 0 0 ... 19 19 19 19
- y (point) int64 0 1 2 3 ... 26 27 28 29
* wind (wind) <U1 'u' 'v'
Then, you have your rotated values thanks to np.dot(R, da):
>>> np.dot(R, da).shape
(2, 600)
>>> type(np.dot(R, da))
<class 'numpy.ndarray'>
But it is a numpy ndarray. So if you want to go back to a xarray DataArray, you can use a trick like that (other solutions may exist):
def rotate(da, dim, angle):
# Put dim first
dims_orig = da.dims
da = da.transpose(dim, ...)
# Rotate
R = rotation_matrix(angle)
rotated = da.copy(data=np.dot(R, da), deep=True)
# Rename values of "dim" coord according to rotation
rotated[dim] = [f"{orig}_rotated_{angle}" for orig in da[dim].values]
# Transpose back to orig
return rotated.transpose(*dims_orig)
And use it like:
>>> da_rotated = rotate(da, dim="wind", angle=30)
>>> da_rotated
<xarray.DataArray (wind: 2, point: 600)>
array([[0.7160254 , 0.7160254 , 0.7160254 , ..., 0.7160254 , 0.7160254 ,
0.7160254 ],
[0.75980762, 0.75980762, 0.75980762, ..., 0.75980762, 0.75980762,
0.75980762]])
Coordinates:
* point (point) MultiIndex
- x (point) int64 0 0 0 0 ... 19 19 19 19
- y (point) int64 0 1 2 3 ... 26 27 28 29
* wind (wind) <U12 'u_rotated_30' 'v_rotated_30'
Eventually, you can go back to the original Dataset structure like that:
>>> ds_rotated = da_rotated.to_dataset(dim="wind").unstack(dim="point")
>>> ds_rotated
<xarray.Dataset>
Dimensions: (x: 20, y: 30)
Coordinates:
* x (x) int64 0 1 2 3 ... 17 18 19
* y (y) int64 0 1 2 3 ... 27 28 29
Data variables:
u_rotated_30 (x, y) float64 0.716 ... 0.716
v_rotated_30 (x, y) float64 0.7598 ... 0.7598

verify distribution of uniformly distributed 3D coordinates

I would like to write a python script to generate a uniformly distributed 3D coordinates (e.g., x, y, z) where x, y, and z are float numbers between 0 and 1. For the moment, z can be fixed, thus what I need is a uniform distributed points in a 2D (x-y) plane. I have written a script to do this job and checked both x, and y are uniform numbers. However, I am not sure if these points are uniformly distributed in (x-y) plane.
My code is
1 import matplotlib.pyplot as plt
2 import random
3 import numpy as np
4 import csv
5 nk1=300
6 nk2=300
7 nk3=10
8 kx=[]
9 ky=[]
10 kz=[]
11 for i in range(nk1):
12 for j in range(nk2):
13 for k in range(nk3):
14 xkg1=random.random()
15 xkg2=random.random()
16 xkg3 = float(k)/nk3
17 kx.append(xkg1)
18 ky.append(xkg2)
19 kz.append(xkg3)
20 kx=np.array(kx)
21 count, bins, ignored = plt.hist(kx, normed=True)
22 plt.plot(bins, np.ones_like(bins), linewidth=2, color='r')
23 plt.show()
The plot shows both "kx", and "ky" are uniformly distributed numbers, however, how can I make sure that x-y are uniformly distributed in the 2D plane?
Just as you used np.histogram1 to check uniformity in 1D, you can use np.histogram2d to do the same thing in 2D, and np.histogramdd in 3D+.
To see an example, let's first fix your loops by making them go away:
kx = np.random.random(nk1 * nk2 * nk3)
ky = np.random.random(nk1 * nk2 * nk3)
kz = np.tile(np.arange(nk3) / nk3, n1 * n2)
hist2d, *_ = np.histogram2d(kx, ky, range=[[0, 1], [0, 1]])
The range parameter ensures that you are binning over [0, 1) in each direction, not over the actual min and max if your data, no matter how close it may be.
Now it's entirely up to you how to visualize the 100 data points in hist2d. One simple way would be to just ravel it and do a bar chart like you did for the 1D case:
plt.bar(np.arange(hist2d.size), hist2d.ravel())
plt.plot([0, hist2d.size], [nk1 * nk2 * nk3 / hist2d.size] * 2)
Another simple way would be to do a heat map:
plt.imshow(hist2d, interpolation='nearest', cmap='hot')
This is actually not as useful as the bar chart, and doesn't generalize to higher dimensions as well.
Your best bet is probably just checking the standard deviation of the raw data.
1 Or rather plt.hist did for you under the hood.
With the help of #Mad Physicist, I finally found the way to verify the uniform distribution of random numbers in 2D. Here I post my script, and explain the details:
1 import numpy as np
2 import random
3 import matplotlib.pyplot as plt
4 import matplotlib
5 fig = plt.figure()
6 ax1 = fig.add_subplot(211)
7 ax2 = fig.add_subplot(212)
8 nk=100
9 nk=100
10 nk=1
11 kx1=[]
12 ky1=[]
13 kz1=[]
14 for i in range(nk1):
15 for j in range(nk2):
16 for k in range(nk3):
17 xkg =r andom.random()
18 ykg = random.random()
19 zkg = float(k)/nk3
20 kx.append(xkg)
21 ky.append(ykg)
22 kz.append(zkg)
23 kx=np.array(kx)
24 ky=np.array(ky)
25 kz=np.array(kz)
26 xedges, yedges = np.linspace(0, 1, 6), np.linspace(0, 1, 6)
27 ## count the number of poins in the region definded by (xedges[i],
xedges[i+1])
28 ## and (yedges[i], xedges[y+1]). There are in total 10*10 2D
squares.
29 hist, xedges, yedges = np.histogram2d(kx, ky, (xedges, yedges))
30 xidx = np.clip(np.digitize(kx, xedges) - 1, 0, hist.shape[0] - 1)
31 yidx = np.clip(np.digitize(ky, yedges) - 1, 0, hist.shape[1] - 1)
32 ax1.bar(np.arange(hist.size),hist.ravel())
33 ax1.plot([0,hist.size], [nk1 * nk2 * nk3 / hist.size] * 2)
34 c = hist[xidx, yidx]
35 new = ax2.scatter(kx, ky, c=c, cmap='jet')
36 cax, _ = matplotlib.colorbar.make_axes(ax2)
37 cbar = matplotlib.colorbar.ColorbarBase(cax, cmap='jet')
38 ax2.grid(True)
39 plt.show()

Pandas reverse of diff()

I have calculated the differences between consecutive values in a series, but I cannot reverse / undifference them using diffinv():
ds_sqrt = np.sqrt(ds)
ds_sqrt = pd.DataFrame(ds_sqrt)
ds_diff = ds_sqrt.diff().values
How can I undifference this?
You can do this via numpy. Algorithm courtesy of #Divakar.
Of course, you need to know the first item in your series for this to work.
df = pd.DataFrame({'A': np.random.randint(0, 10, 10)})
df['B'] = df['A'].diff()
x, x_diff = df['A'].iloc[0], df['B'].iloc[1:]
df['C'] = np.r_[x, x_diff].cumsum().astype(int)
# A B C
# 0 8 NaN 8
# 1 5 -3.0 5
# 2 4 -1.0 4
# 3 3 -1.0 3
# 4 9 6.0 9
# 5 7 -2.0 7
# 6 4 -3.0 4
# 7 0 -4.0 0
# 8 8 8.0 8
# 9 1 -7.0 1
You can use diff_inv from pmdarima.Docs link
# genarating random table
np.random.seed(10)
vals = np.random.randint(1, 10, 6)
df_t = pd.DataFrame({"a":vals})
#creating two columns with diff 1 and diff 2
df_t['dif_1'] = df_t.a.diff(1)
df_t['dif_2'] = df_t.a.diff(2)
df_t
a dif_1 dif_2
0 5 NaN NaN
1 1 -4.0 NaN
2 2 1.0 -3.0
3 1 -1.0 0.0
4 2 1.0 0.0
5 9 7.0 8.0
Then create a function that will return an array with inverse values of diff.
from pmdarima.utils import diff_inv
def inv_diff (df_orig_column,df_diff_column, periods):
# Generate np.array for the diff_inv function - it includes first n values(n =
# periods) of original data & further diff values of given periods
value = np.array(df_orig_column[:periods].tolist()+df_diff_column[periods:].tolist())
# Generate np.array with inverse diff
inv_diff_vals = diff_inv(value, periods,1 )[periods:]
return inv_diff_vals
Example of Use:
# df_orig_column - column with original values
# df_diff_column - column with differentiated values
# periods - preiods for pd.diff()
inv_diff(df_t.a, df_t.dif_2, 2)
Output:
array([5., 1., 2., 1., 2., 9.])
Reverse diff in one line with pandas
import pandas as pd
df = pd.DataFrame([10, 15, 14, 18], columns = ['Age'])
df['Age_diff'] = df.Age.diff()
df['reverse_diff'] = df['Age'].shift(1) + df['Age_diff']
print(df)
Age Age_diff reverse_diff
0 10 NaN NaN
1 15 5.0 15.0
2 14 -1.0 14.0
3 18 4.0 18.0
Here's a working example.
First, let's import needed packages
import numpy as np
import pandas as pd
import pmdarima as pm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Then, let's create a simple discretized cosine wave
period = 5
cycles = 7
x = np.cos(np.linspace(0, 2*np.pi*cycles, periods*cycles+1))
X = pd.DataFrame(x)
and plot
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(X, marker='.')
ax.set(
xticks=X.index
)
ax.axvline(0, color='r', ls='--')
ax.axvline(period, color='r', ls='--')
ax.set(
title='Original data'
)
plt.show()
Note that the period is 5. Let's now remove this "seasonality" by differentiating with period 5
X_diff = X.diff(periods=period)
# NOTE: the first `period` observations
# are needed for back transformation
X_diff.iloc[:period] = X[:period]
Note that we have to keep the first period observations to allow back transformation. If you don't need them you have to keep them elsewhere and then concatenate when you want to back transform.
fig, ax = plt.subplots(figsize=(12, 5))
ax.axvline(0, color='r', ls='--')
ax.axvline(period-1, color='r', ls='--')
ax.plot(X_diff, marker='.')
ax.annotate(
'Keep these original data\nto allow back transformation',
xy=(period-1, .5), xytext=(10, .5),
arrowprops=dict(color='k')
)
ax.set(
title='Transformed data'
)
plt.show()
Let's now back transform data with pmdarima.utils.diff_inv
X_diff_inv = pm.utils.diff_inv(X_diff, lag=period)[period:]
Note that we discard the first period results that would be 0 and not needed.
fig, ax = plt.subplots(figsize=(12, 5))
ax.axvline(0, color='r', ls='--')
ax.axvline(period-1, color='r', ls='--')
ax.plot(X_diff_inv, marker='.')
ax.set(
title='Back transformed data'
)
plt.show()
I think some examples may overcomplicate? The inverse of differentiating is simple integrating. But for this you need a starting value, so to say the const part when integrating dx = f(x) + const:
import pandas as pd
import matplotlib.pyplot as plt
# some example data
input = pd.Series([5., 1., 2., 1., 2., 9.])
# saving the offset ('const' part of integral) to reconstruct
offset = input[0]
# differentiating
diff = input.diff()
# the first row after diff() will always be NaN, it is reasonable to set this to zero
diff[0] = 0
# => reconstruct (reverse diff / integrate) <=
reverse_diff_no_offset = diff.cumsum()
reverse_diff = reverse_diff_no_offset + offset
# plot: You can see why a offset is needed. Any other offset will shift the line up/down
plt.plot(input, color='green', linestyle=None, marker="o")
plt.plot(reverse_diff_no_offset, color='grey')
plt.plot(reverse_diff, color='blue')
Also numpy has cumsum, so it will work there as well
arb = pd.DataFrame({'a': [1, 4, 9, 16, 25, 36]})
(-1)*arb['a'].diff(periods=-1)
Output:
3.0
5.0
7.0
9.0
11.0
NaN
Name: a, dtype: float64

Take the sum of every N rows in a pandas series

Suppose
s = pd.Series(range(50))
0 0
1 1
2 2
3 3
...
48 48
49 49
How can I get the new series that consists of sum of every n rows?
Expected result is like below, when n = 5;
0 10
1 35
2 60
3 85
...
8 210
9 235
If using loc or iloc and loop by python, of course it can be accomplished, however I believe it could be done simply in Pandas way.
Also, this is a very simplified example, I don't expect the explanation of the sequences:). Actual data series I'm trying has the time index and the the number of events occurred in every second as the values.
GroupBy.sum
N = 5
s.groupby(s.index // N).sum()
0 10
1 35
2 60
3 85
4 110
5 135
6 160
7 185
8 210
9 235
dtype: int64
Chunk the index into groups of 5 and group accordingly.
numpy.reshape + sum
If the size is a multiple of N (or 5), you can reshape and add:
s.values.reshape(-1, N).sum(1)
# array([ 10, 35, 60, 85, 110, 135, 160, 185, 210, 235])
numpy.add.at
b = np.zeros(len(s) // N)
np.add.at(b, s.index // N, s.values)
b
# array([ 10., 35., 60., 85., 110., 135., 160., 185., 210., 235.])
The most efficient solution I can think of is f1() in my example below. It is orders of magnitude faster than using the groupby in the other answer.
Note that f1() doesn't work when the length of the array is not an exact multiple, e.g. if you want to sum a 3-item array every 2 items.
For those cases, you can use f1v2():
f1v2( [0,1,2,3,4] ,2 ) = [1,5,4]
My code is below. I have used timeit for the comparisons:
import timeit
import numpy as np
import pandas as pd
def f1(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
return a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
def f2(myarray, x):
return [sum(myarray[n: n+x]) for n in range(0, len(myarray), x)]
def f3(myarray, x):
s = pd.Series(myarray)
out = s.groupby(s.index // 2).sum()
return out
def f1v2(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
mod = a.shape[0] % x
if mod != 0:
excl = a[-mod:]
keep = a[: len(a) - mod]
out = keep.reshape((int(keep.shape[0]/x), int(x) )).sum(1)
out = np.hstack( (excl.sum() , out) )
else:
out = a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
return out
a = np.arange(0,1e6)
out1 = f1(a,2)
out2 = f2(a,2)
out3 = f2(a,2)
t1 = timeit.Timer( "f1(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t1v2 = timeit.Timer( "f1v2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t2 = timeit.Timer( "f2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t3 = timeit.Timer( "f3(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
resdf = pd.DataFrame(index = ['min time'])
resdf['f1'] = [min(t1)]
resdf['f1v2'] = [min(t1v2)]
resdf['f2'] = [min(t2)]
resdf['f3'] = [min(t3)]
#the docs explain why it makes more sense to take the min than the avg
resdf = resdf.transpose()
resdf['% difference vs fastes'] = (resdf /resdf.min() - 1) * 100
b = np.array( [0,1,2,4,5,6,7] )
out1v2 = f1v2(b,2)

Categories

Resources