Using Pandas/NumPy to increase resolution - python

I need to change the number of point in array, so the new point y value will be the same value as the original point on the left side.
import numpy as np
def regularizeSeries1(x, y, M = 100):
s0 = (x - x[0])
s1 = np.linspace(0, max(s0), M + 1)
z = np.empty(M)
for i in range(M):
z[i] = y[(s0 <= s1[i])][-1]
return(z)
x = np.array([0, 1, 2, 5, 7,8 ,10])
y = np.array([0, 1, 3,4, 6, 7.5, 9])
M = 20
Z = regularizeSeries1(x, y, M)
How can I do it without loop using Pandas or numpy?
[][1

merge and fill the nan using pd.ffill
import pandas as pd
import numpy as np
M = 20
x = np.array([0, 1, 2, 5, 7,8 ,10])
y = np.array([0, 1, 3,4, 6, 7.5, 9])
s1 = np.linspace(0, max(s0), M)
df1 = pd.DataFrame({'x': x, 'y': y})
df2 = pd.DataFrame({'x': s1})
df3 = df1.merge(df2, on='x', how='outer').sort_values(by='x').ffill().reset_index(drop=True)
df3 = df3[df3['x'].isin(df2['x'])]
newX, newY = df3['x'], df3['y']

Related

Remove elements from Numpy array until y has equivalent elements in each value

I have an array y composed of 0 and 1, but at a different frequency.
For example:
y = np.array([0, 0, 1, 1, 1, 1, 0])
And I have an array x of the same length.
x = np.array([0, 1, 2, 3, 4, 5, 6])
The idea is to filter out elements until there are the same number of 0 and 1.
A valid solution would be to remove index 5:
x = np.array([0, 1, 2, 3, 4, 6])
y = np.array([0, 0, 1, 1, 1, 0])
A naive method I can think of is to get the difference between the value frequency of y (in this case 4-3=1) create a mask for y == 1 and switch random elements from True to False until the difference is 0. Then create a mask for y == 0, do a OR between them and apply it to both x and y.
This doesn't really seem the best "python/numpy way" of doing it though.
Any suggestions? Something like randomly select n elements from the highest count, where n is the count of the lowest value.
If this is easier with pandas then that would work for me too.
Naive algorithm assuming 1 > 0:
mask_pos = y == 1
mask_neg = y == 0
pos = len(y[mask_pos])
neg = len(y[mask_neg])
diff = pos-neg
while diff > 0:
rand = np.random.randint(0, len(y))
if mask_pos[rand] == True:
mask_pos[rand] = False
diff -= 1
mask_final = mask_pos | mask_neg
y_new = y[mask_final]
x_new = x[mask_final]
This naive algorithm is really slow
One way to do that with NumPy is this:
import numpy as np
# Makes a mask to balance ones and zeros
def balance_binary_mask(binary_array):
binary_array = np.asarray(binary_array).ravel()
# Count number of ones
z = np.count_nonzero(binary_array)
# If there are less ones than zeros
if z <= len(binary_array) // 2:
# Invert the array
binary_array = ~binary_array
# Find ones
idx = np.nonzero(binary_array)[0]
# Number of elements to remove
rem = 2 * len(idx) - len(binary_array)
# Pick random indices to remove
rem_idx = np.random.choice(idx, size=rem, replace=False)
# Make mask
mask = np.ones_like(binary_array, dtype=bool)
# Mask elements to remove
mask[rem_idx] = False
return mask
# Test
np.random.seed(0)
y = np.array([0, 0, 1, 1, 1, 1, 0])
x = np.array([0, 1, 2, 3, 4, 5, 6])
m = balance_binary_mask(y)
print(m)
# [ True True True True False True True]
y = y[m]
x = x[m]
print(y)
# [0 0 1 1 1 0]
print(x)
# [0 1 2 3 5 6]

initiate 2d array with given ordered datasets

I don't know how to search key words in my case in search engine.
I want to make a 2d-array and each column means (x,y,z) from given 3 arrays.
x = [3,6,9,12]
y = [4,8,12,16]
z = [5,10,15,20]
to this:
[3,4,5],
[6,8,10],
[9,12,15],
[12,16,20]
my code is like below, is any better way to write this?
x = [3,6,9,12]
y = [4,8,12,16]
z = [5,10,15,20]
count=0
ans = []
for ind1 in range(4):
ans.append([x[count], y[count], z[count]])
count +=1
I will use numpy here.
import numpy as np
xyz = np.zeros((4, 3))
x = [3,6,9,12]
y = [4,8,12,16]
z = [5,10,15,20]
xyz[:, 0] = np.reshape(x, -1)
xyz[:, 1] = np.reshape(y, -1)
xyz[:, 2] = np.reshape(z, -1)
You can do this with zip
[ins] In [1]: x = [3,6,9,12]
...: y = [4,8,12,16]
...: z = [5,10,15,20]
[ins] In [2]: [list(x) for x in zip(x,y,z)]
Out[2]: [[3, 4, 5], [6, 8, 10], [9, 12, 15], [12, 16, 20]]

Calculate average of groups of DataFrame rows given by 2-D lists of indices with unequal lengths

I have a DataFrame with n rows. I also have a 2-D array of indices. This array also has n rows, however each row can be variable in length. I need to group DataFrame rows according to the indices and calculate an average of a column.
For example:
If I have DataFrame df and array ind, I need to get
[df.loc[ind[n], col_name].mean() for n in ind].
I've implemented this using the apply pandas function:
size = 100000
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)
but this is slow and scales poorly. In this case it's significantly faster to do
df.a.values[ind].mean(axis=1)
However, as far as I understand, this works only because all elements of ind are the same length, and this following code does not work:
new_ind = ind.tolist()
new_ind[0].pop()
df.a.values[new_ind].mean(axis=1)
I've toyed around with the pandas groupby method but have had no success. Is there another efficient way to group rows according to lists of indices with unequal lengths and return a mean of a column?
Setup
Keeping the dataframe shorter for demonstraion purposes
np.random.seed(1)
size = 10
df = pd.DataFrame(dict(a=np.arange(size)))
# array of variable length sub-arrays
ind = np.array([
np.random.randint(
0, size, size=np.random.randint(1, 11)
) for _ in range(size)
])
Solution
Use np.bincount with the weights parameter.
This should be a very fast solution.
# get an array of the lengths of sub-arrays
lengths = np.array([len(x) for x in ind])
# simple np.arange for initial positions
positions = np.arange(len(ind))
# get at the underlying values of column `'a'`
values = df.a.values
# for each position repeated the number of times equal to
# the length of the sub-array at that position,
# add to the bin, identified by the position, the amount
# from values at the indices from the sub-array
# divide sums by lengths to get averages
avg = np.bincount(
positions.repeat(lengths),
values[np.concatenate(ind)]
) / lengths
df.assign(avg=avg)
a avg
0 0 3.833333
1 1 4.250000
2 2 6.200000
3 3 6.000000
4 4 5.200000
5 5 5.400000
6 6 2.000000
7 7 3.750000
8 8 6.500000
9 9 6.200000
Timing
This table identifies the minimum amount of time for each row and every other value in that row is expressed as a multiple of the amount of time taken for the minumum. The last column identifies the fastest method for the length of data specified by the respective row.
Method pir mcf Best
Size
10 1 12.3746 pir
30 1 44.0495 pir
100 1 124.054 pir
300 1 270.6 pir
1000 1 576.505 pir
3000 1 819.034 pir
10000 1 990.847 pir
Code
def mcf(d, i):
g = lambda r: d.loc[i[d.index.get_loc(r.name)], 'a'].mean()
return d.assign(avg=d.apply(g, 1))
def pir(d, i):
lengths = np.array([len(x) for x in i])
positions = np.arange(len(i))
values = d.a.values
avg = np.bincount(
positions.repeat(lengths),
values[np.concatenate(i)]
) / lengths
return d.assign(avg=avg)
results = pd.DataFrame(
index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000], name='Size'),
columns=pd.Index(['pir', 'mcf'], name='Method')
)
for i in results.index:
df = pd.DataFrame(dict(a=np.arange(i)))
ind = np.array([
np.random.randint(
0, i, size=np.random.randint(1, 11)
) for _ in range(i)
])
for j in results.columns:
stmt = '{}(df, ind)'.format(j)
setp = 'from __main__ import df, ind, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))
fig, (a1, a2) = plt.subplots(2, 1, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.bar(logy=True, ax=a2)
I think this is what you might be after... I set the size lower to make it easier to demonstrate
Here is a shortened version of your code with a repeatable (fixed) ind that you can test against
import pandas as pd
import numpy as np
size = 10
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
ind = np.array([[5, 8, 9, 5, 0],
[0, 1, 7, 6, 9],
[2, 4, 5, 2, 4],
[2, 4, 7, 7, 9],
[1, 7, 0, 6, 9],
[9, 7, 6, 9, 1],
[0, 1, 8, 8, 3],
[9, 8, 7, 3, 6],
[5, 1, 9, 3, 4],
[8, 1, 4, 0, 3]])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)
The following also gives the same
df['comparison'] = df.a.values[ind].mean(axis=1)
In [86]: (df['comparison'] == df['avg']).all()
Out[86]: True
Timings
Before 0.5263588428497314
After 0.014391899108886719
With bincount 0.03328204154968262
Comparison and Scaling
To do a comparison of scaling I set up three timeit functions (code at bottom) and I define the sizes I want to test for the scaling
import timeit
sizes = [10, 100, 1000, 10000]
res_mine = map(mine, sizes)
res_bincount = map(bincount, sizes)
res_original = map(original, sizes[:-1])
Timing Code
def bincount(size):
return min(timeit.repeat(
"""lengths = np.array([len(x) for x in ind])
positions = np.arange(len(ind))
values = df.a.values
avg = np.bincount(positions.repeat(lengths), values[np.concatenate(ind)]) / lengths
df.assign(avg=avg)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
number=100, repeat=10))
def original(size):
return min(timeit.repeat(
"""df['avg'] = df.apply(group, axis=1)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
repeat=10, number=1))
def mine(size):
return min(timeit.repeat("""df['comparison'] = df.a.values[ind].mean(axis=1)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
repeat=100, number=10))
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
ax.plot(sizes, res_mine, label='mine')
ax.plot(sizes, res_bincount, label='bincount')
ax.plot(sizes[:-1], res_original, label='original')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.xlabel('size of dataframe')
plt.ylabel('run time (s)')
plt.show()
Note that I had to reduce runs for original as it was taking very long

How to access multi columns in the rolling operator?

I want to do some rolling window calculation in pandas which need to deal with two columns at the same time. I'll take an simple instance to express the problem clearly:
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9],
'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2]
})
windowSize = 4
result = []
for i in range(1, len(df)+1):
if i < windowSize:
result.append(None)
else:
x = df.x.iloc[i-windowSize:i]
y = df.y.iloc[i-windowSize:i]
m = y.mean()
r = sum(x[y > m]) / sum(x[y <= m])
result.append(r)
print(result)
Is there any way without for loop in pandas to solve the problem? Any help is appreciated
You can use the rolling window trick for numpy arrays and apply it to the array underlying the DataFrame.
import pandas as pd
import numpy as np
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df = pd.DataFrame({
'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9],
'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2]
})
windowSize = 4
rw = rolling_window(df.values.T, windowSize)
m = np.mean(rw[1], axis=-1, keepdims=True)
a = np.sum(rw[0] * (rw[1] > m), axis=-1)
b = np.sum(rw[0] * (rw[1] <= m), axis=-1)
result = a / b
The result lacks the leading None values, but they should be easy to append (in form of np.nan or after converting the result to a list).
This is probably not what you are looking for, working with pandas, but it will get the job done without loops.
Here's one vectorized approach using NumPy tools -
windowSize = 4
a = df.values
X = strided_app(a[:,0],windowSize,1)
Y = strided_app(a[:,1],windowSize,1)
M = Y.mean(1)
mask = Y>M[:,None]
sums = np.einsum('ij,ij->i',X,mask)
rest_sums = X.sum(1) - sums
out = sums/rest_sums
strided_app is taken from here.
Runtime test -
Approaches -
# #kazemakase's solution
def rolling_window_sum(df, windowSize=4):
rw = rolling_window(df.values.T, windowSize)
m = np.mean(rw[1], axis=-1, keepdims=True)
a = np.sum(rw[0] * (rw[1] > m), axis=-1)
b = np.sum(rw[0] * (rw[1] <= m), axis=-1)
result = a / b
return result
# Proposed in this post
def strided_einsum(df, windowSize=4):
a = df.values
X = strided_app(a[:,0],windowSize,1)
Y = strided_app(a[:,1],windowSize,1)
M = Y.mean(1)
mask = Y>M[:,None]
sums = np.einsum('ij,ij->i',X,mask)
rest_sums = X.sum(1) - sums
out = sums/rest_sums
return out
Timings -
In [46]: df = pd.DataFrame(np.random.randint(0,9,(1000000,2)))
In [47]: %timeit rolling_window_sum(df)
10 loops, best of 3: 90.4 ms per loop
In [48]: %timeit strided_einsum(df)
10 loops, best of 3: 62.2 ms per loop
To squeeze in more performance, we can compute the Y.mean(1) part, which is basically a windowed summation with Scipy's 1D uniform filter. Thus, M could be alternatively computed for windowSize=4 as -
from scipy.ndimage.filters import uniform_filter1d as unif1d
M = unif1d(a[:,1].astype(float),windowSize)[2:-1]
The performance gains are significant -
In [65]: %timeit strided_einsum(df)
10 loops, best of 3: 61.5 ms per loop
In [66]: %timeit strided_einsum_unif_filter(df)
10 loops, best of 3: 49.4 ms per loop

Summing values of numpy array based on indices in other array

Assume I have the following arrays:
N = 8
M = 4
a = np.zeros(M)
b = np.random.randint(M, size=N) # contains indices for a
c = np.random.rand(N) # contains random values
I want to sum the values of c according to the indices provided in b, and store them in a. Writing a loop for this is trivial:
for i, v in enumerate(b):
a[v] += c[i]
Since N can get quite big in my real-world problem I'd like to avoid using python loops, but I can't figure out how to write it as a numpy-statement. Can anyone help me out?
Ok, here some example values:
In [27]: b
Out[27]: array([0, 1, 2, 0, 2, 3, 1, 1])
In [28]: c
Out[28]:
array([ 0.15517108, 0.84717734, 0.86019899, 0.62413489, 0.24357903,
0.86015187, 0.85813481, 0.7071174 ])
In [30]: a
Out[30]: array([ 0.77930596, 2.41242955, 1.10377802, 0.86015187])
import numpy as np
N = 8
M = 4
b = np.array([0, 1, 2, 0, 2, 3, 1, 1])
c = np.array([ 0.15517108, 0.84717734, 0.86019899, 0.62413489, 0.24357903, 0.86015187, 0.85813481, 0.7071174 ])
a = ((np.mgrid[:M,:N] == b)[0] * c).sum(axis=1)
returns
array([ 0.77930597, 2.41242955, 1.10377802, 0.86015187])

Categories

Resources