Let's say I have a numpy array
my_array = [0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan]
For each nan value, I want to extract the two non-nan values to the left and right of that point (or single value if appropriate). So I would like my output to be something like
output = [[0.3,0.1], [0.3,0.1], [0.3,0.1], [0.1,0.5], [0.5]]
I was thinking of looping through all the values in my_array, then finding those that are nan, but I'm not sure how to do the next part of finding the nearest non-nan values.
Using pandas and numpy:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
a = np.vstack((s.ffill()[m], s.bfill()[m]))
out = a[:,~np.isnan(a).any(0)].T.tolist()
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5]]
NB. You can choose to keep or drop the lists containing NaNs.
With NaNs:
out = a.T.tolist()
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5, nan]]
alternative to handle the single elements:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
(pd
.concat((s.ffill()[m], s.bfill()[m]), axis=1)
.stack()
.groupby(level=0).agg(list)
.to_list()
)
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
Less elegant than #mozway's answer, but the last list only has one element:
pd.DataFrame({
'left':arr.ffill(),
'right': arr.bfill()
}).loc[arr.isna()].apply(lambda row: row.dropna().to_list(), axis=1).to_list()
For the sake of education, I'll post a pretty straight-forward algorithm for achieving this result, which works by finding the closest index of a value to the left and to the right of each index of a NaN, and filters out any infs at the end:
def get_neighbors(x: np.ndarray) -> list:
mask = np.isnan(x)
nan_idxs, *_ = np.where(mask)
val_idxs, *_ = np.where(~mask)
neighbors = []
for nan_idx in nan_idxs:
L, R = -float("inf"), float("inf")
for val_idx in val_idxs:
if val_idx < nan_idx:
L = max(L, val_idx)
else:
R = min(R, val_idx)
# casting to list isn't strictly necessary, you'll just end up with a list of arrays
neighbors.append(list(x[[i for i in (L, R) if i > 0 and i < float("inf")]]))
return neighbors
Output:
>>> get_neighbors(my_array)
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
The nested for loop has a worst-case runtime of O((n / 2)^2) where n is the number of elements of x (worst case occurs when exactly half the elements are NaN).
I was eager to check how could use just NumPy to solve this problem as an exercise. After some hours I could reach a solution :), but as I think it will be inefficient comparing to pandas as mentioned by Mozway, I didn't optimized the code further (it can be optimized; if conditions may could be cured and merged in other sections):
my_array = np.array([np.nan, np.nan, 0.2, 0.3, np.nan, np.nan, np.nan, 0.1, 0.7, np.nan, 0.5])
nans = np.isnan(my_array).astype(np.int8) # [1 1 0 0 1 1 1 0 0 1 0]
zeros = np.where(nans == 0)[0] # [ 2 3 7 8 10]
diff_nan = np.diff(nans) # [ 0 -1 0 1 0 0 -1 0 1 -1]
start = np.where(diff_nan == 1)[0] # [3 8]
end = np.where(diff_nan == -1)[0] + 1 # [ 2 7 10]
mask_start_nan = np.isnan(my_array[0]) # True
mask_end_nan = np.isnan(my_array[-1]) # False
if mask_end_nan: start = start[:-1] # [3 8]
if mask_start_nan: end = end[1:] # [ 7 10]
inds = np.dstack([start, end]).squeeze() # [[ 3 7] [ 8 10]]
initial = my_array[inds] # [[0.3 0.1] [0.7 0.5]]
repeats = np.diff(np.where(np.concatenate(([nans[0]], nans[:-1] != nans[1:], [True])))[0])[::2] # [2 3 1]
if mask_end_nan: repeats = repeats[:-1] # [2 3 1]
if mask_start_nan: repeats = repeats[1:] # [3 1]
result = np.repeat(initial, repeats, axis=0) # [[0.3 0.1] [0.3 0.1] [0.3 0.1] [0.7 0.5]]
if mask_end_nan: result = np.array([*result, np.array(my_array[zeros[-1]])], dtype=object)
if mask_start_nan: result = np.array([np.array(my_array[zeros[0]]), *result], dtype=object)
# [array(0.2) array([0.3, 0.1]) array([0.3, 0.1]) array([0.3, 0.1]) array([0.7, 0.5])]
I don't know if there be a much easier solution by NumPy; I implemented what came to my mind. I believe that this code can be greatly improved (I will do it if I find a free time).
Related
I'm sorry if this question isn't framed well. So I would rather explain with an example.
I have a numpy matrix:
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
And another numpy array as shown:
b = np.array([1, 0, 2, 2])
With the given condition that values in b will be in the range(a.shape[1]) and that b.shape[1] == a.shape[0]. Now this is the operation I need to perform.
For every index i of a, and the corresponding index i of b, I need to subtract 1 from the index j of a[i] where j== b[i]
So in my example, a[0] == [0.5 0.8 0.1] and b[0] == 1. Therefore I need to subtract 1 from a[0][b[0]] so that a[0] = [0.5, -0.2, 0.1]. This has to be done for all rows of a. Any direct solution without me having to iterate through all rows or columns one by one?
Thanks.
Use numpy indexing. See this post for a nice introduction:
import numpy as np
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
b = np.array([1, 0, 2, 2])
a[np.arange(a.shape[0]), b] -= 1
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
As an alternative use substract.at:
np.subtract.at(a, (np.arange(a.shape[0]), b), 1)
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
The main idea is that:
np.arange(a.shape[0]) # shape[0] is equals to the numbers of rows
generates the indices of the rows:
[0 1 2 3]
I get a pandas dataframe like this:
id prob
0 1 0.5
1 1 0.6
2 1 0.4
3 1 0.2
4 2 0.3
6 2 0.5
...
I want to group it by 'id', sort descending order and get the first 3 prob of every group. Note that some groups contain rows less than 3.
Finally I want to get a 2D array like:
[[1, 0.6, 0.5, 0.4], [2, [0.5, 0.3]]...]
How can I do that with pandas?
Thanks!
Use sort_values, groupby, and head:
df.sort_values(by=['id','prob'], ascending=[True,False]).groupby('id').head(3).values
Output:
array([[ 1. , 0.6],
[ 1. , 0.5],
[ 1. , 0.4],
[ 2. , 0.5],
[ 2. , 0.3]])
Following #COLDSPEED lead:
df.sort_values(by=['id','prob'], ascending=[True,False])\
.groupby('id').agg(lambda x: x.head(3).tolist())\
.reset_index().values.tolist()
Output:
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]
You can use groupby and nlargest
df.groupby('id').prob.nlargest(3).reset_index(1,drop = True)
id
1 0.6
1 0.5
1 0.4
2 0.5
2 0.3
For the array
df1 = df.groupby('id').prob.nlargest(3).unstack(1)#.reset_index(1,drop = True)#.set_index('id')
np.column_stack((df1.index.values, df1.values))
You get
array([[ 1. , 0.5, 0.6, 0.4, nan, nan],
[ 2. , nan, nan, nan, 0.3, 0.5]])
If you're looking for a dataframe of array columns, you can use np.sort:
df = df.groupby('id').prob.apply(lambda x: np.sort(x.values)[:-4:-1])
df
id
1 [0.6, 0.5, 0.4]
2 [0.5, 0.3]
To retrieve the values, reset_index and access:
df.reset_index().values
array([[1, array([ 0.6, 0.5, 0.4])],
[2, array([ 0.5, 0.3])]], dtype=object)
[[n, g.nlargest(3).tolist()] for n, g in df.groupby('id').prob]
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]
I have a numpy array of M*N dimensions in which each element of the array is a float with a value between 0-1.
Input: for simplicity purpose lets consider a 3*4 array:
a=np.array([
[0.1, 0.2, 0.3, 0.6],
[0.3, 0.4, 0.8, 0.7],
[0.5, 0.6, 0.2, 0.1]
])
I want to consider 3 columns at a time (say col 0,1,2 for first iteration and 1,2,3 for second) and get the maximum value of multiplication of all possible combinations of the 3 columns and also get the index of their respective values.
In this case I should get max value of 0.5*0.6*0.8=0.24 and the index of the rows of values that gave the max value: (2,2,1) in this case.
Output: [[0.24,(2,2,1)],[0.336,(2,1,1)]]
I can do this using loops but I want to avoid them as it would affect running time, is there anyway I can do that in numpy?
Here's an approach using NumPy strides that is supposedly very efficient for such sliding windowed operations as it creates a view into the array without actually making copies -
N = 3 # Window size
m,n = a.strides
p,q = a.shape
a3D = np.lib.stride_tricks.as_strided(a,shape=(p, q-N +1, N),strides=(m,n,n))
out1 = a3D.argmax(0)
out2 = a3D.max(0).prod(1)
Sample run -
In [69]: a
Out[69]:
array([[ 0.1, 0.2, 0.3, 0.6],
[ 0.3, 0.4, 0.8, 0.7],
[ 0.5, 0.6, 0.2, 0.1]])
In [70]: out1
Out[70]:
array([[2, 2, 1],
[2, 1, 1]])
In [71]: out2
Out[71]: array([ 0.24 , 0.336])
We can zip those two outputs together if needed in that format -
In [75]: zip(out2,map(tuple,out1))
Out[75]: [(0.23999999999999999, (2, 2, 1)), (0.33599999999999997, (2, 1, 1))]
I have a symmetric, multi-index dataframe from which I want to systematically extract data:
import pandas as pd
df_index = pd.MultiIndex.from_arrays(
[["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 0.3, -0.4],
[0.5, 1.0, 0.9, -0.8],
[0.3, 0.9, 1.0, 0.1],
[-0.4, -0.8, 0.1, 1.0]],
index=df_index, columns=df_index)
I want a function extract_vals that can return all values related to elements in the same group, EXCEPT for the diagonal AND elements must not be double-counted. Here are two examples of the desired behavior (order does not matter):
A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]
My question is similar to this question on SO, but my situation is different because I am using a multi-index dataframe.
Finally, to make things more fun, please consider efficiency because I'll be running this many times on much bigger dataframes. Thanks very much!
EDIT:
Happy001's solution is awesome. I came up with a method myself based on the logic of extracting the elements where target is NOT in BOTH the rows and columns, and then extracting the lower triangle of those elements where target IS in BOTH the rows and columns. However, Happy001's solution is much faster.
First, I created a more complex dataframe to make sure both methods are generalizable:
import pandas as pd
import numpy as np
df_index = pd.MultiIndex.from_arrays(
[["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
[0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
[1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
[-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
[1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
[-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
index=df_index, columns=df_index)
Next, I defined both versions of extract_vals (the first is my own):
def extract_vals(target, multi_index_level_name, df):
# Extract entries where target is in the rows but NOT also in the columns
target_in_rows_but_not_in_cols_vals = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) != target]
# Extract entries where target is in the rows AND in the columns
target_in_rows_and_cols_df = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) == target]
mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]
# Append both arrays of extracted values
vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)
return vals
def extract_vals2(target, multi_index_level_name, df):
# Get indices for what you want to extract and then extract all at once
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
df.index.get_level_values(multi_index_level_name)[i] == target or (
df.columns.get_level_values(multi_index_level_name)[j] == target))]
return df.values[tuple(np.transpose(coord))]
I checked that both functions returned output as desired:
# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])
# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)
assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)
And finally, I checked speed:
## Test speed
import time
# Method 1
start1 = time.time()
for ii in range(10000):
out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec
# Method 2
start2 = time.time()
for ii in range(10000):
out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec
I don't assume df has the same columns and index. (Of course they can be the same).
def extract_vals(group_label, df):
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i<j and (df.index.get_level_values('group')[i] == group_label or df.columns.get_level_values('group')[j] == group_label) ]
return df.values[tuple(np.transpose(coord))]
print extract_vals('A', df)
print extract_vals('B', df)
result:
[ 0.5 0.3 -0.4 0.9 -0.8]
[ 0.3 -0.4 0.9 -0.8 0.1]
is that what you want?
all elements above the diagonal:
In [139]: df.values[np.triu_indices(len(df), 1)]
Out[139]: array([ 0.5, 0.3, -0.4, 0.9, -0.8, 0.1])
A_vals:
In [140]: df.values[np.triu_indices(len(df), 1)][:-1]
Out[140]: array([ 0.5, 0.3, -0.4, 0.9, -0.8])
B_vals:
In [141]: df.values[np.triu_indices(len(df), 1)][1:]
Out[141]: array([ 0.3, -0.4, 0.9, -0.8, 0.1])
Source matrix:
In [142]: df.values
Out[142]:
array([[ 1. , 0.5, 0.3, -0.4],
[ 0.5, 1. , 0.9, -0.8],
[ 0.3, 0.9, 1. , 0.1],
[-0.4, -0.8, 0.1, 1. ]])
Ok let's imagine that I have a list of values like so:
list = [-0.23, -0.5, -0.3, -0.8, 0.3, 0.6, 0.8, -0.9, -0.4, 0.1, 0.6]
I would like to loop on this list and when the sign changes to get the absolute difference between the maximum (minimum if it's negative) of the first interval and maximum (minimum if it's negative) of the next interval.
For example on the previous list, we would like to have a result like so:
[1.6, 1.7, 1.5]
The tricky part is that it has to work also for lists like:
list = [0.12, -0.23, 0.52, 0.2, 0.6, -0.3, 0.4]
Which would return :
[0.35, 0.83, 0.9, 0.7]
It's tricky because some intervals are 1 value long, and I'm having difficulties with managing this.
How would you solve this with the least possible number of lines?
Here is my current code, but it's not working at the moment.
list is a list of 6 lists, where each of these 6 lists contains else a nan, else a np.array of 1024 values (the values I want to evaluate)
Hmax = []
for c in range(0,6):
Hmax_tmp = []
for i in range(len(list[c])):
if(not np.isnan(list[c][i]).any()):
zero_crossings = np.where(np.diff(np.sign(list[c][i])))[0]
if(not zero_crossings[0] == 0):
zero_crossings = [0] + zero_crossings.tolist() + [1023]
diff = []
for j in range(1,len(zero_crossings)-2):
if
diff.append(max(list[c][i][np.arange(zero_crossings[j-1],zero_crossings[j])].min(), list[c][i][np.arange(zero_crossings[j]+1,zero_crossings[j+1])].max(), key=abs) - max(list[c][i][np.arange(zero_crossings[j+1],zero_crossings[j+2])].min(), list[c][i][np.arange(zero_crossings[j+1],zero_crossings[j+2])].max(), key=abs))
Hmax_tmp.append(np.max(diff))
else:
Hmax_tmp.append(np.nan)
Hmax.append(Hmax_tmp)
This type of grouping operation can be greatly simplified using itertools.groupby. For example:
>>> from itertools import groupby
>>> lst = [-0.23, -0.5, -0.3, -0.8, 0.3, 0.6, 0.8, -0.9, -0.4, 0.1, 0.6] # the list
>>> minmax = [min(v) if k else max(v) for k,v in groupby(lst, lambda a: a < 0)]
>>> [abs(j-i) for i,j in zip(minmax[:-1], minmax[1:])]
[1.6, 1.7000000000000002, 1.5]
And the second list:
>>> lst2 = [0.12, -0.23, 0.52, 0.2, 0.6, -0.3, 0.4] # the list
>>> minmax = [min(v) if k else max(v) for k,v in groupby(lst2, lambda a: a < 0)]
>>> [abs(j-i) for i,j in zip(minmax[:-1], minmax[1:])]
[0.35, 0.83, 0.8999999999999999, 0.7]
So here, the list is grouped into consecutive intervals of negative/positive values. The min/max is computed for each group and stored in a list minmax. Lastly, a list comprehension finds the differences.
Excuse the inexact representations of floating point numbers!
It would be straightforward to retrieve max/min values of intervals, and then do the calculation.
def difference(nums):
if not nums:
return []
pivots = []
last_sign = nums[0] >= 0
current = 0
for x in nums:
current_sign = x >= 0
if current_sign != last_sign:
pivots.append(current)
current = 0
last_sign = current_sign
current = max(current, x) if current_sign else min(current, x)
pivots.append(current)
result = []
for idx in xrange(len(pivots)):
if idx + 1 < len(pivots):
result.append(abs(pivots[idx] - pivots[idx + 1]))
return result
>>> print difference([-0.23, -0.5, -0.3, -0.8, 0.3, 0.6, 0.8, -0.9, -0.4, 0.1, 0.6])
[1.6, 1.7000000000000002, 1.5]
>>> print difference([0.12, -0.23, 0.52, 0.2, 0.6, -0.3, 0.4])
[0.35, 0.83, 0.8999999999999999, 0.7]