Creating dataframe based on conditions on other dataframes - python

I have two dataframes: s-1 column, d-3 columns
s = {0: [0, 0.3, 0.5, -0.1, -0.2, 0.7, 0]}
d = {0: [0.1, 0.2, -0.2, 0, 0, 0, 0], 1: [0.3, 0.4, -0.7, 0, 0.8, 0, 0.1], 2: [-0.5, 0.4, -0.1, 0.5, 0.5, 0, 0]}
sd = pd.DataFrame(data=s)
dd = pd.DataFrame(data=d)
result = pd.DataFrame()
I want to get the result dataframe (1 column) based on values in those two:
1. When value in sd = 0 then 0
2. When value in sd != 0 then check if for this row there is at least one non-zero value in dd, if yes - get avg of non zero values, if no return OK
Here is what I would like to get:
results:
0 0
1 -0,033
2 -0,333
3 0,5
4 0,65
5 OK
6 0
I know I can use dd[dd != 0].mean(axis=1) to calculate the mean of non zero values for the row but I don't know how to connect all these 3 conditions together

Using np.where twice
np.where(sd[0]==0,0,np.where(dd.eq(0).all(1),'OK',dd.mask(dd==0).mean(1)))
Out[232]:
array(['0', '0.3333333333333333', '-0.3333333333333333', '0.5', '0.65',
'OK', '0'], dtype='<U32')

Using numpy.select:
c1 = sd[0].eq(0)
c2 = dd.eq(0).all(1)
res = np.select([c1, c2], [0, 'OK'], dd.where(dd.ne(0)).mean(1))
pd.Series(res)
0 0
1 0.3333333333333333
2 -0.3333333333333333
3 0.5
4 0.65
5 OK
6 0
dtype: object

thank you for your help. I managed to do it in a quite different way.
I used:
res1 = pd.Series(np.where(sd[0]==0, 0, dd[dd != 0].mean(axis=1))).fillna('OK')
The difference is that it returns float values (for rows that are not 'OK'), not string. It also appears to be a little bit faster.

Related

Get two neighboring non-nan values in numpy array

Let's say I have a numpy array
my_array = [0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan]
For each nan value, I want to extract the two non-nan values to the left and right of that point (or single value if appropriate). So I would like my output to be something like
output = [[0.3,0.1], [0.3,0.1], [0.3,0.1], [0.1,0.5], [0.5]]
I was thinking of looping through all the values in my_array, then finding those that are nan, but I'm not sure how to do the next part of finding the nearest non-nan values.
Using pandas and numpy:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
a = np.vstack((s.ffill()[m], s.bfill()[m]))
out = a[:,~np.isnan(a).any(0)].T.tolist()
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5]]
NB. You can choose to keep or drop the lists containing NaNs.
With NaNs:
out = a.T.tolist()
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5, nan]]
alternative to handle the single elements:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
(pd
.concat((s.ffill()[m], s.bfill()[m]), axis=1)
.stack()
.groupby(level=0).agg(list)
.to_list()
)
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
Less elegant than #mozway's answer, but the last list only has one element:
pd.DataFrame({
'left':arr.ffill(),
'right': arr.bfill()
}).loc[arr.isna()].apply(lambda row: row.dropna().to_list(), axis=1).to_list()
For the sake of education, I'll post a pretty straight-forward algorithm for achieving this result, which works by finding the closest index of a value to the left and to the right of each index of a NaN, and filters out any infs at the end:
def get_neighbors(x: np.ndarray) -> list:
mask = np.isnan(x)
nan_idxs, *_ = np.where(mask)
val_idxs, *_ = np.where(~mask)
neighbors = []
for nan_idx in nan_idxs:
L, R = -float("inf"), float("inf")
for val_idx in val_idxs:
if val_idx < nan_idx:
L = max(L, val_idx)
else:
R = min(R, val_idx)
# casting to list isn't strictly necessary, you'll just end up with a list of arrays
neighbors.append(list(x[[i for i in (L, R) if i > 0 and i < float("inf")]]))
return neighbors
Output:
>>> get_neighbors(my_array)
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
The nested for loop has a worst-case runtime of O((n / 2)^2) where n is the number of elements of x (worst case occurs when exactly half the elements are NaN).
I was eager to check how could use just NumPy to solve this problem as an exercise. After some hours I could reach a solution :), but as I think it will be inefficient comparing to pandas as mentioned by Mozway, I didn't optimized the code further (it can be optimized; if conditions may could be cured and merged in other sections):
my_array = np.array([np.nan, np.nan, 0.2, 0.3, np.nan, np.nan, np.nan, 0.1, 0.7, np.nan, 0.5])
nans = np.isnan(my_array).astype(np.int8) # [1 1 0 0 1 1 1 0 0 1 0]
zeros = np.where(nans == 0)[0] # [ 2 3 7 8 10]
diff_nan = np.diff(nans) # [ 0 -1 0 1 0 0 -1 0 1 -1]
start = np.where(diff_nan == 1)[0] # [3 8]
end = np.where(diff_nan == -1)[0] + 1 # [ 2 7 10]
mask_start_nan = np.isnan(my_array[0]) # True
mask_end_nan = np.isnan(my_array[-1]) # False
if mask_end_nan: start = start[:-1] # [3 8]
if mask_start_nan: end = end[1:] # [ 7 10]
inds = np.dstack([start, end]).squeeze() # [[ 3 7] [ 8 10]]
initial = my_array[inds] # [[0.3 0.1] [0.7 0.5]]
repeats = np.diff(np.where(np.concatenate(([nans[0]], nans[:-1] != nans[1:], [True])))[0])[::2] # [2 3 1]
if mask_end_nan: repeats = repeats[:-1] # [2 3 1]
if mask_start_nan: repeats = repeats[1:] # [3 1]
result = np.repeat(initial, repeats, axis=0) # [[0.3 0.1] [0.3 0.1] [0.3 0.1] [0.7 0.5]]
if mask_end_nan: result = np.array([*result, np.array(my_array[zeros[-1]])], dtype=object)
if mask_start_nan: result = np.array([np.array(my_array[zeros[0]]), *result], dtype=object)
# [array(0.2) array([0.3, 0.1]) array([0.3, 0.1]) array([0.3, 0.1]) array([0.7, 0.5])]
I don't know if there be a much easier solution by NumPy; I implemented what came to my mind. I believe that this code can be greatly improved (I will do it if I find a free time).

How to append array as column to the original Pandas dataframe

I have a data frame that looks like below:
x_1 x_2 x_3 x_combined
0 1 0 [0,1,0]
1 0 1 [1,0,1]
1 1 0. [1,1,0]
0 0 1 [0,0,1]
Then I calculated the centroid of each dimension by using np.mean(df['x_combined'], axis=0) and got
array([0.5, 0.5, 0.5])
How do I now append it back to the original DF as a fifth column and have the same value for each row? It should look like this:
x_1 x_2 x_3 x_combined centroid
0 1 0 [0,1,0] [0.5, 0.5, 0.5]
1 0 1 [1,0,1] [0.5, 0.5, 0.5]
1 1 0. [1,1,0] [0.5, 0.5, 0.5]
0 0 1 [0,0,1] [0.5, 0.5, 0.5]
This also works:
df = pd.DataFrame({
'x_1': [0, 1, 1, 0],
'x_2': [1, 0, 1, 0],
'x_3': [0, 1, 0, 1],
'x_combined': [np.array([0, 1, 0]), np.array([1, 0, 1]),
np.array([1, 1, 0]), np.array([0, 0, 1])]
})
a = np.mean(df['x_combined'], axis=0) # or a = df['x_combined'].mean(axis=0)
df['centroid'] = [a]*len(df)

How do i create a majority voting based in two arrays?

Scenario:
I want to create a majority vote system based that takes into account the weight of someone's vote about N observations.
So, M observers will give their guess about N observations, selecting from 3 classes (1,2,3). For each observation, each observer will have a weight associated with it.
Defining:
G: Matrix of guesses per observation / observer (N observations × M observers);
W: Weights for each observation / observer (N observations × M observers)
Example:
# 2 observations, 3 observers
G = [[1, 2, 3],
[2, 2, 1]]
# Weights (influence) each observer has about each observation
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
I need to compute another matrix with shape (N observations × C classes) that stores the probability of an observation comes from an specific class.
Example using values above:
G = [[1, 2, 3],
[2, 2, 1]]
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
P = [[0.1, 0.2, 0.3],
[0.2, (0.3 + 0.1), 0]]
After computing the P matrix, I could apply np.argmax() row-wise to get the column (class) with highest value:
P = [[0.1, 0.2, 0.3], #class 3 has highest value (0.3)
[0.2, 0.4, 0]] #class 2 has highest value (0.4)
result = [3, 2]
I would like to know how can I combine G and W to generate the P matrix.
You can get the job done in a vectorized manner by using NumPy's indices and advanced indexing:
In [569]: import numpy as np
In [570]: G = np.array([[1, 2, 3], [2, 2, 1]] )
In [571]: W = np.array([[0.1, 0.2, 0.3], [0.3, 0.1, 0.2]])
In [572]: C = 3
In [573]: M, N = G.shape
In [574]: row, col = np.indices((M, N))
In [575]: P3d = np.zeros(shape=(M, N, C))
In [576]: P3d[row, col, G-1] = W
In [577]: P = P3d.sum(axis=1)
In [578]: P
Out[578]:
array([[0.1, 0.2, 0.3],
[0.2, 0.4, 0. ]])
Initialize P with zero values then iterate by observations/rows of G and value of index i.e g[observation][index] if class 1 then add weight[observation][index] from W matrix to P[observation][class]+=weight[observation][index]. i.e in your sample testcase. for row 1. index 0 has value 1 and weight[0][0] is 0.1 so add 0.1 to row 0 and index[class] of P. similarly for index 2 and 3 value are same as index therefore same in P.
Now for row 2, index 1 has class 2 so we add weight of class 2 to p[2][class]+=0.3 and for index 2 class is again 2 so weight of that observer is 0.1 so again p[2][class]+=weight i.e 0.1. for last index class is 1 so p[2][class]+=weight now Our matrix is ready so use np.argmax() for required answer.

Search a list of values in a csv file in python

I have a list of lists as follows.
[[0, 0.1, 0.3], [0.5, 0.2, 0.8]]
I also have a csv file as follows.
No, heading1, heading2, heading3
0, 0, 0.7, 0.3
1, 0, 0.1, 0.3
Now I want to search these list of values only using the values in 'heading1', 'heading2' and 'heading3' and if matched return the 'No' corresponding to it.
Can we do this using pandas?
You can use merge by all columns:
L = [[0, 0.1, 0.3], [0.5, 0.2, 0.8]]
#helper df with same columns without first
df1 = pd.DataFrame(L, columns= df.columns[1:])
print (df1)
heading1 heading2 heading3
0 0.0 0.1 0.3
1 0.5 0.2 0.8
#after merge select column No
d = pd.merge(df, df1)['No']
print (d)
0 1
Name: No, dtype: int64

Efficient way for calculating selected differences in array

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:
ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])
These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do
for id in np.unique(ids):
diffs = np.diff(times[ids==id])
print diffs
# do stuff with diffs
However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?
You can use array.argsort() and ignore the values corresponding to change in ids:
>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)
or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:
times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
and finally you can split this array with np.split and np.where:
np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])
As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.
Say you np.argsort by ids:
inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
Now sort times by this, np.diff, and prepend a nan:
diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan, 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
These differences are correct except for the boundaries. Let's calculate those
boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False, True, False, True, True, False, True], dtype=bool)
Now we can just do
diffs[~boundaries] = np.nan
Let's see what we got:
>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])
>>> times[inds]
array([ 0.3, 0.5, 0.3, 0.6, 1.2, 0.1, 1.3])
>>> diffs
array([ nan, 0.2, nan, 0.3, 0.6, nan, 1.2])
I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.
In pandas, you could do this in one step, after creating a DataFrame:
df = pd.DataFrame({'ids': ids, 'times': times})
df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
This gives:
>>> df
ids times diffs
0 2 0.1 NaN
1 0 0.3 NaN
2 1 0.3 NaN
3 0 0.5 0.2
4 1 0.6 0.3
5 1 1.2 0.6
6 2 1.3 1.2
The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:
import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

Categories

Resources