Insert value in numpy array with conditions - python

I want to insert the value in the NumPy array as follows,
If Nth row is the same as (N-1)th row insert 1 for Nth row and (N-1)th row and rest 0
If Nth row is different from (N_1)th row then change column and repeat condition
Here is the example
d = {'col1': [2,2, 3,3,3, 4,4, 5,5,5,],
'col2': [3,3, 4,4,4, 1,1, 0,0,0]}
df = pd.DataFrame(data=d)
np.zeros((10,4))
###########################################################
OUTPUT MATRIX
1 0 0 0 First two rows are the same so 1,1 in a first column
1 0 0 0
0 1 0 0 Three-rows are same 1,1,1
0 1 0 0
0 1 0 0
0 0 1 0 Again two rows are the same 1,1
0 0 1 0
0 0 0 1 Again three rows are same 1,1,1
0 0 0 1
0 0 0 1

IIUC, you can achieve this simply with numpy indexing:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
# craft the numpy array
a = np.zeros((len(group), group.max()+1), dtype=int)
a[np.arange(len(df)), group] = 1
print(a)
Alternative with numpy.identity:
# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
shape = df.groupby(group).size()
# craft the numpy array
a = np.repeat(np.identity(len(shape), dtype=int), shape, axis=0)
print(a)
output:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
intermediates:
group
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
dtype: int64
shape
0 2
1 3
2 2
3 3
dtype: int64
other option
for fun, likely no so efficient on large inputs:
a = pd.get_dummies(df.agg(tuple, axis=1)).to_numpy()
Note that this second option uses groups of identical values, not successive identical values. For identical values with the first (numpy) approach, you would need to use group = df.groupby(list(df)).ngroup() and the numpy indexing option (this wouldn't work with repeating the identity).

Related

Calculate the average of sections of a column with condition met to create new dataframe

I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.

Counting matrices with specific condition

I have a list of 3x3 0,1 matrices and I want to count how many of them have 0 on diagonal and pseudo-antisymmetric, i.e A[i][j] != A[j][i] (if A[i][j] = 1 then A[j][i] should be 0).
How can I implement this? I was trying similar approach as for counting symmetric matrices(here Counting symmetric matrices), but it doesn't work here.
Assuming a is a 3x3 numpy array,
(a + np.eye(len(a)) == 1 - a.T).all()
Explanation:
a.T flips the matrix.
1 - a.T inverts flipped matrix
in the inverted matrix, diagonal elements are expected to be 1. Fix: add a diagonal matrix, a + np.eye(len(a))
test passes only if all elements satisfy the condition: (... == ...).all()
You could use your existing is_symmetric function and create a new is has_zero_main_diagonal one:
def pretty_print(matrix):
for row in matrix:
print(*row)
def has_zero_main_diagonal(matrix, n):
return all(matrix[i][i] == 0 for i in range(n))
def is_symmetric(matrix, n):
return all(matrix[i][j] == matrix[j][i] for i in range(n) for j in range(n))
Informal Testing:
symmetric_zero_diag = [[0, 1, 1],
[1, 0, 1],
[1, 1, 0]]
symmetric_non_zero_diag = [[0, 1, 1],
[1, 0, 1],
[1, 1, 1]]
asymmetric_zero_diag = [[0, 1, 1],
[1, 0, 1],
[0, 1, 0]]
asymmetric_non_zero_diag = [[1, 1, 1],
[1, 1, 1],
[0, 1, 1]]
testcases = {
'symmetric_zero_diag': symmetric_zero_diag,
'symmetric_non_zero_diag': symmetric_non_zero_diag,
'asymmetric_zero_diag': asymmetric_zero_diag,
'asymmetric_non_zero_diag': asymmetric_non_zero_diag
}
num_pseudo_antisymmetric_zero_diag = 0
for testcase_name, matrix in testcases.items():
print(f'{testcase_name}:')
pretty_print(matrix)
print(f'has_zero_main_diagonal={has_zero_main_diagonal(matrix, 3)}')
print(f'is_symmetric={is_symmetric(matrix, 3)}')
is_pseudo_antisymmetric_zero_diag = has_zero_main_diagonal(matrix, 3) and not is_symmetric(matrix, 3)
if is_pseudo_antisymmetric_zero_diag:
num_pseudo_antisymmetric_zero_diag += 1
print((f'has_zero_main_diagonal and not is_symmetric='
f'{is_pseudo_antisymmetric_zero_diag}'))
print()
print(f'Number of matrices that are pseudo-antisymmetric with a zero_diag: {num_pseudo_antisymmetric_zero_diag}')
Output:
symmetric_zero_diag:
0 1 1
1 0 1
1 1 0
has_zero_main_diagonal=True
is_symmetric=True
has_zero_main_diagonal and not is_symmetric=False
symmetric_non_zero_diag:
0 1 1
1 0 1
1 1 1
has_zero_main_diagonal=False
is_symmetric=True
has_zero_main_diagonal and not is_symmetric=False
asymmetric_zero_diag:
0 1 1
1 0 1
0 1 0
has_zero_main_diagonal=True
is_symmetric=False
has_zero_main_diagonal and not is_symmetric=True
asymmetric_non_zero_diag:
1 1 1
1 1 1
0 1 1
has_zero_main_diagonal=False
is_symmetric=False
has_zero_main_diagonal and not is_symmetric=False
Number of matrices that are pseudo-antisymmetric with a zero_diag: 1

Numpy: how to convert observations to probabilities?

I have a feature matrix and a corresponding targets, which are ones or zeroes:
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
As you can see, each feature may correspond to both ones and zeros. I need to convert my raw observation matrix to probability matrix, where each feature will correspond to the probability of seeing one as a target:
[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0
I have constructed a quite straight-forward solution:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
from collections import Counter
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
zeros = Counter()
ones = Counter()
# collect row-wise number of one and zero targets
for i, row in enumerate(features[:]):
if targets[i] == 0:
zeros[tuple(row)] += 1
else:
ones[tuple(row)] += 1
# iterate over unique features and compute probabilities
for k in idx:
unique_row = features[k]
zero_count = zeros[tuple(unique_row)]
one_count = ones[tuple(unique_row)]
proba = float(one_count) / float(zero_count + one_count)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)
which:
extracts unique features;
counts number of zero and one observations targets for each unique feature;
computes probability and constructs the result.
Could it be solved in a prettier way using some advanced numpy magic?
Update. Previous code was pretty inefficient O(n^2). Converted it to more performance-friendly. Old code:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
# calculate ZERO class occurences and ONE class occurences
for k in idx:
unique_row = features[k]
zeros = 0
ones = 0
for i, row in enumerate(features[:]):
if np.array_equal(row, unique_row):
if targets[i] == 0:
zeros += 1
else:
ones += 1
proba = float(ones) / float(zeros + ones)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)
It's easy using Pandas:
df = pd.DataFrame(features)
df['targets'] = targets
Now you have:
0 1 2 targets
0 1 1 0 1
1 1 1 0 0
2 0 1 0 1
3 0 1 0 1
4 0 1 0 0
5 0 0 1 0
Now, the fancy part:
df.groupby([0,1,2]).targets.mean()
Gives you:
0 1 2
0 0 1 0.000000
1 0 0.666667
1 1 0 0.500000
Name: targets, dtype: float64
Pandas doesn't print the 0 at the leftmost part of the 0.666 row, but if you inspect the value there, it is indeed 0.
np.sum(np.reshape([targets[f] if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)/np.sum(np.reshape([1 if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)
Here you go, numpy magic! Although unnecceserily so, this could probably be cleaned up using some boring variables ;)
(And this is probably far from optimal)

Pandas DataFrame column concatenation

I have a pandas Dataframe y with 1 million rows and 5 columns.
np.shape(y)
(1037889, 5)
The column values are all 0 or 1. Looks something like this:
y.head()
a, b, c, d, e
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I want a Dataframe with 1 million rows and 1 column.
np.shape(y)
(1037889, )
where the column is just the 5 columns concatenated together.
New column
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I keep trying different things like merge, concat, dstack, etc...
but can't seem to figure this out.
If you want new column to have all data concatenated to string, it's good case for apply() function:
>>> df = pd.DataFrame({'a':[0,1,0,0], 'b':[0,0,1,0], 'c':[1,0,1,0], 'd':[0,1,1,0], 'c':[0,1,1,0]})
>>> df
a b c d
0 0 0 0 0
1 1 0 1 1
2 0 1 1 1
3 0 0 0 0
>>> df2 = df.apply(lambda row: ','.join(map(str, row)), axis=1)
>>> df2
0 0,0,0,0
1 1,0,1,1
2 0,1,1,1
3 0,0,0,0

numpy/pandas: How to convert a series of strings of zeros and ones into a matrix

I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
In tuple format, it looks like this:
(Y, X, other_info, ... )
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
A few notes:
I'm already using numpy/pandas/sklearn, so anything in those libraries is fair game.
For a lot of what I'm doing, it's convenient to have the other_info columns together in a DataFrame
The strings are is pretty long (~20,000 columns), but the total data frame is not very tall (~500 rows).
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
This compares favourable to #DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
How about something like this:
Make the dataframe:
In [82]: v = [
....: (1, "000010101001010101011101010101110101", "aaa"),
....: (0, "111101010100101010101110101010111010", "bb"),
....: (0, "100010110100010101001010101011101010", "ccc"),
....: (1, "000010101001010101011101010101110101", "ddd"),
....: (1, "110100010101001010101011101010111101", "eeee"),
....: ]
In [83]:
In [83]: df = pandas.DataFrame(v)
We can use fromiter or array to get an ndarray:
In [84]: d ="000010101001010101011101010101110101"
In [85]: np.fromiter(d, int) # better: np.fromiter(d, int, count=len(d))
Out[85]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
In [86]: np.array(list(d), int)
Out[86]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
There might be a slick vectorized way to do this, but I'd just apply the obvious per-entry function to the values and get on with my day:
In [87]: df[1]
Out[87]:
0 000010101001010101011101010101110101
1 111101010100101010101110101010111010
2 100010110100010101001010101011101010
3 000010101001010101011101010101110101
4 110100010101001010101011101010111101
Name: 1
In [88]: df[1] = df[1].apply(lambda x: np.fromiter(x, int)) # better with count=len(x)
In [89]: df
Out[89]:
0 1 2
0 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 aaa
1 0 [1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 bb
2 0 [1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 ccc
3 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 ddd
4 1 [1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 eeee
In [90]: df[1][0]
Out[90]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])

Categories

Resources