numpy for no repeating for two columns - python

Basically, I am looking to solve without repeating comparing via AA and BB. If AA has 1, BB will start with 1, rather than repeating.
import numpy as np
df2 = pd.DataFrame({
'A': np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0], dtype='int32'),
'B': np.array([0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1], dtype='int32')
})
df2['AA'] = np.where( (df2['A'] > df2['A'].shift(1)),1,0)
df2['BB'] = np.where(( (df2['B'] > df2['B'].shift(1))),1,0)
I am getting repeating of BB. How can I get without repeating 1 if AA has 1, next BB should get 1 rather than repeating.
df2
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 1
The result should be as following.
if AA has 1 in previous row or past rows, and BB will start with 1 rather than repeating 1 at AA again, likewise in BB it should not repeat.
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 0

Related

Join DataFrames on Condition Pandas

I have the following two dataframes with binary values that I want to merge.
df1
Action Adventure Animation Biography
0 0 1 0 0
1 0 0 0 0
2 1 0 0 0
3 0 0 0 0
4 1 0 0 0
df2
Action Adventure Biography Comedy
0 0 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 1
4 1 0 0 0
I want to join these two data frames in a way that the result has the distinct columns and if in one dataframe the value is 1 then the result has 1, if not it has 0.
Result
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0
I am stuck on this so I don not have a proposed solution.
Let us add the two dataframes then clip the upper value:
df1.add(df2, fill_value=0).clip(upper=1).astype(int)
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0
Thinking it as set problem may give you solution. Have look to code.
print((df1 | df2).fillna(0).astype(int) | df2)
COMPLETE CODE:
import pandas as pd
df1 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 0, 0, 0, 0]
}
)
df2 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 1, 0, 0, 0],
'Comedy':[0, 0, 0, 1, 0]
}
)
print((df1 | df2).fillna(0).astype(int) | df2)
OUTPUT:
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0

Create a new column based on other column values - conditional forward fill?

Have the following dataframe
d = {'c_1': [0,0,0,1,0,0,0,1,0,0,0,0],
'c_2': [0,0,0,0,0,1,0,0,0,0,0,1]}
df = pd.DataFrame(d)
I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
desired output as follows
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0
Thinking this requires some kind of conditional forward fill, looking at previous questions however havn't been able to arrive at desired output
edit: have come across a related scenario where inputs differ and current solutions do not work. Will confirm answered but appreciate any input on the below
d = {'c_1': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
'c_2': [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
df = pd.DataFrame(d)
desired output as follows - same as before I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
c_1 c_2 f
0 0 1 0
1 0 1 0
2 0 1 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 0 1 0
11 0 1 0
12 0 1 0
13 0 0 0
14 0 0 0
15 0 0 0
16 1 0 1
17 0 0 1
18 1 0 1
19 1 0 1
20 0 0 1
21 0 0 1
22 0 0 1
23 0 0 1
24 0 1 0
You can try:
df['f'] = df[['c_1','c_2']].sum(1).cumsum().mod(2)
print(df)
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0
You can also try like this:
df.loc[df['c_2'].shift().ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 1.0 # <--- set these value to be zero
6 0 0 NaN
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 1.0 # <---
add one more line. if you don't want to include the end position.
Final:
df.loc[df['c_2'].shift().ne(1) & df['c_2'].ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
df = df.fillna(0)
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 0.0
6 0 0 0.0
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 0.0
This should work for both scenarios:
df['c_1'].groupby(df[['c_1','c_2']].sum(1).cumsum()).transform('first')

Efficient way to get a subset of indices in numpy

I have the following indices as you would get them from np.where(...):
coords = (
np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
Another tuple with indices is meant to select those that are in coords:
index = tuple(
np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.
I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?
Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:
from scipy.spatial.distance import cdist
c = np.stack(coords).T
i = np.stack(index).T
d = cdist(c, i)
In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1])
By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:
d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)
You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:
cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1
ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)
it.sort()
result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))
Prints:
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Encode integer pandas dataframe column to padded 16 bit binary

I would like to encode integers stored in a pandas dataframe column into respective 16-bit binary numbers which correspond to bit positions in those integers. I would also need to pad leading zeros for numbers with corresponding binary less than 16 bits. For example, given one column containing integers ranging from 0 to 33000, for an integer value of 20 (10100 in binary) I would like to produce 16 columns with values 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0 and so on across the entire column.
Setup
Consider the data frame df with column 'A'
df = pd.DataFrame(dict(A=range(16)))
Numpy broadcasting and bit shifting
a = df.A.values
n = int(np.log2(a.max() + 1))
b = (a[:, None] >> np.arange(n)[::-1]) & 1
pd.DataFrame(b)
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
String formatting with f-strings
n = int(np.log2(df.A.max() + 1))
pd.DataFrame([list(map(int, f'{i:0{n}b}')) for i in df.A])
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
Could you do something like this?
x = 20
bin_string = format(x, '016b')
df = pd.DataFrame(list(bin_string)).T
I don't know enough about what you're trying to do to know if that's sufficient.

Pandas histogram (counts) on grouped (by) values

I have a DataFrame which looks like this:
>>> df
type value
0 1 0.698791
1 3 0.228529
2 3 0.560907
3 1 0.982690
4 1 0.997881
5 1 0.301664
6 1 0.877495
7 2 0.561545
8 1 0.167920
9 1 0.928918
10 2 0.212339
11 2 0.092313
12 4 0.039266
13 2 0.998929
14 4 0.476712
15 4 0.631202
16 1 0.918277
17 3 0.509352
18 1 0.769203
19 3 0.994378
I would like to group on the type column and obtain histogram bins for the column value in 10 new columns, e.g. something like that:
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Where column 1 is the count for the first bin (0.0 to 0.1) and so on...
Using numpy.histogram, I can only obtain the following:
>>> df.groupby('type')['value'].agg(lambda x: numpy.histogram(x, bins=10, range=(0, 1)))
type
1 ([0, 1, 1, 1, 1, 0, 0, 0, 0, 2], [0.0, 0.1, 0....
2 ([2, 0, 1, 0, 1, 0, 0, 0, 1, 1], [0.0, 0.1, 0....
3 ([2, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
4 ([1, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
Name: value, dtype: object
Which I do not manage to put in the correct format afterwards (at least not in a simple way).
I found a trick to do what I want, but it is very ugly:
>>> d = {str(k): lambda x, _k = k: ((x >= (_k - 1)/10) & (x < _k/10)).sum() for k in range(1, 11)}
>>> df.groupby('type')['value'].agg(d)
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Is there a better way to do what I want? I know that in R, the aggregate method can return a DataFrame, but not in python...
is that what you want?
In [98]: %paste
bins = np.linspace(0, 1.0, 11)
labels = list(range(1,11))
(df.assign(q=pd.cut(df.value, bins=bins, labels=labels, right=False))
.pivot_table(index='type', columns='q', aggfunc='size', fill_value=0)
)
## -- End pasted text --
Out[98]:
q 1 2 3 4 5 6 7 8 9 10
type
1 0 1 0 1 0 0 1 1 1 4
2 1 0 1 0 0 1 0 0 0 1
3 0 0 1 0 0 2 0 0 0 1
4 1 0 0 0 1 0 1 0 0 0

Categories

Resources