Pandas histogram (counts) on grouped (by) values

Pandas histogram (counts) on grouped (by) values - python

I have a DataFrame which looks like this:
>>> df
type value
0 1 0.698791
1 3 0.228529
2 3 0.560907
3 1 0.982690
4 1 0.997881
5 1 0.301664
6 1 0.877495
7 2 0.561545
8 1 0.167920
9 1 0.928918
10 2 0.212339
11 2 0.092313
12 4 0.039266
13 2 0.998929
14 4 0.476712
15 4 0.631202
16 1 0.918277
17 3 0.509352
18 1 0.769203
19 3 0.994378
I would like to group on the type column and obtain histogram bins for the column value in 10 new columns, e.g. something like that:
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Where column 1 is the count for the first bin (0.0 to 0.1) and so on...
Using numpy.histogram, I can only obtain the following:
>>> df.groupby('type')['value'].agg(lambda x: numpy.histogram(x, bins=10, range=(0, 1)))
type
1 ([0, 1, 1, 1, 1, 0, 0, 0, 0, 2], [0.0, 0.1, 0....
2 ([2, 0, 1, 0, 1, 0, 0, 0, 1, 1], [0.0, 0.1, 0....
3 ([2, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
4 ([1, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0.0, 0.1, 0....
Name: value, dtype: object
Which I do not manage to put in the correct format afterwards (at least not in a simple way).
I found a trick to do what I want, but it is very ugly:
>>> d = {str(k): lambda x, _k = k: ((x >= (_k - 1)/10) & (x < _k/10)).sum() for k in range(1, 11)}
>>> df.groupby('type')['value'].agg(d)
1 3 9 6 8 10 5 4 7 2
type
1 0 1 0 0 0 2 1 1 0 1
2 2 1 1 0 0 1 1 0 0 0
3 2 0 0 0 0 1 1 0 0 0
4 1 1 0 0 0 1 0 0 0 1
Is there a better way to do what I want? I know that in R, the aggregate method can return a DataFrame, but not in python...

is that what you want?
In [98]: %paste
bins = np.linspace(0, 1.0, 11)
labels = list(range(1,11))
(df.assign(q=pd.cut(df.value, bins=bins, labels=labels, right=False))
.pivot_table(index='type', columns='q', aggfunc='size', fill_value=0)
)
## -- End pasted text --
Out[98]:
q 1 2 3 4 5 6 7 8 9 10
type
1 0 1 0 1 0 0 1 1 1 4
2 1 0 1 0 0 1 0 0 0 1
3 0 0 1 0 0 2 0 0 0 1
4 1 0 0 0 1 0 1 0 0 0

Related

How to generate a column of sequential numbers when finding repeated numbers in another column in Pandas: Python

I'm using the following dictionary and developing in pandas to manipulate it in a dataframe:
data = {"Value": [4, 4, 2, 1, 1, 1, 0, 7, 0, 4, 1, 1, 3, 0, 3, 0, 7, 0, 4, 1, 0, 1, 0, 1, 4, 4, 2, 3],
"IdPar": [0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 0, 0, 22, 22, 28, 28, 28, 28, 0, 0, 38, 38 , 0, 0, 0, 0, 0]
}
df = pd.DataFrame(data)
I would like to achieve that when it finds a repeated number in the IdPar column, a sequential number is generated in the same row but in a new column called Count, with the condition that if it finds 0 it repeats the value of 0 in the new column. Next I show what I expect to get:
Value IdPar Count
0 4 0 0
1 4 0 0
2 2 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 10 1
7 7 10 2
8 0 10 3
9 4 10 4
1 1 10 5
1 1 0 0
1 3 0 0
1 0 22 1
1 3 22 2
1 0 28 1
1 7 28 2
1 0 28 3
1 4 28 4
1 1 0 0
2 0 0 0
2 1 38 1
2 0 38 2
2 1 0 0
2 4 0 0
2 4 0 0
2 2 0 0
2 3 0 0
What I've done is review pandas information, I've tried many functions and what I've found is the use of ne, shift, cumsum, groupby, pivot_table or transform functions, but it isn't the result I want:
s = df.pivot_table(index = ['IdPar'], aggfunc = 'size')
print(s)
t = df['IdPar'].ne(df['IdPar'].shift()).cumsum()
print(t)
df ['Count'] = df['IdPar'].isin(df['Id_Par'])
df ['Count'] = df.loc[df ['Count'] == True, 'IdPar']
print(df)
How far I've come is to place in the column Count the sum of the repetitions in front of the row in which it is presented or that the repetition of the number in the IdPar column begins, which is the code below, but I don't want that either:
df['Count'] = df.groupby(['IdPar'])['Value'].transform('count')
print(df['Count'])
I really appreciate anyone who can help me. Any comment helps.

Try cumcount:
df['Count'] = df.groupby('IdPar')['IdPar'].cumcount() + 1
df.loc[df['IdPar'] == 0, 'Count'] = 0
print(df)
Or try in one line:
df['Count'] = df.groupby('IdPar').cumcount().add(1).mask(df['IdPar'].eq(0), 0)
Both codes output:
IdPar Value Count
0 0 4 0
1 0 4 0
2 0 2 0
3 0 1 0
4 0 1 0
5 0 1 0
6 10 0 1
7 10 7 2
8 10 0 3
9 10 4 4
10 10 1 5
11 0 1 0
12 0 3 0
13 22 0 1
14 22 3 2
15 28 0 1
16 28 7 2
17 28 0 3
18 28 4 4
19 0 1 0
20 0 0 0
21 38 1 1
22 38 0 2
23 0 1 0
24 0 4 0
25 0 4 0
26 0 2 0
27 0 3 0

Create a new column based on other column values - conditional forward fill?

Have the following dataframe
d = {'c_1': [0,0,0,1,0,0,0,1,0,0,0,0],
'c_2': [0,0,0,0,0,1,0,0,0,0,0,1]}
df = pd.DataFrame(d)
I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
desired output as follows
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0
Thinking this requires some kind of conditional forward fill, looking at previous questions however havn't been able to arrive at desired output
edit: have come across a related scenario where inputs differ and current solutions do not work. Will confirm answered but appreciate any input on the below
d = {'c_1': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
'c_2': [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
df = pd.DataFrame(d)
desired output as follows - same as before I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
c_1 c_2 f
0 0 1 0
1 0 1 0
2 0 1 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 0 1 0
11 0 1 0
12 0 1 0
13 0 0 0
14 0 0 0
15 0 0 0
16 1 0 1
17 0 0 1
18 1 0 1
19 1 0 1
20 0 0 1
21 0 0 1
22 0 0 1
23 0 0 1
24 0 1 0

You can try:
df['f'] = df[['c_1','c_2']].sum(1).cumsum().mod(2)
print(df)
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0

You can also try like this:
df.loc[df['c_2'].shift().ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 1.0 # <--- set these value to be zero
6 0 0 NaN
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 1.0 # <---
add one more line. if you don't want to include the end position.
Final:
df.loc[df['c_2'].shift().ne(1) & df['c_2'].ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
df = df.fillna(0)
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 0.0
6 0 0 0.0
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 0.0

This should work for both scenarios:
df['c_1'].groupby(df[['c_1','c_2']].sum(1).cumsum()).transform('first')

numpy for no repeating for two columns

Basically, I am looking to solve without repeating comparing via AA and BB. If AA has 1, BB will start with 1, rather than repeating.
import numpy as np
df2 = pd.DataFrame({
'A': np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0], dtype='int32'),
'B': np.array([0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1], dtype='int32')
})
df2['AA'] = np.where( (df2['A'] > df2['A'].shift(1)),1,0)
df2['BB'] = np.where(( (df2['B'] > df2['B'].shift(1))),1,0)
I am getting repeating of BB. How can I get without repeating 1 if AA has 1, next BB should get 1 rather than repeating.
df2
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 1
The result should be as following.
if AA has 1 in previous row or past rows, and BB will start with 1 rather than repeating 1 at AA again, likewise in BB it should not repeat.
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 0

Concatenate multiple values in same row into a list

I have a df where I want to do multi-label classification. One of the ways which was suggested to me was to calculate the probability vector. Here's an example of my DF with what would represent training data.
id ABC DEF GHI
1 0 0 0 1
2 1 0 1 0
3 2 1 0 0
4 3 0 1 1
5 4 0 0 0
6 5 0 1 1
7 6 1 1 1
8 7 1 0 1
9 8 1 1 0
And I would like to concatenate columns ABC, DEF, GHI into a new column. I will also have to do this with more than 3 columns, so I want to do relatively cleanly using a column list or something similar:
col_list = ['ABC','DEF','GHI']
The result I am looking for would be something like:
id ABC DEF GHI Conc
1 0 0 0 1 [0,0,1]
2 1 0 1 0 [0,1,0]
3 2 1 0 0 [1,0,0]
4 3 0 1 1 [0,1,1]
5 4 0 0 0 [0,0,0]
6 5 0 1 1 [0,1,1]
7 6 1 1 1 [1,1,1]
8 7 1 0 1 [1,0,1]
9 8 1 1 0 [1,1,0]

Try:
col_list = ['ABC','DEF','GHI']
df['agg_lst']=df.apply(lambda x: list(x[col] for col in col_list), axis=1)

You can use 'agg' with function 'list':
df[cols].agg(list,axis=1)
1 [0, 0, 1]
2 [0, 1, 0]
3 [1, 0, 0]
4 [0, 1, 1]
5 [0, 0, 0]
6 [0, 1, 1]
7 [1, 1, 1]
8 [1, 0, 1]
9 [1, 1, 0]

Efficient way to get a subset of indices in numpy

I have the following indices as you would get them from np.where(...):
coords = (
np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
Another tuple with indices is meant to select those that are in coords:
index = tuple(
np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)
So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.
I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?

Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:
from scipy.spatial.distance import cdist
c = np.stack(coords).T
i = np.stack(index).T
d = cdist(c, i)
In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1])
By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:
d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)

You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:
cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1
ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)
it.sort()
result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))
Prints:
[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas histogram (counts) on grouped (by) values - python

Related

How to generate a column of sequential numbers when finding repeated numbers in another column in Pandas: Python

Create a new column based on other column values - conditional forward fill?

numpy for no repeating for two columns

Concatenate multiple values in same row into a list

Efficient way to get a subset of indices in numpy

Categories

Resources