I have df:
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz', 'baz', 'baz', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo']),
np.array(['one','two'] * 9),
np.array([1,2,3] * 6)]
df = pd.DataFrame(np.random.randn(18,2), index=arrays)
col0 col1
bar one 1 0.872359 -1.115871
two 2 -0.937908 -0.528563
one 3 -0.118874 0.286595
two 1 -0.507698 1.364643
one 2 1.507611 1.379498
two 3 -1.398019 -1.603056
baz one 1 1.498263 0.412380
two 2 -0.930022 -1.483657
one 3 -0.438157 1.465089
two 1 0.161887 1.346587
one 2 0.167086 1.246322
two 3 0.276344 -1.206415
foo one 1 -0.045389 -0.759927
two 2 0.087999 -0.435753
one 3 -0.232054 -2.221466
two 1 -1.299483 1.697065
one 2 0.612211 -1.076738
two 3 -1.482573 0.907826
And now I want to create 'NEW' column that:
for 'bar'
if index.level(2) > 1
"NEW" = col1
else
"NEW" = col2
for 'baz' the same with >2
for 'foo' the same with >3
How to do it without Py loops?
You can use get_level_values for select index values by levels and then for new column numpy.where:
#if possible use dictionary
d = {'bar':1, 'baz':2, 'foo':3}
m = df.index.get_level_values(2) > df.rename(d).index.get_level_values(0)
df['NEW'] = np.where(m, df.col1, df.col2)
For more general solution useSeries.rank:
a = df.index.get_level_values(2)
b = df.index.get_level_values(0).to_series().rank(method='dense')
df['NEW'] = np.where(a > b, df.col1, df.col2)
Detail:
print (b)
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
bar 1.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
baz 2.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
foo 3.0
dtype: float64
I think you can avoid looping over df.level(0) by using df.factorize:
In [40]: thresh = pd.factorize(df.index.get_level_values(0).values)[0] + 1
In [41]: mask = df.index.get_level_values(2) > thresh
In [42]: df['NEW'] = np.where(mask, df.col1, df.col2)
In [43]: df
Out[43]:
col1 col2 NEW
bar one 1 0.247222 -0.270104 -0.270104
two 2 0.429196 -1.385352 0.429196
one 3 0.782293 -1.565623 0.782293
two 1 0.392214 1.023960 1.023960
one 2 -1.628410 -0.484275 -1.628410
two 3 0.256757 0.529373 0.256757
baz one 1 -0.568608 -0.776577 -0.776577
two 2 2.142408 -0.815413 -0.815413
one 3 0.860080 0.501965 0.860080
two 1 -0.267029 -0.025360 -0.025360
one 2 0.187145 -0.063436 -0.063436
two 3 0.351296 -2.050649 0.351296
foo one 1 0.704941 0.176698 0.176698
two 2 -0.380353 1.027745 1.027745
one 3 -1.337364 -0.568359 -0.568359
two 1 -0.588601 -0.800426 -0.800426
one 2 1.513358 -0.616237 -0.616237
two 3 0.244831 1.027109 1.027109
Related
I have a DataFrame:
>>> df
A
0 foo
1 bar
2 foo
3 baz
4 foo
5 bar
I need to find all the duplicate groups and label them with sequential dgroup_id's:
>>> df
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz
4 foo 1
5 bar 2
(This means that foo belongs to the first group of duplicates, bar to the second group of duplicates, and baz is not duplicated.)
I did this:
import pandas as pd
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))
This leads to:
>>> df
A dgroup_id
0 foo 1.0
1 bar 0.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 0.0
Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)
Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id's as ints, not floats.
You can make a list of duplicates by get_duplicates() then set the dgroup_id by A's index
def find_index(string):
if string in duplicates:
return duplicates.index(string)+1
else:
return 0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df
Output:
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame(),
left_on='A', right_index=True).sort_index()
)
Out[49]:
A dgroup_id
0 foo 1.0
1 bar 2.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 2.0
If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int),
left_on='A', right_index=True).sort_index()
)
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz 0
4 foo 1
5 bar 2
Use duplicated to identify where dups are. Use where to replace singletons with ''. Use categorical to factorize.
dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
If you insist on the zeros being ''
dups = df.A.duplicated(keep=False)
df.assign(
dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz
4 foo 2
5 bar 1
You could go for:
import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])
# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz']
# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)
# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan
print(df)
# name duplication_index
# 0 foo 1.0
# 1 bar 2.0
# 2 foo 1.0
# 3 baz NaN
# 4 foo 1.0
# 5 bar 2.0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')
In [32]: result.drop(0,axis=1)
Out[32]:
A dgroup_id
0 foo 2
1 bar 0
2 foo 2
3 baz 1
4 foo 2
5 bar 0
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : [np.nan, 'bla2', np.nan, 'bla3', np.nan, np.nan, np.nan, np.nan]})
Output:
A B C
0 foo one NaN
1 bar one bla2
2 foo two NaN
3 bar three bla3
4 foo two NaN
5 bar two NaN
6 foo one NaN
7 foo three NaN
I would like to use groupby in order to count the number of NaN's for the different combinations of foo.
Expected Output (EDIT):
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
Currently I am trying this:
df['count']=df.groupby(['A'])['B'].isnull().transform('sum')
But this is not working...
Thank You
I think you need groupby with sum of NaN values:
df2 = df.C.isnull().groupby([df['A'],df['B']]).sum().astype(int).reset_index(name='count')
print(df2)
A B count
0 bar one 0
1 bar three 0
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
Notice that the .isnull() is on the original Dataframe column, not on the groupby()-object. The groupby() does not have .isnull() but if it would have it, it would be expected to give the same result as with .isnull() on the original DataFrame.
If need filter first add boolean indexing:
df = df[df['A'] == 'foo']
df2 = df.C.isnull().groupby([df['A'],df['B']]).sum().astype(int)
print(df2)
A B
foo one 2
three 1
two 2
Or simpler:
df = df[df['A'] == 'foo']
df2 = df['B'].value_counts()
print(df2)
one 2
two 2
three 1
Name: B, dtype: int64
EDIT: Solution is very similar, only add transform:
df['D'] = df.C.isnull().groupby([df['A'],df['B']]).transform('sum').astype(int)
print(df)
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
Similar solution:
df['D'] = df.C.isnull()
df['D'] = df.groupby(['A','B'])['D'].transform('sum').astype(int)
print(df)
A B C D
0 foo one NaN 2
1 bar one bla2 0
2 foo two NaN 2
3 bar three bla3 0
4 foo two NaN 2
5 bar two NaN 1
6 foo one NaN 2
7 foo three NaN 1
df[df.A == 'foo'].groupby('b').agg({'C': lambda x: x.isnull().sum()})
returns:
=> C
B
one 2
three 1
two 2
just add this parameter dropna=False
df.groupby(['A', 'B','C'], dropna=False).size()
check the documentation:
dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
I know this solution How to make a pandas crosstab with percentages?, but the solution proposed does not work with three-way tables.
Consider the following table:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4})
pd.crosstab(df.A,[df.B,df.C],colnames=['topgroup','bottomgroup'])
Out[89]:
topgroup A B C
bottomgroup bar foo bar foo bar foo
A
one 2 2 2 2 2 2
three 2 0 0 2 2 0
two 0 2 2 0 0 2
Here, I would like to get the row percentage, within each topgroup (A, B and C).
Using apply(lambda x: x/sum(),axis=1) will fail because percentages have to sum to 1 within each group.
Any ideas?
If I understand your question, it seems that you could write:
>>> table = pd.crosstab(df.A,[df.B,df.C], colnames=['topgroup','bottomgroup'])
>>> table / table.sum(axis=1, level=0)
topgroup A B C
bottomgroup bar foo bar foo bar foo
A
one 0.5 0.5 0.5 0.5 0.5 0.5
three 1.0 0.0 0.0 1.0 1.0 0.0
two 0.0 1.0 1.0 0.0 0.0 1.0
I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?
df.groupby('a').groupby('b')
does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.
df.groupby('a').apply(lambda x: x[x['b']] == '...').count()
To find the total number of class b for each instance of class a you would do
df.groupby('a').b.value_counts()
For example, create a DataFrame as below:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
A B C D
0 foo one -1.565185 -0.465763
1 bar one 2.499516 -0.941229
2 foo two -0.091160 0.689009
3 bar three 1.358780 -0.062026
4 foo two -0.800881 -0.341930
5 bar two -0.236498 0.198686
6 foo one -0.590498 0.281307
7 foo three -1.423079 0.424715
Then:
df.groupby('A')['B'].value_counts()
A
bar one 1
two 1
three 1
foo one 2
two 2
three 1
To convert this to a conditional probability, you need to divide by the total size of each group.
You can either do it with another groupby:
df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Or you can apply a lambda function onto the groups:
df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))
Answer:
This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'
pd.crosstab(df.a, df.b, normalize='columns')
Will return a Dataframe representing P(a | b)
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
Explanation:
Consider the DataFrame:
df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
'b':['1', '2', '3', '4','5', '1', '2', '3']})
Looking at columns a and b
df[["a", "b"]]
We have
a b
0 x 1
1 x 2
2 x 3
3 y 4
4 y 5
5 y 1
6 y 2
7 z 3
Then
pd.crosstab(df.a, df.b)
returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b
b 1 2 3 4 5
a
x 1 1 1 0 0
y 1 1 0 1 1
z 0 0 1 0 0
We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)
pd.crosstab(df.a, df.b, normalize='columns')
Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B) for specific values of B
b 1 2 3 4 5
a
x 0.5 0.5 0.5 0.0 0.0
y 0.5 0.5 0.0 1.0 1.0
z 0.0 0.0 0.5 0.0 0.0
Notice, the columns sum to 1.
If we would instead prefer to get P(b | a), we could normalize over the rows
pd.crosstab(df.a, df.b, normalize='rows')
To get
b 1 2 3 4 5
a
x 0.333333 0.333333 0.333333 0.00 0.00
y 0.250000 0.250000 0.000000 0.25 0.25
z 0.000000 0.000000 1.000000 0.00 0.00
Where the rows represent the conditional probabilities P(b | a=A) for specific values of A. Notice, the rows sum to 1.
You can pass in a list to groupby:
df.groupby(['a','b']).count()
You could try this function,
def conprob(pd1,pd2,transpose=1):
if transpose==0:
table=pd.crosstab(pd1,pd2)
else:
table=pd.crosstab(pd2,pd1)
cnames=table.columns.values
weights=1/table[cnames].sum()
out=table*weights
pc=table[cnames].sum()/table[cnames].sum().sum()
table=table.transpose()
cnames=table.columns.values
p=table[cnames].sum()/table[cnames].sum().sum()
out['p']=p
return out
This return de conditional probability P( row |column )
Consider the DataFrame that Maxymoo suggested:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
df
A B C D
0 foo one 0.229206 -1.899999
1 bar one 0.174972 0.328746
2 foo two -1.384699 -1.691151
3 bar three -1.008328 -0.915467
4 foo two -0.065298 -0.107240
5 bar two 1.871916 0.798135
6 foo one 1.589609 -1.682237
7 foo three 2.292783 0.639595
Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?
Approach 1:
df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B
one foo 0.666667
bar 0.333333
three foo 0.500000
bar 0.500000
two foo 0.666667
bar 0.333333
dtype: float64
So the answer is: 0.6667
Approach 2:
Probability of x = one: 0.375
df['B'].value_counts()/df['B'].count()
one 0.375
two 0.375
three 0.250
dtype: float64
Probability of y = foo: 0.625
df['A'].value_counts()/df['A'].count()
foo 0.625
bar 0.375
dtype: float64
Probability of (x=one|y=foo): 0.4
df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667
The question is a little odd, in that it suggests that column B has categorical values. Typically, we compute (conditional) expectations on real-valued variables. In this case, it's actually much simpler
df.groupby('A')['B'].mean()
For example, in the dataframe
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':[1, 1, 2, 3,2, 2, 1, 3], 'C':np.random.randn(8), 'D':np.random.randn(8)})
we get
A
bar 2.0
foo 1.8
Name: B, dtype: float64
After pivoting a dataframe with two values like below:
import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'two',
'two', 'two', 'one', 'two'],
'C' : [56, 2, 3, 4, 5, 6, 0, 2],
'D' : [51, 2, 3, 4, 5, 6, 0, 2]})
pd.pivot_table(df, values=['C','D'],rows='B',cols='A').unstack().reset_index()
When I unstack the pivot and reset the index two new columns 'level_0' and 0 are created. Level_0 contains the column names C and D and 0 contains the values.
level_0 A B 0
0 C bar one 2.0
1 C bar two 4.0
2 C foo one 28.0
3 C foo two 4.0
4 D bar one 2.0
5 D bar two 4.0
6 D foo one 25.5
7 D foo two 4.0
Is it possible to unstack the frame so each value (C,D) appears in a separate column or do I have to split and concatenate the frame to achieve this? Thanks.
edited to show desired output:
A B C D
0 bar one 2 2
1 bar two 4 4
2 foo one 28 25.5
3 foo two 4 4
You want to stack (and not unstack):
In [70]: pd.pivot_table(df, values=['C','D'],rows='B',cols='A').stack()
Out[70]:
C D
B A
one bar 2 2.0
foo 28 25.5
two bar 4 4.0
foo 4 4.0
Although the unstack you used did a 'stack' operation because you had no MultiIndex in the index axis (only in the column axis).
But actually, you can get there also (and I think more logical) with a groupby-operation, as this is what you actually do (group columns C and D by A and B):
In [72]: df.groupby(['A', 'B']).mean()
Out[72]:
C D
A B
bar one 2 2.0
two 4 4.0
foo one 28 25.5
two 4 4.0