I want to do what they've done in the answer here: Calculating the number of specific consecutive equal values in a vectorized way in pandas
, but using a grouped dataframe instead of a series.
So given a dataframe with several columns
A B C
------------
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0
I want to groupby columns A and B, then count the number of consecutive zeros in C. After that I'd like to return counts of the number of times each length of zeros occurred. So I want output like this:
A B num_consecutive_zeros count
---------------------------------------
x x 1 2
x x 2 1
y x 1 1
y x 2 1
I don't know how to adapt the answer from the linked question to deal with grouped dataframes.
Here is the code, count_consecutive_zeros() use numpy functions and pandas.value_counts() to get the results, and use groupby().apply(count_consecutive_zeros) to call count_consecutive_zeros() for every group. call reset_index() to change MultiIndex to columns:
import pandas as pd
import numpy as np
from io import BytesIO
text = """A B C
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0"""
df = pd.read_csv(BytesIO(text.encode()), delim_whitespace=True)
def count_consecutive_zeros(s):
v = np.diff(np.r_[0, s.values==0, 0])
s = pd.value_counts(np.where(v == -1)[0] - np.where(v == 1)[0])
s.index.name = "num_consecutive_zeros"
s.name = "count"
return s
df.groupby(["A", "B"]).C.apply(count_consecutive_zeros).reset_index()
Related
I want to discover the underlying pattern between my features and target so I tried to use groupby but instead of the count I want to calculate the ratio or the percentage compared to the total of the count of each class
the following code is similar to the work I have done.
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])
You can achieve this more simply with:
out = df.groupby('class').value_counts(normalize=True).mul(100)
Output:
class fet1 fet2
0 A Y 13.859275
B Y 12.366738
X 12.153518
C X 11.513859
Y 10.660981
B Z 10.447761
A Z 10.021322
C Z 9.594883
A X 9.381663
1 A Y 14.124294
C Z 13.935970
B Z 11.676083
Y 11.111111
C Y 11.111111
X 11.111111
A X 10.169492
B X 9.416196
A Z 7.344633
dtype: float64
If you want the same order of multiindex:
out = (df
.groupby('class').value_counts(normalize=True).mul(100)
.reorder_levels(['fet1', 'fet2', 'class']).sort_index()
)
Output:
fet1 fet2 class
A X 0 9.381663
1 10.169492
Y 0 13.859275
1 14.124294
Z 0 10.021322
1 7.344633
B X 0 12.153518
1 9.416196
Y 0 12.366738
1 11.111111
Z 0 10.447761
1 11.676083
C X 0 11.513859
1 11.111111
Y 0 10.660981
1 11.111111
Z 0 9.594883
1 13.935970
dtype: float64
I achieved it by doing this
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])/df.groupby(['class'])['class'].agg(['count'])*100
I've got data in a pandas dataframe that looks like this:
ID A B C D
100 0 1 0 1
101 1 1 0 1
102 0 0 0 1
...
The idea is to create a barchart plot that shows the total of each (sum of the total number of A's, B's, etc.). Something like:
X
X X
x X X
A B C D
This should be so simple...
Set 'ID' aside, sum, and plot.bar:
df.set_index('ID').sum().plot.bar()
# or
df.drop(columns=['ID']).sum().plot.bar()
output:
just for fun
print(df.drop(columns='ID')
.replace({0: ' ', 1: 'X'})
.apply(sorted, reverse=True)
.to_string(index=False)
)
Output:
A B C D
X X X
X X
X
I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.
Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2
Given a pandas.DataFrame with a column MultiIndex as follows
A B C
X Y Z X Y Z X Y Z
how to rearrange the columns into this format?
X Y Z
A B C A B C A B C
(I tried .columns.swaplevel(0,1), but that does not yet yield the desired grouping)
df.swaplevel with axis=1 (for columns) is what you need.
>>> df.swaplevel(0,1, axis=1)
X Y Z X Y Z X Y Z
A A A B B B C C C
0 x x x x x x x x x
You can use sort_index to sort:
>>> df.swaplevel(0,1, axis=1).sort_index(level=0, axis=1)
X Y Z
A B C A B C A B C
0 x x x x x x x x x
Here's my pseudo code
source
a b c d e
0 x x x x x
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
And then I have a lookup dataframe
lookup
a b c
0 1 2 3
Is there any function that would behave something like this - pd.source.overlay(lookup[2,c]) - producing an "overlay" at a specific position?
a b c d e
0 x x x x x
1 x x x x x
2 x x 1 2 3
3 x x x x x
4 x x x x x
5 x x x x x
Like this:
In [898]: df.iloc[2, -3:] = lu.values
In [899]: df
Out[899]:
a b c d e
0 x x x x x
1 x x x x x
2 x x 1 2 3
3 x x x x x
4 x x x x x
5 x x x x x
First we ge the index , then assign the value
df.values[2,2:]=lu.values
df
a b c d e
0 x x x x x
1 x x x x x
2 x x 1 2 3
3 x x x x x
4 x x x x x
5 x x x x x
col='c'
df.values[2,df.columns.get_indexer([col])[0]:]=lu.values