Does anybody have an idea how to count all boolean values(including the false one) in pivot table?
passed_exam is a column of boolean values.
This code performs the task only for true values which is great:
table = pd.pivot_table(df,index=["student","semester"], values=["passed_exam"],aggfunc=np.sum)
But I also want a column that counts all boolean values.
Thank you in advance!
I think you need groupby with size, last reshape by unstack:
df = pd.DataFrame({'student':['a'] * 4 + ['b'] * 6,
'semester':[1,1,2,2,1,1,2,2,2,2],
'passed_exam':[True, False] * 5})
print (df)
passed_exam semester student
0 True 1 a
1 False 1 a
2 True 2 a
3 False 2 a
4 True 1 b
5 False 1 b
6 True 2 b
7 False 2 b
8 True 2 b
9 False 2 b
table = df.groupby(["student","semester","passed_exam"])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
.reset_index()
print (table)
student semester False True
0 a 1 1 1
1 a 2 1 1
2 b 1 1 1
3 b 2 2 2
Related
I have something like this:
data = {'SKU':[1,1,2,1,2,2,3],
'QTY':[5,12,2,24,1,2,12],
'TYPE': ['M','C','M','C','M','M','C']
}
df = pd.DataFrame(data)
print(df)
OUTPUT:
SKU QTY TYPE
0 1 5 M
1 1 12 C
2 2 2 M
3 1 24 C
4 2 1 M
5 2 2 M
6 3 12 C
And I want a list of unique SKUs and a true / false column indicating if Type = C for all instances of that SKU.
Something like this:
SKU Case
0 1 False
1 2 False
2 3 True
I've tried all manner of combinations of groupby, filter, agg, value_counts, etc. and just can't seem to find a reasonable way to achieve this.
Any help would be much appreciated. I'm sure the answer will be humbling.
print(df.groupby('SKU')['TYPE'].agg(lambda x: np.all(x == 'C')).reset_index())
Prints:
SKU TYPE
0 1 False
1 2 False
2 3 True
Let us do groupby + nunique
s=df.TYPE.eq('C').groupby(df['SKU']).all().reset_index()
SKU TYPE
0 1 False
1 2 False
2 3 True
I have some colums in my DataFrame with values 0 and 1
name a b c d e
0 one 1 0 1 0 0
1 two 0 0 1 0 0
2 three 0 0 1 0 1
How can I select columns where at least one value is 1? But another columns (that are strings or take not only 0 and 1 values) must be selected too.
I tried this expression
df.iloc[:, [(clm == 'name') | (1 in df[clm].unique()) for clm in df.columns]]
Out:
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
But is seems not good because I explicitly choose column 'name'
If is possible remove all columns with only 0 values compare values by DataFrame.ne for not equal and return at least one True per columns in DataFrame.loc:
df = df.loc[:, df.ne(0).any()]
print (df)
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
Details:
print (df.ne(0))
name a b c d e
0 True True False True False False
1 True False False True False False
2 True False False True False True
print (df.ne(0).any())
name True
a True
b False
c True
d False
e True
dtype: bool
There are quite a few similar questions out there, but I am not sure if there is one that tackles both index and row values. (relevant to binary classification df)
So what I am trying to do is compare the columns with the same name to have the same values and index. If not, simply return an error.
Let's say DataFrame df has columns a, b and c and df_orginal has columns from a to z.
How can we first find the columns that have the same name between those 2 DataFrames, and then check the contents of those columns such that they match row by row in value and index between a, b and c from df and df_orginal
The contents of all the columns are numerical, that's why I want to compare the combination of index and values
Demo:
In [1]: df
Out[1]:
a b c
0 0 1 2
1 1 2 0
2 0 1 0
3 1 1 0
4 3 1 0
In [3]: df_orginal
Out[3]:
a b c d e f g ......
0 4 3 1 1 0 0 0
1 3 1 2 1 1 2 1
2 1 2 1 1 1 2 1
3 3 4 1 1 1 2 1
4 0 3 0 0 1 1 1
In the above example, for those columns that have the same column name, compare the combination of index and value and flag an error if the combination of index and value is not correct
common_cols = df.columns.intersection(df_original.columns)
for col in common_cols:
df1_ind_val_pair = df[col].index.astype(str) + ' ' + df[col].astype(str)
df2_ind_val_pair = df_original[col].index.astype(str) + ' ' + df_original[col].astype(str)
if any(df1_ind_val_pair != df2_ind_val_pair):
print('Found one or more unequal (index, value) pairs in col {}'.format(col))
IIUC:
Use pd.DataFrame.align with a join method of inner. Then pass the resulting tuple unpacked to pd.DataFrame.eq
pd.DataFrame.eq(*df.align(dfo, 'inner'))
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True
To see rows that have all columns True, filter with this mask:
pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)
0 False
1 False
2 False
3 False
4 False
dtype: bool
with the sample data however, the result will be empty
df[pd.DataFrame.eq(*df.align(dfo, 'inner')).all(1)]
Empty DataFrame
Columns: [a, b, c]
Index: []
Same answer but with clearer code
def eq(d1, d2):
d1, d2 = d1.align(d2, 'inner')
return d1 == d2
eq(df, dfo)
a b c
0 False False False
1 False False False
2 False False False
3 False False False
4 False False True
Consider the dataframe df
A B C D match?
0 x y 1 1 true
1 x y 1 2 false
2 x y 2 1 false
3 x y 2 2 true
4 x y 3 4 false
5 x y 5 6 false
I would like to drop the unmatched rows that are already matched somewhere else.
A B C D match?
1 x y 1 1 true
3 x y 2 2 true
4 x y 3 4 false
5 x y 5 6 false
How can I do that with Pandas?
You could sort those two columns so that their order of positioning could be made same throughout. Then, drop off all such duplicated entries present by providing keep=False in DF.drop_duplicates() method.
df[['C','D']] = np.sort(df[['C','D']].values)
df.drop_duplicates(keep=False)
you can compare the two columns with
df.C == df.D
0 True
1 False
2 False
3 True
4 False
dtype: bool
Then shift the series down.
0 NaN
1 True
2 False
3 False
4 True
dtype: object
Each True value indicates the start of a new group. We can use cumsum to create the groupings we need for groupby
(df.C == df.D).shift().fillna(False).cumsum()
0 0
1 1
2 1
3 1
4 2
dtype: int64
Then use groupy + last
df.groupby(df.C.eq(df.D).shift().fillna(False).cumsum()).last()
A B C D
0 x y 1 1
1 x y 2 2
2 x y 3 4
If you would like to remove the rows where "C" and "D" matched, the method .ix will help you:
df = df.ix[(df['C'] != df['D'])]
Therefore, df['C'] != df['D'] generates a list of booleans and .ix allows you to extract the corresponding DataFrame :)
I have two columns I am working with. The first column is populated with zeros and the second column is populated with booleans.
column 1 column 2
0 True
0 True
0 False
0 True
0 True
0 False
0 False
0 True
There are millions of rows so I am trying to figure an efficient process that looks at column 2 and for each grouping of True bools adds 1 to column 1.
column 1 column 2
1 True
1 True
0 False
2 True
2 True
0 False
0 False
3 True
Any help is much appreciated!
One trick which often comes in handy when vectorizing operations on contiguous groups is the shift-cumsum pattern:
>>> c = df["column 2"]
>>> c * (c & (c != c.shift())).cumsum()
0 1
1 1
2 0
3 2
4 2
5 0
6 0
7 3
Name: column 2, dtype: int32
df['column 3'] = (df['column 2'] & (df['column 2'].shift() != True))
df['column 4'] = df['column 3'].cumsum()
df['column 1'] = df['column 2'] * df['column 4']
print df
column 1 column 2 column 3 column 4
0 1 True True 1
1 1 True False 1
2 0 False False 1
3 2 True True 2
4 2 True False 2
5 0 False False 2
6 0 False False 2
7 3 True True 3