Given a series that looks like:
0 foo
1 bar
2 foo
3 foo
4 bar
5 baz
How can I create a dataframe where each column is a mask for a unique value in the series? In this example, it would look like:
foo bar baz
0 True False False
1 False True False
2 True False False
3 True False False
4 False True False
5 False False True
Using get_dummies
s.str.get_dummies().astype(bool)
Out[392]:
bar baz foo
0 False False True
1 True False False
2 False False True
3 False False True
4 True False False
5 False True False
Or we try something new crosstab
pd.crosstab(s.index,s).astype(bool)
Out[395]:
a bar baz foo
row_0
0 False False True
1 True False False
2 False False True
3 False False True
4 True False False
5 False True False
Here's one with array-initialization -
def series_hotencode(s):
a,b = s.factorize()
ar = np.zeros((len(a),len(b)), dtype=bool)
ar[np.arange(len(a)),a] = 1
return pd.DataFrame(ar,columns=b)
Sample run -
In [40]: s
Out[40]:
0 foo
1 bar
2 foo
3 foo
4 bar
5 baz
Name: 1, dtype: object
In [41]: series_hotencode(s)
Out[41]:
foo bar baz
0 True False False
1 False True False
2 True False False
3 True False False
4 False True False
5 False False True
Let's try pd.factorize + np.eye for a fast, concise solution.
x,y = pd.factorize(s)
pd.DataFrame(np.eye(len(y), dtype=bool)[x], columns=y)
foo bar baz
0 True False False
1 False True False
2 True False False
3 True False False
4 False True False
5 False False True
Related
My logic is like this:
cond2 column is true before expected column, and cond1 column is true before cond2 column, then expected column can be true
input
import pandas as pd
import numpy as np
d={'cond1':[False,False,True,False,False,False,False,True,False,False],'cond2':[False,True,False,True,True,False,False,False,True,False]}
df = pd.DataFrame(d)
expected result table
cond1 cond2 expected
0 FALSE FALSE
1 FALSE TRUE
2 TRUE FALSE
3 FALSE TRUE
4 FALSE TRUE
5 FALSE FALSE TRUE
6 FALSE FALSE TRUE
7 TRUE FALSE
8 FALSE TRUE
9 FALSE FALSE TRUE
I have such an idea
get the number of lines from cond1 is true to the present, and then use the cumsum function to calculate the number of lines where cond2 is true is greater than 0
But how to get the number of lines from cond1 is true to the present
The description is not fully clear. It looks like you need a cummax per group starting with True in cond1:
m = df.groupby(df['cond1'].cumsum())['cond2'].cummax()
df['expected'] = df['cond2'].ne(m)
Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True
It's not very clear what you're looking for~
df['expected'] = ((df.index > df.idxmax().max())
& ~df.any(axis=1))
# Output:
cond1 cond2 expected
0 False False False
1 False True False
2 True False False
3 False True False
4 False True False
5 False False True
6 False False True
7 True False False
8 False True False
9 False False True
I have a dataframe with two columns:
A B
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 False True
7 False False
8 False False
9 False False
10 True False
11 False False
12 False False
I would like to create a new column "C" with Boolean values, that turns on (=True) each time B turns on and turns of each time A turns on (ex: here between index 6 to index 10)
Ex: for this df, the output will be:
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True
7 False False True
8 False False True
9 False False True
10 True False True
11 False False False
12 False False False
I wrote this code with a for loop and a "switch", but I'm pretty sure there will be faster and easier solution to do the same thing for large dataframes. I appreciate your help.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [False,False,False,True,False,False,False,False,False,False,True,False,False],
'B': [False,False,False,False,False,False,True,False,False,False,False,False,False]
})
df["C"]=0
switch=False
for i in df.index :
if df.B.iloc[i]:
switch=True
if switch:
df.C.iloc[i]=True
else:
df.C.iloc[i]=False
if df.A.iloc[i]:
switch=False
print(df)
Alternative approach using ffill
df.loc[df['A'],'C'] = False
df.loc[df['B'],'C'] = True
df['C'] = df['C'].ffill().fillna(False) #start "off"
Combine the two columns, subtract 1, filter out negative and even numbers:
x = (df['A'] | df['B']).cumsum().sub(1)
df['C'] = (x >= 0) & (x % 2 == 1)
Output:
>>> df
A B C
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
5 False False False
6 False True True <
7 False False True <
8 False False True <
9 False False True <
10 True False False
11 False False False
12 False False False
I have a DataFrame that looks like the following:
df = pd.DataFrame({'a':[True]*5+[False]*5+[True]*5,'b':[False]+[True]*3+[False]+[True]*5+[False]*4+[True]})
a b
0 True False
1 True True
2 True True
3 True True
4 True False
5 False True
6 False True
7 False True
8 False True
9 False True
10 True False
11 True False
12 True False
13 True False
14 True False
How can I select blocks where column a is True only when the interior values over the same rows for column b are True?
I know that I could find break apart the DataFrame into consecutive True regions, and apply a function to each DataFrame chunk, but this is for a much larger problem with 10 million+ rows, and I don't think such a solution would scale up very well.
My expected output would be the following:
a b c
0 True False True
1 True True True
2 True True True
3 True True True
4 True False True
5 False True False
6 False True False
7 False True False
8 False True False
9 False True False
10 True False False
11 True False False
12 True False False
13 True False False
14 True True False
You can do a groupby on the a values and then look at the b values in a function, like this:
groupby_consec_a = df.groupby(df.a.diff().ne(0).cumsum())
all_interior = lambda x: x.iloc[1:-1].all()
df['c'] = df.a & groupby_consec_a.b.transform(all_interior)
Try out whether it's fast enough on your data. If not, the lambda will have to be replaced by pandas functions, but that will be more code.
I'm considering a Pandas Dataframe. I would like to find an efficient way in which the second Dataframe is created.
import pandas as pd
data = {"column":[0,1,2,0,1,2,0]}
df = pd.DataFrame(data)
column
0
1
2
0
1
2
0
column0 column1 column2
true false false
false true false
false false true
true false false
false true false
false false true
true false false
This is a get_dummies problem, but you will additionally need to specify dtype=bool to get columns of bools:
pd.get_dummies(df['column'], dtype=bool)
0 1 2
0 True False False
1 False True False
2 False False True
3 True False False
4 False True False
5 False False True
6 True False False
pd.get_dummies(df['column'], dtype=bool).dtypes
0 bool
1 bool
2 bool
dtype: object
# carbon copy of expected output
dummies = pd.get_dummies(df['column'], dtype=bool)
dummies[:] = np.where(pd.get_dummies(df['column'], dtype=bool), 'true', 'false')
dummies.add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
I also use get_dummies as cs95. However, I use str.get_dummies and concat the word column before get_dummies. Finally, replace
('column'+df.column.astype(str)).str.get_dummies().replace({1:'true', 0:'false'})
Out[2164]:
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
factorize and slice assignment
i, u = pd.factorize(df.column)
a = np.empty((len(i), len(u)), '<U5')
a.fill('false')
a[np.arange(len(i)), i] = 'true'
pd.DataFrame(a).add_prefix('column')
column0 column1 column2
0 true false false
1 false true false
2 false false true
3 true false false
4 false true false
5 false false true
6 true false false
This is my dataframe which I want to use groupby
Value Boolean1 Boolean2
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
4.893860 False False
5.091573 False False
6 True False
6.05 False False
I want to use groupby with Boolean1 and Boolean2 column. The groupby continues from False to unless it finds True and it checks in both column and then next False to True again. If there is nomore True, then it can ignore rest of the False (values corresponding to it) or it can be there
I want to achieve similar to this.
Value Boolean1 Boolean2
This is one group
5.175603 False False
5.415855 False False
5.046997 False False
4.607749 True False
This is another one
5.140482 False False
1.796552 False False
0.139924 False True
4.157981 False True
And this is another one
4.893860 False False
5.091573 False False
6 True False
My idea is check Falses in both columns before at least one True column:
#chain condition together by OR and invert
m = ~(df['Boolean1'] | df['Boolean2'])
#get consecutive groups with AND for filter only Trues
#(because inverting, it return False in both cols)
s = (m.ne(m.shift()) & m).cumsum()
for i, x in df.groupby(s):
print (x)
dtype: int32
Value Boolean1 Boolean2
0 5.175603 False False
1 5.415855 False False
2 5.046997 False False
3 4.607749 True False
Value Boolean1 Boolean2
4 5.140482 False False
5 1.796552 False False
6 0.139924 False True
7 4.157981 False True
Value Boolean1 Boolean2
8 4.893860 False False
9 5.091573 False False
10 6.000000 True False
Value Boolean1 Boolean2
11 6.05 False False
Detail:
print (m)
0 True
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 False
11 True
dtype: bool
print (s)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 4
dtype: int32