How to sum appearances in columns with pandas without lambda - python

I have the following dataframe
a b c d e
0 0 0 -1 1 -1
1 0 1 -1 1 -1
2 -1 0 -1 1 1
3 -1 1 1 -1 1
4 1 0 1 -1 1
5 1 0 0 0 1
6 1 1 0 0 -1
7 1 1 -1 0 0
For each numbers that appears in a,b,c,d,e, I want to sum and store in a column the amount of times that it appears in a row, so the result should be something like this:
a b c d e Sum1 Sum0 Sum_1
0 0 0 -1 1 -1 1 2 2
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 3 0 2
4 1 0 1 -1 1 3 1 1
5 1 0 0 0 1 2 3 0
6 1 -1 0 0 -1 1 2 2
7 1 1 -1 -1 0 2 1 2
So in the first row, the number "1" appears once in a,b,c,d,e, so we store that in Sum1 column. Then, number "0" appears two times, and we store that in Sum0, and number "-1" appears 2 times and we store it in Sum_1.
How could this columns be calculated, without using lambda functions (to get better performance)? I guess that numpy is involved here but I don't get how to do it

Using get_dummies
df=df.astype(str)
pd.get_dummies(df.stack()).sum(level=0)
Out[667]:
-1 0 1
0 2 2 1
1 2 1 2
2 2 1 2
3 2 0 3
4 1 1 3
5 0 3 2
6 1 2 2
7 1 2 2
More info
pd.concat([df,pd.get_dummies(df.stack()).sum(level=0).add_prefix('Sum')],1)
Out[669]:
a b c d e Sum-1 Sum0 Sum1
0 0 0 -1 1 -1 2 2 1
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 2 0 3
4 1 0 1 -1 1 1 1 3
5 1 0 0 0 1 0 3 2
6 1 1 0 0 -1 1 2 2
7 1 1 -1 0 0 1 2 2
Another method maybe solve but do not need convert to str.
df.apply(lambda x : x.value_counts(),1).fillna(0)
Out[674]:
-1 0 1
0 2.0 2.0 1.0
1 2.0 1.0 2.0
2 2.0 1.0 2.0
3 2.0 0.0 3.0
4 1.0 1.0 3.0
5 0.0 3.0 2.0
6 1.0 2.0 2.0
7 1.0 2.0 2.0

Use
In [62]: df.assign(**{'Sum{}'.format(v):df.eq(v).sum(1) for v in [1, 0, -1]})
Out[62]:
a b c d e Sum-1 Sum0 Sum1
0 0 0 -1 1 -1 2 2 1
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 2 0 3
4 1 0 1 -1 1 1 1 3
5 1 0 0 0 1 0 3 2
6 1 1 0 0 -1 1 2 2
7 1 1 -1 0 0 1 2 2
Same as
In [72]: df.join(pd.DataFrame({'Sum{}'.format(v):df.eq(v).sum(1) for v in [1, 0, -1]}))
Out[72]:
a b c d e Sum-1 Sum0 Sum1
0 0 0 -1 1 -1 2 2 1
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 2 0 3
4 1 0 1 -1 1 1 1 3
5 1 0 0 0 1 0 3 2
6 1 1 0 0 -1 1 2 2
7 1 1 -1 0 0 1 2 2

Create boolean mask and count it - Trues are process like 1:
m1 = df == 1
m0 = df == 0
m_1 = df == -1
df['Sum1'] = m1.sum(1)
df['Sum0'] = m0.sum(1)
df['Sum_1'] = m_1.sum(1)
print (df)
a b c d e Sum1 Sum0 Sum_1
0 0 0 -1 1 -1 1 2 2
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 3 0 2
4 1 0 1 -1 1 3 1 1
5 1 0 0 0 1 2 3 0
6 1 1 0 0 -1 2 2 1
7 1 1 -1 0 0 2 2 1
General solution with get_dummies:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.sum(level=0, axis=1)
.add_prefix('Sum'))
print (df1)
Sum-1 Sum0 Sum1
0 2 2 1
1 2 1 2
2 2 1 2
3 2 0 3
4 1 1 3
5 0 3 2
6 1 2 2
7 1 2 2
df = df.join(df1)
print (df)
a b c d e Sum-1 Sum0 Sum1
0 0 0 -1 1 -1 2 2 1
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 2 0 3
4 1 0 1 -1 1 1 1 3
5 1 0 0 0 1 0 3 2
6 1 1 0 0 -1 1 2 2
7 1 1 -1 0 0 1 2 2
Idea for better performance of Zero solution - compare numpy array and instead static values is possible use unique values by numpy.unique:
all_vals = np.unique(df.values)
arr = df.values
df1 = df.join(pd.DataFrame({'Sum{}'.format(v):(arr == v).sum(1) for v in all_vals}))
print (df1)
a b c d e Sum-1 Sum0 Sum1
0 0 0 -1 1 -1 2 2 1
1 0 1 -1 1 -1 2 1 2
2 -1 0 -1 1 1 2 1 2
3 -1 1 1 -1 1 2 0 3
4 1 0 1 -1 1 1 1 3
5 1 0 0 0 1 0 3 2
6 1 1 0 0 -1 1 2 2
7 1 1 -1 0 0 1 2 2
Timings
np.random.seed(234)
N = 100000
df = pd.DataFrame(np.random.randint(3, size=(N,5)), columns=list('abcde')) - 1
print (df)
#wen's solution 1
In [49]: %timeit pd.concat([df,pd.get_dummies(df.astype(str).stack()).sum(level=0).add_prefix('Sum')],1)
1 loop, best of 3: 2.21 s per loop
#wen's solution 2
In [56]: %timeit df.apply(lambda x : x.value_counts(),1).fillna(0)
1 loop, best of 3: 1min 35s per loop
#jezrael's solution 2
In [50]: %timeit df.join((pd.get_dummies(df.astype(str), prefix='', prefix_sep='').sum(level=0, axis=1).add_prefix('Sum')))
1 loop, best of 3: 2.14 s per loop
#jezrael's solution 1
In [55]: %%timeit
...: m1 = df == 1
...: m0 = df == 0
...: m_1 = df == -1
...: df['Sum1'] = m1.sum(1)
...: df['Sum0'] = m0.sum(1)
...: df['Sum_1'] = m_1.sum(1)
...:
10 loops, best of 3: 50.6 ms per loop
#zero's solution1
In [51]: %timeit df.assign(**{'Sum{}'.format(v):df.eq(v).sum(1) for v in [1, 0, -1]})
10 loops, best of 3: 39.8 ms per loop
#zero's solution2
In [52]: %timeit df.join(pd.DataFrame({'Sum{}'.format(v):df.eq(v).sum(1) for v in [1, 0, -1]}))
10 loops, best of 3: 39.6 ms per loop
#zero&jezrael's solution1
In [53]: %timeit df.join(pd.DataFrame({'Sum{}'.format(v):(df.values == v).sum(1) for v in np.unique(df.values)}))
10 loops, best of 3: 23.8 ms per loop
#zero&jezrael's solution2
In [54]: %timeit df.join(pd.DataFrame({'Sum{}'.format(v):(df.values == v).sum(1) for v in [0, 1, -1]}))
100 loops, best of 3: 12.8 ms per loop
#if many columns and more unique values is possible convert to numpy array outside loop
def f1(df):
all_vals = np.unique(df.values)
arr = df.values
return df.join(pd.DataFrame({'Sum{}'.format(v):(arr == v).sum(1) for v in all_vals}))
def f2(df):
arr = df.values
return df.join(pd.DataFrame({'Sum{}'.format(v):(arr == v).sum(1) for v in [0, 1, -1]}))
print (f1(df))
print (f2(df))
#zero&jezrael's solution3
In [58]: %timeit (f1(df))
10 loops, best of 3: 25.8 ms per loop
#zero&jezrael's solution4
In [59]: %timeit (f2(df))
100 loops, best of 3: 13 ms per loop

Related

How to count consecutive same values in a pythonic way that looks iterative

So I am trying to count the number of consecutive same values in a dataframe and put that information into a new column in the dataframe, but I want the count to look iterative.
Here is what I have so far:
df = pd.DataFrame(np.random.randint(0,3, size=(15,4)), columns=list('ABCD'))
df['subgroupA'] = (df.A != df.A.shift(1)).cumsum()
dfg = df.groupby(by='subgroupA', as_index=False).apply(lambda grp: len(grp))
dfg.rename(columns={None: 'numConsec'}, inplace=True)
df = df.merge(dfg, how='left', on='subgroupA')
df
Here is the result:
A B C D subgroupA numConsec
0 2 1 1 1 1 1
1 1 2 1 0 2 2
2 1 0 2 1 2 2
3 0 1 2 0 3 1
4 1 0 0 1 4 1
5 0 2 2 1 5 2
6 0 2 1 1 5 2
7 1 0 0 1 6 1
8 0 2 0 0 7 4
9 0 0 0 2 7 4
10 0 2 1 1 7 4
11 0 2 2 0 7 4
12 1 2 0 1 8 1
13 0 1 1 0 9 1
14 1 1 1 0 10 1
The problem is, in the numConsec column, I don't want the full count for every row. I want it to reflect how it looks as you iteratively look at the dataframe. The problem is, my dataframe is too large to iteratively loop through and make the counts, as that would be too slow. I need to do it in a pythonic way and make it look like this:
A B C D subgroupA numConsec
0 2 1 1 1 1 1
1 1 2 1 0 2 1
2 1 0 2 1 2 2
3 0 1 2 0 3 1
4 1 0 0 1 4 1
5 0 2 2 1 5 1
6 0 2 1 1 5 2
7 1 0 0 1 6 1
8 0 2 0 0 7 1
9 0 0 0 2 7 2
10 0 2 1 1 7 3
11 0 2 2 0 7 4
12 1 2 0 1 8 1
13 0 1 1 0 9 1
14 1 1 1 0 10 1
Any ideas?

How to one-hot-encode matrix of sentences at the character level?

There is a dataframe:
0 1 2 3
0 a c e NaN
1 b d NaN NaN
2 b c NaN NaN
3 a b c d
4 a b NaN NaN
5 b c NaN NaN
6 a b NaN NaN
7 a b c e
8 a b c NaN
9 a c e NaN
I would like to transfrom encode it with one-hot like this
a c e b d
0 1 1 1 0 0
1 0 0 0 1 1
2 0 1 0 1 0
3 1 1 0 1 1
4 1 0 0 1 0
5 0 1 0 1 0
6 1 0 0 1 0
7 1 1 1 1 0
8 1 1 0 1 0
9 1 1 1 0 0
pd.get_dummies does not work here, because it acutually encode each columns independently. How can I get this? Btw, the order of the columns doesn't matter.
Try this:
df.stack().str.get_dummies().max(level=0)
Out[129]:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1
One way using str.join and str.get_dummies:
one_hot = df1.apply(lambda x: "|".join([i for i in x if pd.notna(i)]), 1).str.get_dummies()
print(one_hot)
Output:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1

Create Duplicate Rows and Change Values in Specific Columns

How to create x amount of duplicates based on a row in the dataframe and change a single or multi variables from specific columns. The rows are then added to the end of the same dataframe.
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0 <- Create 25 Duplicates of this row (4) and change variable C to 1
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
I repeat only 10 times to keep length of result reasonable.
# Number of repeats |
# v
df.append(df.loc[[4] * 10].assign(C=1), ignore_index=True)
A B C D E F
0 1 1 0 1 1 0
1 2 2 1 1 1 0
2 2 2 1 1 1 0
3 2 2 1 1 1 0
4 1 1 0 1 1 0
5 1 1 0 1 1 0
6 2 2 1 1 1 0
7 2 2 1 1 1 0
8 2 2 1 1 1 0
9 1 1 0 1 1 0
10 1 1 1 1 1 0
11 1 1 1 1 1 0
12 1 1 1 1 1 0
13 1 1 1 1 1 0
14 1 1 1 1 1 0
15 1 1 1 1 1 0
16 1 1 1 1 1 0
17 1 1 1 1 1 0
18 1 1 1 1 1 0
19 1 1 1 1 1 0
Per comments, try:
df.append(df.loc[[4] * 10].assign(**{'C': 1}), ignore_index=True)
I am using repeat and reindex
s=df.iloc[[4],] # pick the row you want to do repeat
s=s.reindex(s.index.repeat(45))# repeat the row by the giving number
#s=pd.DataFrame([df.iloc[4,].tolist()]*25) if need enhance the speed , using this line replace the above
s.loc[:,'C']=1 # change the value
pd.concat([df,s]) #append to the original df

Python append dataframe such that only columns remain the same

I have the following dataframes in python pandas:
A:
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
B:
1 2 3 4 5 6 7 8 9 10
1 0 1 1 1 1 1 1 0 0 1 0
C:
1 2 3 4 5 6 7 8 9 10
2 0 1 1 1 0 0 0 0 0 1 0
I want to concatenate them together such that the column titles remain the same while row index and values get appended so the new dataframe is:
df:
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
1 0 1 1 1 1 1 1 0 0 1 0
2 0 1 1 1 0 0 0 0 0 1 0
I have tried using append and concat but none seem to be fulfilling the output I am trying to achieve. Any suggestions?
Here is what I tried:
df = pd.concat([df,pd.concat([A,B,C], ignore_index=True)], axis=1)
This is a plain vanilla concat
pd.concat([A, B, C])
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
1 0 1 1 1 1 1 1 0 0 1 0
2 0 1 1 1 0 0 0 0 0 1 0
Simple pd.concat will just do the work, you over complicated the task a little bit:
pd.concat([A,B,C], axis=0, ignore_index=True)

Drop columns with more than 70% zeros

I would like to know if there is a command that drop columns that has more than 70% zeros or X% zeros. like:
df = df.loc[:, df.isnull().mean() < .7]
for NaN.
Thank you !
Just change df.isnull().mean() to (df==0).mean():
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1

Categories

Resources