How to one-hot-encode matrix of sentences at the character level? - python

There is a dataframe:
0 1 2 3
0 a c e NaN
1 b d NaN NaN
2 b c NaN NaN
3 a b c d
4 a b NaN NaN
5 b c NaN NaN
6 a b NaN NaN
7 a b c e
8 a b c NaN
9 a c e NaN
I would like to transfrom encode it with one-hot like this
a c e b d
0 1 1 1 0 0
1 0 0 0 1 1
2 0 1 0 1 0
3 1 1 0 1 1
4 1 0 0 1 0
5 0 1 0 1 0
6 1 0 0 1 0
7 1 1 1 1 0
8 1 1 0 1 0
9 1 1 1 0 0
pd.get_dummies does not work here, because it acutually encode each columns independently. How can I get this? Btw, the order of the columns doesn't matter.

Try this:
df.stack().str.get_dummies().max(level=0)
Out[129]:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1

One way using str.join and str.get_dummies:
one_hot = df1.apply(lambda x: "|".join([i for i in x if pd.notna(i)]), 1).str.get_dummies()
print(one_hot)
Output:
a b c d e
0 1 0 1 0 1
1 0 1 0 1 0
2 0 1 1 0 0
3 1 1 1 1 0
4 1 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 0
7 1 1 1 0 1
8 1 1 1 0 0
9 1 0 1 0 1

Related

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that
You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Fill column with nan if sum of multiple columns is 0

Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1

Python append dataframe such that only columns remain the same

I have the following dataframes in python pandas:
A:
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
B:
1 2 3 4 5 6 7 8 9 10
1 0 1 1 1 1 1 1 0 0 1 0
C:
1 2 3 4 5 6 7 8 9 10
2 0 1 1 1 0 0 0 0 0 1 0
I want to concatenate them together such that the column titles remain the same while row index and values get appended so the new dataframe is:
df:
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
1 0 1 1 1 1 1 1 0 0 1 0
2 0 1 1 1 0 0 0 0 0 1 0
I have tried using append and concat but none seem to be fulfilling the output I am trying to achieve. Any suggestions?
Here is what I tried:
df = pd.concat([df,pd.concat([A,B,C], ignore_index=True)], axis=1)
This is a plain vanilla concat
pd.concat([A, B, C])
1 2 3 4 5 6 7 8 9 10
0 1 1 1 1 1 1 1 0 0 1 1
1 0 1 1 1 1 1 1 0 0 1 0
2 0 1 1 1 0 0 0 0 0 1 0
Simple pd.concat will just do the work, you over complicated the task a little bit:
pd.concat([A,B,C], axis=0, ignore_index=True)

Sequence number groupby ID with reset

I'am looking for a way to générate a sequence of numbers that reset on every break
Example
ID VAR
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
B 1
B 1
B 1
B 0
B 0
B 0
B 0
Each time var is at 1 and ID the same as before, you start the counter.
but if ID is not the same or VAR is 0 you start again from 0
Desired output
ID VAR DESIRED
A 0 0
A 0 0
A 1 1
A 1 2
A 0 0
A 0 0
A 1 1
A 1 2
B 1 1
B 1 2
B 1 3
B 0 0
B 0 0
B 0 0
B 0 0
You can create an intermediate index, and then groupby this index and ID, cumsumming up on VAR:
df['ix'] = df['VAR'].diff().fillna(0).abs().cumsum()
df['DESIRED'] = df.groupby(['ID','ix'])['VAR'].cumsum()
In [21]: df
Out[21]:
ID VAR ix DESIRED
0 A 0 0 0
1 A 0 0 0
2 A 1 1 1
3 A 1 1 2
4 A 0 2 0
5 A 0 2 0
6 A 1 3 1
7 A 1 3 2
8 B 1 3 1
9 B 1 3 2
10 B 1 3 3
11 B 0 4 0
12 B 0 4 0
13 B 0 4 0
14 B 0 4 0

Python - Pandas - create "first fail" column from other column data

I have a data frame that represents fail-data for a series of parts, showing which of 3 tests (A, B, C) pass (0) or fail (1).
A B C
1 0 1 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 0 0
6 0 1 0
7 1 1 0
8 1 1 1
I'd like to add a final column to the dataframe showing the First Fail (FF) of each part, or a default (P) if no fails.
A B C | FF
1 0 1 1 | B
2 0 0 0 | P
3 1 0 0 | A
4 0 0 1 | C
5 0 0 0 | P
6 0 1 0 | B
7 1 1 0 | A
8 1 1 1 | A
Any easy way to do this pandas? Does it require iterating over each row?
maybe:
>>> df['FF'] = df.dot(df.columns).str.slice(0, 1).replace('', 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A
alternatively:
>>> df['FF'] = np.where(df.any(axis=1), df.idxmax(axis=1), 'P')
>>> df
A B C FF
1 0 1 1 B
2 0 0 0 P
3 1 0 0 A
4 0 0 1 C
5 0 0 0 P
6 0 1 0 B
7 1 1 0 A
8 1 1 1 A

Categories

Resources