Convert pandas DataFrame column of comma separated strings to one-hot encoded

Convert pandas DataFrame column of comma separated strings to one-hot encoded - python

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?

Note that you're not dealing with OHEs.
str.split + stack + get_dummies + sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply + value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0

If categories are controlled (you know how many and who they are), best answer is by #Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0

Related

Create indicator matrix for much larger data?

I have following data:
movies.head()
and would like to create a categorical matrix based on its genres.
The final result should look like this:
I know how to do it using a SLOW way, which is:
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
Output is:
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
Creating a zero matrix and renaming its column to be the genres:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()
Output is:
Converting movies.genres into categorical matrix:
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre'))
movies_windic.iloc[0:2]
Output is:
The above code is copied from the book Python for Data Analysis 2nd edition page 213, 214.
What irritates me is the warning under the code regarding its performance, which is
For much larger data, this method of constructing indicator variables
with multiple membership is not especially speedy. It would be
betterto write a lower-level function that writes directly to a NumPy
array,and then wrap the result in a DataFrame.
Could someone give me a pointer how to do it with a lower-level function so that it could work faster?
Thank you in advance.

Let's generate some random data:
import pandas as pd
df = pd.DataFrame({"Movie_number": [1, 2, 3, 4, 5], "genres": ["A|B|C", "B", "B|C", "C", "A|C"]})
print(df)
Movie_number genres
0 1 A|B|C
1 2 B
2 3 B|C
3 4 C
4 5 A|C
I've managed to come up with this horrible solution:
newdf = pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")], axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
print(newdf)
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Explanation:
First we explode our "genres" column based on | separator:
>>> df['genres'].str.split('|').explode()
0 A
0 B
0 C
1 B
2 B
2 C
3 C
4 A
4 C
Name: genres, dtype: object
Then we convert these into indicator variables with pd.get_dummies:
>>> pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")
genre_A genre_B genre_C
0 1 0 0
0 0 1 0
0 0 0 1
1 0 1 0
2 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
4 0 0 1
After that we concatenate it with the original dataframe, then finally we merge the rows with groupby and sum.
>>> pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")],axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Despite it's not so low-level, I think it is definitely faster than using for loops.

Reshaping into binary variables using pandas python

I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)

You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1

Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.

You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.

Finding the count of letters in each column

I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?

Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist().
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially.
Deal with rows second by iterating through 'ATCG' sequentially.
This produces a list of lists which can be fed directly into pd.DataFrame.

With Series.value_counts():
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result.
The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ... on each row.

Drop all columns where all values are zero

I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?

You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1

If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1

In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1

Group a dataframe and count amount of items of a column that is not shown

Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.

You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert pandas DataFrame column of comma separated strings to one-hot encoded - python

Related

Create indicator matrix for much larger data?

Reshaping into binary variables using pandas python

Finding the count of letters in each column

Drop all columns where all values are zero

Group a dataframe and count amount of items of a column that is not shown

Categories

Resources