Create indicator matrix for much larger data?

Create indicator matrix for much larger data? - python

I have following data:
movies.head()
and would like to create a categorical matrix based on its genres.
The final result should look like this:
I know how to do it using a SLOW way, which is:
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
Output is:
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
Creating a zero matrix and renaming its column to be the genres:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()
Output is:
Converting movies.genres into categorical matrix:
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre'))
movies_windic.iloc[0:2]
Output is:
The above code is copied from the book Python for Data Analysis 2nd edition page 213, 214.
What irritates me is the warning under the code regarding its performance, which is
For much larger data, this method of constructing indicator variables
with multiple membership is not especially speedy. It would be
betterto write a lower-level function that writes directly to a NumPy
array,and then wrap the result in a DataFrame.
Could someone give me a pointer how to do it with a lower-level function so that it could work faster?
Thank you in advance.

Let's generate some random data:
import pandas as pd
df = pd.DataFrame({"Movie_number": [1, 2, 3, 4, 5], "genres": ["A|B|C", "B", "B|C", "C", "A|C"]})
print(df)
Movie_number genres
0 1 A|B|C
1 2 B
2 3 B|C
3 4 C
4 5 A|C
I've managed to come up with this horrible solution:
newdf = pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")], axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
print(newdf)
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Explanation:
First we explode our "genres" column based on | separator:
>>> df['genres'].str.split('|').explode()
0 A
0 B
0 C
1 B
2 B
2 C
3 C
4 A
4 C
Name: genres, dtype: object
Then we convert these into indicator variables with pd.get_dummies:
>>> pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")
genre_A genre_B genre_C
0 1 0 0
0 0 1 0
0 0 0 1
1 0 1 0
2 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
4 0 0 1
After that we concatenate it with the original dataframe, then finally we merge the rows with groupby and sum.
>>> pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")],axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Despite it's not so low-level, I think it is definitely faster than using for loops.

Related

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.

Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

Select rows which have only zeros in columns

I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!

You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0

Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data.
For example,
data = {"mesh": ["A, B, C", "C,B", ""]}
From this I would like to get a dataframe consisting of:
index A B. C
0 1 1 1
1 0 1 1
2 0 0 0
How can I do this?

Note that you're not dealing with OHEs.
str.split + stack + get_dummies + sum
df = pd.DataFrame(data)
df
mesh
0 A, B, C
1 C,B
2
(df.mesh.str.split('\s*,\s*', expand=True)
.stack()
.str.get_dummies()
.sum(level=0))
df
A B C
0 1 1 1
1 0 1 1
2 0 0 0
apply + value_counts
(df.mesh.str.split(r'\s*,\s*', expand=True)
.apply(pd.Series.value_counts, 1)
.iloc[:, 1:]
.fillna(0, downcast='infer'))
A B C
0 1 1 1
1 0 1 1
2 0 0 0
pd.crosstab
x = df.mesh.str.split('\s*,\s*', expand=True).stack()
pd.crosstab(x.index.get_level_values(0), x.values).iloc[:, 1:]
df
col_0 A B C
row_0
0 1 1 1
1 0 1 1
2 0 0 0

Figured there is a simpler answer, or I felt this as more simple compared to multiple operations that we have to make.
Make sure the column has unique values separated be commas
Use get dummies in built parameter to specify the separator as comma. The default for this is pipe separated.
data = {"mesh": ["A, B, C", "C,B", ""]}
sof_df=pd.DataFrame(data)
sof_df.mesh=sof_df.mesh.str.replace(' ','')
sof_df.mesh.str.get_dummies(sep=',')
OUTPUT:
A B C
0 1 1 1
1 0 1 1
2 0 0 0

If categories are controlled (you know how many and who they are), best answer is by #Tejeshar Gurram. But, what if you have lots of potencial categories and you are not interested in all of them. Say:
s = pd.Series(['A,B,C,', 'B,C,D', np.nan, 'X,W,Z'])
0 A,B,C,
1 B,C,D
2 NaN
3 X,W,Z
dtype: object
If you are only interested in categories B and C for the final df of dummies, I've found this workaround does the job:
cat_list = ['B', 'C']
list_of_lists = [ (s.str.contains(cat_, regex=False)==True).astype(bool).astype(int).to_list() for cat_ in cat_list]
data = {k:v for k,v in zip(cat_list,list_of_lists)}
pd.DataFrame(data)
B C
0 1 0
1 0 1
2 0 0
3 0 0

Set value of first item in slice in python pandas

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:
df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0
The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:
df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0
And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.
EDIT:
It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.

I think you can use idxmax for get index of first True value and then set by loc:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
0
0 1
1 3
2 0
3 0
4 3
print ((df[0] == 0).idxmax())
2
df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
0
0 1
1 3
2 100
3 0
4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
0
0 1
1 200
2 0
3 0
4 3
EDIT:
Solution with not unique index:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT1:
Solution with MultiIndex:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df.index = [np.arange(len(df.index)), df.index]
print (df)
0
0 1 1
1 2 3
2 2 0
3 3 0
4 4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT2:
Solution with double cumsum:
np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
0
1 4
2 0
2 4
3 7
4 4
mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1 0
2 1
2 2
3 3
4 4
Name: 0, dtype: int32
df.loc[mask == 1, 0] = 200
print (df)
0
1 4
2 200
2 4
3 7
4 4

Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))
print(df)
A
0 1
1 2
2 3
3 4
4 5
Create some arbitrary slice slc
slc = df[df.A > 2]
print(slc)
A
2 3
3 4
4 5
Access the first row of slc within df by using index[0] and loc
df.loc[slc.index[0]] = 0
print(df)
A
0 1
1 2
2 0
3 4
4 5

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0
In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows
df.loc[df[df['b']==0].index.tolist()[n],'a']=0
to change any nth item in a slice
df
a
1 0.111089
2 0.255633
2 0.332682
3 0.434527
3 0.730548
3 0.844724
df after slicing and labelling them
a b
1 0.111089 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
3 0.730548 1
3 0.844724 1
After changing value of first item in slice (labelled as 0) to 0
a b
3 0.730548 1
3 0.844724 1
1 0.000000 0
2 0.255633 0
2 0.332682 0
3 0.434527 0

So using some of the answers I managed to find a one liner way to do this:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
0
0 1
1 3
2 0
3 0
4 3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
0
0 1
1 3
2 1
3 0
4 3
Essentially this is using the mask inline with a cumsum.

sum values of columns starting with the same string in pandas dataframe

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.

You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create indicator matrix for much larger data? - python

Related

Groupby selected rows by a condition on a column value and then transform another column

Select rows which have only zeros in columns

Convert pandas DataFrame column of comma separated strings to one-hot encoded

Set value of first item in slice in python pandas

sum values of columns starting with the same string in pandas dataframe

Categories

Resources