sum values of columns starting with the same string in pandas dataframe - python

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.

You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

Related

Create indicator matrix for much larger data?

I have following data:
movies.head()
and would like to create a categorical matrix based on its genres.
The final result should look like this:
I know how to do it using a SLOW way, which is:
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
Output is:
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
Creating a zero matrix and renaming its column to be the genres:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()
Output is:
Converting movies.genres into categorical matrix:
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre'))
movies_windic.iloc[0:2]
Output is:
The above code is copied from the book Python for Data Analysis 2nd edition page 213, 214.
What irritates me is the warning under the code regarding its performance, which is
For much larger data, this method of constructing indicator variables
with multiple membership is not especially speedy. It would be
betterto write a lower-level function that writes directly to a NumPy
array,and then wrap the result in a DataFrame.
Could someone give me a pointer how to do it with a lower-level function so that it could work faster?
Thank you in advance.
Let's generate some random data:
import pandas as pd
df = pd.DataFrame({"Movie_number": [1, 2, 3, 4, 5], "genres": ["A|B|C", "B", "B|C", "C", "A|C"]})
print(df)
Movie_number genres
0 1 A|B|C
1 2 B
2 3 B|C
3 4 C
4 5 A|C
I've managed to come up with this horrible solution:
newdf = pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")], axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
print(newdf)
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Explanation:
First we explode our "genres" column based on | separator:
>>> df['genres'].str.split('|').explode()
0 A
0 B
0 C
1 B
2 B
2 C
3 C
4 A
4 C
Name: genres, dtype: object
Then we convert these into indicator variables with pd.get_dummies:
>>> pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")
genre_A genre_B genre_C
0 1 0 0
0 0 1 0
0 0 0 1
1 0 1 0
2 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
4 0 0 1
After that we concatenate it with the original dataframe, then finally we merge the rows with groupby and sum.
>>> pd.concat([df, pd.get_dummies(df['genres'].str.split('|').explode(), prefix="genre")],axis=1).groupby(["Movie_number", "genres"]).sum().reset_index()
Movie_number genres genre_A genre_B genre_C
0 1 A|B|C 1 1 1
1 2 B 0 1 0
2 3 B|C 0 1 1
3 4 C 0 0 1
4 5 A|C 1 0 1
Despite it's not so low-level, I think it is definitely faster than using for loops.

Pandas: obtaining frequency of a specified value in a row across multiple columns

I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3

Label encoding across multiple columns with same attributes in sckit-learn

If I have two columns as below:
Origin Destination
China USA
China Turkey
USA China
USA Turkey
USA Russia
Russia China
How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e
Origin Destination
0 1
0 3
1 0
1 0
1 0
2 1
If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case
stack
df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
factorize with reshape
pd.DataFrame(
pd.factorize(df.values.ravel())[0].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
np.unique and reshape
pd.DataFrame(
np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Disgusting Option
I couldn't stop trying stuff... sorry!
df.applymap(
lambda x, y={}, c=itertools.count():
y.get(x) if x in y else y.setdefault(x, next(c))
)
Origin Destination
0 0 1
1 0 3
2 1 0
3 1 3
4 1 2
5 2 0
As pointed out by cᴏʟᴅsᴘᴇᴇᴅ
You can shorten this by assigning back to dataframe
df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)
pandas Method
You could create a dictionary of {country: value} pairs and map the dataframe to that:
country_map = {country:i for i, country in enumerate(df.stack().unique())}
df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)
>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
sklearn method
Since you tagged sklearn, you could use LabelEncoder():
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())
df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])
>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
To get the original labels back:
>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)
You can using replace
df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Succinct and nice answer from Pir
df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))
And
df.replace(dict(zip(np.unique(df), itertools.count())))
Edit: just found out about return_inverse option to np.unique. No need to search and substitute!
df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)
You could leverage the vectorized version of np.searchsorted with
df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)
Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.
df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)
Using LabelEncoder from sklearn, you can also try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())
df = df.apply(le.fit_transform)
print(df)
Result:
Origin Destination
0 0 3
1 0 2
2 2 0
3 2 2
4 2 1
5 1 0
If you have more columns and only want to apply to selected columns of dataframe then, you can try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())
df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

Set value of first item in slice in python pandas

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:
df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0
The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:
df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0
And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.
EDIT:
It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.
I think you can use idxmax for get index of first True value and then set by loc:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
0
0 1
1 3
2 0
3 0
4 3
print ((df[0] == 0).idxmax())
2
df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
0
0 1
1 3
2 100
3 0
4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
0
0 1
1 200
2 0
3 0
4 3
EDIT:
Solution with not unique index:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT1:
Solution with MultiIndex:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df.index = [np.arange(len(df.index)), df.index]
print (df)
0
0 1 1
1 2 3
2 2 0
3 3 0
4 4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT2:
Solution with double cumsum:
np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
0
1 4
2 0
2 4
3 7
4 4
mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1 0
2 1
2 2
3 3
4 4
Name: 0, dtype: int32
df.loc[mask == 1, 0] = 200
print (df)
0
1 4
2 200
2 4
3 7
4 4
Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))
print(df)
A
0 1
1 2
2 3
3 4
4 5
Create some arbitrary slice slc
slc = df[df.A > 2]
print(slc)
A
2 3
3 4
4 5
Access the first row of slc within df by using index[0] and loc
df.loc[slc.index[0]] = 0
print(df)
A
0 1
1 2
2 0
3 4
4 5
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0
In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows
df.loc[df[df['b']==0].index.tolist()[n],'a']=0
to change any nth item in a slice
df
a
1 0.111089
2 0.255633
2 0.332682
3 0.434527
3 0.730548
3 0.844724
df after slicing and labelling them
a b
1 0.111089 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
3 0.730548 1
3 0.844724 1
After changing value of first item in slice (labelled as 0) to 0
a b
3 0.730548 1
3 0.844724 1
1 0.000000 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
So using some of the answers I managed to find a one liner way to do this:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
0
0 1
1 3
2 0
3 0
4 3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
0
0 1
1 3
2 1
3 0
4 3
Essentially this is using the mask inline with a cumsum.

Efficiently transform pandas dataFrame using column name as factor

I would like to transform a DataFrame given by a software into a more python usable one and I can't fix it in a simple way with pandas because I have to use information contained in the columns. Here a simple example :
import pandas as pd
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
pd.DataFrame(d)
00 01 10 11
0 1 11 111 1111
The column names contains the factors that I need to use in rows, I would like to get something like this :
df = {'trt': [0,0,1,1], 'grp': [0,1,0,1], 'value':[1,11,111,1111]}
pd.DataFrame(df)
grp trt value
0 0 0 1
1 1 0 11
2 0 1 111
3 1 1 1111
Any ideas of how to do it properly ?
Solution with MultiIndex.from_arrays created indexing with str and transpose by T:
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print (df)
0 1
0 1 0 1
0 1 11 111 1111
df1 = df.T.reset_index()
df1.columns = ['grp','trt','value']
print (df1)
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
Similar solution with rename_axis and rename index:
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
df = pd.DataFrame(d)
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print(df.rename_axis(('grp','trt'), axis=1).rename(index={0:'value'}).T.reset_index())
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
To me the simplest solution is just melting the original frame and splitting the column names in a second step. Something like this:
df = pd.DataFrame(d)
mf = pd.melt(df)
mf[['grp', 'trt']] = mf.pop('variable').apply(lambda x: pd.Series(tuple(x)))
Here's mf after melting:
variable value
0 00 1
1 01 11
2 10 111
3 11 1111
And the final result, after splitting the variable column:
value grp trt
0 1 0 0
1 11 0 1
2 111 1 0
3 1111 1 1
I'd encourage you to read up more on melting here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html . It can be incredibly useful.

Categories

Resources