This question already has answers here:
Pandas dataframe with multiindex column - merge levels
(4 answers)
Closed 3 years ago.
I have a multiIndex pandas dataframe.
Issue
high med low
name age empId
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
When I'm executing df.columns I'm getting the following result:-
MultiIndex(levels=[['Issue'], ['high', 'med', 'low']],
labels=[[0, 0, 0], [0, 1, 2]])
I'm looking to flatten this dataframe by renaming the Multi_index issue columns.
Expected output df:
name age empId Issue_high Issue_med Issue_low
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
I tried this:
df2 = df.rename(columns={'high':'Issue_high','low':'Issue_low','med':'Issue_med'}, level = 1)
Im getting error.
rename() got an unexpected keyword argument "level"
Is there any way to get the output structure?
Edit: By using df.columns = df.columns.map('_'.join) I'm getting
Issue_high Issue_med Issue_low
name age empId
Jack 44 Ab1 0 1 0
Bob 34 Ab2 0 0 1
Mike 52 Ab6 1 1 0
df.columns
>>> Index(['Issue_high',
'Issue_med', 'Issue_low'],
dtype='object')
Use Index.map with join if all values are strings:
df.columns = df.columns.map('_'.join)
Or format or f-strings - working with numeric values in levels too:
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Last DataFrame.reset_index for convert MultiIndex in Index to columns:
df = df.reset_index()
print (df)
name age empId Issue_high Issue_med Issue_low
0 Jack 44 Ab1 0 1 0
1 Bob 34 Ab2 0 0 1
2 Mike 52 Ab6 1 1 0
Related
In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1
I need help transforming the data as follows:
From a dataset in this version (df1)
ID apples oranges pears apples_pears oranges_pears
0 1 1 0 0 1 0
1 2 0 1 0 1 0
2 3 0 1 1 0 1
to a data set like the following (df2):
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
What I'm trying to accomplish is get the total values of apples, from all the columns in which the word "apple" appears in the column name. E.g. in the df1 there are 2 column names in which the word "apple" appears. If you sum up all the apples from the first row, there would be a total of 2. I want a single column for apples in the new dataset (df2). Note that a 1 for appleas_pears is a 1 for EACH apples and pears.
Idea is split DataFrame to new 2 - first change columns names by all values before _ and for second filter columns with _ by DataFrame.filter and change columns by value after _, last join together by concat and sum per columns:
df1 = df.set_index('ID')
df2 = df1.filter(like='_')
df1.columns = df1.columns.str.split('_').str[0]
df2.columns = df2.columns.str.split('_').str[1]
df = pd.concat([df1, df2], axis=1).sum(level=0, axis=1).reset_index()
print (df)
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
i have such a problem:
I have pandas dataframe with shop ID and shop cathegories, looking smth like that:
id cats
0 10002718 182,45001,83079
1 10004056 9798
2 10009726 17,45528
3 10009752 64324,17
4 1001107 44607,83520,76557
... ... ...
24922 9992184 45716
24923 9997866 77063
24924 9998461 45001,44605,3238,72627,83785
24925 9998954 69908,78574,77890
24926 9999728 45653,44605,83648,85023,84481,68822
So the problem is that each shop can have multiple cathegories, and the task is to count frequency of each cathegoty. What's the easiest way to do it?
In conclusion i need to have dataframe with columns
cats count
0 1 133
1 2 1
2 3 15
3 4 12
Use Series.str.split with Series.explode and Series.value_counts:
df1 = (df['cats'].str.split(',')
.explode()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
Or add expand=True to split to DataFrame and DataFrame.stack:
df1 = (df['cats'].str.split(',', expand=True)
.stack()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
print (df1.head(10))
cats count
0 17 2
1 44605 2
2 45001 2
3 83520 1
4 64324 1
5 44607 1
6 45653 1
7 69908 1
8 83785 1
9 83079 1
I would like to transform a DataFrame given by a software into a more python usable one and I can't fix it in a simple way with pandas because I have to use information contained in the columns. Here a simple example :
import pandas as pd
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
pd.DataFrame(d)
00 01 10 11
0 1 11 111 1111
The column names contains the factors that I need to use in rows, I would like to get something like this :
df = {'trt': [0,0,1,1], 'grp': [0,1,0,1], 'value':[1,11,111,1111]}
pd.DataFrame(df)
grp trt value
0 0 0 1
1 1 0 11
2 0 1 111
3 1 1 1111
Any ideas of how to do it properly ?
Solution with MultiIndex.from_arrays created indexing with str and transpose by T:
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print (df)
0 1
0 1 0 1
0 1 11 111 1111
df1 = df.T.reset_index()
df1.columns = ['grp','trt','value']
print (df1)
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
Similar solution with rename_axis and rename index:
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
df = pd.DataFrame(d)
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print(df.rename_axis(('grp','trt'), axis=1).rename(index={0:'value'}).T.reset_index())
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
To me the simplest solution is just melting the original frame and splitting the column names in a second step. Something like this:
df = pd.DataFrame(d)
mf = pd.melt(df)
mf[['grp', 'trt']] = mf.pop('variable').apply(lambda x: pd.Series(tuple(x)))
Here's mf after melting:
variable value
0 00 1
1 01 11
2 10 111
3 11 1111
And the final result, after splitting the variable column:
value grp trt
0 1 0 0
1 11 0 1
2 111 1 0
3 1111 1 1
I'd encourage you to read up more on melting here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html . It can be incredibly useful.
I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary