how to transpose multiple level pandas dataframe based only on outer index - python

the below is my dataframe with two level indexing. I want 'only' the outer index to be transposed as columns. My desired output would be 2X2 dataframe instead of a 4X1 dataframe as is the case now. Can any of you please help?
0
0 0 232
1 3453
1 0 443
1 3241

Given you have the multi index you can use unstack() on level 0.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
print df.unstack(level=[0])
0
0 1
0 1 3
1 2 4

One way to do this would be to reset the index and then pivot the table indexing on the level_1 of the index, and using level_0 as the columns and 0 as the values. Example -
df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Demo -
In [66]: index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
In [67]: df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
In [68]: df
Out[68]:
0
0 0 1
1 2
1 0 3
1 4
In [69]: df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Out[69]:
level_0 0 1
level_1
0 1 3
1 2 4
Later on, if you want you can set the .name attribute for the index as well as the columns to empty string or whatever you want , if you don't want the level_* there.

Related

Pandas count based on condition in current row from records before current row

I have a dataframe show as follows:
import pandas as pd
df=pd.DataFrame({'col1':['a','a','a','b','b','c']})
df.sort_values('col1', inplace=True)
df['Ref']=0
Thus the dataframe looks like:
a 0
a 0
a 0
b 0
b 0
c 0
For the ref column, I want to show the number of reference of current row. For illustration purpose, following is what I want to achieve:
a 0
a 1
a 2
b 0
b 1
c 0
I can use df.iterrows() and loop row by row. Un fortunately in my case, it will take 15 minutes to run. I am wondering if there is a reasonable way to do so.
Group the data by col1 and use cumcount
import pandas as pd
df = pd.DataFrame({'col1':['a','a','a','b','b','c']})
df['Ref'] = df.groupby('col1').cumcount()
df.sort_values('col1', inplace=True)
Output:
>>> df
col1 Ref
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 c 0

Why does df.isnull().sum() work the way it does?

When I do df.isnull().sum(), I get the count of null values in a column. But the default axis for .sum() is None, or 0 - which should be summing across the columns.
Why does .sum() calculate the sum down the columns, instead of the rows, when the default says to sum across axis = 0?
Thanks!
I'm seeing the opposite behavior as you explained:
Sums across the columns
In [3309]: df1.isnull().sum(1)
Out[3309]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
dtype: int64
Sums down the columns
In [3310]: df1.isnull().sum()
Out[3310]:
date 0
variable 1
value 0
dtype: int64
Uh.. this is not what I am seeing for functionality. Let's look at this small example.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[np.nan, np.nan, 3],'B':[1,1,3]}, index =[*'abc'])
print(df)
print(df.isnull().sum())
print(df.sum())
Note the columns are uppercase 'A' and 'B', and the index or row indexes are lowercase.
Output:
A B
a NaN 1
b NaN 1
c 3.0 3
A 2
B 0
dtype: int64
A 3.0
B 5.0
dtype: float64
Per docs:
axis : {index (0), columns (1)} Axis for the function to be applied
on.
The axis parameter is orthogonal to the direction which you wish to sum.
Unfortunately, the pandas documentation for sum doesn't currently make this clear, but the documentation for count does:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html
Parameters
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

Subsetting pandas dataframe and retain original size

I am trying to subset a dataframe but want the new dataframe to have same size of original dataframe.
Attaching the input, output and the expected output.
df_input = pd.DataFrame([[1,2,3,4,5], [2,1,4,7,6], [5,6,3,7,0]], columns=["A", "B","C","D","E"])
df_output=pd.DataFrame(df_input.iloc[1:2,:])
df_expected_output=pd.DataFrame([[0,0,0,0,0], [2,1,4,7,6], [0,0,0,0,0]], columns=["A", "B","C","D","E"])
Please suggest the way forward.
Set the index after you subset back to the original with reindex. This will set all the values for the new rows to NaN, which you can replace with 0 via fillna. Since NaN is a float type, you can convert everything back to int with astype.
df_input.iloc[1:2,:].reindex(df_input.index).fillna(0).astype(int)
Setup
df = pd.DataFrame([[1,2,3,4,5], [2,1,4,7,6], [5,6,3,7,0]], columns=["A", "B","C","D","E"])
output = df_input.iloc[1:2,:]
You can create a mask and use multiplication:
m = df.index.isin(output.index)
m[:, None] * df
A B C D E
0 0 0 0 0 0
1 2 1 4 7 6
2 0 0 0 0 0
I will using where + between
df_input.where(df_input.index.to_series().between(1,1),other=0)
Out[611]:
A B C D E
0 0 0 0 0 0
1 2 1 4 7 6
2 0 0 0 0 0
One more option would be to create DataFrame with zero values and then update it with df_input slice
df_output = pd.DataFrame(0, index=df_input.index, columns = df_input.columns)
df_output.update(df_input.iloc[1:2,:])

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources