Dataframe set_index produces duplicate index values instead of doing hierarchical grouping

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping - python

I have a dataframe that looks like this (index not shown)
Time Letter Type Value
0 A x 10
0 B y 20
1 A y 30
1 B x 40
3 C x 50
I want to produce a dataframe that looks like this:
Time Letter TypeX TypeY
0 A 10 20
0 B 20
1 A 30
1 B 40
3 C 50
To do that, I decided I would first create a table that have multiple indices, Time, Letter and then unstack the last index Type.
Let's say my original dataframe was named my_table:
my_table.reset_index().set_index(['Time', 'Letter']) and instead of grouping it so that under every time index, letter there is BOTH Type X and Type Y, they seemed to have been sorted (adding a few more entries to demonstrate a point):
Time(i) Letter(i) Type Value
0 A x 10
D x 25
H x 15
G x 33
1 B x 40
G x 10
3 C x 50
0 B y 20
H y 10
1 A y 30
Why does this happen? I expected a result like so:
Time Letter Type Value
0 A x 10
y 30
B y 20
H x 15
y 10
D x 25
G x 33
1 B x 40
G x 10
3 C x 50
The same behavior occurs when I make Type one of the indices, it just becomes bold as an index.
How do I successfully group columns using Time and Letter to get X and Y to be matched by those columns, so I can successfully use unstack?

You need to set type as the index as well
df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index()
Out[178]:
Type Time Letter x y
0 0 A 10
1 0 B 20
2 1 A 30
3 1 B 40
4 3 C 50

Related

Top N rows vs rows with Top N unique values using pandas

I have a pandas dataframe like as shown below
import pandas as pd
data ={'Count':[1,1,2,3,4,2,1,1,2,1,3,1,3,6,1,1,9,3,3,6,1,5,2,2,0,2,2,4,0,1,3,2,5,0,3,3,1,2,2,1,6,2,3,4,1,1,3,3,4,3,1,1,4,2,3,0,2,2,3,1,3,6,1,8,4,5,4,2,1,4,1,1,1,2,3,4,1,1,1,3,2,0,6,2,3,2,9,10,2,1,2,3,1,2,2,3,2,1,8,4,0,3,3,5,12,1,5,13,6,13,7,3,5,2,3,3,1,1,5,15,7,9,1,1,1,2,2,2,4,3,3,2,4,1,2,9,3,1,3,0,0,4,0,1,0,1,0]}
df = pd.DataFrame(data)
I would like to do the below
a) Top 5 rows (this will return only 5 rows)
b) Rows with Top 5 unique values (this can return N > 5 rows if the top 5 values are repeating). See my example screenshot below where we have 8 rows for selecting top 5 unique values
While am able to get Top 5 rows by using the below
df.nlargest(5,['Count'])
However, when I try the below for b), I don't get the expected output
df.nlargest(5,['Count'],keep='all')
I expect my output to be like as below

Are you after top 5 unique values or largest top five values?
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df.index.isin(df.drop_duplicates(keep='first').head(5).index), 'Y','N')))
or did you need
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df['Count'].isin(list(df['Count'].unique()[:5])),'Y','N')))
Count top5rows top5unique
0 1 Y Y
1 1 Y Y
2 2 Y Y
3 3 Y Y
4 4 Y Y
5 2 N Y
6 1 N Y
7 1 N Y
8 2 N Y
9 1 N Y
10 3 N Y
11 1 N Y
12 3 N Y
13 6 N Y
14 1 N Y

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000

To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Function for bar plot in python

I am trying to write an function for bar plot and it has to be like the plot shown below for Category and Group based on the index. The problem here is function has to divide X - Index and Y -Index separately and plot the graphs for Category and Group.
Index Group Category Population
X A 5 12
X A 5 34
Y B 5 23
Y B 5 34
Y B 6 33
X A 6 44
Y C 7 12
X C 7 23
Y A 8 12
Y A 8 4
X B 8 56
Y B 9 67
X B 10 23
Y A 8 45
X C 9 34
X C 9 56
Here the Men and Women are Index- X, Y in my case
I have tried many different ways but not able to solve this issue. It would be really helpful if anyone would help me in this.

Not sure if this is what you are looking for, but it's the easiest way to plot multi-indices IMP:
df["Index"] = df["Index"].map({"X":"Male", "Y": "Female"})
df_ = df.groupby(["Group","Category","Index"]).mean().unstack()
df_.plot.bar()
This will give you:

Pandas: batch subsitution of values from different rows meeting same criterias

I have extracted some data in pandas format from a sql server. The structure like this:
df = pd.DataFrame({'Day':(1,2,3,4,1,2,3,4),'State':('A','A','A','A','B','B','B','B'),'Direction':('N','S','N','S','N','S','N','S'),'values':(12,34,22,37,14,16,23,43)})
>>> df
Day Direction State values
0 1 N A 12
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Now I want to substitute all values with same day and same Direction but with (State == A) by itself + values with same day and same State but with (State == B). For example, like this:
df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'] = df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'].values + df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'B'),'values'].values
>>> df
Day Direction State values
0 1 N A 26
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Notice the first line values have been changed from 12 to 26(12 + 14)
Since the values are from different rows, so kind of difficult to use combine_first functions?
Now I have to use two loops (on 'Day' and on 'Direction') and the above attribution sentence to do, it's extremely slow when the dataframe's getting big. Do you have any smart and efficient way doing this?

You can first define a function to do add values from B to A in the same group. Then apply this function to each group.
def f(x):
x.loc[x.State=='A','values']+=x.loc[x.State=='B','values'].iloc[0]
return x
df.groupby(['Day','Direction']).apply(f)
Out[94]:
Day Direction State values
0 1 N A 26
1 2 S A 50
2 3 N A 45
3 4 S A 80
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43

populate new column in a pandas dataframe which takes input from other columns

i have a function which should take x , y , z as input and returns r as output.
For example : my_func( x , y, z) takes x = 10 , y = 'apple' and z = 2 and returns value in column r. Similarly, function takes x = 20, y = 'orange' and z =4 and populates values in column r. Any suggestions what would be the efficient code for this ?
Before :
a x y z
5 10 'apple' 2
2 20 'orange' 4
0 4 'apple' 2
5 5 'pear' 6
After:
a x y z r
5 10 'apple' 2 x
2 20 'orange' 4 x
10 4 'apple' 2 x
5 5 'pear' 6 x

Depends on how complex your function is. In general you can use pandas.DataFrame.apply:
>>> def my_func(x):
... return '{0} - {1} - {2}'.format(x['y'],x['a'],x['x'])
...
>>> df['r'] = df.apply(my_func, axis=1)
>>> df
a x y z r
0 5 10 'apple' 2 'apple' - 5 - 10
1 2 20 'orange' 4 'orange' - 2 - 20
2 0 4 'apple' 2 'apple' - 0 - 4
3 5 5 'pear' 6 'pear' - 5 - 5
axis=1 is to make your function work 'for each row' instead of 'for each column`:
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1)
But if it's really simple function, like the one above, you probably can even do it without function, with vectorized operations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping - python

You need to set type as the index as well df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index() Out[178]: Type Time Letter x y 0 0 A 10 1 0 B 20 2 1 A 30 3 1 B 40 4 3 C 50

Related

Top N rows vs rows with Top N unique values using pandas

pd.Dataframe.update puts the result at the top of the dataframe

Function for bar plot in python

Pandas: batch subsitution of values from different rows meeting same criterias

populate new column in a pandas dataframe which takes input from other columns

Categories

Resources