dataframe pivot with pandas - python

I am trying to pivot df111 into df222:
ID1 ID2 Type Value
0 1 a X 1
1 1 a Y 2
2 1 b X 3
3 1 b Y 4
4 2 a X 5
5 2 a Y 6
6 2 b X 7
7 2 b Y 8
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
I tried with df111.pivot() and df111.groupby() but no luck. Can someone throw me a one-liner? Thanks

you can do it by first set_index the three first columns and then unstack. To fit the exact output, rename the columns by keeping the second level and reset_index such as:
df222 = df111.set_index(['ID1', 'ID2','Type']).unstack()
df222.columns = [col[1] + ' Value' for col in df222.columns]
df222 = df222.reset_index()
print (df222)
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
and if you want to do it with chaining methods:
df222 = df111.set_index(['ID1', 'ID2','Type']).Value.unstack()\
.rename(columns = {'X': 'X Value', 'Y': 'Y Value'})\
.rename_axis(None, axis="columns")\
.reset_index()

If you have pivot_table function, why the hell you provide pivot? this is just confusing ...
df333 = pd.pivot_table(df111, index=['ID1','ID2'], columns=['Type'], values='Value')
df333.reset_index()

df222 = (df111.set_index(['ID1', 'ID2','Type']).unstack()
.add_suffix(' Value'))
df222.columns=[lev[1] for lev in df222.columns]
df222.reset_index(inplace=True)

Related

Python: Group and count unique variables based on multiple grouping without recount

I have a Dataframe with 3 columns:
ID Round Investor
X 1 A
X 2 A
X 2 B
X 3 A
Y 1 A
Y 1 B
Y 1 C
Y 2 B
Y 2 D
And want to count the number of unique investors for each round for each ID. But I dont want it to recount the investor if it has been in the previous round. The code I am using is:
print(df.groupby(['ID', 'Round'])['Investor'].nunique())
Which results in:
ID Round Unique Investor
X 1 1
2 2
2 2
3 1
Y 1 3
1 3
1 3
2 2
2 2
But I dont what it to count when an investor have invested in a earlier round for the same ID:
ID Round Unique Investor
X 1 1
2 1
2 1
3 0
Y 1 3
1 3
1 3
2 1
2 1
Any help is greatly appreciated!
You can define a helper column Investor2 which is grouped under ID and dropped duplicates within the same ID with Series.drop_duplicates
Then, group by ID and Round as you did before on this Investor2 column with .transform() and nunique, as follows:
df['Unique Investor'] = (
df.assign(Investor2=df.groupby('ID')['Investor'].apply(pd.Series.drop_duplicates).droplevel(0))
.groupby(['ID', 'Round'])['Investor2'].transform('nunique')
)
Result:
print(df)
ID Round Investor Unique Investor
0 X 1 A 1
1 X 2 A 1
2 X 2 B 1
3 X 3 A 0
4 Y 1 A 3
5 Y 1 B 3
6 Y 1 C 3
7 Y 2 B 1
8 Y 2 D 1
One possible solution is to drop the duplicates, based on 'ID' and Investor, groupby ID and Round to get number of uniques, and merge the result back to the main dataframe:
dups = ['ID', 'Investor']
group = ['ID', 'Round']
mapping = (df.drop_duplicates(subset = dups)
.groupby(group)
.Investor
.nunique()
)
(df.filter(group)
.merge(mapping, left_on = group,
right_index = True, how = 'left')
.fillna(0, downcast='infer')
)
ID Round Investor
0 X 1 1
1 X 2 1
2 X 2 1
3 X 3 0
4 Y 1 3
5 Y 1 3
6 Y 1 3
7 Y 2 1
8 Y 2 1

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

sort headers by specific cols - pandas

I'm trying to sort col headers by last 3 columns only. Using below, sort_index works on the whole data frame but not when I select the last 3 cols only.
Note: I can't hard-code the sorting because I don't know the columns headers beforehand.
import pandas as pd
df = pd.DataFrame({
'Z' : [1,1,1,1,1],
'B' : ['A','A','A','A','A'],
'C' : ['B','A','A','A','A'],
'A' : [5,6,6,5,5],
})
# sorts all cols
df = df.sort_index(axis = 1)
# aim to sort by last 3 cols
#df.iloc[:,1:3] = df.iloc[:,1:3].sort_index(axis=1)
Intended Out:
Z A B C
0 1 A B 5
1 1 A A 6
2 1 A A 6
3 1 A A 5
4 1 A A 5
Try with reindex
out = df.reindex(columns=df.columns[[0]].tolist()+sorted(df.columns[1:].tolist()))
Out[66]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A
Method two insert
newdf = df.iloc[:,1:].sort_index(axis=1)
newdf.insert(loc=0, column='Z', value=df.Z)
newdf
Out[74]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A

How do I calculate the first value in each group from every other value in the group to calculate change over time?

In R I can calculate the change over time for each group in a data set like this:
df %>%
group_by(z) %>%
mutate(diff = y - y[x == 0])
What is the equivalent in pandas?
I know that using pandas you can minus the first value of a column like this:
df['diff'] = df.y-df.y.iloc[0]
But how do you group by variable z?
Example data:
x y z
0 2 A
5 4 A
10 6 A
0 1 B
5 3 B
10 9 B
Expected output:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8
You can try this.
temp = df.groupby('z').\
apply(lambda g: g.y - g.y[0]).\
reset_index().\
rename(columns={'y': 'diff'}).\
drop('z', axis=1)
df.merge(temp, how='inner', left_index=True, right_on='level_1').\
drop('level_1', axis=1)
Return:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Categories

Resources