Map value from one row as a new column in pandas - python

I have a pandas dataframe:
SrNo value
a nan
1 100
2 200
3 300
b nan
1 500
2 600
3 700
c nan
1 900
2 1000
i want my final dataframe as:
value new_col
100 a
200 a
300 a
500 b
600 b
700 b
900 c
1000 c
i.e for sr.no 'a' the values under a should have 'a' as a new column similarly for b and c

Create new column by where with condition by isnull, then use ffill for replace NaNs by forward filling.
Last remove NaNs rows by dropna and column by drop:
print (df['SrNo'].where(df['value'].isnull()))
0 a
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 c
9 NaN
10 NaN
Name: SrNo, dtype: object
df['new_col'] = df['SrNo'].where(df['value'].isnull()).ffill()
df = df.dropna().drop('SrNo', 1)
print (df)
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c

Here's one way
In [2160]: df.assign(
new_col=df.SrNo.str.extract('(\D+)', expand=True).ffill()
).dropna().drop('SrNo', 1)
Out[2160]:
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c

Another way with replace numbers with nan and ffill()
df['col'] = df['SrNo'].replace('([0-9]+)',np.nan,regex=True).ffill()
df = df.dropna(subset=['value']).drop('SrNo',1)
Output:
value col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c

Related

Pandas Multiply 2D by 1D Dataframe

Looking for an elegant way to multiply a 2D dataframe by a 1D series where the indices and column names align
df1 =
Index
A
B
1
1
5
2
2
6
3
3
7
4
4
8
df2 =
Coef
A
10
B
100
Something like...
df3 = df1.mul(df2)
To get :
Index
A
B
1
10
500
2
20
600
3
30
700
4
40
800
There is no such thing as 1D DataFrame, you need to slice as Series to have 1D, then multiply (by default on axis=1):
df3 = df1.mul(df2['Coef'])
Output:
A B
1 10 500
2 20 600
3 30 700
4 40 800
If Index is a column:
df3 = df1.mul(df2['Coef']).combine_first(df1)[df1.columns]
Output:
Index A B
0 1.0 10.0 500.0
1 2.0 20.0 600.0
2 3.0 30.0 700.0
3 4.0 40.0 800.0

With pd.merge() on different column names, the resulting dataframe has duplicated columns. How to avoid that?

Assume the dataframes df_1 and df_2 below, which I want to merge "left".
df_1= pd.DataFrame({'A': [1,2,3,4,5],
'B': [10,20,30,40,50]})
df_2= pd.DataFrame({'AA': [1,5],
'BB': [10,50],
'CC': [100, 500]})
>>> df_1
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
>>> df_2
AA BB CC
0 1 10 100
1 5 50 500
I want to perform a merging which will result to the following output:
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
So, I tried pd.merge(df_1, df_2, left_on=['A', 'B'], right_on=['AA', 'BB'], how='left') which unfortunately duplicates the columns upon which I merge:
A B AA BB CC
0 1 10 1.0 10.0 100.0
1 2 20 NaN NaN NaN
2 3 30 NaN NaN NaN
3 4 40 NaN NaN NaN
4 5 50 5.0 50.0 500.0
How do I achieve this without needing to drop the columns 'AA' and 'BB'?
Thank you!
You can use rename and join by A, B columns only:
df = pd.merge(df_1, df_2.rename(columns={'AA':'A','BB':'B'}), on=['A', 'B'], how='left')
print (df)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
In pd.merge's right_on parameter accepts array-like argument.
Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
df_1.merge(
df_2["CC"], left_on=["A", "B"], right_on=[df_2["AA"], df_2["BB"]], how="left"
)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
df.merge(sec_df) give sec_df without column names you want to merge on.
rigth_on as a list with columns you want to merge, [df_2['AA'], df_2['BB']] is equivalent to [*df_2[['AA', 'BB']].to_numpy()]
IMHO this method is cumbersome. As #jezrael posted renaming columns and merging them is pythonic/pandorable

Transposing a Pandas DataFrame Without Aggregating

I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN

How to preserve column order when calling groupby and shift from pandas?

It seems that the columns get reordered by column index when calling pandas.DataFrame.groupby().shift(). The sort parameter applies only to rows.
Here is an example:
import pandas as pd
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
'E': ['a','b','c','d','e','f'],
'B': [10, 12, 10, 25, 10, 12],
'C': [100, 102, 100, 250, 100, 102],
'D': [1,2,3,4,5,6]
})
df.set_index('A',inplace=True)
df = df[['E','C','D','B']]
df
# E C D B
# A
#group1 a 100 1 10
#group1 b 102 2 12
#group2 c 100 3 10
#group2 d 250 4 25
#group3 e 100 5 10
#group3 f 102 6 12
Going from here, I want to achieve:
# E C D B C_s D_s B_s
# A
#group1 a 100 1 10 102.0 2.0 12.0
#group1 b 102 2 12 NaN NaN NaN
#group2 c 100 3 10 250.0 4.0 25.0
#group2 d 250 4 25 NaN NaN NaN
#group3 e 100 5 10 102.0 6.0 12.0
#group3 f 102 6 12 NaN NaN NaN
But
df[['C_s','D_s','B_s']]= df.groupby(level='A')[['C','D','B']].shift(-1)
Results in:
# E C D B C_s D_s B_s
# A
#group1 a 100 1 10 12.0 102.0 2.0
#group1 b 102 2 12 NaN NaN NaN
#group2 c 100 3 10 25.0 250.0 4.0
#group2 d 250 4 25 NaN NaN NaN
#group3 e 100 5 10 12.0 102.0 6.0
#group3 f 102 6 12 NaN NaN NaN
Introducing an artificial ordering of the columns helps to maintain the intrinsic logical connection of the columns:
df = df.sort_index(axis=1)
df[['B_s','C_s','D_s']]= df.groupby(level='A')[['B','C','D']].shift(-1).sort_index(axis=1)
df
# B C D E B_s C_s D_s
# A
#group1 10 100 1 a 12.0 102.0 2.0
#group1 12 102 2 b NaN NaN NaN
#group2 10 100 3 c 25.0 250.0 4.0
#group2 25 250 4 d NaN NaN NaN
#group3 10 100 5 e 12.0 102.0 6.0
#group3 12 102 6 f NaN NaN NaN
Why are the columns reordered in the first place?
In my opinion it is bug.
Working custom lambda function:
df[['C_s','D_s','B_s']] = df.groupby(level='A')['C','D','B'].apply(lambda x: x.shift(-1))
print (df)
E C D B C_s D_s B_s
A
group1 a 100 1 10 102.0 2.0 12.0
group1 b 102 2 12 NaN NaN NaN
group2 c 100 3 10 250.0 4.0 25.0
group2 d 250 4 25 NaN NaN NaN
group3 e 100 5 10 102.0 6.0 12.0
group3 f 102 6 12 NaN NaN NaN
Thank you #cᴏʟᴅsᴘᴇᴇᴅ for another solution:
df[['C_s','D_s','B_s']] = (df.groupby(level='A')['C','D','B']
.apply(pd.DataFrame.shift, periods=-1))

Backfilling columns by groups in Pandas

I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)

Categories

Resources