Pandas create new column based on first unique values of existing column - python

I'm trying to add a new column to a dataframe with only unique values from an existing column. There will be fewer rows in the new column maybe with np.nan values where duplicates would have been.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[3,4,3,4,5]})
df
a b
0 1 3
1 2 4
2 3 3
3 4 4
4 5 5
Goal:
a b c
0 1 3 3
1 2 4 4
2 3 3 nan
3 4 4 nan
4 5 5 5
I've tried:
df['c'] = np.where(df['b'].unique(), df['b'], np.nan)
It throws: operands could not be broadcast together with shapes (3,) (5,) ()

mask + duplicated
You can use Pandas methods for masking a series:
df['c'] = df['b'].mask(df['b'].duplicated())
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0

Use duplicated with np.where:
df['c'] = np.where(df['b'].duplicated(),np.nan,df['b'])
Or:
df['c'] = df['b'].where(~df['b'].duplicated(),np.nan)
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7
You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

How to remove or drop all rows after first occurrence of `NaN` from the entire DataFrame

I am looking forward to remove/drop all rows after first occurrence of NaN based on any of dataFrame column.
I have created two sample DataFrames as illustrated Below, the first dataframe the dtypes are for initial two columns are object while the last one in int, while in the Second dataframe these are float, obj and int.
First:
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,'NaN','NaN','NaN','NaN'),"B": (1,2,3,'NaN',4,5,6,7,'NaN',"9","10"),"C": range(11)})
>>> df
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
Dtypes:
>>> df.dtypes
A object
B object
C int64
dtype: object
While carrying out index based approach as follows based on a particular, it works Just fine as long as dtype is obj and int but i'm looking for dataFrame level action merely not limited to a column.
>>> df[:df[df['A'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
>>> df[:df[df['B'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
Second:
Another interesting fact while creating DataFrame with np.nan where we get different dtype, then even index based approach failed for a single column operation s well.
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,np.nan,np.nan,np.nan,np.nan),"B": (1,2,3,np.nan,4,5,6,7,np.nan,"9","10"),"C": range(11)})
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 NaN 3
4 5.0 4 4
5 6.0 5 5
6 7.0 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
dtypes:
>>> df.dtypes
A float64
B object
C int64
dtype: object
Error:
>>> df[:df[df['B'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
>>> df[:df[df['A'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
Expected should be for the Second DataFrame:
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
So, i am looking a way around to check across the entire DataFrame regardless of dtype and drop all rows from the first occurrence of NaN in the DataFrame.
You can try:
out=df.iloc[:df.isna().any(1).idxmax()]
OR
via replace() make your string 'NaN's to real 'NaN's then check for missing values and filter rows:
df=df.replace({'NaN':float('NaN'),'nan':float('NaN')})
out=df.iloc[:df.isna().any(1).idxmax()]
output of out:
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Just for posterity ...
>>> df.iloc[:df.isna().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
>>> df.iloc[:df.isnull().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2

compare two columns row by row and nan duplicate values pandas

I have a df
a b c
0 3 0
1 1 4
2 3 3
4 4 1
I want to compare a and b to c. If a value in the same row is equal to c I want 'nan' in a and/or b.
Like that:
a b c
nan 3 0
1 1 4
2 nan 3
4 4 1
We can use to_numpy with DataFrame.mask for this:
eqs = df.loc[:, :'b'].eq(df['c'].to_numpy()[:, None])
df.loc[:, :'b'] = df.loc[:, :'b'].mask(eqs)
a b c
0 NaN 3.0 0
1 1.0 1.0 4
2 2.0 NaN 3
3 4.0 4.0 1

Sort dataframe by another on one column - pandas

Let's say i have to data-frames, as shown below:
df=pd.DataFrame({'a':[1,4,3,2],'b':[1,2,3,4]})
df2=pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[34,56,7,55]})
I would like to sort df data by the order df2 data on 'a' column, so the df.a column would be the order of df2.a and that which makes the whole data-frame that order.
Desired output:
a b
0 1 1
1 2 4
2 3 3
3 4 2
(made it manually, and if there's any mistake with it, please tell me :D)
My own attempt:
df = df.set_index('a')
df = df.reindex(index=df2['a'])
df = df.reset_index()
print(df)
Works as expected!!!,
But when i have longer data-frames, like:
df=pd.DataFrame({'a':[1,4,3,2,3,4,5,3,5,6],'b':[1,2,3,4,5,5,5,6,6,7]})
df2=pd.DataFrame({'a':[1,2,3,4,3,4,5,6,4,5],'b':[1,2,4,3,4,5,6,7,4,3]})
It doesn't work ass expected.
Note: i don't only want a explanation of why but i also need a solution to do it for big data-frames
One possible solution is create helper columns in both DataFrames, because duplicated values:
df['g'] = df.groupby('a').cumcount()
df2['g'] = df2.groupby('a').cumcount()
df = df.set_index(['a','g']).reindex(index=df2.set_index(['a','g']).index)
print(df)
b
a g
1 0 1.0
2 0 4.0
3 0 3.0
4 0 2.0
3 1 5.0
4 1 5.0
5 0 5.0
6 0 7.0
4 2 NaN
5 1 6.0
Or maybe need merge:
df3 = df.merge(df2[['a','g']], on=['a','g'])
print(df3)
a b g
0 1 1 0
1 4 2 0
2 3 3 0
3 2 4 0
4 3 5 1
5 4 5 1
6 5 5 0
7 5 6 1
8 6 7 0

Inserting Series into DataFrame with automatic reindexing

I have a DataFrame and a Series of different dimensions
from pandas import *
df = DataFrame({'a':[1,2,3],'b':[1,2,3]})
s = Series([1,2,3,4,5])
is there a way to insert s into df without creating a reindexed copy of df first? Currently i am using
df = df.reindex(range(len(s))
df['s'] = s
print df
a b s
0 1 1 1
1 2 2 2
2 3 3 3
3 NaN NaN 4
4 NaN NaN 5
Use concat:
In [19]:
concat([df,s], axis=1)
Out[19]:
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 NaN NaN 4
4 NaN NaN 5

Categories

Resources