I have two dataframes:
index a b c d
1 x x x x
2 x nan x x
3 x x x x
4 x nan x x
index a b e
3 x nan x
4 x x x
5 x nan x
6 x x x
I want to make it into the following, where we simply get rid of the NaN values. An easier version of this question is where the second dataframe has no nan values....
index a b c d e
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
6 x x x x x
You may use combine_first with fillna:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
You can read the doc from here
import pandas as pd
d1 = pd.DataFrame([[nan,1,1],[2,2,2],[3,3,3]], columns=['a','b','c'])
d1
a b c
0 NaN 1 1
1 2 2 2
2 3 3 3
d2 = pd.DataFrame([[1,nan,1],[nan,2,2],[3,3,nan]], columns=['b','d','e'])
d2
b d e
0 1 NaN 1
1 NaN 2 2
2 3 3 NaN
d2.combine_first(d1) # d1's values are prioritized, if d2 has no NaN
a b c d e
0 NaN 1 1 NaN 1
1 2 2 2 2 2
2 3 3 3 3 NaN
d2.combine_first(d1).fillna(5) # simply fill NaN with a value
a b c d e
0 5 1 1 5 1
1 2 2 2 2 2
2 3 3 3 3 5
Use nan_to_num to replace a nan with a number:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
Just apply this:
from numpy import nan_to_num
df2 = df.apply(nan_to_num)
Then you can merge the arrays however you want.
Related
I have a pandas df:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
2 2 x y
3 2 x y
3 4 x y
3 3 x y
For each ID, like to remove rows where df.Score = 2 but only when there is a 3 or 4 present for that ID. I'd like to keep nansand 2 when the only score per ID = 2.
So I get:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
3 4 x y
3 3 x y
Any help, much appreciated
Use:
df[~df.groupby('ID')['Score'].apply(lambda x:x.eq(2)&x.isin([3,4]).any())]
ID Score C D
0 1 2.0 x y
1 1 NaN x y
2 1 2.0 x y
3 2 3.0 x y
6 3 4.0 x y
7 3 3.0 x y
I have the following table:
ind_ID pair_ID orig_data
0 A 1 W
1 B 1 X
2 C 2 Y
3 D 2 Z
4 A 3 W
5 C 3 X
6 B 4 Y
7 D 4 Z
Each row has an individual_ID, and a pair_ID that it shares with exactly one other row. I want to do a self join, so that every row has its original data, and the data of the row it shares a pair_ID with:
ind_ID pair_ID orig_data partner_data
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
I have tried:
df.join(df, on='pair_ID')
But obviously since pair_ID values are not unique I get:
ind_ID pair_ID orig_data partner_data
0 A 1 W NaN
1 B 1 X NaN
2 C 2 Y NaN
3 D 2 Z NaN
4 A 3 W NaN
5 C 3 X NaN
6 B 4 Y NaN
7 D 4 Z NaN
I've also thought about creating a new column that concatenates ind_ID+pair_ID which would be unique, but then the join would not know what to match on.
Is it possible to do a self-join on pair_ID where each row is joined with the matching row that is not itself?
In your case (with only two pairs) - you can probably just groupby and transform based on the ID, and just reverse the order of the values in the group, eg:
df.loc[:, 'partner_data'] = df.groupby('pair_ID').orig_data.transform(lambda L: L[::-1])
Which gives you:
ind_ID pair_ID orig_data partner_ID
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 5 years ago.
I have a list of (x,y,z) tuples in a dataframe A.
How can I produce a dataframe B which represents the underlying matrix of A, using the existing values of x and y as index and columns values, respectively?
Example:
A:
x y z
1 1 1
1 2 10
2 1 100
B:
1 2
1 1 10
2 100 NaN
For this data frame df:
x y z
0 1 1 1
1 1 2 10
2 2 1 100
pivoting:
df.pivot(index='x', columns='y')
works:
z
y 1 2
x
1 1.0 10.0
2 100.0 NaN
You can also clean the column and index names:
res = df.pivot(index='x', columns='y')
res.index.name = None
res.columns = res.columns.levels[1].values
print(res)
Output:
1 2
1 1.0 10.0
2 100.0 NaN
I have a dataframe (df_temp) which is like the following:
ID1 ID2
0 A X
1 A X
2 A Y
3 A Y
4 A Z
5 B L
6 B L
What I need is to add a column which shows the cummulative number of unique values of ID2 for each ID1, so something like
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I've tried:
dfl_temp['CumUniqueIDs'] = dfl_temp.groupby(by=[ID1])[ID2].nunique().cumsum()+1
But this simply fills CumUniqueIDs with NaN.
Not sure what I'm doing wrong here! Any help much appreciated!
you can use groupby() + transform() + factorize():
In [12]: df['CumUniqueIDs'] = df.groupby('ID1')['ID2'].transform(lambda x: pd.factorize(x)[0]+1)
In [13]: df
Out[13]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
By using category
df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
Out[551]:
0 1
1 1
2 2
3 2
4 3
5 1
6 1
Name: ID2, dtype: int8
After assign it back
df['CumUniqueIDs']=df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
df
Out[553]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I have a pandas dataframe and want replace each value with the mean for it.
ID X Y
1 a 1
2 a 2
3 a 3
4 b 2
5 b 4
How do I replace Y values with mean Y for every unique X?
ID X Y
1 a 2
2 a 2
3 a 2
4 b 3
5 b 3
Use transform:
df['Y'] = df.groupby('X')['Y'].transform('mean')
print (df)
ID X Y
0 1 a 2
1 2 a 2
2 3 a 2
3 4 b 3
4 5 b 3
For new column in another DataFrame use map with drop_duplicates:
df1 = pd.DataFrame({'X':['a','a','b']})
print (df1)
X
0 a
1 a
2 b
df1['Y'] = df1['X'].map(df.drop_duplicates('X').set_index('X')['Y'])
print (df1)
X Y
0 a 2
1 a 2
2 b 3
Another solution:
df1['Y'] = df1['X'].map(df.groupby('X')['Y'].mean())
print (df1)
X Y
0 a 2
1 a 2
2 b 3