Conditionally delete rows by ID in Pandas - python

I have a pandas df:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
2 2 x y
3 2 x y
3 4 x y
3 3 x y
For each ID, like to remove rows where df.Score = 2 but only when there is a 3 or 4 present for that ID. I'd like to keep nansand 2 when the only score per ID = 2.
So I get:
ID Score C D
1 2 x y
1 nan x y
1 2 x y
2 3 x y
3 4 x y
3 3 x y
Any help, much appreciated

Use:
df[~df.groupby('ID')['Score'].apply(lambda x:x.eq(2)&x.isin([3,4]).any())]
ID Score C D
0 1 2.0 x y
1 1 NaN x y
2 1 2.0 x y
3 2 3.0 x y
6 3 4.0 x y
7 3 3.0 x y

Related

how to repeat each row n times in pandas so that it looks like this?

I want to know how to repeat each row n times in pandas in this fashion
I want this below result(With df_repeat = pd.concat([df]*2, ignore_index=False) I can't get expected result ):
Original Dataset:
index
value
0
x
1
x
2
x
3
x
4
x
5
x
Dataframe I want:
index
value
0
x
0
x
1
x
1
x
2
x
2
x
3
x
3
x
4
x
4
x
5
x
5
x
You can repeat the index:
df_repeat = df.loc[df.index.repeat(2)]
output:
index value
0 0 x
0 0 x
1 1 x
1 1 x
2 2 x
2 2 x
3 3 x
3 3 x
4 4 x
4 4 x
5 5 x
5 5 x
For a clean, new index:
df_repeat = df.loc[df.index.repeat(2)].reset_index(drop=True)
output:
index value
0 0 x
1 0 x
2 1 x
3 1 x
4 2 x
5 2 x
6 3 x
7 3 x
8 4 x
9 4 x
10 5 x
11 5 x
on Series
Should you have a Series as input, there is a Series.repeat method:
# create a Series from the above DataFrame
s = df.set_index('index')['value']
s.repeat(2)
output:
index
0 x
0 x
1 x
1 x
2 x
2 x
3 x
3 x
4 x
4 x
5 x
5 x
Name: value, dtype: object

Pandas - adding another column with the same name as another column

for example see below:
1
2
3
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
how would I add another column with a 4 in it? I have used:
df = df.assign(4 = np.zeros(shape=(df.shape[0],1))
however, it just changes the columns of 4 to what I have entered.
I hope this question is clear enough! Future state should look like this:
1
2
3
4
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
As #anon01 states, this is not a good idea, but you can use pd.concat
df = pd.DataFrame(np.arange(25).reshape(-1, 5))
pd.concat([df,pd.Series([np.nan]*5).rename(4)], axis=1)
And as #CameronRiddell states:
pd.concat([df, pd.Series(np.nan, name=4)], axis=1)
Output:
0 1 2 3 4 4
0 0 1 2 3 4 NaN
1 5 6 7 8 9 NaN
2 10 11 12 13 14 NaN
3 15 16 17 18 19 NaN
4 20 21 22 23 24 NaN

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Pandas self-join on non-unique values

I have the following table:
ind_ID pair_ID orig_data
0 A 1 W
1 B 1 X
2 C 2 Y
3 D 2 Z
4 A 3 W
5 C 3 X
6 B 4 Y
7 D 4 Z
Each row has an individual_ID, and a pair_ID that it shares with exactly one other row. I want to do a self join, so that every row has its original data, and the data of the row it shares a pair_ID with:
ind_ID pair_ID orig_data partner_data
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
I have tried:
df.join(df, on='pair_ID')
But obviously since pair_ID values are not unique I get:
ind_ID pair_ID orig_data partner_data
0 A 1 W NaN
1 B 1 X NaN
2 C 2 Y NaN
3 D 2 Z NaN
4 A 3 W NaN
5 C 3 X NaN
6 B 4 Y NaN
7 D 4 Z NaN
I've also thought about creating a new column that concatenates ind_ID+pair_ID which would be unique, but then the join would not know what to match on.
Is it possible to do a self-join on pair_ID where each row is joined with the matching row that is not itself?
In your case (with only two pairs) - you can probably just groupby and transform based on the ID, and just reverse the order of the values in the group, eg:
df.loc[:, 'partner_data'] = df.groupby('pair_ID').orig_data.transform(lambda L: L[::-1])
Which gives you:
ind_ID pair_ID orig_data partner_ID
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y

pandas dataframe inserting null values

I have two dataframes:
index a b c d
1 x x x x
2 x nan x x
3 x x x x
4 x nan x x
index a b e
3 x nan x
4 x x x
5 x nan x
6 x x x
I want to make it into the following, where we simply get rid of the NaN values. An easier version of this question is where the second dataframe has no nan values....
index a b c d e
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
6 x x x x x
You may use combine_first with fillna:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
You can read the doc from here
import pandas as pd
d1 = pd.DataFrame([[nan,1,1],[2,2,2],[3,3,3]], columns=['a','b','c'])
d1
a b c
0 NaN 1 1
1 2 2 2
2 3 3 3
d2 = pd.DataFrame([[1,nan,1],[nan,2,2],[3,3,nan]], columns=['b','d','e'])
d2
b d e
0 1 NaN 1
1 NaN 2 2
2 3 3 NaN
d2.combine_first(d1) # d1's values are prioritized, if d2 has no NaN
a b c d e
0 NaN 1 1 NaN 1
1 2 2 2 2 2
2 3 3 3 3 NaN
d2.combine_first(d1).fillna(5) # simply fill NaN with a value
a b c d e
0 5 1 1 5 1
1 2 2 2 2 2
2 3 3 3 3 5
Use nan_to_num to replace a nan with a number:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
Just apply this:
from numpy import nan_to_num
df2 = df.apply(nan_to_num)
Then you can merge the arrays however you want.

Categories

Resources