update non-available values of one pandas column based on another - python

I have a 2-column dataframe with records:column names ['user_id', 'cookie_id'] and I would like to update user_id values if they are NaN and there is a available user_id value for the common cookie_id.
Example:
(before)
user_id cookie_id
2 15
2 15
3 22
NaN 15
NaN 15
NaN 38
(after)
user_id cookie_id
2 15
2 15
3 22
2 15
2 15
NaN 38

If need replace only missing values first non missing value per user_id use GroupBy.transform with GroupBy.first and Series.fillna:
df['user_id'] = df['user_id'].fillna(df.groupby("cookie_id")['user_id'].transform('first'))
print (df)
user_id cookie_id
0 2.0 15
1 2.0 15
2 3.0 22
3 2.0 15
4 2.0 15
5 NaN 38
Or if need first non missing value per group then use:
df['user_id'] = df.groupby("cookie_id")['user_id'].transform('first')

Related

How to filter a pandas dataframe till it finds a value in NaN column?

I have a data frame like this:
df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
17 NaN
18 NaN
I want to filter this data frame from the start to the row where it finds a number in the score column.
So, after filtering the data frame should look like this:
new_df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
I want to filter this data frame from the row where it finds a number in the score column to the end of the data frame.
So, after filtering the data frame should look like this:
new_df:
number score
16 10
17 NaN
18 NaN
How do I filter this data frame?
Kindly help
You can use pd.Series.last_valid_index and pd.Series.first_valid_index like this:
df.loc[df['score'].first_valid_index():]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
And,
df.loc[:df['score'].last_valid_index()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
And, if you wanted to clip leading NaN and trailing Nan you can combined the two.
df.loc[df['score'].first_valid_index():df['score'].last_valid_index()]
Output:
number score
4 16 10.0
You can use a reverse cummax and boolean slicing:
new_df = df[df['score'].notna()[::-1].cummax()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
For the second one, a simple cummax:
new_df = df[df['score'].notna().cummax()]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN

How to filter the rows with multiple entries of MultiIndex level two?

I have a dataframe, df with a MultiIndex.
df.columns
Index(['all', 'month', 'day', 'year'], dtype='object')
all month day year
match
7 0 10/24/89 10 24 89
8 0 3/7/86 3 7 86
1 10 NaN NaN 10
9 0 4/10/71 4 10 71
10 0 5/11/85 5 11 85
1 96 NaN NaN 96
2 26 NaN NaN 26
11 0 10 NaN NaN 10
1 4/09/75 4 09 75
12 0 8/01/98 8 01 98
How can I select the rows with more than 1 entry at the MultiIndex level 2?
For example, here I need the rows 8,10 and 11.
you can use groupby.transform by the first level of index and use len. Then get True where the len is greater and equal (ge) to the value you want (here 2) to get the boolean mask you want and select the rows.
print(df[df.groupby(level=0)['month'].transform(len).ge(2)])
0 month day year
match
8 0 3/7/86 3.0 7.0 86
1 10 NaN NaN 10
10 0 5/11/85 5.0 11.0 85
1 96 NaN NaN 96
2 26 NaN NaN 26
11 0 10 NaN NaN 10
1 4/09/75 4.0 9.0 75
Here I use 'month' as column after the groupby operation, but any column in your dataframe would work.
You can also use groupby.filter and get the same result with:
print(df.groupby(level=0).filter(lambda x: len(x)>=2))

Applying values to column and grouping all columns by those values

I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M

Working with NaN values in multiple columns in Pandas

I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.
You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)
Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

Categories

Resources