Given the following dataframe df, where df['B']=df['M1']+df['M2']:
A M1 M2 B
1 1 2 3
1 2 NaN NaN
1 3 6 9
1 4 8 12
1 NaN 10 NaN
1 6 12 18
I want the NaN in column B to equal the corresponding value in M1 or M2 provided that the latter is not NaN:
A M1 M2 B
1 1 2 3
1 2 NaN 2
1 3 6 9
1 4 8 12
1 NaN 10 10
1 6 12 18
This answer suggested to use:
df.loc[df['B'].isnull(),'B'] = df['M1'], but the structure of this line allows to consider either M1 or M2, and not both at the same time.
Ideas on how I should change it to consider both columns?
EDIT
Not a duplicate question! For ease of understanding, I claimed that df['B']=df['M1']+df['M2'], but in my real case, df['B'] is not a sum and comes from a rather complicated computation. So I cannot apply a simple formula to df['B']: all I can do is change the NaN values to match the corresponding value in either M1 or M2.
Base on our discussion above in the comment
df.B=df.B.fillna(df[['M1','M2']].max(1))
df
Out[52]:
A M1 M2 B
0 1 1.0 2.0 3.0
1 1 2.0 NaN 2.0
2 1 3.0 6.0 9.0
3 1 4.0 8.0 12.0
4 1 NaN 10.0 10.0
5 1 6.0 12.0 18.0
From jezrael
df['B']= (df['M1']+ df['M2']).fillna(df[['M2','M1']].sum(1))
Related
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
I have several DataFrames (DataFrames have the same index and column structure). The problem is that there are NaN values in these dataframes.
I want to replace these NaN values by mean value of other's DataFrames' corresponding values.
For exapmle let's look at 3 dataframes.
DataFrame1 with 1:M2 NaN :
M1 M2 M3
0 1 1 2
1 8 NaN 9
2 4 2 7
3 9 6 3
DataFrame 2 with NaN value at 0:M3:
M1 M2 M3
0 2 3 NaN
1 1 1 6
2 1 2 9
3 4 6 2
DataFrame3:
M1 M2 M3
0 1 4 2
1 2 9 1
2 1 6 5
3 1 NaN 4
So we replace NaN in first DataFrame by 5 (9+1)/2. Second NaN should be replaced by 2 because (2+2)/2, third by 6 and so on.
Is there any good and elegant way to do it?
This is one way using numpy.nanmean.
avg = np.nanmean([df1.values, df2.values, df3.values], axis=0)
for df in [df1, df2, df3]:
df[df.isnull()] = avg
df = df.astype(int)
Note: since np.nan is float, we convert explicitly back to int.
We can concat , then using groupby fillna , after split should get what you need
s=pd.concat([df1,df2,df3],keys=[1,2,3])
s=s.groupby(level=1).apply(lambda x : x.fillna(x.mean()))
df1,df2,df3=[x.reset_index(level=0,drop=True) for _,x in s.groupby(level=0)]
df1
Out[1737]:
M1 M2 M3
0 1 1.0 2.0
1 8 5.0 9.0
2 4 2.0 7.0
3 9 6.0 3.0
I hava a data frame, for example
df = pd.DataFrame([[1,2,np.nan],[4,5,np.nan],[7,8,9]])
so it would be
sku r1 r2
0 1 2 NaN
1 4 5 NaN
2 7 8 9.0
if I would like to change r1 column's value base on r2, I mean if r2 is Not Nan, then use r2's value replace r1'value, otherwise keep r1 no change
So the result would be:
sku r1 r2
0 1 2 NaN
1 4 5 NaN
2 7 9.0 9.0
so you see, change 8 to 9.0 in third case in this example.
I am a new learner of pandas, it takes me time to find a solution for this.
Thanks for help.
You can use mask with notnull:
df['r1'] = df['r1'].mask(df['r2'].notnull(), df['r2'])
print (df)
sku r1 r2
0 1 2.0 NaN
1 4 5.0 NaN
2 7 9.0 9.0
Or loc:
df.loc[df['r2'].notnull(), 'r1'] = df['r2']
print (df)
sku r1 r2
0 1 2.0 NaN
1 4 5.0 NaN
2 7 9.0 9.0
Use np.where:
df['r1'] = np.where(df['r2'].notnull(),df['r2'],df['r1'])
df
Output:
sku r1 r2
0 1 2.0 NaN
1 4 5.0 NaN
2 7 9.0 9.0
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.