I would like to fill NaN based on a column values' mean.
Example:
Groups Temp
1 5 27
2 5 23
3 5 NaN (will be replaced by 25)
4 1 NaN (will be replaced by the mean of the Temps that are in group 1)
Any suggestions ? Thanks !
Use groupby, transfrom, and lamdba function with fillna and mean:
df = df.assign(Temp=df.groupby('Groups')['Temp'].transform(lambda x: x.fillna(x.mean())))
print(df)
Output:
Temp
0 27.0
1 23.0
2 25.0
3 NaN
Related
I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
I am fairly new to Python and Pandas; been searching for a solution for couple days with no luck... here's the problem:
I have a data set like the below and I need to cull the first few values of some rows so the highest value in each row is in column A. In the below example, rows 0 & 3 would drop the values in column A and row 4 drop the values in column A and B and then shift all remaining values to left.
A B C D
0 11 23 21 14
1 24 18 17 15
2 22 18 15 13
3 10 13 12 10
4 5 7 14 11
Desired
A B C D
0 23 21 14 NaN
1 24 18 17 15
2 22 18 15 13
3 13 12 10 NaN
4 14 11 NaN NaN
I've looked at the df.shift(), but don't see how I can get that function to work on a unique row by row basis. Should I instead be using an array and a loop function?
Any help is greatly appreciated.
You need to turn all left values of the max to np.nan and use the solution in this question. I use the one from #cs95
df_final = df[df.eq(df.max(1), axis=0).cummax(1)].apply(lambda x: sorted(x, key=pd.isnull), 1)
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
You can loop over the unique shifts (fewer of these than rows) with a groupby and join the results back:
import pandas as pd
shifts = df.to_numpy().argmax(1)
pd.concat([gp.shift(-i, axis=1) for i, gp in df.groupby(shifts)]).sort_index()
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
One approach is to convert each row of the data frame to a list (excluding the index) and append NaN values. Then keep N elements, starting with the max value.
ncols = len(df.columns)
nans = [np.nan] * ncols
new_rows = list()
for row in df.itertuples():
# convert each row of the data frame to a list
# start at 1 to exclude the index;
# and append list of NaNs
new_list = list(row[1:]) + nans
# find index of max value (exluding NaNs we appended)
k = np.argmax(new_list[:ncols])
# collect `new row`, starting at max element
new_rows.append(new_list[k : k+ncols])
# create new data frame
df_new = pd.DataFrame(new_rows, columns=df.columns)
df_new
for i in range(df.shape[0]):
arr = list(df.iloc[i,:])
c = 0
while True:
if arr[0] != max(arr):
arr.remove(arr[0])
c += 1
else:
break
nan = ["NaN"]*c
arr.extend(nan)
df.iloc[i,:] = arr
print(df)
I have looped over every row and found out max value and remove values before the max and padding "NaN" values at the end to match the number of columns for every row.
I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN
I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.
I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M