I want to add a new column with the maximum next_crossing_down for the entire x street.
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,1,1,2,2,2,2],
'y': [7,None,13,14,22,None,9,13,14,15,16],
'next_crossing_down': [5,None,10,10,20,None,5,10,10,10,15]})
x y next_crossing_down
0 1 7.0 5.0
1 1 NaN NaN
2 1 13.0 10.0
3 1 14.0 10.0
4 1 22.0 20.0
5 1 NaN NaN
6 1 9.0 5.0
7 2 13.0 10.0
8 2 14.0 10.0
9 2 15.0 10.0
10 2 16.0 15.0
And I would like this:
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 20.0
1 1 NaN NaN NaN
2 1 13.0 10.0 20.0
3 1 14.0 10.0 20.0
4 1 22.0 20.0 20.0
5 1 NaN NaN NaN
6 1 9.0 5.0 15.0
7 2 13.0 10.0 15.0
8 2 14.0 10.0 15.0
9 2 15.0 10.0 15.0
10 2 16.0 15.0 15.0
This is the closest that I have come. I get the right numbers, only not in the entire x_street.
cars['next_crossing_down_max']= cars.groupby(['x'])['next_crossing_down'].max()
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 NaN
1 1 NaN NaN 20.0
2 1 13.0 10.0 15.0
3 1 14.0 10.0 NaN
4 1 22.0 20.0 NaN
5 1 NaN NaN NaN
6 1 9.0 5.0 NaN
7 2 13.0 10.0 NaN
8 2 14.0 10.0 NaN
9 2 15.0 10.0 NaN
10 2 16.0 15.0 NaN
Are you looking for pandas.DataFrame.transform?
import numpy as np
cars['next_crossing_down_max']= cars.groupby(['x'])['next_crossing_down'].transform('max')
cars['next_crossing_down_max'] = np.where(cars['next_crossing_down'].isnull(),
np.nan,
cars['next_crossing_down_max'])
Output
cars
Out[18]:
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 20.0
1 1 NaN NaN NaN
2 1 13.0 10.0 20.0
3 1 14.0 10.0 20.0
4 1 22.0 20.0 20.0
5 1 NaN NaN NaN
6 1 9.0 5.0 20.0
7 2 13.0 10.0 15.0
8 2 14.0 10.0 15.0
9 2 15.0 10.0 15.0
10 2 16.0 15.0 15.0
Alternatively you could mask instead of np.where, which will get you the same result, but it's a bit slower (thanks to #Anky):
>>> cars.groupby("x")['next_crossing_down'].transform('max').mask(cars['next_crossing_down'].isna())
Out[19]:
0 20.0
1 NaN
2 20.0
3 20.0
4 20.0
5 NaN
6 20.0
7 15.0
8 15.0
9 15.0
10 15.0
Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are:
vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC'))
indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC'))
>>> vals
A B C
0 64 20 48
1 28 60 81
2 5 73 77
3 74 66 86
4 41 39 21
5 65 37 98
6 10 20 73
7 6 70 3
8 36 29 28
9 43 13 12
>>> indexes
A B C
0 4 2 3
1 3 3 8
2 5 1 7
3 9 8 9
4 2 4 0
I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later.
This is what I came up with:
vals_indexes = pd.DataFrame()
for i in range(vals.shape[1]):
vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1)
>>> vals_indexes
A B C
0 NaN NaN 48.0
1 NaN 60.0 NaN
2 5.0 73.0 NaN
3 74.0 66.0 86.0
4 41.0 39.0 NaN
5 65.0 NaN NaN
7 NaN NaN 3.0
8 NaN 29.0 28.0
9 43.0 NaN 12.0
Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan
for i in vals.columns:
vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan
print(vals)
A B C
0 NaN 2.0 NaN
1 NaN 5.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
4 NaN NaN 6.0
5 9.0 NaN NaN
6 NaN NaN 4.0
7 NaN 7.0 NaN
8 2.0 NaN NaN
9 NaN NaN NaN
how can I separate each dataframe with an empty row
ive combined them using this snippet
frames1 = [df4, df5, df6]
Summary = pd.concat(frames1)
so how can i split them with an empty row
You can use the below example which works:
Create test dfs
df1 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df3 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
dfs=[df1,df2,df3]
Solution:
pd.concat([df.append(pd.Series(), ignore_index=True) for df in dfs])
A B C D
0 17.0 16.0 15.0 7.0
1 13.0 6.0 12.0 18.0
2 0.0 2.0 10.0 17.0
3 8.0 13.0 10.0 17.0
4 4.0 18.0 8.0 19.0
5 NaN NaN NaN NaN
0 14.0 0.0 13.0 12.0
1 10.0 3.0 6.0 3.0
2 15.0 10.0 15.0 3.0
3 9.0 16.0 11.0 4.0
4 5.0 7.0 6.0 2.0
5 NaN NaN NaN NaN
0 10.0 18.0 13.0 12.0
1 1.0 6.0 10.0 0.0
2 2.0 19.0 4.0 18.0
3 4.0 3.0 9.0 16.0
4 16.0 6.0 5.0 6.0
5 NaN NaN NaN NaN
For horizontal stack:
pd.concat([df.assign(test=np.nan) for df in dfs],axis=1)
A B C D test A B C D test A B C D test
0 17 16 15 7 NaN 14 0 13 12 NaN 10 18 13 12 NaN
1 13 6 12 18 NaN 10 3 6 3 NaN 1 6 10 0 NaN
2 0 2 10 17 NaN 15 10 15 3 NaN 2 19 4 18 NaN
3 8 13 10 17 NaN 9 16 11 4 NaN 4 3 9 16 NaN
4 4 18 8 19 NaN 5 7 6 2 NaN 16 6 5 6 NaN
Is this what you want?:
fname = 'test2.csv'
frames1 = [df4, df5, df6]
with open(fname, mode='a+') as f:
for df in frames1:
df.to_csv(fname, mode='a', header = f.tell() == 0)
f.write('\n')
test2.csv:
,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
f.tell() == 0 checks whether the file handle is at the beginning of the file, i.e. at 0, if yes, prints header, else doesn't.
NOTE: I have used same values for all the dfs, that's why all the results are similar.
For columns:
fname = 'test3.csv'
frames1 = [df1, df2, df3]
Summary = pd.concat([df.assign(**{' ':' '}) for df in frames1], axis=1)
Summary.to_csv(fname)
test3.csv:
,a,b,c, ,a,b,c, ,a,b,c,
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
But the columns will not be equally spaced. If you save with header=False:
test3.csv:
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2
If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN