I have the following dataframe:
a = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]], columns=['a','b','c'])
a
Out[234]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I want to add a column with only the last row as the mean of the last 2 values of column c. Something like:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 mean(9,12)
I tried this but the first part gives an error:
a['d'].iloc[-1] = a.c.iloc[-2:].values.mean()
You can use .at to assign at a single row/column label pair:
ix = a.shape[0]
a.at[ix-1,'d'] = a.loc[ix-2:ix, 'c'].values.mean()
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Also note that chained indexing (what you're doing with a.c.iloc[-2:]) is explicitly discouraged in the docs, given that pandas sees these operations as separate events, namely two separate calls to __getitem__, rather than a single call using a nested tuple of slices.
You may set d column beforehand (to ensure assignment):
In [100]: a['d'] = np.nan
In [101]: a['d'].iloc[-1] = a.c.iloc[-2:].mean()
In [102]: a
Out[102]:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
We can use .loc, .iloc & np.mean
a.loc[a.index.max(), 'd'] = np.mean(a.iloc[-2:, 2])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Or just using .loc and np.mean:
a.loc[a.index.max(), 'd'] = np.mean(a.loc[a.index.max()-1:, 'c'])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Related
I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.
Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I am looking forward to remove/drop all rows after first occurrence of NaN based on any of dataFrame column.
I have created two sample DataFrames as illustrated Below, the first dataframe the dtypes are for initial two columns are object while the last one in int, while in the Second dataframe these are float, obj and int.
First:
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,'NaN','NaN','NaN','NaN'),"B": (1,2,3,'NaN',4,5,6,7,'NaN',"9","10"),"C": range(11)})
>>> df
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
Dtypes:
>>> df.dtypes
A object
B object
C int64
dtype: object
While carrying out index based approach as follows based on a particular, it works Just fine as long as dtype is obj and int but i'm looking for dataFrame level action merely not limited to a column.
>>> df[:df[df['A'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
>>> df[:df[df['B'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
Second:
Another interesting fact while creating DataFrame with np.nan where we get different dtype, then even index based approach failed for a single column operation s well.
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,np.nan,np.nan,np.nan,np.nan),"B": (1,2,3,np.nan,4,5,6,7,np.nan,"9","10"),"C": range(11)})
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 NaN 3
4 5.0 4 4
5 6.0 5 5
6 7.0 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
dtypes:
>>> df.dtypes
A float64
B object
C int64
dtype: object
Error:
>>> df[:df[df['B'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
>>> df[:df[df['A'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
Expected should be for the Second DataFrame:
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
So, i am looking a way around to check across the entire DataFrame regardless of dtype and drop all rows from the first occurrence of NaN in the DataFrame.
You can try:
out=df.iloc[:df.isna().any(1).idxmax()]
OR
via replace() make your string 'NaN's to real 'NaN's then check for missing values and filter rows:
df=df.replace({'NaN':float('NaN'),'nan':float('NaN')})
out=df.iloc[:df.isna().any(1).idxmax()]
output of out:
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Just for posterity ...
>>> df.iloc[:df.isna().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
>>> df.iloc[:df.isnull().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Lets say I have a DataFrame that looks like this:
A B C D
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2
9 8 7 NaN
5 2 2 NaN
X 4 3 1
Y 3 2 1
6 8 0 NaN
Notice in Column A there are certain values that are letters (i.e. X,Y). In Column A there are also values that should be in Column B (i.e. the numbers). How do I specify in Pandas to say:
For every value that is not X, Y, or Empty, shift the specific row over by 1 column.
My desired output being:
A B C D
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2
9 8 7
5 2 2
X 4 3 1
Y 3 2 1
6 8 0
Even something like this would work for me:
A B C D E
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2 NaN
9 8 7 NaN
5 2 2 NaN
X 4 3 1
Y 3 2 1
6 8 0 NaN
There should be some sort of function to say, for every value that is X, Y or Empty, ignore and the ones that are numbers shift over... or vice versa.
Any tips would be greatly appreciated!
Basically: how do I say; anything that is not these specific values, take that row and shift everything over?
My approach would be to convert the DataFrame into a list of lists and then insert an empty element into each row that doesn't have X, Y, or ''.
df = df.values.tolist()
for row in df:
if row[0] not in ['X', 'Y', '']:
row.insert(0, '')
result = pd.DataFrame(df, columns=list('ABCDE')
Output:
A B C D E
0 X 5 6 5.0 NaN
1 Y 3 1 2.0 NaN
2 X 9 7 5.0 NaN
3 3 5 2.0 NaN
4 9 8 7.0 NaN
5 5 2 2.0 NaN
6 X 4 3 1.0 NaN
7 Y 3 2 1.0 NaN
8 6 8 0.0 NaN
having two columns A and B in a dataframe:
A B
0 1 6
1 2 7
2 1 8
3 2 9
4 1 10
I would like to create a column C. C must have values of B shifted by value of A:
A B C
0 1 6 NaN
1 2 7 NaN
2 1 8 7
3 2 9 7
4 1 10 9
The command:
df['C'] = df['B'].shift(df['A'])
does not work.
Do you have any other ideas?
I'd use help from numpy to avoid the apply
l = np.arange(len(df)) - df.A.values
df['C'] = np.where(l >=0, df.B.values[l], np.nan)
df
A B C
0 1 6 NaN
1 2 7 NaN
2 1 8 7.0
3 2 9 7.0
4 1 10 9.0
simple time test
This is tricky due to index alignment, you can define a user func and apply row-wise on your df, here the function will perform a shift on the B column and return the index value (using .name attribute to return the index) of the shifted column:
In [134]:
def func(x):
return df['B'].shift(x['A'])[x.name]
df['C'] = df.apply(lambda x: func(x), axis=1)
df
Out[134]:
A B C
0 1 6 NaN
1 2 7 NaN
2 1 8 7.0
3 2 9 7.0
4 1 10 9.0
I'm trying to understand the expected behavior of DataFrame.sort on columns with NaN values.
Given this DataFrame:
In [36]: df
Out[36]:
a b
0 1 9
1 2 NaN
2 NaN 5
3 1 2
4 6 5
5 8 4
6 4 5
Sorting using one column puts the NaN at the end, as expected:
In [37]: df.sort(columns="a")
Out[37]:
a b
0 1 9
3 1 2
1 2 NaN
6 4 5
4 6 5
5 8 4
2 NaN 5
But nested sort doesn't behave as I would expect, leaving the NaN unsorted:
In [38]: df.sort(columns=["a","b"])
Out[38]:
a b
3 1 2
0 1 9
1 2 NaN
2 NaN 5
6 4 5
4 6 5
5 8 4
Is there a way to make sure the NaNs in nested sort will appear at the end, per column?
Until fixed in Pandas, this is what I'm using for sorting for my needs, with a subset of the functionality of the original DataFrame.sort function. This will work for numerical values only:
def dataframe_sort(df, columns, ascending=True):
a = np.array(df[columns])
# ascending/descending array - -1 if descending, 1 if ascending
if isinstance(ascending, bool):
ascending = len(columns) * [ascending]
ascending = map(lambda x: x and 1 or -1, ascending)
ind = np.lexsort([ascending[i] * a[:, i] for i in reversed(range(len(columns)))])
return df.iloc[[ind]]
Usage example:
In [4]: df
Out[4]:
a b c
10 1 9 7
11 NaN NaN 1
12 2 NaN 6
13 NaN 5 6
14 1 2 6
15 6 5 NaN
16 8 4 4
17 4 5 3
In [5]: dataframe_sort(df, ['a', 'c'], False)
Out[5]:
a b c
16 8 4 4
15 6 5 NaN
17 4 5 3
12 2 NaN 6
10 1 9 7
14 1 2 6
13 NaN 5 6
11 NaN NaN 1
In [6]: dataframe_sort(df, ['b', 'a'], [False, True])
Out[6]:
a b c
10 1 9 7
17 4 5 3
15 6 5 NaN
13 NaN 5 6
16 8 4 4
14 1 2 6
12 2 NaN 6
11 NaN NaN 1