I've got a pandas DataFrame. In this DataFrame I want to modify several columns of some rows. These are the approaches I've attempted.
df[['finalA', 'finalB']] = df[['A', 'B']]
exceptions = df.loc[df.normal == False]
Which works like a charm, but now I want to set the exceptions:
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']]
Which doesn't work. So I tried using .ix from this answer.
df.ix[exceptions.index, ['finalA', 'finalB']] = \
df.ix[exceptions.index, ['A_except', 'B_except']]
Which doesn't work either. Both methods give me NaN in both finalA and finalB for the exceptional rows.
The only way that seems to work is doing it one column at a time:
df.ix[exceptions.index, 'finalA'] = \
df.ix[exceptions.index, 'A_except']
df.ix[exceptions.index, 'finalB'] = \
df.ix[exceptions.index, 'B_except']
What's going on here in pandas? How do I avoid setting the values to the copy that is apparently made by selecting multiple columns? Is there a way to avoid this kind of code repetition?
Some more musings: It doesn't actually set the values to a copy of the dataframe, it sets the values to NaN. It actually overwrites them to a new value.
Sample dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4],
'B': [5,6,7,8],
'normal': [True, True, False, False],
'A_except': [0,0,9,9],
'B_except': [0,0,10,10]})
Result:
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1.0 5.0
1 2 0 6 0 True 2.0 6.0
2 3 9 7 10 False NaN NaN
3 4 9 8 10 False NaN NaN
Expected result:
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
You can rename column names for align:
d = {'A_except':'finalA', 'B_except':'finalB'}
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']].rename(columns=d)
print (df)
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
Another solution is convert output to numpy array, but columns dont align:
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']].values
print (df)
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
If you view both sides of the equations, you will notice that the columns differ. Pandas takes the labels of the columns into account, and since they don't match, wont insert the value.
It works for a single column because then you are extracting a Series, and then the column label no longer applies.
A quick solution would be the simply strip the DataFrame to a bare array, then both the loc and ix method work:
df.loc[exceptions.index, ['finalA', 'finalB']] =
df.loc[exceptions.index, ['A_except', 'B_except']].values
But keep in mind that doing this will eliminate Pandas attempt to match column and index labels, its basically a 'hard' insert. So that makes you as a user responsible for the proper alignment. Which in this case is not a problem, but something to be aware of in general.
Related
I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7
I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0
I am looking to populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas, otherwise with a value of False.
For instance, this is original output dataframe I am constructing.
ID Type
1 A
2 B
3 A
4 A
5 C
6 A
7 D
8 A
9 B
10 A
And the smaller subset of the dataframe selected based on some criteria:
ID Type
1 A
3 A
4 A
5 C
7 D
10 A
What I am trying to accomplish is when ID and Type in the output dataframe match with the smaller subset datadrame, I want to populate a new column called 'Result' and value equals to True. Otherwise, value equals to False.
ID Type Result
1 A True
2 B False
3 A True
4 A True
5 C True
6 A False
7 D True
8 A False
9 B False
10 A True
You can .merge() the 2 dataframes using a left merge with the original dataframe as base and turn on the indicator= parameter to show the merge result. Then change the merge result to True for the rows that appear in both dataframes and False otherwise.
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
df_out['Result'] = (df_out['Result'] == 'both')
Explanation:
With indicator= parameter turn on, Pandas will show you the merge result of which dataframe the current row are from (in terms of both, left_only and right_only)
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
print(df_out)
ID Type Result
0 1 A both
1 2 B left_only
2 3 A both
3 4 A both
4 5 C both
5 6 A left_only
6 7 D both
7 8 A left_only
8 9 B left_only
9 10 A both
Then, we transform the both and others to True/False by boolean mask, as follows:
df_out['Result'] = (df_out['Result'] == 'both')
print(df_out)
ID Type Result
0 1 A True
1 2 B False
2 3 A True
3 4 A True
4 5 C True
5 6 A False
6 7 D True
7 8 A False
8 9 B False
9 10 A True
I have two dataframes, e.g.
import pandas as pd
import numpy as np
from random import shuffle
df_data = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(10,3)), columns=['A', 'B', 'C'])
keys = np.arange(0, 10)
shuffle(keys)
df_data['keys'] = keys
key_data = pd.DataFrame(data=np.reshape(np.arange(1,10), (3,3)), columns=['Key_col1', 'Key_col2', 'Key_col3'])
key_data['Timestamp'], key_data['Info'] = ['Mon', 'Wed', 'Fri'], [13, 2, 47]
Which returns, something like this:
A B C keys
0 3 9 2 5
1 7 9 4 7
2 9 6 6 0
3 9 9 0 9
4 8 5 8 6
5 2 5 7 3
6 5 1 2 4
7 3 9 6 2
8 4 2 3 8
9 6 5 5 1
and this:
Key_col1 Key_col2 Key_col3 Timestamp Info
0 1 2 3 Mon 13
1 4 5 6 Wed 2
2 7 8 9 Fri 47
I'd like to use the 'keys' column in the first dataframe to search the only the Key columns in the second dataframe (i.e. Key_col1, Key_col2, Key_col3) (because the 'info' column may contain values that much keys).
I'll then add the columns Timestamp and Info to the row in which the there is a match for key.
Expected output for row 0 would be this:
A B C keys Timestamp Info
0 3 9 2 5 Wed 2
My approach is to first a subset of my key_df for a value:
key_data.iloc[:, 0:3] == 2
OUT
Key_col1 Key_col2 Key_col3
0 False True False
1 False False False
2 False False False
In my next step I try to return only the row where the value True occurs using df.loc
key_data.loc[:, key_data.iloc[:, 0:3] == 2]
But this results in the error ValueError: Cannot index with multidimensional key
Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
Thanks
EDIT: The keys are unique and all of them are present in exactly 1 of the 3 key columns.
This works for you, just rename the columns:
new_df = pd.merge(df_data, key_data, how= 'right', left_on=['keys','keys','keys'], right_on = ['Key_col1','Key_col2','Key_col3'])
new_df =new_df.dropna(axis=1, how='all')
Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
The answer to this question is key_data.loc[(key_data.iloc[:, 0:3] == 2).any(axis=1)], but for your larger goal, doing something with merge as Rahul Agarwal suggests would be better.
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8