I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)
I have this dataframe
dd = pd.DataFrame({'a':[1,5,3],'b':[3,2,3],'c':[2,4,5]})
a b c
0 1 3 2
1 5 2 4
2 3 3 5
I just want to replace numbers of column a and b which are smaller than column c numbers. I want to this operation row wise
I did this
dd.applymap(lambda x: 0 if x < x['c'] else x )
I get error
TypeError: 'int' object is not subscriptable
I understood x is a int but how to get value of column c for that row
I want this output
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Use DataFrame.mask with DataFrame.lt:
df = dd.mask(dd.lt(dd['c'], axis=0), 0)
print (df)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Or you can set values by compare broadcasting by column c:
dd[dd < dd['c'].to_numpy()[:, None]] = 0
print (dd)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
I have the following dataframe:
frame=pd.DataFrame({"col1":[1,5,9,4,7,3],"col2":[5,8,7,9,3,4],"col3":[3,4,2,7,9,1],
"col4":[2,4,7,4,9,0],"col5":[3,4,5,2,1,1],"col6":[8,7,5,4,1,2]})
it results in the following output:
col1 col2 col3 col4 col5 col6
0 1 5 3 2 3 8
1 5 8 4 4 4 7
2 9 7 2 7 5 5
3 4 9 7 4 2 4
4 7 3 9 9 1 1
5 3 4 1 0 1 2
I want to create a new dataframe that differences col1 and col2, col3 and col4 and col5 and col6
Expected output is like that:
col1-col2 col3-col4 col5-col6
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1
Thanks in advance
dfr = pd.DataFrame({'col1-col2': frame.col1 - frame.col2,
'col3-col4': frame.col3 - frame.col4,
'col5-col6': frame.col5 - frame.col6})
If many columns use general solution - select pair and unpair columns, convert to numpy array and create new DataFrame by contructor:
#pandas 0.24+
arr = frame.iloc[:, ::2].to_numpy() - frame.iloc[:, 1::2].to_numpy()
#pandas below
#arr = frame.iloc[:, ::2].values - frame.iloc[:, 1::2].values
c = [f'{a}-{b}' for a, b in zip(frame.columns[::2], frame.columns[1::2])]
df = pd.DataFrame(arr, columns=c)
print (df)
col1-col2 col3-col4 col5-col6
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1
If performance is important, convert to numpy array first, store to variable and then indexing:
#pandas 0.24+
arr = frame.to_numpy()
#pandas below
#arr = frame.values
c = [f'{a}-{b}' for a, b in zip(frame.columns[::2], frame.columns[1::2])]
df = pd.DataFrame(arr[:, ::2] - arr[:, 1::2], columns=c)
df = pd.DataFrame(frame.apply(lambda x: [x['col1']-x['col2'],x['col3']-x['col4'],x['col5']-x['col6']],axis=1).tolist())
df.rename({0:'col1-col2',1:'col3-col4',2:'col4-col5'},axis=1)
col1-col2 col3-col4 col4-col5
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1
I have an array with numbers which corresponds to the row numbers that need to be selected from a DataFrame.
For example, arr = np.array([0,0,1,1]) and the DataFrame is seen below. arr is the row number and not the index.
Index A B C D
3 10 0 0 0
4 5 2 0 0
Using arr I would like to produce a DataFrame that looks like this
Index A B C D
3 10 0 0 0
3 10 0 0 0
4 5 2 0 0
4 5 2 0 0
You can use iloc with integer indexing:
df.iloc[[0,0,1,1], :] # or df.iloc[arr, :]
# A B C D
#Index
#3 10 0 0 0
#3 10 0 0 0
#4 5 2 0 0
#4 5 2 0 0
I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?
Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6