I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)
In python dataframe, I have a data frame like this
index
column A
0
a
1
a
2
b
3
c
4
c
5
c
6
c
I want to create a column that will set index based on the same column's value
index
column A
setIndex
0
a
0
1
a
1
2
b
0
3
c
0
4
c
1
5
c
2
6
c
3
You can use .groupby() + .cumcount(), as follows:
df['setIndex'] = df.groupby('column A').cumcount()
Result:
print(df)
column A setIndex
0 a 0
1 a 1
2 b 0
3 c 0
4 c 1
5 c 2
6 c 3
I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2
I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])
I have an array with numbers which corresponds to the row numbers that need to be selected from a DataFrame.
For example, arr = np.array([0,0,1,1]) and the DataFrame is seen below. arr is the row number and not the index.
Index A B C D
3 10 0 0 0
4 5 2 0 0
Using arr I would like to produce a DataFrame that looks like this
Index A B C D
3 10 0 0 0
3 10 0 0 0
4 5 2 0 0
4 5 2 0 0
You can use iloc with integer indexing:
df.iloc[[0,0,1,1], :] # or df.iloc[arr, :]
# A B C D
#Index
#3 10 0 0 0
#3 10 0 0 0
#4 5 2 0 0
#4 5 2 0 0