Add incremeting number to certain coulmns of a pandas DataFrame [duplicate] - python

This question already has answers here:
Grouping and auto increment based on columns in pandas
(1 answer)
How to use groupby and cumcount on unique names in a Pandas column
(2 answers)
Closed 3 years ago.
Say I have a df as follows
df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')
Ouput:
val category
idx
9 30 a
8 40 a
7 50 b
6 60 b
5 70 c
4 80 c
3 90 c
I want to add an incrementing number from 1 to the total number if rows for each 'category'. The new column should look like this:
category incrNbr val
idx
3 a 1 30
4 a 2 40
5 b 1 50
6 b 2 60
7 c 1 70
8 c 2 80
9 c 3 90
Currently I loop through each category like this:
li = []
for index, row in df.iterrows():
cat = row['category']
if cat not in li:
li.append(cat)
temp = df.loc[(df['category'] == row['category'])][['val']]
temp.insert(0, 'incrNbr', range(1, 1 + len(temp)))
del temp['val']
df = df.combine_first(temp)
It is very slow.
Is there a way to do this using vectorized operations?

If your category column is sorted, we can use GroupBy.cumcount:
df['incrNbr'] = df.groupby('category')['category'].cumcount().add(1)
val category incrNbr
idx
9 30 a 1
8 40 a 2
7 50 b 1
6 60 b 2
5 70 c 1
4 80 c 2
3 90 c 3

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

Delete rows in apply() function or depending on apply() result

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

How to get equivalent of pandas melt using groupby + stack?

Recently I am learning groupby and stack and encountered one method of pandas called melt. I would like to know how to achieve the same result given by melt using groupby and stack.
Here is the MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
df1 = pd.melt(df, id_vars='A',value_vars=['B','C'],var_name='variable',value_name='value')
print(df1)
A variable value
0 1 B 1
1 1 B 1
2 1 B 2
3 2 B 2
4 2 B 1
5 1 C 10
6 1 C 20
7 1 C 30
8 2 C 40
9 2 C 50
How to get the same result using groupby and stack?
My attempt
df.groupby('A')[['B','C']].count().stack(0).reset_index()
I am not quite correct. And looking for the suggestions.
I guess you do not need groupby, just stack + sort_values:
result = df[['A', 'B', 'C']].set_index('A').stack().reset_index().sort_values(by='level_1')
result.columns = ['A', 'variable', 'value']
Output
A variable value
0 1 B 1
2 1 B 1
4 1 B 2
6 2 B 2
8 2 B 1
1 1 C 10
3 1 C 20
5 1 C 30
7 2 C 40
9 2 C 50

return new columns by subtracting two columns in a pandas list

It seems so basic, but I can't work out how to achieve the following...
Consider the scenario where I have the following data:
all_columns = ['A','B','C','D']
first_columns = ['A','B']
second_columns = ['C','D']
new_columns = ['E','F']
values = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
df = pd.DataFrame(data = values, columns = all_columns)
df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
How can I using this data subsequently subtract let's say column C - column A, then column D - column B and return two new columns E and F respectively to my df Pandas dataframe? I have multiple columns so writing the formula one by one is not an option.
I imagine it should be something like that, but python thinks that I am trying to subtract list names rather than the values in the actual lists...
df[new_columns] = df[second_columns] - df[first_columns]
Expected output:
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2
df['E'] = df['C'] - df['A']
df['F'] = df['D'] - df['B']
Or, alternatively (similar to #rafaelc's comment):
new_cols = ['E', 'F']
second_cols = ['C', 'D']
first_cols = ['A', 'B']
df[new_cols] = df[second_cols] - df[first_cols].values
As #rafaelc and #Ben.T mentioned .. below would be the good fit to go.
I'm Just placing this is in the answer section for the posterity use...
>>> df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Result:
>>> df[['E', 'F']] = df[['C', 'D']] - df[['A', 'B']].values
>>> df
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2

Selecting rows from pandas dataframe limited by count per column value

I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?
You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2
Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2
Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32
Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

Categories

Resources