For the following dataframe
df = pd.DataFrame({"a": [1, 0, 11], "b": [7, 0, 0], "c": [0,10,0], "d": [1,0,0],
"e": [0,0,0], "name":["b","c","a"]})
print(df)
a b c d e name
0 1 7 0 1 0 b
1 0 0 10 0 0 c
2 11 0 0 0 0 a
I would like to get one row back that comprises the maximum values of each column plus the name of that column.
E.g. in this case:
a b c d e name
11 7 10 1 0 a
How can this be performed?
First get maximum values to one row DataFrame by max to_frame and transposse by T and then get name of maximum value per DataFrame with idxmax:
a = df.max().to_frame().T
a.loc[0, 'name'] = df.set_index('name').max(axis=1).idxmax()
print (a)
a b c d e name
0 11 7 10 1 0 a
Detail:
print (df.set_index('name').max(axis=1))
name
b 7
c 10
a 11
dtype: int64
print (df.set_index('name').max(axis=1).idxmax())
a
Use df.max() and create a Dataframeand Transpose as:
pd.DataFrame(df.max()).T
a b c d e name
0 11 7 10 1 0 c
Related
Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object
I have a pandas DataFrame like below:
df = pd.DataFrame({"type": ["A", "B", "C"],
"A": [0, 0, 12],
"B": [1, 3, 0],
"C": [0, 1, 1]}
)
I want to transform this to a DataFrame that is N X 2, where I concatenate the column and type values with " - " as delimiter. The output should look like this:
pair value
A - A 0
A - B 0
A - C 12
B - A 1
B - B 3
B - C 0
C - A 0
C - B 1
C - C 1
I don't know if there is a name for what I want to accomplish (I thought about pivoting but I believe that is something else), so that didn't help me in googling the solution for this. How to solve this problem efficiently?
1st set index as type and then unstack and convert the result to dataframe.
try:
x = df.set_index('type').unstack().to_frame('value')
x.index = x.index.map(' - '.join)
res = x.rename_axis('pair').reset_index()
res:
pair value
0 A - A 0
1 A - B 0
2 A - C 12
3 B - A 1
4 B - B 3
5 B - C 0
6 C - A 0
7 C - B 1
8 C - C 1
First melt the column type, then join variable, and type column with a hyphen -, and take the required columns only:
>>> out = df.melt(id_vars='type')
>>> out.assign(pair=out['variable']+'-'+out['type'])[['pair', 'value']]
pair value
0 A-A 0
1 A-B 0
2 A-C 12
3 B-A 1
4 B-B 3
5 B-C 0
6 C-A 0
7 C-B 1
8 C-C 1
So my dataframe has multiple columns, one of them is named "multiple" which contains boolean, only 1s and 0s. Now, I want to replicate all the rows 4 times only for all the df.loc[df.multiple==1]. How can I do that? (I don't want to replicate indexes)
example input:
df=
index strings multiple
0 A 0
1 B 1
2 C 1
3 D 0
4 E 1
Expected output:
index strings multiple
0 A 0
1 B 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 1
7 C 1
8 C 1
9 C 1
10 C 1
11 D 0
12 E 1
13 E 1
14 E 1
15 E 1
16 E 1
Here is another alternative, based on #Vinzent answer.
It is using the same approach to construct the repeats, but doesn't require to reconstruct the full dataframe. It is instead based on indexing. This solution is ~30% faster on the provided dataset and larger datasets.
df.loc[np.repeat(df.multiple, df.multiple.values*4+1).index].reset_index(drop=True)
This is what numpy.repeat is for:
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 0],
['B', 1],
['C', 1],
['D', 0],
['E', 1]],
columns=['strings', 'multiple'])
df = pd.DataFrame(np.repeat(df.values, df['multiple']*4+1, axis=0), columns=df.columns)
print(df)
# strings multiple
# 0 A 0
# 1 B 1
# 2 B 1
# 3 B 1
# 4 B 1
# 5 B 1
# 6 C 1
# 7 C 1
# 8 C 1
# 9 C 1
# 10 C 1
# 11 D 0
# 12 E 1
# 13 E 1
# 14 E 1
# 15 E 1
# 16 E 1
You can do it with pandas:
(df.groupby('multiple')
.apply(lambda x: pd.concat([x]*4) if x.name else x)
.droplevel(level=0)
.sort_index()
.reset_index(drop=True)
)
For a given DataFrame, sorted by b and index reset:
df = pd.DataFrame({'a': list('abcdef'),
'b': [0, 2, 7, 3, 9, 15]}
).sort_values('b').reset_index(drop=True)
a b
0 a 0
1 b 2
2 d 3
3 c 7
4 e 9
5 f 15
and a list, v
v = list('adf')
I would like to pull out just the rows in v and the following row (if there is one), similar to grep -A1:
a b
0 a 0
1 b 2
2 d 3
3 c 7
5 f 15
I can do this by concatenating the index from isin and the index from isin plus one, like so:
df[df.index.isin(
np.concatenate(
(df[df['a'].isin(v)].index,
df[df['a'].isin(v)].index + 1)))]
But this is long and not too easy to understand. Is there a better way?
You can combine the isin condition and the shift (next row) to create the boolean you needed:
df[df.a.isin(v).pipe(lambda x: x | x.shift())]
# a b
#0 a 0
#1 b 2
#2 d 3
#3 c 7
#5 f 15
I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')
Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b
Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b