I have dataframe that looks like:
Alice Eleonora Mike Helen
2 7 8 6
11 5 9 4
6 15 12 3
5 3 7 8
I want ot create the new column that containes for each row the name of the column with max value for given row
Alice Eleonora Mike Helen _Max
2 7 8 6 Mike
11 5 9 4 Alice
6 15 12 3 Eleonora
5 3 7 8 Helen
I figure out how to get the max value:
df['_Max']=df[['Alice', 'Eleonora', 'Mike', 'Helen']].max(axis=1)
but how to get the name of column with max value and write it into _Max instead of value itself?
You can use apply with a lambda to return the name of the column, here we compare the value row-wise against the max, this produces a boolean mask we can use to mask the columns:
In [229]:
df['MAX'] = df.apply( lambda x: df.columns[x == x.max()][0], axis=1)
df
Out[229]:
Alice Eleonora Mike Helen MAX
0 2 7 8 6 Mike
1 11 5 9 4 Alice
2 6 15 12 3 Eleonora
3 5 3 7 8 Helen
Here is the boolean mask:
In [232]:
df.apply( lambda x: x == x.max(), axis=1)
Out[232]:
Alice Eleonora Mike Helen
0 False False True False
1 True False False False
2 False True False False
3 False False False True
Related
This may be a litte confusing, but I have the following dataframe:
exporter assets liabilities
False 5 1
True 10 8
False 3 1
False 24 20
False 40 2
True 12 11
I want to calculate a ratio with this formula df['liabilieties'].sum()/df['assets'].sum())*100
And I expect to create a new column where the values are the ratio but calculated for each boolean value, like this:
exporter assets liabilities ratio
False 5 1 33.3
True 10 8 86.3
False 3 1 33.3
False 24 20 33.3
False 40 2 33.3
True 12 11 86.3
Use DataFrame.groupby on column exporter and transform the datafarme using sum, then use Series.div to divide liabilities by assets and use Series.mul to multiply by 100:
d = df.groupby('exporter').transform('sum')
df['ratio'] = d['liabilities'].div(d['assets']).mul(100).round(2)
Result:
print(df)
exporter assets liabilities ratio
0 False 5 1 33.33
1 True 10 8 86.36
2 False 3 1 33.33
3 False 24 20 33.33
4 False 40 2 33.33
5 True 12 11 86.36
My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)
So I have a df like this:
NAME TRY SCORE
Bob 1st 3
Sue 1st 7
Tom 1st 3
Max 1st 8
Jay 1st 4
Mel 1st 7
Bob 2nd 4
Sue 2nd 2
Tom 2nd 6
Max 2nd 4
Jay 2nd 7
Mel 2nd 8
Bob 3rd 3
Sue 3rd 5
Tom 3rd 6
Max 3rd 3
Jay 3rd 4
Mel 3rd 6
I want to count haw mant times each person scores more than 5?
into a new df2 that looks like this:
NAME COUNT
Bob 0
Sue 1
Tom 2
Mary 1
Jay 1
Mel 3
My attempts have been many - here is the latest
df2 = df.groupby('NAME')[['SCORE'] > 5].count().reset_index(name="count")
Just using groupby and sum
df.assign(SCORE=df.SCORE.gt(5)).groupby('NAME')['SCORE'].sum().astype(int).reset_index()
Out[524]:
NAME SCORE
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Or we using set_index with sum
df.set_index('NAME').SCORE.gt(5).sum(level=0).astype(int)
First create boolean mask and then aggregate by sum- Trues values are processes like 1:
df2 = (df['SCORE'] > 5).groupby(df['NAME']).sum().astype(int).reset_index(name="count")
print (df2)
NAME count
0 Bob 0
1 Jay 1
2 Max 1
3 Mel 3
4 Sue 1
5 Tom 2
Detail:
print (df['SCORE'] > 5)
0 False
1 True
2 False
3 True
4 False
5 True
6 False
7 False
8 True
9 False
10 True
11 True
12 False
13 False
14 True
15 False
16 False
17 True
Name: SCORE, dtype: bool
One way to do this is to write a custom groupby function where you take the scores of each group and sum up those that are greater than 5 like this:
df.groupby('NAME')['SCORE'].agg(lambda x: (x > 5).sum())
NAME
Bob 0
Jay 1
Max 1
Mel 3
Sue 1
Tom 2
Name: SCORE, dtype: int64
If you want counts as a dictionary, you can use collections.Counter:
from collections import Counter
c = Counter(df.loc[df['SCORE'] > 5, 'NAME'])
For a dataframe you can map counts from unique names:
res = pd.DataFrame({'NAME': df['NAME'].unique(), 'COUNT': 0})
res['COUNT'] = res['NAME'].map(c).fillna(0).astype(int)
print(res)
COUNT NAME
0 0 Bob
1 1 Sue
2 2 Tom
3 1 Max
4 1 Jay
5 3 Mel
Filter dataframe first, then groupby with aggregation and reindex to fill missing values.
df[df['SCORE'] > 5].groupby('NAME')['SCORE'].size()\
.reindex(df['NAME'].unique(), fill_value=0)
Output:
NAME
Bob 0
Sue 1
Tom 2
Max 1
Jay 1
Mel 3
Name: SCORE, dtype: int64
I have a pandas DataFrame where each observation (row) represents a person.
I want to assign every person who satisfies a particular condition to different groups. I need this because my final aim is to create a network and link the persons in the same groups with some probabilities depeneding on the group.
So, for instance, I want to assign all children aged between 6 and 10 to schools. Then in the end I will create links between the children in the same school with a particular probability p.
I know the size distribution of the schools in the area I want to simulate.
So I want to draw school sizes from this distribution and then "fill up" the schools with all the children aged from 6 to 10.
I am new to pandas: the way I was thinking to do this was to create a new column, fill it up with NaN and then just assign a school ID to the different students.
Let's say my DataFrame df is this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': range(11), 'AGE': [15, 6, 54, 8, 10, 39, 2, 7, 9, 10, 6]})
df
Out[1]:
AGE ID
0 15 0
1 6 1
2 54 2
3 8 3
4 10 4
5 39 5
6 2 6
7 7 7
8 9 8
9 10 9
10 6 10
(Incidentally, I don't know how to put the ID column first, but anyway in real life I'm reading the dataframe from a CSV file so that's not a problem).
Now, what I'd like to do is create another column, ELEM_SCHOOL_ID, initialize it to NaN and just assign values to those who are the right age.
What I have succeded to do until now is: create a subset of the DataFrame with the persons who satisfy the age condition.
df['IN_ELEM_SCH'] = np.where((df['AGE']>5) & (df['AGE']<11), 'True', 'False')
df
Out[2]:
AGE ID IN_ELEM_SCH
0 15 0 False
1 6 1 True
2 54 2 False
3 8 3 True
4 10 4 True
5 39 5 False
6 2 6 False
7 7 7 True
8 9 8 True
9 10 9 True
10 6 10 True
Then, I would need to add another column, ELEM_SCHOOL_ID that contains the ID of the particular elementary school every student is attending.
I can initialize the new column with:
df["ELEM_SCHOOL_ID"] = np.nan
df
Out[84]:
AGE ID IN_ELEM_SCH SCHOOL_ID
0 15 0 False NaN
1 6 1 True NaN
2 54 2 False NaN
3 8 3 True NaN
4 10 4 True NaN
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True NaN
8 9 8 True NaN
9 10 9 True NaN
10 6 10 True NaN
What I want to do now is:
Draw a number from the school size distribution: n0
For n0 random persons satisfying the age condition (so those who have IN_ELEM_SCHOOL == True), assign 0 to SCHOOL_ID
Draw another number from the school size distribution: n1
For n1 random persons still not assigned to a school, assign 1 to SCHOOL_ID
Repeat until all the persons with IN_ELEM_SCH == True have been assigned a school ID.
So, for example, let's say that the first school size drawn from the distribution is n0=2, the second n1=3 and the third n2=4.
I want to end up with something like this:
AGE ID IN_ELEM_SCH SCHOOL_ID
0 15 0 False NaN
1 6 1 True 0
2 54 2 False NaN
3 8 3 True 1
4 10 4 True 2
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True 1
8 9 8 True 1
9 10 9 True 2
10 6 10 True 0
In real life, the school size is distributed as a lognormal distribution. Say, with parameters mu = 4 and sigma = 1
I can then draw from this distribution:
s = np.random.lognormal(mu, sigma, 100)
But I still wasn't able to figure out how to assign the schools.
I apologize for the length of this question, but I wanted to be clear.
Thank you very much for any hint or help you could give me.
Pandas will automatically match on the index when assigning new data. Checkout the pandas docs on indexing.
Note: You wouldn't normally create the extra IN_ELEM_SCHOOL column (i.e. third line in the code below is unnecessary).
mu, sigma = 1, 0.5
m = (5 < df['AGE']) & (df['AGE'] < 11)
df['IN_ELEM_SCHOOL'] = m
s = m[m].sample(frac=1)
n, i = 0, 0
while n < len(s):
num_students = int(np.random.lognormal(mu, sigma))
s[n: n + num_students] = i
i += 1
n += num_students
df['SCHOOL_ID'] = s
df
returns
AGE ID IN_ELEM_SCHOOL SCHOOL_ID
0 15 0 False NaN
1 6 1 True 0.0
2 54 2 False NaN
3 8 3 True 1.0
4 10 4 True 2.0
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True 1.0
8 9 8 True 0.0
9 10 9 True 0.0
10 6 10 True 1.0
In Pandas, I'm trying to figure out how to generate a column that is the difference between the time of the current row and time of the last row in which the value of another column is True:
So given the dataframe:
df = pd.DataFrame({'Time':[5,10,15,20,25,30,35,40,45,50],
'Event_Occured': [True,False,False,True,True,False,False,True,False,False]})
print df
Event_Occured Time
0 True 5
1 False 10
2 False 15
3 True 20
4 True 25
5 False 30
6 False 35
7 True 40
8 False 45
9 False 50
I'm trying to generate a column that would look like this:
Event_Occured Time Time_since_last
0 True 5 0
1 False 10 5
2 False 15 10
3 True 20 0
4 True 25 0
5 False 30 5
6 False 35 10
7 True 40 0
8 False 45 5
9 False 50 10
Thanks very much!
Using df.Event_Occured.cumsum() gives you distinct groups to groupby. Then applying a function per group that subtracts the first member's value from every member gets you what you want.
df['Time_since_last'] = \
df.groupby(df.Event_Occured.cumsum()).Time.apply(lambda x: x - x.iloc[0])
df
Here's an alternative that fills the values corresponding to Falses with the last valid observation:
df['Time'] - df.loc[df['Event_Occured'], 'Time'].reindex(df.index).ffill()
Out:
0 0.0
1 5.0
2 10.0
3 0.0
4 0.0
5 5.0
6 10.0
7 0.0
8 5.0
9 10.0
Name: Time, dtype: float64