Grouping by ID choosing highest values in columns from same ID - python

I have a problem trying to calculate some final tests marks. I need to group by Students, getting only the highest value in each column for each student.
Being DF the dataframe:
data = {'Students': ['Student1', 'Student1', 'Student1', 'Student2','Student2','Studen3'],
'Result1': [2, 4, 5, 8, 2, 5],
'Result2': [5, 3, 2, 8, 5, 5],
'Result3': [7, 5, 7, 3, 8, 9]}
df = pd.DataFrame(data)
Students Result1 Result2 Result3
0 Student1 2 5 7
1 Student1 4 3 5
2 Student1 5 2 7
3 Student2 8 8 3
4 Student2 2 5 8
5 Studen3 5 5 9
I need to generate a DF choosing the higher mark, for each student, in each Result.
So, the final DF should look like:
Students Result1 Result2 Result3
0 Student1 5 5 7
1 Student2 8 8 8
2 Student3 5 5 9
Any help?

The dataframe can be generated using simply iterations over groups:
df2 = pd.DataFrame(columns=('Student', 'res1', 'res2', 'res3'))
for s in df.Students.unique():
stdf = df[df["Students"]==s]
df2 = df2.append({'Student':s,'res1':max(stdf.Result1),'res2':max(stdf.Result2),
'res3':max(stdf.Result3)}, ignore_index=True)

 Works calling groupby('Students').max()
>>> df.groupby('Students').max()
Result1 Result2 Result3
Students
Student1 5 5 7
Student2 8 8 8
Student3 5 5 9

Related

Creating a new column based on other columns from another dataframe

I have 2 dataframes:
df1
Name Apples Pears Grapes Peachs
James 3 5 5 2
Harry 1 0 2 9
Will 20 2 7 3
df2
Class User Factor
A Harry 3
A Will 2
A James 5
B NaN 4
I want to create a new column in df2 called Total which is a list of all the columns for each user in df1, multiplied by the Factor for that user - this should only be done if they are in Class A.
This is how the final df should look
df2
Class User Factor Total
A Harry 3 [3,0,6,27]
A Will 2 [40,4,14,6]
A James 5 [15,25,25,10]
B NaN 4
This is what I tried:
df2['Total'] = list(df1.Name.isin((df2.User) and (df2.Class==A)) * df2.Factor)
This will do what your question asks:
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=[
x.strip() for x in 'Name Apples Pears Grapes Peachs'.split()], data =[
['James', 3, 5, 5, 2],
['Harry', 1, 0, 2, 9],
['Will', 20, 2, 7, 3]])
print(df)
df2 = pd.DataFrame(columns=[
x.strip() for x in 'Class User Factor'.split()], data =[
['A', 'Harry', 3],
['A', 'Will', 2],
['A', 'James', 5],
['B', np.nan, 4]])
print(df2)
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
print(df2)
Input:
Name Apples Pears Grapes Peachs
0 James 3 5 5 2
1 Harry 1 0 2 9
2 Will 20 2 7 3
Class User Factor
0 A Harry 3
1 A Will 2
2 A James 5
3 B NaN 4
Output
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]
You can use:
# First lookup
factor = df2[df2['Class'] == 'A'].set_index('User')['Factor']
df1['Total'] = df1[cols].mul(df1['Name'].map(factor), axis=0).agg(list, axis=1)
# Second lookup
df2['Total'] = df2['User'].map(df1.set_index('Name')['Total'])
Output:
>>> df2
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]
3 B NaN 4 NaN
>>> df1
Name Apples Pears Grapes Peachs Total
0 James 3 5 5 2 [15, 25, 25, 10]
1 Harry 1 0 2 9 [3, 0, 6, 27]
2 Will 20 2 7 3 [40, 4, 14, 6]
On-liner masochists, greetings ;)
df2['Total'] = pd.Series(df1.sort_values(by='Name').reset_index(drop=True).iloc[:,1:5]\
.mul(df2[df2.Class == 'A'].sort_values(by='User')['Factor'].reset_index(drop=True), axis=0)\
.values.tolist())
df2
Output:
index
Class
User
Factor
Total
0
A
Harry
3
3,0,6,27
1
A
Will
2
15,25,25,10
2
A
James
5
40,4,14,6
3
B
NaN
4
NaN

Rolling groupby nunique count based on start and end dates

I have a dataframe with a unique ID, a start date, and an end date. Over the course of a year, the ID can start, stop, and be restarted.
I would like to get a groupby nunique count of IDs over the course of a year.
Currently, I can count unique values for a start date of the ID, but how exactly do I incorporate the end date?
fun = pd.DataFrame({'ZIP_KEY': ['A', 'B','C', 'A', 'B', 'A'],
'start_month': [1, 2, 2, 6, 8, 10],
'end_month': [4, 3, 7, 7, 12, 12]})
fun.groupby('start_month')['ZIP_KEY'].nunique()
start_month
1 1
2 2
3 0
4 0
5 0
6 1
7 0
8 1
9 0
10 1
11 0
12 0
Essentially, if an ID starts in January and ends in March, I'd like it to be included in the count for February and March, not just January, which is how my current method is operating.
Desired Output:
start_month
1 1
2 3
3 3
4 2
5 1
6 2
7 2
8 1
9 1
10 2
11 2
12 2
Any tips or help is much appreciated!
Maybe you can list all the months between start and end, explode and finally count
import pandas as pd
df = pd.DataFrame({'ZIP_KEY': ['A', 'B','C', 'A', 'B', 'A'],
'start_month': [1, 2, 2, 6, 8, 10],
'end_month': [4, 3, 7, 7, 12, 12]})
df["list"] = df.apply(lambda x: list(range(x["start_month"], x["end_month"]+1)),
axis=1)
df = df.explode("list")
df.groupby("list")["ZIP_KEY"].nunique()
One option is to re-create the DataFrame where you expand the ranges to all months within the range and replicate the key across every row. Then you can use a normal groupby.
df = pd.concat([pd.DataFrame({'month': range(st, en+1), 'key': k})
for k, st, en in zip(fun['ZIP_KEY'], fun['start_month'], fun['end_month'])])
df.groupby('month').key.nunique()
#month
#1 1
#2 3
#3 3
#4 2
#5 1
#6 2
#7 2
#8 1
#9 1
#10 2
#11 2
#12 2
#Name: key, dtype: int64
Here's a little fun using pd.IntervalIndex with pandas 1.0.0.
ii = pd.IntervalIndex.from_arrays(fun['start_month'], fun['end_month'], closed='both')
monthrange = np.arange(1,13)
pd.Series(monthrange, index=monthrange).apply(lambda x: sum(ii.contains(x)))\
.rename_axis('months').rename('count')
Output:
months
1 1
2 3
3 3
4 2
5 1
6 2
7 2
8 1
9 1
10 2
11 2
12 2
Name: count, dtype: int64

using isin across multiple columns

I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.

Add multiple dataframes based on one column

I have several hundred dataframes with same column names, like this:
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df2
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
That's how i0'm reading them
path_to_files = '/home/Desktop/computed_2d/'
lst = []
for filen in dir1:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
lst.append(df)
The desired result should look like this:
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.284425 0.074430 22.535720 4050.319374
1 4208.98 5.5 0.484515 0.086690 44.708220 4208.981496
2 4374.94 9.0 0.715155 0.114330 87.033245 4374.935812
3 4379.74 9.5 0.313710 0.091025 30.395310 4379.769305
4 4398.01 14.5 0.501825 0.092285 49.309715 4398.013920
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1.0 0.061480 0.125560 8.216850 5520.484742
As you can see the number of rows are not same. Now i want to take the average of all the dataframes based on column1 wave and i want to make sure that the each index of column wave of df1 gets added to the correct index of df2
You can stack all dataframe in one by using pd.concat wich axis = 1 and take average of respective column
df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T
Out:
EWs MeasredWave fwhm num stlines wave
0 22.535720 4050.319374 0.074430 3.0 0.284425 4050.32
1 44.708220 4208.981496 0.086690 5.5 0.484515 4208.98
2 87.033245 4374.935812 0.114330 9.0 0.715155 4374.94
3 30.395310 4379.769305 0.091025 9.5 0.313710 4379.74
4 49.309715 4398.013920 0.092285 14.5 0.501825 4398.01
5 8.216850 5520.484742 0.125560 1.0 0.061480 5520.50
6 60.678680 4502.223123 0.101140 9.0 0.563620 4502.21
7 85.884280 4508.291777 0.116000 3.0 0.695540 4508.28
8 19.387450 4512.999332 0.088910 2.0 0.204860 4512.9
Here is an example to do what you need:
import pandas as pd
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': [0, 1, 2, 3],
'C': [0, 1, 2, 3],
'D': [0, 1, 2, 3]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': [4, 5, 6, 7],
'B': [4, 5, 6, 7],
'C': [4, 5, 6, 7],
'D': [4, 5, 6, 7]},
index=[0, 1, 2, 3])
df3 = pd.DataFrame({'A': [8, 9, 10, 11],
'B': [8, 9, 10, 11],
'C': [8, 9, 10, 11],
'D': [8, 9, 10, 11]},
index=[0, 1, 2, 3])
df4 = pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)
print(f"Average of column A = {df4['A'].mean()}")
You will have
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
0 4 4 4 4
1 5 5 5 5
2 6 6 6 6
3 7 7 7 7
0 8 8 8 8
1 9 9 9 9
2 10 10 10 10
3 11 11 11 11
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
Average of column A = 5.5
Answer from #Naga Kiran is great. I updated the whole solution here:
import pandas as pd
df1 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
'num' : [3, 5, 9, 9, 14, 1],
'stlines' : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
'fwhm' : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
'EWs' : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
index=[0, 1, 2, 3, 4, 5])
df2 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
'num' : [3, 6, 9, 10, 15, 9, 3, 2, 1],
'stlines' : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
'fwhm' : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
'EWs' : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
index=[0, 1, 2, 3, 4, 5, 6, 7, 8])
df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Categories

Resources