I am trying to groupby several columns. With one column it is easy, just df.groupby('c1').['c3].sum()
Original dataframe
c1 c2 c3
1 1 2
1 2 12
2 1 87
2 2 12
2 3 87
2 3 13
Desired result
c2 c3(c1_1) c3(c1_2)
1 2 87
2 12 12
3 (0?) 100
Where c3(c1_1) means sum of column c3 where c1 has a value of 1
I have no idea how to apply groupby on this. It would be nice, if someone will show not only how to solve it, but what to read to have no such stupid questions
You can group by multiple columns at once by providing a list to groupby. If you don't mind the output being formatted slightly differently, you can achieve this result through
In [32]: df.groupby(['c2', 'c1']).c3.sum().unstack(fill_value=0)
Out[32]:
c1 1 2
c2
1 2 87
2 12 12
3 0 100
With a bit of work, this can be massaged into the form you give as well.
Related
Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)
I have a pandas dataframe like as shown below
df = pd.DataFrame({'sub_id': [101,101,101,102,102,103,104,104,105],
'test_id':['A1','A1','C1','A1','B1','D1','E1','A1','F1'],
'dummy':['hi','hello','how','are','you','am','fine','thank','you']})
I want each combination of sub_id and test_id to have a unique id (sequence number)
Please note that one subject can have duplicate test_ids but dummy values will be different.
Similarly, multiple subjects can share the same test_ids as shown in sample dataframe
So, I tried the below 2 approaches but they are incorrect.
df.groupby(['sub_id','test_id']).cumcount()+1 # incorrect
df['seq_id'] = df.index + 1 # incorrect
I expect my output to be like as below
IIUC:
try via ngroup():
df['seq_id']=df.groupby(['sub_id','test_id'],sort=False).ngroup()+1
output of df:
sub_id test_id dummy seq_id
0 101 A1 hi 1
1 101 A1 hello 1
2 101 C1 how 2
3 102 A1 are 3
4 102 B1 you 4
5 103 D1 am 5
6 104 E1 fine 6
7 104 A1 thank 7
8 105 F1 you 8
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes like
df1
sub_id Weight
1 56
2 67
3 81
5 73
9 59
df2
sub_id Text
1 He is normal.
1 person is healthy.
1 has strong immune power.
3 She is over weight.
3 person is small.
9 Looks good.
5 Not well.
5 Need to be tested.
By combining these two data frame i need to get as
(when there are multiple sub_id's in second df need to pick first text and combine with first df as below)
merge_df
sub_id Weight Text
1 56 He is normal.
2 67 Nan.
3 81 She is over weight.
5 73 Not well.
9 59 Looks good.
Can anyone help me out?
Thanks in advance.
Here you go:
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 He is normal.
1 2 67 NaN
2 3 81 She is over weight.
3 5 73 Not well.
4 9 59 Looks good.
To keep the last duplicate, you'd use the parameter keep='last'
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id', keep='last'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 has strong immune power.
1 2 67 NaN
2 3 81 person is small.
3 5 73 Need to be tested.
4 9 59 Looks good.
I have a data frame with many columns that I need to divide by a column to compute proportions. Can someone help with a for loop for this? In the given data example below I want to add columns c1p = c1/ct, c2p=c2/ct, c3p=c3/ct, c4p=c4/ct.
id c1 c2 c3 c4 ct
1 6 8 8 12 34
2 5 3 11 6 25
3 3 9 6 12 30
4 14 10 10 3 37
The power of DataFrames is that you can do stuff column by column. Like this:
for i in range(1, 5):
df[f'c{i}p'] = df[f'[c{i}'] / df['ct']
Here is a vectorized approach to get another dataframe df_p which contains all the per-unit values you need:
df_p = df.filter(regex='c[0-9]').divide(df['ct'], axis=0)
I just asked a similar question but then
realized, it wasn't the right question.
What I'm trying to accomplish is to combine two data frames that actually have the same columns, but may or may not have common rows (indices of a MultiIndex). I'd like to combine them taking the sum of one of the columns, but leaving the other columns.
According to the accepted answer, the approach may be something like:
def mklbl(prefix,n):
try:
return ["%s%s" % (prefix,i) for i in range(n)]
except:
return ["%s%s" % (prefix,i) for i in n]
mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])
mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])
df2 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1)),'c':np.arange(len(mi1)), 'd':np.arange(len( mi1))[::-1]}, index=mi1).sort_index().sort_index(axis=1)
df1 = pd.DataFrame({'a':np.arange(len(mi2)), 'b':np.arange(len(mi2)),'c':np.arange(len(mi2)), 'd':np.arange(len( mi2))[::-1]}, index=mi2).sort_index().sort_index(axis=1)
df1 = df1.add(df2.pop('b'))
but the problem is this will fail as the indices don't align.
This is close to what I'm trying to achieve, except that I lose rows that are not common to the two dataframes:
df1['b'] = df1['b'].add(df2['b'], fill_value=0)
But this gives me:
Out[197]:
a b c d
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
When I want:
In [197]: df1
Out[197]:
a b c d
A0 C0 0 0 0 7
C1 1 2 1 6
A1 C0 2 4 2 5
C1 3 6 3 4
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
Note: in response to #RandyC's comment about the XY problem... the specific problem is that I have a class which reads data and returns a dataframe of 1e9 rows. The columns of the data frame are latll, latur, lonll, lonur, concentration, elevation. The data frame is indexed by a MultiIndex (lat, lon, time) where time is a datetime. The rows of the two dataframes may/may not be the same (IF they exist for a given date, the lat/lon will be the same... they are grid cell centers). latll, latur, lonll, lonur are calculated from lat/lon. I want to sum the concentration column as I add two data frames, but not change the others.
Self answering, there was an error in the comment above that caused a double adding. This is correct:
newdata = df2.pop('b')
result = df1.combine_first(df2)
result['b']= result['b'].add(newdata, fill_value=0)
seems to provide the solution to my use-case.