I have a pandas dataframe where I am trying to sum based on groupings, but I can't seem to get the order right. In the example below, I want to groupby group2 then group1 and sum without double-counting the group1 values. This is part of a larger table with other things going on, so I don't want to filter-out by unique group1-2 sets.
Using pandas 1.0.5
x, y = [(21643,21665,21640,21668,21713,21706), (30,28,84,2,32,-9)]
val = [11,27,31,15,50,35]
group1, group2 = [(1,1,3,4,1,4), (21660,21660,21660,21660,21700,21700)]
df = pd.DataFrame(list(zip(x, y, val, group1, group2)),
columns =['x', 'y', 'val', 'group1', 'group2']
)
df.reset_index(drop=True, inplace=True)
df.sort_values(['group2', 'group1'],inplace=True)
df['group1_mean'] = df.groupby(['group2', 'group1'])['val'].transform('mean')
df['group2_sum'] = df.groupby(['group2', 'group1'])['group1_mean'].transform('sum')
display(df)
I would make a temporary df
dfsum = df.groupby(['group2', 'group1']).mean()
dfsum = dfsum.groupby('group2').sum()
Then merge df with this dfsum
df = df.merge(dfsum, on='group2')
The one line trick
df = df.merge(df.groupby(['group2', 'group1']).val.mean()
.groupby('group2').sum().rename('result'), on='group2')
This will not assign a new variable name so groupby intermediate dfs will be garbage-collected.
Output
x y val group1 group2 result
0 21643 30 11 1 21660 65
1 21665 28 27 1 21660 65
2 21640 84 31 3 21660 65
3 21668 2 15 4 21660 65
4 21713 32 50 1 21700 85
5 21706 -9 35 4 21700 85
Related
I have two dataframes, df_1 and df_2, where df_1 has several columns of "codes" and df_2 has the definitions for all of those codes:
df_1 = pd.DataFrame({
'Age': [42, 35, 64, 53],
'Code 1': [1234, 3452, 9583, 8753],
'Code 2': [3857, np.nan, np.nan, 1234]})
df_2 = pd.DataFrame({
'Code': [3452, 8753, 3857, 1234, 9583],
'Code Def':['a', 'b', 'c', 'd', 'e']})
How do I create a new column in df_1 that contains the definitions of all the codes from df_2 to look something like this?
Age Code 1 Code 2 Code def
42 1234 3857 d, c
35 3452 NaN a
64 9583 NaN e
53 8753 1234 b, d
I've tried using merge() to combine the two dataframes, but that doesn't work since I want to join on multiple columns in df_1 and just one column in df_2. I also tried creating empty columns in df_1 and filling them using if statements, but that got quite complicated.
Thanks!
You could first stack and groupby+agg to form the new column, then merge with the original dataset:
s = df_2.set_index(['Code'])['Code Def']
df_1.merge(df_1.set_index('Age')
.stack().map(s)
.groupby(level='Age').agg(','.join)
.rename('Code def'),
left_on='Age', right_index=True
)
output:
Age Code 1 Code 2 Code def
0 42 1234 3857.0 d,c
1 35 3452 NaN a
2 64 9583 NaN e
3 53 8753 1234.0 b,d
Here anther approach
x= dict(zip(df_2["Code"], df_2['Code Def']))
tmp = df_1[["Code 1","Code 2"]].replace({"Code 1":x,'Code 2':x}).fillna('')
df_1["Code Def"] = tmp["Code 1"]+ " " + tmp["Code 2"]
output
Age Code 1 Code 2 Code Def
0 42 1234 3857.0 d c
1 35 3452 NaN a
2 64 9583 NaN e
3 53 8753 1234.0 b d
I have dataframe named df which has two columns id1 and id2
I need to filter values based on some other df named as meta_df
meta_df has three columns id,name,text
df
id1
id2
12
34
99
42
metadf
id
name
text
12
aa
lowerend
42
bb
upperend
99
cc
upper limit
34
dd
uppersome
I need values from text which have lower and upper in string of text. e.g 12 and 34
I am trying the below code and stuck at getting text clumn
for row in df.itertuples():
print(row.Index, row.id1, row.id2)
print(meta_df[id['id']== row.id1])
print(meta_df[id['id']== row.id2])
Output Expected
id2
id2
flag
12
34
yes
99
42
no
Melt df and merge to metadf, a bit of reshaping before getting the final value:
# keep the index with ignore_index
# it will be used when reshaping back to original form
reshaped = (df.melt(value_name = 'id', ignore_index = False)
.assign(ind=lambda df: df.index)
.merge(metadf, on='id', how = 'left')
.assign(text = lambda df: df.text.str.contains('lower'))
.drop(columns='name')
.pivot('ind', 'variable')
.rename_axis(columns=[None, None], index=None)
)
# if the row contains both lower(1) and upper(0)
# it will sum to 1, else 0, or 2(unlikely with the sample data shared)
flag = reshaped.loc(axis=1)['text'].sum(1)
reshaped.loc(axis=1)['id'].assign(flag = flag.map({1:'yes', 0:'no'}))
id1 id2 flag
0 12 34 yes
1 99 42 no
I have original df:
df = pd.DataFrame({'Creatinine':[68,69,80,75],
'Ferritin':[251,1481,107,110],
'ALT':[11,14,10,15]})
I would like to add the values of df2 (below) as a suffix to the col names of df respectively.
df2 = pd.DataFrame({'Creatinine_Units':['umol/L','umol/L','umol/L','umol/L'],
'Ferritin_units':['ug/L','ug/L','ug/L','ug/L'],
'ALT':['U/L','U/L','U/L','U/L']})
Expected Outcome:
How do I go about this in Python?
You can perform arithmetic on column names like this:
df.columns = df.columns + ' (' + df2.iloc[0] + ')'
Output:
Creatinine (umol/L) Ferritin (ug/L) ALT (U/L)
0 68 251 11
1 69 1481 14
2 80 107 10
3 75 110 15
I have a dataframe with two columns: id1 and id2.
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [12,13,12,11,13,13]})
print(df)
id1 id2
A 123
B 13
C 12
B 11
A 13
C 132
And I want to reshape it (using, groupby, or pivot maybe?) to obtain the following:
id1 id2-1 id2-2
A 123 13
B 13 11
C 12 132
Note that there are exactly two rows for each id1 but a great number of different values of id2 (so I'd rather not do one-hot vector encoding).
There is a preference if the output could be sorted by lexicographic order, to give this:
id1 id2-1 id2-2
A 13 123
B 11 13
C 12 132
i.e. for each row the values in id2-1 and id2-2 are sorted (see the row corresponding to id1 == 'B').
plan
we want to create an index that for each successive time we see the values in 'id1'. For this we will groupby('id1') then use cumcount() to give us that new index.
We then set the index to be a pd.MultiIndex with set_index
with the pd.MultiIndex we are set up to unstack
finally, we rename the columns with some tricky mapping
d = df.set_index(['id1', df.groupby('id1').cumcount() + 1]).unstack()
d.columns = d.columns.to_series().map('{0[0]}-{0[1]}'.format)
print(d)
id2-1 id2-2
id1
A 12 13
B 13 11
C 12 13
This should do it:
import pandas as pd
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [123,13,12,11,13,132]})
df['id2'] = df['id2'].astype(str)
df = df.groupby(['id1']).agg(lambda x: '-'.join(x))
df['id2-1'] = df['id2'].apply(lambda x: x.split('-')[0]).astype(int)
df['id2-2'] = df['id2'].apply(lambda x: x.split('-')[1]).astype(int)
df = df.reset_index()[['id1', 'id2-1', 'id2-2']]
Say I have a n ⨉ p matrix of n samples of a single feature of p dimension (for example a word2vec element, so that p is of the order of ~300). I can create each column programatically, eg. with features = ['f'+str(i) for i in range(p)] and then appending to an existing dataframe.
Since they represent a single feature, how can I reference all those columns later on? I can assign df.feature = df[features] which works, but it breaks when I slice the dataset: df[:x].feature results in an exception.
Example:
df = pre_exisiting_dataframe() # such that len(df) is n
n,p = 3,4
m = np.arange(n*p).reshape((n,p))
fs = ['f'+str(i) for i in range(p)]
df_m = pd.DataFrame(m)
df_m.columns = fs
df = pd.concat([df,df_m],axis=1) # m is now only a part of df
df.f = df[fs]
df.f # works: I can access the whole m at once
df[:1].f # crashes
I wouldn't use df.f = df[fs]. It may lead to undesired and surprising behaviour if you try to modify the data frame. Instead, I'd consider creating hierarchical columns as in the below example.
Say, we already have a preexisting data frame df0 and another one with features:
df0 = pd.DataFrame(np.arange(4).reshape(2,2), columns=['A', 'B'])
df1 = pd.DataFrame(np.arange(10, 16).reshape(2,3), columns=['f0', 'f1', 'f2'])
Then, using the keys argument to concat, we create another level in columns:
df = pd.concat([df0, df1], keys=['pre', 'feat1'], axis=1)
df
Out[103]:
pre feat1
A B f0 f1 f2
0 0 1 10 11 12
1 2 3 13 14 15
The subframe with features can be accessed as follows:
df['feat1']
Out[104]:
f0 f1 f2
0 10 11 12
1 13 14 15
df[('feat1', 'f0')]
Out[105]:
0 10
1 13
Name: (feat1, f0), dtype: int64
Slicing on rows is straightforward. Slicing on columns may be more complicated:
df.loc[:, pd.IndexSlice['feat1', :]]
Out[106]:
feat1
f0 f1 f2
0 10 11 12
1 13 14 15
df.loc[:, pd.IndexSlice['feat1', 'f0':'f1']]
Out[107]:
feat1
f0 f1
0 10 11
1 13 14
To modify values in the data frame, use .loc, for example df.loc[1:, ('feat1', 'f1')] = -1. (More on hierarchical indexing, slicing etc.)
It's also possible to append another frame to df.
# another set of features
df2 = pd.DataFrame(np.arange(100, 108).reshape(2,4), columns=['f0', 'f1', 'f2', 'f3'])
# create a MultiIndex:
idx = pd.MultiIndex.from_product([['feat2'], df2.columns])
# append
df[idx] = df2
df
Out[117]:
pre feat1 feat2
A B f0 f1 f2 f0 f1 f2 f3
0 0 1 10 11 12 100 101 102 103
1 2 3 13 -1 15 104 105 106 107
To keep a nice layout, it's important that idx have the same numbers of levels as df.columns.