pandas merge dataframe and pivot creating new columns - python

I've got two input dataframes
df1 (note, this DF could have more columns of data)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
and df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
and I'd like to combine them so that I get the following:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
This is how I'm doing things:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
This works just fine, but could potentially be slow for large datasets. I'm wondering if any panda-pros see a better (read faster, more efficient)? I'm new to pandas and can imagine there are some shortcuts here that I don't know about.

A few steps here. The key is that in order to create columns like one_a one_b .... two_c, we need add Time column to Sample index to build a multi-level index and then unstack to get the required form. Then, a groupby on Animal index is required to aggregate and reduce the number of NaNs. The rest are just some manipulations on format.
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN

Related

Coalesce values only from columns where column matches with data dates

I have a data frame similar to one below.
Date 20180601T32 20180604T33 20180605T32 20180610T33
2018-06-04 0.1 0.5 4.5 nan
2018-06-05 1.5 0.2 nan 0
2018-06-07 1.1 1.6 nan nan
2018-06-10 0.4 1.1 0 0.3
The values in columns '20180601', '20180604', '20180605' and '20180607' needs to be coalesced into a new column.
I am using the method bfill as below but it selects first value in the row.
coalsece_columns = ['20180601', '20180604', '20180605', '20180610]
df['obs'] = df[coalesce_columns].bfill(axis=1).iloc[:,0]
But instead of taking value from first column, value should match 'Date' and respective column names. The expected output should be:
Date 20180601T32 20180604T33 20180605T32 20180610T33 Obs
2018-06-04 0.1 0.5 4.5 nan 0.5
2018-06-05 1.5 0.2 1.7 0 1.7
2018-06-07 1.1 1.6 nan nan nan
2018-06-10 0.4 1.1 0 0.3 0.3
Any suggestions?
Use lookup with convert Datecolumn to same format like columns names:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
If possible columnsnames are integers:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.rename(columns=str).reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3

pandas: get value based on column of indexes

I have a pd Dataframe like this:
df = pd.DataFrame({'val':[0.1,0.2,0.3,None,None],'parent':[None,None,None,0,2]})
parent val
0 NaN 0.1
1 NaN 0.2
2 NaN 0.3
3 0.0 NaN
4 2.0 NaN
where parent represents an index within the pandas df. I want to create a new column that has either the value, or the value of the parent.
that would look like this:
parent val val_full
0 NaN 0.1 0.1
1 NaN 0.2 0.2
2 NaN 0.3 0.3
3 0.0 NaN 0.1
4 2.0 NaN 0.3
This is a fairly large dataframe (10k+ rows), so something efficient would be preferable. How can I do this without using something like .iterrows()?
In your case do
df['new'] = df.val
df.loc[df.new.isna(),'new'] = df.loc[df.parent.dropna().values,'val'].values
df
Out[289]:
val parent new
0 0.1 NaN 0.1
1 0.2 NaN 0.2
2 0.3 NaN 0.3
3 NaN 0.0 0.1
4 NaN 2.0 0.3
Or try fillna with replace
df['new'] = df.val.fillna(df.parent.replace(df.val))
Out[290]:
0 0.1
1 0.2
2 0.3
3 0.1
4 0.3
Name: val, dtype: float64

Appending df lines into another df based on its index value

I have the following df1:
col1 col2 col3 col4 col5
A 3 4 1 2 1
B 2 1 2 3 1
C 2 3 4 2 1
On the other hand I have the df2:
type col1 col2 col3
j A 0.5 0.7 0.1
k B 0.2 0.3 0.9
l A 0.5 0.3 0.2
m C 0.8 0.7 0.1
n A 0.3 0.3 0.2
o B 0.1 0.7 0.3
Given the column type in df2 I would like to generate like a pivot table like this:
col1 col2 col3 col4 col5
A 3 4 1 2 1
j 0.5 0.7 0.1
l 0.5 0.3 0.2
n 0.3 0.3 0.2
B 2 1 2 3 1
k 0.2 0.3 0.9
o 0.1 0.7 0.3
C 2 3 4 2 1
m 0.8 0.7 0.1
Is there premade function in pandas I could used to append each line in df2 below its corresponding index in df1?
Sorry I do not include my try , but I have no idea on how to approach this problem.
It seems you need MultiIndex here. You should not use NaN indices as shown in your desired result: the label lacks meaning. One idea is to use a non-letter indicator such as 0:
# set index as (type, current_index) for df2
df2 = df2.reset_index().set_index(['type', 'index']).sort_index()
# reassign index as (type, 0) for df1
df1.index = pd.MultiIndex.from_tuples([(i, 0) for i in df1.index])
# concatenate df1 and df2
res = pd.concat([df1, df2]).sort_index()
print(res)
col1 col2 col3 col4 col5
A 0 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 0 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 0 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Using pd.merge and sort_index specifying na_position='first'
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.set_index(['type', 'index'])\
.sort_index(na_position='first')
col1 col2 col3 col4 col5
type index
A NaN 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B NaN 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C NaN 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
As highlighted by #jpp, in the docs for sort_index it says that
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end. Not implemented for MultiIndex.
even though it actually seems to be, indeed, implemented.
However, if you think this behavior could be inconsistent, an alternative is to sort_values first, and just then setting the index. In sort_values Docs, no such not implemented warning exists.
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.sort_values(['type', 'index'], na_position='first')\
.set_index(['type', 'index'])
Similar to #jpp
d2 = df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
d1 = df1.set_index(np.zeros(len(df1), str), append=True).rename_axis(['type', 'k'])
d1.append(d2).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Alternate
df1.rename_axis('type').assign(k='').set_index('k', append=True).append(
df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN

Pandas creating new dataframe from several group by operations

I have a pandas dataframe
test = pd.DataFrame({'d':[1,1,1,2,2,3,3], 'id':[1,2,3,1,2,2,3], 'v1':[10, 20, 15, 35, 5, 10, 30], 'v2':[3, 4, 1, 6, 0, 2, 0], 'w1':[0.1, 0.3, 0.2, 0.1, 0.4, 0.3, 0.2], 'w2':[0.8, 0.1, 0.2, 0.3, 0.1, 0.1, 0.0]})
d id v1 v2 w1 w2
0 1 1 10 3 0.10 0.80
1 1 2 20 4 0.30 0.10
2 1 3 15 1 0.20 0.20
3 2 1 35 6 0.10 0.30
4 2 2 5 0 0.40 0.10
5 3 2 10 2 0.30 0.10
6 3 3 30 0 0.20 0.00
and I would like to get some weighted values by group like
test['w1v1'] = test['w1'] * test['v1']
test['w1v2'] = test['w1'] * test['v2']
test['w2v1'] = test['w2'] * test['v1']
test['w2v2'] = test['w2'] * test['v2']
How can I get the result nicely into a df. something that looks like
test.groupby('id').sum()['w1v1'] / test.groupby('id').sum()['w1']
id
1 22.50
2 11.00
3 22.50
but includes columns for each weighted value, so like
id w1v1 w1v2 w2v1 w2v2
1 22.50 ... ... ...
2 11.00 ... ... ...
3 22.50 ... ... ...
Any ideas how I can achieve this quick and easy?
Use:
cols = ['w1v1','w1v2','w2v1','w2v2']
test1 = (test[['w1', 'w2', 'w1', 'w2']] * test[['v1', 'v1', 'v2', 'v2']].values)
test1.columns = cols
print (test1)
w1v1 w1v2 w2v1 w2v2
0 1.0 8.0 0.3 2.4
1 6.0 2.0 1.2 0.4
2 3.0 3.0 0.2 0.2
3 3.5 10.5 0.6 1.8
4 2.0 0.5 0.0 0.0
5 3.0 1.0 0.6 0.2
6 6.0 0.0 0.0 0.0
df = test.join(test1).groupby('id').sum()
df1 = df[cols] / df[['w1', 'w2', 'w1', 'w2']].values
print (df1)
w1v1 w1v2 w2v1 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
Another more dynamic solution with MultiIndex DataFrames:
a = ['v1', 'v2']
b = ['w1', 'w2']
mux = pd.MultiIndex.from_product([a,b])
df1 = test.set_index('id').drop('d', axis=1)
v = df1.reindex(columns=mux, level=0)
w = df1.reindex(columns=mux, level=1)
print (v)
v1 v2
w1 w2 w1 w2
id
1 10 10 3 3
2 20 20 4 4
3 15 15 1 1
1 35 35 6 6
2 5 5 0 0
2 10 10 2 2
3 30 30 0 0
print (w)
v1 v2
w1 w2 w1 w2
id
1 0.1 0.8 0.1 0.8
2 0.3 0.1 0.3 0.1
3 0.2 0.2 0.2 0.2
1 0.1 0.3 0.1 0.3
2 0.4 0.1 0.4 0.1
2 0.3 0.1 0.3 0.1
3 0.2 0.0 0.2 0.0
df = w * v
print (df)
v1 v2
w1 w2 w1 w2
id
1 1.0 8.0 0.3 2.4
2 6.0 2.0 1.2 0.4
3 3.0 3.0 0.2 0.2
1 3.5 10.5 0.6 1.8
2 2.0 0.5 0.0 0.0
2 3.0 1.0 0.6 0.2
3 6.0 0.0 0.0 0.0
df1 = df.groupby('id').sum() / w.groupby('id').sum()
#flatten MultiIndex columns
df1.columns = ['{0[1]}{0[0]}'.format(x) for x in df1.columns]
print (df1)
w1v1 w2v1 w1v2 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
If you can take multi index columns, you can use groupby + dot:
test.groupby('id').apply(
lambda g: g.filter(like='v').T.dot(g.filter(like='w')/g.filter(like='w').sum()).stack()
)
# v1 v2
# w1 w2 w1 w2
#id
#1 22.5 16.818182 4.5 3.818182
#2 11.0 11.666667 1.8 2.000000
#3 22.5 15.000000 0.5 1.000000

Pandas dataframe : assigning values according to ranks at row-level

Consider the following pandas dataframe (df):
index A B C D E F G weights
1 NaN 1 NaN NaN NaN 3 2 [0.6 , 0.2 , 0.2]
2 3 2 NaN 1 NaN NaN NaN [0.5 , 0.4 , 0.1]
3 NaN NaN 1 2 3 NaN NaN [0.8 , 0.1 , 0.1]
4 NaN 3 1 NaN NaN 2 NaN [0.9 , 0.1 , 0.0]
Desired output (values matched to their corresponding weights at row-level) :
1 NaN 0.6 NaN NaN NaN 0.2 0.2
2 0.1 0.4 NaN 0.5 NaN NaN NaN
3 NaN NaN 0.8 0.1 0.1 NaN NaN
4 NaN 0.0 0.9 NaN NaN 0.1 NaN
My current solution :
def assign_weights(row):
for i in range(1,4):
row.replace(i, row.weights[i-1], inplace=True)
return row
df.apply(assign_weights, axis = 1)
Is there a faster way (for a big dataframe with more weights to be assigned) ?
Not sure if this will be faster, though:
>>> def worker(row):
... n = np.array(row['weights'])
... i = (row.notnull()) & (row.index != 'weights')
... row[i] = n[row[i].astype('int').values - 1]
... return row
>>>
>>> df.apply(worker, axis=1)
A B C D E F G weights
index
1 NaN 0.6 NaN NaN NaN 0.2 0.2 [0.6, 0.2, 0.2]
2 0.1 0.4 NaN 0.5 NaN NaN NaN [0.5, 0.4, 0.1]
3 NaN NaN 0.8 0.1 0.1 NaN NaN [0.8, 0.1, 0.1]
4 NaN 0.0 0.9 NaN NaN 0.1 NaN [0.9, 0.1, 0.0]

Categories

Resources