Python: Group elements in DataFrame [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 5 months ago.
I have a DataFrame as follows:
df:
date name issuer rate
2022-01-01 SPY A 0.3
2022-01-01 SPY B 0.2
2022-01-01 MSFT A 0.2
2022-01-01 MSFT B 0.1
2022-01-02 SPY A 0.2
2022-01-02 SPY B 0.1
2022-01-02 SPY C 0.2
2022-01-02 SPY D 0.2
2022-01-02 MSFT A 0.2
2022-01-02 MSFT B 0.4
2022-01-02 MSFT C 0.5
2022-01-02 MSFT D 0.4
I want to get group the DataFrame (which contains duplicate entries) and add a median column so that it looks like this:
df1:
A B C D median
date name
2022-01-01 SPY 0.3 0.2 0.25
MSFT 0.2 0.1 0.15
2022-01-02 SPY 0.2 0.1 0.2 0.2 0.2
MSFT 0.2 0.4 0.5 0.4 0.4
I have tried using the groupby function but it gives me an error due to duplicate entries.
What's the best way to do this?
EDIT: I have tried using the pivot function and got an error concerning duplicates. groupby function works but I don't know how to set issuer as column names.

You can pivot the dataframe first, then calculate the median along the row axis:
>>> out = df.pivot_table('rate', ['date', 'name'], ['issuer'])
>>> out['median'] = out.median(axis=1)
OUTPUT:
issuer A B C D median
date name
2022-01-01 MSFT 0.2 0.1 NaN NaN 0.15
SPY 0.3 0.2 NaN NaN 0.25
2022-01-02 MSFT 0.2 0.4 0.5 0.4 0.40
SPY 0.2 0.1 0.2 0.2 0.20

Related

Coalesce values only from columns where column matches with data dates

I have a data frame similar to one below.
Date 20180601T32 20180604T33 20180605T32 20180610T33
2018-06-04 0.1 0.5 4.5 nan
2018-06-05 1.5 0.2 nan 0
2018-06-07 1.1 1.6 nan nan
2018-06-10 0.4 1.1 0 0.3
The values in columns '20180601', '20180604', '20180605' and '20180607' needs to be coalesced into a new column.
I am using the method bfill as below but it selects first value in the row.
coalsece_columns = ['20180601', '20180604', '20180605', '20180610]
df['obs'] = df[coalesce_columns].bfill(axis=1).iloc[:,0]
But instead of taking value from first column, value should match 'Date' and respective column names. The expected output should be:
Date 20180601T32 20180604T33 20180605T32 20180610T33 Obs
2018-06-04 0.1 0.5 4.5 nan 0.5
2018-06-05 1.5 0.2 1.7 0 1.7
2018-06-07 1.1 1.6 nan nan nan
2018-06-10 0.4 1.1 0 0.3 0.3
Any suggestions?
Use lookup with convert Datecolumn to same format like columns names:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
If possible columnsnames are integers:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.rename(columns=str).reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3

I would like to copy a Dataframe and interpolate the values of this new dataframe to achieve Data Augmentation

My original dataframe looks like this:
No index
Value1
Value2
Value3
0
1.0
0.0
0.0
1
1.0
0.2
0.2
2
1.0
0.4
0.4
3
0.8
0.6
0.6
4
0.5
0.4
0.8
5
0.1
0.2
1.0
And what I want to achieve is the following:
No index
Value1
Value2
Value3
0
1.0
0.1
0.1
1
1.0
0.3
0.3
2
0.9
0.5
0.5
3
0.65
0.5
0.7
4
0.3
0.3
0.9
5
0.1
0.2
1.0
I would basically like to shift the new dataframe by 1 index, and then compute the average of the two original values. But keeping the values in the last row the same.
Is there someone who can help me with this? Thank you in advance.
Use rolling_mean and get values from the last row:
out = df.rolling(2).mean().shift(-1)
out.loc[len(df)-1] = df.tail(1).values
Output:
>>> out
Value1 Value2 Value3
0 1.00 0.1 0.1
1 1.00 0.3 0.3
2 0.90 0.5 0.5
3 0.65 0.5 0.7
4 0.30 0.3 0.9
5 0.10 0.2 1.0

pandas: get value based on column of indexes

I have a pd Dataframe like this:
df = pd.DataFrame({'val':[0.1,0.2,0.3,None,None],'parent':[None,None,None,0,2]})
parent val
0 NaN 0.1
1 NaN 0.2
2 NaN 0.3
3 0.0 NaN
4 2.0 NaN
where parent represents an index within the pandas df. I want to create a new column that has either the value, or the value of the parent.
that would look like this:
parent val val_full
0 NaN 0.1 0.1
1 NaN 0.2 0.2
2 NaN 0.3 0.3
3 0.0 NaN 0.1
4 2.0 NaN 0.3
This is a fairly large dataframe (10k+ rows), so something efficient would be preferable. How can I do this without using something like .iterrows()?
In your case do
df['new'] = df.val
df.loc[df.new.isna(),'new'] = df.loc[df.parent.dropna().values,'val'].values
df
Out[289]:
val parent new
0 0.1 NaN 0.1
1 0.2 NaN 0.2
2 0.3 NaN 0.3
3 NaN 0.0 0.1
4 NaN 2.0 0.3
Or try fillna with replace
df['new'] = df.val.fillna(df.parent.replace(df.val))
Out[290]:
0 0.1
1 0.2
2 0.3
3 0.1
4 0.3
Name: val, dtype: float64

Appending df lines into another df based on its index value

I have the following df1:
col1 col2 col3 col4 col5
A 3 4 1 2 1
B 2 1 2 3 1
C 2 3 4 2 1
On the other hand I have the df2:
type col1 col2 col3
j A 0.5 0.7 0.1
k B 0.2 0.3 0.9
l A 0.5 0.3 0.2
m C 0.8 0.7 0.1
n A 0.3 0.3 0.2
o B 0.1 0.7 0.3
Given the column type in df2 I would like to generate like a pivot table like this:
col1 col2 col3 col4 col5
A 3 4 1 2 1
j 0.5 0.7 0.1
l 0.5 0.3 0.2
n 0.3 0.3 0.2
B 2 1 2 3 1
k 0.2 0.3 0.9
o 0.1 0.7 0.3
C 2 3 4 2 1
m 0.8 0.7 0.1
Is there premade function in pandas I could used to append each line in df2 below its corresponding index in df1?
Sorry I do not include my try , but I have no idea on how to approach this problem.
It seems you need MultiIndex here. You should not use NaN indices as shown in your desired result: the label lacks meaning. One idea is to use a non-letter indicator such as 0:
# set index as (type, current_index) for df2
df2 = df2.reset_index().set_index(['type', 'index']).sort_index()
# reassign index as (type, 0) for df1
df1.index = pd.MultiIndex.from_tuples([(i, 0) for i in df1.index])
# concatenate df1 and df2
res = pd.concat([df1, df2]).sort_index()
print(res)
col1 col2 col3 col4 col5
A 0 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 0 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 0 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Using pd.merge and sort_index specifying na_position='first'
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.set_index(['type', 'index'])\
.sort_index(na_position='first')
col1 col2 col3 col4 col5
type index
A NaN 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B NaN 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C NaN 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
As highlighted by #jpp, in the docs for sort_index it says that
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end. Not implemented for MultiIndex.
even though it actually seems to be, indeed, implemented.
However, if you think this behavior could be inconsistent, an alternative is to sort_values first, and just then setting the index. In sort_values Docs, no such not implemented warning exists.
pd.merge(df2.reset_index(),
df.reset_index().rename(columns={'index':'type'}),
'outer')\
.sort_values(['type', 'index'], na_position='first')\
.set_index(['type', 'index'])
Similar to #jpp
d2 = df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
d1 = df1.set_index(np.zeros(len(df1), str), append=True).rename_axis(['type', 'k'])
d1.append(d2).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN
Alternate
df1.rename_axis('type').assign(k='').set_index('k', append=True).append(
df2.rename_axis('k').set_index('type', append=True).swaplevel(0, 1)
).sort_index()
col1 col2 col3 col4 col5
type k
A 3.0 4.0 1.0 2.0 1.0
j 0.5 0.7 0.1 NaN NaN
l 0.5 0.3 0.2 NaN NaN
n 0.3 0.3 0.2 NaN NaN
B 2.0 1.0 2.0 3.0 1.0
k 0.2 0.3 0.9 NaN NaN
o 0.1 0.7 0.3 NaN NaN
C 2.0 3.0 4.0 2.0 1.0
m 0.8 0.7 0.1 NaN NaN

pandas merge dataframe and pivot creating new columns

I've got two input dataframes
df1 (note, this DF could have more columns of data)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
and df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
and I'd like to combine them so that I get the following:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
This is how I'm doing things:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
This works just fine, but could potentially be slow for large datasets. I'm wondering if any panda-pros see a better (read faster, more efficient)? I'm new to pandas and can imagine there are some shortcuts here that I don't know about.
A few steps here. The key is that in order to create columns like one_a one_b .... two_c, we need add Time column to Sample index to build a multi-level index and then unstack to get the required form. Then, a groupby on Animal index is required to aggregate and reduce the number of NaNs. The rest are just some manipulations on format.
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN

Categories

Resources