Python DF - apply same procedure to multiple columns using multiple parameters - python

I have a dataframe with id, v1 and v2. For v1, first, for each id I pick the first 4 rows (lookback or lb=4) . Then within these 4 rows, I identify the top 3 value of v1 and create a table for these top 3. I do the same thing but this time I use the first 5 rows (lb=5).
For v2, I apply the same procedure as v1.
Finally, I combine all top3 results together.
The code below yields exactly what I want. However my real work has multiple v1,v2... and requires multiple lookback as well. So I wonder if you could guide me to make an efficient code? Many thanks.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,4], [1,6,7], [1,39,9],[1,30,8],[1,40,6],[1,140,0], [2,2,1], [2,1,99], [2,20,88], [2,15,25], [2,99,25], [2,9,0]], columns=['id', 'v1','v2'])
print(df)
# PART 1: WORK ON Value 1 ************************************************************************************************************************
# lookback the first 4 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:4]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v1'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v1'))
#Transposing using unstack
v1_top3_lb4 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v1'].unstack().add_prefix('v1_lb4_top');
v1_top3_lb4 = v1_top3_lb4.reset_index();
# lookback the first 5 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:5]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v1'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v1'))
#Transposing using unstack
v1_top3_lb5 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v1'].unstack().add_prefix('v1_lb5_top');
v1_top3_lb5 = v1_top3_lb5.reset_index();
# PART 2: WORK ON Value 2 ************************************************************************************************************************
# lookback the first 4 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:4]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v2'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v2'))
#Transposing using unstack
v2_top3_lb4 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v2'].unstack().add_prefix('v2_lb4_top');
v2_top3_lb4 = v2_top3_lb4.reset_index();
# lookback the first 5 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:5]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v2'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v2'))
#Transposing using unstack
v2_top3_lb5 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v2'].unstack().add_prefix('v2_lb5_top');
v2_top3_lb5 = v2_top3_lb5.reset_index();
# PART 3: Merge all ************************************************************************************************************************
combine = pd.merge(v1_top3_lb4, v1_top3_lb5, on='id')
combine = pd.merge(combine, v2_top3_lb4, on='id')
combine = pd.merge(combine, v2_top3_lb5, on='id')
combine
id v1 v2
0 1 1 4
1 1 6 7
2 1 39 9
3 1 30 8
4 1 40 6
5 1 140 0
6 2 2 1
7 2 1 99
8 2 20 88
9 2 15 25
10 2 99 25
11 2 9 0
id v1_lb4_top0 v1_lb4_top1 v1_lb4_top2 v1_lb5_top0 v1_lb5_top1 v1_lb5_top2 v2_lb4_top0 v2_lb4_top1 v2_lb4_top2 v2_lb5_top0 v2_lb5_top1 v2_lb5_top2
0 1 39 30 6 40 39 30 9 8 7 9 8 7
1 2 20 15 2 99 20 15 99 88 25 99 88 25

Related

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

Changing a cell value based on other rows and columns

I have a dataframe called Result that comes from a SQL query:
Loc ID Bank
1 23 NULL
1 24 NULL
1 25 NULL
2 23 6
2 24 7
2 25 8
I am trying to set the values of Loc == 1 Bank equal to the Bank of Loc == 2 when the ID is the same, resulting in:
Loc ID Bank
1 23 6
1 24 7
1 25 8
2 23 6
2 24 7
2 25 8
Here is where I am at with the code, I know the ending is super simple I just can't wrap my head around a solution that doesn't involve iterating over every row (9000~).
result.loc[(result['Loc'] == '1'), 'bank'] = ???
You can try this. It uses map() to get the values from ID.
for_map = df.loc[df['Loc'] == 2].set_index('ID')['Bank'].squeeze().to_dict()
df.loc[df['Loc'] == 1,'Bank'] = df.loc[df['Loc'] == 1,'Bank'].fillna(df['ID'].map(for_map))
You could do a self merge on the dataframe, on ID, then filter for rows where it is equal to 2:
(
df.merge(df, on="ID")
.loc[lambda df: df.Loc_y == 2, ["Loc_x", "ID", "Bank_y"]]
.rename(columns=lambda x: x.split("_")[0] if "_" in x else x)
.astype({"Bank": "Int8"})
.sort_values("Loc", ignore_index=True)
)
Loc ID Bank
0 1 23 6
1 1 24 7
2 1 25 8
3 2 23 6
4 2 24 7
5 2 25 8
You could also stack/unstack, although this fails if you have duplicate indices:
(
df.set_index(["Loc", "ID"])
.unstack("Loc")
.bfill(1)
.stack()
.reset_index()
.reindex(columns=df.columns)
)
Loc ID Bank
0 1 23 6.0
1 2 23 6.0
2 1 24 7.0
3 2 24 7.0
4 1 25 8.0
5 2 25 8.0
Why not use pandas.MultiIndex ?
Commonalities
# Arguments,
_0th_level = 'Loc'
merge_key = 'ID'
value_key = 'Bank' # or a list of colnames or `slice(None)` to propagate all columns values.
src_key = '2'
dst_key = '1'
# Computed once for all,
df = result.set_index([_0th_level, merge_key])
df2 = df.xs(key=src_key, level=_0th_level, drop_level=False)
df1_ = df2.rename(level=_0th_level, index={src_key: dst_key})
First (naive) approach
df.loc[df1_.index, value_key] = df1_
# to get `result` back : df.reset_index()
Second (robust) approach
That being shown, the first approach may be illegal (since pandas version 1.0.0) if there is one or more missing label [...].
So if you must ensure that indexes exist both at source and destination, the following does the job on shared IDs only.
df1 = df.xs(key=dst_key, level=_0th_level, drop_level=False)
idx = df1.index.intersection(df1_.index) # <-----
df.loc[idx, value_key] = df1_.loc[idx, value_key]

How do I sort a Pandas dataframe Excel import?

I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1
Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5

Aggregate columns values by string column numerated name in pandas

I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns

pandas merge to bring two columns from a dataframe and doing operations on column

i have a table in pandas dataframe df
Leafid pidx pidy value
100 1 3 10
100 2 6 12
200 5 7 48
300 7 1 11
i have another dataframe df2 which has
pid price
1 10
2 20
3 30
4 40
5 50
6 60
7 70
i want to merge df and df2 such that i have two more column's price_pidx and price_pidy
and then also do division of price_pidy/price_pidx
for example:
Leafid pidx pidy value price_pidx price_pidy price_pidy/price_pidx
`100 1 3 10 10 30 3`
my final df should have columns
pidx pidy value price_pidx/price_pidy
i don't want to use .map() in this.
is there any way to do it using pd.merge?
i know how to bring one column price_pidx but how to bring both?
for eg.
pd.merge(df,df2['pid','price'],how = left, left_on = 'pidx' right_on = 'pid')
but how to bring both price_pidx and price_pidy
Without map it is complicated, because need reshape by melt, then merge and last unstack:
df = pd.melt(df, id_vars='value', value_name='pid', var_name='g')
df2 = pd.merge(df,df2[['pid','price']], how='left', on = 'pid')
df2 = df2.set_index(['value','g']).unstack()
df2.columns = ['_'.join(col) for col in df2.columns]
df2['col'] = df2.price_pidy / df2.price_pidx
df2 = df2.rename(columns={'pid_pidx':'pidx','pid_pidy':'pidy'})
print (df2)
pidx pidy price_pidx price_pidy col
value
10 1 3 10 30 3.000000
11 7 1 70 10 0.142857
12 2 6 20 60 3.000000
48 5 7 50 70 1.400000

Categories

Resources