Summarize data from a list of pandas dataframes

Summarize data from a list of pandas dataframes - python

I have a list of dfs, df_list:
[ CLASS IDX A B C D
0 1 1 1.0 0.0 0.0 0.0
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 NaN NaN NaN NaN
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 0.900 0.100 0.0 0.0
1 1 2 1.000 0.000 0.0 0.0
2 1 3 0.999 0.001 0.0 0.0]
I would like to summarize the data into one df based on conditions and values in the individual dfs. Each df has 4 columns of interest, A, B, C and D. If the value in for example column A is >= 0.1 in df_list[0], I want to print 'A' in the summary df. If two columns, for example A and B, have values >= 0.1, I want to print 'A/B'. The final summary df for this data should be:
CLASS IDX 0 1 2
0 1 1 A NaN A/B
1 1 2 A A A
2 1 3 A A A
In the summary df, the column labels (0,1,2) represent the position of the df in the df_list.
I am starting with this
for index, values in enumerate(df_list):
# summarize the data
But not sure what would be the best way to continue..
Any help greatly appreciated!

Here there is one approach
cols = ['A','B','C','D']
def join_func(df):
m = df[cols].ge(0.1)
return (df[cols].mask(m, cols)
.where(m, np.nan)
.apply(lambda x: '/'.join(x.dropna()), axis=1))
result = (df_list[0].loc[:, ['CLASS','IDX']]
.assign(**{str(i) : join_func(df)
for i, df in enumerate(df_list)}))
print(result)
CLASS IDX 0 1 2
0 1 1 A A/B
1 1 2 A A A
2 1 3 A A A

Related

How to get average of all the columns and store into a new row

I have a dataframe like this:
A
B
C
D
user_id
1
1
0
0
1
2
2
1
0
2
3
2
3
1
3
4
3
2
0
4
I need to compute the average of all the columns and need the dataframe looks like this:
A
B
C
D
user_id
1
1
0
0
1
2
2
1
0
2
3
2
3
1
3
4
3
2
0
4
Average
2
1.5
0.25
2.5
I'm trying this but it gives me error
df = df.append({'user_id':'Average', df.mean}, ignore_index=True)

Also working:
df = pd.concat([df, df.mean().to_frame('Average').T])
which will create the following result.
A B C D
1 1.0 0.0 0.00 1.0
2 2.0 1.0 0.00 2.0
3 2.0 3.0 1.00 3.0
4 3.0 2.0 0.00 4.0
Average 2.0 1.5 0.25 2.5
Comment
If you really want to mix floats and integers, please use
pd.concat([df, df.mean().to_frame('Average', ).T.astype(object)])
This will result in
A B C D
1 1 0 0 1
2 2 1 0 2
3 2 3 1 3
4 3 2 0 4
Average 2.0 1.5 0.25 2.5
I want to quote from the official documenation to dtypes to show the disadvantage of this solution:
Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for performance and interoperability with other libraries and methods).
This is also the reason, why the default data type is float.

do you mean to do:
df.loc["Average", :] = df.mean()
this creates a new row called "Average" in your df where over all columns you store the mean over all columns

It will surely work :-
df = pd.concat([df, df.mean().to_frame('Average').T])

This also works:
pd.concat([df, pd.DataFrame(df.describe()).loc[['mean']] ])
A B C D
0 1.0 0.0 0.00 1.0
1 2.0 1.0 0.00 2.0
2 2.0 3.0 1.00 3.0
3 3.0 2.0 0.00 4.0
mean 2.0 1.5 0.25 2.5

You can use pandas.DataFrame.loc to add a line at the bottom :
df.loc['Average'] = df.iloc[:, :].mean()
>>> print(df)

How to rank the categorical values while one-hot-encoding

I have the data like this.
id
feature_1
feature_2
1
a
e
2
b
c
3
c
d
4
d
b
5
e
a
I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.
id
a
b
c
d
e
1
1
0
0
0
0.5
2
0
1
0.5
0
0
3
0
0
1
0.5
0
4
0
0.5
0
1
0
5
0.5
0
0
0
1
But when applying sklearn.preprocessing.OneHotEncoder
it outputs 10 columns with respective 1s.
How can I achieve this?

For the two columns, you can do:
pd.crosstab(df.id, df.feature_1) + pd.crosstab(df['id'], df['feature_2']) * .5
Output:
feature_1 a b c d e
id
1 1.0 0.0 0.0 0.0 0.5
2 0.0 1.0 0.5 0.0 0.0
3 0.0 0.0 1.0 0.5 0.0
4 0.0 0.5 0.0 1.0 0.0
5 0.5 0.0 0.0 0.0 1.0
If you have more than two features, with the weights defined, then you can melt then map the features to the weights:
weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')
(flatten['variable'].map(weights)
.groupby([flattern['id'], flatten['value']])
.sum().unstack('value', fill_value=0)
)

How to use groupby + transform instead of pipe?

Let's say I have a dataframe like this
import pandas as pd
from scipy import stats
df = pd.DataFrame(
{
'group': list('abaab'),
'val1': range(5),
'val2': range(2, 7),
'val3': range(4, 9)
}
)
group val1 val2 val3
0 a 0 2 4
1 b 1 3 5
2 a 2 4 6
3 a 3 5 7
4 b 4 6 8
Now I want to calculate linear regressions for each group in column group using two of the vali columns (potentially all pairs, so I don't want to hardcode column names anywhere).
A potential implementation could look like this based on pipe
def do_lin_reg_pipe(df, group_col, col1, col2):
group_names = df[group_col].unique()
df_subsets = []
for s in group_names:
df_subset = df.loc[df[group_col] == s]
x = df_subset[col1].values
y = df_subset[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
df_subset = df_subset.assign(
slope=slope,
intercept=intercept,
r=r,
p=p,
se=se
)
df_subsets.append(df_subset)
return pd.concat(df_subsets)
and then I can use
df_linreg_pipe = (
df.pipe(do_lin_reg_pipe, group_col='group', col1='val1', col2='val3')
.assign(p=lambda d: d['p'].round(3))
)
which gives the desired outcome
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 0.0 0.0
2 a 2 4 6 1.0 4.0 1.0 0.0 0.0
3 a 3 5 7 1.0 4.0 1.0 0.0 0.0
1 b 1 3 5 1.0 4.0 1.0 0.0 0.0
4 b 4 6 8 1.0 4.0 1.0 0.0 0.0
What I don't like is that I have to loop through the groups, use and append and then also concat, so I thought I should somehow use a groupby and transform but I don't get this to work. The function call should be something like
df_linreg_transform = df.copy()
df_linreg_transform[['slope', 'intercept', 'r', 'p', 'se']] = (
df.groupby('group').transform(do_lin_reg_transform, col1='val1', col2='val3')
)
question is how to define do_lin_reg_transform; I would like to have something along these lines
def do_lin_reg_transform(df, col1, col2):
x = df[col1].values
y = df[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
return (slope, intercept, r, p, se)
but that - of course - crashes with a KeyError
KeyError: 'val1'
How could one implement do_lin_reg_transform to make it work with groupby and transform?

As you can use groupby_transform because you need extra columns to compute the result, the idea is to use groupby_apply with map to broadcast the result to each rows:
cols = ['slope', 'intercept', 'r', 'p', 'se']
lingress = lambda x: stats.linregress(x['val1'], x['val3'])
df[cols] = pd.DataFrame.from_records(df['group'].map(df.groupby('group').apply(lingress)))
print(df)
# Output
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e+00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e+00 0.0

Transform is meant to aggregate results for a single column. A regression requires multiple so you should use apply.
If you wanted, you could define your aggregation to return a DataFrame as opposed to the Series (so the result doesn't reduce). For this to work, you'd want to make sure you index is unique. Then concat the result back so it aligns on the index. You won't have any issues if there's more than 1 grouping column.
def group_reg(gp, col1, col2):
df = pd.DataFrame([stats.linregress(gp[col1], gp[col2])]*len(gp),
columns=['slope', 'intercept', 'r', 'p', 'se'],
index=gp.index)
return df
pd.concat([df, df.groupby('group').apply(group_reg, col1='val1', col2='val3')], axis=1)
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e+00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e+00 0.0

How to create another column according to other column value in python?

I have the following dataframe with the following code:
for i in range(int(tower1_base),int(tower1_top)):
if i not in tower1_not_included_int :
df = pd.concat([df, pd.DataFrame({"Tower": 1, "Floor": i, "Unit": list("ABCDEFG")})],ignore_index=True)
Result:
Tower Floor Unit
0 1 1.0 A
1 1 1.0 B
2 1 1.0 C
3 1 1.0 D
4 1 1.0 E
5 1 1.0 F
6 1 1.0 G
How can I create another Index column like this?
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 2.0 B 1B2
2 1 3.0 C 1C3
3 1 4.0 D 1D4
4 1 5.0 E 1E5
5 1 6.0 F 1F6
6 1 7.0 G 1G7

You can simply add the columns:
df['Index'] = df['Tower'].astype(str)+df['Unit']+df['Floor'].astype(int).astype(str)
Outputs this for the first version of your dataframe:
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 1.0 B 1B1
2 1 1.0 C 1C1
3 1 1.0 D 1D1
4 1 1.0 E 1E1
5 1 1.0 F 1F1
6 1 1.0 G 1G1

Another approach.
I've created a copy of the dataframe and reordered the columns, to make the "melting" easier.
dfAl = df.reindex(columns=['Tower','Unit','Floor'])
to_load = [] #list to load the new column
vals = pd.DataFrame.to_numpy(dfAl) #All values extracted
for sublist in vals:
combs = ''.join([str(i).strip('.0') for i in sublist]) #melting values
to_load.append(combs)
df['Index'] = to_load
If you really want the 'Index' column to be a real index, the last step:
df = df.set_index('Index')
print(df)
Tower Floor Unit
Index
1A1 1 1.0 A
1B2 1 2.0 B
1C3 1 3.0 C
1D4 1 4.0 D
1E5 1 5.0 E
1F6 1 6.0 F
1G7 1 7.0 G

Check value on merged dataframe and make changes on original dataframe

I have a merged_dataframe which id is from X dataframe and value from Y dataframe
I want to drop rows like A which have value 1 on the last row.
How do I do it so in X dataframe, A rows are dropped?
id value
A 0
A 1
B 0
C 0
To check the last value in the row, isit by using
merged_dataframe = merged_dataframe.groupby('id').nth(-1)
get_last_value = merged_dataframe['value']

Here's 1 method of doing it i.e
mask = df.groupby('id',as_index=False)['value'].nth(-1) == 1
df.loc[mask[mask].index,'value'] = np.nan
ndf = df.dropna()
Output:
id value
0 A 0.0
2 B 0.0
3 C 0.0
If you have a dataframe like
id value
0 A 1.0
1 A 0.0
2 A 1.0
3 B 1.0
4 B 0.0
5 B 1.0
6 C 0.0
Then Output :
id value
0 A 1.0
1 A 0.0
3 B 1.0
4 B 0.0
6 C 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summarize data from a list of pandas dataframes - python

Related

How to get average of all the columns and store into a new row

How to rank the categorical values while one-hot-encoding

How to use groupby + transform instead of pipe?

How to create another column according to other column value in python?

Check value on merged dataframe and make changes on original dataframe

Categories

Resources