Find matching column interval in pandas - python

I have a pandas dataframe with multiple columns were their values increase from some value between 0 and 1 for column A up to column E which is always 1 (representing cumulative probabilities).
ID A B C D E SIM
1: 0.49 0.64 0.86 0.97 1.00 0.98
2: 0.76 0.84 0.98 0.99 1.00 0.87
3: 0.32 0.56 0.72 0.92 1.00 0.12
The column SIM represents a column with random uniform numbers.
I wish to add a new column SIM_CAT with values equal to the column-name which value is the right boundary of the interval in which the value in column SIM falls:
ID A B C D E SIM SIM_CAT
1: 0.49 0.64 0.86 0.97 1.00 0.98 E
2: 0.76 0.84 0.98 0.99 1.00 0.87 C
3: 0.32 0.56 0.72 0.92 1.00 0.12 A
I there a concise way to do that?

You can compare columns with SIM and use idxmax to find the 1st greater value:
cols = list('ABCDE')
df['SIM_CAT'] = df[cols].ge(df.SIM, axis=0).idxmax(axis=1)
df
ID A B C D E SIM SIM_CAT
0 1: 0.49 0.64 0.86 0.97 1.0 0.98 E
1 2: 0.76 0.84 0.98 0.99 1.0 0.87 C
2 3: 0.32 0.56 0.72 0.92 1.0 0.12 A
If SIM can contain values greater than 1:
cols = list('ABCDE')
df['SIM_CAT'] = None
df.loc[df.SIM <= 1, 'SIM_CAT'] = df[cols].ge(df.SIM, axis=0).idxmax(axis=1)
df
ID A B C D E SIM SIM_CAT
0 1: 0.49 0.64 0.86 0.97 1.0 0.98 E
1 2: 0.76 0.84 0.98 0.99 1.0 0.87 C
2 3: 0.32 0.56 0.72 0.92 1.0 0.12 A

Related

How to add sum() and mean() value above the df column values in the same line?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, now sum value is in the first line, and avg value is the second line, but it's ugly, how to let sum value and avg value in the same column and with index name:Total? Also place it in the first line as below
# Total 27.56 25.04 -1.31
code in pandas is as below:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df[['value_a','value_b']].sum().to_frame().T
df2 = df[['difference']].mean().to_frame().T
df = pd.concat([df1,df2, df], ignore_index=True)
df
value_a value_b name up_or_down difference
project_name
27.56 25.04
-1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Thanks so much for any advice
Use DataFrame.agg with dictionary for aggregate functions:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df.agg({'value_a':'sum', 'value_b':'sum', 'difference':'mean'}).to_frame('Total').T
df = pd.concat([df1,df])
print (df.head())
value_a value_b difference name up_or_down
Total 27.56 25.04 -0.035405 NaN NaN
2021-project11 0.43 0.48 0.050000 2021-project11 up
2021-project1 0.62 0.56 -0.060000 2021-project1 down
2021-project2 0.51 0.47 -0.040000 2021-project2 down
2021-porject3 0.37 0.34 -0.030000 2021-porject3 down

How to get maximums of multiple groups based on grouping column?

I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.
id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)

Python(Pandas) - Create a column by matching column's values into dataframe

I have the below assumed dataframe
a b c d e F
0.02 0.62 0.31 0.67 0.27 a
0.30 0.07 0.23 0.42 0.00 a
0.82 0.59 0.34 0.73 0.29 a
0.90 0.80 0.13 0.14 0.07 d
0.50 0.62 0.94 0.34 0.53 d
0.59 0.84 0.95 0.42 0.54 d
0.13 0.33 0.87 0.20 0.25 d
0.47 0.37 0.84 0.69 0.28 e
Column F represents the columns of the dataframe.
For each row of column F I want to find relevant row and column from the rest of the dataframe and return the values into one column
The outcome will look like this:
a b c d e f To_Be_Filled
0.02 0.62 0.31 0.67 0.27 a 0.02
0.30 0.07 0.23 0.42 0.00 a 0.30
0.82 0.59 0.34 0.73 0.29 a 0.82
0.90 0.80 0.13 0.14 0.07 d 0.14
0.50 0.62 0.94 0.34 0.53 d 0.34
0.59 0.84 0.95 0.42 0.54 d 0.42
0.13 0.33 0.87 0.20 0.25 d 0.20
0.47 0.37 0.84 0.69 0.28 e 0.28
I am able to identify each case with the below, but not sure how to do it across the whole dataframe.
test.loc[test.iloc[:,5]==a,test.columns==a]
Many thanks in advance.
You can use lookup:
df['To_Be_Filled'] = df.lookup(np.arange(len(df)), df['F'])
df
Out:
a b c d e F To_Be_Filled
0 0.02 0.62 0.31 0.67 0.27 a 0.02
1 0.30 0.07 0.23 0.42 0.00 a 0.30
2 0.82 0.59 0.34 0.73 0.29 a 0.82
3 0.90 0.80 0.13 0.14 0.07 d 0.14
4 0.50 0.62 0.94 0.34 0.53 d 0.34
5 0.59 0.84 0.95 0.42 0.54 d 0.42
6 0.13 0.33 0.87 0.20 0.25 d 0.20
7 0.47 0.37 0.84 0.69 0.28 e 0.28
np.arange(len(df)) can be replaced with df.index.

DataFrame of means of top N most correlated columns

I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43
For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop
Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

Creating Multi-hierarchy pivot table in Pandas

1. Background
The .xls files I have now contain some parameters of multi-pollutant in many aspects for different sites.
I created an simplified dataframe below as an illustration:
Some declaration:
Column Site contain the monitoring sites properties. In this case, Sites S1, S2 are the only two locatio here.
Column Time contain the monitoring period for different sites.
Species A & B represents two chemical pollutants had been detected.
Conc is one key parameter for each species(A & B) represents the concentration. Notice that, the concentration of Species A should be measured twice as parallel.
P and Q are two different analysis experiments. Since species A has two samples, it has P1, P2, P3 & Q1, Q2 as the analysis results respectively. Species B has only be analyzed by P. So, P1, P2, P3 are the only parameters.
After read some post on manipulating the pivot_table using Pandas, I want to have a try.
2. My target
I presented my target file construction manually in Excel showing like this:
3. My work
df = pd.ExcelFile("./test_file.xls")
df = df.parse("Sheet1")
pd.pivot_table(df,index = ["Site","Time","Species"])
This is the result:
Update
What I'm trying to figure out is to creat two columns P & Q and sub_columns below them.
I have re-upload my test file here. Anyone interested in can download it.
The P and Q tests are for each sample of species A respectively.
The Conc test are for them both.
Any advice would be appreciate!
IIUC
You want the same dataframe, but with a better column index.
To create the first level:
level0 = df.columns.str.extract(r'([^\d]*)', expand=False)
then assign a multiindex to the columns attribute.
df.columns = pd.MultiIndex.from_arrays([level0, df.columns])
Looks like:
print df
Conc P Q
Conc P1 P2 P3 Q1 Q2
Site Time Species
S1 20141222 A 0.79 0.02 0.62 1.05 0.01 1.73
20141228 A 0.13 0.01 0.79 0.44 0.01 1.72
20150103 B 0.48 0.03 1.39 0.84 NaN NaN
20150104 A 0.36 0.02 1.13 0.31 0.01 0.94
20150109 A 0.14 0.01 0.64 0.35 0.00 1.00
20150114 B 0.47 0.08 1.16 1.40 NaN NaN
20150115 A 0.62 0.02 0.90 0.95 0.01 2.63
20150116 A 0.71 0.03 1.72 1.71 0.01 2.53
20150121 B 0.61 0.03 0.67 0.87 NaN NaN
S2 20141222 A 0.23 0.01 0.66 0.44 0.01 1.49
20141228 A 0.42 0.06 0.99 1.56 0.00 2.18
20150103 B 0.09 0.01 0.56 0.12 NaN NaN
20150104 A 0.18 0.01 0.56 0.36 0.00 0.67
20150109 A 0.50 0.03 0.74 0.71 0.00 1.11
20150114 B 0.64 0.06 1.76 0.92 NaN NaN
20150115 A 0.58 0.05 0.77 0.95 0.01 1.54
20150116 A 0.93 0.04 1.33 0.69 0.00 0.82
20150121 B 0.33 0.09 1.33 0.76 NaN NaN

Categories

Resources