AWK else if to Python (Pandas)

AWK else if to Python (Pandas) - python

I have the following data:
data = {'f_geno': ["AA", "AA", "AA", "BB", "BB", "BB", "AB", "AB", "AB"],
'ch_geno': ["AA", "AB", "BB", "AA", "AB", "BB", "AA", "BB", "AB"],
'freq_A': [0.50, 0.46, 0.49, 0.57, 0.55, 0.44, 0.37, 0.66, 0.46],
'freq_B': [0.50, 0.54, 0.51, 0.43, 0.45, 0.56, 0.63, 0.34, 0.54]
}
I already wrote a simple calculator that calculates a value for each row in AWK and prints the resulting value to $5:
awk 'BEGIN {
FS = OFS = ","
}
{
if ($1 == "AA" && $2 == "AA") {
$5 = (1 / $3)
} else if ($1 == "AA" && $2 == "AB") {
$5 = (0.5 / $3)
} else if ($1 == "AA" && $2 == "BB") {
$5 = (0.001)
} else if ($1 == "BB" && $2 == "AA") {
$5 = (0.001)
} else if ($1 == "BB" && $2 == "AB") {
$5 = (0.5 / $4)
} else if ($1 == "BB" && $2 == "BB") {
$5 = (1 / $4)
} else if ($1 == "AB" && $2 == "AA") {
$5 = (0.5 / $3)
} else if ($1 == "AB" && $2 == "BB") {
$5 = (0.5 / $4)
} else {
$5 = (($3 + $4) / (4 * $3 * $4))
}
print
}'
I would like to do the same as above but in Python.Can someone help, please?

You can use .apply() on a function:
def condition(x) -> float:
if x.f_geno == "AA" and x.ch_geno == "AA":
return 1/x.freq_A
if x.f_geno == "AA" and x.ch_geno == "AB" or x.f_geno == "AB" and x.ch_geno == "AA":
return 0.5/x.freq_A
if x.f_geno == "AA" and x.ch_geno == "BB" or x.f_geno == "BB" and x.ch_geno == "AA":
return .001
if x.f_geno == "BB" and x.ch_geno == "AB":
return 0.5/x.freq_B
if x.f_geno == "BB" and x.ch_geno == "BB" or x.f_geno == "AB" and x.ch_geno == "BB":
return 1/x.freq_B
return (x.freq_A + x.freq_B) / (4 * x.freq_A * x.freq_B)
df = pd.DataFrame(data=data)
df["result"] = df.apply(condition, axis=1)
print(df)
Output:
f_geno ch_geno freq_A freq_B result
0 AA AA 0.50 0.50 2.000000
1 AA AB 0.46 0.54 1.086957
2 AA BB 0.49 0.51 0.001000
3 BB AA 0.57 0.43 0.001000
4 BB AB 0.55 0.45 1.111111
5 BB BB 0.44 0.56 1.785714
6 AB AA 0.37 0.63 1.351351
7 AB BB 0.66 0.34 2.941176
8 AB AB 0.46 0.54 1.006441

Use numpy.select with mask chains by & for bitwise AND and | for bitwise OR if performance is important:
df = pd.DataFrame(data=data)
faa = df.f_geno == "AA"
chaa = df.ch_geno == "AA"
fab = df.f_geno == "AB"
chab = df.ch_geno == "AB"
fbb = df.f_geno == "BB"
chbb = df.ch_geno == "BB"
masks = [(faa & chaa),
(faa & chab) | (fab & chaa),
(faa & chbb) | (fbb & chaa),
(fbb & chbb),
(fbb & chab) | (fab & chbb)]
vals = [1 / df.freq_A,
0.5 / df.freq_A,
0.001,
1 / df.freq_B,
0.5 / df.freq_B]
default = (df.freq_A + df.freq_B) / (4 * df.freq_A * df.freq_B)
df["result"] = np.select(masks, vals, default=default)
print(df)
f_geno ch_geno freq_A freq_B result
0 AA AA 0.50 0.50 2.000000
1 AA AB 0.46 0.54 1.086957
2 AA BB 0.49 0.51 0.001000
3 BB AA 0.57 0.43 0.001000
4 BB AB 0.55 0.45 1.111111
5 BB BB 0.44 0.56 1.785714
6 AB AA 0.37 0.63 1.351351
7 AB BB 0.66 0.34 1.470588
8 AB AB 0.46 0.54 1.006441
Performance with 90k rows:
#90k rows
df = pd.DataFrame(data=data)
df = pd.concat([df] * 10000, ignore_index=True)
In [98]: %timeit df["result"] = df.apply(condition, axis=1)
5.96 s ± 585 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [99]: %timeit df["result"] = np.select(masks, vals, default=default)
1.59 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

How to make a python program that calculates a result for each row of the input table?

I am trying to make a Python program that will calculate a result based on a formula, given factors and an input dataframe.
I have a number of cars (N_cars) on a given length of the road (l) and their average speed (v):
input_columns = ['l', 'N_cars', 'v']
input_data = [[3.5, 1000, 100], [5.7, 500, 110],
[10, 367, 110], [11.1, 1800, 95],
[2.8, 960, 105], [4.7, 800, 120],
[10.4, 103, 111], [20.1, 1950, 115]]
input_df = pd.DataFrame(input_data, columns=input_columns)
input_df
l N_cars v
0 3.5 1000 100
1 5.7 500 110
2 10.0 367 110
3 11.1 1800 95
4 2.8 960 105
5 4.7 800 120
6 10.4 103 111
7 20.1 1950 115
I also know the factors needed for the formula for each category of car, and I know the percentage of each category. I also have different options for each category (3 options that I have here are just an example, there are many more options).
factors_columns = ['category', 'category %', 'option', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
factors_data = [['A', 58, 'opt_1', 0.000011, 0.23521, 0.93847, 0.39458, 0.00817, 0.24566, 0.0010, 0],
['A', 58, 'opt_2', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
['A', 58, 'opt_3', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
['B', 22, 'opt_1', 0.002452, 0.48327, 0.83773, 0.92852, 0.00871, 0.29568, 0.0009, 0.02],
['B', 22, 'opt_2', 0.002899, 0.49327, 0.83773, 0.92852, 0.00871, 0.30468, 0.0009, 0.02],
['B', 22, 'opt_3', 0.002452, 0.48327, 0.83773, 0.92852, 0.00771, 0.29568, 0.0119, 0.01],
['C', 17, 'opt_1', 0.082583, 0.39493, 0.02462, 0.82714, 0.00918, 0.28572, 0.0012, 0],
['C', 17, 'opt_2', 0.072587, 0.35493, 0.02852, 0.82723, 0.00912, 0.29572, 0.0018, 0],
['C', 17, 'opt_3', 0.082583, 0.39493, 0.02852, 0.82714, 0.00962, 0.28572, 0.0012, 0.01],
['D', 3, 'opt_1', 0.018327, 0.32342, 0.82529, 0.92752, 0.00988, 0.21958, 0.0016, 0],
['D', 3, 'opt_2', 0.014427, 0.32342, 0.82729, 0.92752, 0.00968, 0.22558, 0.0026, 0],
['D', 3, 'opt_3', 0.018327, 0.32342, 0.82729, 0.94452, 0.00988, 0.21258, 0.0016, 0]]
factors_df = pd.DataFrame(factors_data, columns=factors_columns)
factors_df
category category % option a b c d e f g h
0 A 58 opt_1 0.000011 0.23521 0.93847 0.39458 0.00817 0.24566 0.0010 0.00
1 A 58 opt_2 0.000011 0.23521 0.93145 0.39458 0.00467 0.24566 0.0010 0.00
2 A 58 opt_3 0.000011 0.23521 0.93145 0.39458 0.00467 0.24566 0.0010 0.00
3 B 22 opt_1 0.002452 0.48327 0.83773 0.92852 0.00871 0.29568 0.0009 0.02
4 B 22 opt_2 0.002899 0.49327 0.83773 0.92852 0.00871 0.30468 0.0009 0.02
5 B 22 opt_3 0.002452 0.48327 0.83773 0.92852 0.00771 0.29568 0.0119 0.01
6 C 17 opt_1 0.082583 0.39493 0.02462 0.82714 0.00918 0.28572 0.0012 0.00
7 C 17 opt_2 0.072587 0.35493 0.02852 0.82723 0.00912 0.29572 0.0018 0.00
8 C 17 opt_3 0.082583 0.39493 0.02852 0.82714 0.00962 0.28572 0.0012 0.01
9 D 3 opt_1 0.018327 0.32342 0.82529 0.92752 0.00988 0.21958 0.0016 0.00
10 D 3 opt_2 0.014427 0.32342 0.82729 0.92752 0.00968 0.22558 0.0026 0.00
11 D 3 opt_3 0.018327 0.32342 0.82729 0.94452 0.00988 0.21258 0.0016 0.00
For each option (opt_1, opt_2, opt_3), I have to calculate the result based on this formula (factors are taken from the factors table, but v is coming from the input table):
formula = ( (a*v*v) + (b*v) + c + (d/v) ) / ( (e*v*v) + (f*v) + g) * (1 - h)
result = l * N_cars * formula
However, I have to take into account the percentage of each category of car. For each row of the input_df I have to perform the calculations three times, once for each of the three options. For example, for the index 0 of input_df, I have N_cars=1000, v=100 and l=3.5, the output should be something like this:
# for opt_1:
result = 3.5 * 1000 * ((58% of category A {formula for index 0 of factors_df}) +
(22% of category B {formula for index 3 of factors_df) +
(17% of category C {formula for index 6 of factors_df}) +
(3% of category D {formula for index 9 of factors_df}) )
# for opt_2:
result = 3.5 * 1000 * ((58% of category A {formula for index 1 of factors_df}) +
(22% of category B {formula for index 4 of factors_df) +
(17% of category C {formula for index 7 of factors_df}) +
(3% of category D {formula for index 10 of factors_df}) )
# for opt_3:
result = 3.5 * 1000 * ((58% of category A {formula for index 2 of factors_df}) +
(22% of category B {formula for index 5 of factors_df) +
(17% of category C {formula for index 8 of factors_df}) +
(3% of category D {formula for index 11 of factors_df}) )
So, as an output, for each of the rows in input_df, I should have three results, one for each of the three options.
I can do the calculation manually for each step, but what I am having troubles with is to make a loop that does it automatically for each input row and all 3 options and then passes to the next input row and so on until the last input row.

Solution
Not sure what your expected results are, but I believe this does what you're asking for:
def formula(g, *, l, N_cars, v):
x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
return N_cars * l * (x * g.pct / 100).sum()
groups = factors_df.rename(columns={"category %": "pct"}).groupby("option")
result = input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)
Output:
In [5]: input_df.join(result)
Out[5]:
l N_cars v opt_1 opt_2 opt_3
0 3.5 1000 100 5411.685077 5115.048256 5500.985916
1 5.7 500 110 4425.339734 4169.893681 4483.595803
2 10.0 367 110 5698.595376 5369.652565 5773.612841
3 11.1 1800 95 30820.717985 29180.106606 31384.785443
4 2.8 960 105 4165.270216 3930.726187 4226.877893
5 4.7 800 120 5860.057879 5506.509637 5919.496692
6 10.4 103 111 1663.960420 1567.455541 1685.339848
7 20.1 1950 115 60976.735053 57375.300546 61685.075902
Explanation
The first step is to group factors_df by option. Just to show what that looks like:
In [6]: groups.apply(print)
category pct option a b ... d e f g h
0 A 58 opt_1 0.000011 0.23521 ... 0.39458 0.00817 0.24566 0.0010 0.00
3 B 22 opt_1 0.002452 0.48327 ... 0.92852 0.00871 0.29568 0.0009 0.02
6 C 17 opt_1 0.082583 0.39493 ... 0.82714 0.00918 0.28572 0.0012 0.00
9 D 3 opt_1 0.018327 0.32342 ... 0.92752 0.00988 0.21958 0.0016 0.00
[4 rows x 11 columns]
category pct option a b ... d e f g h
1 A 58 opt_2 0.000011 0.23521 ... 0.39458 0.00467 0.24566 0.0010 0.00
4 B 22 opt_2 0.002899 0.49327 ... 0.92852 0.00871 0.30468 0.0009 0.02
7 C 17 opt_2 0.072587 0.35493 ... 0.82723 0.00912 0.29572 0.0018 0.00
10 D 3 opt_2 0.014427 0.32342 ... 0.92752 0.00968 0.22558 0.0026 0.00
[4 rows x 11 columns]
category pct option a b ... d e f g h
2 A 58 opt_3 0.000011 0.23521 ... 0.39458 0.00467 0.24566 0.0010 0.00
5 B 22 opt_3 0.002452 0.48327 ... 0.92852 0.00771 0.29568 0.0119 0.01
8 C 17 opt_3 0.082583 0.39493 ... 0.82714 0.00962 0.28572 0.0012 0.01
11 D 3 opt_3 0.018327 0.32342 ... 0.94452 0.00988 0.21258 0.0016 0.00
Note that I renamed the category % to pct. This isn't necessary, but made accessing that column in the formula() function a bit cleaner (g.pct vs g["category %"]).
The next step was to implement formula() in such a way as to accept a group from factors_df as an argument:
def formula(g, *, l, N_cars, v):
x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
return N_cars * l * (x * g.pct / 100).sum()
In the function signature, g is a group from factors_df, then the keyword-only arguments l, N_cars, and v, which will come from a single row of input_df at a time.
Each of the three groups shown above will be entered into the formula() function one at a time, in their entirety. For example, during one call to formula(), the g argument will hold all of this data:
category pct option a b ... d e f g h
0 A 58 opt_1 0.000011 0.23521 ... 0.39458 0.00817 0.24566 0.0010 0.00
3 B 22 opt_1 0.002452 0.48327 ... 0.92852 0.00871 0.29568 0.0009 0.02
6 C 17 opt_1 0.082583 0.39493 ... 0.82714 0.00918 0.28572 0.0012 0.00
9 D 3 opt_1 0.018327 0.32342 ... 0.92752 0.00988 0.21958 0.0016 0.00
When the formula uses something like g.e, it's accessing the entire e column, and is taking advantage of vectorization to perform the arithmetic calculations on the entire column at the same time. When the dust settles, x will be a Series where each item in the series will be the result of the formula for each of the four categories of car. Here's an example:
0 0.231242
3 0.619018
6 7.188941
9 1.792376
Notice the indices? Those correspond to category A, B, C, and D from factors_df, respectively.
From there, we need to call formula() on each row of input_df, using the axis argument of pd.DataFrame.apply():
input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)
The lambda r is an anonymous function object being passed to apply, being applied over axis 1, meaning that r will be a single row from input_df at a time, for example:
In [13]: input_df.apply(print, axis=1)
l 3.5
N_cars 1000.0
v 100.0
Name: 0, dtype: float64
.
.
.
Now, on each row-wise apply, we're also applying the formula() function on the groups groupby object with lambda g: formula(g, **r). The **r unpacks the row from input_df as keyword arguments, which helps to ensure that the values for v, l, and N_cars aren't misused in the formula (no need to worry about which order they're passed into the formula() function).

Here is the code I wrote. It's somewhat long but it works. May be you (or someone) can modify and make it shorter.
# Transforming factors_df
df = factors_df.pivot(columns=["category", "option"])
df.reset_index(inplace=True)
# Renaming column names for each combination of option and category
df.columns = [s3 + s2 + s1 for (s1, s2, s3) in df.columns.to_list()]
df.drop(columns=["index"], inplace=True)
# Flattening to a single row to be able to apply formula
df = pd.DataFrame(df.max()).T
# Merging input with transformed factors data
input_df["tmp"] = 1
df["tmp"] = 1
df = pd.merge(input_df, df, on="tmp", how="left")
df.drop("tmp", axis=1, inplace=True)
# Calculating values for opt_1 using the formula
df["opt_1_value"] = (
df["l"]
* df["N_cars"]
* (
(
df["opt_1Acategory %"]
/ 100
* (
df["opt_1Aa"] * df["v"] * df["v"]
+ df["opt_1Ab"] * df["v"]
+ df["opt_1Ac"]
+ df["opt_1Ad"] / df["v"]
)
/ (
(
df["opt_1Ae"] * df["v"] * df["v"]
+ df["opt_1Af"] * df["v"]
+ df["opt_1Ag"]
)
* (1 - df["opt_1Ah"])
)
)
+ (
df["opt_1Bcategory %"]
/ 100
* (
df["opt_1Ba"] * df["v"] * df["v"]
+ df["opt_1Bb"] * df["v"]
+ df["opt_1Bc"]
+ df["opt_1Bd"] / df["v"]
)
/ (
(
df["opt_1Be"] * df["v"] * df["v"]
+ df["opt_1Bf"] * df["v"]
+ df["opt_1Bg"]
)
* (1 - df["opt_1Bh"])
)
)
+ (
df["opt_1Ccategory %"]
/ 100
* (
df["opt_1Ca"] * df["v"] * df["v"]
+ df["opt_1Cb"] * df["v"]
+ df["opt_1Cc"]
+ df["opt_1Cd"] / df["v"]
)
/ (
(
df["opt_1Ce"] * df["v"] * df["v"]
+ df["opt_1Cf"] * df["v"]
+ df["opt_1Cg"]
)
* (1 - df["opt_1Ch"])
)
)
)
)
# Calculating values for opt_2 using the formula
df["opt_2_value"] = (
df["l"]
* df["N_cars"]
* (
(
df["opt_2Acategory %"]
/ 100
* (
df["opt_2Aa"] * df["v"] * df["v"]
+ df["opt_2Ab"] * df["v"]
+ df["opt_2Ac"]
+ df["opt_2Ad"] / df["v"]
)
/ (
(
df["opt_2Ae"] * df["v"] * df["v"]
+ df["opt_2Af"] * df["v"]
+ df["opt_2Ag"]
)
* (1 - df["opt_2Ah"])
)
)
+ (
df["opt_2Bcategory %"]
/ 100
* (
df["opt_2Ba"] * df["v"] * df["v"]
+ df["opt_2Bb"] * df["v"]
+ df["opt_2Bc"]
+ df["opt_2Bd"] / df["v"]
)
/ (
(
df["opt_2Be"] * df["v"] * df["v"]
+ df["opt_2Bf"] * df["v"]
+ df["opt_2Bg"]
)
* (1 - df["opt_2Bh"])
)
)
+ (
df["opt_2Ccategory %"]
/ 100
* (
df["opt_2Ca"] * df["v"] * df["v"]
+ df["opt_2Cb"] * df["v"]
+ df["opt_2Cc"]
+ df["opt_2Cd"] / df["v"]
)
/ (
(
df["opt_2Ce"] * df["v"] * df["v"]
+ df["opt_2Cf"] * df["v"]
+ df["opt_2Cg"]
)
* (1 - df["opt_2Ch"])
)
)
)
)
# Calculating values for opt_3 using the formula
df["opt_3_value"] = (
df["l"]
* df["N_cars"]
* (
(
df["opt_3Acategory %"]
/ 100
* (
df["opt_3Aa"] * df["v"] * df["v"]
+ df["opt_3Ab"] * df["v"]
+ df["opt_3Ac"]
+ df["opt_3Ad"] / df["v"]
)
/ (
(
df["opt_3Ae"] * df["v"] * df["v"]
+ df["opt_3Af"] * df["v"]
+ df["opt_3Ag"]
)
* (1 - df["opt_3Ah"])
)
)
+ (
df["opt_3Bcategory %"]
/ 100
* (
df["opt_3Ba"] * df["v"] * df["v"]
+ df["opt_3Bb"] * df["v"]
+ df["opt_3Bc"]
+ df["opt_3Bd"] / df["v"]
)
/ (
(
df["opt_3Be"] * df["v"] * df["v"]
+ df["opt_3Bf"] * df["v"]
+ df["opt_3Bg"]
)
* (1 - df["opt_3Bh"])
)
)
+ (
df["opt_3Ccategory %"]
/ 100
* (
df["opt_3Ca"] * df["v"] * df["v"]
+ df["opt_3Cb"] * df["v"]
+ df["opt_3Cc"]
+ df["opt_3Cd"] / df["v"]
)
/ (
(
df["opt_3Ce"] * df["v"] * df["v"]
+ df["opt_3Cf"] * df["v"]
+ df["opt_3Cg"]
)
* (1 - df["opt_3Ch"])
)
)
)
)
# Removing unnecessary columns
df_final = df[["l", "N_cars", "v", "opt_1_value", "opt_2_value", "opt_3_value"]]
print(df_final)
Output:
l N_cars v opt_1_value opt_2_value opt_3_value
0 3.5 1000 100 1496.002370 1420.656629 1534.748740
1 5.7 500 110 750.997279 710.944885 767.411691
2 10.0 367 110 551.157686 521.754019 562.906668
3 11.1 1800 95 2685.551348 2554.477141 2756.164589
4 2.8 960 105 1439.467965 1364.815604 1475.082027
5 4.7 800 120 1206.116125 1138.614075 1229.225287
6 10.4 103 111 154.744048 146.445615 157.990346
7 20.1 1950 115 2933.825622 2773.297776 2990.828374

Another way to do it, not nearly as elegand as #ddejhon 's solution, tho:
def formula(input_index, factors_index):
formula = ((factors_df.loc[factors_index,'a']*input_df['v'][input_index]**2)+
(factors_df.loc[factors_index,'b']*input_df['v'][input_index])+
(factors_df.loc[factors_index,'c'])+
(factors_df.loc[factors_index,'d']/input_df['v'][input_index])
)/(
(factors_df.loc[factors_index,'e']*input_df['v'][input_index]**2)+
(factors_df.loc[factors_index,'f']*input_df['v'][input_index])+
(factors_df.loc[factors_index,'g'])
)*(1-factors_df.loc[factors_index,'h'])
return formula
index_list = [factors_df[factors_df['option'] == opt].index.tolist() for opt in factors_df['option'].unique().tolist()]
Edit1: got rid of that ugly nested for structure and replaced it with list comprehension
output_df = pd.DataFrame(np.repeat(input_df.values, len(factors_df['option'].unique()), axis=0))
output_df.columns = input_df.columns
output_df['option'] = factors_df['option'].unique().tolist()*len(input_df.index)
output_df['formula'] = [n for sub_list in [[sum(factors_df['category %'].unique()[k]/100 * formula(i,j[k])
for k in range(len(factors_df['category'].unique())))
for j in index_list] for i in input_df.index] for n in sub_list]
output_df['result'] = output_df['l'] * output_df['N_cars'] * output_df['formula']
Output:
output_df
l N_cars v option formula result
0 3.5 1000.0 100.0 opt_1 1.546196 5411.685077
1 3.5 1000.0 100.0 opt_2 1.461442 5115.048256
2 3.5 1000.0 100.0 opt_3 1.571710 5500.985916
3 5.7 500.0 110.0 opt_1 1.552751 4425.339734
4 5.7 500.0 110.0 opt_2 1.463121 4169.893681
5 5.7 500.0 110.0 opt_3 1.573192 4483.595803
6 10.0 367.0 110.0 opt_1 1.552751 5698.595376
7 10.0 367.0 110.0 opt_2 1.463121 5369.652565
8 10.0 367.0 110.0 opt_3 1.573192 5773.612841
9 11.1 1800.0 95.0 opt_1 1.542578 30820.717985
10 11.1 1800.0 95.0 opt_2 1.460466 29180.106606
11 11.1 1800.0 95.0 opt_3 1.570810 31384.785443
12 2.8 960.0 105.0 opt_1 1.549580 4165.270216
13 2.8 960.0 105.0 opt_2 1.462324 3930.726187
14 2.8 960.0 105.0 opt_3 1.572499 4226.877893
15 4.7 800.0 120.0 opt_1 1.558526 5860.057879
16 4.7 800.0 120.0 opt_2 1.464497 5506.509637
17 4.7 800.0 120.0 opt_3 1.574334 5919.496692
18 10.4 103.0 111.0 opt_1 1.553361 1663.960420
19 10.4 103.0 111.0 opt_2 1.463271 1567.455541
20 10.4 103.0 111.0 opt_3 1.573319 1685.339848
21 20.1 1950.0 115.0 opt_1 1.555727 60976.735053
22 20.1 1950.0 115.0 opt_2 1.463842 57375.300546
23 20.1 1950.0 115.0 opt_3 1.573800 61685.075902

Calculating ALL nested level aggregations of specific Column ( SUM, AVG, STDEV) in dataframe

I have a table that looks like below (simplified):
col_A col_B col_C
A 37 2
B 28 7
C 10 5
D 11 5
E 99 4
I would like to get a table with all nested combinations of each level of col_A and calculate, say, an average within the subgroup: for example the choose-any-2 table would look like (10 unique level combinations):
Grp_2 AVG (col_B/col_C)
A,B 7.76
A,C 6.61
A,D 7.55
… …
D,E 12.99
Choose-any-4 would look like (5 unique level combinations):
Grp_4 AVG (col_B/col_C)
A,B,C,D 7.84
A,B,C,E 6.68
A,C,D,E 7.63
… …
B,C,D,E 13.12
(order od preference) R, SQL(postgres, ANSI) , Python.;
My current solution (below) in R does not scale well as the number of levels of col_A grow:
require(tidyverse)
df <- tibble(col_A=c("A", "B","C", "D", "E"), col_B=c(37,28,10,11,99), col_C=c(2,7,5,5,4))
nested_subgroup_agg <- function(choice = 2, mydf = NULL) {
library(tidyverse)
dfx <-
combn(c("A", "B", "C", "D", "E"), choice) %>%
t() %>%
as_tibble()
try(if (choice <= 1) {
stop("Can't Choose less than 2 levels at a time")
}
else{
if (choice == 2) {
val <- map_dbl(1:nrow(dfx), function(i) {
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]])
})
}
else{
if (choice == 3) {
val <- map_dbl(1:nrow(dfx), function(i) {
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]] + mydf$col_B[mydf$col_A == dfx$V3[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]] + mydf$col_C[mydf$col_A == dfx$V3[i]])
})
}
else{
if (choice == 4) {
val <- map_dbl(1:nrow(dfx), function(i) {
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]] + mydf$col_B[mydf$col_A == dfx$V3[i]] + mydf$col_B[mydf$col_A == dfx$V4[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]] + mydf$col_C[mydf$col_A == dfx$V3[i]] + mydf$col_C[mydf$col_A == dfx$V4[i]])
})
}
}
}
})
dfx$val <- val
dfx
}
## Example
df <-
tibble(
col_A = c("A", "B", "C", "D", "E"),
col_B = c(37, 28, 10, 11, 99),
col_C = c(2, 7, 5, 5, 4)
)
nested_subgroup_agg(choice = 4, mydf = df)
Can you help improve?

An option using data.table:
nested_subgroup_agg <- function(choice=2, mydf) {
ans <- setDT(mydf)[.(g=rep(seq(choose(.N, choice)), each=choice), col_A=c(combn(col_A, choice))), on=.(col_A)][,
.(toString(col_A), sum(col_B) / sum(col_C)), g]
setnames(ans, names(ans)[-1L], c(paste0("Grp_", choice), "val"))[]
}
nested_subgroup_agg(3, DT)
output:
g Grp_3 val
1: 1 A, B, C 5.357143
2: 2 A, B, D 5.428571
3: 3 A, B, E 12.615385
4: 4 A, C, D 4.833333
5: 5 A, C, E 13.272727
6: 6 A, D, E 13.363636
7: 7 B, C, D 2.882353
8: 8 B, C, E 8.562500
9: 9 B, D, E 8.625000
10: 10 C, D, E 8.571429
data:
library(data.table)
DT <- fread("col_A col_B col_C
A 37 2
B 28 7
C 10 5
D 11 5
E 99 4")

An idea is to use combn to get all the combination of the rows (considering that you have 1 Letter per row) and then simply aggregate every 2 rows, i.e.
#get a df with all combination of rows
new_d <- dd[c(combn(nrow(dd), 2)),]
#Aggregate
#You can use `aggregate` or `lapply(split())`
lapply(split(new_d, rep(seq((nrow(new_d)) / 2), each = 2)), function(i)sum(i$col_C))
DATA
dput(dd)
structure(list(col_A = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), col_B = c(37L, 28L, 10L, 11L, 99L
), col_C = c(2L, 7L, 5L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-5L))

PANDAS NEW COLUMN BASED ON MULTIPLE CRITERIA AND COLUMNS

I want to create a new columns for a big table using several criteria and columsn and was not sure the best way to approach it.
df = pd.DataFrame({'a': ['A', "B", "B", "C", "D"],
'b':['y','n','y','n', np.nan], 'c':[10,20,10,40,30], 'd':[.3,.1,.4,.2, .1]})
df.head()
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
df['new_Col'] = df.c+df.d
if df.a=='A' & df.b =='y':
df['new_Col'] = df.d *2
else:
df['new_Col'] = 0
return df
fun()
OR
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
return = df.c+df.d
if df.a=='A' & df.b =='y':
return df.d *2
else:
return 0
df['new_Col"] df.apply(fun)
OR using np.where:
df['new_Col'] = np.where(df.a=='A' & df.b =='n', df.c+df.d,0 )
df['new_Col'] = np.where(df.a=='A' & df.b =='y', df.d *2,0 )

Looks like you need np.select
a, n, y = df.a.eq('A'), df.b.eq('n'), df.b.eq('y')
df['result'] = np.select([a & n, a & y], [df.c + df.d, df.d*2], default=0)

This is an arithmetic way (I added one more row to your sample for case a = 'A' and b = 'n'):
sample
Out[1369]:
a b c d
0 A y 10 0.3
1 B n 20 0.1
2 B y 10 0.4
3 C n 40 0.2
4 D NaN 30 0.1
5 A n 50 0.9
nc = df.a.eq('A') & df.b.eq('y')
mc = df.a.eq('A') & df.b.eq('n')
nr = df.d * 2
mr = df.c + df.d
df['new_col'] = nc*nr + mc*mr
Out[1371]:
a b c d new_col
0 A y 10 0.3 0.6
1 B n 20 0.1 0.0
2 B y 10 0.4 0.0
3 C n 40 0.2 0.0
4 D NaN 30 0.1 0.0
5 A n 50 0.9 50.9

Issue in applying function in Pandas data frame

Hi I have the following function to decide the winner:
def winner(T1,T2,S1,S2,PS1,PS2):
if S1>S2:
return T1
elif S2>S1:
return T2
else:
#print('Winner will be decided via penalty shoot out')
Ninit = 5
Ts1 = np.sum(np.random.random(size=Ninit))*PS1
Ts2 = np.sum(np.random.random(size=Ninit))*PS2
if Ts1>Ts1:
return T1
elif Ts2>Ts1:
return T2
else:
return 'Draw'
And I have the following data frame:
df = pd.DataFrame()
df['Team1'] = ['A','B','C','D','E','F']
df['Score1'] = [1,2,3,1,2,4]
df['Team2'] = ['U','V','W','X','Y','Z']
df['Score2'] = [2,2,2,2,3,3]
df['Match'] = df['Team1'] + ' Vs '+ df['Team2']
df['Match_no']= [1,2,3,4,5,6]
df ['P1'] = [0.8,0.7,0.6,0.9,0.75,0.77]
df ['P2'] = [0.75,0.75,0.65,0.78,0.79,0.85]
I want to create a new column in which winner from each match will be assigned.
To decide a winner from each match, I used the function winner. I tested the function using arbitrary inputs. it works. When I used dataframe,
as follow:
df['Winner']= winner(df.Team1,df.Team2,df.Score1,df.Score2,df.P1,df.P2)
it showed me the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can anyone advise why there is an error?
Thanks
Zep.

Your function isn't set up to take pandas.Series as inputs. Use a different way.
df['Winner'] = [
winner(*t) for t in zip(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2)]
df
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 V
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F
Another way to go about it
def winner(T1,T2,S1,S2,PS1,PS2):
ninit = 5
Ts1 = np.random.rand(5).sum() * PS1
Ts2 = np.random.rand(5).sum() * PS2
a = np.select(
[S1 > S2, S2 > S1, Ts1 > Ts2, Ts2 > Ts1],
[T1, T2, T1, T2], 'DRAW')
return a
df.assign(Winner=winner(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2))
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 B
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F

Python Pandas : how to set 2 colums at the same time?

I posted something simpler because I thought it could be easy to understand, but referring to your comments, I was wrong, so I edit this question :
So here is the code. I want to do it without a loop, should it be done in pandas ?
import pandas as pd
myval = [0.0,1.1, 2.2, 3.3, 4.4, 5.5,6.6, 7.7, 8.8,9.9]
s1 = [0,0,1,1,0,0,1,1,0,1]
s2 = [0,0,1,0,1,0,1,0,1,1]
posin = [10,0,0,0,0,0,0,0,0,0]
posout = [0,0,0,0,0,0,0,0,0,0]
sig = ['-']
d = {'myval' : myval, 's1' : s1, 's2' : s2}
d = pd.DataFrame(d)
'''
normaly the dataframe should be with the 6 col,
but I can't make the part below working in the df.(THAT is the problem !!)
The real df is 5000+ row, and this should be done for 100+ sets of values,
so this way is not eligible. Too slow.
'''
for i in xrange(1,len(myval)) :
if (s1[i]== 1) & (s2[i] == 1) & (posin[i-1] != 0 ) :
posin[i]= 0
posout[i]= posin[i-1] / myval[i]
sig.append( 'a')
elif (s1[i] == 0) & (s2[i] == 1) & (posin[i-1] == 0) :
posin[i]= posout[i-1] * myval[i]
posout[i] = 0
sig.append( 'v')
else :
posin[i] = posin[i-1]
posout[i] = posout[i-1]
sig.append('-')
d2 = pd.DataFrame({'posin' : posin , 'posout' : posout , 'sig' : sig })
d = d.join(d2)
#the result wanted :
print d
myval s1 s2 posin posout sig
0 0.0 0 0 10.000000 0.000000 -
1 1.1 0 0 10.000000 0.000000 -
2 2.2 1 1 0.000000 4.545455 a
3 3.3 1 0 0.000000 4.545455 -
4 4.4 0 1 20.000000 0.000000 v
5 5.5 0 0 20.000000 0.000000 -
6 6.6 1 1 0.000000 3.030303 a
7 7.7 1 0 0.000000 3.030303 -
8 8.8 0 1 26.666667 0.000000 v
9 9.9 1 1 0.000000 2.693603 a
Any help ?
Thanks for it !!

I was hoping that something like the following might work (as suggested in the comments), however (suprisingly?) this use of np.where raises a ValueError: shape mismatch: objects cannot be broadcast to a single shape (using a 1D to select from a 2D):
np.where(df.s1 & df.s2,
pd.DataFrame({"bin": 0, "bout": df.bin.diff() / df.myval}),
np.where(df.s1,
pd.DataFrame({"bin": df.bout.diff() * df.myval, "bout": 0}),
pd.DataFrame({"bin": df.bin.diff(), "bout": df.bout.diff()})))
As an alternative to using where, I would construct this in stages:
res = pd.DataFrame({"bin": 0, "bout": df.bin.diff() / df.myval})
res.update(pd.DataFrame({"bin": df.bout.diff() * df.myval,
"bout": 0}).loc[(df.s1 == 1) & (df.s2 == 0)])
res.update(pd.DataFrame({"bin": df.bin.diff(),
"bout": df.bout.diff()}).loc[(df.s1 == 0) & (df.s2 == 0)])
Then you can assign this to the two columns in df:
df[["bin", "bout"]] = res

Code referring to Andy Hayden's answer :
import pandas as pd
myval = [0.0,1.1, 2.2, 3.3, 4.4, 5.5,6.6, 7.7, 8.8,9.9]
s1 = [0,0,1,1,0,0,1,1,0,1]
s2 = [0,0,1,0,1,0,1,0,1,1]
posin = [10,0,0,0,0,0,0,0,0,0]
posout = [0,0,0,0,0,0,0,0,0,0]
sig = ['-']
d = {'myval' : myval, 's1' : s1, 's2' : s2,'posin' : posin , 'posout' : posout }
d = pd.DataFrame(d)
res = pd.DataFrame({"posin": 10, 'sig' : '-', "posout": d.posin.diff() / d.myval})
res.update(pd.DataFrame({"posin": 0,
'sig' : 'a',
"posout":d.posin.diff() / d.myval }
).loc[(d.s1 == 1) & (d.s2 == 1) & (d.posin.diff() != 0) ])
res.update(pd.DataFrame({"posin": d.posout.diff() * d.myval,
'sig' : 'v',
"posout": 0}
).loc[(d.s1 == 0) & (d.s2 == 1) & (d.posin.diff()) == 0])
d[["posin", "posout", 'sig']] = res
print d
myval posin posout s1 s2 sig
0 0.0 10 0 0 0 v
1 1.1 0 0 0 0 v
2 2.2 0 0 1 1 v
3 3.3 0 0 1 0 v
4 4.4 0 0 0 1 v
5 5.5 0 0 0 0 v
6 6.6 0 0 1 1 v
7 7.7 0 0 1 0 v
8 8.8 0 0 0 1 v
9 9.9 0 0 1 1 v

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWK else if to Python (Pandas) - python

Related

How to make a python program that calculates a result for each row of the input table?

Calculating ALL nested level aggregations of specific Column ( SUM, AVG, STDEV) in dataframe

PANDAS NEW COLUMN BASED ON MULTIPLE CRITERIA AND COLUMNS

Issue in applying function in Pandas data frame

Python Pandas : how to set 2 colums at the same time?

Categories

Resources