Python Pandas Dataframe select row by max value in group

Python Pandas Dataframe select row by max value in group - python

I have a dataframe which was created via a df.pivot:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908140300 NaN 480
20150908140600 NaN 120
20150908143000 10743 8803
C 20150908140100 NaN 1715
20150908140200 NaN 1062
20150908141000 NaN 145
20150908141500 418 NaN
20150908141800 NaN 450
20150908142900 1973 1499
20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
I would like to filter and return a single row for each 'F_TYPE' only returning the row with the Maximum 'to_date'. I would like to return the following dataframe:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Thanks..

A standard approach is to use groupby(keys)[column].idxmax().
However, to select the desired rows using idxmax you need idxmax to return unique index values. One way to obtain a unique index is to call reset_index.
Once you obtain the index values from groupby(keys)[column].idxmax() you can then select the entire row using df.loc:
In [20]: df.loc[df.reset_index().groupby(['F_Type'])['to_date'].idxmax()]
Out[20]:
start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Note: idxmax returns index labels, not necessarily ordinals. After using reset_index the index labels happen to also be ordinals, but since idxmax is returning labels (not ordinals) it is better to always use idxmax in conjunction with df.loc, not df.iloc (as I originally did in this post.)

The other ways to do that are as follow:
If you want only one max row per group.
(
df
.groupby(level=0)
.apply(lambda group: group.nlargest(1, columns='to_date'))
.reset_index(level=-1, drop=True)
)
If you want to get all rows that are equal to max per group.
(
df
.groupby(level=0)
.apply(lambda group: group.loc[group['to_date'] == group['to_date'].max()])
.reset_index(level=-1, drop=True)
)

Related

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?

I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)

If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Calculate a third column based on two other columns values for iteration on a fourth column

[Results]
[Input data]
I have a dataframe in Python:
I want to calculate the subtract of "Start" and "End" columns for each "Product order" in the following way:
for each product number I have "Type"s A to D: I need to subtract the end time of D from start time of A for each product order. Any idea how to do it? Thanks.
Input data:
process order Type Start End
111 A 10 20
111 B 22 25
111 C 28 30
111 D 33 35
222 A 37 40
222 B 42 45
222
222
333
333
333
333
like for process order 111: we have D (End: 35) - A (Strat: 10) = 25
Output should be like:
process order Time_difference
111 25
222 ?
333 ?

Group by your order and then simply take the end value from the last row and subtract from it the start value from the second row:
df.groupby('order_number').apply(lambda x: x.iloc[-1, 'end'] - x.iloc[1, 'start'])
If you need to, you can join these results back into the original df.

Assuming all products have Types A to F, try:
#sort values so that B is the 2nd and F is the last row for each order number
df = df.sort_values(["order number", "type"])
#groupby order number and keep the 2nd (iat[1]) row for "start" and the last row for "end"
output = df.groupby("order number").agg({"start": lambda x: x.iat[1], "end": "last"})
#compute the difference
diff = output["end"] - output["start"]
>>> diff
order number
103895166 2072
103900419 2228
103902156 9348
dtype: int64
Input df:
order number type start end
0 103895166 A 0 1999
1 103895166 B 361 2067
2 103895166 C 365 2117
3 103895166 D 368 2118
4 103895166 E 497 2423
5 103895166 F 498 2433
6 103900419 A 0 3627
7 103900419 B 2128 3791
8 103900419 C 2132 3841
9 103900419 D 2135 3842
10 103900419 E 2264 4346
11 103900419 F 2454 4356
12 103902156 A 0 12432
13 103902156 B 3852 12938
14 103902156 C 3856 12987
15 103902156 D 3860 13000
16 103902156 E 3864 13100
17 103902156 F 3868 13200

fillna(0) first but NaN value appears in iloc

df1.fillna(0)
Montant vente Marge
0 778283.75 13.63598
1 312271.20 9.26949
2 163214.65 14.50288
3 191000.20 9.55818
4 275970.00 12.76534
... ... ...
408 2999.80 14.60610
409 390.00 0.00000
410 699.00 26.67334
411 625.00 30.24571
412 0.00 24.79797
x = df1.iloc[:,1:3] # 1t for rows and second for columns
x
Marge
0 13.63598
1 9.26949
2 14.50288
3 9.55818
4 12.76534
... ...
408 14.60610
409 NaN
410 26.67334
411 30.24571
412 24.79797
413 rows × 1 columns
Why does the line 409 has a 0.000value first and then after iloc, it has NaN?
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

You should learn which functions mutate the data frame and which doesn't. For example fillna does not mutate the dataframe. Instead you can use inplace=True
df1 = df1.fillna(0)
or
df1.fillna(0, inplace=True)

Performing iterative arithmetic over a column in a Pandas dataframe

I am attempting to perform arithmetic on the 'data_d' column.
dataframe
data_a data_b data_c data_d
60 0.30786 Discharge 2.31714
61 0.30792 Rest 2.34857
121 0.62095 Rest 2.38647
182 0.93398 Discharge 2.31115
183 0.93408 Rest 2.34550
243 1.24711 Rest 2.37162
304 1.56014 Discharge 2.30855
305 1.56019 Rest 2.34215
365 1.87322 Rest 2.36276
426 2.18630 Discharge 2.30591
I want to assign the variables A,B,C into a new column named 'variable'. As shown below.
dataframe2
data_a data_b data_c data_d variable
60 0.30786 Discharge 2.31714 A
61 0.30792 Rest 2.34857 B
121 0.62095 Rest 2.38647 C
182 0.93398 Discharge 2.31115 A
183 0.93408 Rest 2.34550 B
243 1.24711 Rest 2.37162 C
304 1.56014 Discharge 2.30855 A
305 1.56019 Rest 2.34215 B
365 1.87322 Rest 2.36276 C
426 2.18630 Discharge 2.30591 A
The script then should perform the following operation iteratively over the entire 'data_d' column.
(C - (B-A))
(2.38647 - (2.34857-2.31714))
(2.35504)
...
dataframe3
measurement
0 2.35504
1 2.33727
2 2.32916
... ...
And so on.
Thank you in advance for any insight.

We use the cumsum to create the groupby key , then do cumcount with groupby map the number of count back to letter
key = df['data_c'].eq('Discharge').cumsum()
df['variable'] = df.groupby(key).cumcount().map({0:'A',1:'B',2:'C'})
df
Out[61]:
data_a data_b data_c data_d variable
0 60 0.30786 Discharge 2.31714 A
1 61 0.30792 Rest 2.34857 B
2 121 0.62095 Rest 2.38647 C
3 182 0.93398 Discharge 2.31115 A
4 183 0.93408 Rest 2.34550 B
5 243 1.24711 Rest 2.37162 C
6 304 1.56014 Discharge 2.30855 A
7 305 1.56019 Rest 2.34215 B
8 365 1.87322 Rest 2.36276 C
9 426 2.18630 Discharge 2.30591 A
Then we just need to pivot : here I am using crosstab
s = pd.crosstab(index=key, columns=df['variable'], values=df['data_d'], aggfunc='sum')
dfout = s.eval('C - (B-A)').to_frame(name = 'measurement')
dfout
Out[69]:
measurement
data_c
1 2.35504
2 2.33727
3 2.32916
4 NaN

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787

Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129

pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.

It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Dataframe select row by max value in group - python

Related

Iterate over specific rows, sum results and store in new row

Calculate a third column based on two other columns values for iteration on a fourth column

fillna(0) first but NaN value appears in iloc

Performing iterative arithmetic over a column in a Pandas dataframe

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

Categories

Resources