filling a column values with max value in pandas

filling a column values with max value in pandas - python

I have some data like this:
pd.DataFrame({'code': ['a', 'a', 'a', 'b', 'b', 'c'],
'value': [1,2,3, 4, 2, 1] })
+-------+------+-------+
| index | code | value |
+-------+------+-------+
| 0 | a | 1 |
+-------+------+-------+
| 1 | a | 2 |
+-------+------+-------+
| 2 | a | 3 |
+-------+------+-------+
| 3 | b | 4 |
+-------+------+-------+
| 4 | b | 2 |
+-------+------+-------+
| 5 | c | 1 |
+-------+------+-------+
i want add a column that contain the max value of each code :
| index | code | value | max |
|-------|------|-------|-----|
| 0 | a | 1 | 3 |
| 1 | a | 2 | 3 |
| 2 | a | 3 | 3 |
| 3 | b | 4 | 4 |
| 4 | b | 2 | 4 |
| 5 | c | 1 | 1 |
is there any way to do this with pandas?

Use GroupBy.transform for new column of aggregated values:
df['max'] = df.groupby('code')['value'].transform('max')

You can try this as well.
df["max"] = df.code.apply(lambda i : max(df.loc[df["code"] == i]["value"]))

Related

Count Change in State For Each Group in Pandas DataFrame

I have Pandas Data Frame containing product and it's state along with other information. An example data frame can be created as follows
data = {'Product':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04','2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-01', '2020-01-02'],
'Price':[10, 20, 30, 40, 15, 25, 35, 45, 55, 65, 101, 102],
'state':[1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1] }
test = pd.DataFrame(data)
I want count how many times the product state changes from 0 to 1. I've used the following code to check if product state goes from 0 to 1 and named it as change
test['change'] = np.where(test.state < test.state.shift(-1), 1, 0)
The problem is the code above does not takes into account product therefore I need to group by product and then check change in state.
Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 1 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
As seen from the output above for product A on 4th date change is 1 because on next date state is 1 but that is for different product.
Desired Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 0 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
+---------+------------+
| Product |count_change|
+---------+------------+
| A | 1 |
| B | 2 |
| C | 1 |
+---------+------------+
How can I tweak the code so change is computed after grouping based on product and I can get Product wise count of how many times state changed from 0 to 1.

Try groupby:
g = test.groupby('Product')
test['change'] = (g['state'].diff(-1)<0).astype(int)
g['change'].sum()

How to group by certain column then take the count of multiple columns where it is not NA and add them in Pandas Python?

I would like to do a group by the ID and then add the count of values in A and B where it is not NA then add the count of both A and B together. To add on to that, what if I want to count only the y values in A?
+----+---+---+
| ID | A | B |
+----+---+---+
| 1 | x | x |
| 1 | x | x |
| 1 | y | |
| 2 | y | x |
| 2 | y | |
| 2 | y | x |
| 2 | x | x |
| 3 | x | x |
| 3 | | x |
| 3 | y | x |
+----+---+---+
+----+--------+
| ID | Output |
+----+--------+
| 1 | 3 |
| 2 | 6 |
| 3 | 4 |
+----+--------+

Here's a way to do:
df = df.groupby('ID').agg(lambda x: sum(pd.notna(x))).sum(1).reset_index(name='Output')
print(df)
ID Output
0 1 5.0
1 2 7.0
2 3 5.0

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |

Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

How to merge rows with same string, but sum up the rows connected

I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |

It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()

Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows

How can i get statistics on the empty fields in DataFrame

I have a dataframe:
| city | field2 | field3 | field4 | field5 |
| 1 | a | | b | b |
| 2 | | | c | |
| 3 | | a | | |
| 4 | a | | | |
| 1 | | a | | b |
| 2 | b | | c | |
| 4 | | a | | |
| 3 | | | a | |
| 2 | b | | | |
| 1 | | a | | b |
| 2 | | | a | |
| 3 | a | | | b |
| 1 | | | b | |
| 1 | b | a | | |
| 2 | | | b | b |
| 1 | b | a | | b |
I need to get here is a list of statistics blank fields with the group on the field "city".
| city | field2 | field3 | field4 | field5 |
| 1 | 3 | 2 | 4 | 2 |
| 2 | 3 | 5 | 1 | 4 |
| 3 | 2 | 2 | 2 | 2 |
| 4 | 1 | 1 | 2 | 2 |
How can I do this with a python pandas?

import pandas as pd
import numpy as np
df = pd.DataFrame({
"city": [1,2,1,2,1,2],
"field2": [np.nan, "a", np.nan, np.nan, "b", np.nan],
"field3": [np.nan, np.nan, np.nan, "b", "a", "b"],
})
df
This is my example data:
city field2 field3
0 1 NaN NaN
1 2 a NaN
2 1 NaN NaN
3 2 NaN b
4 1 b a
5 2 NaN b
Now the logic:
# define a function that counts the number of `nan` in a series.
def count_nan(col):
return col.isnull().sum()
# group by city and count the number of `nan` per city
df.groupby("city").agg({"field2": count_nan, "field3": count_nan})
This is the output:
field2 field3
city
1 2 2
2 2 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

filling a column values with max value in pandas - python

Use GroupBy.transform for new column of aggregated values: df['max'] = df.groupby('code')['value'].transform('max')

You can try this as well. df["max"] = df.code.apply(lambda i : max(df.loc[df["code"] == i]["value"]))

Related

Count Change in State For Each Group in Pandas DataFrame

How to group by certain column then take the count of multiple columns where it is not NA and add them in Pandas Python?

Multi-Index Lookup Mapping

How to merge rows with same string, but sum up the rows connected

How can i get statistics on the empty fields in DataFrame

Categories

Resources