I have some data like this:
pd.DataFrame({'code': ['a', 'a', 'a', 'b', 'b', 'c'],
'value': [1,2,3, 4, 2, 1] })
+-------+------+-------+
| index | code | value |
+-------+------+-------+
| 0 | a | 1 |
+-------+------+-------+
| 1 | a | 2 |
+-------+------+-------+
| 2 | a | 3 |
+-------+------+-------+
| 3 | b | 4 |
+-------+------+-------+
| 4 | b | 2 |
+-------+------+-------+
| 5 | c | 1 |
+-------+------+-------+
i want add a column that contain the max value of each code :
| index | code | value | max |
|-------|------|-------|-----|
| 0 | a | 1 | 3 |
| 1 | a | 2 | 3 |
| 2 | a | 3 | 3 |
| 3 | b | 4 | 4 |
| 4 | b | 2 | 4 |
| 5 | c | 1 | 1 |
is there any way to do this with pandas?
Use GroupBy.transform for new column of aggregated values:
df['max'] = df.groupby('code')['value'].transform('max')
You can try this as well.
df["max"] = df.code.apply(lambda i : max(df.loc[df["code"] == i]["value"]))
Related
I have Pandas Data Frame containing product and it's state along with other information. An example data frame can be created as follows
data = {'Product':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04','2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-01', '2020-01-02'],
'Price':[10, 20, 30, 40, 15, 25, 35, 45, 55, 65, 101, 102],
'state':[1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1] }
test = pd.DataFrame(data)
I want count how many times the product state changes from 0 to 1. I've used the following code to check if product state goes from 0 to 1 and named it as change
test['change'] = np.where(test.state < test.state.shift(-1), 1, 0)
The problem is the code above does not takes into account product therefore I need to group by product and then check change in state.
Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 1 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
As seen from the output above for product A on 4th date change is 1 because on next date state is 1 but that is for different product.
Desired Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 0 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
+---------+------------+
| Product |count_change|
+---------+------------+
| A | 1 |
| B | 2 |
| C | 1 |
+---------+------------+
How can I tweak the code so change is computed after grouping based on product and I can get Product wise count of how many times state changed from 0 to 1.
Try groupby:
g = test.groupby('Product')
test['change'] = (g['state'].diff(-1)<0).astype(int)
g['change'].sum()
I would like to do a group by the ID and then add the count of values in A and B where it is not NA then add the count of both A and B together. To add on to that, what if I want to count only the y values in A?
+----+---+---+
| ID | A | B |
+----+---+---+
| 1 | x | x |
| 1 | x | x |
| 1 | y | |
| 2 | y | x |
| 2 | y | |
| 2 | y | x |
| 2 | x | x |
| 3 | x | x |
| 3 | | x |
| 3 | y | x |
+----+---+---+
+----+--------+
| ID | Output |
+----+--------+
| 1 | 3 |
| 2 | 6 |
| 3 | 4 |
+----+--------+
Here's a way to do:
df = df.groupby('ID').agg(lambda x: sum(pd.notna(x))).sum(1).reset_index(name='Output')
print(df)
ID Output
0 1 5.0
1 2 7.0
2 3 5.0
I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791
I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows
I have a dataframe:
| city | field2 | field3 | field4 | field5 |
| 1 | a | | b | b |
| 2 | | | c | |
| 3 | | a | | |
| 4 | a | | | |
| 1 | | a | | b |
| 2 | b | | c | |
| 4 | | a | | |
| 3 | | | a | |
| 2 | b | | | |
| 1 | | a | | b |
| 2 | | | a | |
| 3 | a | | | b |
| 1 | | | b | |
| 1 | b | a | | |
| 2 | | | b | b |
| 1 | b | a | | b |
I need to get here is a list of statistics blank fields with the group on the field "city".
| city | field2 | field3 | field4 | field5 |
| 1 | 3 | 2 | 4 | 2 |
| 2 | 3 | 5 | 1 | 4 |
| 3 | 2 | 2 | 2 | 2 |
| 4 | 1 | 1 | 2 | 2 |
How can I do this with a python pandas?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"city": [1,2,1,2,1,2],
"field2": [np.nan, "a", np.nan, np.nan, "b", np.nan],
"field3": [np.nan, np.nan, np.nan, "b", "a", "b"],
})
df
This is my example data:
city field2 field3
0 1 NaN NaN
1 2 a NaN
2 1 NaN NaN
3 2 NaN b
4 1 b a
5 2 NaN b
Now the logic:
# define a function that counts the number of `nan` in a series.
def count_nan(col):
return col.isnull().sum()
# group by city and count the number of `nan` per city
df.groupby("city").agg({"field2": count_nan, "field3": count_nan})
This is the output:
field2 field3
city
1 2 2
2 2 1