comparing column values in groupby in pandas - python

my dataframe look like this
Time Name price Profit
5:25 A 150 15
5:25 B 250 10
5:25 C 200 20
5:30 A 200 25
5:30 B 150 20
5:30 C 210 25
5:35 A 180 15
5:35 B 200 30
5:35 C 200 10
5:40 A 150 20
5:40 B 260 15
5:40 C 220 10
I want output should be like:
Time Name price profit diff_price diff_profit
5:25 A 150 15 0 0
5:25 B 250 10 0 0
5:25 C 200 20 0 0
5:30 A 200 25 50 10
5:30 B 150 20 -100 10
5:30 C 210 25 10 5
5:35 A 180 15 20 -10
5:35 B 200 30 50 10
5:35 C 200 10 -10 -15
5:40 A 150 20 -30 5
5:40 B 260 35 60 5
5:40 C 220 15 20 5
I need to compare between previous values of groupby is greater than of previous values like
difference of A,B and C are greater than previous values or not .
if condition matches it has to display Name :
from above at Time 5:40, diff_price and diff_profit of B is greater than all previous Time column values
so output should print like : B
my code look like
df.groupby(['Time','Price'])
df['diff_price']=df.groupby(['Time','Price']).price.diff().fillna(0)
df['diff_profit']=df.groupby(['Time','Price']).profit.diff().fillna(0)
Then how to do comparision between values to get desired output to display is : B

You could tackle this problem one group ("Name") at the time:
# Let's iterate the dataframe by grouping by "Name"
for name, group_df in df.groupby(["Name"]):
# Make sure that the rows are sorted by time
group_df = group_df.sort_values("Time")
# Calculate difference between each row (diff = bottom - top)
group_df[["diff_price", "diff_profit"]] = group_df[["price", "Profit"]].shift(1) - group_df[["price", "Profit"]]
# Fill the first value with 0 instead of NaN (as in your sample input)
group_df = group_df.fillna(0)
# Let's see if the maximum diff_price is reached at the end
*previous_values, last_value = group_df["diff_price"]
if last_value >= max(previous_values):
print(f"Max price diff reached at '{name}'")
print(group_df.tail(1))
# Again, but let's checkout the diff_profit
*previous_values, last_value = group_df["diff_profit"]
if last_value >= max(previous_values):
print(f"Max profit diff reached at '{name}'")
print(group_df.tail(1))
This is the output I get for your sample input:
Max price diff reached: A
Time Name price Profit diff_price diff_profit
9 5:40 A 150 20 30.0 -5.0
Max profit diff reached: B
Time Name price Profit diff_price diff_profit
10 5:40 B 260 15 -60.0 15.0

IIUC, compute diff_price and diff_profit based on Name column then patch the last group of time according your condition:
df[['diff_price', 'diff_profit']] = df.groupby('Name')[['price', 'profit']] \
.diff().fillna(0)
mask = df['Time'].eq(df['Time'].max())
df.loc[mask, 'diff_profit'] = df.loc[mask, 'diff_profit'].max()
Output:
>>> df
Time Name price profit diff_price diff_profit
0 5:25 A 150 15 0.0 0.0
1 5:25 B 250 10 0.0 0.0
2 5:25 C 200 20 0.0 0.0
3 5:30 A 200 25 50.0 10.0
4 5:30 B 150 20 -100.0 10.0
5 5:30 C 210 25 10.0 5.0
6 5:35 A 180 15 -20.0 -10.0
7 5:35 B 200 30 50.0 10.0
8 5:35 C 200 10 -10.0 -15.0
9 5:40 A 150 20 -30.0 5.0
10 5:40 B 260 15 60.0 5.0
11 5:40 C 220 10 20.0 5.0

Related

Replace the 0 in a column with groupby median in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 0
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 0
A 50 2017-04-18 6 0
B 200 2017-12-08 25 0
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 0
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 0
From the above I would like replace 0 salary with groupby median of the the column product.
Explanation:
A : 15, 20, 20, 25, 45
So the median = 20.
B : 60, 80, 90, 100, 100, 110
So the median = 95.
Expected Output
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 20
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 95
A 50 2017-04-18 6 20
B 200 2017-12-08 25 95
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 20
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 95
You can try this using masking 0 values using pd.Series.mask and use np.nanmedian here.
fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25
1 A 50 2017-01-03 4 20
2 B 200 2016-12-24 10 100
3 A 50 2017-01-18 3 20
4 B 200 2017-01-28 15 80
5 A 50 2017-01-18 6 15
6 B 200 2017-01-28 20 95
7 A 50 2017-04-18 6 20
8 B 200 2017-12-08 25 95
9 A 50 2017-11-18 6 20
10 B 200 2017-08-21 20 90
11 B 200 2017-12-28 30 110
12 A 50 2018-03-18 10 20
13 B 300 2018-06-08 45 100
14 B 300 2018-09-20 50 60
15 A 50 2018-11-18 8 45
16 B 300 2018-11-28 35 95
OR
Using np.where
df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))
first use .groupby and .transform the column to shown the grouped by median. Finally, locate salaries that are 0 with .loc and set them equal to the median salary.
#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df
output:
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25.0
1 A 50 2017-01-03 4 20.0
2 B 200 2016-12-24 10 100.0
3 A 50 2017-01-18 3 17.5
4 B 200 2017-01-28 15 80.0
5 A 50 2017-01-18 6 15.0
6 B 200 2017-01-28 20 80.0
7 A 50 2017-04-18 6 17.5
8 B 200 2017-12-08 25 80.0
9 A 50 2017-11-18 6 20.0
10 B 200 2017-08-21 20 90.0
11 B 200 2017-12-28 30 110.0
12 A 50 2018-03-18 10 17.5
13 B 300 2018-06-08 45 100.0
14 B 300 2018-09-20 50 60.0
15 A 50 2018-11-18 8 45.0
16 B 300 2018-11-28 35 80.0

Want to assign value in the new column by comparing the other column in pandas data frame?

I have a dataframe and has a column name called savings. In that savings, it has a positive and negative value. I want to check if the savings is negative then assign 1 in the new column (need to create a new column name called flag_negative). If the savings is positive assign 0 in the new column. I have a missing value in the savings column which I don't want to do anything. Leave as it is.
I would like to apply loop or any other easy method.
my dataframe name is df
I want to get as follow
Number of rows: 9000
savings flag_negative
100 0
-76 1
1200 0
-
-
-200 1
500 0
I tried with loop and created new column as flag_ negatvie. But I am getting NONE for all the rows
Below is my code
for i in sum['savings']:
if i>0:
sum['flag_negative'] = print(0)
elif i == " ":
sum['flag_negative'] = print(" ")
else:
sum['flag_negative'] = print(1)
if your dataframe is like this:
savings
0 -4
1 -41
2 174
3 -103
4 -194
5 -160
6 126
7 100
8 -125
9 -71
10 -159
11 -100
12 -30
13 -50
14 83
15 124
16 -123
17 -70
18 -71
19 -29
then you can easily filter on positive/negative and assign to a new column like this:
df.loc[df.savings < 0, 'flag_negative'] = 1
df.loc[df.savings >= 0, 'flag_negative'] = 0
resulting in:
savings flag_negative
0 -4 1.0
1 -41 1.0
2 174 0.0
3 -103 1.0
4 -194 1.0
5 -160 1.0
6 126 0.0
7 100 0.0
8 -125 1.0
9 -71 1.0
10 -159 1.0
11 -100 1.0
12 -30 1.0
13 -50 1.0
14 83 0.0
15 124 0.0
16 -123 1.0
17 -70 1.0
18 -71 1.0
19 -29 1.0

Slice values of a column and calculate average in python

I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?
First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Add a column in dataframe conditionally from values in other dataframe python

i have a table in pandas df
id product_1 count
1 100 10
2 200 20
3 100 30
4 400 40
5 500 50
6 200 60
7 100 70
also i have another table in dataframe df2
product score
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column score in my first df, taking values of score from df2 with respect to product_1.
my final output should be. df =
id product_1 count score
1 100 10 5
2 200 20 10
3 100 30 5
4 400 40 20
5 500 50 25
6 200 60 10
7 100 70 5
Any ideas how to achieve it?
Use map:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
print (df)
id product_1 count score
0 1 100 10 5
1 2 200 20 10
2 3 100 30 5
3 4 400 40 20
4 5 500 50 25
5 6 200 60 10
6 7 100 70 5
Or merge:
df = pd.merge(df,df2, left_on='product_1', right_on='product', how='left')
print (df)
id product_1 count product score
0 1 100 10 100 5
1 2 200 20 200 10
2 3 100 30 100 5
3 4 400 40 400 20
4 5 500 50 500 25
5 6 200 60 200 10
6 7 100 70 100 5
EDIT by comment:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
df['final_score'] = (df['count'].mul(0.6).div(df.id)).add(df.score.mul(0.4))
print (df)
id product_1 count score final_score
0 1 100 10 5 8.0
1 2 200 20 10 10.0
2 3 100 30 5 8.0
3 4 400 40 20 14.0
4 5 500 50 25 16.0
5 6 200 60 10 10.0
6 7 100 70 5 8.0

Categories

Resources