Conditional Calculation of Pandas Dataframe columns - python

I have a pandas dataframe which reads
Category Sales
A 10
B 20
I wanna do a conditional creation of new column target
And I want my target df to look like
Category Sales Target
A 10 5
B 20 10
I used the below code and it threw an error
if(df['Category']=='A'):
df['Target']=df['Sales']-5
else:
df['Target']=df['Sales']-10

Use vectorized numpy.where:
df['Target']= np.where(df['Category']=='A', df['Sales'] - 5, df['Sales'] - 10)
print (df)
Category Sales Target
0 A 10 5
1 B 20 10

Related

Groupby by sum of revenue and the corresponding highest contributing month - Pandas

I have a bill details data set and I want to do a groupby of the products based on the sum of their Total value, additionally i want a column which indicates the month which has produced the most revenue for the corresponding product
Data set:
Bill_Id Month Product_Id Net_Value
1 1 20 100
2 1 20 100
3 2 20 100
4 1 30 200
5 2 30 200
6 2 30 200
Desired_Result
Product_Id Total_revenue Top_Month
20 300 1
30 600 2
This just a sample dataset I have the transaction data of the entire year
Pivot the dataframe with aggfunc=sum, then use sum and idxmax along columns axis to find total revenue and month which has the highest contribution to total revenue, finally concat the individual components along column axis to get the result
s = df.pivot_table('Net_Value', 'Product_Id', 'Month', aggfunc='sum')
pd.concat([s.sum(1), s.idxmax(1)], axis=1, keys=['Total_revenue', 'Top_Month'])
Total_revenue Top_Month
Product_Id
20 300 1
30 600 2
Assuming that your only 1 Top month value is needed based on the maximum sum of Net_Revenue, Below is the code that might work for you.
We can achieve this in 3 stages as mentioned below:
1. Extracting the sum of net revenue based on product id
df_1 = df.groupby(['Product_Id']).agg({'Net_Value' : sum}).reset_index()
df_1 = df_1.rename(columns={'Net_Value' : 'Total_revenue'})
print(df_1)
Product_Id Total_revenue
0 20 300
1 30 600
2. Extracting the best contibuting month based on max sum net revenue for each product id
df_2 = df.groupby(['Product_Id', 'Month']).agg({'Net_Value' : sum}).sort_values('Net_Value', ascending=False).reset_index()
df_2 = df_2.drop_duplicates(subset=['Product_Id'])[['Product_Id', 'Month']]
print(df_2)
Product_Id Month
0 30 2
1 20 1
3. Final step is to merge this both dataframes into single based on product id
final_df = df_1.merge(df_2)
print(final_df)
Product_Id Total_revenue Month
0 20 300 1
1 30 600 2
Please do upvote the solution if it helps :)
Small modification over #Shubham's approach
result = (
df.pivot_table("Net_Value", "Product_Id", "Month", aggfunc="sum")
.agg(["sum", "idxmax"], axis=1)
.set_axis(["Total_revenue", "Top_Month"], axis=1)
)
As multiple columns are being interacted, I have used the apply function in addition to groupby:
Net_value is calucated using basic aggregate function sum
Top_month required interaction between columns so, first get the index of max Net_value using idxmax then using loc to find the month
The resultant Pandas Series object has the groupby column (Product_id) as index, so it make it a column I have used reset_index
def f(x):
d = {}
d['Net_Value'] = x['Net_Value'].sum()
d['Top_month'] = df.loc[x['Net_Value'].idxmax(), "Month"]
return pd.Series(d, index=['Net_Value', 'Top_month'])
df.groupby('Product_Id').apply(f).reset_index()
# Output
Product_Id Net_Value Top_month
0 20 300 1
1 30 600 2
Check out this amazing answer which helped me and can help you in the future as well.

How to implement the excel function IF(H3>I3,C2,0) in pandas

In column J would like to get the value as per excel function ie IF(H3>I3,C2,0) and based on that occurance value ie from bottom to up 1st occurance as the latest one and next to that is 2nd occurance.
enter image description here
Here is the solution:
import pandas as pd
import numpy as np
# suppose we have this DataFrame:
df = pd.DataFrame({'A':[55,23,11,100,9] , 'B':[12,72,35,4,100]})
# suppose we want to reflect values of 'A' column if its values are equal or more than values in 'B' column, otherwise return 0
# so i'll make another column named 'Result' to put the results in it
df['Result'] = np.where(df['A'] >= df['B'] , df['A'] , 0)
then if you try to print DataFrame:
df
result:
A B Result
0 55 12 55
1 11 72 0
2 23 35 0
3 100 4 100
4 9 100 0

Select a specific group of a grouped dataframe with pandas

I have the following dataframe:
df.index = df['Date']
df.groupby([df.index.month, df['Category'])['Amount'].sum()
Date Category Amount
1 A -125.35
B -40.00
...
12 A 505.15
B -209.00
I would like to report the sum of the Amount for every Category B like:
Date Category Amount
1 B -40.00
...
12 B -209.00
I tried the df.get_group method but this method needs tuple that contains the Date and Category key. Is there a way to filter out only the Categories with B?
You can use IndexSlice:
# groupby here
df_group = df.groupby([df.index.month, df['Category'])['Amount'].sum()
# report only Category B
df_group.loc[pd.IndexSlice[:,'B'],:]
Or query:
# query works with index level name too
df_group.query('Category=="B"')
Output:
Amount
Date Category
1 B -40.0
12 B -209.0
apply a filter to your groupby dataframe where Category equals B
filter=df['Category']=='B'
df[filter].groupby([df.index.month, df['Category'])['Amount'].sum()

Need to display the max price and company that has it

So, i just started working with python and i need to display the maximum price and the company that has it. I got the data from an CSV file that has multiple columns that describes some cars. I'm only interested in two of them: price and company.
I need to display the maximum price and the company that has it. Some advice?
This is what i tried and I don't know how to get the company too, not only the maximum price.
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
for x in df['price']:
if x == df['price'].max():
print(x)
Use Series.max, create index by DataFrame.set_index and get company name by Series.idxmax:
df = pd.DataFrame({
'company':list('abcdef'),
'price':[7,8,9,4,2,3],
})
print (df)
company price
0 a 7
1 b 8
2 c 9
3 d 4
4 e 2
5 f 3
print(df['price'].max())
9
print(df.set_index('company')['price'].idxmax())
c
Another idea is use DataFrame.agg:
s = df.set_index('company')['price'].agg(['max','idxmax'])
print (s['max'])
9
print (s['idxmax'])
c
If possible duplicated maximum values and need all companies of max price use boolean indexing with DataFrame.loc - get Series:
df = pd.DataFrame({
'company':list('abcdef'),
'price':[7,8,9,4,2,9],
})
print (df)
company price
0 a 7
1 b 8
2 c 9
3 d 4
4 e 2
5 f 9
print(df['price'].max())
9
#only first value
print(df.set_index('company')['price'].idxmax())
c
#all maximum values
s = df.loc[df['price'] == df['price'].max(), 'company']
print (s)
2 c
5 f
Name: company, dtype: object
If need one row DataFrame:
out = df.loc[df['price'] == df['price'].max(), ['company','price']]
print (out)
company price
2 c 9
out = df.loc[df['price'] == df['price'].max(), ['company','price']]
print (out)
company price
2 c 9
5 f 9
That is how not to use Pandas. Pandas is made to avoid loops
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
max_price = df[df['price'] == df['price'].max()]
print(max_price)
That is how you would do it. If you only want price and company
print(max_price[['company','price']])
Explanation: we create a boolean filter that true if the price is equal to maximum price. We use this as a mask to catch what we need.
In addition to the complete answer of Jezrael, I would suggest using groupby as follows:
df = pd.DataFrame({
'company':list('abcdef'),
'price':[7,8,9,4,2,3],
})
sorted_df = df.groupby(['price']).max().reset_index()
desired_row = sorted_df.loc[sorted_df.index[-1]]
price = desired_row[0]
company = desired_row[1]
print('Maximum price is: ', price)
print('The company is: ', company)
The above code prints:
Maximum price is: 9
The company is: c

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Categories

Resources