Extract minimum values only in a dataframe - python

I have the following dataframe:
Quantity_Limit Cost Wholesaler_Code
2 9.2 1
2 9.4 1
2 7.1 2
4 10.2 1
4 4.1 2
4 2.1 3
And I would like to create the following dataframe, with only Wholesalers that offer the minimum Cost, for the same quantity limit, without using a for loop:
Quantity_Limit Cost Wholesaler_Code
2 7.1 2
4 2.1 3
I tried with:
df.groupby(["Quantity_Limit", "Wholesaler_Code"], as_index = False).agg({"Cost": "min"})
but I don't get the desired result.

Just sort Quantity_Limit, Cost and drop_duplicates
df.sort_values(['Quantity_Limit', 'Cost']).drop_duplicates(subset=['Quantity_Limit'])
Out[1121]:
Quantity_Limit Cost Wholesaler_Code
2 2 7.1 2
5 4 2.1 3

You can use transform to create a column with the minimum values and filter based on those.
df["min_cost"] = df.groupby(["Quantity_Limit"])["Cost"].min()
df[df["Cost"] == df["min_cost"]]

You can also groupby and join the result df to the original df to get the left over column:
df2 = df.groupby(['Quantity_Limit'])['Cost'].min().reset_index()
df2 = pd.merge(df2, df, on = ['Quantity_Limit', 'Cost'], how = 'left')
Output:
Quantity_Limit Cost Wholesaler_Code
0 2 7.1 2
1 4 2.1 3

import pandas as pd
#Raw data
data = [[2, 9.2,1], [2, 9.4,1], [2,7.1,1],[4, 10.2,1], [4, 4.1,2], [4,2.1,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Quantity_Limit', 'Cost','Wholesaler_Code'])
# Group by to find minimum using variable "Cost" . It will create a variable min_Cost
df["min_cost"] =df.groupby(["Quantity_Limit"])["Cost"].min()
Now from above output we will filter the rows where min_cost is not equal to NaN
df1=df[df["min_cost"]>0]
And you will get your desired output.

Related

Replace values in Pandas Dataframe using another Dataframe as a lookup table

I'm looking to replace values in a Dataframe with the values in a second Dataframe by matching the values in the first Dataframe with the columns from the second Dataframe.
Example:
import numpy as np
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03', '2003-05-04'])
df = pd.DataFrame({'A':[1,1,3,12], 'B':[12,1,3,3], 'C':[3,12,12,1]}, index = dt_index)
df2 = pd.DataFrame({1:[1.4,4.2,1.3,5.6], 12:[2.3,7.3,9.5,0.4], 3:[8.8,0.1,8.7,2.4], 4:[9.6,9.8,5.5,1.8]}, index = dt_index)
df =
A B C
2003-05-01 1 12 3
2003-05-02 1 1 12
2003-05-03 3 3 12
2003-05-04 12 3 1
df2 =
1 12 3 4
2003-05-01 1.4 2.3 8.8 9.6
2003-05-02 4.2 7.3 0.1 9.8
2003-05-03 1.3 9.5 8.7 5.5
2003-05-04 5.6 0.4 2.4 1.8
Expected output:
expect = pd.DataFrame({'A':[1.4,4.2,8.7,0.4], 'B':[2.3,4.2,8.7,2.4], 'C':[8.8,7.3,9.5,5.6]}, index = dt_index)
expect =
A B C
2003-05-01 1.4 2.3 8.8
2003-05-02 4.2 4.2 7.3
2003-05-03 8.7 8.7 9.5
2003-05-04 0.4 2.4 5.6
Attempt:
X = df.copy()
for i in np.unique(df):
X.mask(df == i, df2[i], axis=0, inplace=True)
My attempt seems to work but I'm not sure if it has any pitfalls and how it would scale as the sizes of the Dataframe increase.
Are there better or faster solutions?
EDIT:
After cottontail's helpful answer, I realised I've made an oversimplification in my example. The values in df and columns of df and df2 cannot be assumed to be sequential.
I've now modified the example to reflect that.
One approach is to use stack() to reshape df2 into a Series and reindex() it using the values in df; reshape back into original shape using unstack().
tmp = df2.stack().reindex(df.stack().droplevel(-1).items())
tmp.index = pd.MultiIndex.from_arrays([tmp.index.get_level_values(0), df.columns.tolist()*len(df)])
df = tmp.unstack()
Another approach is to iteratively create a dummy dataframe shaped like df2, multiply it by df2, reduce it into a Series (using sum()) and assign it to an empty dataframe shaped like df.
X = pd.DataFrame().reindex_like(df)
df['dummy'] = 1
for c in X:
X[c] = (
df.groupby([df.index, c])['dummy'].size()
.unstack(fill_value=0)
.reindex(df2.columns, axis=1, fill_value=0)
.mul(df2)
.sum(1)
)

How to iterate two or more columns and perform analysis in pandas?

I have two dataframes, where one dataframe has 2 columns with 11 rows and another dataframe with 2 columns with 2 rows.
print(df)
Output is :
C1 C2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 9 9
8 11 13
9 10 11
10 12 11
Second dataframe is
print(df1)
Output is :
Mean Dev
0 2 0.5
1 1 1.0
I'm trying to subtract each and every value from column 1 of df with 1st column 1st row Mean value and divinding with 2nd column 1st row Dev value. Below is the code
for i in range(0, len(df)):
print((df['C1'][i] - df1['Mean'][0]) / (df1['Dev'][0]))
Output is :
-2.0
0.0
2.0
4.0
6.0
8.0
10.0
14.0
18.0
16.0
20.0
My question is how to perform the subtraction and dividing for every column with respect to the Mean and Dev columns. For example, i'm trying to write code
for i in range(0, len(df)):
print((df['C2'][i] - df1['Mean'][1]) / (df1['Dev'][1]))
Followed by
for i in range(0, len(df)):
print((df['C3'][i] - df1['Mean'][2]) / (df1['Dev'][2]))
Followed by
for i in range(0, len(df)):
print((df['C4'][i] - df1['Mean'][3]) / (df1['Dev'][3]))
In the above codes, we are looping df values. How to loop the df1 values?
You can accomplish this without for loops taking advantage of elementwise subtraction the following way:
import pandas as pd
#Example data
df = pd.DataFrame({'C1': [i for i in range(1, 12)], 'C2': [i for i in range(2, 13)]})
#Example mean and standard deviation
df1 = pd.DataFrame({'Mean': [2, 1], 'Dev': [0.5, 1]})
#Transpose the mean column and subtract from the original dataframe
#Transpose the standard deviation column and divide
df_out = (df - df1['Mean'].to_numpy().T)/df1['dev'].to_numpy().T
This is assuming that the number of rows in the mean/standard deviation matrix is equal to the columns in the data matrix. It also assumes that each row number in the mean/standard deviation matrix corresponds to the same number column number in the data matrix.
If you need to store results in a list:
l = []
l = (df.loc[:, 'a'] - df2.at[0, 'c1'])/ df2.at[0, 'c2']
If you need to store the new values in a new dataframe:
df3 = pd.DataFrame(columns=['c1'])
df3.loc[:,'c1'] = (df.loc[:, 'C1'] - df2.at[0, 'Mean'])/ df2.at[0, 'Dev']

How do I create a rank table for a given pandas dataframe with multiple numerical columns?

I would like to create a rank table based on a multi-column pandas dataframe, with several numerical columns.
Let's use the following df as an example:
Name
Sales
Volume
Reviews
A
1000
100
100
B
2000
200
50
C
5400
500
10
I would like to create a new table, ranked_df that ranks the values in each column by descending order while maintaining essentially the same format:
Name
Sales_rank
Volume_rank
Reviews_rank
A
3
3
1
B
2
2
2
C
1
1
3
Now, I can iteratively do this by looping through the columns, i.e.
df = pd.DataFrame{
"Name":['A', 'B', 'C'],
"Sales":[1000, 2000, 5400],
"Volume":[100, 200, 500],
"Reviews":[1000, 2000, 5400]
}
# make a copy of the original df
ranked_df = df.copy()
# define our interested columns
interest_cols = ['Sales', 'Volume', 'Reviews']
for col in interest_cols:
ranked_df[f"{col}_rank"] = df[col].rank()
# drop the cols not needed
...
But my question is this: is there a more elegant - or pythonic way of doing this? Maybe an apply for the dataframe? Or some vectorized operation by throwing it to numpy?
Thank you.
df.set_index('Name').rank().reset_index()
Name Sales Volume Reviews
0 A 1.0 1.0 1.0
1 B 2.0 2.0 2.0
2 C 3.0 3.0 3.0
You could use transform/apply to hit each column
df.set_index('Name').transform(pd.Series.rank, ascending = False)
Sales Volume Reviews
Name
A 3.0 3.0 1.0
B 2.0 2.0 2.0
C 1.0 1.0 3.0

need to add a column together and put the average beneath the column in Pandas

I'm currently trying to add a column together that has two rows to it as such:
Now I just need to add row 1 and 2 together for each column, and I want to append the average underneath the given column for their respective header name. I currently have this:
for x in sub_houseKeeping:
if "phch" in x:
sub_houseKeeping['Average'] = sub_houseKeeping[x].sum()/2
However, this adds together the entire row and appends it to the end of the rows, not the bottom of the column as I wished. How can I fix it to add to the bottom of the column?
This?
data=''' id a b
0 1 34 10
1 2 27 40'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df1 = df.append(df[['a', 'b']].mean(), ignore_index=True)
df1
id a b
0 1.0 34.0 10.0
1 2.0 27.0 40.0
2 NaN 30.5 25.0
Try this:
sub_houseKeeping = pd.DataFrame({'ID':['200650_s_at','1565446_at'], 'phchp003v1':[2174.84972,6.724141107], 'phchp003v2':[444.9008362,4.093883364]})
sub_houseKeeping = sub_houseKeeping.append(pd.DataFrame(sub_houseKeeping.mean(axis=0)).T, ignore_index=True)
Output:
print(sub_houseKeeping)
ID phchp003v1 phchp003v2
0 200650_s_at 2174.849720 444.900836
1 1565446_at 6.724141 4.093883
2 NaN 1090.786931 224.497360

Replace values in dataframe from another dataframe with Pandas

I have 3 dataframes: df1, df2, df3. I am trying to fill NaN values of df1 with some values contained in df2. The values selected from df2 are also selected according to the output of a simple function (mul_val) who processes some data stored in df3.
I was able to get such result but I would like to find in a simpler, easier way and more readable code.
Here is what I have so far:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val at every iteration I get:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO, I identify 6800 as the minimum number between 6800 and 8910. Therefore I select the number 5 who is placed in df1. Repeating such operation for the remaining rows of df3 (in this case I have only 2 rows but they could be a lot more), the final result should be like this:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); the important thing is that the final result is still a dataframe.
I can offer two similar options that achieve the same result than your loop in a couple of lines:
1.Using apply and fillna() (fillna is faster than combine_first by a factor of two):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of df3

Categories

Resources