I'm looking to replace values in a Dataframe with the values in a second Dataframe by matching the values in the first Dataframe with the columns from the second Dataframe.
Example:
import numpy as np
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03', '2003-05-04'])
df = pd.DataFrame({'A':[1,1,3,12], 'B':[12,1,3,3], 'C':[3,12,12,1]}, index = dt_index)
df2 = pd.DataFrame({1:[1.4,4.2,1.3,5.6], 12:[2.3,7.3,9.5,0.4], 3:[8.8,0.1,8.7,2.4], 4:[9.6,9.8,5.5,1.8]}, index = dt_index)
df =
A B C
2003-05-01 1 12 3
2003-05-02 1 1 12
2003-05-03 3 3 12
2003-05-04 12 3 1
df2 =
1 12 3 4
2003-05-01 1.4 2.3 8.8 9.6
2003-05-02 4.2 7.3 0.1 9.8
2003-05-03 1.3 9.5 8.7 5.5
2003-05-04 5.6 0.4 2.4 1.8
Expected output:
expect = pd.DataFrame({'A':[1.4,4.2,8.7,0.4], 'B':[2.3,4.2,8.7,2.4], 'C':[8.8,7.3,9.5,5.6]}, index = dt_index)
expect =
A B C
2003-05-01 1.4 2.3 8.8
2003-05-02 4.2 4.2 7.3
2003-05-03 8.7 8.7 9.5
2003-05-04 0.4 2.4 5.6
Attempt:
X = df.copy()
for i in np.unique(df):
X.mask(df == i, df2[i], axis=0, inplace=True)
My attempt seems to work but I'm not sure if it has any pitfalls and how it would scale as the sizes of the Dataframe increase.
Are there better or faster solutions?
EDIT:
After cottontail's helpful answer, I realised I've made an oversimplification in my example. The values in df and columns of df and df2 cannot be assumed to be sequential.
I've now modified the example to reflect that.
One approach is to use stack() to reshape df2 into a Series and reindex() it using the values in df; reshape back into original shape using unstack().
tmp = df2.stack().reindex(df.stack().droplevel(-1).items())
tmp.index = pd.MultiIndex.from_arrays([tmp.index.get_level_values(0), df.columns.tolist()*len(df)])
df = tmp.unstack()
Another approach is to iteratively create a dummy dataframe shaped like df2, multiply it by df2, reduce it into a Series (using sum()) and assign it to an empty dataframe shaped like df.
X = pd.DataFrame().reindex_like(df)
df['dummy'] = 1
for c in X:
X[c] = (
df.groupby([df.index, c])['dummy'].size()
.unstack(fill_value=0)
.reindex(df2.columns, axis=1, fill_value=0)
.mul(df2)
.sum(1)
)
Related
I would like to create a rank table based on a multi-column pandas dataframe, with several numerical columns.
Let's use the following df as an example:
Name
Sales
Volume
Reviews
A
1000
100
100
B
2000
200
50
C
5400
500
10
I would like to create a new table, ranked_df that ranks the values in each column by descending order while maintaining essentially the same format:
Name
Sales_rank
Volume_rank
Reviews_rank
A
3
3
1
B
2
2
2
C
1
1
3
Now, I can iteratively do this by looping through the columns, i.e.
df = pd.DataFrame{
"Name":['A', 'B', 'C'],
"Sales":[1000, 2000, 5400],
"Volume":[100, 200, 500],
"Reviews":[1000, 2000, 5400]
}
# make a copy of the original df
ranked_df = df.copy()
# define our interested columns
interest_cols = ['Sales', 'Volume', 'Reviews']
for col in interest_cols:
ranked_df[f"{col}_rank"] = df[col].rank()
# drop the cols not needed
...
But my question is this: is there a more elegant - or pythonic way of doing this? Maybe an apply for the dataframe? Or some vectorized operation by throwing it to numpy?
Thank you.
df.set_index('Name').rank().reset_index()
Name Sales Volume Reviews
0 A 1.0 1.0 1.0
1 B 2.0 2.0 2.0
2 C 3.0 3.0 3.0
You could use transform/apply to hit each column
df.set_index('Name').transform(pd.Series.rank, ascending = False)
Sales Volume Reviews
Name
A 3.0 3.0 1.0
B 2.0 2.0 2.0
C 1.0 1.0 3.0
I am using Python, and would like to get a calculated number (price * ratio) from two data frames for each group:
Table 1: df1
Group
Category
price_1
price_2
price_3
price_4
a
Single
20.1
19.8
19.7
19.9
a
Multi
25.1
26.8
24.7
24.9
b
Multi
27.1
27.8
27.7
26.9
Table 2: df2
Group
Category
ratio_1
ratio_2
ratio_3
ratio_4
a
Single
1.0
0.8
0.7
0.5
a
Multi
1.0
0.7
0.6
0.4
b
Multi
1.0
0.7
0.5
0.3
Desired Output: df
Group
Category
value
a
Single
59.68
a
Multi
68.64
b
Multi
68.48
Example, for Group = 'b' and Category = 'Multi', value = 27.1 * 1.0 + 27.8 * 0.7 + 27.7 * 0.5 + 26.9 * 0.3 = 68.48
How may I get that? Thanks!
We can use set_index + str.split to create a MultiIndex on df1 and df2 (columns and index) then use math operations to compute the value column:
# Create MultiIndex on df1 and df2
idx_cols = ['Group', 'Category']
df1 = df1.set_index(idx_cols)
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
df2 = df2.set_index(idx_cols)
df2.columns = df2.columns.str.rsplit('_', n=1, expand=True)
# Compute DF3
df3 = df1['price'].mul(df2['ratio']).sum(axis=1).reset_index(name='value')
df3:
Group Category value
0 a Single 59.68
1 a Multi 68.64
2 b Multi 68.48
df1 becomes:
price
1 2 3 4
Group Category
a Single 20.1 19.8 19.7 19.9
Multi 25.1 26.8 24.7 24.9
b Multi 27.1 27.8 27.7 26.9
And df2 becomes:
ratio
1 2 3 4
Group Category
a Single 1.0 0.8 0.7 0.5
Multi 1.0 0.7 0.6 0.4
b Multi 1.0 0.7 0.5 0.3
pandas will correctly align columns and index to perform appropriate multiplication.
IF AND ONLY IF the DataFrames are already aligned correctly can simply do the operations using the Group and Category columns from one of the DataFrames and use to_numpy to multiply the two dataframes together ignoring the column index and np.sum to compute the totals:
df3 = df1[['Group', 'Category']].copy()
df3['value'] = np.sum(
df1.filter(like='price_').to_numpy() * df2.filter(like='ratio_').to_numpy(),
axis=1
)
df3:
Group Category value
0 a Single 59.68
1 a Multi 68.64
2 b Multi 68.48
This method is much faster and takes up less space, but requires the DataFrames df1 and df2 to already be aligned correctly (as they are in the OP) but is much less robust in handling errors than the former. However, this is optimal if the conditions have been met.
Code:
df = pd.DataFrame.from_dict({'Group': list(df1['Group']), 'Category': list(df1['Category']), 'value': df1['price_1']*df2['ratio_1'] + df1['price_2']*df2['ratio_2'] + df1['price_3']*df2['ratio_3'] + df1['price_4']*df2['ratio_4']})
Output:
Group Category value
0 a Single 59.68
1 a Multi 68.64
2 b Multi 68.48
Hello I would like to merge my replace my "Timestamp" column by an outer merge of "Timestamp" and "Timestamp+0.4". Moreover, I would like that my values in the input still correspond to this new "Timestamp" merge column and to have NaNs where the value is not defined (for example 0.6 = NaN in the "input" column)
My expected output is something like this :
Do you have any idea how to achieve this ?
Here is the code to create the dataframe
df = pd.DataFrame({'Timestamp':[0.2,0.4,0.8,1.2,1.4,1.6,2.0,2.4],
'input':[10,20,40,5,15,25,0,20]})
df["Timestamp+0.4"] = df["Timestamp"]+0.4
Thanks a lot !
You may use concat to concatenate dataframes along a particular axis. After that drop the duplicates using only the Timestamp column as parameter, then finally sort the values again by the Timestamp column.
import pandas as pd
df = pd.DataFrame({'Timestamp':[0.2,0.4,0.8,1.2,1.4,1.6,2.0,2.4],
'input': [10, 20, 40, 5, 15, 25, 0, 20]})
df1 = pd.DataFrame(df["Timestamp"]+0.4)
df = pd.concat([df, df1])
df["Timestamp"] = round(df["Timestamp"], 8)
df = df.drop_duplicates(subset=["Timestamp"], keep="first")
df = df.sort_values(["Timestamp"], ignore_index=True)
print(df)
Output from df
Timestamp input
0 0.2 10.0
1 0.4 20.0
2 0.6 NaN
3 0.8 40.0
4 1.2 5.0
5 1.4 15.0
6 1.6 25.0
7 1.8 NaN
8 2.0 0.0
9 2.4 20.0
10 2.8 NaN
I have the following dataframe:
Quantity_Limit Cost Wholesaler_Code
2 9.2 1
2 9.4 1
2 7.1 2
4 10.2 1
4 4.1 2
4 2.1 3
And I would like to create the following dataframe, with only Wholesalers that offer the minimum Cost, for the same quantity limit, without using a for loop:
Quantity_Limit Cost Wholesaler_Code
2 7.1 2
4 2.1 3
I tried with:
df.groupby(["Quantity_Limit", "Wholesaler_Code"], as_index = False).agg({"Cost": "min"})
but I don't get the desired result.
Just sort Quantity_Limit, Cost and drop_duplicates
df.sort_values(['Quantity_Limit', 'Cost']).drop_duplicates(subset=['Quantity_Limit'])
Out[1121]:
Quantity_Limit Cost Wholesaler_Code
2 2 7.1 2
5 4 2.1 3
You can use transform to create a column with the minimum values and filter based on those.
df["min_cost"] = df.groupby(["Quantity_Limit"])["Cost"].min()
df[df["Cost"] == df["min_cost"]]
You can also groupby and join the result df to the original df to get the left over column:
df2 = df.groupby(['Quantity_Limit'])['Cost'].min().reset_index()
df2 = pd.merge(df2, df, on = ['Quantity_Limit', 'Cost'], how = 'left')
Output:
Quantity_Limit Cost Wholesaler_Code
0 2 7.1 2
1 4 2.1 3
import pandas as pd
#Raw data
data = [[2, 9.2,1], [2, 9.4,1], [2,7.1,1],[4, 10.2,1], [4, 4.1,2], [4,2.1,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Quantity_Limit', 'Cost','Wholesaler_Code'])
# Group by to find minimum using variable "Cost" . It will create a variable min_Cost
df["min_cost"] =df.groupby(["Quantity_Limit"])["Cost"].min()
Now from above output we will filter the rows where min_cost is not equal to NaN
df1=df[df["min_cost"]>0]
And you will get your desired output.
I have two data frames, df1 and df2, each with same number of columns & same column names, but with different number of rows. Basically, there are many columns in df2 which have all 0 values.
What I would like to accomplish is that all columns in df2 which are zero valued are replaced with the mean (average) value of the same column name (as in df1).
So, if df1 has a structure like:-
Column1 Column2 ------ Column n
0.4 2.3 1.7
0.7 2.5 1.4
0.1 2.1 1.2
and df2 has a structure like:-
Column1 Column2 ------ Column n
0 2.3 1.7
0 2.5 1.4
0 2.1 1.2
I would like to replace column1 (and any other all-zero columns in df2) with the mean of the same column mapped in df1.
So, finally, df2 would look like:-
Column1 Column2 ------ Column n
0.4 2.3 1.7
0.4 2.5 1.4
0.4 2.1 1.2
(All zero values in column 1 of df2 replaced with mean of column 1 in df1.
I am fairly new to this and have checked other options such as fillna() and replace(), but am unable to accomplish exactly what I want. Any help in this regard is highly appreciated.
Use DataFrame.mask with mean:
df = df2.mask(df2 == 0, df1.mean(), axis=1)
print (df)
Column1 Column2 Column n
0 0.4 2.3 1.7
1 0.4 2.5 1.4
2 0.4 2.1 1.2
numpy alternative with numpy.where should working faster in large DataFrames:
df = pd.DataFrame(np.where(df2 == 0, df1.mean(), df1),
index=df1.index,
columns=df1.columns)
print (df)
Column1 Column2 Column n
0 0.4 2.3 1.7
1 0.4 2.5 1.4
2 0.4 2.1 1.2