The "Day" column won't rename. Could it have something to do with the "Day" column being an index and not a column? Here's my code and a sample of the unprocessed data:
import pandas as pd
import numpy as np
# Load the excel file into a dataframe
df = pd.read_excel("Marginal CPA data - NOV.xlsx")
# Delete the bottom row
df = df[:-1]
# Filter the column labeled "Campaign Type (Search ACQ) - ONC" to keep only rows with value "NonBrand"
df = df[df["Campaign Type (Search ACQ) - ONC"] == "NonBrand"]
# Make a pivot table
pivot_table = pd.pivot_table(df, values=["Media Cost", "CAFE Approvals"],
index=["Campaign Type (Search ACQ) - ONC", "Product (ACQ Search) - ONC", "Day"],
columns=["CDJ"], aggfunc="sum")
df_pivot = pivot_table.fillna(value=0)
# Reset the column index to a single level
df_pivot.columns = ["_".join(col) for col in df_pivot.columns]
# Rename columns
df_pivot = df_pivot.rename(columns={"Media Cost_CPA": "CPA Spend", "Media Cost_Non CPA (CDJ)": "CDJ Spend",
"CAFE Approvals_CPA": "CPA Approvals", "CAFE Approvals_Non CPA (CDJ)": "CDJ Approvals",
"Day": "Date"})
Campaign Type (Search ACQ) - ONC
Product (ACQ Search) - ONC
CDJ
Day
Media Cost
CAFE Approvals
NonBrand
Consumer
CPA
11 Jan 2023
29019.77415
94
NonBrand
Consumer
Non CPA (CDJ)
17 Jan 2023
24640.36448
86
NonBrand
Consumer
Non CPA (CDJ)
12 Jan 2023
23627.78256
78
NonBrand
Student
CPA
17 Jan 2023
29863.95447
152
NonBrand
Miles
CPA
23 Jan 2023
380.94
1
NonBrand
Miles
CPA
07 Jan 2023
1786.51
5
NonBrand
Consumer
CPA
19 Jan 2023
26745.81705
64
NonBrand
Secured
CPA
20 Jan 2023
1551.35
19
NonBrand
Consumer
Non CPA (CDJ)
02 Feb 2023
41185.11225
66
rename_axis might work better for a MultiIndex:
cols = {
"Media Cost_CPA": "CPA Spend",
"Media Cost_Non CPA (CDJ)": "CDJ Spend",
"CAFE Approvals_CPA": "CPA Approvals",
"CAFE Approvals_Non CPA (CDJ)": "CDJ Approvals"
}
indx = {
"Day": "Date"
}
df_pivot = df_pivot.rename(columns=cols).rename_axis(index=indx)
Related
I'm trying to subset my data by the value of a column using the typical pandas protocol:
df[df[column_name] == "value"]
But I keep getting a keyerror for "Product (ACQ Search) - ONC". I also found that checking the column names with pd.columns shows only the 4 columns I renamed at a different point in the script. Why do I keep getting a keyerror?
Here's my code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Load the excel file into a dataframe
df = pd.read_excel("Marginal CPA data - NOV.xlsx")
# Delete the bottom row
df = df[:-1]
# Filter the column labeled "Campaign Type (Search ACQ) - ONC" to keep only rows with value "NonBrand"
df = df[df["Campaign Type (Search ACQ) - ONC"] == "NonBrand"]
df["Date"] = pd.to_datetime(df["Day"], format="%d %b %Y")
df = df.drop("Day", axis=1)
# Make a pivot table
pivot_table = pd.pivot_table(df, values=["Media Cost", "CAFE Approvals"],
index=["Campaign Type (Search ACQ) - ONC", "Product (ACQ Search) - ONC", "Date"],
columns=["CDJ"], aggfunc="sum")
df_pivot = pivot_table.fillna(value=0)
# Reset the column index to a single level
df_pivot.columns = ["_".join(col) for col in df_pivot.columns]
cols = {
"Media Cost_CPA": "CPA Spend",
"Media Cost_Non CPA (CDJ)": "CDJ Spend",
"CAFE Approvals_CPA": "CPA Approvals",
"CAFE Approvals_Non CPA (CDJ)": "CDJ Approvals"
}
df_pivot = df_pivot.rename(columns=cols)
# Add two new columns for Total Spend and Total Approvals
df_pivot["Total Approvals"] = df_pivot["CPA Approvals"] + df_pivot["CDJ Approvals"]
df_pivot["Total Spend"] = df_pivot["CPA Spend"] + df_pivot["CDJ Spend"]
#Remove data for days where spend is zero
df_pivot = df_pivot[df_pivot["CPA Spend"] != 0]
df_pivot = df_pivot[df_pivot["Total Approvals"] != 0]
#Sort Date and Product
df_pivot = df_pivot.sort_values("Date", ascending=True)
df_pivot = df_pivot.sort_values("Product (ACQ Search) - ONC", ascending=True)
df_pivot.to_excel("Marginal CPA data - NOV (processed).xlsx")
# filter the data to only include rows where "Product (ACQ Search) - ONC" is "Consumer"
consumer_data = df_pivot[df_pivot["Product (ACQ Search) - ONC"] == "Consumer"]
Data:
Campaign Type (Search ACQ) - ONC
Product (ACQ Search) - ONC
CDJ
Day
Media Cost
CAFE Approvals
NonBrand
Consumer
CPA
11 Jan 2023
29019.77415
94
NonBrand
Consumer
Non CPA (CDJ)
17 Jan 2023
24640.36448
86
NonBrand
Consumer
Non CPA (CDJ)
12 Jan 2023
23627.78256
78
NonBrand
Student
CPA
17 Jan 2023
29863.95447
152
NonBrand
Miles
CPA
23 Jan 2023
380.94
1
NonBrand
Miles
CPA
07 Jan 2023
1786.51
5
NonBrand
Consumer
CPA
19 Jan 2023
26745.81705
64
NonBrand
Secured
CPA
20 Jan 2023
1551.35
19
NonBrand
Consumer
Non CPA (CDJ)
02 Feb 2023
41185.11225
66
NonBrand
Student
CPA
08 Jan 2023
42822.8508
171
NonBrand
Student
CPA
16 Jan 2023
29408.66012
160
NonBrand
Consumer
CPA
17 Jan 2023
29378.05227
85
NonBrand
Miles
CPA
10 Jan 2023
2019.25
4
NonBrand
Miles
CPA
11 Jan 2023
1604.98
4
NonBrand
Secured
CPA
21 Jan 2023
1704.13419
22
The problem is you are trying to slice your data using a column that is actually an index.
You can slice your MultiIndex data frame by using pd.IndexSlice and passing "Consumer" to the second level:
idx = pd.IndexSlice
df_pivot.loc[idx[:, "Consumer", :]]
Which returns the following:
CPA Approvals
CDJ Approvals
CPA Spend
CDJ Spend
Total Approvals
Total Spend
('NonBrand', Timestamp('2023-01-11 00:00:00'))
94
0
29019.8
0
94
29019.8
('NonBrand', Timestamp('2023-01-17 00:00:00'))
85
86
29378.1
24640.4
171
54018.4
('NonBrand', Timestamp('2023-01-19 00:00:00'))
64
0
26745.8
0
64
26745.8
See more about advanced indexing here: https://pandas.pydata.org/docs/user_guide/advanced.html#
You can also reset the index and then subset the data in a manner similar to your last line:
df = df_pivot.reset_index()
df.loc[df["Product (ACQ Search) - ONC"] == "Consumer"]
Which returns the following:
Campaign Type (Search ACQ) - ONC
Product (ACQ Search) - ONC
Date
CPA Approvals
CDJ Approvals
CPA Spend
CDJ Spend
Total Approvals
Total Spend
0
NonBrand
Consumer
2023-01-11 00:00:00
94
0
29019.8
0
94
29019.8
1
NonBrand
Consumer
2023-01-17 00:00:00
85
86
29378.1
24640.4
171
54018.4
2
NonBrand
Consumer
2023-01-19 00:00:00
64
0
26745.8
0
64
26745.8
The first method loses the first two indices, while the second method preserves all of the data.
I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03
Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 dataframes that I want to be merged together.
Df1 : sales dataframe, only with sold products. if not sold, not there. ALL WEEKS 1 to 53 for 2019/2020/2021
Year Week Store Article Sales Volume
2019 11 SF sku1 500
2021 16 NY sku2 20
2020 53 PA sku1 500
2021 01 NY sku3 200
2019 11 SF sku1 455
2021 16 NY sku2 20
df2: is a stock dataframe. Entire product range, even if not sold, it appears. Only with stock at WEEK 16 for each 2019/2020/2021 year for each ALL products
Year Week Store Article Stock Volume
2019 16 SF sku1 500
2021 16 NY sku2 20
2020 16 PA sku4 500
2021 16 NY sku5 200
2019 16 SF sku65 455
2021 16 NY sku2000 20
...
I have tried to merge both dfs by doing this (I wanted to get all Articles but the drawback is that I loose the other weeks):
merged = pd.merge(df1,df2, how = "right", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
But I only get the sales value associated to week 16 stock and I lose all the other weeks.
So I tried a left join
merged = pd.merge(df1,df2, how = "left", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
Now I have all the weeks but I am missing some products stocks
I need to keep ALL PRODUCTS of df2 while also keeping weeks of sales of df1.
Is there a way to merge both dfs by keeping the entire stock depth and the entire sales weeks ?
Thanks for your help !!
You can try this
merged = pd.merge(df1, df2, on='year')
Source: how to merge two data frames based on particular column in pandas python?
You need a full outer join in order to not lose any Sales from df1 or Product from df2:
merged = pd.merge(df1,df2, how = "outer", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
I have 3 tables/df. All have same column names. Bascially they are df for data from different months
October (df1 name)
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
November (df2 name)
Sales_value Sales_units Unique_Customer_id Countries Month
2000 1000 40 14 Nov
112 200 30 10 Nov
December (df3 name)
Sales_value Sales_units Unique_Customer_id Countries Month
20009090 4809509 4500 30 Dec
etc. This is dummy data. Each table has thousands of rows in reality. How to combine all these 3 tables such that columns come only once and all rows are displayed such that rows from October df come first, followed by November df rows followed by December df rows. When i use joins I am getting column names repeated.
Expected output:
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
2000 1000 40 14 Nov
112 200 30 10 Nov
20009090 4809509 4500 30 Dec
Concat combines rows from different tables based on common columns
pd.concat([df1, df2, df3])
df["% Sales"] = df["Jan"]/df["Q1"]
q1_sales = df.groupby(["City"])["Jan","Feb","Mar", "Q1"].sum()
ql_sales.head()
Jan Feb Mar Q1
City
Los Angeles 44 40 54 138
Want the code to get the percentage of sales for the quarter. Want it to look like this below each month is divided by the total sales of the quarter.
Jan Feb Mar
City
Los Angeles 31.9% 29% 39.1%
Try div:
q1_sales[['Jan','Feb','Mar']].div(q1_sales['Q1']*0.01, axis='rows')
Output:
Jan Feb Mar
City
Los Angeles 31.884058 28.985507 39.130435
Use:
new_df=q1_sales[q1_sales.columns.difference(['Q1'])]
new_df=(new_df.T/new_df.sum(axis=1)*100).T
print(new_df)
Feb Jan Mar
Los Angeles 28.985507 31.884058 39.130435