I have a pandas timeseries dataframe df with columns date, week, week_start_date, country, campaign_name, active for some date(s) we have multiple campaign information.
for example
data = [["2023.01.02", 1, "2023.01.01", "BR", "SALE-1", 1],
["2023.01.02", 1, "2023.01.01", "BR", "SALE-2", 1],
["2023.01.02", 1, "2023.01.01", "NL", "SALE-1", 1],
["2023.01.02", 1, "2023.01.01", "DE", "SALE-1", 1]]
df = pd.DataFrame(data, columns=["date", "week", "week_start_date", "country", "campaign_name", "active"])
date week week_start_date country campaign_name active
2023.01.02 1 2023.01.01 BR SALE-1 1
2023.01.02 1 2023.01.01 BR SALE-2 1
2023.01.02 1 2023.01.01 NL SALE-1 1
2023.01.02 1 2023.01.01 DE SALE-1 1
I don't mind having separate date country time-series combination but for the same country in case we have 2 campaigns then I would like to pivot it
date week week_start_date country campaign_name active campaign_name_n active_n total_active
2023.01.02 1 2023.01.01 BR SALE-1 1 SALE-2 1 2
2023.01.02 1 2023.01.01 NL SALE-1 1 NaN NaN 1
2023.01.02 1 2023.01.01 DE SALE-1 1 NaN NaN 1
now campaign_name_n , active_n could be any number based on the values we find while running the loop.
I am trying to use
import pandas as pd
# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")
# Group the data by date, week, week_start_date, country, and days_active
grouped = df.groupby(["date", "week", "week_start_date", "country", "days_active"])
# Create a dictionary to store the campaign names for each group
campaign_names = {}
# Iterate through the groups
for name, group in grouped:
# Check if there are multiple entries for a particular date
if len(group) > 1:
# Create new columns for the campaign names
for i, row in enumerate(group.itertuples()):
campaign_name = "campaign_name_{}".format(i + 1)
campaign_names[row.Index] = campaign_name
df.at[row.Index, campaign_name] = row.campaign_name
# Add the campaign name columns to the DataFrame
for index, campaign_name in campaign_names.items():
df.at[index, "campaign_name"] = campaign_name
# Drop the original campaign_name column
df = df.drop(columns=["campaign_name"])
# Save the grouped and modified data to a new file
df.to_csv("grouped_data.csv", index=False)
but I am getting all the campaigns pivoted. which is not intended. Would be great if someone can help here. Thank you!
Try:
x = df.groupby(["date", "week", "week_start_date", "country"]).agg(
{"campaign_name": list, "active": [list, "sum"]}
)
x.columns = (f"{a}_{b}".replace("_list", "") for a, b in x.columns)
tmp = pd.DataFrame(
x[["campaign_name", "active"]]
.apply(
lambda x: {
f"{a}_{i}": v for a, b in zip(x.index, x.values) for i, v in enumerate(b, 1)
},
axis=1,
)
.to_list()
)
x = pd.concat([x.reset_index(), tmp], axis=1).drop(columns=["campaign_name", "active"])
print(x)
Prints:
date week week_start_date country active_sum campaign_name_1 campaign_name_2 active_1 active_2
0 2023.01.02 1 2023.01.01 BR 2 SALE-1 SALE-2 1 1.0
1 2023.01.02 1 2023.01.01 DE 1 SALE-1 NaN 1 NaN
2 2023.01.02 1 2023.01.01 NL 1 SALE-1 NaN 1 NaN
Related
Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]
df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?
Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4
I have data-frame like thi:
df
ID Brands Age Gender City
1 BMW_Audi_VW 50 M Milano
2 VW_BMW 45 F SF
I would like to split the Brands column on "_" and want to duplicate all columns except City
I can do based based on ID column like this:
df = df.set_index('ID').stack().str.split('_', expand=True).unstack(-1).stack(0).reset_index()
but it duplicate only ID column. I need all columns but not "City"
Here is the desired output that i am looking for:
ID Brands Age Gender City
1 BMW 50 M Milano
1 Audi 50 M None
1 VW 50 M None
2 VW 45 F SF
2 BMW 45 F None
Use DataFrame.explode with splitted columns values by Series.str.split and then set Nones by DataFrame.mask:
df = df.assign(Brands = df['Brands'].str.split('_')).explode('Brands')
include = ['ID','Brands','Age','Gender']
cols = df.columns.difference(include)
df[cols] = df[cols].mask(df.index.to_series().duplicated(), None)
df = df.reset_index(drop=True)
print (df)
ID Brands Age Gender City
0 1 BMW 50 M Milano
1 1 Audi 50 M None
2 1 VW 50 M None
3 2 VW 45 F SF
4 2 BMW 45 F None
EDIT:
Check difference:
#Brands column is assigned to Brands column (to same column)
df1= df.assign(Brands = df['Brands '].str.split('_')).explode('Brands')
print (df1)
ID Brands Age Gender City
0 1 BMW 50 M Milano
0 1 Audi 50 M Milano
0 1 VW 50 M Milano
1 2 VW 45 F SF
1 2 BMW 45 F SF
#Brands column is assigned to Brands1 column (to another column)
df2 = df.assign(Brands1 = df['Brands'].str.split('_')).explode('Brands')
print (df2)
ID Brands Age Gender City Brands1
0 1 BMW_Audi_VW 50 M Milano [BMW, Audi, VW]
1 2 VW_BMW 45 F SF [VW, BMW]
Make DataFrame:
people = ['shayna','shayna','shayna','shayna','john']
dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
df = pd.DataFrame(d)
df
Person Service_Date Site_Where_Served
shayna 01-01-18 hospital
shayna 01-01-18 hospital
shayna 01-01-18 inpatient
shayna 01-02-18 hospital
john 01-02-18 hospital
What I would like to do is count the unique pairs of Person and their Service_Date grouped by Site_Where_Served.
Expected Output:
Site_Where_Served Site_Visit_Count
hospital 3
inpatient 1
My attempt:
df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')
But then it doesn't know how to reset the index. So, I tried leaving that out and I realize that it isn't counting the unique pair of 'Person' and 'Service_Date', because the output looks like this:
Person Service_Date
Site_Where_Served
hospital 2 2
inpatient 1 1
drop_duplicates with groupby + count
(df.drop_duplicates()
.groupby('Site_Where_Served')
.Site_Where_Served.count()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Note, one tiny difference between count/size is that the former does not count NaN entries.
Tuplization, groupby and nunique
This is really only fixing your current solution, but I would not recommend this as it is quite long winded with more steps than necessary. First, tuplize your columns, group by Site_Where_Served, and then count:
(df[['Person', 'Service_Date']]
.apply(tuple, 1)
.groupby(df.Site_Where_Served)
.nunique()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
In my opinion, a better way is to drop duplicates before using groupby.size:
res = df.drop_duplicates()\
.groupby('Site_Where_Served').size()\
.reset_index(name='Site_Visit_Count')
print(res)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Maybe value_counts
(df.drop_duplicates()
.Site_Where_Served
.value_counts()
.to_frame('Site_Visit_Count')
.rename_axis('Site_Where_Served')
.reset_index()
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Counter 1
pd.Series(Counter(df.drop_duplicates().Site_Where_Served)) \
.rename_axis('Site_Where_Served').reset_index(name='Site_Visit_Count')
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Counter 2
pd.DataFrame(
list(Counter(t[2] for t in set(map(tuple, df.values))).items()),
columns=['Site_Where_Served', 'Site_Visit_Count']
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1