pivot data, in case of multiple values

pivot data, in case of multiple values - python

I have a pandas timeseries dataframe df with columns date, week, week_start_date, country, campaign_name, active for some date(s) we have multiple campaign information.
for example
data = [["2023.01.02", 1, "2023.01.01", "BR", "SALE-1", 1],
["2023.01.02", 1, "2023.01.01", "BR", "SALE-2", 1],
["2023.01.02", 1, "2023.01.01", "NL", "SALE-1", 1],
["2023.01.02", 1, "2023.01.01", "DE", "SALE-1", 1]]
df = pd.DataFrame(data, columns=["date", "week", "week_start_date", "country", "campaign_name", "active"])
date week week_start_date country campaign_name active
2023.01.02 1 2023.01.01 BR SALE-1 1
2023.01.02 1 2023.01.01 BR SALE-2 1
2023.01.02 1 2023.01.01 NL SALE-1 1
2023.01.02 1 2023.01.01 DE SALE-1 1
I don't mind having separate date country time-series combination but for the same country in case we have 2 campaigns then I would like to pivot it
date week week_start_date country campaign_name active campaign_name_n active_n total_active
2023.01.02 1 2023.01.01 BR SALE-1 1 SALE-2 1 2
2023.01.02 1 2023.01.01 NL SALE-1 1 NaN NaN 1
2023.01.02 1 2023.01.01 DE SALE-1 1 NaN NaN 1
now campaign_name_n , active_n could be any number based on the values we find while running the loop.
I am trying to use
import pandas as pd
# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")
# Group the data by date, week, week_start_date, country, and days_active
grouped = df.groupby(["date", "week", "week_start_date", "country", "days_active"])
# Create a dictionary to store the campaign names for each group
campaign_names = {}
# Iterate through the groups
for name, group in grouped:
# Check if there are multiple entries for a particular date
if len(group) > 1:
# Create new columns for the campaign names
for i, row in enumerate(group.itertuples()):
campaign_name = "campaign_name_{}".format(i + 1)
campaign_names[row.Index] = campaign_name
df.at[row.Index, campaign_name] = row.campaign_name
# Add the campaign name columns to the DataFrame
for index, campaign_name in campaign_names.items():
df.at[index, "campaign_name"] = campaign_name
# Drop the original campaign_name column
df = df.drop(columns=["campaign_name"])
# Save the grouped and modified data to a new file
df.to_csv("grouped_data.csv", index=False)
but I am getting all the campaigns pivoted. which is not intended. Would be great if someone can help here. Thank you!

Try:
x = df.groupby(["date", "week", "week_start_date", "country"]).agg(
{"campaign_name": list, "active": [list, "sum"]}
)
x.columns = (f"{a}_{b}".replace("_list", "") for a, b in x.columns)
tmp = pd.DataFrame(
x[["campaign_name", "active"]]
.apply(
lambda x: {
f"{a}_{i}": v for a, b in zip(x.index, x.values) for i, v in enumerate(b, 1)
},
axis=1,
)
.to_list()
)
x = pd.concat([x.reset_index(), tmp], axis=1).drop(columns=["campaign_name", "active"])
print(x)
Prints:
date week week_start_date country active_sum campaign_name_1 campaign_name_2 active_1 active_2
0 2023.01.02 1 2023.01.01 BR 2 SALE-1 SALE-2 1 1.0
1 2023.01.02 1 2023.01.01 DE 1 SALE-1 NaN 1 NaN
2 2023.01.02 1 2023.01.01 NL 1 SALE-1 NaN 1 NaN

Related

Creation of DataFrame with specific conditions on rows

Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#

First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN

Python : Remodelling a dataframe and regrouping data from a specific column with predefined rows

Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you

You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.

You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]

keep order while using python pandas pivot

df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?

Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4

Stacking value by duplicating some other columns also in pandas dataframe?

I have data-frame like thi:
df
ID Brands Age Gender City
1 BMW_Audi_VW 50 M Milano
2 VW_BMW 45 F SF
I would like to split the Brands column on "_" and want to duplicate all columns except City
I can do based based on ID column like this:
df = df.set_index('ID').stack().str.split('_', expand=True).unstack(-1).stack(0).reset_index()
but it duplicate only ID column. I need all columns but not "City"
Here is the desired output that i am looking for:
ID Brands Age Gender City
1 BMW 50 M Milano
1 Audi 50 M None
1 VW 50 M None
2 VW 45 F SF
2 BMW 45 F None

Use DataFrame.explode with splitted columns values by Series.str.split and then set Nones by DataFrame.mask:
df = df.assign(Brands = df['Brands'].str.split('_')).explode('Brands')
include = ['ID','Brands','Age','Gender']
cols = df.columns.difference(include)
df[cols] = df[cols].mask(df.index.to_series().duplicated(), None)
df = df.reset_index(drop=True)
print (df)
ID Brands Age Gender City
0 1 BMW 50 M Milano
1 1 Audi 50 M None
2 1 VW 50 M None
3 2 VW 45 F SF
4 2 BMW 45 F None
EDIT:
Check difference:
#Brands column is assigned to Brands column (to same column)
df1= df.assign(Brands = df['Brands '].str.split('_')).explode('Brands')
print (df1)
ID Brands Age Gender City
0 1 BMW 50 M Milano
0 1 Audi 50 M Milano
0 1 VW 50 M Milano
1 2 VW 45 F SF
1 2 BMW 45 F SF
#Brands column is assigned to Brands1 column (to another column)
df2 = df.assign(Brands1 = df['Brands'].str.split('_')).explode('Brands')
print (df2)
ID Brands Age Gender City Brands1
0 1 BMW_Audi_VW 50 M Milano [BMW, Audi, VW]
1 2 VW_BMW 45 F SF [VW, BMW]

pandas equivalent select count(distinct col1, col2) group by col3

Make DataFrame:
people = ['shayna','shayna','shayna','shayna','john']
dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
df = pd.DataFrame(d)
df
Person Service_Date Site_Where_Served
shayna 01-01-18 hospital
shayna 01-01-18 hospital
shayna 01-01-18 inpatient
shayna 01-02-18 hospital
john 01-02-18 hospital
What I would like to do is count the unique pairs of Person and their Service_Date grouped by Site_Where_Served.
Expected Output:
Site_Where_Served Site_Visit_Count
hospital 3
inpatient 1
My attempt:
df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')
But then it doesn't know how to reset the index. So, I tried leaving that out and I realize that it isn't counting the unique pair of 'Person' and 'Service_Date', because the output looks like this:
Person Service_Date
Site_Where_Served
hospital 2 2
inpatient 1 1

drop_duplicates with groupby + count
(df.drop_duplicates()
.groupby('Site_Where_Served')
.Site_Where_Served.count()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Note, one tiny difference between count/size is that the former does not count NaN entries.
Tuplization, groupby and nunique
This is really only fixing your current solution, but I would not recommend this as it is quite long winded with more steps than necessary. First, tuplize your columns, group by Site_Where_Served, and then count:
(df[['Person', 'Service_Date']]
.apply(tuple, 1)
.groupby(df.Site_Where_Served)
.nunique()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

In my opinion, a better way is to drop duplicates before using groupby.size:
res = df.drop_duplicates()\
.groupby('Site_Where_Served').size()\
.reset_index(name='Site_Visit_Count')
print(res)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Maybe value_counts
(df.drop_duplicates()
.Site_Where_Served
.value_counts()
.to_frame('Site_Visit_Count')
.rename_axis('Site_Where_Served')
.reset_index()
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Counter 1
pd.Series(Counter(df.drop_duplicates().Site_Where_Served)) \
.rename_axis('Site_Where_Served').reset_index(name='Site_Visit_Count')
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Counter 2
pd.DataFrame(
list(Counter(t[2] for t in set(map(tuple, df.values))).items()),
columns=['Site_Where_Served', 'Site_Visit_Count']
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pivot data, in case of multiple values - python

Related

Creation of DataFrame with specific conditions on rows

Python : Remodelling a dataframe and regrouping data from a specific column with predefined rows

keep order while using python pandas pivot

Stacking value by duplicating some other columns also in pandas dataframe?

pandas equivalent select count(distinct col1, col2) group by col3

Categories

Resources