Dataframe to Sankey Diagram - python

I want to generate a Sankey Diagram from product data looking like this.
id begin_date status
1 01.02.2020 a
1 10.02.2020 b
1 17.02.2020 c
2 02.02.2020 d
2 06.03.2020 b
2 17.04.2020 c
For your experimentation:
pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'],[2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])
After looking at this explanation:
Draw Sankey Diagram from dataframe
I want to construct the "Source-Target-Value"-Dataframe looking like this. To improve understanding, I did not convert Source and Target to integers.
# with Source = previous status
# with Target = next status
# with Value = count of IDs that transition from Source to Target
Source Target Value Link Color
a b 1 rgba(127, 194, 65, 0.2)
b c 2 rgba(127, 194, 65, 0.2)
d b 1 rgba(211, 211, 211, 0.5)
The problem lies in generating Source, Target, and Value.
The Source and Target should be the status transition from a to b. The Value is the count of ids doing that transition.
What is the best way to do this?
EDIT: Using an online generator, the result would look like this:

Found the answer!
# assuming df is sorted by begin_date
import pandas as pd
df = pd.read_csv(r"path")
dfs = []
unique_ids = df["id"].unique()
for uid in unique_ids:
df_t = df[df["id"] == uid].copy()
df_t["status_next"] = df_t["status"].shift(-1)
df_t["status_append"] = df_t["status"] + df_t["status_next"]
df_t = df_t.groupby("status_append").agg(Value=("status_append","count")).reset_index()
dfs.append(df_t)
df = pd.concat(dfs, ignore_index=True)
df = df.groupby("status_append").agg(Value=("Value","sum")).reset_index()
df["Source"] = df['status_append'].astype(str).str[0]
df["Target"] = df['status_append'].astype(str).str[1]
df = df.drop("status_append", axis=1)
df = df[["Source", "Target", "Value"]]
yields
Source Target Value
a b 1
b c 2
d b 1

Related

How to pivot a pandas table just for some columns

I have a datafame in pandas with has a group of columns with hyphens (theres several but I'll use 2 as an example, _1 and _2), which both depict a different year.
df = pd.DataFrame({'A': ['BP','Virgin'],
'B(LY)': ['A','C'],
'B(LY_1)': ['B', 'D'],
'C': [1, 3],
'C_1': [2,4],
'D': ['W','Y'],
'D_1': ['X','Z']})
I'm trying to reorganise the table to pivot it, so that it look like this:
df = pd.DataFrame({'A': ['BP','BP', 'Virgin', 'Virgin'],
'Year': ['A','B','C','D'],
'C': [1,2,3,4],
'D': ['W','X','Y','Z']})
But I can't work out how to do it. The problem is, I only need the the hyphen column to match the equivalent hyphen for the other variables. Any help is appreciated, thanks
EDIT
here is a real life example of the data
df = pd.DataFrame({'Company': ['BP','Virgin'],
'Account_date(LY)': ['May','Apr'],
'Account_date(LY_1)': ['Apr', 'Mar'],
'Account_date(LY_2)': ['Mar', 'Feb'],
'Account_date(LY_3)': ['Feb', 'Jan'],
'Acc_day': [1, 5],
'Acc_day_1': [2,6],
'Acc_day_2': [3,7],
'Acc_day_2': [4,8],
'D': ['W','A'],
'D_1': ['X','B'],
'D_1': ['Y','C'],
'D_1': ['Z','D']})
desired output:
df = pd.DataFrame({'Company': ['BP','BP','BP','BP', 'Virgin', 'Virgin','Virgin', 'Virgin'],
'Year': ['May','Apr','Mar','Feb','Apr','Mar','Feb','May'],
'Acc_day': [1,2,3,4,5,6,7,8],
'D': ['W','X','Y','Z','A','B','C','D']})
You can use:
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
df2.columns = df2.columns.str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Or using janitor's pivot_longer:
import janitor
out = (df.pivot_longer(index='A', names_sep='_', names_to=('.value', '_drop'), sort_by_appearance=True)
.rename(columns={'B': 'Year'}).drop(columns='_drop')
)
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
updated example
using a mapper to match (LY) -> _1, etc.
import re
# you can generate this mapper programmatically if needed
mapper = {'(LY)': '_1', '(LY-1)': '_2'}
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
pattern = '|'.join(map(re.escape, mapper))
df2.columns = df2.columns.str.replace(pattern, lambda m: mapper[m.group()], regex=True).str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z

Iterate over columns and rows to identify what changed for data analysis

I have a historical table that keeps track of the status of a task over time.
The table looks similar to the below, where the 'ID' is unique to the task, 'Date' changes whenever an action is taken on the task, 'Factor1, Factor2, etc' are columns that contain details of the underlying task.
I want to flag on an 'ID' level, what 'Factor' columns are changing over time. Once I identify which 'Factor' columns are changing, I am planning on doing analysis to see which 'Factor' columns are changing the most, the least, etc.
I am looking to:
Sort by 'Date' ascending
Groupby 'ID'
Loop through each column that has 'Factor' in the column name and for each column, identify if the 'Factor' data changed by looping through each row for each ID
Create a new column for each 'Factor' column to flag if the underlying factor row changed overtime for that specific ID
Python code for sample data:
import pandas as pd
data = [[1,'12/12/2021','A',500],[2,'10/20/2021','D',200],[3,'7/2/2022','E',300],
[1,'5/2/2022','B',500],[1,'8/2/2022','B',500],[3,'10/2/2022','C',200],
[2,'1/5/2022','D',200]]
df = pd.DataFrame(data, columns=['ID', 'Date','Factor1','Factor2'])
My desired output is this:
import pandas as pd
data = [[1, '12/12/2021', 'A', 500], [2, '10/20/2021', 'D', 200], [3, '7/2/2022', 'E', 300],
[1, '5/2/2022', 'B', 500], [1, '8/2/2022', 'B', 500], [3, '10/2/2022', 'C', 200],
[2, '1/5/2022', 'D', 200]]
df = pd.DataFrame(data, columns=['ID', 'Date', 'Factor1', 'Factor2'])
# get the 'Factor' columns
factor_columns = [col for col in df.columns if col.startswith('Factor')]
# returns Y if previous val has changed else N
def check_factor(x, col, df1):
# assigning previous value if exist or target factor value if NaN
val = df1[df1.ID == x.ID].shift(1)[col].fillna(x[col]).loc[x.name]
return 'N' if val == x[col] else 'Y'
# creating new columns list to reorder columns
columns = ['ID', 'Date']
for col in factor_columns:
columns += [col, f'{col}_Changed']
# applying check_factor to new column
df[f'{col}_Changed'] = df.apply(check_factor, args=(col, df.copy()), axis=1)
df = df[columns]
print(df)
OUTPUT:
ID Date Factor1 Factor1_Changed Factor2 Factor2_Changed
0 1 12/12/2021 A N 500 N
1 2 10/20/2021 D N 200 N
2 3 7/2/2022 E N 300 N
3 1 5/2/2022 B Y 500 N
4 1 8/2/2022 B N 500 N
5 3 10/2/2022 C Y 200 Y
6 2 1/5/2022 D N 200 N

Using `.groupby().apply()` instead of `.groupby().agg()`

Suppose I have a dataframe like this
d = {'User':['A', 'A', 'B'],
'time':[1,2,3],
'state':['CA', 'CA', 'OR'],
'type':['cd', 'dvd', 'cd']}
df = pd.Dataframe(data=d)
I want to create a function that where I will pass in a single users dataframe so for example
user_df = df[df['User'] == 'A']
Then the function will return a single row data frame that will look like this
d = {'User':['A'],
'avg_time':[1.5],
'state':['CA'],
'cd':[1],
'dvd':[1]}
res_df = pd.Dataframe(data=d)
Then that function will be used to apply this across the entire dataframe of users, so I will have
def some_function():
Then I will write df.groupby('User').apply(some_function). Then I will have this as the resulting new dataframe
d = {'User':['A','B'],
'avg_time':[1.5, 3],
'state':['CA', 'OR'],
'cd':[1, 1],
'dvd':[1, 0]}
final_df = pd.Dataframe(data=d)
I know I can grab values for the df like this
avg_time = user_df['time'].mean()
state = user_df['state'].iloc[0]
type_counts = user_df['type'].value_counts().to_dict()
But I am not sure how to tranform this into a results row dataframe. Any help is appreciated. The reasoning on why I want to do it in this way instead of .agg() is because I am going to parallelize this function to make it run faster since I will have a very large dataframe.
IIUC,
def aggUser(df):
a = pd.DataFrame({'avg_time':df['time'].mean(),
'state': [df['state'].iloc[0]]})
b = df['type'].value_counts().to_frame().T.reset_index(drop=True)
return pd.concat([a,b], axis=1).set_axis(df['User'].iloc[[0]])
pd.concat([aggUser(df.query('User == "A"')),
aggUser(df.query('User == "B"'))])
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN
df.groupby('User', group_keys=False).apply(aggUser)
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN

Get frequncy counts of the values stored in a pandas dataframe column, based on a range, and seperarated by a categorical variable

I have a dataframe that looks like so:
df = pd.DataFrame([[0.012343, 'A'], [0.135528, 'A'], [0.198878, 'A'], [0.199999, 'B'], [0.181121, 'B'], [0.199999, 'B']])
df.columns = ['effect', 'category']
effect category
0 0.012343 A
1 0.135528 A
2 0.198878 A
3 0.199999 B
4 0.181121 B
5 0.199999 B
My goal is to get a representative of the frequency distribution of each category. In this case, the bin size would be .05. The resulting dataframe would look like the following:
my_distribution = pd.DataFrame([['A', 1, 0, 1, 1], ['B', 0, 0, 0, 3]])
my_distributions.columns = ['category', '0.0-0.05', '0.05-0.10', 0.1-0.15', '0.15-0.20']
category 0.0-0.05 0.05-0.10 0.1-0.15 0.15-0.20
0 A 1 0 1 1
1 B 0 0 0 3
____________________________________________________________
So, in brief what I am trying to do is create bins and count the number of occurrences in each bin, separated by category. Any help would be really appreciated.
You can use cut followed by crosstab + reindex:
import pandas as pd
df = pd.DataFrame([[0.01, 'A'], [0.13, 'A'], [0.19, 'A'], [0.19, 'B'], [0.18, 'B'], [0.19, 'B']])
df.columns = ['effect', 'category']
labels = ['0.0-0.05', '0.05-0.10', '0.1-0.15', '0.15-0.20']
cuts = df.assign(quant=pd.cut(df.effect, bins=[0.0, 0.05, 0.10, 0.15, 0.20], labels=labels))
# get counts per bin
result = pd.crosstab(cuts.category, columns=cuts.quant)
# reindex with labels to account for bin with 0 counts
result = result.reindex(labels, axis=1).fillna(0).astype(int)
# reset index and rename axis for display purposes
result = result.reset_index().rename_axis(None, axis=1)
print(result)
Output
category 0.0-0.05 0.05-0.10 0.1-0.15 0.15-0.20
0 A 1 0 1 1
1 B 0 0 0 3

Creating Dynamic Data Frames in Python

I have a Data Frame DF1 with three columns A, B and C with values 3, 2 and 100.
What I am trying to do is loop through DF1 to create two new dataframes called DF_A and DF_B....dynamically (assign names dynamically) such that
DF_A = 3, 100, 300 # (i.e. A*100)
DF_B = 2, 100, 200 # (i.e. B* 100)
Can someone please help?
Probably there is no way to dynamically create DataFrames. In another Stackoverflow question, a method of storing it in a dictionary was recommended. It can be done with the following code.
import pandas as pd
DF1 = pd.DataFrame({"A": [3], "B": [2], "C": [100]})
DF_list = {}
for i in ["A", "B"]:
DF = pd.DataFrame({})
DF[i] = DF1[[i]]
DF["C"] = DF1[["C"]]
DF["value"] = DF[i] * DF["C"]
DF_list["DF_" + i] = DF
print(DF_list)
# {'DF_A': A C value
# 0 3 100 300, 'DF_B': B C value
# 0 2 100 200}

Categories

Resources