I have dataset as below,
index
10_YR_CAGR
5_YR_CAGR
1_YR_CAGR
c1_rev
20.5
21.5
31.5
c2_rev
20.5
22.5
24
c3_rev
21
24
27
c4_rev
20
26
30
c5_rev
24
19
15
c1_eps
21
22
23
c2_eps
21
24
25
This data has 5 companies and its parameters like rev, eps, profit etc. I need to plot as below:
rev:
x_axis-> index_col c1_rev, ...c5_rev
y_axis -> 10_YR_CAGR .. 1_YR_CAGR
eps:
x_axis -> index_col: c1_eps,...c5_eps
y_axis -> 10_YR_CAGR,... 1_YR_CAGR
etc...
I have tried with following code:
eps = analysis_df[analysis_df.index.str.contains('eps',regex=True)]
for i1 in eps.columns[eps.columns!='index']:
sns.lineplot(x="index",y=i1,data=eps,label=i1)
I have to make a dataframe from source and then loop it. How can I try to create a for loop which loops from the main source dataframe itself?
Instead of creating a loop for separate parameters, how can I loop from the main source dataframe to create a chart of plots with parameters like rev, eps, profit to facegrid parameters? How to apply those filter in facetgrid?
My sample output of the above code,
How to plot the same sort of plot for different parameters in a single for loop?
The way facets are typically plotted is by "melting" your analysis_df into id/variable/value columns.
split() the index column into Company and Parameter, which we'll later use as id columns when melting:
analysis_df[['Company', 'Parameter']] = analysis_df['index'].str.split('_', expand=True)
# index 10_YR_CAGR 5_YR_CAGR 1_YR_CAGR Company Parameter
# 0 c1_rev 100 21 1 c1 rev
# 1 c2_rev 1 32 24 c2 rev
# ...
melt() the CAGR columns:
melted = analysis_df.melt(
id_vars=['Company', 'Parameter'],
value_vars=['10_YR_CAGR', '5_YR_CAGR', '1_YR_CAGR'],
var_name='Period',
value_name='CAGR',
)
# Company Parameter Period CAGR
# 0 c1 rev 10_YR_CAGR 100
# 1 c2 rev 10_YR_CAGR 1
# 2 c3 rev 10_YR_CAGR 14
# 3 c1 eps 10_YR_CAGR 1
# ...
# 25 c2 pft 1_YR_CAGR 14
# 26 c3 pft 1_YR_CAGR 17
relplot() CAGR vs Company (colored by Period) for each Parameter using the melted dataframe:
sns.relplot(
data=melted,
kind='line',
col='Parameter',
x='Company',
y='CAGR',
hue='Period',
col_wrap=1,
facet_kws={'sharex': False, 'sharey': False},
)
Sample data to reproduce this plot:
import io
import pandas as pd
csv = '''
index,10_YR_CAGR,5_YR_CAGR,1_YR_CAGR
c1_rev,100,21,1
c2_rev,1,32,24
c3_rev,14,23,7
c1_eps,1,20,50
c2_eps,21,20,25
c3_eps,31,20,37
c1_pft,20,1,10
c2_pft,25,20,14
c3_pft,11,55,17
'''
analysis_df = pd.read_csv(io.StringIO(csv))
Related
I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)
I have a dataframe that looks somewhat below (please note There are columns beyond COST and UNITS)
TIME COST1 UNITS1_1 COST2 UNITS2_1 .... COSTN UNITSN_1
21:55:51 25 100 20 50 .... 22 130
22:55:51 23 100 24 150 .... 22 230
21:58:51 28 100 22 250 .... 22 430
I am looking at computing a sumproduct (New column) for each row such that (COST1*UNITS1_1) + (COST2*UNITS2_1) + (COSTN*UNITSN_1) is computed and stored in this column
Could you advise an efficient way here.
The ones that am thinking are looping through the column names based on the filter condition for the columns and /or using a lambda function to compute the necessary number.
Select columns by positions, convert to numpy array by DataFrame.to_numpy or DataFrame.values, multiple them and last sum:
#pandas 0.24+
df['new'] = (df.iloc[:, ::2].to_numpy() * df.iloc[:, 1::2].to_numpy()).sum(axis=1)
#pandas lower
#df['new'] = (df.iloc[:, ::2].values * df.iloc[:, 1::2].values).sum(axis=1)
Or use DataFrame.filter for select columns:
df['new'] = (df.filter(like='COST').to_numpy()*df.filter(like='UNITS').to_numpy()).sum(axis=1)
df['new'] = (df.filter(like='COST').values*df.filter(like='UNITS').values).sum(axis=1)
print (df)
COST1 UNITS1_1 COST2 UNITS2_1 COSTN UNITSN_1 new
TIME
21:55:51 25 100 20 50 22 130 6360
22:55:51 23 100 24 150 22 230 10960
21:58:51 28 100 22 250 22 430 17760
I am trying to create a PIVOT table in python from python with calculated columns. I have a huge amount of data running in thousands in a DataFrame.
The file has few columns which is customer alertkey and mttr
The expected output
The Output is expected to have Customer and AlertKey (Top 5 count only) wise groupby. Then against each alertKey its corresponding Mean and Median MTTR
The dataframe is created by pulling data from multiple tables of database. Now I am stuck on how to do the representation. This cannot be easily done in excel as we need to pull the records from multiple databases + median calculation in Excel Pivot is a pain. Also the process need to be automated.
df.groupby(['Customer','AlertKey']).AlertKey.value_counts().nlargest(20)
The Excel File with Sample Data
Consider first creating indicators for value count ranking and then run groupby().agg() call:
df['AKCount'] = df.groupby(['Customer'])['AlertKey'].transform('value_counts')
df['AKCountRank'] = df.groupby(['Customer'])['AKCount'].transform(lambda x: x.rank(method='max'))
sub = df[df['AKCountRank'] <= 20]
gdf = sub.groupby(['Customer', 'AlertKey'])['MTTR'].agg(['count','mean','median'])
Data
from io import StringIO
import pandas as pd
txt ="""
Customer AlertKey MTTR
C1 C1A1 38
C1 C1A2 25
C2 C2A5 40
C1 C1A1 50
C3 C3A7 60
C3 C3A7 23
C5 C5A8 29
C3 C3A7 30
C5 C5A8 40
"""
df = pd.read_table(StringIO(txt), sep="\s+")
Output
print(gdf)
# MTTR
# count mean median
# Customer AlertKey
# C1 C1A1 2 44.000000 44.0
# C1A2 1 25.000000 25.0
# C2 C2A5 1 40.000000 40.0
# C3 C3A7 3 37.666667 30.0
# C5 C5A8 2 34.500000 34.5
Given a dataframe df with columns A, B, C, and D,
A B C D
0 88 38 15.66 30.0
1 88 34 15.66 40.0
2 15 15 12.00 20.0
3 15 19 8.00 15.0
4 45 12 6.00 15.0
5 45 30 4.00 30.0
6 29 31 3.60 15.0
7 88 20 3.60 10.0
8 64 25 3.60 15.0
9 45 43 3.60 20.0
I want to make a scatter plot that graphs A vs B, with sizes based on C and colors based on D. After trying many ways to do this, I settled on grouping the data by D, then plotting each group in D:
fig,axes=plt.subplots()
factor=df.groupby('D')
for name, group in factor:
axes.scatter(group.A,group.B,s=(group.C)**2,c=group.D,
cmap='viridis',norm=Normalize(vmin=min(df.D),vmax=max(df.D)),label=name)
This yields the appropriate result, but the default legend() function is wrong. The groups listed in the legend have correct names, but incorrect colors and sizes (colors should vary by group, and sizes of all markers should be the same).
I tried to manually set the legend, which I can approximate the colors but can't get the sizes to be equal. Eventually I'd like a second legend that will link sizes to the appropriate levels of C.
axes.legend(loc=1,scatterpoints=1,fontsize='small',frameon=False,ncol=2)
leg=axes.get_legend()
for i in range(len(factor)):
z=plt.cm.viridis(np.linspace(0,1,len(factor)))
leg.legendHandles[i].set_color(z[i])
Here's one approach that seems to satisfy your requirements, using Seaborn's lmplot(). (Inspiration taken from this post.)
First, generate some sample data:
import numpy as np
import pandas as pd
n = 10
min_size = 50
max_size = 300
A = np.random.random(n)
B = np.random.random(n)*2
C = np.random.randint(min_size, max_size, size=n)
D = np.random.choice(['Group1','Group2'], n)
df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D})
Now plot:
import seaborn as sns
sns.lmplot(x='A', y='B', hue='D',
fit_reg=False, data=df,
scatter_kws={'s':df.C})
UPDATE
Given updated example data from OP, the same lmplot() approach should fulfill specifications: group legend is tracked by color, size of legend indicators is equal.
sns.lmplot(x='A', y='B', hue='D', data=df,
scatter_kws={'s':df.C**2}, fit_reg=False,)
I am trying to plot a histogram distribution of one column with respect to another. For example, if the dataframe columns are ['count','age'], then I want to plot the total counts in each age group. Suppose in
age: 0-10 -> total count was 20
age: 10-20 -> total count was 10
age: 20-30 -> ... etc
I tried groupby('age') and than plotting histogram but it didn't work.
Thanks.
Update
Here is some of my data
df.head()
age count
0 65 2417.86
1 65 4173.50
2 65 3549.16
3 65 509.07
4 65 0.00
Also, df.plot( x='age', y='count', kind='hist') shows
Ok,if I understand correctly, you want a weighted histogram
import pylab as plt
import pandas as pd
np = pd.np
df = pd.DataFrame( {'age':np.random.normal( 50,10,300).astype(int),
'counts':1000*np.random.random(300)} ) # test data
#df.head()
# age counts
#0 38 797.174450
#1 36 402.171434
#2 49 894.218420
#3 66 841.786623
#4 51 597.040259
df.hist('age',weights=df['counts'] )
plt.ylabel('counts')
plt.show()
yields the figure