Pandas - plotting user RFM - python

Given the following DF of user RFM activity:
uid R F M
0 1 10 1 5
1 1 2 2 10
2 1 4 3 1
3 1 5 4 10
4 2 10 1 3
5 2 1 2 10
6 2 1 3 4
Recency: The time between the last purchase and today, represented by
the distance between the rightmost circle and the vertical dotted line
that's labeled Now.
Frequency: The time between purchases, represented by the distance
between the circles on a single line.
Monetary: The amount of money spent on each purchase, represented by
the size of the circle. This amount could be the average order value
or the quantity of products that the customer ordered.
I would like to plot something like the figure below:
Where the size of the circle is the M value and the distance is the R. Any help would be appreciated.
Update
As suggested by Diziet Asahi I've tried the following:
import matplotlib.pyplot as plt
def plot_users(df):
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['M'],y=df['uid'],s=30*df['R'], marker='o', color='grey')
ax.invert_xaxis()
ax.axvline(0, ls='--', color='black', zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color='grey', zorder=-1)
tmp = pd.DataFrame({'uid':[1,1,1,1,2,2,2],'R':[10,2,4,5,10,1,1],'F':[1,2,3,4,1,3,4],'M':[5,10,1,10,3,10,4]})
plot_users(tmp)
And I get the following:
So I think there is a bug, since first user has 4 records and the sizes also doesn't match.

you can use matplotlib's scatter() with the s= argument to draw markers with an area proportional to the value in M. The rest is just tweaking the appearance of the plot.
c = 'xkcd:dark grey'
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['R'],y=df['uid'],s=60*df['M'], marker='o', color=c)
ax.invert_xaxis()
ax.axvline(0, ls='--', color=c, zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color=c, zorder=-1)
ax.set_ymargin(1)

Related

Set the y-axis to scale in a Seaborn heat map

I currently have a dataframe, df:
In [1]: df
Out [1]:
one two
1.5 11.22
2 15.36
2.5 11
3.3 12.5
3.5 14.78
5 9
6.2 26.14
I used this code to get a heat map:
In [2]:
plt.figure(figsize=(30, 7))
plt.title('Test')
ax = sns.heatmap(data=df, annot=True,)
plt.xlabel('Test')
ax.invert_yaxis()
value = 6
index = np.abs(df.index - value).argmin()
ax.axhline(index + .5, ls='--')
print(index)
Out [2]:
I am looking for the y-axis, instead, to automatically scale and plot the df[2] values in their respective positions on the full axis. For example, there should be a clear empty space between 3.5 and 5.0 as there aren’t any values - I want the values in between on the y-axis with 0 value against them.
This can be easily achieved with a bar plot instead:
plt.bar(df['one'], df['two'], color=list('rgb'), width=0.2, alpha=0.4)

Plotting two Seaborn catplots in one figure

I am trying to plot the following data as a horizontal stacked barplot. I would like to show the Week 1 and Week 2, as bars with the largest bar size ('Total') at the top and then descending down. The actual data is 100 lines so I arrived at using Seaborn catplots with kind='bar'. I'm not sure if possible to stack (like Matplotlib) so I opted to create two charts and overlay 'Week 1' on top of 'Total', for the same stacked effect.
However when I run the below I'm getting two separate plots and the chart title and axis is one the one graph. Am I able to combine this into one stacked horizontal chart. If easier way then appreciate to find out.
Company
Week 1
Week 2
Total
Stanley Atherton
0
1
1
Dennis Auton
1
1
2
David Bailey
3
8
11
Alan Ball
5
2
7
Philip Barker
3
0
3
Mark Beirne
0
1
1
Phyllis Blitz
3
0
3
Simon Blower
4
2
6
Steven Branton
5
7
12
Rebecca Brown
0
4
4
(Names created from random name generator)
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('Sample1.csv', delimiter="\t", error_bad_lines=False)
data_rank = data.sort_values(["Attending", "Company"], ascending=[False,True])
sns.set(style="ticks")
g = sns.catplot(y='Company', x='Total', data=data_rank, kind='bar', height=4, color='red', aspect=0.8, ax=ax)
ax2 =ax.twinx()
g = sns.catplot(y='Company', x='Week 1', data=data_rank, kind='bar', height=4, color='blue', aspect=0.8, ax=ax2)
for ax in g.axes[0]:
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
ax.spines['bottom'].set_visible(True)
ax.spines['top'].set_visible(True)
plt.title("Company by week ", size=7)
catplot 1
catplot 2
I think something like this works.
g = sns.barplot(y='Company', x='Total', data=data_rank, color='red', label='Total')
g = sns.barplot(y='Company', x='Week1', data=data_rank, color='blue', label='Week 1')
plt.title("Company by week ", size=12)
plt.xlabel('Frequency')
plt.legend()
plt.show()

Generating an histogram with Matplotlib using a dataframe for x and y

I'm generating a simple line chart with Matplotlib, here is my code:
fig = plt.figure(facecolor='#131722',dpi=155, figsize=(8, 4))
ax1 = plt.subplot2grid((1,2), (0,0), facecolor='#131722')
for x in OrderedList:
rate_buy = []
total_buy = []
for y in x['data']['bids']:
rate_buy.append(y[0])
total_buy.append(y[1])
rBuys = pd.DataFrame({'buy': rate_buy})
tBuys = pd.DataFrame({'total': total_buy})
ax1.plot(rBuys.buy, tBuys.total, color='#0400ff', linewidth=0.5, alpha=1)
ax1.fill_between(rBuys.buy, 0, tBuys.total, facecolor='#0400ff', alpha=1)
Which gives me the following output:
And here is the data i used in the dataframe:
buy
0 9611
1 9610
2 9609
3 9608
4 9607
5 9606
6 9605
7 9604
8 9603
9 9602
10 9601
11 9600
12 9599
total
0 3.033661
1 3.295753
2 3.599813
3 22.305765
4 22.987476
5 30.975145
6 39.492845
7 42.828580
8 46.677708
9 49.533740
10 50.925840
11 61.396243
12 61.921523
I want to get the same output of the image, but with an histogram chart or whatever it's similar to that, where the height of the column on the y axis is retrieved from the total dataframe and the x axis position is retrieved from the buy dataframe. So the first element will have position x=9611 and y=3.033661
Is it possible to do that with Matplotlib? I tried to use hist, but it doesn't allow me to set both the x and the y axis
Pandas uses matplotlib as well, and the API is very easy once you have the dataframe.
Here is an example.
d = {
'buy':[
9611,
9610,
9609,
9608,
9607,
9606,
9605,
9604,
9603,
9602,
9601,
9600,
9599
],
'total':[
3.033661,
3.295753,
3.599813,
22.305765,
22.987476,
30.975145,
39.492845,
42.828580,
46.677708,
49.533740,
50.925840,
61.396243,
61.921523
]
}
df = pd.DataFrame(d)
df = df.sort_values(by=['buy']) #remember to sort your x values!
df.plot(kind='bar', x='buy', y='total', width=1)
plt.show()

scatter plot with multiple X features and single Y in Python

Data in form:
x1 x2
data= 2104, 3
1600, 3
2400, 3
1416, 2
3000, 4
1985, 4
y= 399900
329900
369000
232000
539900
299900
I want to plot scatter plot which have got 2 X feature {x1 and x2} and single Y,
but when I try
y=data.loc[:'y']
px=data.loc[:,['x1','x2']]
plt.scatter(px,y)
I get:
'ValueError: x and y must be the same size'.
So I tried this:
data=pd.read_csv('ex1data2.txt',names=['x1','x2','y'])
px=data.loc[:,['x1','x2']]
x1=px['x1']
x2=px['x2']
y=data.loc[:'y']
plt.scatter(x1,x2,y)
This time I got blank graph with full blue color painted inside.
I will be great full if i get some guide
You can only plot with one x and several y's. You could plot the different x's in a twiny axis:
fig, ax = plt.subplots()
ay = ax.twiny()
ax.scatter(df['x1'], df['y'])
ay.scatter(df['x2'], df['y'], color='r')
plt.show()
Output:
You can check the pandas functions for plotting dataframe content, it's very powerful.
But if you want to use matplotlib you can check the documentation (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html), and it's said that X and Y must be array-like. You are instead passing a list.
So the working code it's like this:
data = pd.read_csv("test.txt", header=None)
data
0 1 2
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
data.columns = ["x1", "x2", "y"]
data
x1 x2 y
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
# If you call scatter many times and then plt.show() a single image is created
plt.scatter(data["x1"], data["y"])
plt.scatter(data["x2"], data["y"])
plt.show()
Note that if you want to have data in an array format you can do data["x1"].values and it will return an ndarray.
You could use seaborn with a melted dataframe. seaborn.scatterplot has a hue argument, which allows to include multiple data series.
import seaborn as sns
ax = sns.scatterplot(x='value', hue='series', y='y',
data=data.melt(value_vars=['x1', 'x2'],
id_vars='y',
var_name='series'))
However, if your x values are that different, you might want to use twin axes, as in #Quang Hoang's answer.

Python Plotting: Heatmap from dataframe with fixed colors in case of strings

I'm trying to visualise a large (pandas) dataframe in Python as a heatmap. This dataframe has two types of variables: strings ("Absent" or "Unknown") and floats.
I want the heatmap to show cells with "Absent" in black and "Unknown" in red, and the rest of the dataframe as a normal heatmap, with the floats in a scale of greens.
I can do this easily in Excel with conditional formatting of cells, but I can't find any help online to do this with Python either with matplotlib, seaborn, ggplot. What am I missing?
Thank you for your time.
You could use cmap_custom.set_under('red') and cmap_custom.set_over('black') to apply custom colors to values below and above vmin and vmax (See 1, 2):
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.axes_grid1 as axes_grid1
import pandas as pd
# make a random DataFrame
np.random.seed(1)
arr = np.random.choice(['Absent', 'Unknown']+list(range(10)), size=(5,7))
df = pd.DataFrame(arr)
# find the largest and smallest finite values
finite_values = pd.to_numeric(list(set(np.unique(df.values))
.difference(['Absent', 'Unknown'])))
vmin, vmax = finite_values.min(), finite_values.max()
# change Absent and Unknown to numeric values
df2 = df.replace({'Absent': vmax+1, 'Unknown': vmin-1})
# make sure the values are numeric
for col in df2:
df2[col] = pd.to_numeric(df2[col])
fig, ax = plt.subplots()
cmap_custom = plt.get_cmap('Greens')
cmap_custom.set_under('red')
cmap_custom.set_over('black')
im = plt.imshow(df2, interpolation='nearest', cmap = cmap_custom,
vmin=vmin, vmax=vmax)
# add a colorbar (https://stackoverflow.com/a/18195921/190597)
divider = axes_grid1.make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, extend='both')
plt.show()
The DataFrame
In [117]: df
Out[117]:
0 1 2 3 4 5 6
0 3 9 6 7 9 3 Absent
1 Absent Unknown 5 4 7 0 2
2 3 0 2 9 8 0 2
3 5 5 7 Unknown 5 Absent 4
4 7 7 5 4 7 Unknown Absent
becomes

Categories

Resources