plot.bar(), duplicate values are removed in the x-axis?

plot.bar(), duplicate values are removed in the x-axis? - python

I have as example the following DataFrame df and I want to plot the price as x-axis and share_1 and share_2 as y-axis in bar stacked form. I want to avoid using pandas.plot and rather using plt.bar and extract the x_values and y_values from the Dataframe.
Price size share_1 share_2
10 1 0.05 0.95
10 2 0.07 0.93
10 3 0.1 0.95
20 4 0.15 0.75
20 5 0.2. 0.8
20 6 0.35 0.65
30 7 0.5. 0.5
30 8 0.53 0.47
30 9 0.6. 0.4
This is the way I proceed:
x= df['Price']
y1= df['share_1']
y2= df['share_2']
plt.bar(x,y1,label='share_1')
plt.bar(x,y2,label='share_2')
I still have the problem that the matplotlib removed the duplicate values the x-axis or maybe the mean value for the duplicated values is calculated automatically so that I get 3 value in the x-axis and not 6 as I aim to have. I don't know what is the reason.
my questions are:
It's possible to extract x and y values as I did or should I convert the values in certain form as string or list?
How can I avoid the fact that the duplicate values are removed in the x-axis. I want to have exactly the same number of x_values as in the DataFrame

Try:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(x, y1, label="share_1")
ax.bar(x, y2, label="share_2", bottom=y1)
ax.set_xticks(x)
ax.legend()
ax.set_xticklabels(labels)
plt.show()
As an aside, consider using pandas.plot as follows:
fig,ax = plt.subplots()
df.plot.bar(x="Price", y=["share_1","share_2"], stacked=True, ax=ax)

Related

Set the y-axis to scale in a Seaborn heat map

I currently have a dataframe, df:
In [1]: df
Out [1]:
one two
1.5 11.22
2 15.36
2.5 11
3.3 12.5
3.5 14.78
5 9
6.2 26.14
I used this code to get a heat map:
In [2]:
plt.figure(figsize=(30, 7))
plt.title('Test')
ax = sns.heatmap(data=df, annot=True,)
plt.xlabel('Test')
ax.invert_yaxis()
value = 6
index = np.abs(df.index - value).argmin()
ax.axhline(index + .5, ls='--')
print(index)
Out [2]:
I am looking for the y-axis, instead, to automatically scale and plot the df[2] values in their respective positions on the full axis. For example, there should be a clear empty space between 3.5 and 5.0 as there aren’t any values - I want the values in between on the y-axis with 0 value against them.

This can be easily achieved with a bar plot instead:
plt.bar(df['one'], df['two'], color=list('rgb'), width=0.2, alpha=0.4)

Pandas - plotting user RFM

Given the following DF of user RFM activity:
uid R F M
0 1 10 1 5
1 1 2 2 10
2 1 4 3 1
3 1 5 4 10
4 2 10 1 3
5 2 1 2 10
6 2 1 3 4
Recency: The time between the last purchase and today, represented by
the distance between the rightmost circle and the vertical dotted line
that's labeled Now.
Frequency: The time between purchases, represented by the distance
between the circles on a single line.
Monetary: The amount of money spent on each purchase, represented by
the size of the circle. This amount could be the average order value
or the quantity of products that the customer ordered.
I would like to plot something like the figure below:
Where the size of the circle is the M value and the distance is the R. Any help would be appreciated.
Update
As suggested by Diziet Asahi I've tried the following:
import matplotlib.pyplot as plt
def plot_users(df):
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['M'],y=df['uid'],s=30*df['R'], marker='o', color='grey')
ax.invert_xaxis()
ax.axvline(0, ls='--', color='black', zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color='grey', zorder=-1)
tmp = pd.DataFrame({'uid':[1,1,1,1,2,2,2],'R':[10,2,4,5,10,1,1],'F':[1,2,3,4,1,3,4],'M':[5,10,1,10,3,10,4]})
plot_users(tmp)
And I get the following:
So I think there is a bug, since first user has 4 records and the sizes also doesn't match.

you can use matplotlib's scatter() with the s= argument to draw markers with an area proportional to the value in M. The rest is just tweaking the appearance of the plot.
c = 'xkcd:dark grey'
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['R'],y=df['uid'],s=60*df['M'], marker='o', color=c)
ax.invert_xaxis()
ax.axvline(0, ls='--', color=c, zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color=c, zorder=-1)
ax.set_ymargin(1)

Finding intersection of Pandas dataframes within range

A project I'm working on requires merging two dataframes together along some line with a delta. Basically, I need to take a dataframe with a non-linear 2D line and find the data points within the other that fall along that line, plus or minus a delta.
Dataframe 1 (Line that we want to find points along)
import pandas as pd
df1 = pd.read_csv('path/to/df1/data.csv')
df1
x y
0 0.23 0.54
1 0.27 0.95
2 0.78 1.59
...
97 0.12 2.66
98 1.74 0.43
99 0.93 4.23
Dataframe 2 (Dataframe we want to filter, leaving points within some delta)
df2 = pd.read_csv('path/to/df2/data.csv')
df2
x y
0 0.21 0.51
1 0.27 0.35
2 3.45 1.19
...
971 0.94 2.60
982 1.01 1.33
993 0.43 2.43
Finding the coarse line
DELTA = 0.03
coarse_line = find_coarse_line(df1, df2, DELTA)
coarse_line
x y
0 0.21 0.51
1 0.09 2.68
2 0.23 0.49
...
345 1.71 0.45
346 0.96 0.40
347 0.81 1.62
I've tried using df.loc((df['x'] >= BOTLEFT_X) & (df['x'] >= BOTLEFT_Y) & (df['x'] <= TOPRIGHT_X) & (df['y'] <= TOPRIGHT_Y)) among many, many other Pandas functions and whatnot but have yet to find anything that works, much less anything efficient (with datasets >2 million points).

Have taken an approach of using merge() where x,y have been placed into bins from good curve df1
generated a uniform line, y=x^2
randomised it a small amount to generate df1
randomised it a large amount to generate df2 also generated three times as many co-ordinates
take df1 as reference for good ranges of x and y co-ordinates to split into bins using pd.cut(). bins being 1/3 of total number of co-ordinates is working well
standardised these back into arrays for use again in pd.cut() when merging
You can see from scatter plots, it's doing a pretty reasonable job of finding and keeping points close to curve in df2
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,3, sharey=True, sharex=False, figsize=[20,5])
linex = [i for i in range(100)]
liney = [i**2 for i in linex]
df1 = pd.DataFrame({"x":[l*random.uniform(0.95, 1.05) for l in linex],
"y":[l*random.uniform(0.95, 1.05) for l in liney]})
df1.plot("x","y", kind="scatter", ax=ax[0])
df2 = pd.DataFrame({"x":[l*random.uniform(0.5, 1.5) for l in linex*3],
"y":[l*random.uniform(0.5, 1.5) for l in liney*3]})
df2.plot("x","y", kind="scatter", ax=ax[1])
# use bins on x and y axis - both need to be within range to find
bincount = len(df1)//3
xc = pd.cut(df1["x"], bincount).unique()
yc = pd.cut(df1["y"], bincount).unique()
xc = np.sort([intv.left for intv in xc] + [xc[-1].right])
yc = np.sort([intv.left for intv in yc] + [yc[-1].right])
dfm = (df2.assign(
xb=pd.cut(df2["x"],xc, duplicates="drop"),
yb=pd.cut(df2["y"],yc, duplicates="drop"),
).query("~(xb.isna() | yb.isna())") # exclude rows where df2 falls outside of range of df1
.merge(df1.assign(
xb=pd.cut(df1["x"],xc, duplicates="drop"),
yb=pd.cut(df1["y"],yc, duplicates="drop"),
),
on=["xb","yb"],
how="inner",
suffixes=("_l","_r")
)
)
dfm.plot("x_l", "y_l", kind="scatter", ax=ax[2])
print(f"graph 2 pairs:{len(df2)} graph 3 pairs:{len(dfm)}")

Matplotlib DataFrame boxplot with given max,min and quaritles

I want to plot a box plot with my DataFrame:
A B C
max 10 11 14
min 3 4 10
q1 5 6 12
q3 9 7 13
how can I plot a box plot with these fixed values?

You can use the Axes.bxp method in matplotlib, based on this helpful answer. The input is a list of dictionaries containing the relevant values, but the median is a required key in these dictionaries. Since the data you provided does not include medians, I have made up medians in the code below (but you will need to calculate them from your actual data).
import matplotlib.pyplot as plt
import pandas as pd
# reproducing your data
df = pd.DataFrame({'A':[10,3,5,9],'B':[11,4,6,7],'C':[14,10,12,13]})
# add a row for median, you need median values!
sample_medians = {'A':7, 'B':6.5, 'C':12.5}
df = df.append(sample_medians, ignore_index=True)
df.index = ['max','min','q1','q3','med']
Here is the modified df with medians included:
>>> df
A B C
max 10.0 11.0 14.0
min 3.0 4.0 10.0
q1 5.0 6.0 12.0
q3 9.0 7.0 13.0
med 7.0 6.5 12.5
Now we transform the df into a list of dictionaries:
labels = list(df.columns)
# create dictionaries for each column as items of a list
bxp_stats = df.apply(lambda x: {'med':x.med, 'q1':x.q1, 'q3':x.q3, 'whislo':x['min'], 'whishi':x['max']}, axis=0).tolist()
# add the column names as labels to each dictionary entry
for index, item in enumerate(bxp_stats):
item.update({'label':labels[index]})
_, ax = plt.subplots()
ax.bxp(bxp_stats, showfliers=False);
plt.show()
Unfortunately the median line is a required parameter so it must be specified for every box. Therefore we just make it as thin as possible to be virtually unseeable.
If you want each box to be drawn with different specifications, they will have to be in different subplots. I understand if this looks kind of ugly, so you can play around with the spacing between subplots or consider removing some of the y-axes.
fig, axes = plt.subplots(nrows=1, ncols=3, sharey=True)
# specify list of background colors, median line colors same as background with as thin of a width as possible
colors = ['LightCoral', '#FEF1B5', '#EEAEEE']
medianprops = [dict(linewidth = 0.1, color='LightCoral'), dict(linewidth = 0.1, color='#FEF1B5'), dict(linewidth = 0.1, color='#EEAEEE')]
# create a list of boxplots of length 3
bplots = [axes[i].bxp([bxp_stats[i]], medianprops=medianprops[i], patch_artist=True, showfliers=False) for i in range(len(df.columns))]
# set each boxplot a different color
for i, bplot in enumerate(bplots):
for patch in bplot['boxes']:
patch.set_facecolor(colors[i])
plt.show()

Is there a way where I can plot data from a CSV file where every 10 data points in a column are a different line in the same graph?

Say I have a data set that is in two columns. I want to plot a line plot iterating through every 10. So, I would take the first 10, and then the second 10, which is right under the first 10, for another line plot on the same graph (different color line). The data is stacked on each other in a CSV file with no header.
Currently, I have it taking in the entire column. It plots them, however there is no differentiation as to which data set it is. I want to plot multiple lines on the same graph but the CSV file has all the data sets in one column, but I need to graph every 10.
EDIT
Below I have Data added I would like the first column to be the x-axis and the second to be the y.
Sample Data:
0 8.2
1 9.1
2 2.2
3 3.3
4 9.8
5 6.3
6 4.8
7 8.6
8 3.9
9 2.1
0 9.34
1 10.2
2 7.22
3 6.98
4 1.34
5 2.56
6 6.78
7 4.56
8 3.3
9 9.4

OK, try this:
# this is the toy data
df = pd.DataFrame({0:list(range(10))*2,
1:np.random.uniform(9,11,20)})
# set up axes for plots
fig, ax = plt.subplots(1,1)
# the groupby argument groups every 10 rows together
# then pass it to the `lambda` function,
# which plots each chunk to the given plt axis
df.groupby(df.reset_index().index//10).apply(lambda x: ax.plot(x[0], x[1]) )
plt.show()
Option 2:
I found sns is a better tool for the purpose:
fig, ax = plt.subplots(1,1, figsize=(10,6))
sns.lineplot(x=df[0],
y=df[1],
hue=df.reset_index().index//10,
data=df,
palette='Set1')
plt.show()
outputs:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

plot.bar(), duplicate values are removed in the x-axis? - python

Related

Set the y-axis to scale in a Seaborn heat map

Pandas - plotting user RFM

Finding intersection of Pandas dataframes within range

Matplotlib DataFrame boxplot with given max,min and quaritles

Is there a way where I can plot data from a CSV file where every 10 data points in a column are a different line in the same graph?

Categories

Resources