I have a count table as dataframe in Python and I want to plot my distribution as a boxplot. E.g.:
df=pandas.DataFrame.from_items([('Quality',[29,30,31,32,33,34,35,36,37,38,39,40]), ('Count', [3,38,512,2646,9523,23151,43140,69250,107597,179374,840596,38243])])
I 'solved' it by repeating my quality value by its count. But I dont think its a good way and my dataframe is getting very very big.
In R there its a one liner:
ggplot(df, aes(x=1,y=Quality,weight=Count)) + geom_boxplot()
This will output:!Boxplot from R1
My aim is to compare the distribution of different groups and it should look like
Can Python solve it like this too?
What are you trying to look at here? The boxplot hereunder will return the following figure.
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
df=pd.DataFrame.from_items([('Quality',[29,30,31,32,33,34,35,36,37,38,39,40]), ('Count', [3,38,512,2646,9523,23151,43140,69250,107597,179374,840596,38243])])
plt.figure()
df_box = df.boxplot(column='Quality', by='Count',return_type='axes')
If you want to look at your Quality distibution weighted on Count, you can try plotting an histogramme:
plt.figure()
df_hist = plt.hist(df.Quality, bins=10, range=None, normed=False, weights=df.Count)
Related
I would like to plot from the seaborn dataset 'tips'.
import seaborn as sns
import pandas as pd
tips = sns.load_dataset("tips")
x1 = tips.loc[(df['time']=='lunch'), 'tip']
x2 = tips.loc[(df['time']=='dinner'),'tip']
x1.plot.kde(color='orange')
x2.plot.kde(color='blue')
plt.show()
I don't know exactly where it's wrong...
Thanks for the help.
Seaborn's sns.kdeplot() supports the hue argument to split the plot between different categories:
import seaborn as sns
import pandas as pd
tips = sns.load_dataset("tips")
sns.kdeplot(data=tips, x='tip', hue='time')
Of course your approach could work too, but there are several problems with your code:
What is df? Shouldn't that be tips?
The category names Lunch and Dinner must be capitalized, as in the data.
You're mixing different indexing techniques. It should be e.g. x1 = tips.tip[tips['time'] == 'Lunch'].
If you want to plot two KDE in the same diagram, they should be scaled according to sample size. With my approach above, seaborn has done that automatically.
As you are loading data from the seaborn built-in datasets check that your column names are case sensitive replace them with correct name.
You can plot the cumulative distribution between the time and density as follows:
sns.kdeplot(
data=tips, x="total_bill", hue="time",
cumulative=True, common_norm=False, common_grid=True,
)
I am using pandas scatter_matrix (couldn't get PairgGrid in seaborn to work) to plot all combinations of a set of columns in a pandas frame. Each column as 1000 data points and there are nine columns.
I am using the following code:
pandas.plotting.scatter_matrix(df, alpha=0.2, figsize=(8,8))
I get the figure shown below:
This is nice., However, you'll notice that across the main diagonal I have a mirror image. Is it possible to plot only the lower portion as in the following fake plot I made using paint:
This is probably not the cleanest way to do it, but it works:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
axes = pd.plotting.scatter_matrix(iris, alpha=0.2, figsize=(8,8))
for i in range(np.shape(axes)[0]):
for j in range(np.shape(axes)[1]):
if i < j:
axes[i,j].set_visible(False)
Im trying to get counts of categories in a category column, and then sum of the values for each category.
In Seaborn the count is simple with a countplot, but Id like to do it in matplotlib directly.
Is it a case where i simply have to create a new dataframe with a column for each category?
Ive given an image with sample dataset & sample of what im trying to accomplish.
Appreciate any advise on the technique for achieving this?
Use this:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data.groupby(['Category']).sum().plot(ax=ax, kind='bar')
plt.title('Sum of Categories')
fig2, ax2 = plt.subplots()
data['Category'].value_counts().plot(ax=ax2,kind='bar')
plt.title('Count of Categories')
I have the following dataframe (with different campaigns)
When I use groupby and try to plot, I get several graphs
df.groupby("Campaign").plot(y=["Visits"], x = "Week")
I would like to have only one graph with all the visits in the same graph by every campaign during the week time. Also because the graphs show up separated, I do not know which one belongs to each campaign.
I would appreciate any tips regarding this.
You could do this:
df.set_index(['Week','Campaign'])['Visits'].unstack().plot(title='Visits by Campaign')
For multiple values of Week/Campaign let's aggregate them with sum or you could use mean to average the values:
df.groupby(['Week','Campaign'])['Visits'].sum().unstack().plot(title='Visits by Campain')
Output:
Another possible solution would be to use seaborn
import seaborn as sns
ax = sns.lineplot(x="Week",
y="Visits",
hue="Campaign",
estimator=None,
lw=1,
data=df)
The documentation is here
Using plotly, I've learned to plot maps that represent stuff like 'salary per country' or 'number of XX per country' etc .
Now I'd like to plot the following : say I'm interested in three quantities A,B and C, I would like to plot, for each country, little circles with a size that gets bigger when the value gets bigger, for example :
USA : A=10, B=12,C=3 , I would have 3 circles in the US zone, circle(B)>circle(A)>circle(C).
My dataframe has 4 columns :columns=['Country','quantity_A','quantity_B','quantity_C']
How can I plot a map that looks like what I described. I'm willing to use any library that allows that (the simpler the better of course).
Thanks !
Using matplotlib you can draw a scatter plot as follows, where the size of the scatter point is given by the quantity in the respective column.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(1)
import pandas as pd
countries = ["USA", "Israel", "Tibet"]
columns=['quantity_A','quantity_B','quantity_C']
df = pd.DataFrame(np.random.rand(len(countries),len(columns))+.2,
columns=columns, index=countries)
fig, ax=plt.subplots()
for i,c in enumerate(df.columns):
ax.scatter(df.index, np.ones(len(df))*i, s = df[c]*200, c=range(len(df)), cmap="tab10")
ax.set_yticks(range(len(df.columns)))
ax.set_yticklabels(df.columns)
ax.margins(0.5)
plt.show()