Data in form:
x1 x2
data= 2104, 3
1600, 3
2400, 3
1416, 2
3000, 4
1985, 4
y= 399900
329900
369000
232000
539900
299900
I want to plot scatter plot which have got 2 X feature {x1 and x2} and single Y,
but when I try
y=data.loc[:'y']
px=data.loc[:,['x1','x2']]
plt.scatter(px,y)
I get:
'ValueError: x and y must be the same size'.
So I tried this:
data=pd.read_csv('ex1data2.txt',names=['x1','x2','y'])
px=data.loc[:,['x1','x2']]
x1=px['x1']
x2=px['x2']
y=data.loc[:'y']
plt.scatter(x1,x2,y)
This time I got blank graph with full blue color painted inside.
I will be great full if i get some guide
You can only plot with one x and several y's. You could plot the different x's in a twiny axis:
fig, ax = plt.subplots()
ay = ax.twiny()
ax.scatter(df['x1'], df['y'])
ay.scatter(df['x2'], df['y'], color='r')
plt.show()
Output:
You can check the pandas functions for plotting dataframe content, it's very powerful.
But if you want to use matplotlib you can check the documentation (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html), and it's said that X and Y must be array-like. You are instead passing a list.
So the working code it's like this:
data = pd.read_csv("test.txt", header=None)
data
0 1 2
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
data.columns = ["x1", "x2", "y"]
data
x1 x2 y
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
5 1985 4 299900
# If you call scatter many times and then plt.show() a single image is created
plt.scatter(data["x1"], data["y"])
plt.scatter(data["x2"], data["y"])
plt.show()
Note that if you want to have data in an array format you can do data["x1"].values and it will return an ndarray.
You could use seaborn with a melted dataframe. seaborn.scatterplot has a hue argument, which allows to include multiple data series.
import seaborn as sns
ax = sns.scatterplot(x='value', hue='series', y='y',
data=data.melt(value_vars=['x1', 'x2'],
id_vars='y',
var_name='series'))
However, if your x values are that different, you might want to use twin axes, as in #Quang Hoang's answer.
Related
I have different datasets:
Df1
X Y
1 1
2 5
3 14
4 36
5 90
Df2
X Y
1 1
2 5
3 21
4 38
5 67
Df3
X Y
1 1
2 5
3 10
4 50
5 78
I would like to determine a line which fits this data and plot all data in one chart (like a regression).
On the x axis I have the time; on the y axis I have the frequency of an event that occurs.
Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.
What I have done so far is plotting the three lines as follows:
plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)
plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values
# plot
fig, ax = plt.subplots(figsize=(10,8))
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
Please note that the three lists at the beginning contain information on the three different df.
I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:
from scipy.stats import linregress
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
#fit a line to the data
fit = linregress(group.X, group.Y)
ax.plot(group.X, group.X * fit.slope + fit.intercept, label=f'{name} fit')
I try to display a histogram with this dataframe.
gr_age weighted_cost
0 1 2272.985462
1 2 2027.919360
2 3 1417.617779
3 4 946.568598
4 5 715.731002
5 6 641.716770
I want to use gr_age column as the X axis and weighted_cost as the Y axis. Here is an example of what I am looking for with Excel:
I tried with the following code, and with discrete=True, but it gives another result, and I didn't do better with displot.
sns.histplot(data=df, x="gr_age", y="weighted_cost")
plt.show()
Thanking you for your ideas!
You want a barplot (x vs y values) not a histplot which plots the distribution of a dataset:
import seaborn as sns
ax = sns.barplot(data=df, x='gr_age', y='weighted_cost', color='#4473C5')
ax.set_title('Values by age group')
output:
I'm generating a simple line chart with Matplotlib, here is my code:
fig = plt.figure(facecolor='#131722',dpi=155, figsize=(8, 4))
ax1 = plt.subplot2grid((1,2), (0,0), facecolor='#131722')
for x in OrderedList:
rate_buy = []
total_buy = []
for y in x['data']['bids']:
rate_buy.append(y[0])
total_buy.append(y[1])
rBuys = pd.DataFrame({'buy': rate_buy})
tBuys = pd.DataFrame({'total': total_buy})
ax1.plot(rBuys.buy, tBuys.total, color='#0400ff', linewidth=0.5, alpha=1)
ax1.fill_between(rBuys.buy, 0, tBuys.total, facecolor='#0400ff', alpha=1)
Which gives me the following output:
And here is the data i used in the dataframe:
buy
0 9611
1 9610
2 9609
3 9608
4 9607
5 9606
6 9605
7 9604
8 9603
9 9602
10 9601
11 9600
12 9599
total
0 3.033661
1 3.295753
2 3.599813
3 22.305765
4 22.987476
5 30.975145
6 39.492845
7 42.828580
8 46.677708
9 49.533740
10 50.925840
11 61.396243
12 61.921523
I want to get the same output of the image, but with an histogram chart or whatever it's similar to that, where the height of the column on the y axis is retrieved from the total dataframe and the x axis position is retrieved from the buy dataframe. So the first element will have position x=9611 and y=3.033661
Is it possible to do that with Matplotlib? I tried to use hist, but it doesn't allow me to set both the x and the y axis
Pandas uses matplotlib as well, and the API is very easy once you have the dataframe.
Here is an example.
d = {
'buy':[
9611,
9610,
9609,
9608,
9607,
9606,
9605,
9604,
9603,
9602,
9601,
9600,
9599
],
'total':[
3.033661,
3.295753,
3.599813,
22.305765,
22.987476,
30.975145,
39.492845,
42.828580,
46.677708,
49.533740,
50.925840,
61.396243,
61.921523
]
}
df = pd.DataFrame(d)
df = df.sort_values(by=['buy']) #remember to sort your x values!
df.plot(kind='bar', x='buy', y='total', width=1)
plt.show()
I would like to combine this violin plot http://seaborn.pydata.org/generated/seaborn.violinplot.html (fourth example with split=True) with this one http://seaborn.pydata.org/examples/elaborate_violinplot.html.
Actually, I have a dataFrame with a column Success (Yes or No) and several data column. For example :
df = pd.DataFrame(
{"Success": 50 * ["Yes"] + 50 * ["No"],
"A": np.random.randint(1, 7, 100),
"B": np.random.randint(1, 7, 100)}
)
A B Success
0 6 4 Yes
1 6 2 Yes
2 1 1 Yes
3 1 2 Yes
.. .. .. ...
95 4 4 No
96 2 1 No
97 2 6 No
98 2 3 No
99 2 1 No
I would like to plot a violin plot for each data column. It works with :
import seaborn as sns
sns.violinplot(data=df[["A", "B"]], inner="quartile", bw=.15)
But now, I would like to split the violin according to the Success column. But, using hue="Success" I got an error with Cannot use 'hue' without 'x' or 'y'. Thus how can I do to plot the violin plot by splitting according to "Success" column ?
If understand your question correctly, you need to reshape your dataframe to have it in long format:
df = pd.melt(df, value_vars=['A', 'B'], id_vars='Success')
sns.violinplot(x='variable', y='value', hue='Success', data=df)
plt.show()
I was able to adapt an example of a violin plot over a DataFrame like so:
df = pd.DataFrame({"Success": 50 * ["Yes"] + 50 * ["No"],
"A": np.random.randint(1, 7, 100),
"B": np.random.randint(1, 7, 100)})
sns.violinplot(df.A, df.B, df.Success, inner="quartile", split=True)
sns.plt.show()
Clearly, it still needs some work: the A scale should be sized to fit a single half-violin, for example.
I'm trying to visualise a large (pandas) dataframe in Python as a heatmap. This dataframe has two types of variables: strings ("Absent" or "Unknown") and floats.
I want the heatmap to show cells with "Absent" in black and "Unknown" in red, and the rest of the dataframe as a normal heatmap, with the floats in a scale of greens.
I can do this easily in Excel with conditional formatting of cells, but I can't find any help online to do this with Python either with matplotlib, seaborn, ggplot. What am I missing?
Thank you for your time.
You could use cmap_custom.set_under('red') and cmap_custom.set_over('black') to apply custom colors to values below and above vmin and vmax (See 1, 2):
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.axes_grid1 as axes_grid1
import pandas as pd
# make a random DataFrame
np.random.seed(1)
arr = np.random.choice(['Absent', 'Unknown']+list(range(10)), size=(5,7))
df = pd.DataFrame(arr)
# find the largest and smallest finite values
finite_values = pd.to_numeric(list(set(np.unique(df.values))
.difference(['Absent', 'Unknown'])))
vmin, vmax = finite_values.min(), finite_values.max()
# change Absent and Unknown to numeric values
df2 = df.replace({'Absent': vmax+1, 'Unknown': vmin-1})
# make sure the values are numeric
for col in df2:
df2[col] = pd.to_numeric(df2[col])
fig, ax = plt.subplots()
cmap_custom = plt.get_cmap('Greens')
cmap_custom.set_under('red')
cmap_custom.set_over('black')
im = plt.imshow(df2, interpolation='nearest', cmap = cmap_custom,
vmin=vmin, vmax=vmax)
# add a colorbar (https://stackoverflow.com/a/18195921/190597)
divider = axes_grid1.make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, extend='both')
plt.show()
The DataFrame
In [117]: df
Out[117]:
0 1 2 3 4 5 6
0 3 9 6 7 9 3 Absent
1 Absent Unknown 5 4 7 0 2
2 3 0 2 9 8 0 2
3 5 5 7 Unknown 5 Absent 4
4 7 7 5 4 7 Unknown Absent
becomes