I have two dataframes df1 and df2 in Python, transformed from a numpy array in which df1 has 50 rows and 8 columns and df2 10 rows and 8 columns as well, and I would like to use pairplot to see these values. I have made something like this:
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
sns.pairplot(df1)
sns.pairplot(df2)
plt.show()
But I would like that the points or histograms of the df2 to appear superimpose, for example, in red color to the df1 points which are in blue. How can I do that?
To illustrate problem I use iris dataset.
First produce 2 dataframes:
import seaborn as sns
iris = sns.load_dataset("iris")
df1 = iris[iris.species =='setosa']
df2 = iris[iris.species =='versicolor']
We have now Your starting point. Then concatenate dataframes and plot the result:
df12 = df1.append(df2)
g = sns.pairplot(df12, hue="species")
use hue parameter to separate points by color.
Using hue parameter in seaborn you can choose column that will differ them.
sns.pairplot(joined_df,hue='special_column_to_differ_df')
But you will have to join them first.
Related
in a Pandas Df with 3 variables i want to plot 2 columns in 2 different boxes and the 3rd column as hue with seaborn
I can reach the first step with pd.melt but I cant insert the hue and make it work
This is what I have:
df=pd.DataFrame({'A':['a','a','b','a','b'],'B':[1,3,5,4,7],'C':[2,3,4,1,3]})
df2=df[['B','C']].copy()
sb.boxplot(data=pd.melt(df2), x="variable", y="value",palette= 'Blues')
I want to do this in the first DF, setting variable 'A' as hue
Can you help me?
Thank you
IIUC, you can achieve this as follows:
Apply df.melt, using column A for id_vars, and ['B','C'] for value_vars.
Next, inside sns.boxplot, feed the melted df to the data parameter, and add hue='A'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'A':['a','a','b','a','b'], 'B':[1,3,5,4,7], 'C':[2,3,4,1,3]})
sns.boxplot(data=df.melt(id_vars='A', value_vars=['B','C']),
x='variable', y='value', hue='A', palette='Blues')
plt.show()
Result
I am wondering is there an elegant and efficient way to achieve what my title as stated.
import pandas as pd
data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])
I tried using
data1.merge(data2, how='left', left_on='name', right_on='name')
to merge the two dataframes.
My aim is to join the following dataframes and auto-fill the missing values :
data1 = pd.DataFrame([['ad_001','50','75'], ['ad_002', '100', '200'], ['ad_003', '150', '0'], ['ad_004', '0', '100']],columns=['name', 'score_x','score_y'])
Then I want to show a scatterplot of the data using matplotlib and colour each point according to the maximum score of x and y.
if x or y >100, colour red
if x or y >150, colour green
if x or y >200, colour red.
I tried looking at
[https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py] the userguide but do quite know how to implement it.
Or is there any other plotting python modules that one would recommend to achieve the same outcome ?
For the first part of merging the two dataframes, one of the ways to do this is to use merge and use outer so that all columns are captured. This will include all rows with nan where no data is available. Using .fillna(0) will handle this - based on how you mentioned you want the invalid number to appear.
For the conditions and plotting, the simplest way would be to use something like np.where(), which you can use to identify the colors you want. As your question had red for two conditions, I have made one as red, while the other is blue. You can adjust the numbers and colors are you need. Once the column with colors is available, using groupby() and plotting will give you the results you need. Hope this helps...
import pandas as pd
import matplotlib.pyplot as plt
data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])
newdata=pd.merge(data1, data2, on="name", how='outer').fillna(0) ## Merge & fillna()
newdata['score_x']=newdata['score_x'].astype('int64') ## Convert to int as you are comparing
newdata['score_y']=newdata['score_y'].astype('int64') ## Convert to int as you are comparing
##Use np.where to create color column with the colors you need
newdata['color']=np.where(((newdata.score_x<100) & (newdata.score_y<100)), 'red',
np.where(((newdata.score_x<150) & (newdata.score_y<150)), 'green', 'blue'))
## Group and plot
fig, ax = plt.subplots()
for clr, d in newdata.groupby('color'):
ax.scatter(x=d['score_x'],y=d['score_x'], label=clr)
I want to make a scatterplot in seaborn (but I'm open to other ways to execute this) from two numerical columns of data and one categorical column of data, with the two titles of the numerical columns on the x axis, the values of the numerical columns on the y axis, and the cat column represented by hue.
this is kind of like what I want, with the names, firstgame and lastgame on the x axis instead of 1 minute and 15 minute
There are 50 basketball teams in my dataset, each with their own row (so there are 50 rows). Each team has a label, "good" or "bad". The label is the categorical column that I want in my plot. The first numerical column I want has the number of attendees for the first game of the season and the second numerical column has the number of attendees for the last game of the season. I figured I could plot this using seaborn, but I'm not sure how to designate x and y. I tried add the two num columns together in a list and then going from there but that didn't really work out. Any suggestions...? Thank you so much in advance.
try the following
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = [[8.98, 1.56, 'fail'],
[8.91, 5.22, 'success'],
[5.39, 2.13, 'fail'],
[5.06, 1.61, 'fail'],
[5.84, 2.86, 'fail']]
df=pd.DataFrame(data=data, columns=['firstgame','lastgame','label'])
ax=sns.scatterplot(x='firstgame',y='lastgame',hue='label',data=df)
plt.show()
This will produce:
You can try the following:
## sample data, ignore this
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100, (50,2)),
columns=['firstgame','lastgame'])
df['label'] = np.random.choice(['good','bad'], 50)
## replace 'index' with your index name if any
sns.lineplot(data=df.reset_index().melt(id_vars=['index','label']),
hue='label',
style='variable',
x='index',
y='value')
Output:
I would like to produce a scatter plot of pandas DataFrame with categorical row and column labels using matplotlib. A sample DataFrame looks like this:
import pandas as pd
df = pd.DataFrame({"a": [1,2], "b": [3,4]}, index=["c","d"])
# a b
#c 1 2
#d 3 4
The marker size is the function of the respective DataFrame values. So far, I came up with an awkward solution that essentially enumerates the rows and columns, plots the data, and then reconstructs the labels:
flat = df.reset_index(drop=True).T.reset_index(drop=True).T.stack().reset_index()
# level_0 level_1 0
#0 0 0 1
#1 0 1 2
#2 1 0 3
#3 1 1 4
flat.plot(kind='scatter', x='level_0', y='level_1', s=100*flat[0])
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Which kind of works.
Now, question: Is there a more intuitive, more integrated way to produce this scatter plot, ideally without splitting the data and the metadata?
Maybe not the entire answer you're looking for, but an idea to help save time and readability with the flat= line of code.
Pandas unstack method will produce a Series with a MultiIndex.
dfu = df.unstack()
print(dfu.index)
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
The MultiIndex contains contains the necessary x and y points to construct the plot (in labels). Here, I assign levels and labels to more informative variable names better suited for plotting.
xlabels, ylabels = dfu.index.levels
xs, ys = dfu.index.labels
Plotting is pretty straight-forward from here.
plt.scatter(xs, ys, s=dfu*100)
plt.xticks(range(len(xlabels)), xlabels)
plt.yticks(range(len(ylabels)), ylabels)
plt.show()
I tried this on a few different DataFrame shapes and it seemed to hold up.
It's not exactly what you were asking for, but it helps to visualize values in a similar way:
import seaborn as sns
sns.heatmap(df[::-1], annot=True)
Result:
Maybe you can use numpy array and pd.melt to create the scatter plot as shown below:
arr = np.array([[i,j] for i in range(df.shape[1]) for j in range(df.shape[0])])
plt.scatter(arr[:,0],arr[:,1],s=100*pd.melt(df)['value'],marker='o')
plt.xlabel('level_0')
plt.ylabel('level_1')
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
I have a DataFrame like this:
I tried these two instructions one after another:
sns.boxplot([dataFrame.mean_qscore_template,dataFrame.mean_qscore_complement,dataFrame.mean_qscore_2d])
sns.boxplot(x = "mean_qscore_template", y= "mean_qscore_complement", hue = "mean_qscore_2d" data = tips)
I want to get mean_qscore_template, mean_qscore_complement and mean_qscore_2d on the x-axis with the measure on y-axis but it doesn't work.
In the documentation they give an example with tips but my dataFrame is not organized f the same way.
sns.boxplot(data = dataFrame) will make boxplots for each numeric column of your dataframe.