I am new to python and pandas, and have the following DataFrame.
How can I plot the DataFrame where each ModelID is a separate plot, saledate is the x-axis and MeanToDate is the y-axis?
Attempt
data[40:76].groupby('ModelID').plot()
DataFrame
You can make the plots by looping over the groups from groupby:
import matplotlib.pyplot as plt
for title, group in df.groupby('ModelID'):
group.plot(x='saleDate', y='MeanToDate', title=title)
See for more information on plotting with pandas dataframes:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
and for looping over a groupby-object:
http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups
Example with aggregation:
I wanted to do something like the following, if pandas had a colour aesthetic like ggplot:
aggregated = df.groupby(['model', 'training_examples']).aggregate(np.mean)
aggregated.plot(x='training_examples', y='accuracy', label='model')
(columns: model is a string, training_examples is an integer, accuracy is a decimal)
But that just produces a mess.
Thanks to joris's answer, I ended up with:
for index, group in df.groupby(['model']):
group_agg = group.groupby(['training_examples']).aggregate(np.mean)
group_agg.plot(y='accuracy', label=index)
I found that title= was just replacing the single title of the plot on each loop iteration, but label= does what you'd expect -- after running plt.legend(), of course.
Related
I'm trying to create a scatterplot in plotly, but have some difficulties. I think I need to rearrange my data table to be able to work with it, but am note sure.
This is how my data table looks:
table structure
The "Average Price" is the "real" data and the prices in the "Predictions" column are what my model predicted.
I want to display it in a scatterplot, showing both the predicted and real prices as dots, like this:
scatterplot created through matplotlib
This, I created with pyplot
plt.scatter(x_axis, result['Average Price'], label='Real')
plt.scatter(x_axis, result['Predictions'], label='Predictions')
plt.xlabel('YYY-MM-DD')
plt.ylabel('Average Price')
plt.legend(loc='lower right')
plt.show()
However, I wanted to do the same with plotly, which I can't seem to figure out. I have no problems with one column, but don't know how to access both. Do I need to rearrange the table so that I have all prices (predicted and real) in one column and an additional column labeling the data as "real" or "predicted"?
chart_model = px.scatter(result, x='YYYY-MM-DD', y='Predictions', title='Predictions')
chart_model.update_layout(title_x=0.5, plot_bgcolor='#ecf0f1', yaxis_title='Average Price Predicted',
font_color='#2c3e50')
chart_model.update_traces(marker=dict(color='blue'))
Thanks in advance for any tips on how to proceed!
have simulated dataframe of same structure as your question
have used pandas melt() to reshape in line to long dataframe that is then simple to use with plotly
import pandas as pd
import numpy as np
import plotly.express as px
# simulate data frame
df = pd.DataFrame(
{
"YYYY-MM-DD": pd.date_range("4-jan-2015", freq="7D", periods=300),
"Average Price": np.random.uniform(1.2, 1.4, 300),
}
).pipe(
lambda d: d.assign(
Predictions=d["Average Price"] * np.random.uniform(0.9, 1.1, 300)
)
)
# simple inline restructure of data frame
px.scatter(df.set_index("YYYY-MM-DD").melt(ignore_index=False), y="value", color="variable")
alternate
just move data into index and define columns to be plotted
px.scatter(df.set_index("YYYY-MM-DD"), y=["Average Price", "Predictions"])
I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.
I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.
I have a makeshift prototype that looks like this:
I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.
Edit: Here is some requested toy data as well as the code used to produce the plot I already have.
Toy data:
The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv(). I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.
,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127
and
,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198
Code:
The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.
def plot_all_clusters_words(dfs):
# target structure: word as non unique column, value as other non unique column
df_dict_list = []
for df in dfs:
for index, row in df.iterrows():
df_dict_list.append({"word": row.features, "value": row.score})
df_dict_list.append({"word": "", "value": 0})
df_dict_list = df_dict_list[:-1]
new_df = pd.DataFrame(df_dict_list)
new_df.plot.bar(x="word")
plt.show()
return new_df
Note:
I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.
Calling plt.bar for each of the dataframes, each with an own label and color, would create the following plot:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))
dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()
I have a DataFrame with multi-index rows and I would like to create a heatmap without the repetition of row's labels, just like it appears in pandas DataFrame. Here a code to replicate my problem:
import pandas as pd
from matplotlib import pyplot as plt
import random
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'Occupation':['Economist','Economist','Economist','Engineer','Engineer','Engineer',
'Data Scientist','Data Scientist','Data Scientist'],
'Sex':['Female','Male','Both']*3, 'UK':random.sample(range(-10,10),9),
'US':random.sample(range(-10,10),9),'Brazil':random.sample(range(-10,10),9)})
df = df.set_index(['Occupation','Sex'])
df
sns.heatmap(df, annot=True, fmt="",cmap="YlGnBu")
Besides the elimination of repetition, I would like to customize a bit the y-labels since this raw form doesn't look good to me.
Is it possible?
AFAIK there's no quick and easy way to do that within seaborn, but hopefully some one corrects me. You can do it manually by resetting the ytick_labels to just be the values from level 1 of your index. Then you can loop over level 0 of your index and add a text element to your visualization at the correct location:
from collections import OrderedDict
ax = sns.heatmap(df, annot=True, cmap="YlGnBu")
ylabel_mapping = OrderedDict()
for occupation, sex in df.index:
ylabel_mapping.setdefault(occupation, [])
ylabel_mapping[occupation].append(sex)
hline = []
new_ylabels = []
for occupation, sex_list in ylabel_mapping.items():
sex_list[0] = "{} - {}".format(occupation, sex_list[0])
new_ylabels.extend(sex_list)
if hline:
hline.append(len(sex_list) + hline[-1])
else:
hline.append(len(sex_list))
ax.hlines(hline, xmin=-1, xmax=4, color="white", linewidth=5)
ax.set_yticklabels(new_ylabels)
An alternative approach involves using dataframe styling. This leads to a super simply syntax, but you do lose out on the colobar. This keeps your index and column presentation all the same as a dataframe. Note that you'll need to be working in a notebook or somewhere that can render html to view the output:
df.style.background_gradient(cmap="YlGnBu", vmin=-10, vmax=10)
I have a DataFrame like this:
I tried these two instructions one after another:
sns.boxplot([dataFrame.mean_qscore_template,dataFrame.mean_qscore_complement,dataFrame.mean_qscore_2d])
sns.boxplot(x = "mean_qscore_template", y= "mean_qscore_complement", hue = "mean_qscore_2d" data = tips)
I want to get mean_qscore_template, mean_qscore_complement and mean_qscore_2d on the x-axis with the measure on y-axis but it doesn't work.
In the documentation they give an example with tips but my dataFrame is not organized f the same way.
sns.boxplot(data = dataFrame) will make boxplots for each numeric column of your dataframe.
I have a pandas dataframe where one of the columns is a set of labels that I would like to plot each of the other columns against in subplots. In other words, I want the y-axis of each subplot to use the same column, called 'labels', and I want a subplot for each of the remaining columns with the data from each column on the x-axis. I expected the following code snippet to achieve this, but I don't understand why this results in a single nonsensical plot:
examples.plot(subplots=True, layout=(-1, 3), figsize=(20, 20), y='labels', sharey=False)
The problem with that code is that you didn't specify an x value. It seems nonsensical because it's plotting the labels column against an index from 0 to the number of rows. As far as I know, you can't do what you want in pandas directly. You might want to check out seaborn though, it's another visualization library that has some nice grid plotting helpers.
Here's an example with your data:
import pandas as pd
import seaborn as sns
import numpy as np
examples = pd.DataFrame(np.random.rand(10,4), columns=['a', 'b', 'c', 'labels'])
g = sns.PairGrid(examples, x_vars=['a', 'b', 'c'], y_vars='labels')
g = g.map(plt.plot)
This creates the following plot:
Obviously it doesn't look great with random data, but hopefully with your data it will look better.