Settings for timedata in seaborn FacetGrid plots - python

I want to plot data monthly and show year label once per each year.
Here is the data:
timedates = ['2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01', '2013-05-01', '2013-06-01', '2013-07-01',
'2013-08-01', '2013-09-01', '2013-10-01', '2013-11-01', '2013-12-01', '2014-01-01', '2014-02-01',
'2014-03-01', '2014-04-01', '2014-05-01', '2014-06-01', '2014-07-01', '2014-08-01', '2014-09-01',
'2014-10-01', '2014-11-01', '2014-12-01']
timedates = pd.to_datetime(timedates)
amount = [38870, 42501, 44855, 44504, 41194, 42087, 43687, 42347, 45098, 43783, 47275, 49767,
39502, 35951, 47059, 47639, 44236, 40826, 46087, 41462, 38384, 41452, 36811, 37943]
types = ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C',
'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
df_x = pd.DataFrame({'timedates': timedates, 'amount': amount, 'types': types})
I found out how to do that with matplotlib
plt.style.use('ggplot')
fig, ax = plt.subplots()
ax.plot_date(df_x.timedates, df_x.amount, 'v-')
ax.xaxis.set_minor_locator(md.MonthLocator())
ax.xaxis.set_minor_formatter(md.DateFormatter('%m'))
ax.xaxis.grid(True, which="minor")
ax.yaxis.grid()
ax.xaxis.set_major_locator(md.YearLocator())
ax.xaxis.set_major_formatter(md.DateFormatter('\n\n%Y'))
plt.show()
Now I move to seaborn to take into account different types of data. Is it possible to have the same style of ticks using seaborn FacetGrid?
g = sns.FacetGrid(df_x, hue='types', size=8, aspect=1.5)
g.map(sns.pointplot, 'timedates', 'amount')
plt.show()
When I try to apply ticks formatting - they just disappear.

You could format the xticks to just include the month and year of the datetime object and get a pointplot with xticks corresponding to the position of scatter plot points.
df['timedates'] = df['timedates'].map(lambda x: x.strftime('%Y-%m'))
def plot(x, y, data=None, label=None, **kwargs):
sns.pointplot(x, y, data=data, label=label, **kwargs)
g = sns.FacetGrid(df, hue='types', size=8, aspect=1.5)
g.map_dataframe(plot, 'timedates', 'amount')
plt.show()

By far, I did it manually. Just separated lines by type and plotted them together.
Changed this line
ax.plot_date(df_x.timedates, df_x.amount, 'v-')
Into three plot-lines:
types_levels = df_x.types.unique()
for i in types_levels:
ax.plot_date(df_x[df_x.types==i].timedates, df_x[df_x.types==i].amount, 'v-')
plt.legend(types_levels)
Though it's not an answer, I can't use other advantages of seaborn FacetGrid.

You can just use the same code you used for matplotlib!
for ax in g.axes.flat:
# Paste in your own code!

Related

Coloring points in a plotly box plot by column variable

I am looking to change the color of the points in the box plot based on a column value from the data frame (committed).
users = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
cost = [87.25, 84.69, 82.32, 87.25, 84.59, 82.32, 89.45, 86.37, 83.83]
committed = ['f', 'f', 't', 't', 'f', 'f', 'f', 't', 'f']
df = pd.DataFrame({'user': users, 'cost': cost, 'committed': committed})
To get the below plot, I entered this:
fig = go.Figure(
data=[go.Box(
x=df['user'],
y=df['cost'],
boxpoints='all',
jitter=.3,
pointpos=-1.8,
)]
)
fig.show()
I have found this post for r and was hoping in the past 5 years, something has changed to make it possible: Coloring jitters in a plotly box plot by group variable
I have tried this, but it does not appear that the marker or marker_color accepts what I attempted and I am unsure how to set the values otherwise.
def set_color(point):
if point == 't':
return "red"
elif point == 'f':
return "blue"
fig = go.Figure(
data=[go.Box(
x=df['user'],
y=df['cost'],
boxpoints='all',
jitter=.3,
pointpos=-1.8,
marker=dict(color=list(map(set_color, df['committed'])))
)]
)
fig.show()
Error:
Invalid value of type 'builtins.list' received for the 'color' property of box.marker
I ended up finding my answer here: Plotting data points over a box plot with specific colors & jitter in plotly
My solution:
import plotly.express as px
import plotly.graph_objects as go
fig = px.strip(
df, x='user', y='cost', color='committed', stripmode="overlay"
)
for user in df['user'].unique():
fig.add_trace(go.Box(y=df.query(f'user == "{user}"')['cost'], name=user, marker_color="aquamarine"))
fig.show()

Y-axis values cuts off using seaborn scatter plot

I have an issue with plotting the big CSV file with Y-axis values ranging from 1 upto 20+ millions. There are two problems I am facing right now.
The Y-axis do not show all the values that it is suppose to. When using the original data, it shows upto 6 million, instead of showing all the data upto 20 millions. In the sample data (smaller data) I put below, it only shows the first Y-axis value and does not show any other values.
In the label section, since I am using hue and style = name, "name" appears as the label title and as an item inside.
Questions:
Could anyone give me a sample or help me to answer how may I show all the Y-axis values? How can I fix it so all the Y-values show up?
How can I get rid of "name" under label section without getting rid of shapes and colors for the scatter points?
(Please let me know of any sources exist or this question was answered on some other post without labeling it duplicated. Please also let me know if I have any grammar/spelling issues that I need to fix. Thank you!)
Below you can find the function I am using to plot the graph and the sample data.
def test_graph (file_name):
data_file = pd.read_csv(file_name, header=None, error_bad_lines=False, delimiter="|", index_col = False, dtype='unicode')
data_file.rename(columns={0: 'name',
1: 'date',
2: 'name3',
3: 'name4',
4: 'name5',
5: 'ID',
6: 'counter'}, inplace=True)
data_file.date = pd.to_datetime(data_file['date'], unit='s')
norm = plt.Normalize(1,4)
cmap = plt.cm.tab10
df = pd.DataFrame(data_file)
# Below creates and returns a dictionary of category-point combinations,
# by cycling over the marker points specified.
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
mult = len(df['name']) // len(points) + (len(df['name']) % len(points) > 0)
markers = {key:value for (key, value)
in zip(df['name'], points * mult)} ; markers
sc = sns.scatterplot(data = df, x=df['date'], y=df['counter'], hue = df['name'], style = df['name'], markers = markers, s=50)
ax.set_autoscaley_on(True)
ax.set_title("TEST", size = 12, zorder=0)
plt.legend(title="Names", loc='center left', shadow=True, edgecolor = 'grey', handletextpad = 0.1, bbox_to_anchor=(1, 0.5))
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(100))
plt.xlabel("Dates", fontsize = 12, labelpad = 7)
plt.ylabel("Counter", fontsize = 12)
plt.grid(axis='y', color='0.95')
fig.autofmt_xdate(rotation = 30)
fig = plt.figure(figsize=(20,15),dpi=100)
ax = fig.add_subplot(1,1,1)
test_graph(file_name)
plt.savefig(graph_results + "/Test.png", dpi=100)
# Prevents to cut-off the bottom labels (manually) => makes the bottom part bigger
plt.gcf().subplots_adjust(bottom=0.15)
plt.show()
Sample data
namet1|1582334815|ai1|ai1||150|101
namet1|1582392415|ai2|ai2||142|105
namet2|1582882105|pc1|pc1||1|106
namet2|1582594106|pc1|pc1||1|123
namet2|1580592505|pc1|pc1||1|141
namet2|1580909305|pc1|pc1||1|144
namet3|1581974872|ai3|ai3||140|169
namet1|1581211616|ai4|ai4||134|173
namet2|1582550907|pc1|pc1||1|179
namet2|1582608505|pc1|pc1||1|185
namet4|1581355640|ai5|ai5|bcu|180|298466
namet4|1582651641|pc2|pc2||233|298670
namet5|1582406860|ai6|ai6|bcu|179|298977
namet5|1580563661|pc2|pc2||233|299406
namet6|1581283626|qe1|q0/1|Link to btse1/3|51|299990
namet7|1581643672|ai5|ai5|bcu|180|300046
namet4|1581758842|ai6|ai6|bcu|179|300061
namet6|1581298027|qe2|q0/2|Link to btse|52|300064
namet1|1582680415|pc2|pc2||233|300461
namet6|1581744427|pc3|p90|Link to btsi3a4|55|6215663
namet6|1581730026|pc3|p90|Link to btsi3a4|55|6573348
namet6|1582190826|qe2|q0/2|Link to btse|52|6706378
namet6|1582190826|qe1|q0/1|Link to btse1/3|51|6788568
namet1|1581974815|pc2|pc2||233|6895836
namet4|1581974841|pc2|pc2||233|7874504
namet6|1582176427|qe1|q0/1|Link to btse1/3|51|9497687
namet6|1582176427|qe2|q0/2|Link to btse|52|9529133
namet7|1581974872|pc2|pc2||233|9573450
namet6|1582162027|pc3|p90|Link to btsi3a4|55|9819491
namet6|1582190826|pc3|p90|Link to btsi3a4|55|13494946
namet6|1582176427|pc3|p90|Link to btsi3a4|55|19026820
Results I am getting:
Big data:
Small data:
Updated Graph
Updated-graph
First of all, some improvements on your post: you are missing the import statements
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
The line
df = pd.DataFrame(data_file)
is not necessary, since data_file already is a DataFrame. The lines
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
mult = len(df['name']) // len(points) + (len(df['name']) % len(points) > 0)
markers = {key:value for (key, value)
in zip(df['name'], points * mult)}
do not cycle through points as you might expect, maybe use itertools as suggested here. Also, setting yticks like
ax.yaxis.set_major_locator(ticker.MultipleLocator(100))
for every 100 might be too much if your data is spanning values from 0 to 20 million, consider replacing 100 with, say, 1000000.
I was able to reproduce your first problem. Using df.dtypes I found that the column counter was stored as type object. Adding the line
df['counter']=df['counter'].astype(int)
resolved your first problem for me. I couldn't reproduce your second issue, though. Here is what the resulting plot looks like for me:
Have you tried updating all your packages to the latest version?
EDIT: as follow up on your comment, you can also adjust the number of xticks in your plot by replacing 1 in
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
by a higher number, say 10. Incorporating all my suggestions and deleting the seemingly unnecessary function definition, my version of your code looks as follows:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
import itertools
fig = plt.figure()
ax = fig.add_subplot()
df = pd.read_csv(
'data.csv',
header = None,
error_bad_lines = False,
delimiter = "|",
index_col = False,
dtype = 'unicode')
df.rename(columns={0: 'name',
1: 'date',
2: 'name3',
3: 'name4',
4: 'name5',
5: 'ID',
6: 'counter'}, inplace=True)
df.date = pd.to_datetime(df['date'], unit='s')
df['counter'] = df['counter'].astype(int)
points = ['o', 'v', '^', '<', '>', '8', 's', 'p', 'H', 'D', 'd', 'P', 'X']
markers = itertools.cycle(points)
markers = list(itertools.islice(markers, len(df['name'].unique())))
sc = sns.scatterplot(
data = df,
x = 'date',
y = 'counter',
hue = 'name',
style = 'name',
markers = markers,
s = 50)
ax.set_title("TEST", size = 12, zorder=0)
ax.legend(
title = "Names",
loc = 'center left',
shadow = True,
edgecolor = 'grey',
handletextpad = 0.1,
bbox_to_anchor = (1, 0.5))
ax.xaxis.set_major_locator(ticker.MultipleLocator(10))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1000000))
ax.minorticks_off()
ax.set_xlabel("Dates", fontsize = 12, labelpad = 7)
ax.set_ylabel("Counter", fontsize = 12)
ax.grid(axis='y', color='0.95')
fig.autofmt_xdate(rotation = 30)
plt.gcf().subplots_adjust(bottom=0.15)
plt.show()

Python + Matplotlib: multi-level treemap plot?

I recently saw this treemap chart from https://www.kaggle.com/philippsp/exploratory-analysis-instacart (two levels of hierarchy, colored, squarified treemap).
It is made with R ggplot2::treemap, by:
treemap(tmp,index=c("department","aisle"),vSize="n",title="",
palette="Set3",border.col="#FFFFFF")
I want to know how can I make this plot in Python?
I searched a bit, but didn't find any multi-level treemap example.
https://gist.github.com/gVallverdu/0b446d0061a785c808dbe79262a37eea
https://python-graph-gallery.com/200-basic-treemap-with-python/
You can use plotly. Here you can find several examples.
https://plotly.com/python/treemaps/
This is a very simple example with a multi-level structure.
import plotly.express as px
import pandas as pd
from collections import defaultdict
data = defaultdict()
data['level_1'] = ['A', 'A', 'A', 'B', 'B', 'B']
data['level_2'] = ['X', 'X', 'Y', 'Z', 'Z', 'X']
data['level_3'] = ['1', '2', '2', '1', '1', '2']
data = pd.DataFrame.from_dict(data)
fig = px.treemap(data, path=['level_1', 'level_2', 'level_3'])
fig.show()
The package matplotlib-extra provides a treemap function that supports multi-level treemap plot. For the dataset of G20, treemap can produce the similar treemap, such as:
import matplotlib.pyplot as plt
import mpl_extra.treemap as tr
fig, ax = plt.subplots(figsize=(7,7), dpi=100, subplot_kw=dict(aspect=1.156))
trc = tr.treemap(ax, df, area='gdp_mil_usd', fill='hdi', labels='country',
levels=['region', 'country'],
textprops={'c':'w', 'wrap':True,
'place':'top left', 'max_fontsize':20},
rectprops={'ec':'w'},
subgroup_rectprops={'region':{'ec':'grey', 'lw':2, 'fill':False,
'zorder':5}},
subgroup_textprops={'region':{'c':'k', 'alpha':0.5, 'fontstyle':'italic'}},
)
ax.axis('off')
cb = fig.colorbar(trc.mappable, ax=ax, shrink=0.5)
cb.ax.set_title('hdi')
cb.outline.set_edgecolor('w')
plt.show()
The obtained treemap is as follows:
For more inforamtion, you can see the project, which has some examples. The source code has an api docstring.

Non overlapping error bars in line plot

I am using Pandas and Matplotlib to create some plots. I want line plots with error bars on them. The code I am using currently looks like this
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5)
ax.set_xscale("log")
plt.show()
With this code, I get 6 lines on a single plot (which is what I want). However, the error bars completely overlap, making the plot difficult to read.
Is there a way I could slightly shift the position of each point on the x-axis so that the error bars no longer overlap?
Here is a screenshot:
One way to achieve what you want is to plot the error bars 'by hand', but it is neither straight forward nor much better looking than your original. Basically, what you do is make pandas produce the line plot and then iterate through the data frame columns and do a pyplot errorbar plot for each of them such, that the index is slightly shifted sideways (in your case, with the logarithmic scale on the x axis, this would be a shift by a factor). In the error bar plots, the marker size is set to zero:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
colors = ['red','blue','green','yellow','purple','black']
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(ax=ax, marker="o",color=colors)
index = df.index
rows = len(index)
columns = len(df.columns)
factor = 0.95
for column,color in zip(range(columns),colors):
y = df.values[:,column]
yerr = df_yerr.values[:,column]
ax.errorbar(
df.index*factor, y, yerr=yerr, markersize=0, capsize=5,color=color,
zorder = 10,
)
factor *= 1.02
ax.set_xscale("log")
plt.show()
As I said, the result is not pretty:
UPDATE
In my opinion a bar plot would be much more informative:
fig2,ax2 = plt.subplots()
df.plot(kind='bar',yerr=df_yerr, ax=ax2)
plt.show()
you can solve with alpha for examples
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5,alpha=0.5)
You can also check this link for reference

Remove anti-aliasing for pandas plot.area

I want to plot stacked areas with Python, and find out this Pandas' function:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area();
However, the result is weirdly antialiased, mixing together the colors, as shown on those 2 plots:
The same problem occurs in the example provided in the documentation.
Do you know how to remove this anti-aliasing? (Or another mean to get a neat output for stacked representation of line plots.)
Using a matplotlib stack plot works fine
fig, ax = plt.subplots()
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax.stackplot(df.index, df.values.T)
Since the area plot is a stackplot, the only difference would be the linewidth of the areas, which you can set to zero.
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area(linewidth=0)
The remaining grayish lines are then indeed due to antialiasing. You may turn that off in the matplotlib plot
fig, ax = plt.subplots()
ax.stackplot(df.index, df.values.T, antialiased=False)
The result however, may not be visually appealing:
It looks like there are two boundaries.
Try a zero line width:
df.plot.area(lw=0);

Categories

Resources