Python Pandas plot multiindex specify x and y - python

Below is an example DataFrame.
joaquin manolo
xx 0 0.000000e+00 44.000000
1 1.570796e+00 52.250000
2 3.141593e+00 60.500000
3 4.712389e+00 68.750000
4 6.283185e+00 77.000000
yy 0 0.000000e+00 37.841896
1 2.078796e+00 39.560399
2 5.292179e-17 41.026434
3 -8.983291e-02 42.304767
4 -4.573916e-18 43.438054
As you can see, the row index has two levels, ['xx', 'yy'] and [0, 1, 2, 3, 4]. I want to call DataFrame.plot() in such a way that it will produce two subplots, one for joaquin and one for manolo, and where I can specify to use data.loc["xx", :] for the domain data and to use data.loc["yy", :] for the ordinate data. In addition, I want the option to supply the subplots on which the plots should be drawn, in a list (or array) of matplotlib.axes._subplots.AxesSubplot instances, such as those that can be returned by the DataFrame.hist() method. How can this be done?
Generating the data above
Just in case you're wondering, below is the code I used to generate the data. If there is an easier way to generate this data, I'd be very interested to know as a side-note.
joaquin_dict = {}
xx_joaquin = numpy.linspace(0, 2*numpy.pi, 5)
yy_joaquin = 10 * numpy.sin(xx_joaquin) * numpy.exp(-xx_joaquin)
for i in range(len(xx_joaquin)):
joaquin_dict[("xx", i)] = xx_joaquin[i]
joaquin_dict[("yy", i)] = yy_joaquin[i]
manolo_dict = {}
xx_manolo = numpy.linspace(44, 77, 5)
yy_manolo = 10 * numpy.log(xx_manolo)
for i in range(len(xx_manolo)):
manolo_dict[("xx", i)] = xx_manolo[i]
manolo_dict[("yy", i)] = yy_manolo[i]
data_dict = {"joaquin": joaquin_dict, "manolo": manolo_dict}
data = pandas.DataFrame.from_dict(data_dict)

Just use a for loop:
fig, axes = pl.subplots(1, 2)
for ax, col in zip(axes, data.columns):
data[col].unstack(0).plot(x="xx", y="yy", ax=ax, title=col)
output:

Related

drowing a 1d lattice graph in python networkx

I am tying to plot 1d lattice graph, but i face with below:
NetworkXPointlessConcept: the null graph has no paths, thus there is no averageshortest path length
what is the problem of this code?
thanks.
N = 1000
x = 0
for n in range(1, N, 10):
lattice_1d_distance = list()
d = 0
lattice_1d = nx.grid_graph(range(1,n))
d = nx.average_shortest_path_length(lattice_1d)
lattice_1d_distance.append(d)
x.append(n)
plt.plot(x, lattice_1d_distance)
plt.show()
According to networkx documentation nx.grid_graph the input is a list of dimensions for nx.grid_graph
Example
print(list(range(1,4)))
nx.draw(nx.grid_graph(list(range(1,4))) # this is a two dimensional graph, as there is only 3 entries AND ONE ENTRY = 1
[1, 2, 3]
print(list(range(1,5)))
nx.draw(nx.grid_graph([1,2,3,4])) # this is a 3 dimensional graph, as there is only 4 entries AND ONE ENTRY = 1
[1, 2, 3, 4]
Therefore, lets say if you want to 1. plot the distance vs increment of number of dimensions for grid graphs but with constant size for each dimension, or you want to 2. plot the distance vs increment of size for each dimension for grid graphs but with constant number of dimensions:
import networkx as nx
import matplotlib.pyplot as plt
N = 10
x = []
lattice_1d_distance = []
for n in range(1, 10):
d = 0
lattice_1d = nx.grid_graph([2]*n) # plotting incrementing number of dimensions, but each dimension have same length.
d = nx.average_shortest_path_length(lattice_1d)
lattice_1d_distance.append(d)
x.append(n)
plt.plot(x, lattice_1d_distance)
plt.show()
N = 10
x = []
lattice_1d_distance = []
for n in range(1, 10):
d = 0
lattice_1d = nx.grid_graph([n,n]) # plotting 2 dimensional graphs, but each graph have incrementing length for each dimension.
d = nx.average_shortest_path_length(lattice_1d)
lattice_1d_distance.append(d)
x.append(n)
plt.plot(x, lattice_1d_distance)
plt.show()
Also, you need to pay attention to the declaration of list variables.

Plotting the mean of multiple columns including standard deviation

I have a data set with 8 columns and several rows. The columns contain measurements for different variable (6 in total) under 2 different conditions, each consisting of 4 columns that contain repeated measurements for a particular condition.
Using Searborn, I would like to generate a bar chart displaying the mean and standard deviation of every 4 columns, grouped by index key (i.e. measured variable). The dataframe structure is as follows:
np.random.seed(10)
df = pd.DataFrame({
'S1_1':np.random.randn(6),
'S1_2':np.random.randn(6),
'S1_3':np.random.randn(6),
'S1_4':np.random.randn(6),
'S2_1':np.random.randn(6),
'S2_2':np.random.randn(6),
'S2_3':np.random.randn(6),
'S2_4':np.random.randn(6),
},index= ['var1','var2','var3','var4','var5','var6'])
How do I pass to seaborn that I would like only 2 bars, 1 for the first 4 columns and 1 for the second. With each bar displaying the mean (and standard deviation or some other measure of dispersion) across 4 columns.
I was thinking of using multi-indexing, adding a second column level to group the columns into 2 condition,
df.columns = pd.MultiIndex.from_arrays([['Condition 1'] * 4 + ['Condition 2'] * 4,df.columns])
but I can't figure out what I should pass to Seaborn to generate the plot I want.
If anyone could point me in the right direction, that would be a great help!
Update Based on Comment
Plotting is all about reshaping the dataframe for the plot API
# still create the groups
l = df.columns
n = 4
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# stack each group and add an id column
data_list = list()
for group in groups:
id_ = group[0][1]
data = df[group].copy().T
data['id_'] = id_
data_list.append(data)
df2 = pd.concat(data_list, axis=0).reset_index()
df2.rename({'index': 'sample'}, axis=1, inplace=True)
# melt df2 into a long form
dfm = df2.melt(id_vars=['sample', 'id_'])
# plot
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', hue='id_', ci='sd', aspect=3)
df2.head()
sample YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C id_
0 S2_1 -13.062716 -8.084685 2.360795 -0.740357 3.086768 -0.117259 -5.678183 2.527573 -17.326287 -1.319402 2
1 S2_2 -5.431474 -12.676807 0.070569 -4.214761 -4.318011 -4.489010 -10.268632 0.691448 -24.189106 -2.343884 2
2 S2_3 -9.365509 -12.281169 0.497772 -3.228236 0.212941 -2.287206 -10.250004 1.111842 -27.811564 -4.329987 2
3 S2_4 -7.582111 -15.587219 -1.286167 -4.531494 -3.090265 -4.718281 -8.933496 2.079757 -21.580854 -2.834441 2
4 S3_1 -12.618254 -20.010779 -2.530541 -3.203072 -2.436503 -2.922565 -15.972632 3.551605 -35.618485 -4.925495 3
dfm.head()
sample id_ variable value
0 S2_1 2 YAL001C -13.062716
1 S2_2 2 YAL001C -5.431474
2 S2_3 2 YAL001C -9.365509
3 S2_4 2 YAL001C -7.582111
4 S3_1 3 YAL001C -12.618254
Plot Result
kind='box'
A box plot might be a better to convey the distribution
p = sns.catplot(kind='box', data=dfm, y='variable', x='value', hue='id_', height=12)
Original Answer
Use a list comprehension to chunk the columns into groups of 4
This uses the original, more comprehensive data that was posted. It can be found in revision 4
Create a figure with subplots and zip each group to an ax from axes
Use each group to select data from df and transpose the data with .T.
Using sns.barplot the default estimator is mean, so the length of the bar is the mean, and set ci='sd' so the confidence interval is the standard deviation.
sns.barplot(data=data, ci='sd', ax=ax) can easily be replaced with sns.boxplot(data=data, ax=ax)
import seaborn as sns
# using the first comma separated data that was posted, create groups of 4
l = df.columns
n = 4 # chunk size for groups
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# plot
fig, axes = plt.subplots(num_gps, 1, figsize=(12, 6*num_gps))
for ax, group in zip(axes, groups):
data = df[group].T
sns.barplot(data=data, ci='sd', ax=ax)
ax.set_title(f'{group.to_list()}')
fig.tight_layout()
fig.savefig('test.png')
Example of data
The bar is the mean of each column, and the line is the standard deviation
YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C
S8_1 -1.731388 -17.215712 -3.518643 -2.358103 0.418170 -1.529747 -12.630343 2.435674 -27.471971 -4.021264
S8_2 -1.325524 -24.056632 -0.984390 -2.119338 -1.770665 -1.447103 -10.618954 2.156420 -30.362998 -4.735058
S8_3 -2.024020 -29.094027 -6.146880 -2.101090 -0.732322 -2.773949 -12.642857 -0.009749 -28.486835 -4.783863
S8_4 2.541671 -13.599049 -2.688125 -2.329332 -0.694555 -2.820627 -8.498677 3.321018 -31.741916 -2.104281
Plot Result

How to draw proper chart of distributional tree?

I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py

How to I set different colors to subsets of line plot iterations in matplotlib?

I am iteratively plotting the np.exp results of 12 rows of data from a 2D array (12,5000), out_array. All data share the same x values, (x_d). I want the first 4 iterations to all plot as the same color, the next 4 to be a different color, and next 4 a different color...such that I have 3 different colors each corresponding to the 1st-4th, 5th-8th, and 9th-12th iterations respectively. In the end, it would also be nice to define these sets with their corresponding colors in a legend.
I have researched cycler (https://matplotlib.org/examples/color/color_cycle_demo.html), but I can't figure out how to assign colors into sets of iterations > 1. (i.e. 4 in my case). As you can see in my code example, I can have all 12 lines plotted with different (default) colors -or- I know how to make them all the same color (i.e. ...,color = 'r',...)
plt.figure()
for i in range(out_array.shape[0]):
plt.plot(x_d, np.exp(out_array[i]),linewidth = 1, alpha = 0.6)
plt.xlim(-2,3)
I expect a plot like this, only with a total of 3 different colors, each corresponding to the chunks of iterations described above.
An other solution
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
color = ['r', 'g', 'b', 'p']
for i in range(12):
plt.plot(x, i*x, color[i//4])
plt.show()
plt.figure()
n = 0
color = ['r','g','b']
for i in range(out_array.shape[0]):
n = n+1
if n/4 <= 1:
c = 1
elif n/4 >1 and n/4 <= 2:
c = 2
elif n/4 >2:
c = 3
else:
print(n)
plt.plot(x_d, np.exp(out_array[i]),color = color[c-1])
plt.show()

Plotting pandas dataframe with string labels

I have a pandas dataframe that has several fields. The ones of importance are:
In[191]: tasks[['start','end','appId','index']]
Out[189]:
start end appId index
2576 1464262540102.000 1464262541204.000 application_1464258584784_0012 1
2577 1464262540098.000 1464262541208.000 application_1464258584784_0012 0
2579 1464262540104.000 1464262541194.000 application_1464258584784_0012 3
2583 1464262540107.000 1464262541287.000 application_1464258584784_0012 6
2599 1464262540125.000 1464262541214.000 application_1464258584784_0012 26
2600 1464262541191.000 1464262541655.000 application_1464258584784_0012 28
.
.
.
2701 1464262562172.000 1464262591147.000 application_1464258584784_0013 14
2718 1464262578901.000 1464262588156.000 application_1464258584784_0013 28
2727 1464262591145.000 1464262602085.000 application_1464258584784_0013 40
I want to plot a line for each row that goes from the coords (x1=start,y1=index),(x2=end,y1=index). Each line will have a different color depending on the value of appId which is a string. This is all done in a subplot I have inside a time series plot. I post the code here but the important bit is the tasks.iterrows() part, you can ignore the rest.
def plot_stage_in_host(dfm,dfg,appId,stageId,parameters,host):
[s,e] = time_interval_for_app(dfm, appId,stageId, host)
time_series = create_time_series_host(dfg, host, parameters, s,e)
fig,p1 = plt.subplots()
p2 = p1.twinx()
for para in parameters:
p1.plot(time_series.loc[time_series['parameter']==para].time,time_series.loc[time_series['parameter']==para].value,label=para)
p1.legend()
p1.set_xlabel("Time")
p1.set_ylabel(ylabel='%')
p1.set(ylim=(-1,1))
p2.set_ylabel("TASK INDEX")
tasks = dfm.loc[(dfm["hostname"]==host) & (dfm["start"]>s) & (dfm["end"]<e) & (dfm["end"]!=0)] #& (dfm["appId"]==appId) & (dfm["stageId"]==stageId)]
apps = tasks.appId.unique()
norm = colors.Normalize(0,len(apps))
scalar_map = cm.ScalarMappable(norm=norm, cmap='hsv')
for _,row in tasks.iterrows():
color = scalar_map.to_rgba(np.where(apps == row['appId'])[0][0])
p2.plot([row['start'],row['end']],[row['index'],row['index']],lw=4 ,c=color)
p2.legend(apps,loc='lower right')
p2.show()
This is the result I get.
Apparently is not considering the labels and the legend shows the same colors for all the lines. How can I label them correctly and show the legend as well?
The problem is that you are assigning the label each time you plot the graph in the for loop using the label= argument. Try removing it and giving p2.lengend() a list of strings as an argument that represent the labels you want to show.
p2.legend(['label1', 'label2'])
If you want to assign a different color to each line try the following:
import matplotlib.pyplot as plt
import numpy as np
xdata = [1, 2, 3, 4, 5]
ydata = [[np.random.randint(0, 6) for i in range(5)],
[np.random.randint(0, 6) for i in range(5)],
[np.random.randint(0, 6) for i in range(5)]]
colors = ['r', 'g', 'b'] # can be hex colors as well
legend_names = ['a', 'b', 'c']
for c, y in zip(colors, ydata):
plt.plot(xdata, y, c=c)
plt.legend(legend_names)
plt.show()
It gives the following result:
Hope this helps!

Categories

Resources