I have some data of the form:
Name Score1 Score2 Score3 Score4
Bob -2 3 5 7
and im trying to use bqplot to plot a really basic bar chart
i'm trying:
sc_ord = OrdinalScale()
y_sc_rf = LinearScale()
bar_chart = Bars(x=data6.Name,
y=[data6.Score1, data6.Score2, data6.Score3],
scales={'x': sc_ord, 'y': y_sc_rf},
labels=['Score1', 'Score2', 'Score3'],
)
ord_ax = Axis(label='Score', scale=sc_ord, grid_lines='none')
y_ax = Axis(label='Scores', scale=y_sc_rf, orientation='vertical',
grid_lines='solid')
Figure(axes=[ord_ax, y_ax], marks=[bar_chart])
but all im getting is one bar, i assume because Name only has one value, is there a way to set the column headers as the x data? or some other way to solve this
I think this is what Doug is getting at. Your length of x and y data should be the same. In this case, x is the column labels, and y is the score values. You should set the 'Name' column of your DataFrame as the index; this will prevent it from being plotted as a value.
PS. Next time, if you make sure your code is a complete example that can be run from scratch without external data (a MCVE, https://stackoverflow.com/help/mcve) you are likely to get a much quicker answer.
BQPlot documentation has lots of good examples using the more complex pyplot interface which are worth reading: https://github.com/bloomberg/bqplot/blob/master/examples/Marks/Object%20Model/Bars.ipynb
from bqplot import *
import pandas as pd
data = pd.DataFrame(
index = ['Bob'],
columns = ['score1', 'score2', 'score3', 'score4'],
data = [[-2, 3,5,7]]
)
sc_ord = OrdinalScale()
y_sc_rf = LinearScale()
bar_chart = Bars(x=data.columns, y = data.iloc[0],
scales={'x': sc_ord, 'y': y_sc_rf},
labels=data.index[1:].tolist(),
)
ord_ax = Axis(label='Score', scale=sc_ord, grid_lines='none')
y_ax = Axis(label='Scores', scale=y_sc_rf, orientation='vertical',
grid_lines='solid')
Figure(axes=[ord_ax, y_ax], marks=[bar_chart])
Related
Just to be upfront, I am a Mechanical Engineer with limited coding experience thou I have some programming classes under my belt( Java, C++, and lisp)
I have inherited this code from my predecessor and am just trying to make it work for what I'm doing with it. I need to iterate through an excel file that has column A values of 0, 1, 2, and 3 (in the code below this correlates to "Revs" ) but I need to pick out all the value = 0 and put into a separate folder, and again for value = 2, etc.. Thank you for bearing with me, I appreciate any help I can get
import pandas as pd
import numpy as np
import os
import os.path
import xlsxwriter
import matplotlib.pyplot as plt
import six
import matplotlib.backends.backend_pdf
from matplotlib.gridspec import GridSpec
from matplotlib.ticker import AutoMinorLocator, MultipleLocator
def CamAnalyzer(entryName):
#Enter excel data from file as a dataframe
df = pd.read_excel (str(file_loc) + str(entryName), header = 1) #header 1 to get correct header row
print (df)
#setup grid for plots
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(17,22))
gs = GridSpec(3,2, figure=fig)
props = dict(boxstyle='round', facecolor='w', alpha=1)
#create a list of 4 smaller dataframes by splitting df when the rev count changes and name them
dfSplit = list(df.groupby("Revs"))
names = ["Air Vent","Inlet","Diaphram","Outlet"]
for x, y in enumerate(dfSplit):
#for each smaller dataframe #x,(df-y), create a polar plot and assign it to a space in the grid
dfs = y[1]
r = dfs["Measurement"].str.strip(" in") #radius measurement column has units. ditch em
r = r.apply(pd.to_numeric) + zero/2 #convert all values in the frame to a float
theta = dfs["Rads"]
if x<2:
ax = fig.add_subplot(gs[1,x],polar = True)
else:
ax = fig.add_subplot(gs[2,x-2],polar = True)
ax.set_rlim(0,0.1) #set limits to radial axis
ax.plot(theta, r)
ax.grid(True)
ax.set_title(names[x]) #nametag
#create another subplot in the grid that overlays all 4 smaller dataframes on one plot
ax2 = fig.add_subplot(gs[0,:],polar = True)
ax2.set_rlim(0,0.1)
for x, y in enumerate(dfSplit):
dfs = y[1]
r = dfs["Measurement"].str.strip(" in")
r = r.apply(pd.to_numeric) + zero/2
theta = dfs["Rads"]
ax2.plot(theta, r)
ax2.set_title("Sample " + str(entryName).strip(".xlsx") + " Overlayed")
ax2.legend(names,bbox_to_anchor=(1.1, 1.05)) #place legend outside of plot area
plt.savefig(str(file_loc) + "/Results/" + str(entryName).strip(".xlsx") + ".png")
print("Results Saved")
I'm on my phone, so I can't check exact code examples, but this should get you started.
First, most of the code you posted is about graphing, and therefore not useful for your needs. The basic approach: use pandas (a library), to read in the Excel sheet, use the pandas function 'groupby' to split that sheet by 'Revs', then iterate through each Rev, and use pandas again to write back to a file. Copying the relevant sections from above:
#this brings in the necessary library
import pandas as pd
#Read excel data from file as a dataframe
#header should point to the row that describes your columns. The first row is row 0.
df = pd.read_excel("filename.xlsx", header = 1)
#create a list of 4 smaller dataframes using GroupBy.
#This returns a 'GroupBy' object.
dfSplit = df.groupby("Revs")
#iterate through the groupby object, saving each
#iterating over key (name) and value (dataframes)
#use the name to build a filename
for name, frame in dfSplit:
frame.to_excel("Rev "+str(name)+".xlsx")
Edit: I had a chance to test this code, and it should now work. This will depend a little on your actual file (eg, which row is your header row).
I'm adding the rSquared to a chart using the method outlined in this answer:
r2 = alt.Chart(df).transform_regression('x', 'y', params=True
).mark_text().encode(x=alt.value(20), y=alt.value(20), text=alt.Text('rSquared:N', format='.4f'))
But I want to prepend "rSquared = " to the final text.
I've seen this answer involving an f string and a value calculated outside the chart, but I'm not clever enough to figure out how to apply that solution to this scenario.
I've tried, e.g., format='rSquared = .4f', but adding any additional text breaks the output, which I'm sure is the system working as intended.
One possible solution using the posts you linked to would be to extract the value of the parameter using altair_transform and then add the value to the plot. This is not the most elegant solution but should achieve what you want.
# pip install git+https://github.com/altair-viz/altair-transform.git
import altair as alt
import pandas as pd
import numpy as np
import altair_transform
np.random.seed(42)
x = np.linspace(0, 10)
y = x - 5 + np.random.randn(len(x))
df = pd.DataFrame({'x': x, 'y': y})
chart = alt.Chart(df).mark_point().encode(
x='x',
y='y'
)
line = chart.transform_regression('x', 'y').mark_line()
params = chart.transform_regression('x','y', params=True).mark_line()
R2 = altair_transform.extract_data(params)['rSquared'][0]
text = alt.Chart({'values':[{}]}).mark_text(
align="left", baseline="top"
).encode(
x=alt.value(5), # pixels from left
y=alt.value(5), # pixels from top
text=alt.value(f"rSquared = {R2:.4f}"),
)
chart + line + text
I'm currently experimenting with pandas and matplotlib.
I have created a Pandas dataframe which stores data like this:
cmc|coloridentity
1 | G
1 | R
2 | G
3 | G
3 | B
4 | B
What I now want to do is to make a stacked bar plot where I can see how many entries per cmc exist. And I want to do that for all coloridentity and stack them above.
My thoughts so far:
#get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
#Create two dictionaries. One for the number of entries per cost and one
# to store the different costs for each color
color_dict_values = {}
color_dict_index = {}
for u in unique_values:
temp_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.array(temp_df)
color_dict_index[u] = temp_df.index.to_numpy()
width = 0.4
p1 = plt.bar(color_dict_index['G'], color_dict_values['G'], width, color='g')
p2 = plt.bar(color_dict_index['R'], color_dict_values['R'], width,
bottom=color_dict_values['G'], color='r')
plt.show()
So but this gives me an error because the line where I say that the bottom of the second plot shall be the values of different plot have different numpy shapes.
Does anyone know a solution? I thought of adding 0 values so that the shapes are the same , but I don't know if this is the best solution, and if yes how the best way would be to solve it.
Working with a fixed index (the range of cmc values), makes things easier. That way the color_dict_values of a color_id give a count for each of the possible cmc values (stays zero when there are none).
The color_dict_index isn't needed any more. To fill in the color_dict_values, we iterate through the temporary dataframe with the value_counts.
To plot the bars, the x-axis is now the range of possible cmc values. I added [1:] to each array to skip the zero at the beginning which would look ugly in the plot.
The bottom starts at zero, and gets incremented by the color_dict_values of the color that has just been plotted. (Thanks to numpy, the constant 0 added to an array will be that array.)
In the code I generated some random numbers similar to the format in the question.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
N = 50
df = pd.DataFrame({'cmc': np.random.randint(1, 10, N), 'coloridentity': np.random.choice(['R', 'G'], N)})
# get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
# find the range of all cmc indices
max_cmc = df['cmc'].max()
cmc_range = range(max_cmc + 1)
# dictionary for each coloridentity: array of values of each possible cmc
color_dict_values = {}
for u in unique_values:
value_counts_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.zeros(max_cmc + 1, dtype=int)
for ind, cnt in value_counts_df.iteritems():
color_dict_values[u][ind] = cnt
width = 0.4
bottom = 0
for col_id, col in zip(['G', 'R'], ['limegreen', 'crimson']):
plt.bar(cmc_range[1:], color_dict_values[col_id][1:], bottom=bottom, width=width, color=col)
bottom += color_dict_values[col_id][1:]
plt.xticks(cmc_range[1:]) # make sure every cmc gets a tick label
plt.tick_params(axis='x', length=0) # hide the tick marks
plt.xlabel('cmc')
plt.ylabel('count')
plt.show()
The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)
I'm recording datafrom 5 temperature sensors using a Raspberry Pi running Python 3.
All is working well and I now want to display plots of the 5 temperatures on one graph, updating every 10 minutes or so. I'd like to use Plotly.
I wrote the following code to test out the idea.
#many_lines2
# tryimg to sort out why x is sent more than once when using extend
import time
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
#tls.set_credentials_file(username=, api_key)
from datetime import datetime
for count in range (1,5):
x1 = count
y1 = count * 2
y2 = count * 3
y3 = count * 4
trace1 = Scatter(x=x1,y = y1,mode = "lines")
trace2 = Scatter(x=x1,y = y2,mode = "lines")
trace3 = Scatter(x=x1,y = y3,mode = "lines")
data = [trace1,trace2,trace3]
py.plot (data,filename="3lines6", fileopt = "extend")
time.sleep(60)
See plot and data received by plotly here https://plot.ly/~steverat/334/trace0-trace1-trace2/
See data tab for data received by plotly.
It looksto me as though the x value in the data table has been added three times after the first values were sent.
I cab get the right results by using .append in python to creat lists of values. This leads to long lists, more data to be sent to plotly and seems just wrong.
The code to do this is below and the data on the plotly serve can be found here.https://plot.ly/~steverat/270
# using lists and append to send data to plotly
import time
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
#tls.set_credentials_file(username='steverat', api_key='f0qs8y2vj8')
from datetime import datetime
xlist = []
y1list= []
y2list = []
y3list = []
for count in range (1,5):
xlist.append (count)
y1list.append (count * 2)
y2list.append (count * 3)
y3list.append (count * 4)
print "xlist = ", xlist
print "y1list = ", y1list
print "y2list = ", y2list
trace1 = Scatter(x=xlist,y = y1list,mode = "lines")
trace2 = Scatter(x=xlist,y = y2list,mode = "lines")
trace3 = Scatter(x=xlist,y = y3list,mode = "lines")
data = [trace1,trace2,trace3]
py.plot (data,filename="3lines2")
time.sleep(60)
I've searched the web and can find examples where data is streamed but I only want to update the plots every 10 ninsor longer.
Have I missed something obvious???
Cheers
Steve
Andrew from Plotly here. Thanks very much for documenting this so well!
EDIT
This issue should now be fixed, which makes the following workaround obsolete/incorrect. Please don't use the following workaround anymore! (keeping it here for documentation though)
TL;DR (just make it work)
Try this code out:
import time
import plotly.plotly as py
from plotly.graph_objs import Figure, Scatter
filename = 'Stack Overflow 31436471'
# setup original figure (behind the scenes, Plotly assumes you're sharing that x source)
# note that the x arrays are all the same and the y arrays are all *different*
fig = Figure()
fig['data'].append(Scatter(x=[0], y=[1], mode='lines'))
fig['data'].append(Scatter(x=[0], y=[2], mode='lines'))
fig['data'].append(Scatter(x=[0], y=[3], mode='lines'))
print py.plot(fig, filename=filename, auto_open=False)
# --> https://plot.ly/~theengineear/4949
# start extending the plots
for i in xrange(1, 3):
x1 = i
y1 = i * 2
y2 = i * 3
y3 = i * 4
fig = Figure()
# note that since the x arrays are shared, you only need to extend one of them
fig['data'].append(Scatter(x=x1, y=y1))
fig['data'].append(Scatter(y=y2))
fig['data'].append(Scatter(y=y3))
py.plot(fig, filename=filename, fileopt='extend', auto_open=False)
time.sleep(2)
More info
This appears to be a bug in our backend code. The issue is that we reuse data arrays that hash to the same value. In this case your x value is hashing to the same value and when you go to extend the traces you're actually extending the same x array three times.
The fix proposed above has you only extend one of the x arrays, which is the same array being used by the other traces anyhow.
Do note that for this to work you must supply a non-zero length array in the initial setup. This is because Plotly won't save an array if it doesn't have any data to begin with.
The takeaway is that you'll be A-OK as long as you initialize identical x arrays and ensure that in the initialization your y arrays aren't also identical to any of the x arrays.
Apologies for the inconvenient workaround. I'll edit this response when a fix has been submitted on our end.