Pandas GroupBy object is not 'serializable' by Plot.ly - python

I'm trying to create a boxplot using Plotly and I get an error when attempting to use a Pandas DataFrame that's been grouped. Some initial digging produced this chunk of code to convert Pandas to Plotly interface:
def df_to_iplot(df):
'''
Coverting a Pandas Data Frame to Plotly interface
'''
x = df.index.values
lines={}
for key in df:
lines[key]={}
lines[key]["x"]=x
lines[key]["y"]=df[key].values
lines[key]["name"]=key
#Appending all lines
lines_plotly=[lines[key] for key in df]
return lines_plotly
Are there alternatives to this method of converting DataFrame's to a Plotly-compatible series? The above code is for line graphs, but I'd like to iterate over my dimensions to produce a boxplot for each group in my DataFrame. Here is the error message I'm getting:
"TypeError: pandas.core.groupby.SeriesGroupBy object is not JSON serializable"
Here is an example from the Plotly website: https://plot.ly/python/box-plots
import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("xxxx", "xxxxxxxxxx")
import numpy as np
y0 = np.random.randn(50)
y1 = np.random.randn(50)+1
trace0 = Box(
y=y0
)
trace1 = Box(
y=y1
)
data = Data([trace0, trace1])
unique_url = py.plot(data, filename = 'basic-box-plot')

If I understand right, you want something like this:
data = Data([Box(y=v.values) for k, v in g])
(where g is your grouped object). Then you can use py.plot on that.
Like I said in the comments, I know nothing about plotly; I'm just going based off your example. We'll see if anyone who knows more about plotly replies. Failing that, it would be helpful if you could explain in your question what format you want the data in (i.e., figure out what format plotly wants).

Related

How to iterate distance calculation for different vehicles from coordinates

I am new to coding and need help developing a Time Space Diagram (TSD) from a CSV file which I got from a VISSIM simulation as a result.
A general TSD looks like this: TSD and I have a CSV which looks like this:
CSV.
I want to take "VEHICLE:SIMSEC" which represent the simulation time which I want it represented as the X axis on TSD, "NO" which represent the vehicle number (there are 185 different vehicles and I want to plot all 185 of them on the plot) as each of the line represented on TSD, "COORDFRONTX" which is the x coordinate of the simulation, and "COORDFRONTY" which is the y coordinate of the simulation as positions which would be the y axis on TSD.
I have tried the following code but did not get the result I want.
import pandas as pd
import matplotlib.pyplot as mp
# take data
data = pd.read_csv(r"C:\Users\hk385\Desktop\VISSIM_DATA_CSV.csv")
df = pd.DataFrame(data, columns=["VEHICLE:SIMSEC", "NO", "DISTTRAVTOT"])
# plot the dataframe
df.plot(x="NO", y=["DISTTRAVTOT"], kind="scatter")
# print bar graph
mp.show()
The plot came out to be uninterpretable as there were too many dots. The diagram looks like this: Time Space Diagram. So would you be able to help me or guide me to get a TSD from the CSV I have?
Suggestion made by mitoRibo,
The top 20 rows of the csv is the following:
VEHICLE:SIMSEC,NO,LANE\LINK\NO,LANE\INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,8.42
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,93.0
7.0,1,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,101.49
7.1,1,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,109.99
7.2,1,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,118.49
7.3,1,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,127.0
7.4,1,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,135.51
7.5,1,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,144.03
7.6,1,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,152.56
7.7,1,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,161.09
Thank you.
You can groupby and iterate through different vehicles, adding each one to your plot. I changed your example data so there were 2 different vehicles.
import pandas as pd
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""
VEHICLE:SIMSEC,NO,LANE_LINK_NO,LANE_INDEX,POS,POSLAT,COORDFRONTX,COORDFRONTY,COORDREARX,COORDREARY,DISTTRAVTOT
5.9,1,1,1,2.51,0.5,-1.259,-3.518,-4.85,-1.319,0
6.0,1,1,1,10.94,0.5,0.932,-4.86,-2.659,-2.661,16.86
6.1,1,1,1,19.37,0.5,3.125,-6.203,-0.466,-4.004,25.29
6.2,1,1,1,27.82,0.5,5.319,-7.547,1.728,-5.348,33.73
6.3,1,1,1,36.26,0.5,7.515,-8.892,3.924,-6.693,42.18
6.4,1,1,1,44.72,0.5,9.713,-10.238,6.122,-8.039,50.64
6.5,1,1,1,53.18,0.5,11.912,-11.585,8.321,-9.386,59.1
6.6,1,1,1,61.65,0.5,14.112,-12.933,10.521,-10.734,67.56
6.7,1,1,1,70.12,0.5,16.314,-14.282,12.724,-12.082,76.04
6.8,1,1,1,78.6,0.5,18.518,-15.632,14.927,-13.432,84.51
6.9,1,1,1,87.08,0.5,20.723,-16.982,17.132,-14.783,90
6.0,2,1,1,95.57,0.5,22.93,-18.334,19.339,-16.135,0
6.1,2,1,1,104.07,0.5,25.138,-19.687,21.547,-17.487,30
6.2,2,1,1,112.57,0.5,27.348,-21.04,23.757,-18.841,40
6.3,2,1,1,121.08,0.5,29.56,-22.395,25.969,-20.195,50
6.4,2,1,1,129.59,0.5,31.773,-23.75,28.182,-21.551,60
6.5,2,1,1,138.11,0.5,33.987,-25.107,30.396,-22.907,70
6.6,2,1,1,146.64,0.5,36.203,-26.464,32.612,-24.264,80
6.7,2,1,1,155.17,0.5,38.421,-27.822,34.83,-25.623,90
"""),sep=',')
fig = plt.figure()
#Iterate through each vehicle, adding it to the plot
for vehicle_no,vehicle_df in df.groupby('NO'):
plt.plot(vehicle_df['VEHICLE:SIMSEC'],vehicle_df['DISTTRAVTOT'], label=vehicle_no)
plt.legend() #comment this out if you don't want a legned
plt.show()
plt.close()
If you don't mind could you please try this.
mp.scatter(x="NO", y=["DISTTRAVTOT"])
If still not work please attach your data for me to test from my side.

How to retrieve all data from seaborn distribution plot with mutliple distributions?

The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)

Plotting pandas dataframe after doing pandas melt is slow and creates strange y-axis

This could be caused by me not understanding how pandas.melt works but I get strange behaviour when plotting "melted" dataframes using plotnine. Both frames has been converted from wide to long format. One frame with a column containing string values (df_slow) and another with only numerical values (df_fast).
The following code gives different behaviour. Plotting df_slow is slow and gives a strange looking y-axis. Plotting df_fast looks ok and is fast. My guess is that pandas melt is doing something strange with the data which causes this behaviour. See example plots.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
SIZE = 200
value = np.random.rand(SIZE, 1)
# Create test data, one with a column containing strings, one with only numeric values
df_slow = pd.DataFrame({'value': value.flatten(), 'string_column': ['A']*SIZE})
df_fast = pd.DataFrame({'value': value.flatten()})
# Set index
df_slow = df_slow.reset_index()
df_fast = df_fast.reset_index()
# Convert 'df_slow', 'df_fast' to long format
df_slow = pd.melt(df_slow, id_vars='index')
df_fast = pd.melt(df_fast, id_vars='index')
print(df_slow.head())
print(df_fast.head())
df_slow = df_slow[df_slow.variable == 'value']
# This is slow and has many breaks on y-axis
p = (p9.ggplot(df_slow, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
# This is much faster and y-axis looks good
p = (p9.ggplot(df_fast, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
slow and strange plot
fast and good looking plot
Possible fix
Changing the type of the "value" column in df_slow makes it behave like df_fast when plotting.
# This makes df_slow behave like df_fast when plotting
df_slow['value'] = df_slow.value.astype(np.float64)
Question
Is this a bug in plotnine (or pandas) or am I doing something wrong?
Answer
When pivoting two columns with different data types, in this case string and float, I guess it makes sense that the resulting column containing both strings and floats will have the type object. As
ALollz pointed out this probably makes plotnine interpret the values as strings which causes this behavior.

Choosing the correct values in excel in Python

General Overview:
I am creating a graph of a large data set, however i have created a sample text document so that it is easier to overcome the problems.
The Data is from an excel document that will be saved as a CSV.
Problem:
I am able to compile the data a it will graph (see below) However how i pull the data will not work for all of the different excel sheet i am going to pull off of.
More Detail of problem:
The Y-Values (Labeled 'Value' and 'Value1') are being pulled for the excel sheet from the numbers 26 and 31 (See picture and Code).
This is a problem because the Values 26 and 31 will not be the same for each graph.
Lets take a look for this to make more sense.
Here is my code
import pandas as pd
import matplotlib.pyplot as plt
pd.read_csv('CSV_GM_NB_Test.csv').T.to_csv('GM_NB_Transpose_Test.csv,header=False)
df = pd.read_csv('GM_NB_Transpose_Test.csv', skiprows = 2)
DID = df['SN']
Value = df['26']
Value1 = df['31']
x= (DID[16:25])
y= (Value[16:25])
y1= (Value1[16:25])
"""
print(x,y)
print(x,y1)
"""
plt.plot(x.astype(int), y.astype(int))
plt.plot(x.astype(int), y1.astype(int))
plt.show()
Output:
Data Set:
Below in the comments you will find the 0bin to my Data Set this is because i do not have enough reputation to post two links.
As you can see from the Data Set
X- DID = Blue
Y-Value = Green
Y-Value1 = Grey
Troublesome Values = Red
The problem again is that the data for the Y-Values are pulled from Row 10&11 from values 26,31 under SN
Let me know if more information is needed.
Thank you
Not sure why you are creating the transposed CSV version. It is also possible to work directly from your original data. For example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('CSV_GM_NB_Test.csv', skiprows=8)
data = df.ix[:,19:].T
data.columns = df['SN']
data.plot()
plt.show()
This would give you:
You can use pandas.DataFrame.ix() to give you a sliced version of your data using integer positions. The [:,19:] says to give you columns 19 onwards. The final .T transposes it. You can then apply the values for the SN column as column headings using .columns to specify the names.

Create Pandas DataFrame for use with ggPlot line plot

I'm trying to create a Pandas dataFrame so that I can create some visualization with ggPlot. But I am having a hard time getting the DataFrame structure setup.
My visualization would be a line plot of (year vs. total). The line plot would be tracking multiple 'cause_of_death' over the years.
I have imported my CSV file, grouped by year, then 'cause_of_death' and do a count. But it is not in the right format to create a line plot because it is not a DataFrame.
Below is my code; any suggestion would be helpful, thanks.
The field that I want from the CSV file are 'deathYear' and 'cause_of_death'
from pandas import *
from ggplot import *
df = pandas.read_csv('query_result.csv')
newDF = df.loc[:,['date_of_death_year','acme_underlying_cause_code']]
data = DataFrame(newDF.groupby(['date_of_death_year','acme_underlying_cause_code']).size())
print data
This is a mega-old question, but it's pretty straightforward to solve. (hint, it's nothing to do with ggplot. It's all about how pandas works)
Here's how I'd render your code:
import numpy as np # |Don't import * from these
import pandas as pd # |
from ggplot import * # But this is customary because it's like R
# All this bit is just to make a DataFrame
# You can ignore it all
causes = ['foo', 'bar', 'baz']
years = [2001, 2002, 2003, 2004]
size = 100
data = {'causes':np.random.choice(causes, size),
'years':np.random.choice(years, size),
'something_else':np.random.random(size)
}
df = pd.DataFrame(data)
# Here's where the good stuff happens. You're importing from
# a CSV so you can just start here
counts = df.groupby(['years', 'causes'])['something_else'].count()
counts = counts.reset_index() # Because ggplot doesn't plot with indexes
g = ggplot(counts, aes(x='years', y='something_else', color='causes')) +\
geom_line()
print(g)
Which results in:

Categories

Resources