Create Pandas DataFrame for use with ggPlot line plot - python

I'm trying to create a Pandas dataFrame so that I can create some visualization with ggPlot. But I am having a hard time getting the DataFrame structure setup.
My visualization would be a line plot of (year vs. total). The line plot would be tracking multiple 'cause_of_death' over the years.
I have imported my CSV file, grouped by year, then 'cause_of_death' and do a count. But it is not in the right format to create a line plot because it is not a DataFrame.
Below is my code; any suggestion would be helpful, thanks.
The field that I want from the CSV file are 'deathYear' and 'cause_of_death'
from pandas import *
from ggplot import *
df = pandas.read_csv('query_result.csv')
newDF = df.loc[:,['date_of_death_year','acme_underlying_cause_code']]
data = DataFrame(newDF.groupby(['date_of_death_year','acme_underlying_cause_code']).size())
print data

This is a mega-old question, but it's pretty straightforward to solve. (hint, it's nothing to do with ggplot. It's all about how pandas works)
Here's how I'd render your code:
import numpy as np # |Don't import * from these
import pandas as pd # |
from ggplot import * # But this is customary because it's like R
# All this bit is just to make a DataFrame
# You can ignore it all
causes = ['foo', 'bar', 'baz']
years = [2001, 2002, 2003, 2004]
size = 100
data = {'causes':np.random.choice(causes, size),
'years':np.random.choice(years, size),
'something_else':np.random.random(size)
}
df = pd.DataFrame(data)
# Here's where the good stuff happens. You're importing from
# a CSV so you can just start here
counts = df.groupby(['years', 'causes'])['something_else'].count()
counts = counts.reset_index() # Because ggplot doesn't plot with indexes
g = ggplot(counts, aes(x='years', y='something_else', color='causes')) +\
geom_line()
print(g)
Which results in:

Related

Seeking to modify code to pull data from an excel sheet where column A has X numeric value. ( ie all rows with value= 0 )

Just to be upfront, I am a Mechanical Engineer with limited coding experience thou I have some programming classes under my belt( Java, C++, and lisp)
I have inherited this code from my predecessor and am just trying to make it work for what I'm doing with it. I need to iterate through an excel file that has column A values of 0, 1, 2, and 3 (in the code below this correlates to "Revs" ) but I need to pick out all the value = 0 and put into a separate folder, and again for value = 2, etc.. Thank you for bearing with me, I appreciate any help I can get
import pandas as pd
import numpy as np
import os
import os.path
import xlsxwriter
import matplotlib.pyplot as plt
import six
import matplotlib.backends.backend_pdf
from matplotlib.gridspec import GridSpec
from matplotlib.ticker import AutoMinorLocator, MultipleLocator
def CamAnalyzer(entryName):
#Enter excel data from file as a dataframe
df = pd.read_excel (str(file_loc) + str(entryName), header = 1) #header 1 to get correct header row
print (df)
#setup grid for plots
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(17,22))
gs = GridSpec(3,2, figure=fig)
props = dict(boxstyle='round', facecolor='w', alpha=1)
#create a list of 4 smaller dataframes by splitting df when the rev count changes and name them
dfSplit = list(df.groupby("Revs"))
names = ["Air Vent","Inlet","Diaphram","Outlet"]
for x, y in enumerate(dfSplit):
#for each smaller dataframe #x,(df-y), create a polar plot and assign it to a space in the grid
dfs = y[1]
r = dfs["Measurement"].str.strip(" in") #radius measurement column has units. ditch em
r = r.apply(pd.to_numeric) + zero/2 #convert all values in the frame to a float
theta = dfs["Rads"]
if x<2:
ax = fig.add_subplot(gs[1,x],polar = True)
else:
ax = fig.add_subplot(gs[2,x-2],polar = True)
ax.set_rlim(0,0.1) #set limits to radial axis
ax.plot(theta, r)
ax.grid(True)
ax.set_title(names[x]) #nametag
#create another subplot in the grid that overlays all 4 smaller dataframes on one plot
ax2 = fig.add_subplot(gs[0,:],polar = True)
ax2.set_rlim(0,0.1)
for x, y in enumerate(dfSplit):
dfs = y[1]
r = dfs["Measurement"].str.strip(" in")
r = r.apply(pd.to_numeric) + zero/2
theta = dfs["Rads"]
ax2.plot(theta, r)
ax2.set_title("Sample " + str(entryName).strip(".xlsx") + " Overlayed")
ax2.legend(names,bbox_to_anchor=(1.1, 1.05)) #place legend outside of plot area
plt.savefig(str(file_loc) + "/Results/" + str(entryName).strip(".xlsx") + ".png")
print("Results Saved")
I'm on my phone, so I can't check exact code examples, but this should get you started.
First, most of the code you posted is about graphing, and therefore not useful for your needs. The basic approach: use pandas (a library), to read in the Excel sheet, use the pandas function 'groupby' to split that sheet by 'Revs', then iterate through each Rev, and use pandas again to write back to a file. Copying the relevant sections from above:
#this brings in the necessary library
import pandas as pd
#Read excel data from file as a dataframe
#header should point to the row that describes your columns. The first row is row 0.
df = pd.read_excel("filename.xlsx", header = 1)
#create a list of 4 smaller dataframes using GroupBy.
#This returns a 'GroupBy' object.
dfSplit = df.groupby("Revs")
#iterate through the groupby object, saving each
#iterating over key (name) and value (dataframes)
#use the name to build a filename
for name, frame in dfSplit:
frame.to_excel("Rev "+str(name)+".xlsx")
Edit: I had a chance to test this code, and it should now work. This will depend a little on your actual file (eg, which row is your header row).

How to plot scatterplot using matplotlib from arrays (using strings)? Python

I have been trying to plot a 3D scatterplot from a pandas array (I have tried to convert the data over to numpy arrays and strings to put into the system). However, the error ValueError: s must be a scalar, or float array-like with the same size as x and y keeps popping up. My data for Patient ID is in the format of EMR-001, EMR-002 etc after blanking it out. My data for Discharge Date is converted to become a string of numbers like 20200120. My data for Diagnosis Code is a mix of characters like 001 or 10B.
I have also tried to look online at some of the other examples but have not been able to identify any areas. Could I seek your advice for anything I missed out or code I can input?
I'm using Python 3.9, UTF-8. Thank you in advanced!
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#importing csv data via pd
A = pd.read_csv('input.csv') #import file for current master list
Diagnosis_Des = A["Diagnosis Code"]
Discharge_Date = A["Discharge Date"]
Patient_ID = A["Patient ID"]
B = Diagnosis_Des.to_numpy()
#B1 = np.array2string(B)
#print(B.shape)
C = Discharge_Date.to_numpy() #need to change to data format
#C1 = np.array2string(C)
#print(C1)
D = Patient_ID.to_numpy()
#D1 = np.array2string(D)
#print(D.shape)
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
sequence_containing_x_vals = D
sequence_containing_y_vals = B
print(type(sequence_containing_y_vals))
sequence_containing_z_vals = C
print(type(sequence_containing_z_vals))
plt.scatter(sequence_containing_x_vals, sequence_containing_y_vals, sequence_containing_z_vals)
pyplot.show()

How to retrieve all data from seaborn distribution plot with mutliple distributions?

The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)

Choosing the correct values in excel in Python

General Overview:
I am creating a graph of a large data set, however i have created a sample text document so that it is easier to overcome the problems.
The Data is from an excel document that will be saved as a CSV.
Problem:
I am able to compile the data a it will graph (see below) However how i pull the data will not work for all of the different excel sheet i am going to pull off of.
More Detail of problem:
The Y-Values (Labeled 'Value' and 'Value1') are being pulled for the excel sheet from the numbers 26 and 31 (See picture and Code).
This is a problem because the Values 26 and 31 will not be the same for each graph.
Lets take a look for this to make more sense.
Here is my code
import pandas as pd
import matplotlib.pyplot as plt
pd.read_csv('CSV_GM_NB_Test.csv').T.to_csv('GM_NB_Transpose_Test.csv,header=False)
df = pd.read_csv('GM_NB_Transpose_Test.csv', skiprows = 2)
DID = df['SN']
Value = df['26']
Value1 = df['31']
x= (DID[16:25])
y= (Value[16:25])
y1= (Value1[16:25])
"""
print(x,y)
print(x,y1)
"""
plt.plot(x.astype(int), y.astype(int))
plt.plot(x.astype(int), y1.astype(int))
plt.show()
Output:
Data Set:
Below in the comments you will find the 0bin to my Data Set this is because i do not have enough reputation to post two links.
As you can see from the Data Set
X- DID = Blue
Y-Value = Green
Y-Value1 = Grey
Troublesome Values = Red
The problem again is that the data for the Y-Values are pulled from Row 10&11 from values 26,31 under SN
Let me know if more information is needed.
Thank you
Not sure why you are creating the transposed CSV version. It is also possible to work directly from your original data. For example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('CSV_GM_NB_Test.csv', skiprows=8)
data = df.ix[:,19:].T
data.columns = df['SN']
data.plot()
plt.show()
This would give you:
You can use pandas.DataFrame.ix() to give you a sliced version of your data using integer positions. The [:,19:] says to give you columns 19 onwards. The final .T transposes it. You can then apply the values for the SN column as column headings using .columns to specify the names.

Pandas GroupBy object is not 'serializable' by Plot.ly

I'm trying to create a boxplot using Plotly and I get an error when attempting to use a Pandas DataFrame that's been grouped. Some initial digging produced this chunk of code to convert Pandas to Plotly interface:
def df_to_iplot(df):
'''
Coverting a Pandas Data Frame to Plotly interface
'''
x = df.index.values
lines={}
for key in df:
lines[key]={}
lines[key]["x"]=x
lines[key]["y"]=df[key].values
lines[key]["name"]=key
#Appending all lines
lines_plotly=[lines[key] for key in df]
return lines_plotly
Are there alternatives to this method of converting DataFrame's to a Plotly-compatible series? The above code is for line graphs, but I'd like to iterate over my dimensions to produce a boxplot for each group in my DataFrame. Here is the error message I'm getting:
"TypeError: pandas.core.groupby.SeriesGroupBy object is not JSON serializable"
Here is an example from the Plotly website: https://plot.ly/python/box-plots
import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("xxxx", "xxxxxxxxxx")
import numpy as np
y0 = np.random.randn(50)
y1 = np.random.randn(50)+1
trace0 = Box(
y=y0
)
trace1 = Box(
y=y1
)
data = Data([trace0, trace1])
unique_url = py.plot(data, filename = 'basic-box-plot')
If I understand right, you want something like this:
data = Data([Box(y=v.values) for k, v in g])
(where g is your grouped object). Then you can use py.plot on that.
Like I said in the comments, I know nothing about plotly; I'm just going based off your example. We'll see if anyone who knows more about plotly replies. Failing that, it would be helpful if you could explain in your question what format you want the data in (i.e., figure out what format plotly wants).

Categories

Resources