Convert a histogram plot from a Pandas dataframe to a scatter plot - python

I have a histogram:
# Lets load a dataset of house prices in Boston.
from sklearn.datasets import load_diabetes
#sklearn gives you the data as a dictionary, so
diabetes = load_diabetes(as_frame=True)
data = diabetes['frame']
import matplotlib.pyplot as plt
%matplotlib inline
bmi_hist = plt.hist(data['bmi'], density=False)
bmi_hist = plt.ylabel("Frequency")
bmi_hist = plt.xlabel("Normalized BMI")
bp_hist = plt.hist(data['bp'], density=False)
bp_hist = plt.ylabel("Frequency")
bp_hist = plt.xlabel("Normalized BP")
This is a histogram for two of the columns in the frame above.
I want to compare these two in a scatter graph. My attempts haven't been quite successful as I know I need an X and a Y to plot.
I thought I would use the same axis as the histogram:
y_bmi = data['bmi'].value_counts() # frequency
x_bmi = data['bmi'] # normalized value
ax1 = df.plot.scatter(x = x_bmi, y= y_bmi, c='DarkBlue')
But this can only be used on the 'dataframe' so do I have to repeat the values of bmi column into a new dataframe? or is there a simpler method?
Any help would be greatly appreciated.
Many Thanks.

Related

Adding labels to some datapoints in seaborn regression plot given condition

Here is a table:
dict1 = {'left':[7,3,5,10,9],
'right':[2,17,0,8,1]}
table = pd.DataFrame(dict1)
I've created a regression scatter plot (scatterplot with best fit line):
sns.regplot(x=table['right'], y=table['left'], data=table)
I would like to add labels to datapoints in the plot where values are => 10 in either columns. Not sure how to do this.
You can iterate over the x,y pairs and if any are >=10 add that text to the chart at those coordinates, with an offset of +/- .5 so it doesn't land on the dot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dict1 = {'left':[7,3,5,10,9],
'right':[2,17,0,8,1]}
table = pd.DataFrame(dict1)
ax = sns.regplot(x=table['right'], y=table['left'], data=table)
for x in table.values:
if any([n>=10 for n in x]):
ax.text(x=x[1]+.5, y=x[0]-.5, s=','.join(map(str,reversed(x))))

Issue with x-axis tick labels in matplotlib scatter plot

I'm trying to plot some data that I have and I'm having issues with the x-axis tick labels. Does anyone have a fix for this? Also, is there an easier way to plot this data with certain conditions? For example, I'm looking at poker hands here, and I only want to plot this data for individuals that have over 50 hands (ie. data points). To do this, I created a new list and filtered out those with Hands < 50, is there a way of plotting this with pandas without creating a new list?
## For data handling
import pandas as pd
import numpy as np
from pandas import plotting
## For plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.figure import Figure
preflop = pd.read_csv("all_player_preflop_report_tourney.csv", thousands=',')
#preflop['Hands'] = preflop['Hands'].astype(int) if preflop['Hands'] < 20000
preflop['Hands'] = preflop['Hands'].astype(int)
preflop = preflop.rename(columns={'BB/100':'BB_100','Raise First':'RFI','WTSD %': 'WTSD', 'All-In Adj BB/100':'adj_BB_100','Avg PF All-In Equity':'pf_all_in','CC 2Bet PF':'cc_2bet','3Bet PF':'3bet','2Bet PF & Call 3Bet':'2Bet_call_3Bet','Raise & 4Bet+ PF':'rfi_and_4bet+','2Bet PF & Fold':'2bet_and_fold','5Bet+ PF':'5bet+','3Bet PF & Fold':'3bet_and_fold','Call Any PFR':'call_any_pfr','Call Steal':'call_steal', 'Call vs BTN Open':'call_btn_open','CC 3Bet+ PF':'cc_3bet+','Limp Behind':'limp_behind','Raise Limpers':'raise_limpers'})
preflop = preflop.set_index('Player')
preflop_copy = preflop.copy()
preflop_train = preflop_copy.sample(frac = .75, random_state = 250)
preflop_test = preflop_copy.drop(preflop_train.index)
## first make a figure
## this makes a figure that is 8 units by 8 units
plt.figure(figsize = (8,8))
preflop_50 = preflop_copy.loc[(preflop_copy.Hands > 100)]
#preflop_50.plot.scatter(x="RFI", y="BB_100")
plt.scatter(preflop_50.RFI,preflop_50.BB_100)
x = np.arange(0,1,0.1)
plt.xticks(x)
#Figure.align_xlabels(plot)
## Always good practice to label well when
## presenting a figure to others
## place an xlabel
plt.xlabel("RFI", fontsize =16)
## place a ylabel
plt.ylabel("BB/100", fontsize = 16)
## type this to show the plot
plt.show()

Plotting data with categorical x and y axes in python

I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)

Histogram color by class

I have a pandas dataframe with 2 columns "height" and "class, class is a column with 3 values 1,2 and 5.
Now i want to make a histogram of the height data and color by class. plot19_s["vegetation height"].plot.hist(bins = 10)
this is my histogram
but now I want to see the different classes by a change in color in the histogram.
Since I'm not sure if the potential duplicate actually answers the question here, this is a way to produce a stacked histogram using numpy.histogram and matplotlib bar plot.
import pandas as pd
import numpy as np;np.random.seed(1)
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.exponential(size=100),
"class" : np.random.choice([1,2,5],100)})
_, edges = np.histogram(df["x"], bins=10)
histdata = []; labels=[]
for n, group in df.groupby("class"):
histdata.append(np.histogram(group["x"], bins=edges)[0])
labels.append(n)
hist = np.array(histdata)
histcum = np.cumsum(hist,axis=0)
plt.bar(edges[:-1],hist[0,:], width=np.diff(edges)[0],
label=labels[0], align="edge")
for i in range(1,len(hist)):
plt.bar(edges[:-1],hist[i,:], width=np.diff(edges)[0],
bottom=histcum[i-1,:],label=labels[i], align="edge")
plt.legend(title="class")
plt.show()

Python Matplotlib - Smooth plot line for x-axis with date values

Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.

Categories

Resources