I am importing data from a .txt file and plot it, that is working.
Now I want to hide some parts of the data, i.e. I want to set all y-values for an interval x to 0 or better hide them completely, the rest of the plot shouldnt disappear.
data = pd.read_csv('C:\\users\johan\Documents\Arbeit\Schwarzkoerper\Avasoft\winkel3\\'+''.join(L[k]),sep='\;',skiprows=10,decimal=",",header=None)
data = pd.DataFrame(data) #muss sein
x = data[0]*10**(-9)
y = data[1]
plt.plot(x, y*Teilung())
plt.axis([450*10**(-9), 1100*10**(-9), 0, 60000])
plt.show
To be concret: I want to hide y-values for x in [500-600*10**(-9)]
Using the following column operations with boolean condition is possibly to filter only the data you need from y-values.
df.y[(df.x < 500 * 1e-9) | (df.x > 600 * 1e-9)]
Then use the new range of y-values over the original x-values.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
lower_bound = 500 * 1e-9
upper_bound = 600 * 1e-9
df = pd.DataFrame({
"x" : np.linspace(0,1100*1e-9,100),
"y" : [5*np.cos(i)+i for i in range(100)]
})
#original values
plt.plot(df.iloc[:,0], df.iloc[:,1])
hidden_df = df.y[(df.x < lower_bound) | (df.x > upper_bound)]
plt.plot(df.iloc[:len(hidden_df), 0], hidden_df)
plt.ticklabel_format(axis="x", style="sci", scilimits=(0,0))
plt.legend(("Original", "Hidden"))
plt.grid()
plt.show()
Related
I have about 200 pairs of columns in a dataframe that I would like to plot in a single plot. Each pair of columns can be thought of as related "x" and "y" variables. Some of the "y variables" are 0 at certain points in the data. I don't want to plot those. I would rather they show up as a discontinuity in the plot. I am not able to figure out an efficient way to excluse those variables. There is also a "date" variable that I don't need in the plot but I am keeping it in the sample data just to mirror the reality.
Here is a sample data set and what I have done with it. I created my sample dataset in a hurry, the original data has unique "y values" for a given "x value" for every pair of column data.
import pandas as pd
from numpy.random import randint
data1y = [n**3 -n**2+n for n in range(12)]
data1x = [randint(0, 100) for n in range(12)]
data1x.sort()
data2y = [n**3 for n in range(12)]
data2x = [randint(0, 100) for n in range(12)]
data2x.sort()
data3y = [n**3 - n**2 for n in range(12)]
data3x = [randint(0, 100) for n in range(12)]
data3x.sort()
data1y = [0 if x%7==0 else x for x in data1y]
data2y = [0 if x%7==0 else x for x in data2y]
data3y = [0 if x%7==0 else x for x in data3y]
date = ['Jan','Feb','Mar','Apr','May', 'Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df = pd.DataFrame({'Date':date,'Var1':data1y, 'Var1x':data1x, 'Vartwo':data2y, 'Vartwox':data2x,'datatree':data3y, 'datatreex':data3x})
print(df)
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
df.plot(x=k+'x', y=k, kind = 'line',ax=ax)enter code here
The output I get this this:
I would like to see discontinuity where the 'y variables' are zero.
I have tried:
import numpy as np
df2 = df.copy()
df2[df2.Var1 < 0.5] = np.nan
But this makes an entire row NaN when I only want it to be a particular variable.
I'm trying this but it isnt working.
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
filter = df.k.values > 0
x = df.k+'x'
y = df.k
plot(x[filter], y[filter], kind = 'line',ax=ax)
This works for a single variable but I don't know how to loop it across 200 variables and this also doesn't show the discontinuities.
import matplotlib.pyplot as plt
ax = plt.gca()
fig = plt.figure()
for k in ['Var1','Vartwo','datatree']:
filter = df.Var1.values > 0
x = df.Var1x[filter]
y = df.Var1[filter]
plt.plot(x, y)
You're looking for .replace():
df2 = df.copy()
cols_to_replace = ['Var1','Var1x','Vartwo']
df2[cols_to_replace] = df2[cols_to_replace].replace({0:np.nan})
fig, ax = plt.subplots()
for k in ['Var1','Vartwo','datatree']:
df2.plot(x=k+'x', y=k, kind = 'line',ax=ax)
Result:
I am trying to identify outliers in a dataset using the 5th and 95th percentiles of a regression line so I'm using quantile regression in Python with statsmodel, matplotlib and pandas. Based on this answer from blokeley, I can create a scatterplot of my data and show the best fit line and the lines for the 5th and 95th percentile based on quantile regression. But how do I identify those points that fall above and below those lines and then save them out to a pandas dataframe?
My data looks like this (there are 95 values in total):
Month Year LST NDVI
0 June 1984 310.550975 0.344335
1 June 1985 310.495331 0.320504
2 June 1986 306.820900 0.369494
3 June 1987 308.945602 0.369946
4 June 1988 308.694022 0.31863
2
and the script I have so far is this:
import pandas as pd
excel = my_excel
df = pd.read_excel(excel)
df.head()
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
model = smf.quantreg('NDVI ~ LST',df)
quantiles = [0.05,0.95]
fits = [model.fit(q=q) for q in quantiles]
figure,axes = plt.subplots()
x = df['LST']
y = df['NDVI']
axes.scatter(x,df['NDVI'],c='green',alpha=0.3,label='data point')
fit = np.polyfit(x, y, deg=1)
axes.plot(x, fit[0] * x + fit[1], color='grey',label='best fit')
_x = np.linspace(x.min(),x.max())
for index, quantile in enumerate(quantiles):
_y = fits[index].params['LST'] * _x + fits[index].params['Intercept']
axes.plot(_x, _y, label=quantile)
title = 'LST/NDVI Jun-Aug'
plt.title(title)
axes.legend()
axes.set_xticks(np.arange(298,320,4))
axes.set_yticks(np.arange(0.25,0.5,.05))
axes.set_xlabel('LST')
axes.set_ylabel('NDVI');
And the chart I get out of that is this:
So I can definitely see data points above the 95th line and below the 5th line that I would classify as outliers, but I want to identify those in my original dataframe and maybe plot them on the cart or highlight them in some way to show them as "outliers".
I am searching on a method but coming up empty and could use some help.
You need to figure out if certain point are above the 95% quantile line or below the 5% quantile line. This you can do using the cross product, see this answer for a straightforward implementation.
In your example, you would need to combine the points above and below the quantile lines, possibly in a mask.
Here's is an example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(np.random.normal(0, 1, (100, 2)))
df.columns = ['LST', 'NDVI']
model = smf.quantreg('NDVI ~ LST', df)
quantiles = [0.05, 0.95]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
x = df['LST']
y = df['NDVI']
fit = np.polyfit(x, y, deg=1)
_x = np.linspace(x.min(), x.max(), num=len(y))
# fit lines
_y_005 = fits[0].params['LST'] * _x + fits[0].params['Intercept']
_y_095 = fits[1].params['LST'] * _x + fits[1].params['Intercept']
# start and end coordinates of fit lines
p = np.column_stack((x, y))
a = np.array([_x[0], _y_005[0]]) #first point of 0.05 quantile fit line
b = np.array([_x[-1], _y_005[-1]]) #last point of 0.05 quantile fit line
a_ = np.array([_x[0], _y_095[0]])
b_ = np.array([_x[-1], _y_095[-1]])
#mask based on if coordinates are above 0.95 or below 0.05 quantile fitlines using cross product
mask = lambda p, a, b, a_, b_: (np.cross(p-a, b-a) > 0) | (np.cross(p-a_, b_-a_) < 0)
mask = mask(p, a, b, a_, b_)
axes.scatter(x[mask], df['NDVI'][mask], facecolor='r', edgecolor='none', alpha=0.3, label='data point')
axes.scatter(x[~mask], df['NDVI'][~mask], facecolor='g', edgecolor='none', alpha=0.3, label='data point')
axes.plot(x, fit[0] * x + fit[1], label='best fit', c='lightgrey')
axes.plot(_x, _y_095, label=quantiles[1], c='orange')
axes.plot(_x, _y_005, label=quantiles[0], c='lightblue')
axes.legend()
axes.set_xlabel('LST')
axes.set_ylabel('NDVI')
plt.show()
Here's the short example version of my problem:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(1, 101, 1)
y1 = np.linspace(0, 1000, 50)
y2 = np.linspace(500, 2000, 50)
y = np.concatenate((y1, y2))
data = np.asmatrix([x, y]).T
df = pd.DataFrame(data)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(df.iloc[:, 0], df.iloc[:, 1], 'r')
plt.gca().invert_yaxis()
plt.show()
Please run the code and see the plot it generates.
I want to read the dataframe from the back (from x=100 to x=0) and make sure my y-axis is always decreasing (from y=2000 to y=0). I want to remove rows where the y value is not decreasing when read from the end of the dataframe.
How can I edit my dataframe to make this happen?
I'm not really happy with this solution, but it's better than nothing. I found it really hard to describe this problem without becoming too vague. Please comment if you see room for improvement, because I know there is.
newindex = []
max = -999
for row in df.index:
if df.loc[row, 1] > max:
max = df.loc[row, 1]
newindex.append(row)
df = df.loc[newindex,]
I am trying to get my function to output the outlier in the array "data." I have created a graph to show the outlier, however I want my function to spit out the actual value also. Basically I want the value '220' to be outputted in my code. How can I do this?
What am I doing wrong with my code? I think something is off with my distance
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
web_stats = {'Day': [1,2,3,4,5,6], 'Visitors': [43,53,34,45,64,34],
'Bounce_Rate': [65,72,62,64,54,220]}
df = pd.DataFrame(web_stats)
data = np.array(df['Bounce_Rate'])
def find_outlier(data, q1, q3):
lower = q1 - 1.5 * (q3 - q1)
upper = q3 + 1.5 * (q3 - q1)
return data <= lower or data >= upper
def find_indices(data):
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
indices_of_outliers = []
for ind, value in enumerate(data):
if find_outlier(value, q1, q3):
indices_of_outliers.append(ind)
return indices_of_outliers
dist=data
find_indices = find_indices(dist)
fig = plt.figure()
ax = fig.add_subplot(111) # 1x1 grid, first subplot
ax.plot(dist, 'b-', label='distances')
ax.plot(
find_indices,
data[find_indices],
'ro',
markersize = 7,
label='outliers')
ax.legend(loc='best')
I am trying to make a profile plot for two columns of a pandas.DataFrame. I would not expect this to be in pandas directly but it seems there is nothing in matplotlib either. I have searched around and cannot find it in any package other than rootpy. Before I take the time to write this myself I thought I would ask if there was a small package that contained profile histograms, perhaps where they are known by a different name.
If you don't know what I mean by "profile histogram" have a look at the ROOT implementation. http://root.cern.ch/root/html/TProfile.html
You can easily do it using scipy.stats.binned_statistic.
import scipy.stats
import numpy
import matplotlib.pyplot as plt
x = numpy.random.rand(10000)
y = x + scipy.stats.norm(0, 0.2).rvs(10000)
means_result = scipy.stats.binned_statistic(x, [y, y**2], bins=50, range=(0,1), statistic='mean')
means, means2 = means_result.statistic
standard_deviations = numpy.sqrt(means2 - means**2)
bin_edges = means_result.bin_edges
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(x=bin_centers, y=means, yerr=standard_deviations, linestyle='none', marker='.')
Use seaborn. Data as from #MaxNoe
import numpy as np
import seaborn as sns
# just some random numbers to get started
x = np.random.uniform(-2, 2, 10000)
y = np.random.normal(x**2, np.abs(x) + 1)
sns.regplot(x=x, y=y, x_bins=10, fit_reg=None)
You can do much more (error bands are from bootstrap, you can change the estimator on the y-axis, add regression, ...)
While #Keith's answer seems to fit what you mean, it is quite a lot of code. I think this can be done much simpler, so one gets the key concepts and can adjust and build on top of it.
Let me stress one thing: what ROOT is calling a ProfileHistogram is not a special kind of plot. It is an errorbar plot. Which can simply be done in matplotlib.
It is a special kind of computation and that's not the task of a plotting library. This lies in the pandas realm, and pandas is great at stuff like this. It's symptomatical for ROOT as the giant monolithic pile it is to have an extra class for this.
So what you want to do is: discretize in some variable x and for each bin, calculate something in another variable y.
This can easily done using np.digitize together with the pandas groupy and aggregate methods.
Putting it all together:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# just some random numbers to get startet
x = np.random.uniform(-2, 2, 10000)
y = np.random.normal(x**2, np.abs(x) + 1)
df = pd.DataFrame({'x': x, 'y': y})
# calculate in which bin row belongs base on `x`
# bins needs the bin edges, so this will give as 100 equally sized bins
bins = np.linspace(-2, 2, 101)
df['bin'] = np.digitize(x, bins=bins)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
bin_width = bins[1] - bins[0]
# grouby bin, so we can calculate stuff
binned = df.groupby('bin')
# calculate mean and standard error of the mean for y in each bin
result = binned['y'].agg(['mean', 'sem'])
result['x'] = bin_centers
result['xerr'] = bin_width / 2
# plot it
result.plot(
x='x',
y='mean',
xerr='xerr',
yerr='sem',
linestyle='none',
capsize=0,
color='black',
)
plt.savefig('result.png', dpi=300)
Just like ROOT ;)
I made a module myself for this functionality.
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
def Profile(x,y,nbins,xmin,xmax,ax):
df = DataFrame({'x' : x , 'y' : y})
binedges = xmin + ((xmax-xmin)/nbins) * np.arange(nbins+1)
df['bin'] = np.digitize(df['x'],binedges)
bincenters = xmin + ((xmax-xmin)/nbins)*np.arange(nbins) + ((xmax-xmin)/(2*nbins))
ProfileFrame = DataFrame({'bincenters' : bincenters, 'N' : df['bin'].value_counts(sort=False)},index=range(1,nbins+1))
bins = ProfileFrame.index.values
for bin in bins:
ProfileFrame.ix[bin,'ymean'] = df.ix[df['bin']==bin,'y'].mean()
ProfileFrame.ix[bin,'yStandDev'] = df.ix[df['bin']==bin,'y'].std()
ProfileFrame.ix[bin,'yMeanError'] = ProfileFrame.ix[bin,'yStandDev'] / np.sqrt(ProfileFrame.ix[bin,'N'])
ax.errorbar(ProfileFrame['bincenters'], ProfileFrame['ymean'], yerr=ProfileFrame['yMeanError'], xerr=(xmax-xmin)/(2*nbins), fmt=None)
return ax
def Profile_Matrix(frame):
#Much of this is stolen from https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py
import pandas.core.common as com
import pandas.tools.plotting as plots
from pandas.compat import lrange
from matplotlib.artist import setp
range_padding=0.05
df = frame._get_numeric_data()
n = df.columns.size
fig, axes = plots._subplots(nrows=n, ncols=n, squeeze=False)
# no gaps between subplots
fig.subplots_adjust(wspace=0, hspace=0)
mask = com.notnull(df)
boundaries_list = []
for a in df.columns:
values = df[a].values[mask[a].values]
rmin_, rmax_ = np.min(values), np.max(values)
rdelta_ext = (rmax_ - rmin_) * range_padding / 2.
boundaries_list.append((rmin_ - rdelta_ext, rmax_+ rdelta_ext))
for i, a in zip(lrange(n), df.columns):
for j, b in zip(lrange(n), df.columns):
common = (mask[a] & mask[b]).values
nbins = 100
(xmin,xmax) = boundaries_list[i]
ax = axes[i, j]
Profile(df[a][common],df[b][common],nbins,xmin,xmax,ax)
ax.set_xlabel('')
ax.set_ylabel('')
plots._label_axis(ax, kind='x', label=b, position='bottom', rotate=True)
plots._label_axis(ax, kind='y', label=a, position='left')
if j!= 0:
ax.yaxis.set_visible(False)
if i != n-1:
ax.xaxis.set_visible(False)
for ax in axes.flat:
setp(ax.get_xticklabels(), fontsize=8)
setp(ax.get_yticklabels(), fontsize=8)
return axes
To my knowledge matplotlib doesn't still allow to directly produce profile histograms.
You can instead give a look at Hippodraw, a package developed at SLAC, that can be used as a Python extension module.
Here there is a Profile histogram example:
http://www.slac.stanford.edu/grp/ek/hippodraw/datareps_root.html#datareps_profilehist