This question is a follow-up to an earlier question and from #JoeKington here. Both of these solutions work excellently for my needs.
However I have been trying to overlay a basemap on the contours. Going by the example here http://matplotlib.org/basemap/users/examples.html, I do not seem to get it right. I think my basic problem is to convert the contour x,y values into map coordinates. I reproduce below the codes for 1) contours (as given by #usethedeathstar, which works very well) and 2) the map object and the plotting.
#!/usr/bin/python
from mpl_toolkits.basemap import Basemap
import numpy as np
from scipy.interpolate import griddata
class d():
def __init__(self):
A0 = open("meansr2.txt","rb") #
A1 = A0.readlines()
A = np.zeros((len(A1),3))
for i, l in enumerate(A1):
li = l.split()
A[i,0] = float(li[0])
A[i,1] = float(li[1])
A[i,2] = float(li[2])
self.Lon = A[:,0]
self.Lat = A[:,1]
self.Z = A[:,2]
data = d()
numcols, numrows = 300, 300
xi = np.linspace(data.Lon.min(), data.Lon.max(), numrows)
yi = np.linspace(data.Lat.min(), data.Lat.max(), numcols)
xi, yi = np.meshgrid(xi, yi)
x, y, z = data.Lon, data.Lat, data.Z
points = np.vstack((x,y)).T
values = z
wanted = (xi, yi)
zi = griddata(points, values, wanted)
Defining map object
m = Basemap(projection = 'merc',llcrnrlon = 21, llcrnrlat = -18, urcrnrlon = 34, urcrnrlat = -8, resolution='h')
m.drawcountries(linewidth=0.5, linestyle='solid', color='k', antialiased=1, ax=None, zorder=None)
m.drawmapboundary(fill_color = 'white')
m.fillcontinents(color='coral',lake_color='blue')
parallels = np.arange(-18, -8, 2.)
m.drawparallels(parallels, color = 'black', linewidth = 0.5, labels=[True,False,False,False])
meridians = np.arange(22,34, 2.)
m.drawmeridians(meridians, color = '0.25', linewidth = 0.5, labels=[False,False,False,True])
import pylab as plt
an attempt to transform form lat/lon to map coordinates
#lat = list(data.Lat)
#lon = list(data.Lon)
#x, y = m(lon,lat)
comment:
contourf is tried with (x, y, zi), then all the above definitions are rewritten with xi, # yi, including many different attempts to redefine x,y and lon, lat.J
The plot functions
fig = plt.figure(0, figsize=(8,4.5))
im = plt.contourf(xi, yi, zi)
plt.scatter(data.Lon, data.Lat, c= data.Z)
plt.colorbar()
plt.show()
The above give two plots side by side.
Here is some data in case there is need to test
Lon Lat Z Z2 pos
32.6 -13.6 41 9 CHIP
27.1 -16.9 43 12 CHOM
32.7 -10.2 46 14 ISOK
24.2 -13.6 33 13 KABO
28.5 -14.4 43 11 KABW
28.1 -12.6 33 16 KAFI
27.9 -15.8 46 13 KAFU
24.8 -14.8 44 9 KAOM
31.1 -10.2 35 14 KASA
25.9 -13.5 24 8 KASE
29.1 -9.8 10 13 KAWA
25.8 -17.8 39 11 LIVI
33.2 -12.3 44 8 LUND
28.3 -15.4 46 12 LUSA
27.6 -16.1 47 9 MAGO
28.9 -11.1 31 15 MANS
31.3 -8.9 39 9 MBAL
31.9 -13.3 45 9 MFUW
23.1 -15.3 31 9 MONG
31.4 -11.9 39 9 MPIK
27.1 -15.0 42 12 MUMB
24.4 -11.8 15 9 MWIN
28.6 -13.0 39 9 NDOL
31.3 -14.3 44 12 PETA
23.3 -16.1 39 5 SENA
30.2 -13.2 38 11 SERE
24.3 -17.5 32 10 SESH
26.4 -12.2 23 12 SOLW
23.1 -13.5 27 14 ZAMB
Any assistance will be appreciated
I would like to thank all those who have looked at my problem and may have tried to work on it. By consistent trying, it has come to my attention that the overlaying of the basemap on the contours actually works with the following lines
After the map object definition
m = Basemap(projection = 'merc',llcrnrlon = 21, llcrnrlat = -18, urcrnrlon = 34, urcrnrlat = -8, resolution='h')
I have
x, y = m(xi, yi)
fig=plt.figure(figsize=(8,4.5))
cs = m.contour(x,y,zi,colors='b',linewidths=1.)
Contour(x,y,zi) plots the contours on the map. Since I was using contourf, I still have to find out why contourf does not give me the filled contours.
Thank you very much all for the patience and tolerance.
Related
I'm working on a project involving railway tracks and I'm trying to find an algorithm that could detect curves(left/right) or straight lines based on time-series GPS coordinates.
The data contains latitude, longitude, and altitude values along with many different sensor readings of a vehicle in a specific range of time.
Example dataframe of a curve looks as follows:
latitude longitude altitude
1 43.46724 -5.823470 145.0
2 43.46726 -5.823653 145.2
3 43.46728 -5.823837 145.4
4 43.46730 -5.824022 145.6
5 43.46730 -5.824022 145.6
6 43.46734 -5.824394 146.0
7 43.46738 -5.824768 146.3
8 43.46738 -5.824768 146.3
9 43.46742 -5.825146 146.7
10 43.46742 -5.825146 146.7
11 43.46746 -5.825527 147.1
12 43.46746 -5.825527 147.1
13 43.46750 -5.825910 147.3
14 43.46751 -5.826103 147.4
15 43.46753 -5.826295 147.6
16 43.46753 -5.826489 147.8
17 43.46753 -5.826685 148.1
18 43.46753 -5.826878 148.2
19 43.46752 -5.827073 148.4
20 43.46750 -5.827266 148.6
21 43.46748 -5.827458 148.9
22 43.46744 -5.827650 149.2
23 43.46741 -5.827839 149.5
24 43.46736 -5.828029 149.7
25 43.46731 -5.828212 150.1
26 43.46726 -5.828393 150.4
27 43.46720 -5.828572 150.5
28 43.46713 -5.828746 150.8
29 43.46706 -5.828914 151.0
30 43.46698 -5.829078 151.2
31 43.46690 -5.829237 151.4
32 43.46681 -5.829392 151.6
33 43.46671 -5.829540 151.8
34 43.46661 -5.829680 152.0
35 43.46650 -5.829816 152.2
36 43.46639 -5.829945 152.4
37 43.46628 -5.830066 152.4
38 43.46616 -5.830180 152.4
39 43.46604 -5.830287 152.5
40 43.46591 -5.830384 152.6
41 43.46579 -5.830472 152.8
42 43.46566 -5.830552 152.9
43 43.46552 -5.830623 153.2
44 43.46539 -5.830687 153.4
45 43.46526 -5.830745 153.6
46 43.46512 -5.830795 153.8
47 43.46499 -5.830838 153.9
48 43.46485 -5.830871 153.9
49 43.46471 -5.830895 154.0
50 43.46458 -5.830911 154.2
51 43.46445 -5.830919 154.3
52 43.46432 -5.830914 154.7
53 43.46418 -5.830896 155.1
54 43.46406 -5.830874 155.6
55 43.46393 -5.830842 155.9
56 43.46381 -5.830803 156.0
57 43.46368 -5.830755 155.5
58 43.46356 -5.830700 155.3
59 43.46332 -5.830575 155.1
I've found out about spline interpolation on this old post asking the same question and decided to apply it in my problem:
from scipy.interpolate import make_interp_spline
## read csv file with pandas
df = pd.read_csv("Curvas/Curva_2.csv")
# take latitude and longitude columns
df['latitude'].fillna(method='ffill',inplace=True)
df['longitude'].fillna(method='ffill',inplace=True)
# plot the data
# df.plot(x='longitude', y='latitude', style='o')
# plt.show()
# using longitude and latitude data, use spline interpolation to create a new curve
x = df['longitude']
y = df['latitude']
xnew = np.linspace(x.min(), x.max(), x.shape[0])
ynew = make_interp_spline(xnew, y)(x)
plt.plot(xnew, ynew, zorder=2)
plt.show()
## Error results using different coordinates/routes
## Curve_1 → Left (e = 0.04818886515888465)
## Curve_2 → Left (e = 0.019459215874292113)
## Straight_1 → Straight (e = 0.03839597167971931)
I've calculated the error between the interpolated points and the real ones but I'm not quite sure how to proceed next or what threshold to use to figure out the direction.
I have a simple exercise that I am not sure how to do. I have the following data sets:
male100
Year Time
0 1896 12.00
1 1900 11.00
2 1904 11.00
3 1906 11.20
4 1908 10.80
5 1912 10.80
6 1920 10.80
7 1924 10.60
8 1928 10.80
9 1932 10.30
10 1936 10.30
11 1948 10.30
12 1952 10.40
13 1956 10.50
14 1960 10.20
15 1964 10.00
16 1968 9.95
17 1972 10.14
18 1976 10.06
19 1980 10.25
20 1984 9.99
21 1988 9.92
22 1992 9.96
23 1996 9.84
24 2000 9.87
25 2004 9.85
26 2008 9.69
and the second one:
female100
Year Time
0 1928 12.20
1 1932 11.90
2 1936 11.50
3 1948 11.90
4 1952 11.50
5 1956 11.50
6 1960 11.00
7 1964 11.40
8 1968 11.00
9 1972 11.07
10 1976 11.08
11 1980 11.06
12 1984 10.97
13 1988 10.54
14 1992 10.82
15 1996 10.94
16 2000 11.12
17 2004 10.93
18 2008 10.78
I have the following code:
y = -0.014*male100['Year']+38
plt.plot(male100['Year'],y,'r-',color = 'b')
ax = plt.gca() # gca stands for 'get current axis'
ax = male100.plot(x=0,y=1, kind ='scatter', color='g', label="Mens 100m", ax = ax)
female100.plot(x=0,y=1, kind ='scatter', color='r', label="Womens 100m", ax = ax)
Which produces this result:
I need to plot a line that would go exactly between them. So the line would leave all of the green points below it, and the red point above it. How do I do so?
I've tried playing with the parameters of y, but to no avail. I also tried fitting a linear regression to male100 , female100 , and the merged version of them (across rows), but couldn't get any results.
Any help would be appreciated!
A solution is using support vector machine (SVM). You can find two margins that separate two classes of points. Then, the average line of two support vectors is your answer. Notice that it's happened just when these two set of points are linearly separable.
You can use the following code to see the result:
Data Entry
male = [
(1896 , 12.00),
(1900 , 11.00),
(1904 , 11.00),
(1906 , 11.20),
(1908 , 10.80),
(1912 , 10.80),
(1920 , 10.80),
(1924 , 10.60),
(1928 , 10.80),
(1932 , 10.30),
(1936 , 10.30),
(1948 , 10.30),
(1952 , 10.40),
(1956 , 10.50),
(1960 , 10.20),
(1964 , 10.00),
(1968 , 9.95),
(1972 , 10.14),
(1976 , 10.06),
(1980 , 10.25),
(1984 , 9.99),
(1988 , 9.92),
(1992 , 9.96),
(1996 , 9.84),
(2000 , 9.87),
(2004 , 9.85),
(2008 , 9.69)
]
female = [
(1928, 12.20),
(1932, 11.90),
(1936, 11.50),
(1948, 11.90),
(1952, 11.50),
(1956, 11.50),
(1960, 11.00),
(1964, 11.40),
(1968, 11.00),
(1972, 11.07),
(1976, 11.08),
(1980, 11.06),
(1984, 10.97),
(1988, 10.54),
(1992, 10.82),
(1996, 10.94),
(2000, 11.12),
(2004, 10.93),
(2008, 10.78)
]
Main Code
Notice that the value of C is important here. If it is selected to 1, you can't get the preferred result.
from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
X = np.array(male + female)
Y = np.array([0] * len(male) + [1] * len(female))
# fit the model
clf = svm.SVC(kernel='linear', C=1000) # C is important here
clf.fit(X, Y)
plt.figure(figsize=(8, 4))
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-1000, 10000)
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.plot(xx, yy, "k-") #********* This is the separator line ************
plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10, cmap=plt.cm.Paired,
edgecolors="k")
plt.xlim((1890, 2010))
plt.ylim((9, 13))
plt.show()
I believe your idea of making use of regression lines is correct - if they aren't used, the line would be merely superficial (and impossible to justify if the points overlap in the event of messy data).
Therefore, using some randomly made data with a known linear relationship, we can do the following:
import random
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
x_values = np.arange(0, 51, 1)
y_points_1 = [i * 2 + random.randint(5, 30) for i in x_points]
y_points_2 = [i - random.randint(5, 30) for i in x_points]
x_points = x_values.reshape(-1, 1)
def regression(x, y):
model = LinearRegression().fit(x, y)
y_pred = model.predict(x)
return y_pred
barrier = [(regression(x=x_points, y=y_points_1)[i] + value) / 2 for i, value in enumerate(regression(x=x_points, y=y_points_2))]
plt.plot(x_points, regression(x=x_points, y=y_points_1))
plt.plot(x_points, regression(x=x_points, y=y_points_2))
plt.plot(x_points, barrier)
plt.scatter(x_values, y_points_1)
plt.scatter(x_values, y_points_2)
plt.grid(True)
plt.show()
Giving us the following plot:
This method also works for an overlap in the data points, so if we change the random data slightly and apply the same process:
x_values = np.arange(0, 51, 1)
y_points_1 = [i * 2 + random.randint(-10, 30) for i in x_points]
y_points_2 = [i - random.randint(-10, 30) for i in x_points]
We get something like the following:
It is important to note that the lists used here are of the same length, so you would need to add some predicted points to the female data after applying regression in order to make use of the line between them. These points would merely be along the regression line with the x-values corresponding to those present in the male data.
Because sklearn might be a bit over the top for a linear fit and to get rid of the condition that you would need the same number of data points for male and female data, here the same implementation with numpy.polyfit. This also demonstrates that their approach is not a solution to the problem.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#data import
male = pd.read_csv("test1.txt", delim_whitespace=True)
female = pd.read_csv("test2.txt", delim_whitespace=True)
#linear fit of both populations
pmale = np.polyfit(male.Year, male.Time, 1)
pfemale = np.polyfit(female.Year, female.Time, 1)
#more appealing presentation, let's pretend we do not just fit a line
x_fitmin=min(male.Year.min(), female.Year.min())
x_fitmax=max(male.Year.max(), female.Year.max())
x_fit=np.linspace(x_fitmin, x_fitmax, 100)
#create functions for the three fit lines
male_fit = np.poly1d(pmale)
print(male_fit)
female_fit = np.poly1d(pfemale)
print(female_fit)
sep = np.poly1d(np.mean([pmale, pfemale], axis=0))
print(sep)
#plot all markers and lines
ax = male.plot(x="Year", y="Time", c="blue", kind="scatter", label="male")
female.plot(x="Year", y="Time", c="red", kind="scatter", ax=ax, label="female")
ax.plot(x_fit, male_fit(x_fit), c="blue", ls="dotted", label="male fit")
ax.plot(x_fit, female_fit(x_fit), c="red", ls="dotted", label="female fit")
ax.plot(x_fit, sep(x_fit), c="black", ls="dashed", label="separator")
plt.legend()
plt.show()
Sample output:
-0.01333 x + 36.42
-0.01507 x + 40.92
-0.0142 x + 38.67
And one point is still in the wrong section. However - I find this question so interesting because I expected answers from the sklearn crowd for non-linear data groups. I even installed sklearn in anticipation! If in the next days nobody posts a good solution
with SVMs, I will set a bounty on this question.
One solution is the geometrical approach. You can find the convex hull of each data class, then find a line that goes through these two convex hulls. To find the line, you can find inner tangent line between two convex hulls using this code, and rotate it a little bit.
You can use the following code:
from scipy.spatial import ConvexHull, convex_hull_plot_2d
male = np.array(male)
female = np.array(female)
hull_male = ConvexHull(male)
hull_female = ConvexHull(female)
plt.plot(male[:,0], male[:,1], 'o')
for simplex in hull_male.simplices:
plt.plot(male[simplex, 0], male[simplex, 1], 'k-')
# Here, the separator line comes from SMV result.
# Just to show the a separator as an exmple
# plt.plot(xx, yy, "k-")
plt.plot(female[:,0], female[:,1], 'o')
for simplex in hull_female.simplices:
plt.plot(female[simplex, 0], female[simplex, 1], 'k-')
plt.xlim((1890, 2010))
plt.ylim((9, 13))
I am trying to create a color map for my 10x10 confusion matrix that is provided by sklearn. I would like to be able to customize the color map to be normalized between [0,1] but I have had no success. I am trying to use ax_ and matplotlib.colors.Normalize but am struggling to get something to work since ConfusionMatrixDisplay is a sklearn object that creates a different than usual matplotlib plot.
My code is the following:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
train_confuse_matrix = confusion_matrix(y_true = ytrain, y_pred = y_train_pred_labels)
print(train_confuse_matrix)
cm_display = ConfusionMatrixDisplay(train_confuse_matrix, display_labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'])
print(cm_display)
cm_display.plot(cmap = 'Greens')
plt.show()
plt.clf()
[[3289 56 84 18 55 7 83 61 48 252]
[ 2 3733 0 1 2 1 16 1 3 220]
[ 81 15 3365 64 81 64 273 18 6 17]
[ 17 37 71 3015 127 223 414 44 6 64]
[ 3 1 43 27 3659 24 225 35 0 3]
[ 5 23 38 334 138 3109 224 80 4 25]
[ 3 1 19 10 12 7 3946 1 1 5]
[ 4 7 38 69 154 53 89 3615 2 27]
[ 62 67 12 7 25 3 62 4 3595 153]
[ 2 30 1 2 4 0 15 2 0 3957]]
Let's try imshow and annotate manually:
accuracies = conf_mat/conf_mat.sum(1)
fig, ax = plt.subplots(figsize=(10,8))
cb = ax.imshow(accuracies, cmap='Greens')
plt.xticks(range(len(classes)), classes,rotation=90)
plt.yticks(range(len(classes)), classes)
for i in range(len(classes)):
for j in range(len(classes)):
color='green' if accuracies[i,j] < 0.5 else 'white'
ax.annotate(f'{conf_mat[i,j]}', (i,j),
color=color, va='center', ha='center')
plt.colorbar(cb, ax=ax)
plt.show()
Output:
I would comment on above great answer by #quang-hoang, but do not have enough reputation.
The annotation position needs to be swappeed to (j,i) since the output from imshow is transposed.
Code:
classes = ['A','B','C']
accuracies = np.random.random((3,3))
fig, ax = plt.subplots(figsize=(10,8))
cb = ax.imshow(accuracies, cmap='Greens')
plt.xticks(range(len(classes)), classes,rotation=90)
plt.yticks(range(len(classes)), classes)
for i in range(len(classes)):
for j in range(len(classes)):
color='green' if accuracies[i,j] < 0.5 else 'white'
ax.annotate(f'{accuracies[i,j]:.2f}', (j,i),
color=color, va='center', ha='center')
plt.colorbar(cb, ax=ax)
plt.show()
Output
This is my current output:
Now i want the next bars next to the already plotted bars.
My DataFrame has 3 columns: 'Block', 'Cluster', and 'District'.
'Block' and 'Cluster' contain the numbers for plotting and the grouping is based
on the strings in 'District'.
How can I plot the other bars next to the existing bars?
df=pd.read_csv("main_ds.csv")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
plt.xticks(rotation=90)
bwidth=0.30
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
ax.autoscale(tight=False)
def autolabel(rects):
for rect in rects:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='top')
autolabel(indic1)
autolabel(indic2)
plt.show()
Data:
District Block Cluster Villages Schools Decadal_Growth_Rate Literacy_Rate Male_Literacy Female_Literacy Primary ... Govt_School Pvt_School Govt_Sch_Rural Pvt_School_Rural Govt_Sch_Enroll Pvt_Sch_Enroll Govt_Sch_Enroll_Rural Pvt_Sch_Enroll_Rural Govt_Sch_Teacher Pvt_Sch_Teacher
0 Dimapur 5 30 278 494 23.2 85.4 88.1 82.5 147 ... 298 196 242 90 33478 57176 21444 18239 3701 3571
1 Kiphire 3 3 94 142 -58.4 73.1 76.5 70.4 71 ... 118 24 118 24 5947 7123 5947 7123 853 261
2 Kohima 5 5 121 290 22.7 85.6 89.3 81.6 128 ... 189 101 157 49 10116 26464 5976 8450 2068 2193
3 Longleng 2 2 37 113 -30.5 71.1 75.6 65.4 60 ... 90 23 90 23 3483 4005 3483 4005 830 293
4 Mon 5 5 139 309 -3.8 56.6 60.4 52.4 165 ... 231 78 219 58 18588 16578 17108 8665 1667 903
5 rows × 26 columns
Try using pandas.DataFrame.plot
import pandas as pd
import numpy as np
from io import StringIO
from datetime import date
import matplotlib.pyplot as plt
def add_value_labels(ax, spacing=5):
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = spacing
# Vertical alignment for positive values
va = 'bottom'
# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'
# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)
# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.
first3columns = StringIO("""District Block Cluster
Dimapur 5 30
Kiphire 3 3
Kohima 5 5
Longleng 2
Mon 5 5
""")
df_plot = pd.read_csv(first3columns, delim_whitespace=True)
fig, ax = plt.subplots()
#df_plot.set_index(['District'], inplace=True)
df_plot[['Block', 'Cluster']].plot.bar(ax=ax, color=['r', 'b'])
ax.set_xticklabels(df_plot['District'])
add_value_labels(ax)
plt.show()
Try changing
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
to
indic1=ax.bar(df["District"]-bwidth/2,df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"]+bwidth/2,df["Cluster"], width=bwidth, color='b')
I have some clinical data that contains values for multiple visits for multiple subjects. I created a script to loop and create a plot for each subject containing values for each visit. Now, I need to add data to each subject plot:
For each subject, add a new marker (star) to identify the baseline value (bcva_OS and bcva_OD) only. I can only get it to display the the markers for all values. How do I subset for baseline only? See comment in the code. I get a syntax error if I use:
plt.plot_date(sub_df['visit_date'] if sub_df[sub_df.visit_label == 'Visit 2 - Baseline'],
For each subject, how can I add an entirely new data type so that both data types will be overlayed on a plot for each subject? I think I could do that with just one subject's worth of data, but again the loop...
Sample code:
for subject, sub_df in new_od_df.groupby(by='subject'):
# Plot fellow eye
plt.plot(sub_df['visit_date'], sub_df['bcva_OS'], marker='^',
label='OS (fellow) ', color=sns.xkcd_rgb['pale red'])
# Plot treated eye
plt.plot(sub_df['visit_date'], sub_df['bcva_OD'], marker='o',
label='OD (treated) ', color=sns.xkcd_rgb['denim blue'])
# Trying to plot only the baseline values
#plt.plot_date(sub_df['visit_date'] if sub_df[sub_df.visit_label == 'Visit 2 - Baseline'],
# Plot fellow eye
plt.plot_date(sub_df['visit_date'], sub_df['bcva_OS'],
marker='*', markersize=10,
label='BL (fellow) ', color=sns.xkcd_rgb['light pink'])
# Plot treated eye
plt.plot_date(sub_df['visit_date'], sub_df['bcva_OD'],
marker='*', markersize=10,
label='BL (treated) ', color=sns.xkcd_rgb['baby blue'])
# Legend the old way
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
# Display each chart separately
plt.show()
Sample data:
subject treated_eye visit_label visit_date bcva_OD bcva_OS refract_OD refract_OS
index
108 1101 OD Visit 1 - Screening 2016-01-07 27.0 41.0 + 5 + 0.75 X 27 + 5 + 1.75 X 45
115 1101 OD Visit 2 - Baseline 2016-01-25 35.0 41.0 + 5 + 0.75 X 27 + 5.5 + 1.75 X 40
120 1101 OD Baseline - VA Session 2 2016-01-25 35.0 41.0 + 5 + 0.75 X 27 + 5.5 + 1.75 X 40
125 1101 OD Visit 4 - Day 1 2016-02-02 32.0 42.0 + 5 + 0.75 X 27 + 5 + 1.75 X 30
123 1101 OD Visit 5 - Day 7 2016-02-08 40.0 43.0 + 5 + 0.75 X 28 + 5 + 1.75 X 30
111 1101 OD Visit 6 - Day 14 2016-02-16 33.0 44.0 + 5 + 0.75 X 27 + 5 + 1.75 X 40
124 1101 OD Unscheduled 2016-02-24 37.0 44.0 + 4.5 + 1.25 X 30 + 5 + 1.75 X 40
118 1101 OD Visit 7 - Month 1 2016-02-29 37.0 40.0 + 4.5 + 1.25 X 30 + 5 + 1.75 X 43
Sample plot:
Note: this is a partial answer to point 1:
I'm not sure I completely understood your requests, especially regarding point 2: creating a new data type. Please edit your question to make point 2 clearer. Right now I'm guessing you want to plot both OD and OS values after baseline-subtraction, is this correct?
Regarding point 1, the solution below correctly gets the baseline values and plots them as a dashed line. Note that I've also added a plot title and changed calls to plt. to ax., after properly creating a figure using fig,ax=plt.subplots(). This may come in handy later, and is already required for fig.autofmt_xdate().
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('ggplot')
import seaborn as sns
data="""index,subject,treated_eye,visit_label,visit_date,bcva_OD,bcva_OS,refract_OD,refract_OS
108, 1101, OD, Visit 1 - Screening, 2016-01-07, 27.0, 41.0, + 5 + 0.75 X 27, + 5 + 1.75 X 45
115, 1101, OD, Visit 2 - Baseline, 2016-01-25, 35.0, 41.0, + 5 + 0.75 X 27, + 5.5 + 1.75 X 40
120, 1101, OD, Baseline - VA Session 2 ,2016-01-25, 35.0, 41.0, + 5 + 0.75 X 27, + 5.5 + 1.75 X 40
125, 1101, OD, Visit 4 - Day 1 ,2016-02-02, 32.0, 42.0, + 5 + 0.75 X 27, + 5 + 1.75 X 30
123, 1101, OD, Visit 5 - Day 7 ,2016-02-08, 40.0, 43.0, + 5 + 0.75 X 28, + 5 + 1.75 X 30
111, 1101, OD, Visit 6 - Day 14 ,2016-02-16,33.0, 44.0, + 5 + 0.75 X 27, + 5 + 1.75 X 40
124, 1101, OD, Unscheduled ,2016-02-24, 37.0, 44.0, + 4.5 + 1.25 X 30, + 5 + 1.75 X 40
118, 1101, OD, Visit 7 - Month 1 , 2016-02-29 , 37.0, 40.0, + 4.5 + 1.25 X 30, + 5 + 1.75 X 43
"""
## DataFrame cleanup
df=pd.read_csv(pd.compat.StringIO(data),sep=",",index_col=0)
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
df['visit_date']=pd.to_datetime(df['visit_date'])
for subject, sub_df in df.groupby(by='subject'):
mask=(sub_df.visit_label == 'Visit 2 - Baseline')
bcva_OS_baseline=sub_df['bcva_OS'][mask].values
bcva_OD_baseline=sub_df['bcva_OD'][mask].values
fig,ax=plt.subplots()
# Plot fellow eye
ax.plot(sub_df['visit_date'], sub_df['bcva_OS'], marker='^',
label='OS (fellow) ', color=sns.xkcd_rgb['pale red'])
# Plot treated eye
ax.plot(sub_df['visit_date'], sub_df['bcva_OD'], marker='o',
label='OD (treated) ', color=sns.xkcd_rgb['denim blue'])
# Plot fellow eye
ax.plot_date(sub_df['visit_date'], sub_df['bcva_OS'],
marker='*', markersize=10,
label='BL (fellow) ', color=sns.xkcd_rgb['light pink'])
# Plot treated eye
ax.plot_date(sub_df['visit_date'], sub_df['bcva_OD'],
marker='*', markersize=10,
label='BL (treated) ', color=sns.xkcd_rgb['baby blue'])
# Plot baseline
ax.axhline(bcva_OS_baseline,color=sns.xkcd_rgb['pale red'],linestyle="dashed")
ax.axhline(bcva_OD_baseline,color=sns.xkcd_rgb['denim blue'],linestyle="dashed")
# Legend the old way
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
# Display each chart separately
ax.set_title('subject {0}'.format(subject))
fig.autofmt_xdate()
plt.tight_layout()
plt.show()
Result: