Pandas DataFrame: Creating 3D Surface Plots - python

I am trying to draw a 3D surface plot using 3 columns (3 features) in a data frame:
age size_tc Survival_days
0 60.463 43185.0 289
1 52.263 15709.0 616
2 54.301 3731.0 464
3 39.068 26400.0 788
4 68.493 14410.0 465
5 67.126 44774.0 269
6 69.912 9557.0 503
7 56.419 76260.0 1155
8 48.367 6994.0 515
9 65.899 8280.0 495
10 59.693 14535.0 698
11 51.734 27568.0 359
12 62.614 17677.0 169
13 55.759 41082.0 368
14 58.258 14713.0 439
15 61.605 2036.0 486
16 68.049 20547.0 287
17 56.921 5669.0 576
18 44.162 30637.0 350
19 67.833 17526.0 332
20 46.666 28472.0 331
21 76.367 15027.0 106
22 67.860 24355.0 473
23 46.452 44985.0 1283
24 71.370 5751.0 89
25 75.978 24963.0 172
26 53.362 19018.0 84
27 75.312 40795.0 726
28 46.570 3461.0 660
29 77.337 7635.0 522
My code is as follows:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import random
from matplotlib import cm
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')
x = df['age']
y = df['size_tc']
z = df['survival']
surf = ax.plot_trisurf(x, y, z, cmap= cm.coolwarm, linewidth=0.2)
The plot looks like this:
Why the plot is not smooth and how can I generate a plot like this?

Related

Plotting a histogram for categorical data

I have a dataset with two columns like as follows:
index Year
0 5 <2012
1 8 >=2012
2 9 >=2012
3 10 <2012
4 15 <2012
... ... ...
171 387 >=2012
172 390 <2012
173 398 <2012
174 403 >=2012
175 409 <2012
And I would like to plot it in a histogram. I tried with
plt.style.use('ggplot')
df.groupby(['Year'])\
.Year.count().unstack().plot.bar(legend=True)
plt.show()
but I have got an error: AttributeError: 'CategoricalIndex' object has no attribute 'remove_unused_levels' for
df.groupby(['Year'])\
.Year.count().unstack().plot.bar(legend=True)
I think this is because I am using categorical values. Any help would be appreciated it.
Try:
plt.style.use('ggplot')
df.groupby(["Year"])["Year"].agg("count").plot.bar();
Alternatively:
plt.hist(df["Year"]);

Plot moving average with data [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I am trying to calculate and plot moving average along with the data it is calculated from:
def movingAvg(df):
window_size = 7
i = 0
moving_averages = []
while i < len(df) - window_size + 1:
current_window = df[i : i + window_size]
window_average = current_window.mean()
moving_averages.append(window_average)
i += 1
return moving_averages
dates = df_valid['dateTime']
startDay = dates.iloc[0]
lastDay = dates.iloc[-1]
fig, ax = plt.subplots(figsize=(20, 10))
ax.autoscale()
#plt.xlim(startDay, lastDay)
df_valid.sedentaryActivityMins.reset_index(drop=True, inplace=True)
df_moving = pd.DataFrame(movingAvg(df_valid['sedentaryActivityMins']))
df_nan = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
df_nan = pd.DataFrame(df_nan)
df_moving = pd.concat([df_nan, df_moving])
plt.plot(df_valid.sedentaryActivityMins)
plt.plot(df_moving)
#plt.show()
But as the moving average uses 7 windows, the list of moving averages is 7 items short, and therefore the plots do not follow each other correctly.
I tried putting 7 "NaN" into the moving average list, but those are ignored when I plot.
The plot is as follows:
But I would like the the orange line to start 7 steps ahead.
So it looks like this:
df_valid.sedentaryActivityMins.head(40)
0 608
1 494
2 579
3 586
4 404
5 750
6 573
7 466
8 389
9 604
10 351
11 553
12 768
13 572
14 616
15 522
16 675
17 607
18 229
19 529
20 746
21 646
22 625
23 590
24 572
25 462
26 708
27 662
28 649
29 626
30 485
31 509
32 561
33 664
34 517
35 587
36 602
37 601
38 495
39 352
Name: sedentaryActivityMins, dtype: int64
Any ideas as to how?
Thanks in advance!
When you do a concat, the indexes don't change. The NaNs will also take the same indices as the first 7 observations of your series. So either do a reset index after the concat or set ignore_index as True as follows:
df_moving = pd.concat([df_nan, df_moving],ignore_index=True)
plt.plot(x)
plt.plot(df_moving)
This gives the output as expected:

dates from csv files , how can i graph it

I am new in using python.
I am trying to graph 2 variables in Y1 and Y2 (secondary y axis) , and the date in the x axis from a csv file.
I think my main problem is with converting the date in csv.
Moreover is it possible to save the 3 graphs according to the ID (A,B,C)... Thanks a lot.
I added the CSV file that I have and an image of the figure that i am looking for.
Thanks a lot for your advice
ID date Y1 Y2
A 40480 136 83
A 41234 173 23
A 41395 180 29
A 41458 124 60
A 41861 158 27
A 42441 152 26
A 43009 155 51
A 43198 154 38
B 40409 185 71
B 40612 156 36
B 40628 165 39
B 40989 139 77
B 41346 138 20
B 41558 132 85
B 41872 157 58
B 41992 120 91
B 42245 139 43
B 42397 131 34
B 42745 114 68
C 40711 110 68
C 40837 156 38
C 40946 110 63
C 41186 161 46
C 41243 187 20
C 41494 122 55
C 41970 103 19
C 42183 148 78
C 42247 115 33
C 42435 132 92
C 42720 187 43
C 43228 127 28
C 43426 183 45
Try the matplotlib library, if i understood right, it should work.
from mpl_toolkits import mplot3d
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')
Data for a three-dimensional line
zaxis = y1
xaxis = date
yaxis = y2
ax.plot3D(xaxis, yaxis, zaxis, 'red')
Data for three-dimensional scattered points
zdat = y1
xdat = date
ydat = y2
ax.scatter3D(xdat, ydat, zdat, c=xdat, cmap='Greens')
If I understand you correctly, you are looking for three separate graphs for ID=A, ID=B, ID=C. Here is how you could get that:
import pandas as pd
import pylab as plt
data = pd.read_csv('data.dat', sep='\t') # read your datafile, you might have a different name here
for i, (label, subset) in enumerate(data.groupby('ID')):
plt.subplot(131+i)
plt.plot(subset['date'], subset['Y1'])
plt.plot(subset['date'], subset['Y2'], 'o')
plt.title('ID: {}'.format(label))
plt.show()
Note that this treats your dates as integers (same as in the datafile).

Scipy peak_widths returns TypeError: only integer scalar arrays can be converted to a scalar index

I am trying to find the x value at the maxima of a data set and the width of the peak each maxima is from. I have tired the code below, the first part correctly returns the peak x positions but once I add the second section it fails with the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
The code is below:
import matplotlib.pyplot as plt
import csv
from scipy.signal import find_peaks, find_peaks, peak_widths
import numpy
x = []
y = []
with open('data.csv','r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
peaks = find_peaks(y, height=10000,) # set the height to remove background
list = numpy.array(x)[peaks[0]]
print("Optimum values")
print(list)
This next part fails:
peaks, _ = find_peaks(y)
results_half = peak_widths(y, peaks, rel_height=0.5)
print(results_half)
results_full = peak_widths(y, peaks, rel_height=1)
plt.plot(y)
plt.plot(peaks, y[peaks], "y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
I have read the scipy documentation but I think the issue is more fundamental than that. How can I make the second part work? I'd like it to return the peak widths and show on a plot which peaks it has picked.
Thanks
Example data:
-7 16
-6.879 14
-6.759 20
-6.638 31
-6.518 33
-6.397 28
-6.276 17
-6.156 17
-6.035 30
-5.915 50
-5.794 64
-5.673 77
-5.553 96
-5.432 113
-5.312 112
-5.191 113
-5.07 123
-4.95 151
-4.829 173
-4.709 207
-4.588 328
-4.467 590
-4.347 1246
-4.226 3142
-4.106 7729
-3.985 18015
-3.864 40097
-3.744 85164
-3.623 167993
-3.503 302845
-3.382 499848
-3.261 761264
-3.141 1063770
-3.02 1380165
-2.899 1644532
-2.779 1845908
-2.658 1931555
-2.538 1918458
-2.417 1788508
-2.296 1586322
-2.176 1346871
-2.055 1086383
-1.935 831396
-1.814 590559
-1.693 398865
-1.573 261396
-1.452 174992
-1.332 139774
-1.211 154694
-1.09 235406
-0.97 388021
-0.849 616041
-0.729 911892
-0.608 1248544
-0.487 1579659
-0.367 1859034
-0.246 2042431
-0.126 2120969
-0.005 2081017
0.116 1925716
0.236 1684327
0.357 1372293
0.477 1064307
0.598 766824
0.719 535333
0.839 346882
0.96 217215
1.08 125673
1.201 68861
1.322 35618
1.442 16286
1.563 7361
1.683 2572
1.804 1477
1.925 1072
2.045 977
2.166 968
2.286 1030
2.407 1173
2.528 1398
2.648 1586
2.769 1770
2.889 1859
3.01 1980
3.131 2041
3.251 2084
3.372 2069
3.492 2012
3.613 1937
3.734 1853
3.854 1787
3.975 1737
4.095 1643
4.216 1548
4.337 1399
4.457 1271
4.578 1143
4.698 1022
4.819 896
4.94 762
5.06 663
5.181 587
5.302 507
5.422 428
5.543 339
5.663 277
5.784 228
5.905 196
6.025 158
6.146 122
6.266 93
6.387 76
6.508 67
6.628 63
6.749 58
6.869 43
6.99 27
7.111 13
7.231 7
7.352 3
7.472 3
7.593 2
7.714 2
7.834 2
7.955 3
8.075 2
8.196 1
8.317 1
8.437 2
8.558 3
8.678 2
8.799 1
8.92 2
9.04 4
9.161 7
9.281 4
9.402 3
9.523 2
9.643 3
9.764 4
9.884 6
10.005 7
10.126 4
10.246 2
10.367 0
10.487 0
10.608 0
10.729 0
10.849 0
10.97 0
11.09 1
11.211 2
11.332 3
11.452 2
11.573 1
11.693 0
11.814 0
11.935 0
12.055 0
12.176 0
12.296 0
12.417 0
12.538 0
12.658 0
12.779 0
12.899 0
13.02 0
13.141 0
13.261 0
13.382 0
13.503 0
13.623 0
13.744 0
13.864 0
13.985 0
14.106 0
14.226 0
14.347 0
14.467 0
14.588 0
14.709 0
14.829 0
14.95 0
15.07 0
15.191 0
15.312 0
15.432 0
15.553 0
15.673 0
15.794 0
15.915 0
16.035 0
16.156 0
16.276 0
16.397 1
16.518 2
16.638 3
16.759 2
16.879 2
17 4
I think your problem is that y is a list, not a numpy array.
The slicing operation y[peaks] will only work if both y and peaks are numpy arrays.
So you should convert y before doing the slicing, e.g. as follows
y_arr = np.array(y)
plt.plot(y_arr)
plt.plot(peaks, y_arr[peaks], 'o', color="y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
plt.show()
This yields the following plot:

Python pdist: Setting an array element with a sequence

I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list

Categories

Resources