I have this numpy array:
[
[ 0 0 0 0 0 0 2 0 2 0 0 1 26 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 4]
[21477 61607 21999 17913 22470 32390 11987 41977 81676 20668 17997 15278 46281 19884]
[ 5059 13248 5498 3866 2144 6161 2361 8734 16914 3724 4614 3607 11305 2880]
[ 282 1580 324 595 218 525 150 942 187 232 430 343 524 189]
[ 1317 6416 1559 882 599 2520 525 2560 19197 729 1391 1727 2044 1198]
]
I've just created logarithm heatmap which works as intended. However I would like to create another heat map that would represent linear scale across rows, and shows each position in matrix corresponding percentage value, while sum of row would give 100%. Without using seaborn or pandas
Something like this:
Here you go:
import matplotlib.pyplot as plt
import numpy as np
a = np.array([[0,0,0,0,0,0,2,0,2,0,0,1,26,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,4],
[21477,61607,21999,17913,22470,32390,11987,41977,81676,20668,17997,15278,46281,19884],
[5059,13248,5498,3866,2144,6161,2361,8734,16914,3724,4614,3607,11305,2880],
[282,1580,324,595,218,525,150,942,187,232,430,343,524,189],
[1317,6416,1559,882,599,2520,525,2560,19197,729,1391,1727,2044,1198]])
# normalize
normalized_a = a/np.sum(a,axis=1)[:,None]
# plot
plt.imshow(normalized_a)
plt.show()
Related
I am trying to find the x value at the maxima of a data set and the width of the peak each maxima is from. I have tired the code below, the first part correctly returns the peak x positions but once I add the second section it fails with the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
The code is below:
import matplotlib.pyplot as plt
import csv
from scipy.signal import find_peaks, find_peaks, peak_widths
import numpy
x = []
y = []
with open('data.csv','r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
peaks = find_peaks(y, height=10000,) # set the height to remove background
list = numpy.array(x)[peaks[0]]
print("Optimum values")
print(list)
This next part fails:
peaks, _ = find_peaks(y)
results_half = peak_widths(y, peaks, rel_height=0.5)
print(results_half)
results_full = peak_widths(y, peaks, rel_height=1)
plt.plot(y)
plt.plot(peaks, y[peaks], "y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
I have read the scipy documentation but I think the issue is more fundamental than that. How can I make the second part work? I'd like it to return the peak widths and show on a plot which peaks it has picked.
Thanks
Example data:
-7 16
-6.879 14
-6.759 20
-6.638 31
-6.518 33
-6.397 28
-6.276 17
-6.156 17
-6.035 30
-5.915 50
-5.794 64
-5.673 77
-5.553 96
-5.432 113
-5.312 112
-5.191 113
-5.07 123
-4.95 151
-4.829 173
-4.709 207
-4.588 328
-4.467 590
-4.347 1246
-4.226 3142
-4.106 7729
-3.985 18015
-3.864 40097
-3.744 85164
-3.623 167993
-3.503 302845
-3.382 499848
-3.261 761264
-3.141 1063770
-3.02 1380165
-2.899 1644532
-2.779 1845908
-2.658 1931555
-2.538 1918458
-2.417 1788508
-2.296 1586322
-2.176 1346871
-2.055 1086383
-1.935 831396
-1.814 590559
-1.693 398865
-1.573 261396
-1.452 174992
-1.332 139774
-1.211 154694
-1.09 235406
-0.97 388021
-0.849 616041
-0.729 911892
-0.608 1248544
-0.487 1579659
-0.367 1859034
-0.246 2042431
-0.126 2120969
-0.005 2081017
0.116 1925716
0.236 1684327
0.357 1372293
0.477 1064307
0.598 766824
0.719 535333
0.839 346882
0.96 217215
1.08 125673
1.201 68861
1.322 35618
1.442 16286
1.563 7361
1.683 2572
1.804 1477
1.925 1072
2.045 977
2.166 968
2.286 1030
2.407 1173
2.528 1398
2.648 1586
2.769 1770
2.889 1859
3.01 1980
3.131 2041
3.251 2084
3.372 2069
3.492 2012
3.613 1937
3.734 1853
3.854 1787
3.975 1737
4.095 1643
4.216 1548
4.337 1399
4.457 1271
4.578 1143
4.698 1022
4.819 896
4.94 762
5.06 663
5.181 587
5.302 507
5.422 428
5.543 339
5.663 277
5.784 228
5.905 196
6.025 158
6.146 122
6.266 93
6.387 76
6.508 67
6.628 63
6.749 58
6.869 43
6.99 27
7.111 13
7.231 7
7.352 3
7.472 3
7.593 2
7.714 2
7.834 2
7.955 3
8.075 2
8.196 1
8.317 1
8.437 2
8.558 3
8.678 2
8.799 1
8.92 2
9.04 4
9.161 7
9.281 4
9.402 3
9.523 2
9.643 3
9.764 4
9.884 6
10.005 7
10.126 4
10.246 2
10.367 0
10.487 0
10.608 0
10.729 0
10.849 0
10.97 0
11.09 1
11.211 2
11.332 3
11.452 2
11.573 1
11.693 0
11.814 0
11.935 0
12.055 0
12.176 0
12.296 0
12.417 0
12.538 0
12.658 0
12.779 0
12.899 0
13.02 0
13.141 0
13.261 0
13.382 0
13.503 0
13.623 0
13.744 0
13.864 0
13.985 0
14.106 0
14.226 0
14.347 0
14.467 0
14.588 0
14.709 0
14.829 0
14.95 0
15.07 0
15.191 0
15.312 0
15.432 0
15.553 0
15.673 0
15.794 0
15.915 0
16.035 0
16.156 0
16.276 0
16.397 1
16.518 2
16.638 3
16.759 2
16.879 2
17 4
I think your problem is that y is a list, not a numpy array.
The slicing operation y[peaks] will only work if both y and peaks are numpy arrays.
So you should convert y before doing the slicing, e.g. as follows
y_arr = np.array(y)
plt.plot(y_arr)
plt.plot(peaks, y_arr[peaks], 'o', color="y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
plt.show()
This yields the following plot:
I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list
I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].
I have a csv file that read using pandas, I' want to split the dataframe in chunks in a specified column:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
list_of_classes=[]
# Reading file
fileName = 'Training.csv'
df = pd.read_csv(fileName)
classID = df.iloc[:,-2]
len(classID)
df.iloc[0,-2]
for i in range(len(classID)):
print(classID[i])
if classID[i] not in list_of_classes:
list_of_classes.append(classID[i])
for i in range(len(df)):
...............................
UPDATE
Say the dataframe looks like :
........................................
Feature0 Feature1 Feature2 Feature3 ......... classID lastColum
190 565 35474 0.336283 2.973684 255 0
311 984 113199 0.316057 3.163987 155 0
310 984 94197 0.315041 3.174194 1005 0
280 984 116359 0.284553 3.514286 255 18
249 984 107482 0.253049 3.951807 1005 0
283 984 132343 0.287602 3.477032 155 0
213 984 88244 0.216463 4.619718 255 0
839 984 203139 0.852642 1.172825 255 0
376 984 105133 0.382114 2.617021 1005 0
324 984 129209 0.329268 3.037037 1005 0
in this example the result that I'm aiming to get, is 3 dataframes, each of them has only 1 classID either 155, 1005, or 255.
my question is, is there a finer way to do this ?
Split to 3 separate CSV files:
df.groupby('classID') \
.apply(lambda x: x.to_csv(r'c:/temp/{}.csv'.format(x.name), index=False))
Generate a dictionary of "splitted" DataFrames:
In [210]: dfs = {g:x for g,x in df.groupby('classID')}
In [211]: dfs.keys()
Out[211]: dict_keys([155, 255, 1005])
In [212]: dfs[155]
Out[212]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
1 311 984 113199 0.316057 155 0
5 283 984 132343 0.287602 155 0
In [213]: dfs[255]
Out[213]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
0 190 565 35474 0.336283 255 0
3 280 984 116359 0.284553 255 18
6 213 984 88244 0.216463 255 0
7 839 984 203139 0.852642 255 0
In [214]: dfs[1005]
Out[214]:
Feature0 Feature1 Feature2 Feature3 classID lastColum
2 310 984 94197 0.315041 1005 0
4 249 984 107482 0.253049 1005 0
8 376 984 105133 0.382114 1005 0
9 324 984 129209 0.329268 1005 0
Here is an example of how you can do it:
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'part': [1, 1, 1, 2, 2, 2]})
parts = df.part.unique()
for part in parts:
print df.loc[df.part == part]
So the point is that you take all unique parts by calling unique() on series that you want to use for split.
After that, you can access those parts via loop and do whatever you need on each one of them.
I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648