Python pdist: Setting an array element with a sequence - python
I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list
Related
text file rows into CSV column python
I've a question I've a text file containing data like this A 34 45 7789 3475768 443 67 8999 3343 656 8876 802 383358 873 36789 2374859 485994 86960 32838459 3484549 24549 58423 T 3445 574649 68078 59348604 45959 64585304 56568 595 49686 656564 55446 665 677 778 433 545 333 65665 3535 and so on I want to make a csv file from this text file, displaying data like this, A & T as column headings, and then numbers A T 34 3445 45 574649 7789 68078 3475768 59348604 443 45959
EDIT (A lot simpler solution inspired by Michael Butscher's comment): import pandas as pd df = pd.read_csv("filename.txt", delimiter=" ") df.T.to_csv("filename.csv", header=False) Here is the code: import pandas as pd # Read file with open("filename.txt", "r") as f: data = f.read() # Split data by lines and remove empty lines columns = data.split("\n") columns = [x.split() for x in columns if x!=""] # Row sizes are different in your example so find max number of rows column_lengths = [len(x) for x in columns] max_col_length = max(column_lengths) data = {} for i in columns: # Add None to end for columns that have less values if len(i)<max_col_length: i += [None]*(max_col_length-len(i)) data[i[0]] = i[1:] # Create dataframe df = pd.DataFrame(data) # Create csv df.to_csv("filename.csv", index=False) Output should look like this: A T 0 34 3445 1 45 574649 2 7789 68078 3 3475768 59348604 4 443 45959 5 67 64585304 6 8999 56568 7 3343 595 8 656 49686 9 8876 656564 10 802 55446 11 383358 665 12 873 677 13 36789 778 14 2374859 433 15 485994 545 16 86960 333 17 32838459 65665 18 3484549 3535 19 24549 None 20 58423 None
here is my code import pandas as pd data = pd.read_csv("text (3).txt", header = None) Our_Data = pd.DataFrame(data) for rows in Our_Data: New_Data=pd.DataFrame(Our_Data[rows].str.split(' ').tolist()).T New_Data.columns = New_Data.iloc[0] New_Data = New_Data[1:] New_Data.to_csv("filename.csv", index=False) The Output A T 1 34 3445 2 45 574649 3 7789 68078 4 3475768 59348604 5 443 45959 6 67 64585304 7 8999 56568 8 3343 595 9 656 49686 10 8876 656564 11 802 55446 12 383358 665 13 873 677 14 36789 778 15 2374859 433 16 485994 545 17 86960 333 18 32838459 65665 19 3484549 3535 20 24549 None 21 58423 None
Python Data Cleaning with Pandas
My dataset looks like this Name Subset Value A 67-A-5678 14 A 58-ABC-87555 187 A 45-ASH-87954 5465 S 34-A-8785 454 S 58-ASO-98978 54 S 23-ASH-87895 784 X 98-X-87876 455 X 87-ABC-54578 4545 X 56-ASH-89667 854 Y 09-D-98644 45 Y 87-ABC-78834 98 Y 87-ASH-87455A 4566 L 67-A-87545 78 L 89-GHS-08753 12 L 78-PHU-09876 655 I want to keep only those groups of rows whose "subset" columns are of pattern; *, *ABC, *ASH (Note: * is any alphabet or digit). For example, output should look like Name Subset Value A 67-A-5678 14 A 58-ABC-87555 187 A 45-ASH-87954 5465 X 98-X-87876 455 X 87-ABC-54578 4545 X 56-ASH-89667 854 Y 09-D-98644 45 Y 87-ABC-78834 98 Y 87-ASH-87455A 4566 P.S. Actual dataset can be of many columns/rows.
Try this: filtered = df[df.groupby('Name')['Subset'].transform(lambda x: len(x) >= 3 and'-ABC-' in x.iloc[1] and '-ASH-' in x.iloc[2])] Output: >>> filtered Name Subset Value 0 A 67-A-5678 14 1 A 58-ABC-87555 187 2 A 45-ASH-87954 5465 6 X 98-X-87876 455 7 X 87-ABC-54578 4545 8 X 56-ASH-89667 854 9 Y 09-D-98644 45 10 Y 87-ABC-78834 98 11 Y 87-ASH-87455A 4566
Plot moving average with data [duplicate]
This question already has answers here: Moving Average Pandas (4 answers) Closed 2 years ago. I am trying to calculate and plot moving average along with the data it is calculated from: def movingAvg(df): window_size = 7 i = 0 moving_averages = [] while i < len(df) - window_size + 1: current_window = df[i : i + window_size] window_average = current_window.mean() moving_averages.append(window_average) i += 1 return moving_averages dates = df_valid['dateTime'] startDay = dates.iloc[0] lastDay = dates.iloc[-1] fig, ax = plt.subplots(figsize=(20, 10)) ax.autoscale() #plt.xlim(startDay, lastDay) df_valid.sedentaryActivityMins.reset_index(drop=True, inplace=True) df_moving = pd.DataFrame(movingAvg(df_valid['sedentaryActivityMins'])) df_nan = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan] df_nan = pd.DataFrame(df_nan) df_moving = pd.concat([df_nan, df_moving]) plt.plot(df_valid.sedentaryActivityMins) plt.plot(df_moving) #plt.show() But as the moving average uses 7 windows, the list of moving averages is 7 items short, and therefore the plots do not follow each other correctly. I tried putting 7 "NaN" into the moving average list, but those are ignored when I plot. The plot is as follows: But I would like the the orange line to start 7 steps ahead. So it looks like this: df_valid.sedentaryActivityMins.head(40) 0 608 1 494 2 579 3 586 4 404 5 750 6 573 7 466 8 389 9 604 10 351 11 553 12 768 13 572 14 616 15 522 16 675 17 607 18 229 19 529 20 746 21 646 22 625 23 590 24 572 25 462 26 708 27 662 28 649 29 626 30 485 31 509 32 561 33 664 34 517 35 587 36 602 37 601 38 495 39 352 Name: sedentaryActivityMins, dtype: int64 Any ideas as to how? Thanks in advance!
When you do a concat, the indexes don't change. The NaNs will also take the same indices as the first 7 observations of your series. So either do a reset index after the concat or set ignore_index as True as follows: df_moving = pd.concat([df_nan, df_moving],ignore_index=True) plt.plot(x) plt.plot(df_moving) This gives the output as expected:
Splitting data into subsamples
I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets: particle_boxes = [] init = 0 final = 50 number_box = 5 for i in range(number_box): for j in range(number_box): for k in range(number_box): index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k) particle_boxes.append(df_particles[index_particle]) where init and final define the box size, df_particles contains every particle coordinate (x,y,z). After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes. Is there any way to write this code more efficiently?
Note on efficiency I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used. I'm curious to see if anyone else comes up with something order of magnitude faster. Sample data np.random.seed([3, 1415]) df_particles = pd.DataFrame( np.random.randint(250, size=(1000, 3)), columns=['X', 'Y', 'Z'] ) Solution Construct an array a that represents your boundaries a = np.array([50, 100, 150, 200, 250]) Then use searchsorted to create the individual dimensional bins x_bin = a.searchsorted(df_particles['X'].to_numpy()) y_bin = a.searchsorted(df_particles['Y'].to_numpy()) z_bin = a.searchsorted(df_particles['Z'].to_numpy()) Use groupby on the three bins. I used trickery to get that into a dict g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),)) We can see the first zone g[(0, 0, 0)] X Y Z 30 2 36 47 194 0 34 45 276 46 37 34 364 10 16 21 378 4 15 4 429 12 34 13 645 36 17 5 743 18 36 13 876 46 11 34 and the last g[(4, 4, 4)] X Y Z 87 223 236 213 125 206 241 249 174 218 247 221 234 222 204 237 298 208 211 225 461 234 204 238 596 209 229 241 731 210 220 242 761 225 215 231 762 206 241 240 840 211 241 238 846 212 242 241 899 249 203 228 970 214 217 232 981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows: from itertools import product particle_boxes = [] for i, j, k in product(range(number_box), range(number_box), range(number_box)): index_particle = (df_particles['X'].between(init+i*final, final+final*i) & df_particles['Y'].between(init+j*final, final+final*j) & df_particles['Z'].between(init+k*final, final+final*k)) particle_boxes.append(df_particles[index_particle]) Alternatively, with list comprehension: def sub_df(i, j, k) index_particle = (df_particles['X'].between(init+i*final, final+final*i) & df_particles['Y'].between(init+j*final, final+final*j) & df_particles['Z'].between(init+k*final, final+final*k)) return df_particles[index_particle] particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib. I think it is almost the kind of functionality that you need. The code is consultable on Github.
How to display a sequence of numbers in column-major order?
Program description: Find all the prime numbers between 1 and 4,027 and print them in a table which "reads down", using as few rows as possible, and using as few sheets of paper as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height of the columns should all be the same, except for perhaps the last column, which might have a few blank entries towards its bottom row. The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down. 2 23 59 3 29 61 5 31 67 7 37 71 11 41 73 13 43 79 17 47 83 19 53 89 ect... This all I've been able to come up with myself: def findPrimes(n): """ Adds calculated prime numbers to a list. """ prime_list = list() for number in range(1, n + 1): prime = True for i in range(2, number): if(number % i == 0): prime = False if prime: prime_list.append(number) return prime_list def displayPrimes(): pass print(findPrimes(4027)) I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems. I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them. For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end. from itertools import zip_longest import locale import math locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting def zip_discard(*iterables, _NULL=object()): """ Like zip_longest() but doesn't fill out all rows to equal length. https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue """ return [[entry for entry in iterable if entry is not _NULL] for iterable in zip_longest(*iterables, fillvalue=_NULL)] def grouper(n, seq): """ Group elements in sequence into groups of "n" items. """ for i in range(0, len(seq), n): yield seq[i:i+n] def tabularize(width, height, numbers): """ Print list of numbers in column-major tabular form given the dimensions of the table in characters (rows and columns). Will create multiple tables of required to display all numbers. """ # Determine number of chars needed to hold longest formatted numeric value gap = 2 # including space between numbers col_width = len('{:n}'.format(max(numbers))) + gap # Determine number of columns that will fit within the table's width. num_cols = width // col_width chunk_size = num_cols * height # maximum numbers in each table for i, chunk in enumerate(grouper(chunk_size, numbers), start=1): print('---- Page {} ----'.format(i)) num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up table = zip_discard(*grouper(num_rows, chunk)) for row in table: print(''.join(('{:{width}n}'.format(num, width=col_width) for num in row))) def find_primes(n): """ Create list of prime numbers from 1 to n. """ prime_list = [] for number in range(1, n+1): for i in range(2, int(math.sqrt(number)) + 1): if not number % i: # Evenly divisible? break # Not prime. else: prime_list.append(number) return prime_list primes = find_primes(4027) tabularize(80, 15, primes) Output: ---- Page 1 ---- 1 47 113 197 281 379 463 571 659 761 863 2 53 127 199 283 383 467 577 661 769 877 3 59 131 211 293 389 479 587 673 773 881 5 61 137 223 307 397 487 593 677 787 883 7 67 139 227 311 401 491 599 683 797 887 11 71 149 229 313 409 499 601 691 809 907 13 73 151 233 317 419 503 607 701 811 911 17 79 157 239 331 421 509 613 709 821 919 19 83 163 241 337 431 521 617 719 823 929 23 89 167 251 347 433 523 619 727 827 937 29 97 173 257 349 439 541 631 733 829 941 31 101 179 263 353 443 547 641 739 839 947 37 103 181 269 359 449 557 643 743 853 953 41 107 191 271 367 457 563 647 751 857 967 43 109 193 277 373 461 569 653 757 859 971 ---- Page 2 ---- 977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087 983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089 991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099 997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111 1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113 1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129 1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131 1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137 1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141 1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143 1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153 1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161 1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179 1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203 1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207 ---- Page 3 ---- 2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413 2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433 2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449 2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457 2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461 2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463 2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467 2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469 2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491 2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499 2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511 2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517 2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527 2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529 2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533 ---- Page 4 ---- 3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019 3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021 3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027 3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003 3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007 3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013