Splitting data into subsamples - python

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?

Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248

Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]

Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!
If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833
You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Python pdist: Setting an array element with a sequence

I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list

Can't set column name from index to str(index) + string (Pandas, Python)

I need to change the names of a subset of columns in a dataframe from whatever number they are to that number plus a string suffix. I know there is a function to add a suffix, but it doesn't seem to work on just indices.
I create a list with all the column indices in it, then run a loop that, for each item in that list, it renames the dataframe column that matches the list item to the same number, plus the suffix string.
if scalename == "CDR":
print(scaledf.columns.tolist())
oldCols = scaledf.columns[7:].tolist()
for f in range(len(oldCols)):
changeCol = int(oldCols[f])
print(changeCol)
scaledf.rename(columns = {changeCol:scalename + str(changeCol)})
print(scaledf.columns)
This doesn't work.
The code will print out the column names, and prints out every item, but it does not rename the columns. It doesn't throw errors, it just doesn't work. I've tried variation after variation, and gotten all kinds of other errors, but this error-free code does nothing. It just runs, and doesn't rename anything.
Any help would be seriously appreciated! Thank you.
Adding sample of list:
45
52
54
55
59
60
61
66
67
68
69
73
74
75
80
81
82
94
101
103
104
108
110
115
116
117
129
136
138
139
143
144
145
150
151
157
158
159
171
178
180
181
185
186
187
192
193
199
200
201
213
220
222
223
227
228
229
234
235
236
Try this:
scaledf = scaledf.rename(columns=lambda c:scalename + str(c) if c in oldCols else c)

How to display a sequence of numbers in column-major order?

Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013

Speeding up array iteration time in python

Currently I am iterating over one array and for each value in this array I am looking for the closest value at the corresponding point in another array that is within a region surrounding the corresponding point.
In summary: For any point in an array, how far away from a corresponding point in another array do you need to go to get the same value.
The code seems to work well for small arrays, however I am working now with 1024x768 arrays, leading me to wait a long time for each run....
Any help or advice would be greatly appreciated as I have been on this for a while!!
Example matrix in format Im using: np.array[[1,2],[3,4]]
#Distance to agreement
#Used later to define a region of pixels around a corresponding point
#to iterate over:
DTA = 26
#To account for noise in pixels - doesnt have to find the exact value,
#just one within +/-130 of it.
limit = 130
#Containers for all pixel value matches and also the smallest distance
#to pixel match
Dist = []
Dist_min = []
#Continer matrix for gamma pass/fail values
Dist_to_agree = np.zeros((i_size,j_size))
#i,j indexes the reference matrix (x), ii,jj indexes the measured
#matrix(y). Finds a match within the limits,
#appends the distance to the match into Dist.
#Then find the minimum distance to a match for that pixel and append it
#to dist_min
for i, k in enumerate(x):
for j, l in enumerate(k):
#added 10 packing to y matrix, so need to shift it by 10 in i&j
for ii in range((i+10)-DTA,(i+10)+DTA):
for jj in range((j+10)-DTA,(j+10)+DTA):
#If the pixel value is within a range to account for noise,
#let it be "found"
if (y[ii,jj]-limit) <= x[i,j] <= (y[ii,jj]+limit):
#Calculating distance
dist_eu = sqrt(((i)-(ii))**2 + ((j) - (jj))**2)
Dist.append(dist_eu)
#If a value cannot be found within the noise range,
#append 10 = instant fail.
else:
Dist.append(10)
try:
Dist_min.append(min(Dist))
Dist_to_agree[i,j] = min(Dist)
except ValueError:
pass
#Need to reset container or previous values will also be
#accounted for when finding minimum
Dist = []
print Dist_to_agree
First, you are getting the elements of x in k and l, but then throwing that away and indexing x again. So in place of x[i,j], you could just use l, which would be much faster (although l isn't a very meaningful name, something like xi and xij might be better).
Second, you are recomputing y[ii,jj]-limit and y[ii,jj]+limitevery time. If you have enough memory, you can-precomputer these:ym = y-limitandyp = y+limit`.
Third, appending to a list is slower than creating an array and setting the values for long lists vs. long arrays. You can also skip the entire else clause by pre-setting the default value.
Fourth, you are computing min(dist) twice, and further may be using the python version rather than the numpy version, the latter being faster for arrays (which is another reason to make dist and array).
However, the biggest speedup would be to vectorize the inner two loops. Here is my tests, with x=np.random.random((10,10)) and y=np.random.random((100,100)):
Your version takes 623 ms.
Here is my version, which takes 7.6 ms:
dta = 26
limit = 130
dist_to_agree = np.zeros_like(x)
dist_min = []
ym = y-limit
yp = y+limit
for i, xi in enumerate(x):
irange = (i-np.arange(i+10-dta, i+10+dta))**2
if not irange.size:
continue
ymi = ym[i+10-dta:i+10+dta, :]
ypi = yp[i+10-dta:i+10+dta, :]
for j, xij in enumerate(xi):
jrange = (j-np.arange(j+10-dta, j+10+dta))**2
if not jrange.size:
continue
ymij = ymi[:, j+10-dta:j+10+dta]
ypij = ypi[:, j+10-dta:j+10+dta]
imesh, jmesh = np.meshgrid(irange, jrange, indexing='ij')
dist = np.sqrt(imesh+jmesh)
dist[ymij > xij or xij < ypij] = 10
mindist = dist.min()
dist_min.append(mindist)
dist_to_agree[i,j] = mindist
print(dist_to_agree)
#Ciaran
Meshgrid is kinda a vectorized equivalent of two nested loops. Below are two equivalent ways of calculating the dist. One with loops and one with meshgrid+numpy vector operations. The second one is six times faster.
DTA=5
i=100
j=200
def func1():
dist1=np.zeros((DTA*2,DTA*2))
for ii in range((i+10)-DTA,(i+10)+DTA):
for jj in range((j+10)-DTA,(j+10)+DTA):
dist1[ii-((i+10)-DTA),jj-((j+10)-DTA)] =sqrt(((i)-(ii))**2 + ((j) - (jj))**2)
return dist1
def func2():
dist2=np.zeros((DTA*2,DTA*2))
ii, jj = meshgrid(np.arange((i+10)-DTA,(i+10)+DTA),
np.arange((j+10)-DTA,(j+10)+DTA))
dist2=np.sqrt((i-ii)**2+(j-jj)**2)
return dist2
This is how ii and jj matrices look after meshgrid operation
ii=
[[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]]
jj=
[[205 205 205 205 205 205 205 205 205 205]
[206 206 206 206 206 206 206 206 206 206]
[207 207 207 207 207 207 207 207 207 207]
[208 208 208 208 208 208 208 208 208 208]
[209 209 209 209 209 209 209 209 209 209]
[210 210 210 210 210 210 210 210 210 210]
[211 211 211 211 211 211 211 211 211 211]
[212 212 212 212 212 212 212 212 212 212]
[213 213 213 213 213 213 213 213 213 213]
[214 214 214 214 214 214 214 214 214 214]]
for loops are very slow in pure python and you have four nested loops which will be very slow. Cython does wonders to the for loop speed. You can also try vectorization. While I'm not sure I fully understand what you are trying to do, you may try to vectorize at last some of the operations. Especially the last two loops.
So instead of two ii,jj cycles over
y[ii,jj]-limit) <= x[i,j] <= (y[ii,jj]+limit)
you can do something like
ii, jj = meshgrid(np.arange((i+10)-DTA,(i+10)+DTA), np.arange((j+10)-DTA,(j+10)+DTA))
t=(y[(i+10)-DTA,(i+10)+DTA]-limit>=x[i,j]) & (y[(i+10)-DTA,(i+10)+DTA]+limit<=x[i,j])
Dist=np.sqrt((i-ii)**2)+(j-jj)**2))
np.min(Dist[t]) will have your minimum distance for element i,j
The numbapro compiler offers gpu Acceleration. Unfortunately it isn't free.
http://docs.continuum.io/numbapro/

Categories

Resources