I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.
Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013