sort pivot/dataframe without All row pandas/python - python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!

If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833

You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Related

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Python print string alignment

I am printing some values in a loop in Python. My current output is as follows:
0 Data Count: 249 7348 249 4469 2768 261 20 126
1 Data Count: 288 11 288 48 2284 598 137 408
2 Data Count: 808 999 808 2896 32739 138 202 678
3 Data Count: 140 26 140 2688 8054 884 433 987
What I'd like is for all values in each column to align, despite differing character/number counts in some, to make it easier to read.
The pseudo code behind this is as follows:
for i in range(0,3):
print i, " Data Count: ", Count_A, " ", Count_B, " ", Count_C, " ", Count_D, " ", Count_E, " ", Count_F, " ", Count_G, " ", Count_H
Thanks in advance everyone!
You could use format string justification:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("{:5} {:5} {:5} {:5}".format(*data))
output:
92 460 72 630
837 214 118 677
906 328 102 320
895 998 177 922
651 742 215 938
According to the format specification from Python docs
With the % string formatting operator, the minimum width of output is specified in a placeholder as a number before the data type (the full format of a placeholder is %[key][flags][width][.precision][length type]conversion type). If the result is shorter, it will be left-padded to the specified length:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("%5d %5d %5d %5d %5d" % tuple(data))
gives:
946 937 544 636 871
232 860 704 877 716
868 849 851 488 739
419 381 695 909 518
570 756 467 351 537
(code adapted from #andreihondrari's answer)

How to display a sequence of numbers in column-major order?

Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013

Separate specific value in a dataframe

I have a large dataset. I am trying to read it with Pandas Dataframe. I want to separate some values from one of the columns. Assuming the name of column is "A", there are values ranging from 90 to 300. I want to separate any values between 270 to 280. I did try below code but it is wrong!
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('....csv')
df2 = df[ 270 < df['A'] < 280]
Use between with boolean indexing:
df = pd.DataFrame({'A':range(90,300)})
df2 = df[df['A'].between(270,280, inclusive=False)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Or:
df2 = df[(df['A'] > 270) & (df['A'] < 280)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Using numpy to speed things up and reconstruct a new dataframe.
Assuming we use jezrael's sample data
a = df.A.values
m = (a > 270) & (a < 280)
pd.DataFrame(a[m], df.index[m], df.columns)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
You can also use query() method:
df2 = df.query("270 < A < 280")
Demo:
In [40]: df = pd.DataFrame({'A':range(90,300)})
In [41]: df.query("270 < A < 280")
Out[41]:
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279

Delete duplicate values in a row within a Pandas DataFrame (Python)

what is the expression to remove duplicate-values in any row within a pandas dataframe as follows....(note: first column is the index (date), followed by four columns of data).
1983-02-16 512 517 510 514,
1983-02-17 513 520 513 517,
1983-02-18 500 500 500 500 <-- duplicate values,
1983-02-21 505 505 496 496
Delete row of duplicate values, end up with this...
1983-02-16 512 517 510 514,
1983-02-17 513 520 513 517,
1983-02-21 505 505 496 496
Could only find how to do this by columns, not rows....Many thanks in advance,
Peter
A slightly more elegant/dynamic (but perhaps less performant version):
In [11]: msk = df1.apply(lambda col: df[1] != col).any(axis=1)
Out[11]:
0 True
1 True
2 False
3 True
dtype: bool
In [12]: msk.index = df1.index # iloc doesn't support masking
In [13]: df1.loc[msk]
Out[13]:
1 2 3 4
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-21 505 505 496 496
import pandas as pd
import io
content = '''\
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-18 500 500 500 500
1983-02-21 505 505 496 496'''
df = pd.read_table(io.BytesIO(content), parse_dates=[0], header=None, sep='\s+',
index_col=0)
index = (df[1] == df[2]) & (df[1] == df[3]) & (df[1] == df[4])
df = df.ix[~index]
print(df)
yields
1 2 3 4
0
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-21 505 505 496 496
df.ix can be used to select rows. df = df.ix[~index] selects all rows where index is False.

Categories

Resources