shifting column in pandas without eliminating data

shifting column in pandas without eliminating data - python

Input:
## x2
##0 214
##1 234
##2 253
##3 272
##4 291
Desired output:
## x2
##0 291
##1 214
##2 234
##3 253
##4 272
Following code eliminates bottom part of the shifted column and puts NA on top. However, I want it as a cycle.
a = pd.DataFrame([214,234,253,272,291], columns=['x2'])
a.x2 = a.x2.shift(1)

I would just chain a call to fill fillna after the call to shift:
import pandas
a = pandas.DataFrame([214,234,253,272,291], columns=['x2'])
a['x3'] = a.shift(1).fillna(a['x2'].iloc[-1])
print(a)
x2 x3
0 214 291
1 234 214
2 253 234
3 272 253
4 291 272
You can reassign directly to the x2 column, but I wanted to be able to show both the source and the result a once for comparison

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!

If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833

You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Create new Pandas.DataFrame with .groupby(...).agg(sum) then recover unsummed columns

I'm starting with a dataframe of baseabll seasons a section of which looks similar to this:
Name Season AB H SB playerid
13047 A.J. Pierzynski 2013 503 137 1 746
6891 A.J. Pierzynski 2006 509 150 1 746
1374 Rod Carew 1977 616 239 23 1001942
1422 Stan Musial 1948 611 230 7 1009405
1507 Todd Helton 2000 580 216 5 432
1508 Nomar Garciaparra 2000 529 197 5 190
1509 Ichiro Suzuki 2004 704 262 36 1101
From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe.
My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible.
I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.

You can write your own condition how do you want to include non summed columns.
col = df.columns.tolist()
col.remove('playerid')
df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns})
df:
Name Season AB H SB playerid
playerid
190 Nomar_Garciaparra 2000 529 197 5 190
432 Todd_Helton 2000 580 216 5 432
746 A.J._Pierzynski 4019 1012 287 2 1492
1101 Ichiro_Suzuki 2004 704 262 36 1101
1001942 Rod_Carew 1977 616 239 23 1001942
1009405 Stan_Musial 1948 611 230 7 1009405

If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns:
stat_cols = ['AB', 'H', 'SB']
groupby_cols = ['playerid', 'Name']
results = df.groupby(groupby_cols)[stat_cols].sum()
Results:
AB H SB
playerid Name
190 Nomar Garciaparra 529 197 5
432 Todd Helton 580 216 5
746 A.J. Pierzynski 1012 287 2
1101 Ichiro Suzuki 704 262 36
1001942 Rod Carew 616 239 23
1009405 Stan Musial 611 230 7
If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map:
results = df.groupby('playerid')[stat_cols].sum()
name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict()
results['Name'] = results.index.map(name_map)
Results:
AB H SB Name
playerid
190 529 197 5 Nomar Garciaparra
432 580 216 5 Todd Helton
746 1012 287 2 A.J. Pierzynski
1101 704 262 36 Ichiro Suzuki
1001942 616 239 23 Rod Carew
1009405 611 230 7 Stan Musial

groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column.
Using the sample data above, one might use
mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max }
df_1 = df.groupby('playerid').agg(mapping)
The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply.
To expand this to larger data sets, you might use a dictionary comprehension. This would work well:
dictionary = { x : sum for x in df.columns}
dont_sum = {'Name': max, 'Season': max}
dictionary.update(dont_sum)
df_1 = df.groupby('playerid').agg(dictionary)

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?

Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248

Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]

Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

How to shift a column in Pandas DataFrame without losing value

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it without losing values.
(This post is quite similar to How to shift a column in Pandas DataFrame but the validated answer doesn't give the desired output and I can't comment it).
Does anyone know how to do it?
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 271
##5 nan 291

Use loc to add a new blank row to the DataFrame, then perform the shift.
df.loc[max(df.index)+1, :] = None
df.x2 = df.x2.shift(1)
The code above assumes that your index is integer based, which is the pandas default. If you're using a non-integer based index, replace max(df.index)+1 with whatever you want the new last index to be.

Speeding up array iteration time in python

Currently I am iterating over one array and for each value in this array I am looking for the closest value at the corresponding point in another array that is within a region surrounding the corresponding point.
In summary: For any point in an array, how far away from a corresponding point in another array do you need to go to get the same value.
The code seems to work well for small arrays, however I am working now with 1024x768 arrays, leading me to wait a long time for each run....
Any help or advice would be greatly appreciated as I have been on this for a while!!
Example matrix in format Im using: np.array[[1,2],[3,4]]
#Distance to agreement
#Used later to define a region of pixels around a corresponding point
#to iterate over:
DTA = 26
#To account for noise in pixels - doesnt have to find the exact value,
#just one within +/-130 of it.
limit = 130
#Containers for all pixel value matches and also the smallest distance
#to pixel match
Dist = []
Dist_min = []
#Continer matrix for gamma pass/fail values
Dist_to_agree = np.zeros((i_size,j_size))
#i,j indexes the reference matrix (x), ii,jj indexes the measured
#matrix(y). Finds a match within the limits,
#appends the distance to the match into Dist.
#Then find the minimum distance to a match for that pixel and append it
#to dist_min
for i, k in enumerate(x):
for j, l in enumerate(k):
#added 10 packing to y matrix, so need to shift it by 10 in i&j
for ii in range((i+10)-DTA,(i+10)+DTA):
for jj in range((j+10)-DTA,(j+10)+DTA):
#If the pixel value is within a range to account for noise,
#let it be "found"
if (y[ii,jj]-limit) <= x[i,j] <= (y[ii,jj]+limit):
#Calculating distance
dist_eu = sqrt(((i)-(ii))**2 + ((j) - (jj))**2)
Dist.append(dist_eu)
#If a value cannot be found within the noise range,
#append 10 = instant fail.
else:
Dist.append(10)
try:
Dist_min.append(min(Dist))
Dist_to_agree[i,j] = min(Dist)
except ValueError:
pass
#Need to reset container or previous values will also be
#accounted for when finding minimum
Dist = []
print Dist_to_agree

First, you are getting the elements of x in k and l, but then throwing that away and indexing x again. So in place of x[i,j], you could just use l, which would be much faster (although l isn't a very meaningful name, something like xi and xij might be better).
Second, you are recomputing y[ii,jj]-limit and y[ii,jj]+limitevery time. If you have enough memory, you can-precomputer these:ym = y-limitandyp = y+limit`.
Third, appending to a list is slower than creating an array and setting the values for long lists vs. long arrays. You can also skip the entire else clause by pre-setting the default value.
Fourth, you are computing min(dist) twice, and further may be using the python version rather than the numpy version, the latter being faster for arrays (which is another reason to make dist and array).
However, the biggest speedup would be to vectorize the inner two loops. Here is my tests, with x=np.random.random((10,10)) and y=np.random.random((100,100)):
Your version takes 623 ms.
Here is my version, which takes 7.6 ms:
dta = 26
limit = 130
dist_to_agree = np.zeros_like(x)
dist_min = []
ym = y-limit
yp = y+limit
for i, xi in enumerate(x):
irange = (i-np.arange(i+10-dta, i+10+dta))**2
if not irange.size:
continue
ymi = ym[i+10-dta:i+10+dta, :]
ypi = yp[i+10-dta:i+10+dta, :]
for j, xij in enumerate(xi):
jrange = (j-np.arange(j+10-dta, j+10+dta))**2
if not jrange.size:
continue
ymij = ymi[:, j+10-dta:j+10+dta]
ypij = ypi[:, j+10-dta:j+10+dta]
imesh, jmesh = np.meshgrid(irange, jrange, indexing='ij')
dist = np.sqrt(imesh+jmesh)
dist[ymij > xij or xij < ypij] = 10
mindist = dist.min()
dist_min.append(mindist)
dist_to_agree[i,j] = mindist
print(dist_to_agree)

#Ciaran
Meshgrid is kinda a vectorized equivalent of two nested loops. Below are two equivalent ways of calculating the dist. One with loops and one with meshgrid+numpy vector operations. The second one is six times faster.
DTA=5
i=100
j=200
def func1():
dist1=np.zeros((DTA*2,DTA*2))
for ii in range((i+10)-DTA,(i+10)+DTA):
for jj in range((j+10)-DTA,(j+10)+DTA):
dist1[ii-((i+10)-DTA),jj-((j+10)-DTA)] =sqrt(((i)-(ii))**2 + ((j) - (jj))**2)
return dist1
def func2():
dist2=np.zeros((DTA*2,DTA*2))
ii, jj = meshgrid(np.arange((i+10)-DTA,(i+10)+DTA),
np.arange((j+10)-DTA,(j+10)+DTA))
dist2=np.sqrt((i-ii)**2+(j-jj)**2)
return dist2
This is how ii and jj matrices look after meshgrid operation
ii=
[[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]
[105 106 107 108 109 110 111 112 113 114]]
jj=
[[205 205 205 205 205 205 205 205 205 205]
[206 206 206 206 206 206 206 206 206 206]
[207 207 207 207 207 207 207 207 207 207]
[208 208 208 208 208 208 208 208 208 208]
[209 209 209 209 209 209 209 209 209 209]
[210 210 210 210 210 210 210 210 210 210]
[211 211 211 211 211 211 211 211 211 211]
[212 212 212 212 212 212 212 212 212 212]
[213 213 213 213 213 213 213 213 213 213]
[214 214 214 214 214 214 214 214 214 214]]

for loops are very slow in pure python and you have four nested loops which will be very slow. Cython does wonders to the for loop speed. You can also try vectorization. While I'm not sure I fully understand what you are trying to do, you may try to vectorize at last some of the operations. Especially the last two loops.
So instead of two ii,jj cycles over
y[ii,jj]-limit) <= x[i,j] <= (y[ii,jj]+limit)
you can do something like
ii, jj = meshgrid(np.arange((i+10)-DTA,(i+10)+DTA), np.arange((j+10)-DTA,(j+10)+DTA))
t=(y[(i+10)-DTA,(i+10)+DTA]-limit>=x[i,j]) & (y[(i+10)-DTA,(i+10)+DTA]+limit<=x[i,j])
Dist=np.sqrt((i-ii)**2)+(j-jj)**2))
np.min(Dist[t]) will have your minimum distance for element i,j

The numbapro compiler offers gpu Acceleration. Unfortunately it isn't free.
http://docs.continuum.io/numbapro/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.