Collect rows based on unique ID Pandas dataframe - python

I have a large time series dataset with some of the observations (each with a unique ID) having a different length. I also have a 'Section' column that counts time step or rows for each unique ID.
df.groupby([df['ID']]).agg({'count'})
A B Z
count count ... count
ID
25782 194 194 194
25783 198 198 198
25784 194 194 194
25785 192 192 192
... ... ... ... ...
25787 192 192 192
25788 195 195 195
25789 196 196 196
25790 200 200 200
say I want to create a new dataframe consisting only where the length of unique ID = 192. I.e 'Section' counts up to 192.
So far I have tried the following but for no avail. Please help.
mask = df.groupby('ID')(len(df['Section']) == 192)
df = df.loc[mask]
print(df)
AND
df.groupby('ID').df[df['Section'].max() == 192]
edit
Desired output
new_df.groupby([new_df['ID']]).agg({'count'})
A B Z
count count ... count
ID
25752 192 192 192
25137 192 192 192
25970 192 192 192
25440 192 192 192

You can use filter after the groupby to keep only the ID where the length of the 'Section' column is 192, such as:
new_df = df.groupby('ID').filter(lambda x: len(x['Section']) == 192)
Then when you do new_df.groupby('ID').agg({'count'}) you should get your expected output

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!
If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833
You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Can't set column name from index to str(index) + string (Pandas, Python)

I need to change the names of a subset of columns in a dataframe from whatever number they are to that number plus a string suffix. I know there is a function to add a suffix, but it doesn't seem to work on just indices.
I create a list with all the column indices in it, then run a loop that, for each item in that list, it renames the dataframe column that matches the list item to the same number, plus the suffix string.
if scalename == "CDR":
print(scaledf.columns.tolist())
oldCols = scaledf.columns[7:].tolist()
for f in range(len(oldCols)):
changeCol = int(oldCols[f])
print(changeCol)
scaledf.rename(columns = {changeCol:scalename + str(changeCol)})
print(scaledf.columns)
This doesn't work.
The code will print out the column names, and prints out every item, but it does not rename the columns. It doesn't throw errors, it just doesn't work. I've tried variation after variation, and gotten all kinds of other errors, but this error-free code does nothing. It just runs, and doesn't rename anything.
Any help would be seriously appreciated! Thank you.
Adding sample of list:
45
52
54
55
59
60
61
66
67
68
69
73
74
75
80
81
82
94
101
103
104
108
110
115
116
117
129
136
138
139
143
144
145
150
151
157
158
159
171
178
180
181
185
186
187
192
193
199
200
201
213
220
222
223
227
228
229
234
235
236
Try this:
scaledf = scaledf.rename(columns=lambda c:scalename + str(c) if c in oldCols else c)

Subtraction/Addition from seperate rows/columns

I have a dataframe like this:
Day Diff
137 0
185 48
249 64
139 -110
In the column Diff whenever a negative value is encountered I want to subtract 365 from the value in Day from the previous row and then add that value to the Day value in the current row of the negative number. For example, in this scenario when -110 is encountered I want to do 365-249 (249 is from Day in previous row) and then add 139. So 365-249 = 116 and 116 + 139 = 255. Therefore -110 would be replaced with 255.
My desired output then is:
Day Diff
137 0
185 48
249 64
139 255
you can do it this way:
In [32]: df.loc[df.Diff < 0, 'Diff'] = 365 + df.Day - df.shift().loc[df.Diff < 0, 'Day']
In [33]: df
Out[33]:
Day Diff
0 137 0.0
1 185 48.0
2 249 64.0
3 139 255.0

Reindex or reorder group

I am looking for a clean way to reorder the index in a group.
Example code:
import numpy as np
import pandas as pd
mydates = pd.date_range('1/1/2012', periods=1000, freq='D')
myts = pd.Series(np.random.randn(len(mydates)), index=mydates)
grouped = myts.groupby(lambda x: x.timetuple()[7])
mymin = grouped.min()
mymax = grouped.max()
The above gives me what I want, aggregate stats on julian day of the year BUT I would then like to reorder the group so the last half (183 days) is placed in front of the 1st half.
With a normal numpy array:
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
But I can't do this with the groupby it raises a not implement error.
Note: this is a cross post from google-groups. Also I have been reading on comp.lang.python, unfortunately people tend to ignore some posts e.g. from google groups.
Thanks in advance,
Bevan
Why not just reindex the result?
In [7]: mymin.reindex(myindex)
Out[7]:
184 -0.788140
185 -2.206314
186 0.284884
187 -2.197727
188 -0.714634
189 -1.082745
190 -0.789286
191 -1.489837
192 -1.278941
193 -0.795507
194 -0.661476
195 0.582994
196 -1.634310
197 0.104332
198 -0.602378
...
169 -1.150616
170 -0.315325
171 -2.233139
172 -1.081528
173 -1.316668
174 -0.963783
175 -0.215260
176 -2.723446
177 -0.493480
178 -0.706771
179 -2.082051
180 -1.066649
181 -1.455419
182 -0.332383
183 -1.277424
Im not aware of a specific Pandas function for this, but you could consider the np.roll() function:
myindex = np.arange(1,367)
myindex = np.roll(myindex, int(len(myindex)/2.))

Categories

Resources