Reindex or reorder group - python

I am looking for a clean way to reorder the index in a group.
Example code:
import numpy as np
import pandas as pd
mydates = pd.date_range('1/1/2012', periods=1000, freq='D')
myts = pd.Series(np.random.randn(len(mydates)), index=mydates)
grouped = myts.groupby(lambda x: x.timetuple()[7])
mymin = grouped.min()
mymax = grouped.max()
The above gives me what I want, aggregate stats on julian day of the year BUT I would then like to reorder the group so the last half (183 days) is placed in front of the 1st half.
With a normal numpy array:
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
But I can't do this with the groupby it raises a not implement error.
Note: this is a cross post from google-groups. Also I have been reading on comp.lang.python, unfortunately people tend to ignore some posts e.g. from google groups.
Thanks in advance,
Bevan

Why not just reindex the result?
In [7]: mymin.reindex(myindex)
Out[7]:
184 -0.788140
185 -2.206314
186 0.284884
187 -2.197727
188 -0.714634
189 -1.082745
190 -0.789286
191 -1.489837
192 -1.278941
193 -0.795507
194 -0.661476
195 0.582994
196 -1.634310
197 0.104332
198 -0.602378
...
169 -1.150616
170 -0.315325
171 -2.233139
172 -1.081528
173 -1.316668
174 -0.963783
175 -0.215260
176 -2.723446
177 -0.493480
178 -0.706771
179 -2.082051
180 -1.066649
181 -1.455419
182 -0.332383
183 -1.277424

Im not aware of a specific Pandas function for this, but you could consider the np.roll() function:
myindex = np.arange(1,367)
myindex = np.roll(myindex, int(len(myindex)/2.))

Related

Problem animating polar plots from measured data

Problem
I'm trying to animate a polar plot from a measured temperature data from a cylinder using the plotly.express command line_polar by using a dataset of 6 radial values (represented by columns #1 - #6) over 10 rows (represented by column Time) distributed over a polar plot. I'm struggling to make it animate and get the following error:
Error
ValueError: All arguments should have the same length. The length of column argument df[animation_frame] is 10, whereas the length of previously-processed arguments ['r', 'theta'] is 6
According to the help for the parameter "animation_frame" it should be specified as following:
animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
I'm a bit stumped with this problem since I don't see why this shouldn't work, since other use cases seem to use multi-dimensional data with the data with equal rows.
Example of polar plot for t=1
Polar plot
Dataset:
Time
#1
#2
#3
#4
#5
#6
1
175
176
179
182
178
173
2
174
175
179
184
178
172
3
175
176
178
183
179
174
4
173
174
178
184
179
174
5
173
174
177
185
180
175
6
173
174
177
185
180
175
7
172
173
176
186
181
176
8
172
173
176
186
181
176
9
171
172
175
187
182
177
10
171
172
175
187
182
177
Code:
import pandas as pd
import plotly.express as px
df = pd.read_excel('TempData.xlsx')
sensor = ["0", "60", "120", "180", "240","300"]
radial_all = ['#1', '#2', '#3', '#4', '#5', '#6']
fig = px.line_polar(df, r=radial_all, theta=sensor, line_close=True,
color_discrete_sequence=px.colors.sequential.Plasma_r, template="plotly_dark", animation_frame="Time")
fig.update_polars(radialaxis_range=[160, 190])
fig.update_polars(radialaxis_rangemode="normal")
fig.update_polars(radialaxis=dict(tickvals = [150, 160, 170, 180, 190, 200]))
Thanks in advance!
I have found the solution to this problem, its also possible with scatterpolar but I recommend line_polar from plotly express, its way more elegant and easy. What you need to do is format the data from wide to long format using the pandas command melt(). This will allow you to correctly walk through the data and match it to the animation steps (in this case "Time" column). See following links for helpful info.
Pandas - reshaping-by-melt
pandas.melt()
Resulting code:
import plotly.express as px
import pandas as pd
df = pd.read_excel('TempData.xlsx')
df_1 = df.melt(id_vars=['Time'], var_name="Sensor", value_name="Temperature",
value_vars=['#1', '#2', '#3', '#4','#5','#6'])
fig = px.line_polar(df_1, r="Temperature", theta="Sensor", line_close=True,
line_shape="linear", direction="clockwise",
color_discrete_sequence=px.colors.sequential.Plasma_r, template="plotly_dark",
animation_frame="Time")
fig.show()
Resulting animating plot

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Aggregations over specific columns of a large dataframe, with named output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.
Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()

Collect rows based on unique ID Pandas dataframe

I have a large time series dataset with some of the observations (each with a unique ID) having a different length. I also have a 'Section' column that counts time step or rows for each unique ID.
df.groupby([df['ID']]).agg({'count'})
A B Z
count count ... count
ID
25782 194 194 194
25783 198 198 198
25784 194 194 194
25785 192 192 192
... ... ... ... ...
25787 192 192 192
25788 195 195 195
25789 196 196 196
25790 200 200 200
say I want to create a new dataframe consisting only where the length of unique ID = 192. I.e 'Section' counts up to 192.
So far I have tried the following but for no avail. Please help.
mask = df.groupby('ID')(len(df['Section']) == 192)
df = df.loc[mask]
print(df)
AND
df.groupby('ID').df[df['Section'].max() == 192]
edit
Desired output
new_df.groupby([new_df['ID']]).agg({'count'})
A B Z
count count ... count
ID
25752 192 192 192
25137 192 192 192
25970 192 192 192
25440 192 192 192
You can use filter after the groupby to keep only the ID where the length of the 'Section' column is 192, such as:
new_df = df.groupby('ID').filter(lambda x: len(x['Section']) == 192)
Then when you do new_df.groupby('ID').agg({'count'}) you should get your expected output

Separate specific value in a dataframe

I have a large dataset. I am trying to read it with Pandas Dataframe. I want to separate some values from one of the columns. Assuming the name of column is "A", there are values ranging from 90 to 300. I want to separate any values between 270 to 280. I did try below code but it is wrong!
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('....csv')
df2 = df[ 270 < df['A'] < 280]
Use between with boolean indexing:
df = pd.DataFrame({'A':range(90,300)})
df2 = df[df['A'].between(270,280, inclusive=False)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Or:
df2 = df[(df['A'] > 270) & (df['A'] < 280)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Using numpy to speed things up and reconstruct a new dataframe.
Assuming we use jezrael's sample data
a = df.A.values
m = (a > 270) & (a < 280)
pd.DataFrame(a[m], df.index[m], df.columns)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
You can also use query() method:
df2 = df.query("270 < A < 280")
Demo:
In [40]: df = pd.DataFrame({'A':range(90,300)})
In [41]: df.query("270 < A < 280")
Out[41]:
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279

Categories

Resources