I have a dataset with two columns like as follows:
index Year
0 5 <2012
1 8 >=2012
2 9 >=2012
3 10 <2012
4 15 <2012
... ... ...
171 387 >=2012
172 390 <2012
173 398 <2012
174 403 >=2012
175 409 <2012
And I would like to plot it in a histogram. I tried with
plt.style.use('ggplot')
df.groupby(['Year'])\
.Year.count().unstack().plot.bar(legend=True)
plt.show()
but I have got an error: AttributeError: 'CategoricalIndex' object has no attribute 'remove_unused_levels' for
df.groupby(['Year'])\
.Year.count().unstack().plot.bar(legend=True)
I think this is because I am using categorical values. Any help would be appreciated it.
Try:
plt.style.use('ggplot')
df.groupby(["Year"])["Year"].agg("count").plot.bar();
Alternatively:
plt.hist(df["Year"]);
I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.
28 121 106 112 134
42 123 114 115 135
56 130 118 124 138
42 123 114 115 135
63 132 126 131 141(and 14 more rows....)
basically each row has 5 points that need to be plotted by line graph along equidistant x(say)
Even if i plot say 5 rows it decent enough. For now this has been my approach to tackle this but it displays a blank plot
for i in range(20):
print()
for j in range(5):
print(int(mat[0].split()[j]),end=' ')
plt.plot(j,int(mat[i].split()[j]),'r')
plt.show()
i checked and mat[i].split()[j] returns the proper no. for each row to be extracted but it is not getting plotted. I dont want to deal with dataframes now since the data is so simple.
I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.
As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.
I am looking for a clean way to reorder the index in a group.
Example code:
import numpy as np
import pandas as pd
mydates = pd.date_range('1/1/2012', periods=1000, freq='D')
myts = pd.Series(np.random.randn(len(mydates)), index=mydates)
grouped = myts.groupby(lambda x: x.timetuple()[7])
mymin = grouped.min()
mymax = grouped.max()
The above gives me what I want, aggregate stats on julian day of the year BUT I would then like to reorder the group so the last half (183 days) is placed in front of the 1st half.
With a normal numpy array:
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
But I can't do this with the groupby it raises a not implement error.
Note: this is a cross post from google-groups. Also I have been reading on comp.lang.python, unfortunately people tend to ignore some posts e.g. from google groups.
Thanks in advance,
Bevan
Why not just reindex the result?
In [7]: mymin.reindex(myindex)
Out[7]:
184 -0.788140
185 -2.206314
186 0.284884
187 -2.197727
188 -0.714634
189 -1.082745
190 -0.789286
191 -1.489837
192 -1.278941
193 -0.795507
194 -0.661476
195 0.582994
196 -1.634310
197 0.104332
198 -0.602378
...
169 -1.150616
170 -0.315325
171 -2.233139
172 -1.081528
173 -1.316668
174 -0.963783
175 -0.215260
176 -2.723446
177 -0.493480
178 -0.706771
179 -2.082051
180 -1.066649
181 -1.455419
182 -0.332383
183 -1.277424
Im not aware of a specific Pandas function for this, but you could consider the np.roll() function:
myindex = np.arange(1,367)
myindex = np.roll(myindex, int(len(myindex)/2.))