Grouping by monthly and plot a bar stacked in pandas - python

I would like to get a dataframe, where data representing different classes and monthly frequency for each class. For example, in the following dataframe want to use the column Forma for get a dataframe representing monthly frequencies of each of the classes of the column Forma y get for example a dataframe df1
df
Evento Forma Excentricidad
Fecha
2004-04-09 22:45:00 1 MBCCM 0.7
2004-04-12 22:45:00 2 MBSCL 0.6
2004-04-24 03:45:00 3 SCL 0.4
2004-05-02 06:45:00 4 SCL 0.5
2004-05-30 04:45:00 5 MBCCM 0.9
2004-05-31 03:15:00 6 MBCCM 0.8
2004-06-08 00:15:00 7 MBSCL 0.6
2004-06-12 22:15:00 8 CCM 1.0
2004-06-13 02:45:00 9 MBCCM 0.8
2004-06-13 23:45:00 10 MBSCL 0.6
2004-06-14 03:15:00 11 MBSCL 0.6
2004-06-17 08:15:00 12 MBCCM 0.7
2004-06-17 11:45:00 13 MBCCM 0.7
2004-06-22 00:15:00 14 SCL 0.5
2004-06-22 07:45:00 15 MBCCM 0.9
2004-06-22 22:45:00 16 CCM 0.8
2004-07-01 05:15:00 17 MBCCM 0.8
2004-07-02 00:15:00 18 MBSCL 0.6
2004-07-04 11:45:00 19 MBCCM 0.9
2004-07-06 03:45:00 20 SCL 0.6
2004-07-07 04:15:00 21 CCM 0.9
2004-07-08 02:45:00 22 MBCCM 1.0
2004-07-08 11:45:00 23 MBCCM 0.8
2004-07-08 02:15:00 24 MBCCM 0.9
2004-07-09 04:45:00 25 CCM 0.7
2004-07-11 18:15:00 26 MBSCL 0.4
2004-07-11 23:15:00 27 MBSCL 0.3
2004-07-15 10:45:00 28 CCM 0.8
2004-07-16 12:15:00 29 MBCCM 0.8
2004-07-17 02:15:00 30 MBCCM 0.8
2004-07-17 05:45:00 31 MBCCM 0.7
2004-07-19 23:15:00 32 CCM 0.9
2004-07-20 09:15:00 33 CCM 0.7
2004-07-20 21:45:00 34 SCL 0.6
2004-07-23 03:45:00 35 SCL 0.6
2004-07-23 12:45:00 36 MBCCM 0.9
2004-07-24 00:45:00 37 CCM 0.7
2004-07-26 00:15:00 38 MBCCM 0.8
2004-07-27 05:15:00 39 MBSCL 0.6
2004-07-27 07:15:00 40 MBSCL 0.6
2004-07-27 14:15:00 41 MBCCM 0.7
2004-07-27 19:45:00 42 SCL 0.6
2004-07-27 23:15:00 43 MBSCL 0.6
2004-07-28 07:15:00 44 MBCCM 0.8
2004-07-30 05:15:00 45 MBCCM 0.7
2004-07-31 00:15:00 46 SCL 0.5
2004-07-31 04:15:00 47 MBSCL 0.6
df1
Tipo Abril Mayo Junio Julio Agosto Septiembre Octubre
MCC 2 9 8 1 5 6 3
CCM 7 11 23 12 7 2 4
MBCCM 4 8 4 1 3 4 2
SCL 1 7 2 4 1 9 5
MBSCL 6 3 7 1 9 3 10
how can i do this, from df?

import pandas as pd
df = pd.read_table('data', sep='\s{2,}')
df.index = pd.to_datetime(df.index)
df['Month'] = [date.strftime('%B') for date in df.index]
print(pd.crosstab(rows=[df['Forma']], cols=[df['Month']], margins=False))
yields
Month April July June May
Forma
CCM 0 6 2 0
MBCCM 1 13 4 2
MBSCL 1 7 3 0
SCL 1 5 1 1

Related

Error plotting a time column as x-axis ticks

I have a df as follows
Time Samstag
0 00:15:00 80.6
1 00:30:00 74.6
2 00:45:00 69.2
3 01:00:00 63.6
4 01:15:00 57.1
5 01:30:00 50.4
6 01:45:00 44.1
7 02:00:00 39.1
8 02:15:00 36.0
9 02:30:00 34.4
10 02:45:00 33.7
11 03:00:00 33.3
12 03:15:00 32.7
13 03:30:00 32.0
14 03:45:00 31.5
15 04:00:00 31.3
16 04:15:00 31.5
17 04:30:00 31.7
18 04:45:00 31.5
19 05:00:00 30.3
20 05:15:00 28.1
21 05:30:00 26.4
22 05:45:00 27.1
23 06:00:00 32.3
24 06:15:00 42.9
25 06:30:00 56.2
26 06:45:00 68.5
27 07:00:00 76.3
28 07:15:00 77.0
29 07:30:00 72.9
30 07:45:00 67.3
31 08:00:00 63.6
32 08:15:00 64.5
33 08:30:00 69.5
34 08:45:00 77.4
35 09:00:00 87.1
36 09:15:00 97.4
37 09:30:00 108.4
38 09:45:00 119.9
39 10:00:00 132.1
40 10:15:00 144.7
41 10:30:00 156.7
42 10:45:00 166.9
43 11:00:00 174.1
44 11:15:00 177.4
45 11:30:00 177.7
46 11:45:00 176.2
47 12:00:00 174.1
48 12:15:00 172.6
49 12:30:00 172.0
50 12:45:00 172.4
51 13:00:00 174.1
52 13:15:00 177.1
53 13:30:00 180.4
54 13:45:00 183.0
55 14:00:00 183.9
56 14:15:00 182.4
57 14:30:00 179.5
58 14:45:00 176.6
59 15:00:00 175.1
60 15:15:00 176.0
61 15:30:00 178.9
62 15:45:00 182.8
63 16:00:00 186.8
64 16:15:00 190.3
65 16:30:00 193.8
66 16:45:00 197.9
67 17:00:00 203.5
68 17:15:00 210.8
69 17:30:00 218.8
70 17:45:00 226.3
71 18:00:00 231.8
72 18:15:00 234.4
73 18:30:00 234.5
74 18:45:00 233.0
75 19:00:00 230.9
76 19:15:00 228.7
77 19:30:00 226.9
78 19:45:00 225.3
79 20:00:00 224.0
80 20:15:00 223.0
81 20:30:00 221.5
82 20:45:00 218.9
83 21:00:00 214.2
84 21:15:00 207.0
85 21:30:00 197.0
86 21:45:00 184.4
87 22:00:00 169.2
88 22:15:00 151.8
89 22:30:00 133.7
90 22:45:00 116.7
91 23:00:00 102.7
92 23:15:00 93.0
93 23:30:00 86.6
94 23:45:00 82.2
I am trying to plot this as follows:
sns.lineplot(x="Time", y="Samstag", data=w_df)
plt.xticks(rotation=15)
plt.xlabel("Time")
plt.ylabel("KWH")
plt.show()
and it gives:
The label of x-axis is 00:00, 05:33:20, .... and so on.
I am trying to plot the Time column as the ticks in x-axis
I tried:
t = pd.to_datetime(w_df["Time"], format='%H:%M:%S')
t = t.apply(lambda x: x.strftime('%H:%M:%S'))
sns.lineplot(x="Time", y="Samstag", data=w_df)
plt.xticks(ticks=t, rotation=15)
plt.xlabel("Time")
plt.ylabel("KWH")
plt.show()
It throws the following error:
Traceback (most recent call last):
File "", line 2, in
plt.xticks(ticks=t, rotation=15)
File
"/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py",
line 1540, in xticks
locs = ax.set_xticks(ticks)
File
"/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/axes/_base.py",
line 3350, in set_xticks
ret = self.xaxis.set_ticks(ticks, minor=minor)
File
"/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/axis.py",
line 1755, in set_ticks
self.set_view_interval(min(ticks), max(ticks))
File
"/home/user/anaconda3/lib/python3.7/site-packages/matplotlib/axis.py",
line 1892, in setter
setter(self, min(vmin, vmax, oldmin), max(vmin, vmax, oldmax),
TypeError: '<' not supported between instances of 'numpy.ndarray' and
'str'
Can anyone please tell the mistake that I am doing?
Also,
w_df.dtypes
Out[27]:
Time object
Samstag float64
Sonntag float64
Werktag float64
dtype: object
So I took some of your data and attempted to get your result. Unfortunately, my Seaborn plot is plotting in the same format that you would like. This may have to do with the format of your time column. When I made my small dataset from your example, I made the time column a string, and it appears that everything is plotting fine.
d = {'Time': ["00:15:00", "00:30:00", "00:45:00", "01:00:00", "01:15:00", "01:30:00", "01:45:00",
"02:00:00", "02:15:00", "02:30:00", "02:45:00", "03:00:00", "03:15:00", "03:30:00", "03:45:00",
"04:00:00", "04:15:00", "04:30:00", "04:45:00", "05:00:00", "05:15:00", "05:30:00",
"05:45:00", "06:00:00"],
'Samstag': [80.6, 74.6,69.2, 62.6, 57.1,50.4, 44.1, 39.1, 36.0, 34.4, 33.7,33.3, 32.7, 32.0,
31.5, 31.3, 31.5, 31.7, 31.5,30.3, 28.1, 26.4, 27.1, 32.3]
}
df = pd.DataFrame(d)
sns.lineplot(x="Time", y="Samstag", data=df)
plt.xticks(rotation=15)
plt.xlabel("Time")
plt.ylabel("KWH")
plt.show()
This makes every time stamp a tick mark. Perhaps you can change your time column to be a string, if it is not already.
df['Time'] = df['Time'].astype(str)

Resampling and add threshold information in Pandas dataframe

I have my pandas data frame in 1 min frequency, I want to do the re-sampling based on the threshold data (there are multiple thresholds in a numpy array)
Here is example of my dataset:
2018-01-01 00:01:00 0.867609
2018-01-01 00:02:00 0.544493
2018-01-01 00:03:00 0.958497
2018-01-01 00:04:00 0.371790
2018-01-01 00:05:00 0.470320
2018-01-01 00:06:00 0.757448
2018-01-01 00:07:00 0.198261
2018-01-01 00:08:00 0.666350
2018-01-01 00:09:00 0.392574
2018-01-01 00:10:00 0.627608
2018-01-01 00:11:00 0.414380
2018-01-01 00:12:00 0.120925
2018-01-01 00:13:00 0.559495
2018-01-01 00:14:00 0.260619
2018-01-01 00:15:00 0.982731
2018-01-01 00:16:00 0.996133
2018-01-01 00:17:00 0.410816
2018-01-01 00:18:00 0.366457
2018-01-01 00:19:00 0.927745
2018-01-01 00:20:00 0.626804
2018-01-01 00:21:00 0.223193
2018-01-01 00:22:00 0.007136
2018-01-01 00:23:00 0.245006
2018-01-01 00:24:00 0.491245
2018-01-01 00:25:00 0.215716
2018-01-01 00:26:00 0.932378
2018-01-01 00:27:00 0.366263
2018-01-01 00:28:00 0.522177
2018-01-01 00:29:00 0.614966
2018-01-01 00:30:00 0.670983
threshold=np.array([0.5,0.8,0.9])
What I want is to extract the data where it crosses the threshold values and if doesn't cross the threshold value just resample data at 30 min
Sample ans :
Threshold
2018-01-01 00:01:00 0.867609 0.8
2018-01-01 00:02:00 0.544493 0.5
2018-01-01 00:03:00 0.958497 0.9
2018-01-01 00:05:00 0.421055 NA
2018-01-01 00:06:00 0.757448 0.5
2018-01-01 00:07:00 0.198261 NA
2018-01-01 00:08:00 0.666350 0.5
2018-01-01 00:09:00 0.392574 NA
2018-01-01 00:10:00 0.627608 0.5
2018-01-01 00:12:00 0.414380 NA
2018-01-01 00:13:00 0.559495 0.5
2018-01-01 00:14:00 0.260619 NA
2018-01-01 00:15:00 0.982731 0.9
2018-01-01 00:16:00 0.996133 0.9
2018-01-01 00:18:00 0.388636 NA
2018-01-01 00:19:00 0.927745 0.9
2018-01-01 00:20:00 0.626804 0.5
2018-01-01 00:25:00 0.215716 NA
2018-01-01 00:26:00 0.932378 0.9
2018-01-01 00:27:00 0.366263 NA
2018-01-01 00:28:00 0.522177 0.5
2018-01-01 00:29:00 0.614966 0.5
2018-01-01 00:30:00 0.670983 0.5
I got the solution for resampling from #Scott Boston,
df = df.set_index(0)
g = df[1].lt(-22).mul(1).diff().bfill().ne(0).cumsum()
df.groupby(g).apply(lambda x: x.resample('1T', kind='period').mean().reset_index()
if (x.iloc[0] < -22).any() else
x.resample('30T', kind='period').mean().reset_index())\
.reset_index(drop=True)
Use pd.cut:
threshold=np.array([0.5,0.8,0.9]).tolist()
pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
Output:
0 0.8
1 0.5
2 0.9
3 NaN
4 NaN
5 0.5
6 NaN
7 0.5
8 NaN
9 0.5
10 NaN
11 NaN
12 0.5
13 NaN
14 0.9
15 0.9
16 NaN
17 NaN
18 0.9
19 0.5
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 0.9
26 NaN
27 0.5
28 0.5
29 0.5
Name: 1, dtype: category
Categories (3, float64): [0.5 < 0.8 < 0.9]
Now, let's add this to the datafame and filter out all consecutive NaNs.
df['Threshold'] = pd.cut(df[1],bins=threshold+[np.inf],labels=threshold)
mask = ~(df.Threshold.isnull() & (df.Threshold.isnull() == df.Threshold.isnull().shift(1)))
df[mask]
Output:
0 1 Threshold
0 2018-01-01 00:01:00 0.867609 0.8
1 2018-01-01 00:02:00 0.544493 0.5
2 2018-01-01 00:03:00 0.958497 0.9
3 2018-01-01 00:04:00 0.371790 NaN
5 2018-01-01 00:06:00 0.757448 0.5
6 2018-01-01 00:07:00 0.198261 NaN
7 2018-01-01 00:08:00 0.666350 0.5
8 2018-01-01 00:09:00 0.392574 NaN
9 2018-01-01 00:10:00 0.627608 0.5
10 2018-01-01 00:11:00 0.414380 NaN
12 2018-01-01 00:13:00 0.559495 0.5
13 2018-01-01 00:14:00 0.260619 NaN
14 2018-01-01 00:15:00 0.982731 0.9
15 2018-01-01 00:16:00 0.996133 0.9
16 2018-01-01 00:17:00 0.410816 NaN
18 2018-01-01 00:19:00 0.927745 0.9
19 2018-01-01 00:20:00 0.626804 0.5
20 2018-01-01 00:21:00 0.223193 NaN
25 2018-01-01 00:26:00 0.932378 0.9
26 2018-01-01 00:27:00 0.366263 NaN
27 2018-01-01 00:28:00 0.522177 0.5
28 2018-01-01 00:29:00 0.614966 0.5
29 2018-01-01 00:30:00 0.670983 0.5

How do I group hourly data by day and count only values greater than a set amount in Pandas?

I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0

How to order dataframe for plotting 3d bar in pandas

I am trying to create a chart with multiple bars in 3d from pandas. Reviewing some examples on the web, I see that the best way to accomplish this is to get a dataframe like this:
data
Variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
My dataframe is:
df
Date_inicio Date_Fin Date_Max Clase
0 2004-04-09 23:00:00 2004-04-10 04:00:00 2004-04-10 02:00:00 MBCCM
1 2004-04-12 23:00:00 2004-04-13 04:00:00 2004-04-13 00:00:00 MBSCL
2 2004-04-24 04:00:00 2004-04-24 12:00:00 2004-04-24 09:00:00 SCL
3 2004-05-02 07:00:00 2004-05-02 14:00:00 2004-05-02 11:00:00 SCL
4 2004-05-30 05:00:00 2004-05-30 08:00:00 2004-05-30 07:00:00 MBCCM
5 2004-05-31 03:00:00 2004-05-31 07:00:00 2004-05-31 05:00:00 MBCCM
6 2004-06-08 00:00:00 2004-06-08 05:00:00 2004-06-08 03:00:00 MBSCL
7 2004-06-12 22:00:00 2004-06-13 12:00:00 2004-06-13 06:00:00 CCM
8 2004-06-13 03:00:00 2004-06-13 08:00:00 2004-06-13 06:00:00 MBCCM
9 2004-06-14 00:00:00 2004-06-14 03:00:00 2004-06-14 02:00:00 MBSCL
10 2004-06-14 03:00:00 2004-06-14 09:00:00 2004-06-14 07:00:00 MBSCL
11 2004-06-17 08:00:00 2004-06-17 14:00:00 2004-06-17 11:00:00 MBCCM
12 2004-06-17 12:00:00 2004-06-17 17:00:00 2004-06-17 14:00:00 MBCCM
13 2004-06-22 00:00:00 2004-06-22 08:00:00 2004-06-22 06:00:00 SCL
14 2004-06-22 08:00:00 2004-06-22 14:00:00 2004-06-22 11:00:00 MBCCM
15 2004-06-22 23:00:00 2004-06-23 09:00:00 2004-06-23 06:00:00 CCM
16 2004-07-01 05:00:00 2004-07-01 09:00:00 2004-07-01 06:00:00 MBCCM
17 2004-07-02 00:00:00 2004-07-02 04:00:00 2004-07-02 02:00:00 MBSCL
18 2004-07-04 12:00:00 2004-07-04 15:00:00 2004-07-04 13:00:00 MBCCM
19 2004-07-06 04:00:00 2004-07-06 13:00:00 2004-07-06 07:00:00 SCL
20 2004-07-07 04:00:00 2004-07-07 12:00:00 2004-07-07 10:00:00 CCM
21 2004-07-08 03:00:00 2004-07-08 06:00:00 2004-07-08 05:00:00 MBCCM
22 2004-07-08 12:00:00 2004-07-08 17:00:00 2004-07-08 13:00:00 MBCCM
23 2004-07-08 02:00:00 2004-07-08 06:00:00 2004-07-08 04:00:00 MBCCM
24 2004-07-09 05:00:00 2004-07-09 12:00:00 2004-07-09 08:00:00 CCM
25 2004-07-11 18:00:00 2004-07-12 12:00:00 2004-07-11 21:00:00 MBSCL
26 2004-07-11 23:00:00 2004-07-12 05:00:00 2004-07-12 02:00:00 MBSCL
27 2004-07-15 11:00:00 2004-07-15 19:00:00 2004-07-15 12:00:00 CCM
28 2004-07-16 12:00:00 2004-07-16 16:00:00 2004-07-16 14:00:00 MBCCM
29 2004-07-17 02:00:00 2004-07-17 06:00:00 2004-07-17 05:00:00 MBCCM
Now I want to get the occurrence of all classes for hour. For example, how many times different classes occur at some time in Date_inicio, Date_fin and Date_max. From df i obtain the next frecuencies table,
frec
Frec_Inicio Frec_Max Frec_Fin
Horas
1 2 0 1
2 3 8 1
3 5 3 2
4 6 2 6
5 6 6 5
6 5 6 4
7 5 7 2
8 2 4 5
9 1 6 6
10 0 3 2
11 2 5 5
12 4 1 9
13 2 4 2
14 3 2 4
15 0 2 3
16 1 1 3
17 0 2 3
18 1 1 1
19 0 0 3
20 1 1 1
21 1 1 0
22 3 1 0
23 9 1 0
24 8 3 2
Now, my goal is to plot a 3D bar like the figure below
To achieve this, i write the following code
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xpos=np.arange(frec.shape[0])
ypos=np.arange(frec.shape[1])
xpos, ypos = np.meshgrid(xpos+0.25, ypos+0.25)
xpos = xpos.flatten()
ypos = ypos.flatten()
zpos=np.zeros(frec.shape).flatten()
dx=0.5 * np.ones_like(zpos)
dy=0.5 * np.ones_like(zpos)
dz=frec.values.ravel()
dz[np.isnan(dz)]=0.
ax.bar3d(xpos,ypos,zpos,dx,dy,dz,color='b', alpha=0.5)
ax.set_xticks([.5,1.5,2.5])
ax.set_yticks([.5,1.5,2.5,3.5])
ax.w_yaxis.set_ticklabels(frec.columns)
ax.w_xaxis.set_ticklabels(frec.index)
ax.set_xlabel('Time')
ax.set_ylabel('B')
ax.set_zlabel('Occurrence')
plt.show()
How I get a better plot, similar to the previous figure?
Here is the code to do count:
import pandas as pd
text="""Date_inicio, Date_Fin, Date_Max, Clase
2004-04-09 23:00:00, 2004-04-10 04:00:00, 2004-04-10 02:00:00, MBCCM
2004-04-12 23:00:00, 2004-04-13 04:00:00, 2004-04-13 00:00:00, MBSCL
2004-04-24 04:00:00, 2004-04-24 12:00:00, 2004-04-24 09:00:00, SCL
2004-05-02 07:00:00, 2004-05-02 14:00:00, 2004-05-02 11:00:00, SCL
2004-05-30 05:00:00, 2004-05-30 08:00:00, 2004-05-30 07:00:00, MBCCM
2004-05-31 03:00:00, 2004-05-31 07:00:00, 2004-05-31 05:00:00, MBCCM
2004-06-08 00:00:00, 2004-06-08 05:00:00, 2004-06-08 03:00:00, MBSCL
2004-06-12 22:00:00, 2004-06-13 12:00:00, 2004-06-13 06:00:00, CCM
2004-06-13 03:00:00, 2004-06-13 08:00:00, 2004-06-13 06:00:00, MBCCM
2004-06-14 00:00:00, 2004-06-14 03:00:00, 2004-06-14 02:00:00, MBSCL
2004-06-14 03:00:00, 2004-06-14 09:00:00, 2004-06-14 07:00:00, MBSCL
2004-06-17 08:00:00, 2004-06-17 14:00:00, 2004-06-17 11:00:00, MBCCM
2004-06-17 12:00:00, 2004-06-17 17:00:00, 2004-06-17 14:00:00, MBCCM
2004-06-22 00:00:00, 2004-06-22 08:00:00, 2004-06-22 06:00:00, SCL
2004-06-22 08:00:00, 2004-06-22 14:00:00, 2004-06-22 11:00:00, MBCCM
2004-06-22 23:00:00, 2004-06-23 09:00:00, 2004-06-23 06:00:00, CCM
2004-07-01 05:00:00, 2004-07-01 09:00:00, 2004-07-01 06:00:00, MBCCM
2004-07-02 00:00:00, 2004-07-02 04:00:00, 2004-07-02 02:00:00, MBSCL
2004-07-04 12:00:00, 2004-07-04 15:00:00, 2004-07-04 13:00:00, MBCCM
2004-07-06 04:00:00, 2004-07-06 13:00:00, 2004-07-06 07:00:00, SCL
2004-07-07 04:00:00, 2004-07-07 12:00:00, 2004-07-07 10:00:00, CCM
2004-07-08 03:00:00, 2004-07-08 06:00:00, 2004-07-08 05:00:00, MBCCM
2004-07-08 12:00:00, 2004-07-08 17:00:00, 2004-07-08 13:00:00, MBCCM
2004-07-08 02:00:00, 2004-07-08 06:00:00, 2004-07-08 04:00:00, MBCCM
2004-07-09 05:00:00, 2004-07-09 12:00:00, 2004-07-09 08:00:00, CCM
2004-07-11 18:00:00, 2004-07-12 12:00:00, 2004-07-11 21:00:00, MBSCL
2004-07-11 23:00:00, 2004-07-12 05:00:00, 2004-07-12 02:00:00, MBSCL
2004-07-15 11:00:00, 2004-07-15 19:00:00, 2004-07-15 12:00:00, CCM
2004-07-16 12:00:00, 2004-07-16 16:00:00, 2004-07-16 14:00:00, MBCCM
2004-07-17 02:00:00, 2004-07-17 06:00:00, 2004-07-17 05:00:00, MBCCM"""
import io
df = pd.read_csv(io.BytesIO(text), skipinitialspace=True)
df.drop(["Clase"], axis=1, inplace=True)
df = df.apply(lambda s:s.str[11:13]).convert_objects(convert_numeric=True)
df2 = df.apply(lambda s:s.value_counts())
print df2
Here is the code that draw 3d bars:
import pandas as pd
text="""Horas Frec_Inicio Frec_Max Frec_Fin
1 2 0 1
2 3 8 1
3 5 3 2
4 6 2 6
5 6 6 5
6 5 6 4
7 5 7 2
8 2 4 5
9 1 6 6
10 0 3 2
11 2 5 5
12 4 1 9
13 2 4 2
14 3 2 4
15 0 2 3
16 1 1 3
17 0 2 3
18 1 1 1
19 0 0 3
20 1 1 1
21 1 1 0
22 3 1 0
23 9 1 0
24 8 3 2"""
import io
df = pd.read_csv(io.BytesIO(text), skipinitialspace=True, delim_whitespace=True)
df.set_index("Horas", inplace=True)
columns_name = [x.replace("_", " ") for x in df.columns]
df.columns = [0, 2, 4]
x, y, z = df.stack().reset_index().values.T
import visvis as vv
app = vv.use()
f = vv.clf()
a = vv.cla()
bar =vv.bar3(x, y, z, width=0.8)
bar.colors = ["r","g","b"] * 24
a.axis.yTicks = dict(zip(df.columns, columns_name))
app.Run()
the output:

How order bins from a crosstab

I am trying to create a frequency table from a dataframe like this:
scm=pd.read_csv('carac_scm.csv')
scm=scm[0:30][['Hora_inicio','Forma','AreaMax']]
scm
Hora_inicio Forma AreaMax
0 2004-04-09 22:45:00 MBCCM 58
1 2004-04-12 22:45:00 MBSCL 86
2 2004-04-24 03:45:00 SCL 141
3 2004-05-02 06:45:00 SCL 108
4 2004-05-30 04:45:00 MBCCM 64
5 2004-05-31 03:15:00 MBCCM 77
6 2004-06-08 00:15:00 MBSCL 51
7 2004-06-12 22:15:00 CCM 73
8 2004-06-13 02:45:00 MBCCM 87
9 2004-06-13 23:45:00 MBSCL 54
10 2004-06-14 03:15:00 MBSCL 70
11 2004-06-17 08:15:00 MBCCM 47
12 2004-06-17 11:45:00 MBCCM 76
13 2004-06-22 00:15:00 SCL 76
14 2004-06-22 07:45:00 MBCCM 115
15 2004-06-22 22:45:00 CCM 98
16 2004-07-01 05:15:00 MBCCM 57
17 2004-07-02 00:15:00 MBSCL 61
18 2004-07-04 11:45:00 MBCCM 50
19 2004-07-06 03:45:00 SCL 77
20 2004-07-07 04:15:00 CCM 51
21 2004-07-08 02:45:00 MBCCM 49
22 2004-07-08 11:45:00 MBCCM 40
23 2004-07-08 02:15:00 MBCCM 74
24 2004-07-09 04:45:00 CCM 39
25 2004-07-11 18:15:00 MBSCL 59
26 2004-07-11 23:15:00 MBSCL 85
27 2004-07-15 10:45:00 CCM 51
28 2004-07-16 12:15:00 MBCCM 53
29 2004-07-17 02:15:00 MBCCM 80
Now I ordered scm.AreaMax, in order to get the best bin. To do this, use the "cut module" and add a new column called bins containing the generated intervals. The following code is an example of what is described above:
scm=scm.sort(columns=['AreaMax'])
scm['bins']=pd.cut(scm.AreaMax, bins=[30, 50, 70,90, 110,130,150])
Hora_inicio Forma AreaMax bins
24 2004-07-09 04:45:00 CCM 39 (30, 50]
22 2004-07-08 11:45:00 MBCCM 40 (30, 50]
11 2004-06-17 08:15:00 MBCCM 47 (30, 50]
21 2004-07-08 02:45:00 MBCCM 49 (30, 50]
18 2004-07-04 11:45:00 MBCCM 50 (30, 50]
27 2004-07-15 10:45:00 CCM 51 (50, 70]
6 2004-06-08 00:15:00 MBSCL 51 (50, 70]
20 2004-07-07 04:15:00 CCM 51 (50, 70]
28 2004-07-16 12:15:00 MBCCM 53 (50, 70]
9 2004-06-13 23:45:00 MBSCL 54 (50, 70]
16 2004-07-01 05:15:00 MBCCM 57 (50, 70]
0 2004-04-09 22:45:00 MBCCM 58 (50, 70]
25 2004-07-11 18:15:00 MBSCL 59 (50, 70]
17 2004-07-02 00:15:00 MBSCL 61 (50, 70]
4 2004-05-30 04:45:00 MBCCM 64 (50, 70]
10 2004-06-14 03:15:00 MBSCL 70 (50, 70]
7 2004-06-12 22:15:00 CCM 73 (70, 90]
23 2004-07-08 02:15:00 MBCCM 74 (70, 90]
12 2004-06-17 11:45:00 MBCCM 76 (70, 90]
13 2004-06-22 00:15:00 SCL 76 (70, 90]
5 2004-05-31 03:15:00 MBCCM 77 (70, 90]
19 2004-07-06 03:45:00 SCL 77 (70, 90]
29 2004-07-17 02:15:00 MBCCM 80 (70, 90]
26 2004-07-11 23:15:00 MBSCL 85 (70, 90]
1 2004-04-12 22:45:00 MBSCL 86 (70, 90]
8 2004-06-13 02:45:00 MBCCM 87 (70, 90]
15 2004-06-22 22:45:00 CCM 98 (90, 110]
3 2004-05-02 06:45:00 SCL 108 (90, 110]
14 2004-06-22 07:45:00 MBCCM 115 (110, 130]
2 2004-04-24 03:45:00 SCL 141 (130, 150]
Now create a frequency table to plot a stacked bar charty get the next:
df=pd.crosstab(rows=[scm['bins']],cols=[scm['Forma']],margins=False)
df
Forma CCM MBCCM MBSCL SCL
bins
(110, 130] 0 1 0 0
(130, 150] 0 0 0 1
(30, 50] 1 4 0 0
(50, 70] 2 4 5 0
(70, 90] 1 5 2 2
(90, 110] 1 0 0 1
df.plot(kind='bar', stacked=True)
How to order bins to get a table like this?
Forma CCM MBCCM MBSCL SCL
bins
(30, 50] 1 4 0 0
(50, 70] 2 4 5 0
(70, 90] 1 5 2 2
(90, 110] 1 0 0 1
(110, 130] 0 1 0 0
(130, 150] 0 0 0 1
I tried to get this with the following lines of code and do not get the desired result
df.sort() #Get the same table
df.sort_index() # Get the same table
df.sort_index(ascending=False)
Forma CCM MBCCM MBSCL SCL
bins
(90, 110] 1 0 0 1
(70, 90] 1 5 2 2
(50, 70] 2 4 5 0
(30, 50] 1 4 0 0
(130, 150] 0 0 0 1
(110, 130] 0 1 0 0
Can anyone suggest me an idea?
This is because index is string/unicode and '30' > '110'
You can create a numeric col to do sorting and then delete.
df['sort_col'] = [float(s.split(',')[0][1:]) for s in df.index]
df.sort(columns= 'sort_col',inplace=True)
del df['sort_col'] #You don't want to plot this col
df.plot(kind='bar', stacked=True)

Categories

Resources