Issue with reindexing a multiindex - python

I am struggling to reindex a multiindex. Example code below:
rng = pd.date_range('01/01/2000 00:00', '31/12/2004 23:00', freq='H')
ts = pd.Series([h.dayofyear for h in rng], index=rng)
daygrouped = ts.groupby(lambda x: x.dayofyear)
daymean = daygrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
daymean.reindex(myindex)
gives (as expected):
184 184
185 185
186 186
187 187
...
180 180
181 181
182 182
183 183
Length: 366, dtype: int64
BUT if I create a multindex:
hourgrouped = ts.groupby([lambda x: x.dayofyear, lambda x: x.hour])
hourmean = hourgrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
hourmean.reindex(myindex, level=1)
I get:
1 1 1
2 1
3 1
4 1
...
366 20 366
21 366
22 366
23 366
Length: 8418, dtype: int64
Any ideas on my mistake? - Thanks.
Bevan

First, you have to specify level=0 instead of 1 (as it is the first level -> zero-based indexing -> 0).
But, there is still a problem: the reindexing works, but does not seem to preserve the order of the provided index in the case of a MultiIndex:
In [54]: hourmean.reindex([5,4], level=0)
Out[54]:
4 0 4
1 4
2 4
3 4
4 4
...
20 4
21 4
22 4
23 4
5 0 5
1 5
2 5
3 5
4 5
...
20 5
21 5
22 5
23 5
dtype: int64
So getting a new subset of the index works, but it is in the same order as the original and not as the new provided index.
This is possibly a bug with reindex on a certain level (I opened an issue to discuss this: https://github.com/pydata/pandas/issues/8241)
A solution for now to reindex your series, is to create a MultiIndex and reindex with that (so not on a specified level, but with the full index, that does preserve the order). Doing this is very easy with MultiIndex.from_product as you already have myindex:
In [79]: myindex2 = pd.MultiIndex.from_product([myindex, range(24)])
In [82]: hourmean.reindex(myindex2)
Out[82]:
184 0 184
1 184
2 184
3 184
4 184
5 184
6 184
7 184
8 184
9 184
10 184
11 184
12 184
13 184
14 184
...
183 9 183
10 183
11 183
12 183
13 183
14 183
15 183
16 183
17 183
18 183
19 183
20 183
21 183
22 183
23 183
Length: 8784, dtype: int64

Related

Pandas Dataframe Reshape/Alteration Question

I feel like this should be an easy solution, but it has eluded me a bit (long week).
Say I have the following Pandas Dataframe (df):
day
x_count
x_max
y_count
y_max
1
8
230
18
127
1
6
174
12
121
1
5
218
21
184
1
11
91
32
162
2
11
128
17
151
2
13
156
16
148
2
18
191
22
120
Etc. How can I collapse it down so that I have one row per day and each of the columns in my example are added across all of their days?
For example:
day
x_count
x_max
y_count
y_max
1
40
713
93
594
2
42
475
55
419
Is it best to reshape it or simply create a new one?

Pandas: clean & convert DataFrame to numbers

I have a dataframe containing strings, as read from a sloppy csv:
id Total B C ...
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600
What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange.
I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.
You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id Total B C
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')
print (df)
id Total B C
0 0 56974 20739 34482
1 1 29479 10253 16704
2 2 86961 29837 43593
3 3 52687 22921 28299
4 4 23794 7646 15600
print (df.dtypes)
id int64
Total int64
B int64
C int64
dtype: object
And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:
df = df.apply(pd.to_numeric, errors='coerce')

Python pandas group by two columns

I have a pandas dataframe:
code type
index
312 11 21
312 11 41
312 11 21
313 23 22
313 11 21
... ...
So I need to group it by count of pairs 'code' and 'type' columns for each index item:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
... ...
How implement it with python and pandas?
Here's one way using pd.crosstab and then rename column names, using levels information.
In [136]: dff = pd.crosstab(df['index'], [df['code'], df['type']])
In [137]: dff
Out[137]:
code 11 23
type 21 41 22
index
312 2 1 0
313 1 0 1
In [138]: dff.columns = ['%s_%s' % c for c in dff.columns]
In [139]: dff
Out[139]:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
Alternatively, less elegantly, create another column and use crosstab.
In [140]: df['ct'] = df.code.astype(str) + '_' + df.type.astype(str)
In [141]: df
Out[141]:
index code type ct
0 312 11 21 11_21
1 312 11 41 11_41
2 312 11 21 11_21
3 313 23 22 23_22
4 313 11 21 11_21
In [142]: pd.crosstab(df['index'], df['ct'])
Out[142]:
ct 11_21 11_41 23_22
index
312 2 1 0
313 1 0 1

Selectively remove deprecated rows in a pandas dataframe

I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1

Python Pandas GroupBy().Sum() Having Clause

So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349

Categories

Resources