Python pandas idxmax for multiple indexes in a dataframe - python

I have a series that looks like this:
delivery
2007-04-26 706 23
2007-04-27 705 10
706 1089
708 83
710 13
712 51
802 4
806 1
812 3
2007-04-29 706 39
708 4
712 1
2007-04-30 705 3
706 1016
707 2
...
2014-11-04 1412 53
1501 1
1502 1
1512 1
2014-11-05 1411 47
1412 1334
1501 40
1502 433
1504 126
1506 100
1508 7
1510 6
1512 51
1604 1
1612 5
Length: 26255, dtype: int64
where the query is: df.groupby([df.index.date, 'delivery']).size()
For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:
df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)
However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).
Any ideas on how to accomplish this?

Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)
I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.
Setting up data :
import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
'2007-04-27', '2007-04-28', '2007-04-28'],
'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}
df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df
output
DeliveryCount DeliveryNb
Date
2007-04-26 23 706
2007-04-27 10 705
2007-04-27 1089 708
2007-04-27 82 450
2007-04-27 34 283
2007-04-28 100 45
2007-04-28 11 89
creating custom function :
The trick is to use the reset_index() method (so you easily get the integer index of the group)
def func(df):
idx = df.reset_index()['DeliveryCount'].idxmax()
return df['DeliveryNb'].iloc[idx]
applying it :
g = df.groupby(df.index)
g.apply(func)
result :
Date
2007-04-26 706
2007-04-27 708
2007-04-28 45
dtype: int64

Suppose you have this series:
delivery
2001-01-02 0 2
1 3
6 2
7 2
9 3
2001-01-03 3 2
6 1
7 1
8 3
9 1
dtype: int64
If you want one delivery per date with the maximum value, you could use idxmax:
dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]
yields
delivery
2001-01-02 1 3
2001-01-03 8 3
dtype: int64
If you want all deliveries per date with the maximum value, use transform to generate a boolean mask:
mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]
yields
delivery
2001-01-02 1 3
9 3
2001-01-03 8 3
dtype: int64
This is the code I used to generate series:
import pandas as pd
import numpy as np
np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()

If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index() :
Date Del_Count Del_Nb
0 1/1 14 19 <
1 11 17
2 2/2 25 29 <
3 21 27
4 22 28
5 3/3 34 36
6 37 37
7 31 39 <
To find the max per Date and extract the relevant Del_Count you can use:
df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]
Which would yeild:
Date Del_Count Del_Nb
0 1/1 14 19
2 2/2 25 29
7 3/3 31 39

Related

Python Pandas calculate total volume with last article volume

I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.

How to calculate aggregate percentage in a dataframe grouped by a value in python?

I am new to python and I am trying to understand how to work with aggregating data and manipulation.
I have a dataframe:
df3
Out[122]:
SBK SSC CountRecs
0 99 22 9
1 99 12 10
2 99 121 11
3 99 138 12
4 99 123 8
... ... ...
160247 184 1318 1
160248 394 2659 1
160249 412 757 1
160250 357 1312 1
160251 202 106 1
I want to understand in the entire data frame, what percentage of CountRecs for each SBK.
For example, in this case, I want to understand 80618 is what % of the summation total number of SBK's with 99. in this case it is 9/50 * 100. But I want this to be done automated for all rows. How can I go about this?
you need to group by the column you want,
marge by the grouped column.
2.1 you can change the name of the new column.
add the percentage column.
a = df3.merge(pd.DataFrame(df3.groupby('SBK' ['CountRecs'].sum()),on='SBK')
df3['percent'] = (a['CountRecs_x']/a['CountRecs_y']) *100
df3
Use GroupBy.transform for Series with same size like original DataFrame filled by counts, so you can divide original column:
df3['percent'] = df3['CountRecs'] / df3.groupby('SBK')['CountRecs'].transform('sum') * 100
print (df3)
SBK SSC CountRecs percent
0 99 22 9 18.0
1 99 12 10 20.0
2 99 121 11 22.0
3 99 138 12 24.0
4 99 123 8 16.0
160247 184 1318 1 100.0
160248 394 2659 1 100.0
160249 412 757 1 100.0
160250 357 1312 1 100.0
160251 202 106 1 100.0

Pandas Dataframe calculating in intervals

I have a dataframe like this
time value
0 1 214
1 4 234
2 5 253
3 7 272
4 9 201
5 11 221
6 13 211
7 15 201
8 17 199
I want to split it into intervals and calculate for every interval the difference for the values to the first row of every interval.
Result should be like this with an interval of 6 for example (the lines inside are just for better explanation):
time value diff_to_first
0 1 214 0
1 4 234 20
2 5 253 39
--------------------------------
3 7 272 0
4 9 201 -71
5 11 221 -51
--------------------------------
6 13 211 0
7 15 201 -10
8 17 199 -12
With the following code i get the wanted result, but i think the code is not very elegant. Are there any better solutions (for example, how can i integrate the subset term in the loc statement) ?
import pandas as pd
interval = 6
low = 0
df = pd.DataFrame([[1, 214], [4, 234], [5, 253], [7, 272], [9, 201], [11, 221],
[13, 211], [15, 201], [17, 199]], columns=['time', 'value'])
df['diff_to_first'] = None
maxvalue = df['time'].max()
while low <= maxvalue:
high = low + interval
subset = df[ (df['time']>=low) & (df['time']<high) ]
first = subset.iloc[0]['value']
df.loc[ (df['time']>=low) & (df['time']<high),
'diff_to_first'] = df.loc[ (df['time']>=low) & (df['time']<high) , 'value'] - first
low = high
You can make a new column "group". Then use groupby and apply you defined function to join column with diff by group. It will be more elegant. But I think, my way to create "group" column also can be more elegant = )
def diff(df):
df['diff_to_first'] = df.value - df.value.values[0]
return df
df['group'] = np.concatenate([[i] * 3 for i in range(0, len(df)/3)])
df.groupby('group').apply(diff)
Output:
time value group diff_to_first
0 1 214 0 0
1 4 234 0 20
2 5 253 0 39
3 7 272 1 0
4 9 201 1 -71
5 11 221 1 -51
6 13 211 2 0
7 15 201 2 -10
8 17 199 2 -12
you can group the dataframe by value of interval and difference the grouped data with the shifting by 1 index
interval = 3
df['diff_to_first'] = df.value.groupby(np.repeat(np.arange(len(df)/interval),interval)[:len(df)]).apply(lambda x:x-x.shift()).fillna(0)
Out:
time value diff_to_first
0 1 214 0.0
1 4 234 20.0
2 5 253 19.0
3 7 272 0.0
4 9 201 -71.0
5 11 221 20.0
6 13 211 0.0
7 15 201 -10.0
8 17 199 -2.0

Pandas: clean & convert DataFrame to numbers

I have a dataframe containing strings, as read from a sloppy csv:
id Total B C ...
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600
What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange.
I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.
You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id Total B C
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')
print (df)
id Total B C
0 0 56974 20739 34482
1 1 29479 10253 16704
2 2 86961 29837 43593
3 3 52687 22921 28299
4 4 23794 7646 15600
print (df.dtypes)
id int64
Total int64
B int64
C int64
dtype: object
And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:
df = df.apply(pd.to_numeric, errors='coerce')

Python pandas group by two columns

I have a pandas dataframe:
code type
index
312 11 21
312 11 41
312 11 21
313 23 22
313 11 21
... ...
So I need to group it by count of pairs 'code' and 'type' columns for each index item:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
... ...
How implement it with python and pandas?
Here's one way using pd.crosstab and then rename column names, using levels information.
In [136]: dff = pd.crosstab(df['index'], [df['code'], df['type']])
In [137]: dff
Out[137]:
code 11 23
type 21 41 22
index
312 2 1 0
313 1 0 1
In [138]: dff.columns = ['%s_%s' % c for c in dff.columns]
In [139]: dff
Out[139]:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
Alternatively, less elegantly, create another column and use crosstab.
In [140]: df['ct'] = df.code.astype(str) + '_' + df.type.astype(str)
In [141]: df
Out[141]:
index code type ct
0 312 11 21 11_21
1 312 11 41 11_41
2 312 11 21 11_21
3 313 23 22 23_22
4 313 11 21 11_21
In [142]: pd.crosstab(df['index'], df['ct'])
Out[142]:
ct 11_21 11_41 23_22
index
312 2 1 0
313 1 0 1

Categories

Resources