Python pandas group by two columns

Python pandas group by two columns - python

I have a pandas dataframe:
code type
index
312 11 21
312 11 41
312 11 21
313 23 22
313 11 21
... ...
So I need to group it by count of pairs 'code' and 'type' columns for each index item:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
... ...
How implement it with python and pandas?

Here's one way using pd.crosstab and then rename column names, using levels information.
In [136]: dff = pd.crosstab(df['index'], [df['code'], df['type']])
In [137]: dff
Out[137]:
code 11 23
type 21 41 22
index
312 2 1 0
313 1 0 1
In [138]: dff.columns = ['%s_%s' % c for c in dff.columns]
In [139]: dff
Out[139]:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
Alternatively, less elegantly, create another column and use crosstab.
In [140]: df['ct'] = df.code.astype(str) + '_' + df.type.astype(str)
In [141]: df
Out[141]:
index code type ct
0 312 11 21 11_21
1 312 11 41 11_41
2 312 11 21 11_21
3 313 23 22 23_22
4 313 11 21 11_21
In [142]: pd.crosstab(df['index'], df['ct'])
Out[142]:
ct 11_21 11_41 23_22
index
312 2 1 0
313 1 0 1

Related

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332

Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)

One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

Pandas Dataframe calculating in intervals

I have a dataframe like this
time value
0 1 214
1 4 234
2 5 253
3 7 272
4 9 201
5 11 221
6 13 211
7 15 201
8 17 199
I want to split it into intervals and calculate for every interval the difference for the values to the first row of every interval.
Result should be like this with an interval of 6 for example (the lines inside are just for better explanation):
time value diff_to_first
0 1 214 0
1 4 234 20
2 5 253 39
--------------------------------
3 7 272 0
4 9 201 -71
5 11 221 -51
--------------------------------
6 13 211 0
7 15 201 -10
8 17 199 -12
With the following code i get the wanted result, but i think the code is not very elegant. Are there any better solutions (for example, how can i integrate the subset term in the loc statement) ?
import pandas as pd
interval = 6
low = 0
df = pd.DataFrame([[1, 214], [4, 234], [5, 253], [7, 272], [9, 201], [11, 221],
[13, 211], [15, 201], [17, 199]], columns=['time', 'value'])
df['diff_to_first'] = None
maxvalue = df['time'].max()
while low <= maxvalue:
high = low + interval
subset = df[ (df['time']>=low) & (df['time']<high) ]
first = subset.iloc[0]['value']
df.loc[ (df['time']>=low) & (df['time']<high),
'diff_to_first'] = df.loc[ (df['time']>=low) & (df['time']<high) , 'value'] - first
low = high

You can make a new column "group". Then use groupby and apply you defined function to join column with diff by group. It will be more elegant. But I think, my way to create "group" column also can be more elegant = )
def diff(df):
df['diff_to_first'] = df.value - df.value.values[0]
return df
df['group'] = np.concatenate([[i] * 3 for i in range(0, len(df)/3)])
df.groupby('group').apply(diff)
Output:
time value group diff_to_first
0 1 214 0 0
1 4 234 0 20
2 5 253 0 39
3 7 272 1 0
4 9 201 1 -71
5 11 221 1 -51
6 13 211 2 0
7 15 201 2 -10
8 17 199 2 -12

you can group the dataframe by value of interval and difference the grouped data with the shifting by 1 index
interval = 3
df['diff_to_first'] = df.value.groupby(np.repeat(np.arange(len(df)/interval),interval)[:len(df)]).apply(lambda x:x-x.shift()).fillna(0)
Out:
time value diff_to_first
0 1 214 0.0
1 4 234 20.0
2 5 253 19.0
3 7 272 0.0
4 9 201 -71.0
5 11 221 20.0
6 13 211 0.0
7 15 201 -10.0
8 17 199 -2.0

How to change this dataframe with python in order to use collaborative filtering

Here is my original data:
enter image description here
As you can see.The cust_id column records the consumption record for each ID.And second column means the product name,third is the munber they bought each time.
I want to get this kind of data:
enter image description here
The result data shows each customer bought which product and how many.If they never bought,then the data is None.I think this is Sparse matrix.
I have tried many ways and still can't fix it up.....
Maybe pandas?Numpy?

There is problem with duplicates, I add last row with same cust_id and prd_id values for demonstrate it.
print (df)
cust_id prd_id prd_number
8 462 40 1
9 462 46 3
10 462 59 1
11 462 63 13
12 462 67 1
13 462 82 12
14 462 88 1
15 462 163 3
16 463 68 1
17 463 90 1
18 463 159 2
16 464 93 11
20 464 94 8
21 464 96 1
22 464 142 4
23 465 50 1
24 465 50 5
Then need groupby by columns cust_id and prd_id with aggreagting some function like mean() or sum(). Last unstack with replacing NaN to 0:
print (df.groupby(['cust_id', 'prd_id'])['prd_number'].sum().unstack(fill_value=0))
prd_id 40 46 50 59 63 67 68 82 88 90 93 94 96 142 \
cust_id
462 1 3 0 1 13 1 0 12 1 0 0 0 0 0
463 0 0 0 0 0 0 1 0 0 1 0 0 0 0
464 0 0 0 0 0 0 0 0 0 0 11 8 1 4
465 0 0 6 0 0 0 0 0 0 0 0 0 0 0
prd_id 159 163
cust_id
462 0 3
463 2 0
464 0 0
465 0 0

Pandas - Sum up previous values of a column

New to pandas, I'm trying to sum up all previous values of a column. In SQL I did this by joining the table to itself, so I've been taking the same approach in pandas, but having some issues.
Original Data Frame
TeamName PlayerCount Goals CalMonth
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 30 300 2
5 B 28 189 3
Code
prev_month = np.where(df3['CalMonth'] == 12, df3['CalMonth'] - 11, df3['CalMonth'] + 1)
df4 = pd.merge(df3, df3, how='left', left_on=['TeamName','CalMonth'], right_on=['TeamName', prev_month])
print(df4.head(20))
Output
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y
NaN NaN NaN
25 126 1
25 100 2
22 NaN NaN
22 205 1
22 100 2
The output is what I had in mind, but what I want now is to create a column that is YTD and sum up all Goals from previous months. Here are my desired results (can either include the current month or not, that can be done in an additional step):
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y Goals_YTD
NaN NaN NaN NaN
25 126 1 126
25 100 2 226
22 NaN NaN NaN
22 205 1 205
22 100 2 305

Issue with reindexing a multiindex

I am struggling to reindex a multiindex. Example code below:
rng = pd.date_range('01/01/2000 00:00', '31/12/2004 23:00', freq='H')
ts = pd.Series([h.dayofyear for h in rng], index=rng)
daygrouped = ts.groupby(lambda x: x.dayofyear)
daymean = daygrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
daymean.reindex(myindex)
gives (as expected):
184 184
185 185
186 186
187 187
...
180 180
181 181
182 182
183 183
Length: 366, dtype: int64
BUT if I create a multindex:
hourgrouped = ts.groupby([lambda x: x.dayofyear, lambda x: x.hour])
hourmean = hourgrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
hourmean.reindex(myindex, level=1)
I get:
1 1 1
2 1
3 1
4 1
...
366 20 366
21 366
22 366
23 366
Length: 8418, dtype: int64
Any ideas on my mistake? - Thanks.
Bevan

First, you have to specify level=0 instead of 1 (as it is the first level -> zero-based indexing -> 0).
But, there is still a problem: the reindexing works, but does not seem to preserve the order of the provided index in the case of a MultiIndex:
In [54]: hourmean.reindex([5,4], level=0)
Out[54]:
4 0 4
1 4
2 4
3 4
4 4
...
20 4
21 4
22 4
23 4
5 0 5
1 5
2 5
3 5
4 5
...
20 5
21 5
22 5
23 5
dtype: int64
So getting a new subset of the index works, but it is in the same order as the original and not as the new provided index.
This is possibly a bug with reindex on a certain level (I opened an issue to discuss this: https://github.com/pydata/pandas/issues/8241)
A solution for now to reindex your series, is to create a MultiIndex and reindex with that (so not on a specified level, but with the full index, that does preserve the order). Doing this is very easy with MultiIndex.from_product as you already have myindex:
In [79]: myindex2 = pd.MultiIndex.from_product([myindex, range(24)])
In [82]: hourmean.reindex(myindex2)
Out[82]:
184 0 184
1 184
2 184
3 184
4 184
5 184
6 184
7 184
8 184
9 184
10 184
11 184
12 184
13 184
14 184
...
183 9 183
10 183
11 183
12 183
13 183
14 183
15 183
16 183
17 183
18 183
19 183
20 183
21 183
22 183
23 183
Length: 8784, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas group by two columns - python

I have a pandas dataframe: code type index 312 11 21 312 11 41 312 11 21 313 23 22 313 11 21 ... ... So I need to group it by count of pairs 'code' and 'type' columns for each index item: 11_21 11_41 23_22 index 312 2 1 0 313 1 0 1 ... ... How implement it with python and pandas?

Related

Average of every x rows with a step size of y per each subset using pandas

Pandas Dataframe calculating in intervals

How to change this dataframe with python in order to use collaborative filtering

Pandas - Sum up previous values of a column

Issue with reindexing a multiindex

Categories

Resources