How to groupby consecutive values in pandas DataFrame - python

I have a column in a DataFrame with values:
[1, 1, -1, 1, -1, -1]
How can I group them like this?
[1,1] [-1] [1] [-1, -1]

You can use groupby by custom Series:
df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
a
0 1
1 1
2 -1
3 1
4 -1
5 -1
print ((df.a != df.a.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
Name: a, dtype: int32
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
print (i)
print (g)
print (g.a.tolist())
a
0 1
1 1
[1, 1]
2
a
2 -1
[-1]
3
a
3 1
[1]
4
a
4 -1
5 -1
[-1, -1]

Using groupby from itertools data from Jez
from itertools import groupby
[ list(group) for key, group in groupby(df.a.values.tolist())]
Out[361]: [[1, 1], [-1], [1], [-1, -1]]

Series.diff is another way to mark the group boundaries (a!=a.shift means a.diff!=0):
consecutives = df['a'].diff().ne(0).cumsum()
# 0 1
# 1 1
# 2 2
# 3 3
# 4 4
# 5 4
# Name: a, dtype: int64
And to turn these groups into a Series of lists (see the other answers for a list of lists), aggregate with groupby.agg or groupby.apply:
df['a'].groupby(consecutives).agg(list)
# a
# 1 [1, 1]
# 2 [-1]
# 3 [1]
# 4 [-1, -1]
# Name: a, dtype: object

If you are dealing with string values:
s = pd.DataFrame(['A','A','A','BB','BB','CC','A','A','BB'], columns=['a'])
string_groups = sum([['%s_%s' % (i,n) for i in g] for n,(k,g) in enumerate(itertools.groupby(s.a))],[])
>>> string_groups
['A_0', 'A_0', 'A_0', 'BB_1', 'BB_1', 'CC_2', 'A_3', 'A_3', 'BB_4']
grouped = s.groupby(string_groups, sort=False).agg(list)
grouped.index = grouped.index.str.split('_').str[0]
>>> grouped
a
A [A, A, A]
BB [BB, BB]
CC [CC]
A [A, A]
BB [BB]
As a separate function:
def groupby_consec(df, col):
string_groups = sum([['%s_%s' % (i, n) for i in g]
for n, (k, g) in enumerate(itertools.groupby(df[col]))], [])
return df.groupby(string_groups, sort=False)

Related

Create list of dataframes of a single dataframe based on repeating binary column [duplicate]

I have a column in a DataFrame with values:
[1, 1, -1, 1, -1, -1]
How can I group them like this?
[1,1] [-1] [1] [-1, -1]
You can use groupby by custom Series:
df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
a
0 1
1 1
2 -1
3 1
4 -1
5 -1
print ((df.a != df.a.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
Name: a, dtype: int32
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
print (i)
print (g)
print (g.a.tolist())
a
0 1
1 1
[1, 1]
2
a
2 -1
[-1]
3
a
3 1
[1]
4
a
4 -1
5 -1
[-1, -1]
Using groupby from itertools data from Jez
from itertools import groupby
[ list(group) for key, group in groupby(df.a.values.tolist())]
Out[361]: [[1, 1], [-1], [1], [-1, -1]]
Series.diff is another way to mark the group boundaries (a!=a.shift means a.diff!=0):
consecutives = df['a'].diff().ne(0).cumsum()
# 0 1
# 1 1
# 2 2
# 3 3
# 4 4
# 5 4
# Name: a, dtype: int64
And to turn these groups into a Series of lists (see the other answers for a list of lists), aggregate with groupby.agg or groupby.apply:
df['a'].groupby(consecutives).agg(list)
# a
# 1 [1, 1]
# 2 [-1]
# 3 [1]
# 4 [-1, -1]
# Name: a, dtype: object
If you are dealing with string values:
s = pd.DataFrame(['A','A','A','BB','BB','CC','A','A','BB'], columns=['a'])
string_groups = sum([['%s_%s' % (i,n) for i in g] for n,(k,g) in enumerate(itertools.groupby(s.a))],[])
>>> string_groups
['A_0', 'A_0', 'A_0', 'BB_1', 'BB_1', 'CC_2', 'A_3', 'A_3', 'BB_4']
grouped = s.groupby(string_groups, sort=False).agg(list)
grouped.index = grouped.index.str.split('_').str[0]
>>> grouped
a
A [A, A, A]
BB [BB, BB]
CC [CC]
A [A, A]
BB [BB]
As a separate function:
def groupby_consec(df, col):
string_groups = sum([['%s_%s' % (i, n) for i in g]
for n, (k, g) in enumerate(itertools.groupby(df[col]))], [])
return df.groupby(string_groups, sort=False)

assign a shorter column to a longer column, given a sliding winow

I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?
df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1
~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2

How to track the unordered pairs in the `pandas` dataframe

I have a pd.Dataframe with the columns R_fighter - the name of the first fighter, B_fighter - the name of the second fighter and the Winner column. Data is sorted in chronological order and I would like to add a column, where if the fighters have previously met and the R fighter won to set value to -1, if the B fighter won - 1, and 0 otherwise. If it was guaranteed, that the fighters can meet again in the same order (R_fighter is again R_fighter, B_fighter is again B_fighter), then one can do the following:
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby(['R_fighter', 'B_fighter'])['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
last_winner_col[idx] = last_winner
last_winner = 2 * val - 1
And add the resulting pd.Series to the dataset. However, their role may change in the consequent fight. The solution, which come to my mind are very lengthy and cumbersome. I would be grateful, if one suggested a handy way to track the previous winner to take into account the possibility of the change of fighters order?
You can create a "sorted" version of your two combatants and use that:
import pandas as pd
a = list("ABCDEFGH1234")
b = list("12341234ABCD")
win = list("ABCD12341234")
df = pd.DataFrame({"R_fighter":a, "B_fighter":b, "Winner":win})
# make a column with fixed order
df["combatants"] = df[['R_fighter', 'B_fighter']].apply(lambda x: sorted(x), axis=1)
# or simply set the result
df["w"] = df[['R_fighter', 'B_fighter', 'Winner']].apply(lambda x: '-1'
if x[2]==x[0]
else ('1' if x[2]==x[1]
else '0'), axis=1 )
print(df)
Output:
R_fighter B_fighter Winner combatants w
0 A 1 A [1, A] -1
1 B 2 B [2, B] -1
2 C 3 C [3, C] -1
3 D 4 D [4, D] -1
4 E 1 1 [1, E] 1
5 F 2 2 [2, F] 1
6 G 3 3 [3, G] 1
7 H 4 4 [4, H] 1
8 1 A 1 [1, A] -1
9 2 B 2 [2, B] -1
10 3 C 3 [3, C] -1
11 4 D 4 [4, D] -1
To get the winner based on 'combatants' (wich contains the sorted names) you can do:
df["w_combatants"] = df[['combatants', 'Winner']].apply(lambda x: '-1'
if x[1]==x[0][0]
else ('1' if x[1]==x[0][1]
else '0'), axis=1 )
to get
R_fighter B_fighter Winner combatants w w_combatants
0 A 1 A [1, A] -1 1
1 B 2 B [2, B] -1 1
2 C 3 C [3, C] -1 1
3 D 4 D [4, D] -1 1
4 E 1 1 [1, E] 1 -1
5 F 2 2 [2, F] 1 -1
6 G 3 3 [3, G] 1 -1
7 H 4 4 [4, H] 1 -1
8 1 A 1 [1, A] -1 -1
9 2 B 2 [2, B] -1 -1
10 3 C 3 [3, C] -1 -1
11 4 D 4 [4, D] -1 -1
On the grounds of #Patrick Artner answer, I've come up with the following solution:
df_train[['fighters']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x :tuple(sorted(x)), axis = 1)
df_train[['fighter_ord_changed']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x : np.argsort(x)[0], axis = 1)
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby('fighters')['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
flag = df_train['fighter_ord_changed'][idx]
last_winner_col[idx] = -last_winner if flag else last_winner
last_winner = 2 * (val ^ flag) - 1

How to find a set mean in a pandas dataframe? [duplicate]

I have a column in a DataFrame with values:
[1, 1, -1, 1, -1, -1]
How can I group them like this?
[1,1] [-1] [1] [-1, -1]
You can use groupby by custom Series:
df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
a
0 1
1 1
2 -1
3 1
4 -1
5 -1
print ((df.a != df.a.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
Name: a, dtype: int32
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
print (i)
print (g)
print (g.a.tolist())
a
0 1
1 1
[1, 1]
2
a
2 -1
[-1]
3
a
3 1
[1]
4
a
4 -1
5 -1
[-1, -1]
Using groupby from itertools data from Jez
from itertools import groupby
[ list(group) for key, group in groupby(df.a.values.tolist())]
Out[361]: [[1, 1], [-1], [1], [-1, -1]]
Series.diff is another way to mark the group boundaries (a!=a.shift means a.diff!=0):
consecutives = df['a'].diff().ne(0).cumsum()
# 0 1
# 1 1
# 2 2
# 3 3
# 4 4
# 5 4
# Name: a, dtype: int64
And to turn these groups into a Series of lists (see the other answers for a list of lists), aggregate with groupby.agg or groupby.apply:
df['a'].groupby(consecutives).agg(list)
# a
# 1 [1, 1]
# 2 [-1]
# 3 [1]
# 4 [-1, -1]
# Name: a, dtype: object
If you are dealing with string values:
s = pd.DataFrame(['A','A','A','BB','BB','CC','A','A','BB'], columns=['a'])
string_groups = sum([['%s_%s' % (i,n) for i in g] for n,(k,g) in enumerate(itertools.groupby(s.a))],[])
>>> string_groups
['A_0', 'A_0', 'A_0', 'BB_1', 'BB_1', 'CC_2', 'A_3', 'A_3', 'BB_4']
grouped = s.groupby(string_groups, sort=False).agg(list)
grouped.index = grouped.index.str.split('_').str[0]
>>> grouped
a
A [A, A, A]
BB [BB, BB]
CC [CC]
A [A, A]
BB [BB]
As a separate function:
def groupby_consec(df, col):
string_groups = sum([['%s_%s' % (i, n) for i in g]
for n, (k, g) in enumerate(itertools.groupby(df[col]))], [])
return df.groupby(string_groups, sort=False)

construct sparse matrix using categorical data

I have a data that looks something like this:
numpy array:
[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]
its like a user-item matrix.
I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.
Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?
desired output:
1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1
where rows: abc,def,ghi,fg,f76
and columns: a,b,c,d,e,f
The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:
import numpy as np
from scipy import sparse
users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)
points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))
pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix
import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
['d', 'abc'],['a', 'ghi'],['e', 'fg'],
['f', 'f76'],['b', 'f76']],
columns = ['user','item'])
print(x)
# user item
# 0 a abc
# 1 b def
# 2 c ghi
# 3 d abc
# 4 a ghi
# 5 e fg
# 6 f f76
# 7 b f76
for col, col_data in x.iteritems():
if str(col)=='item':
col_data = pd.get_dummies(col_data, prefix = col)
x = x.join(col_data)
print(x)
# user item item_abc item_def item_f76 item_fg item_ghi
# 0 a abc 1 0 0 0 0
# 1 b def 0 1 0 0 0
# 2 c ghi 0 0 0 0 0
# 3 d abc 1 0 0 0 0
# 4 a ghi 0 0 0 0 1
# 5 e fg 0 0 0 1 0
# 6 f f76 0 0 1 0 0
# 7 b f76 0 0 1 0 0
Here's what I could come up with:
You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.
Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.
A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])
customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
[[1, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])
Here is my approach using pandas, let me know if it performed better:
#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])
#get rows and cols for your sparse dataframe
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]
#initialize your sparse dataframe,
#(this is not sparse, but you can check pandas support for sparse datatypes
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)
#define apply function
def hasUser(xx):
spdf.ix[xx.name, xx] = 1
#groupby and apply to create desired output dataframe
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))
Here is the sampel dataframes for above code:
spdf
Out[71]:
a b c d e f
abc 1 0 0 1 0 0
def 0 1 0 0 0 0
ghi 1 0 1 0 0 0
fg 0 0 0 0 1 0
f76 0 1 0 0 0 1
x
Out[72]:
User Item
0 a abc
1 b def
2 c ghi
3 d abc
4 a ghi
5 e fg
6 f f76
7 b f76
Also, in case you want to make groupby apply function execution
parallel , this question might be of help:
Parallelize apply after pandas groupby

Categories

Resources