How to track the unordered pairs in the `pandas` dataframe

How to track the unordered pairs in the `pandas` dataframe - python

I have a pd.Dataframe with the columns R_fighter - the name of the first fighter, B_fighter - the name of the second fighter and the Winner column. Data is sorted in chronological order and I would like to add a column, where if the fighters have previously met and the R fighter won to set value to -1, if the B fighter won - 1, and 0 otherwise. If it was guaranteed, that the fighters can meet again in the same order (R_fighter is again R_fighter, B_fighter is again B_fighter), then one can do the following:
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby(['R_fighter', 'B_fighter'])['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
last_winner_col[idx] = last_winner
last_winner = 2 * val - 1
And add the resulting pd.Series to the dataset. However, their role may change in the consequent fight. The solution, which come to my mind are very lengthy and cumbersome. I would be grateful, if one suggested a handy way to track the previous winner to take into account the possibility of the change of fighters order?

You can create a "sorted" version of your two combatants and use that:
import pandas as pd
a = list("ABCDEFGH1234")
b = list("12341234ABCD")
win = list("ABCD12341234")
df = pd.DataFrame({"R_fighter":a, "B_fighter":b, "Winner":win})
# make a column with fixed order
df["combatants"] = df[['R_fighter', 'B_fighter']].apply(lambda x: sorted(x), axis=1)
# or simply set the result
df["w"] = df[['R_fighter', 'B_fighter', 'Winner']].apply(lambda x: '-1'
if x[2]==x[0]
else ('1' if x[2]==x[1]
else '0'), axis=1 )
print(df)
Output:
R_fighter B_fighter Winner combatants w
0 A 1 A [1, A] -1
1 B 2 B [2, B] -1
2 C 3 C [3, C] -1
3 D 4 D [4, D] -1
4 E 1 1 [1, E] 1
5 F 2 2 [2, F] 1
6 G 3 3 [3, G] 1
7 H 4 4 [4, H] 1
8 1 A 1 [1, A] -1
9 2 B 2 [2, B] -1
10 3 C 3 [3, C] -1
11 4 D 4 [4, D] -1
To get the winner based on 'combatants' (wich contains the sorted names) you can do:
df["w_combatants"] = df[['combatants', 'Winner']].apply(lambda x: '-1'
if x[1]==x[0][0]
else ('1' if x[1]==x[0][1]
else '0'), axis=1 )
to get
R_fighter B_fighter Winner combatants w w_combatants
0 A 1 A [1, A] -1 1
1 B 2 B [2, B] -1 1
2 C 3 C [3, C] -1 1
3 D 4 D [4, D] -1 1
4 E 1 1 [1, E] 1 -1
5 F 2 2 [2, F] 1 -1
6 G 3 3 [3, G] 1 -1
7 H 4 4 [4, H] 1 -1
8 1 A 1 [1, A] -1 -1
9 2 B 2 [2, B] -1 -1
10 3 C 3 [3, C] -1 -1
11 4 D 4 [4, D] -1 -1

On the grounds of #Patrick Artner answer, I've come up with the following solution:
df_train[['fighters']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x :tuple(sorted(x)), axis = 1)
df_train[['fighter_ord_changed']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x : np.argsort(x)[0], axis = 1)
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby('fighters')['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
flag = df_train['fighter_ord_changed'][idx]
last_winner_col[idx] = -last_winner if flag else last_winner
last_winner = 2 * (val ^ flag) - 1

Related

assign a shorter column to a longer column, given a sliding winow

I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?

df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1

~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?

There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy

I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Splitting values in the column and creating new cols small problem

I have an survey data in which one column is as follows:
Evaluations_Col
E: 3, D: 3, C: 3, S: 3, E: 3, X, K: 3
E: 1, D: 1, C: 1, S: 1, E: 1, X, K: 1
E: 2, D: 2, C: 2, S: 2, E: 2, X, K: 2
E: 5, D: 5, C: 5, S: 5, E: 5, X, K: 5
E: 3, D: 1, C: 1, S: 1, E: 1, X, K: 1
NOTE: I need to ignore X values in the columns.
I want to extract each evaluation and separate them as columns separately for each type of evaluation. and at the end expected columns will be like:
E_col D_col C_Col ...
3 3 3
1 1 1
2 2 2
5 5 5
3 1 1
I can maybe split them by comma and get a list like this, [E: 3, D: 3, C: 3, S: 3, E: 3, K: 3] What how to create seperate column for each and spread the corresponding values correctly?
I can achive normally by this but X values cause problem bc dictionary... How can I exclude it?
df1 = pd.DataFrame([dict([y.split(':') for y in x.split(',')]) for x in test_col])
df1.head()
error is
ValueError: dictionary update sequence element #9 has length 1; 2 is required

Using list comprehension and filtering lines that are with ':' separator only:
Let's break the list comprehension to parts:
Looping on lines : for x in test_col
Seperating only the lines (denoted by x) to colums by splitting by ',' : for y in x.split(',')
Splitting column to key-value pair only if ':' seperator exists : y.split(':') for y in x.split(',') ***only*** if ':' in y (that solves the problem described)
Code:
import pandas as pd
import numpy as np
test_col = []
with open('data.csv', 'r') as f:
test_col = [l.strip() for l in f.readlines()]
df = pd.DataFrame([dict([y.split(':') for y in x.split(',') if ':' in y]) for x in test_col])
print(df.head())
Output:
E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

One way is to use str.extractall:
s = df["Value"].str.extractall(r"([A-Z]):\s(\d)").reset_index().groupby("level_0")
print (pd.DataFrame(s[1].agg(list).tolist(), columns=s[0].get_group(0).tolist()))
E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

using str.split and stack
df1 = (
df["Evaluations_Col"]
.str.split(",", expand=True)
.stack()
.str.split(":", expand=True)
.set_index(0, append=True)
.dropna()
.unstack([1, 2])
.droplevel(1,1)
)
1
0 E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

How to find a set mean in a pandas dataframe? [duplicate]

I have a column in a DataFrame with values:
[1, 1, -1, 1, -1, -1]
How can I group them like this?
[1,1] [-1] [1] [-1, -1]

You can use groupby by custom Series:
df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
a
0 1
1 1
2 -1
3 1
4 -1
5 -1
print ((df.a != df.a.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
Name: a, dtype: int32
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
print (i)
print (g)
print (g.a.tolist())
a
0 1
1 1
[1, 1]
2
a
2 -1
[-1]
3
a
3 1
[1]
4
a
4 -1
5 -1
[-1, -1]

Using groupby from itertools data from Jez
from itertools import groupby
[ list(group) for key, group in groupby(df.a.values.tolist())]
Out[361]: [[1, 1], [-1], [1], [-1, -1]]

Series.diff is another way to mark the group boundaries (a!=a.shift means a.diff!=0):
consecutives = df['a'].diff().ne(0).cumsum()
# 0 1
# 1 1
# 2 2
# 3 3
# 4 4
# 5 4
# Name: a, dtype: int64
And to turn these groups into a Series of lists (see the other answers for a list of lists), aggregate with groupby.agg or groupby.apply:
df['a'].groupby(consecutives).agg(list)
# a
# 1 [1, 1]
# 2 [-1]
# 3 [1]
# 4 [-1, -1]
# Name: a, dtype: object

If you are dealing with string values:
s = pd.DataFrame(['A','A','A','BB','BB','CC','A','A','BB'], columns=['a'])
string_groups = sum([['%s_%s' % (i,n) for i in g] for n,(k,g) in enumerate(itertools.groupby(s.a))],[])
>>> string_groups
['A_0', 'A_0', 'A_0', 'BB_1', 'BB_1', 'CC_2', 'A_3', 'A_3', 'BB_4']
grouped = s.groupby(string_groups, sort=False).agg(list)
grouped.index = grouped.index.str.split('_').str[0]
>>> grouped
a
A [A, A, A]
BB [BB, BB]
CC [CC]
A [A, A]
BB [BB]
As a separate function:
def groupby_consec(df, col):
string_groups = sum([['%s_%s' % (i, n) for i in g]
for n, (k, g) in enumerate(itertools.groupby(df[col]))], [])
return df.groupby(string_groups, sort=False)

keyerror after removing nans in pandas

I am reading a file with pd.read_csv and removing all the values that are -1. Here's the code
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C', 'D']
catalog = pd.read_csv('data.txt', sep='\s+', names=columns, skiprows=1)
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
print len(b) # answer is 700
# remove rows that are -1 in column b
idx = np.where(b != -1)[0]
a = a[idx]
b = b[idx]
c = c[idx]
d = d[idx]
print len(b) # answer is 612
So I am assuming that I have successfully managed to remove all the rows where the value in column b is -1.
In order to test this, I am doing the following naive way:
for i in range(len(b)):
print i, a[i], b[i]
It prints out the values until it reaches a row which was supposedly filtered out. But now it gives a KeyError.

You can filtering by boolean indexing:
catalog = catalog[catalog['B'] != -1]
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
It is expected you get KeyError, because index values not match, because filtering.
One possible solution is convert Series to lists:
for i in range(len(b)):
print i, list(a)[i], list(b)[i]
Sample:
catalog = pd.DataFrame({'A':list('abcdef'),
'B':[-1,5,4,5,-1,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0]})
print (catalog)
A B C D
0 a -1 7 1
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
4 e -1 2 1
#filtered DataFrame have no index 0, 4
catalog = catalog[catalog['B'] != -1]
print (catalog)
A B C D
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
5 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
print (b)
1 5
2 4
3 5
5 4
Name: B, dtype: int64
#a[i] in first loop want match index value 0 (a[0]) what does not exist, so KeyError,
#same problem for b[0]
for i in range(len(b)):
print (i, a[i], b[i])
KeyError: 0
#convert Series to list, so list(a)[0] return first value of list - there is no Series index
for i in range(len(b)):
print (i, list(a)[i], list(b)[i])
0 b 5
1 c 4
2 d 5
3 f 4
Another solution should be create default index 0,1,... by reset_index with drop=True:
catalog = catalog[catalog['B'] != -1].reset_index(drop=True)
print (catalog)
A B C D
0 b 5 8 3
1 c 4 9 5
2 d 5 4 7
3 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
#default index values match a[0] and a[b]
for i in range(len(b)):
print (i, a[i], b[i])
0 b 5
1 c 4
2 d 5
3 f 4

If you filter out indices, then
for i in range(len(b)):
print i, a[i], b[i]
will attempt to access erased indices. Instead, you can use the following:
for i, ae, be in zip(a.index, a.values, b.values):
print(i, ae, be)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to track the unordered pairs in the `pandas` dataframe - python

Related

assign a shorter column to a longer column, given a sliding winow

Replace contents of cell with another cell if condition on a separate cell is met

Splitting values in the column and creating new cols small problem

How to find a set mean in a pandas dataframe? [duplicate]

keyerror after removing nans in pandas

Categories

Resources