Splitting values in the column and creating new cols small problem

Splitting values in the column and creating new cols small problem - python

I have an survey data in which one column is as follows:
Evaluations_Col
E: 3, D: 3, C: 3, S: 3, E: 3, X, K: 3
E: 1, D: 1, C: 1, S: 1, E: 1, X, K: 1
E: 2, D: 2, C: 2, S: 2, E: 2, X, K: 2
E: 5, D: 5, C: 5, S: 5, E: 5, X, K: 5
E: 3, D: 1, C: 1, S: 1, E: 1, X, K: 1
NOTE: I need to ignore X values in the columns.
I want to extract each evaluation and separate them as columns separately for each type of evaluation. and at the end expected columns will be like:
E_col D_col C_Col ...
3 3 3
1 1 1
2 2 2
5 5 5
3 1 1
I can maybe split them by comma and get a list like this, [E: 3, D: 3, C: 3, S: 3, E: 3, K: 3] What how to create seperate column for each and spread the corresponding values correctly?
I can achive normally by this but X values cause problem bc dictionary... How can I exclude it?
df1 = pd.DataFrame([dict([y.split(':') for y in x.split(',')]) for x in test_col])
df1.head()
error is
ValueError: dictionary update sequence element #9 has length 1; 2 is required

Using list comprehension and filtering lines that are with ':' separator only:
Let's break the list comprehension to parts:
Looping on lines : for x in test_col
Seperating only the lines (denoted by x) to colums by splitting by ',' : for y in x.split(',')
Splitting column to key-value pair only if ':' seperator exists : y.split(':') for y in x.split(',') ***only*** if ':' in y (that solves the problem described)
Code:
import pandas as pd
import numpy as np
test_col = []
with open('data.csv', 'r') as f:
test_col = [l.strip() for l in f.readlines()]
df = pd.DataFrame([dict([y.split(':') for y in x.split(',') if ':' in y]) for x in test_col])
print(df.head())
Output:
E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

One way is to use str.extractall:
s = df["Value"].str.extractall(r"([A-Z]):\s(\d)").reset_index().groupby("level_0")
print (pd.DataFrame(s[1].agg(list).tolist(), columns=s[0].get_group(0).tolist()))
E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

using str.split and stack
df1 = (
df["Evaluations_Col"]
.str.split(",", expand=True)
.stack()
.str.split(":", expand=True)
.set_index(0, append=True)
.dropna()
.unstack([1, 2])
.droplevel(1,1)
)
1
0 E D C S E K
0 3 3 3 3 3 3
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 5 5 5 5 5 5
4 3 1 1 1 1 1

Related

How to remove consecutive pairs of opposite numbers from Pandas Dataframe?

How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks

Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Is there a good way to apply a function cumulatively to a pandas series of strings?

I have a Pandas data frame like this
x y
0 0 a
1 0 b
2 0 c
3 0 d
4 1 e
5 1 f
6 1 g
7 1 h
what I want to do is for each value of x to create a series which cumulatively concatenates the strings which have already appeared in y for that value of x. In other words, I want to get a Pandas series like this.
0
1 a,
2 a,b,
3 a,b,c,
4
5 e,
6 e,f,
7 e,f,g,
I can do it using a double for loop:
dat = pd.DataFrame({'x': [0, 0, 0, 0, 1, 1, 1, 1],
'y': ['a','b','c','d','e','f','g','h']})
z = dat['x'].copy()
for i in range(dat.shape[0]):
z[i] = ''
for j in range(i):
if dat['x'][j] == dat['x'][i]:
z[i] += dat['y'][j] + ","
but I was wondering whether there is a quicker way? It seems that pandas expanding().apply() doesn't work for strings and it is an open issue. But perhaps there is an efficient way of doing it which doesn't involve apply?

You can do with shift and np.cumsum in a custom function:
def myfun(x):
y = x.shift()
return np.cumsum(y.fillna('').add(',').mask(y.isna(),'')).str[:-1]
df.groupby("x")['y'].apply(myfun)
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

We can group the dataframe by x then for each group in x we can cumsum and shift the column y and update the values in new column cum_y in dat
dat['cum_y'] = ''
for _, g in dat.groupby('x'):
dat['cum_y'].update(g['y'].add(',').cumsum().shift().str[:-1])
>>> dat
x y cum_y
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

Use GroupBy.transform with lambda function with Series.shift, adding ,, cumulative sum and last remove trailing separator:
f = lambda x: (x.shift(fill_value='') + ',').cumsum()
dat['z'] = dat.groupby('x')['y'].transform(f).str.strip(',')
print (dat)
x y z
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

I would try to use lists here. Unsure for the efficiency anyway...
df.assign(y=df['y'].apply(lambda x: [x])).groupby('x')['y'].transform(
lambda x: x.cumsum()).str.join(',')
It gives as expected:
0 a
1 a,b
2 a,b,c
3 a,b,c,d
4 e
5 e,f
6 e,f,g
7 e,f,g,h
Name: y, dtype: object

Can also do:
(df['y'].apply(list)
.groupby(df['x'])
.transform(lambda x: x.cumsum().shift(fill_value=''))
.str.join(',')
)
Output:
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

How to track the unordered pairs in the `pandas` dataframe

I have a pd.Dataframe with the columns R_fighter - the name of the first fighter, B_fighter - the name of the second fighter and the Winner column. Data is sorted in chronological order and I would like to add a column, where if the fighters have previously met and the R fighter won to set value to -1, if the B fighter won - 1, and 0 otherwise. If it was guaranteed, that the fighters can meet again in the same order (R_fighter is again R_fighter, B_fighter is again B_fighter), then one can do the following:
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby(['R_fighter', 'B_fighter'])['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
last_winner_col[idx] = last_winner
last_winner = 2 * val - 1
And add the resulting pd.Series to the dataset. However, their role may change in the consequent fight. The solution, which come to my mind are very lengthy and cumbersome. I would be grateful, if one suggested a handy way to track the previous winner to take into account the possibility of the change of fighters order?

You can create a "sorted" version of your two combatants and use that:
import pandas as pd
a = list("ABCDEFGH1234")
b = list("12341234ABCD")
win = list("ABCD12341234")
df = pd.DataFrame({"R_fighter":a, "B_fighter":b, "Winner":win})
# make a column with fixed order
df["combatants"] = df[['R_fighter', 'B_fighter']].apply(lambda x: sorted(x), axis=1)
# or simply set the result
df["w"] = df[['R_fighter', 'B_fighter', 'Winner']].apply(lambda x: '-1'
if x[2]==x[0]
else ('1' if x[2]==x[1]
else '0'), axis=1 )
print(df)
Output:
R_fighter B_fighter Winner combatants w
0 A 1 A [1, A] -1
1 B 2 B [2, B] -1
2 C 3 C [3, C] -1
3 D 4 D [4, D] -1
4 E 1 1 [1, E] 1
5 F 2 2 [2, F] 1
6 G 3 3 [3, G] 1
7 H 4 4 [4, H] 1
8 1 A 1 [1, A] -1
9 2 B 2 [2, B] -1
10 3 C 3 [3, C] -1
11 4 D 4 [4, D] -1
To get the winner based on 'combatants' (wich contains the sorted names) you can do:
df["w_combatants"] = df[['combatants', 'Winner']].apply(lambda x: '-1'
if x[1]==x[0][0]
else ('1' if x[1]==x[0][1]
else '0'), axis=1 )
to get
R_fighter B_fighter Winner combatants w w_combatants
0 A 1 A [1, A] -1 1
1 B 2 B [2, B] -1 1
2 C 3 C [3, C] -1 1
3 D 4 D [4, D] -1 1
4 E 1 1 [1, E] 1 -1
5 F 2 2 [2, F] 1 -1
6 G 3 3 [3, G] 1 -1
7 H 4 4 [4, H] 1 -1
8 1 A 1 [1, A] -1 -1
9 2 B 2 [2, B] -1 -1
10 3 C 3 [3, C] -1 -1
11 4 D 4 [4, D] -1 -1

On the grounds of #Patrick Artner answer, I've come up with the following solution:
df_train[['fighters']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x :tuple(sorted(x)), axis = 1)
df_train[['fighter_ord_changed']] = df_train[['R_fighter', 'B_fighter']].apply(lambda x : np.argsort(x)[0], axis = 1)
last_winner_col = np.zeros(df_train.shape[0])
for x in df_train.groupby('fighters')['Winner']:
last_winner = 0
for idx, val in zip(x[1].index, x[1].values):
flag = df_train['fighter_ord_changed'][idx]
last_winner_col[idx] = -last_winner if flag else last_winner
last_winner = 2 * (val ^ flag) - 1

range() function is giving me trouble

If I were to type something like this, I would get these values:
print range(1,10)
[1,2,3,4,5,6,7,8,9]
but say if I want to use this same value in a for loop then it would instead start at 0, an example of what I mean:
for r in range(1,10):
for c in range(r):
print c,
print ""
The Output is this:
0
0 1
0 1 2
0 1 2 3
0 1 2 3 4
0 1 2 3 4 5
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8
Why is 0 here? shouldn't it start at 1 and end in 9?

You are creating a second range() object in your loop. The default start value is 0.
Each iteration you create a loop over range(r), meaning range from 0 to r, exclusive, to produce the output numbers. For range(1) that means you get a list with just [0] in it, for range(1) you get [0, 1], etc.
If you wanted to produce ranges from 1 to r inclusive`, just add 1 to the number you actually print:
for r in range(1,10):
for c in range(r):
print c + 1,
print ""
or range from 1 to r + 1:
for r in range(1,10):
for c in range(1, r + 1):
print c,
print ""
Both produce your expected output:
>>> for r in range(1,10):
... for c in range(r):
... print c + 1,
... print ""
...
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
>>> for r in range(1,10):
... for c in range(1, r + 1):
... print c,
... print ""
...
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9

If you pass only one argument to range function, it would treat that as the ending value (without including it), starting from zero.
If you pass two arguments to the range function, it would treat the first value as the starting value and the second value as the ending value (without including it).
If you pass three arguments to the range function, it would treat the first value as the starting value and the second value as the ending value (without including it) and the third value as the step value.
You can confirm this with few trial runs like this
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Default start value 0
>>> range(5, 10)
[5, 6, 7, 8, 9] # Starts from 5
>>> range(5, 10, 2)
[5, 7, 9] # Starts from 5 & takes only the 2nd element

Nope.
for r in range(1,10):
for c in range(r):
print c,
print ""
range(), when only given one argument, prints the numbers from 0 to the argument, not including the argument:
>>> range(6)
[0, 1, 2, 3, 4, 5]
And so, on the third iteration of your code, this is what happens:
for r in range(1,10): # r is 3
for c in range(r): # range(3) is [0,1,2]
print c, #you then print each of the range(3), giving the output you observe
print ""

https://docs.python.org/2/library/functions.html#range
From the docs:
The arguments must be plain integers. If the step argument is omitted, it defaults to 1. If the start argument is omitted, it defaults to 0.

Best way to split a DataFrame given an edge

Suppose I have the following DataFrame:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
And I need to group it given the "edge B"; that means the groups will be:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
That is any time I find a 'B' in the column 'a' I want to split my DataFrame.
My current solution is:
#create the dataframe
s = pd.Series(['A','A','A','B','A','A','B','A','A'])
ss = pd.Series(np.random.randn(9))
dff = pd.DataFrame({"a":s,"b":ss})
#my solution
count = 0
ls = []
for i in s:
if i=="A":
ls.append(count)
else:
ls.append(count)
count+=1
dff['grpb']=ls
and I got the dataframe:
a b grpb
0 A 1.516733 0
1 A 0.035646 0
2 A -0.942834 0
3 B -0.157334 0
4 A 2.226809 1
5 A 0.768516 1
6 B -0.015162 1
7 A 0.710356 2
8 A 0.151429 2
Which I can then split with dff.groupby('grpb').
Is there a more efficient way to do this using pandas' functions?

here's a oneliner:
zip(*dff.groupby(pd.rolling_median((1*(dff['a']=='B')).cumsum(),3,True)))[-1]
[ 1 2
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334,
1 2
4 A 2.226809
5 A 0.768516
6 B -0.015162,
1 2
7 A 0.710356
8 A 0.151429]

How about:
df.groupby((df.a == "B").shift(1).fillna(0).cumsum())
For example:
>>> df
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072
4 A -1.903361
5 A 1.436268
6 B 0.391087
7 A -0.907679
8 A 1.672897
>>> gg = list(df.groupby((df.a == "B").shift(1).fillna(0).cumsum()))
>>> pprint.pprint(gg)
[(0,
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072),
(1, a b
4 A -1.903361
5 A 1.436268
6 B 0.391087),
(2, a b
7 A -0.907679
8 A 1.672897)]
(I didn't bother getting rid of the indices; you could use [g for k, g in df.groupby(...)] if you liked.)

An alternative is:
In [36]: dff
Out[36]:
a b
0 A 0.689785
1 A -0.374623
2 A 0.517337
3 B 1.549259
4 A 0.576892
5 A -0.833309
6 B -0.209827
7 A -0.150917
8 A -1.296696
In [37]: dff['grpb'] = np.NaN
In [38]: breaks = dff[dff.a == 'B'].index
In [39]: dff['grpb'][breaks] = range(len(breaks))
In [40]: dff.fillna(method='bfill').fillna(len(breaks))
Out[40]:
a b grpb
0 A 0.689785 0
1 A -0.374623 0
2 A 0.517337 0
3 B 1.549259 0
4 A 0.576892 1
5 A -0.833309 1
6 B -0.209827 1
7 A -0.150917 2
8 A -1.296696 2
Or using itertools to create 'grpb' is an option too.

def vGroup(dataFrame, edgeCondition, groupName='autoGroup'):
groupNum = 0
dataFrame[groupName] = ''
#loop over each row
for inx, row in dataFrame.iterrows():
if edgeCondition[inx]:
dataFrame.ix[inx, groupName] = 'edge'
groupNum += 1
else:
dataFrame.ix[inx, groupName] = groupNum
return dataFrame[groupName]
vGroup(df, df[0] == ' ')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting values in the column and creating new cols small problem - python

One way is to use str.extractall: s = df["Value"].str.extractall(r"([A-Z]):\s(\d)").reset_index().groupby("level_0") print (pd.DataFrame(s[1].agg(list).tolist(), columns=s[0].get_group(0).tolist())) E D C S E K 0 3 3 3 3 3 3 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 5 5 5 5 5 5 4 3 1 1 1 1 1

using str.split and stack df1 = ( df["Evaluations_Col"] .str.split(",", expand=True) .stack() .str.split(":", expand=True) .set_index(0, append=True) .dropna() .unstack([1, 2]) .droplevel(1,1) ) 1 0 E D C S E K 0 3 3 3 3 3 3 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 5 5 5 5 5 5 4 3 1 1 1 1 1

Related

How to remove consecutive pairs of opposite numbers from Pandas Dataframe?

Is there a good way to apply a function cumulatively to a pandas series of strings?

How to track the unordered pairs in the `pandas` dataframe

range() function is giving me trouble

Best way to split a DataFrame given an edge

Categories

Resources