Encoding column labels in Pandas for machine learning

Encoding column labels in Pandas for machine learning - python

I am working on car evaulation dataset for machine learning and the dataset is like this
buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
i want to convert these strings to unique enumerated integers columnwise. i see that pandas.factorize() is the way to go, but it only works on one column. how do i factorize the dataframe in one go with one command.
i tried lambda function and it is not working.
df.apply(lambda c:pd.factorize(c),axis=1)
Output:
0 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...
1 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...
2 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...
3 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...
4 ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])
5 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
i see the encoded values but cant pull that out from above array

Factorize returns a tuple of (values, labels). You'll just want the values in the DataFrame.
In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']
In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]:
buying maint lug_boot safety class
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 0 2 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 0 1 2 0
Then concat that to the numeric data.
A word of warning though: this implies that "low" safety and "high" safety are the same distance from "med" safety. You might be better off using pd.get_dummies:
In [37]: dummies = []
In [38]: for col in cols:
....: dummies.append(pd.get_dummies(df[col]))
....:
In [39]: pd.concat(dummies, axis=1)
Out[39]:
vhigh vhigh med small high low med unacc
0 1 1 0 1 0 1 0 1
1 1 1 0 1 0 0 1 1
2 1 1 0 1 1 0 0 1
3 1 1 1 0 0 1 0 1
4 1 1 1 0 0 0 1 1
5 1 1 1 0 1 0 0 1
get_dummies has some optional parameters to control the naming, which you'll probably want.

Related

Print all possible strings[1,2,3,4,5,6] : for eg: (2,3,4,5,6,1) (3,4,5,6,1,2) (4,5,6,1,2,3) i have to rotate each number using for loop in python

code:
a=['1','2','3','4','5','6']
for i in range(1,6):
for j in range(i+1):
for k in range(j+1):
for l in range(k+1):
for m in range(l+1):
for p in range(m+1):
print(i,j,k,l,m,p)
#------------------------------------------------------------------------------------------
output: 1 0 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
1 1 1 1 0 0
1 1 1 1 1 0
1 1 1 1 1 1
2 0 0 0 0 0
2 1 0 0 0 0
2 1 1 0 0 0
2 1 1 1 0 0
2 1 1 1 1 0
2 1 1 1 1 1
2 2 0 0 0 0
2 2 1 0 0 0
2 2 1 1 0 0
2 2 1 1 1 0
2 2 1 1 1 1
2 2 2 0 0 0
2 2 2 1 0 0
2 2 2 1 1 0
2 2 2 1 1 1
2 2 2 2 0 0
2 2 2 2 1 0
2 2 2 2 1 1
2 2 2 2 2 0
2 2 2 2 2 1
2 2 2 2 2 2
3 0 0 0 0 0
3 1 0 0 0 0
3 1 1 0 0 0
3 1 1 1 0 0
and so on....
This is the code I have tried but im not getting desired output can someone please explain..Thankyou

It would appear to me that you need to look at the examples very carefully, and do something with the use of the word 'rotate'.
A fairly simple solution:
def rotate(xs):
for i in range(len(xs)):
yield tuple(xs[i:] + xs[:i])
for result in rotate([1,2,3,4,5,6]):
print(result)
Output:
(1, 2, 3, 4, 5, 6)
(2, 3, 4, 5, 6, 1)
(3, 4, 5, 6, 1, 2)
(4, 5, 6, 1, 2, 3)
(5, 6, 1, 2, 3, 4)
(6, 1, 2, 3, 4, 5)

Do you mean this?
# Create new list containing all "rotated" versions of lst
lst = list(range(7))
new_lists = [lst[-i:] + lst[:-i] for i in range(len(lst))]
# print results
for l in new_lists:
print(l)
Output:
[0, 1, 2, 3, 4, 5, 6]
[6, 0, 1, 2, 3, 4, 5]
[5, 6, 0, 1, 2, 3, 4]
[4, 5, 6, 0, 1, 2, 3]
[3, 4, 5, 6, 0, 1, 2]
[2, 3, 4, 5, 6, 0, 1]
[1, 2, 3, 4, 5, 6, 0]

Use itertools library and import permutation. Permutation will find all combination. And, it is an easy method too. It will print all possible combination of your input.
from itertools import permutations
a = [1,2,3,4,5,6]
perm = permutations(a)
for i in list(perm):
print (i)

I got the desired output using forloops, hence I am posting this answer so it can help any of you struggling with the same question
import numpy as np
count = 0
n = "kashmeen"
for i in range(0,8):
for j in range(0,8):
for k in range(0,8):
for l in range(0,8):
for m in range(0,8):
for o in range(0,8):
for p in range(0,8):
for q in range(len(n)):
new = [i, j, k, l, m, o, p, q]
new = np.array(new)
new_1 = np.unique(new)
if len(new_1)==8:
print(n[i],n[j],n[k],n[l],n[m],n[o],n[p],n[q])
count += 1

It is clear that the intent of this exercise is to for you to demonstrate knowledge of slicing operations. Using slices, this can be done in a single for-loop:
>>> n = "kashmeen"
>>> for index in range(len(n)):
print (n[index:]+n[:index])
which gives the output
kashmeen
ashmeenk
shmeenka
hmeenkas
meenkash
eenkashm
enkashme
nkashmee
This solution will work for a string of any length (not just 8) and will work equally well for lists or tuples.

Pandas replace all but first in consecutive group

The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?

You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0

Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

Fast Cartesian sum of rows of dataframe

I have two dataframes of errors in 3 axis (x, y, z):
df1 = pd.DataFrame([[0, 1, 2], [-1, 0, 1], [-2, 0, 3]], columns = ['x', 'y', 'z'])
df2 = pd.DataFrame([[1, 1, 3], [1, 0, 2], [1, 0, 3]], columns = ['x', 'y', 'z'])
I'm looking for a fast way to find the Cartesian sum of the square of each row of the two dataframes.
EDIT My current solution:
cartesian_sum = list(np.sum(list(tup), axis = 0).tolist()
for tup in itertools.product( (df1**2).to_numpy().tolist(),
(df2**2).to_numpy().tolist() ) )
cartesian_sum
>>>
[[1, 2, 13],
[1, 1, 8],
[1, 1, 13],
[2, 1, 10],
[2, 0, 5],
[2, 0, 10],
[5, 1, 18],
[5, 0, 13],
[5, 0, 18]]
is too slow (~ 2.4 ms; compared to the solutions based purely in Pandas running ~ 8-10 ms).
This is similar to the related question (link here) but using itertools is so slow. Is there a faster way of doing this in Python?

I think you need cross join first, remove column a, squared, convert columns to MultiIndex and sum per first level:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a').drop('a', axis=1) ** 2
df.columns = df.columns.str.split('_', expand=True)
df = df.sum(level=0, axis=1)
print (df)
x y z
0 1 2 13
1 1 1 8
2 1 1 13
3 2 1 10
4 2 0 5
5 2 0 10
6 5 1 18
7 5 0 13
8 5 0 18
Details:
print (df1.assign(a=1).merge(df2.assign(a=1), on='a'))
x_x y_x z_x a x_y y_y z_y
0 0 1 2 1 1 1 3
1 0 1 2 1 1 0 2
2 0 1 2 1 1 0 3
3 -1 0 1 1 1 1 3
4 -1 0 1 1 1 0 2
5 -1 0 1 1 1 0 3
6 -2 0 3 1 1 1 3
7 -2 0 3 1 1 0 2
8 -2 0 3 1 1 0 3
One idea for improve performance:
#https://stackoverflow.com/a/53699013/2901002
def cartesian_product_simplified_changed(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
a = np.column_stack([left.values[ia2.ravel()] ** 2, right.values[ib2.ravel()] ** 2])
a = a[:, :la] + a[:, la:]
return a
a = cartesian_product_simplified_changed(df1, df2)
print (a)
[[ 1 2 13]
[ 1 1 8]
[ 1 1 13]
[ 2 1 10]
[ 2 0 5]
[ 2 0 10]
[ 5 1 18]
[ 5 0 13]
[ 5 0 18]]

Vectorized cummulative sum based on value in array numpy [duplicate]

Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.

Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop

Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0

You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64

You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?

This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop

Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))

stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding column labels in Pandas for machine learning - python

Related

Print all possible strings[1,2,3,4,5,6] : for eg: (2,3,4,5,6,1) (3,4,5,6,1,2) (4,5,6,1,2,3) i have to rotate each number using for loop in python

Pandas replace all but first in consecutive group

Fast Cartesian sum of rows of dataframe

Vectorized cummulative sum based on value in array numpy [duplicate]

Creating intervaled ramp array based on a threshold - Python / NumPy

Categories

Resources