Encoding method for a categorical variable

Encoding method for a categorical variable - python

Suppose we have a categorical variable
Age['0-17','18-25','35-40','55+']
What should we prefer; OneHotEncoding, LabelEncoding or Mapping (like assigning data values such as '0-17':1, '18-25':2) and Why?

You can solve this problem with pure python like below:
age = ['0-17','18-25','35-40','40-55', '55-70', '70-85', '85+']
rng = range(len(age))
# If you want label start from '1'
# rng = range(1,len(age)+1)
res = dict(zip(age, rng))
print(res)
Output:
{'0-17': 0, '18-25': 1, '35-40': 2, '40-55': 3, '55-70': 4, '70-85': 5, '85+': 6}

Related

Cleaner Way to Change Dictionary Structure

I have the following object type in python, there are several entries in the data-object just like the one below.
> G1 \
jobname
x [3.3935e-06, 6.099100000000001e-06, 8.8048e-06...
y [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
yerr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
xfit [3.3935e-06, 4.0631e-06, 4.7327000000000004e-0...
yfit [-0.0215695613, -0.0215695613, -0.0215695613, ...
xlabel Pulse Time (s)
ylabel Avg Counts
params [-3.6475000000000002e-06, -0.0026722969, -3.20...
x0 -374.2
x0_err 1.124e+07
I have been able to change the structure with the following code but I want to be able to do this is a cleaner way. I tried list comprehension, but that doesn't seem to be any cleaner. I also dont want to create all of the empyy lists before initializing the class.
class change_dataframe():
def flatten(l):
return [item for sublist in l for item in sublist]
def process_data(data, zones:list):
for zone in zones:
for x in df[zone]['x']:
zone_list.append(zone)
xdata_append.append(df[zone]['x'])
ydata_append.append(df[zone]['y'])
xfit_append.append(df[zone]['xfit'])
yfit_append.append(df[zone]['yfit'])
xdata = flatten(xdata_append)
ydata = flatten(ydata_append)
xfit = flatten(xfit_append)
yfit = flatten(yfit_append)
data = pd.DataFrame({'zones':zone_list,'x':xdata, 'y':ydata})
fit_data = pd.DataFrame({'xfit':xfit, 'yfit':yfit})
return fit_data, data
# x_data, y_data = process_data(df,zones)
if __name__ == "__main__":
xdata_append = []
xfit_append = []
ydata_append = []
yfit_append = []
zone_list = []
zones = ['G1','G2','G3','G4','G5']
data = change_dataframe.process_data(df,zones)
print(data)
# zones_list = flatten(zones_append)
Any help would be greatly appreciated.

You wrote
def process_data(data, zones:list):
but Author's Intent was apparently
def process_data(df, zones:list):
Are we writing class methods, which accept a self parameter,
or are we writing top-level functions here?
The flatten helper looks good, though PEP-8 asks
you to name the formal parameter lst rather than l.
The repeated applications of flatten
seem well suited to a map( ... ) invocation,
perhaps with tuple unpack.
The top-level global lists are a bit of a disaster.
You have a class already.
Banish the globals by making them instance variables,
e.g. self.zone_list.
Also, the _append suffix is not great, consider eliding it.

How to call pandas columns with numeric suffixes in a for loop then apply conditions based on other columns with numeric suffixes (python)?

In python I am trying to update pandas dataframe column values based on the condition of another column value. Each of the column names have numeric suffixes that relate them. Here is a dataframe example:
Nodes_2 = pd.DataFrame([[0, 0, 37.76, 0, 0, 1, 28.32], [0, 0, 45.59, 0, 0, 1, 34.19], [22.68, 0, 22.68, 1, 0, 1, 34.02], [0, 0, 41.03, 0, 0, 1, 30.77], [20.25, 0, 20.25, 1, 0, 1, 30.37]], columns=['ait1', 'ait2', 'ait3', 'Type1', 'Type2', 'Type3', 'Flow'])
And the relevant 'Type' list:
TypeNums = [1, 2, 3, 4, 5, 6, 7, 8]
Specifically, I am trying to update values in the 'ait' columns with 'Flow' values if the 'Type' value equals 1; if the 'Type' value equals 0, the 'ait' value should be 0.
My attempt at applying these conditions is not working as it is getting hung up on how I am trying to reference the columns using the string formatting. See below:
for num in TypeNums:
if Nodes_2['Type{}'.format(num)] == 1:
Nodes_2['ait{}'.format(num)] == Nodes_2['Flow']
elif Nodes_2['Type{}'.format(num)] == 0:
Nodes_2['ait{}'.format(num)] == 0
That said, how should I call the columns with their numeric suffixes without typing duplicative code calling each name? And is this a correct way of applying the above mentioned conditions?
Thank you!

The correct way is to use np.where or, in your case, just simple multilication:
for num in TypeNums:
Nodes_2['ait{}'.format(num)] = Nodes_2['Type{}'.format(num)] * Nodes_2['Flow']
Or, you can multiply all the columns at once:
Nodes_2[['ait{}'.format(num) for num in TypeNums]] = Nodes_2[['Type{}'.format(num) for num in TypeNums]].mul(Nodes_2['Flow'], axis='rows').to_numpy()

using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution with `tf.data.Dataset`

I'm trying to make this function accept a single element tensor.
def classi(i):
out = np.zeros((1, 49), np.uint8)
for j in range(len(classcount)):
i -= classcount[j]
if i<0:
break
out[0][j] += 1
return tf.convert_to_tensor(out)
#basically the error seems to be related to the if i<0 line
This function will be called by another function here
def formatcars(elem):
return (elem['image'], tf.function(classi(elem['label'])))
#note elem['label'] is just a single element tensor of integer.
Which in turn is mapped to the cars dataset.
dataset.map(formatcars)
And I keep getting the error:
OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with #tf.function.
I've tried enabling eager execution. I've tried using tf.function, using tf.cond, tf.greater, .tonumpy(), .eval(), etc. to no avail. It keeps giving the same error. I'm out of ideas now.
The classcount list is defined as follow :
classcount = [ 1, 6, 4, 14, 13, 6, 2, 4, 3, 22, 6, 1, 15, 1, 2, 4, 1,
12, 5, 1, 2, 4, 11, 2, 1, 1, 5, 4, 2, 1, 1, 1, 1, 1,
6, 1, 4, 1, 1, 1, 3, 1, 2, 4, 1, 4, 3, 3, 1]
it's simply a list of integers created from
import scipy
import tensorflow_datasets as tfds
dataset = tfds.load('cars196', split = 'train')
mat = scipy.io.loadmat('cars_annos.mat')
classcount = []
starti = 0
curmake = ''
for i in range(len(mat['class_names'][0])):
print(mat['class_names'][0][i][0].split(' ', 1)[0])
if mat['class_names'][0][i][0].split(' ', 1)[0] != curmake:
print(i-starti)
if i-starti != 0:
classcount.append(i-starti)
starti = i
curmake = mat['class_names'][0][i][0].split(' ', 1)[0]
classcount.append(1)
cars_annos.mat is from http://imagenet.stanford.edu/internal/car196/cars_annos.mat

As the error states, you can't use a tensor as a boolean value in a python conditional statement outside of eager execution, and the tf.data.Dataset will force you to use graph mode (for performances reason). You can't simply decorate the function with the #tf.function decorator either, because Autograph is not able to convert a code where a tensor is used as a python bool (in a condition statement for example).
The best way of handling that is to rewrite the function using TF ops. One way of doing it could be the following :
def graphmode_classi(i):
"""
performs a cumulative sum on classcount and substract that sum from i
the first negative value will give us the value to one hot encode
"""
cumsum = tf.math.cumsum(classcount)
first_neg = tf.where((i - cumsum)<0)[0]
return tf.one_hot(first_neg, 49)
We can compare if my rewritten function is equivalent :
# random data
data = tf.cast(tf.random.normal((200, 1))**2 * 10, tf.int32)
for d in data:
assert (classi(d).numpy() == graphmode_classi(d).numpy()).all()
Now you should be able to use the function written with only tf ops with the tf.data.Dataset API:
data = tf.cast(tf.random.normal((200, 1))**2 * 10, tf.int32)
ds = tf.data.Dataset.from_tensor_slices(data)
ds.map(graphmode_classi)

ValueError: object too deep for desired array while using cross correlation

I am trying to investigate the cross correlation of two DataFrames. The code is given here:
df1 = pd.DataFrame({"A":[1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
df2 = pd.DataFrame({"A":[7191, 7275, 9889, 9934, 9633, 9924, 9650, 9341, 8820, 8784, 8869]})
np.correlate(df1, df2)
But I get this error:
https://imgur.com/PIOXwND
Any ideas?

You're getting this error as you're passing as a dataframe, which is 2D. np.correlate is for cross-correlation of two 1-dimensional sequences. So try.
np.correlate(df1.squeeze(), df2.squeeze())
which outputs array([80556], dtype=int64).
Edit
Based on your suggestion, try
# You will need to change your column names, like
df1 = pd.DataFrame({"A":[1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
df2 = pd.DataFrame({"B":[7191, 7275, 9889, 9934, 9633, 9924, 9650, 9341, 8820, 8784, 8869]})
df1.join(df2).corr()
which outputs
A B
A 1.000000 -0.174287
B -0.174287 1.000000
As suggested by piRSquared in the comments, you can also use df1.corrwith(df2) to return a single value.

Another option is using scipy.stats function pearsonr. So after the imports:
pearson = pearsonr(df1['A'].values,df2['A'].values)

Algorithm to offset a list of data

Given a list of data as follows:
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
I would like to create an algorithm that is able to offset the list of certain number of steps. For example, if the offset = -1:
def offsetFunc(inputList, offsetList):
#make something
return output
where:
output = [0,0,0,0,1,1,5,5,5,5,5,5,3,3,3,2,2]
Important Note: The elements of the list are float numbers and they are not in any progression. So I actually need to shift them, I cannot use any work-around for getting the result.
So basically, the algorithm should replace the first set of values (the 4 "1", basically) with the 0 and then it should:
Detect the lenght of the next range of values
Create a parallel output vectors with the values delayed by one set
The way I have roughly described the algorithm above is how I would do it. However I'm a newbie to Python (and even beginner in general programming) and I have figured out time by time that Python has a lot of built-in functions that could make the algorithm less heavy and iterating. Does anyone have any suggestion to better develop a script to make this kind of job? This is the code I have written so far (assuming a static offset at -1):
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
output = []
PrevVal = 0
NextVal = input[0]
i = 0
while input[i] == NextVal:
output.append(PrevVal)
i += 1
while i < len(input):
PrevVal = NextVal
NextVal = input[i]
while input[i] == NextVal:
output.append(PrevVal)
i += 1
if i >= len(input):
break
print output
Thanks in advance for any help!
BETTER DESCRIPTION
My list will always be composed of "sets" of values. They are usually float numbers, and they take values such as this short example below:
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
In this example, the first set (the one with value "1.236") is long 4 while the second one is long 6. What I would like to get as an output, when the offset = -1, is:
The value "0.000" in the first 4 elements;
The value "1.236" in the second 6 elements.
So basically, this "offset" function is creating the list with the same "structure" (ranges of lengths) but with the values delayed by "offset" times.
I hope it's clear now, unfortunately the problem itself is still a bit silly to me (plus I don't even speak good English :) )
Please don't hesitate to ask any additional info to complete the question and make it clearer.

How about this:
def generateOutput(input, value=0, offset=-1):
values = []
for i in range(len(input)):
if i < 1 or input[i] == input[i-1]:
yield value
else: # value change in input detected
values.append(input[i-1])
if len(values) >= -offset:
value = values.pop(0)
yield value
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
print list(generateOutput(input))
It will print this:
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]
And in case you just want to iterate, you do not even need to build the list. Just use for i in generateOutput(input): … then.
For other offsets, use this:
print list(generateOutput(input, 0, -2))
prints:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 5, 5, 5, 3, 3]

Using deque as the queue, and using maxlen to define the shift length. Only holding unique values. pushing inn new values at the end, pushes out old values at the start of the queue, when the shift length has been reached.
from collections import deque
def shift(it, shift=1):
q = deque(maxlen=shift+1)
q.append(0)
for i in it:
if q[-1] != i:
q.append(i)
yield q[0]
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
print list(shift(Sample))
#[0, 0, 0, 0, 1.236, 1.236, 1.236, 1.236, 1.236, 1.236]

My try:
#Input
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
shift = -1
#Build service structures: for each 'set of data' store its length and its value
set_lengths = []
set_values = []
prev_value = None
set_length = 0
for value in input:
if prev_value is not None and value != prev_value:
set_lengths.append(set_length)
set_values.append(prev_value)
set_length = 0
set_length += 1
prev_value = value
else:
set_lengths.append(set_length)
set_values.append(prev_value)
#Output the result, shifting the values
output = []
for i, l in enumerate(set_lengths):
j = i + shift
if j < 0:
output += [0] * l
else:
output += [set_values[j]] * l
print input
print output
gives:
[1, 1, 1, 1, 5, 5, 3, 3, 3, 3, 3, 3, 2, 2, 2, 5, 5]
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]

def x(list, offset):
return [el + offset for el in list]

A completely different approach than my first answer is this:
import itertools
First analyze the input:
values, amounts = zip(*((n, len(list(g))) for n, g in itertools.groupby(input)))
We now have (1, 5, 3, 2, 5) and (4, 2, 6, 3, 2). Now apply the offset:
values = (0,) * (-offset) + values # nevermind that it is longer now.
And synthesize it again:
output = sum([ [v] * a for v, a in zip(values, amounts) ], [])
This is way more elegant, way less understandable and probably way more expensive than my other answer, but I didn't want to hide it from you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding method for a categorical variable - python

Suppose we have a categorical variable Age['0-17','18-25','35-40','55+'] What should we prefer; OneHotEncoding, LabelEncoding or Mapping (like assigning data values such as '0-17':1, '18-25':2) and Why?

Related

Cleaner Way to Change Dictionary Structure

How to call pandas columns with numeric suffixes in a for loop then apply conditions based on other columns with numeric suffixes (python)?

using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution with `tf.data.Dataset`

ValueError: object too deep for desired array while using cross correlation

Algorithm to offset a list of data

Categories

Resources