Decision Tree Learning - python

I want to implement the decision-tree learning alogorithm.
I am pretty new to coding so I know it's not the best code, but I just want it to work. Unfortunately i get the error: e2 = b(pk2/(pk2 + nk2))
ZeroDivisionError: division by zero
Can someone explain to me what I am doing wrong?

Lets assume after some splits you are left with two records with 3 features/attributes (last column being the truth label)
1 1 1 2
2 2 2 1
Now you are about to select the next best feature to split on, so you call this method remainder(examples, attribute) as part of selection which internally calls nk1, pk1 = pk_nk(1, examples, attribute).
The value returned by pk_nk for the above mentioned rows and features will be 0, 0 which will result in divide by zero exception for e1 = b(pk1/(pk1 + nk1)). This is a valid scenario based on how you coded the DT and you should be handling the case.

(pk2 + nk2) at some point equals zero. If we step backwards through your code, we see they are assigned here:
nk2, pk2 = pk_nk(2, examples, attribute)
def pk_nk(path, examples, attribute):
nk = 0
pk = 0
for ex in examples:
if ex[attribute] == path and ex[7] == NO:
nk += 1
elif ex[attribute] == path and ex[7] == YES:
pk += 1
return nk, pk
As such, for the divisor to equal zero nk and pk must remain zero through the function, i.e. either:
examples is empty, or
neither if/elif condition is satisfied

Related

Can I use a value stored in a variable after an if statement with the loc method?

I created an if else statement to determine whether the min or max is the bigger difference and then stored the number to a variable.
findValue = 0.0
minAbs = abs(df[["numbers"]].min())
maxAbs = abs(df[["numbers"]].max())
if minAbs > maxAbs:
findValue = minAbs
else:
findValue = maxAbs
**df2=df.loc[df['numbers'] == findValue, 'day_related']**
df2
Python hates that I use findValue and not the actual number that it's set equal to in the statement with ** around it, but I thought these are interchangeable?
df[["numbers"]] creates a new dataframe with one column called numbers, so df[["numbers"]].max() isn't actually going to return a number; it's going to return a Series object with one item. df["numbers"] will return the actual numbers column so .max() and .min() will work as expected.
Change your code to this:
minAbs = abs(df["numbers"].min())
maxAbs = abs(df["numbers"].max())
...and then the rest of your code.

Pulp Integer Programming Constraint ignored

I am trying to solve a LpProblem with only boolean variables and Pulp seems to be ignoring some constraints. To give some context about the problem:
I want to find an optimal solution to the problem schools face when trying to create classroom groups. In this case, students are given a paper to write at most 5 other students and the school guarantees them that they will be together with at least one of those students. To see how I modeled this problem into an integer programming problem please refer to this question.
In that link you will see that my variables are defined as x_ij = 1 if student i will be together with student j, and x_i_j = 0 otherwise. Also, in that link I ask about the constraint that I am having trouble implementing with Pulp: if x_i_j = 1 and x_j_k = 1, then by transitive property, x_i_k = 1. In other words, if student i is with student j, and student j is with student k, then, student i will inherently be together with student k.
My objective is to maximize the sum of all the elements of the matrix obtained when performing a Hadamard product between the input matrix and the variables matrix. In other words, I want to contemplate as many of the student's requests as possible.
I will now provide some code snippets and screen captures that should help visualize the problem:
Inputs (just a sample: the real matrix is 37x37)
Output
As you can see in this last image, x_27 = 1 and x_37 = 1 but x_23 = 0 which doesn't make sense.
Here is how I define my variables
def define_variables():
variables = []
for i in range(AMOUNT_OF_STUDENTS):
row = []
for j in range(AMOUNT_OF_STUDENTS):
row.append(LpVariable(f"x_{i}_{j}", lowBound=0, upBound=1, cat='Integer'))
variables.append(row)
return variables
Here is how I define the transitive constraints
for i in range(len(variables)):
for j in range(i, len(variables)):
if i != j:
problem += variables[i][j] == variables[j][i] # Symmetry
for k in range(j, len(variables)):
if i < j < k < len(variables):
problem += variables[i][j] + variables[j][k] - variables[i][k] <= 1 # Transitive
problem += variables[i][j] + variables[i][k] - variables[j][k] <= 1
problem += variables[j][k] + variables[i][k] - variables[i][j] <= 1
When printing the LpProblem I see the constraint that is apparently not working:
As you can see in the output: x_2_7 = 1 and x_3_7 = 1. Therefore, to satisfy this constraint, x_2_3 should also be 1, but as you can also see in the output, it is 0.
Any ideas about what could be happening? I've been stuck for days and the problem seems to be modeled fine and it worked when I only had 8 students (64 variables). Now that I have 37 students (1369 variables) it seems to be behaving oddly. The solver arrives to a solution but it seems to be ignoring some constraints.
Any help is very much appreciated! Thank you in advance.
The constraint is working correctly. Find below the analysis:
(crossposted from github: https://github.com/coin-or/pulp/issues/377)
import pulp as pl
import pytups as pt
path = 'debugSolution.txt'
# import model
_vars, prob = pl.LpProblem.from_json(path)
# get all variables with non-zero value
vars_value = pt.SuperDict(_vars).vfilter(pl.value)
# export the lp
prob.writeLP('debugSolution.lp')
# the constraint you show in the SO problem is:
# _C3833: - x_2_3 + x_2_7 + x_3_7 <= 1
'x_2_7' in vars_value
# True, so x_2_7 has value 1
'x_3_7' in vars_value
# False, so x_3_7 has value 0
'x_2_3' in vars_value
# False, so x_2_3 has value 0
So -0 + 1 + 0 <= 1 means the constraint is respected. There must be a problem with bringing back the value of x_3_7 somewhere because you think is 1 when in pulp it's 0.
This is called a set partitioning problem and PuLP has an example in their documentation here.
In essence, instead of modeling your variables as indicators of whether student A is in the same class as student B, you'll define a mapping between a set of students and a set of classrooms. You can then apply your student preferences as either constraints or part of a maximization objective.

Trying to calculate EMA using python and i cant figure out why my code is always producing the same result

I am trying to calculate an exponential moving average of bitcoin in python2.7 but my result is always the same value and I have no idea why.
def calcSMA(data,counter,timeframe):
closesum = 0
for i in range(timeframe):
closesum = float(closesum) + float(data[counter-i])
return float(closesum / timeframe)
def calcEMA(price,timeframe,prevema):
multiplier = float(2/(timeframe+1))
ema = ((float(price) - float(prevema))*multiplier) + float(prevema)
return float(ema)
counter = 0
closeprice = [7242.4,7240,7242.8,7253.8,7250.6,7255.7,7254.9,7251.4,7234.3,7237.4
,7240.7,7232,7230.2,7232.2,7236.1,7230.5,7230.5,7230.4,7236.4]
while counter < len(closeprice):
if counter == 3:
movingaverage = calcSMA(closeprice,counter,3)
print movingaverage
if counter > 3:
movingaverage = calcEMA(closeprice[counter],3,movingaverage)
print movingaverage
counter +=1
This is how to calculate the EMA:
{Close - EMA(previous day)} x multiplier + EMA(previous day)
you seed the formula with a simple moving average.
Doing this in Excel works so might it be my use of variables?
I would be really glad if someone could tell me what I am doing wrong because I have failed on this simple problem for hours and can't figure it out I've tried storing my previous ema in a separate variable and I even stored all of them in a list but I am always getting the same values at every timestep.
The expression 2/(timeframe+1) is always zero, because the components are all integers and therefore Python 2 uses integer division. Wrapping that result in float() does no good; you just get 0.0 instead of 0.
Try 2.0/(timeframe+1) instead.

Defining a function that counts 2 letters and divides that by total length of the word

I'm trying to create a code that counts how many G's and C's there are in a "strand of DNA" and calculate the percentage of the G's + C's in that strand, e.g.
gcContent('CGGTCCAATAGATTCGAA')
44.4444444444
There are 18 letters in that string and 8 G's + C's together.
I am struggling so far to even count the letter of G's in a strand in my code, this is what I have so far:
def gcContent(dnaMolecule):
count = 0
for g in dnaMolecule:
dnaMolecule.count('g')
count += 1
return count
and when I type it into the interactive python shell the result is this:
In [1]: gcContent('a')
Out[1]: 1.0
It's not counting the amount of G's so far and it says one no matter what if I type in 1 character inside the brackets after gcContent.
You can use the count method that every string has.
def gcContent(dnaMolecule):
dnaMolecule = dnaMolecule.lower()
count = dnaMolecule.count('g') + dnaMolecule.count('c')
return count / len(dnaMolecule)
For Python 2.x and getting a value between 0 - 100 instead of 0 - 1:
def gcContent(dnaMolecule):
dnaMolecule = dnaMolecule.lower()
count = dnaMolecule.count('g') + dnaMolecule.count('c')
return 100.0 * count / len(dnaMolecule)
If you can make use of Biopython, there is already a predefined function GC which calculates the GC content of a given sequence:
from Bio.SeqUtils import GC
print(GC('CGGTCCAATAGATTCGAA'))
That gives the desired output:
44.44444444444444
Depending on what additional things you want to do with your sequence, I highly recommend to use the predefined functions rather than writing your own ones.
EDIT:
As this is discussed below #TammoHeeren's answer, GC also takes care of the lower/upper case issue:
print(GC('CGGGggg'))
gives
100.0

newB struggling with Backus Naur at Udacity Comp. Sci. 101

I am finishing up the Intro to Computer Science 101 course at Udacity and am looking for some help addressing one of the final quiz problems. The following code returned a "pass" when submitted but I feel like I am not getting at the heart of the challenge in this quiz. Any help or suggestions about how to approach and think about the problem would be appreciated.
The problem:
"""
Define a Python procedure, in_language(<string>),
that takes as input a string and returns True
if the input string is in the language described by the BNF grammar below
(starting from S) and returns False otherwise.
BNF grammar description:
S => 0 S 1
S => 1 S 0
S => 0
"""
# Tests. These all should print True if your procedure is defined correctly
print in_language("00011") == True
print in_language("0") == True
print in_language("01") == False
print in_language("011") == False
print in_language("01002") == False
Here's my code so far:
def in_language(bnf):
if len(bnf) % 2 == 0:
return False
if any(i in '23456789' for i in bnf) == True:
return False
if bnf[len(bnf)/2] != '0':
return False
else:
return True
This code will return True for submissions not of the given Backus-Naur Form:
S => 0 S 1
S => 1 S 0
S => 0
such as '11111011111'
print in_language('11111011111') == False
I am still wrapping my head around recursion, but it seems like there is a way to address this problem recursively? Either that or my next step would have been to check the first and last characters of the string to see if they were exclusively a zero and a one (not both), then remove them and continue pruning the string until I got to the base case, or, "middle" zero. I was surprised that the code passed the quiz at this point.
Of note, my thinking about the code:
if len(bnf) % 2 == 0:
The first if conditional occurred to me because given the B-N form, any iteration will result in an odd number of numbers, so string length divisibility by 2 indicates something not of that form.
if any(i in '23456789' for i in bnf) == True:
The second "if" was also a simply consideration because the problem is only looking for strings constructed of ones and zeros (I suppose I could have included the alphabet as well, or simply written if any(i not in '01' for i in bnf) .
if bnf[len(bnf)/2] != '0':
The third "if" is similarly looking for a qualifying feature of the given B-N form - no matter what the expression according to the given syntax, there will be a zero in the middle - and making use of Pythons floor division as well as the index start at zero.
Any thoughts or suggestions of alternative solutions would be greatly appreciated, thanks!
As I am new to StackOverflow, I did research this question before posting. Any posting style considerations (too verbose?) or concerns would also be helpful :)
Okay,
I took duskwoof's suggestion and came up with this:
def in_language(bnf):
# corresponding to: S => 0 S 1
if bnf[0] == '0' and bnf[-1] == '1':
return in_language(bnf[1:-1])
# corresponding to: S => 0 S 1
if bnf[0] == '1' and bnf[-1] == '0':
return in_language(bnf[1:-1])
# corresponding to: S => 0
if bnf == '0':
return True
return False
and it works for cases which follow the form, but python barfs when I submit cases which do not... and I still feel like there is something I am missing about recursion and parsing the strings for Backus-Naur Forms. How should I think about handling the cases which don't follow the form? Thank you for the help. I'll keep wrestling with this.
This seems to work better - passes all the test cases:
def in_language(bnf):
if len(bnf) > 2:
# corresponding to: S => 0 S 1
if bnf[0] == '0' and bnf[-1] == '1':
return in_language(bnf[1:-1])
# corresponding to: S => 0 S 1
if bnf[0] == '1' and bnf[-1] == '0':
return in_language(bnf[1:-1])
# corresponding to: S => 0
if bnf == '0':
return True
return False
but again, I'm totally a newB #programming, so any advice or input would be very helpful... I still don't feel like I have a very general solution; just something specific to this particular BNF grammar.
I am still wrapping my head around recursion, but it seems like there is a way to address this problem recursively?
This is precisely how you are expected to approach this problem. Don't try to overthink the problem by analyzing what properties the strings in this language will have (e.g, length modulo 2, what characters it will contain, etc)! While this could potentially work for this specific language, it won't work in general; some languages will simply be too complicated to write an iterative solution like you're describing.
Your solution should be a direct translation of the description of the language -- you shouldn't have to think too much about what the rules mean! -- and should use recursion for rules which have S on the right side. It should be written in the form:
def in_language(bnf):
if ...: # corresponding to: S => 0 S 1
return True
if ...: # corresponding to: S => 1 S 0
return True
if ...: # corresponding to: S => 0
return True
return False
(The solution you currently have is a "false solution" -- it will pass the tests given in the problem, but will fail on certain other inputs. For example, the string 000 is not in this language, but your function will say it is.)

Categories

Resources