I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.
The Context
In Python 3.5, I'm making a function to generate a map with different biomes - a 2-dimensional list with the first layer representing the lines of the Y-axis and the items representing items along the X-axis.
Example:
[
["A1", "B1", "C1"],
["A2", "B2", "C2"],
["A3", "B3", "C3"]
]
This displays as:
A1 B1 C1
A2 B2 C2
A3 B3 C3
The Goal
A given position on the map should be more likely to be a certain biome if its neighbours are also that biome. So, if a given square's neighbours are all Woods, that square is almost guaranteed to be a Woods.
My Code (so far)
All the biomes are represented by classes (woodsBiome, desertBiome, fieldBiome). They all inherit from baseBiome, which is used on its own to fill up a grid.
My code is in the form of a function. It takes the maximum X and Y coordinates as parameters. Here it is:
def generateMap(xMax, yMax):
areaMap = [] # this will be the final result of a 2d list
# first, fill the map with nothing to establish a blank grid
xSampleData = [] # this will be cloned on the X axis for every Y-line
for i in range(0, xMax):
biomeInstance = baseBiome()
xSampleData.append(biomeInstance) # fill it with baseBiome for now, we will generate biomes later
for i in range(0, yMax):
areaMap.append(xSampleData)
# now we generate biomes
yCounter = yMax # because of the way the larger program works. keeps track of the y-coordinate we're on
for yi in areaMap: # this increments for every Y-line
xCounter = 0 # we use this to keep track of the x coordinate we're on
for xi in yi: # for every x position in the Y-line
biomeList = [woodsBiome(), desertBiome(), fieldBiome()]
biomeProbabilities = [0.0, 0.0, 0.0]
# biggest bodge I have ever written
if areaMap[yi-1][xi-1].isinstance(woodsBiome):
biomeProbabilities[0] += 0.2
if areaMap[yi+1][xi+1].isinstance(woodsBiome):
biomeProbabilities[0] += 0.2
if areaMap[yi-1][xi+1].isinstance(woodsBiome):
biomeProbabilities[0] += 0.2
if areaMap[yi+1][xi-1].isinstance(woodsBiome):
biomeProbabilities[0] += 0.2
if areaMap[yi-1][xi-1].isinstance(desertBiome):
biomeProbabilities[1] += 0.2
if areaMap[yi+1][xi+1].isinstance(desertBiome):
biomeProbabilities[1] += 0.2
if areaMap[yi-1][xi+1].isinstance(desertBiome):
biomeProbabilities[1] += 0.2
if areaMap[yi+1][xi-1].isinstance(desertBiome):
biomeProbabilities[1] += 0.2
if areaMap[yi-1][xi-1].isinstance(fieldBiome):
biomeProbabilities[2] += 0.2
if areaMap[yi+1][xi+1].isinstance(fieldBiome):
biomeProbabilities[2] += 0.2
if areaMap[yi-1][xi+1].isinstance(fieldBiome):
biomeProbabilities[2] += 0.2
if areaMap[yi+1][xi-1].isinstance(fieldBiome):
biomeProbabilities[2] += 0.2
choice = numpy.random.choice(biomeList, 4, p=biomeProbabilities)
areaMap[yi][xi] = choice
return areaMap
Explanation:
As you can see, I'm starting off with an empty list. I add baseBiome to it as a placeholder (up to xi == xMax and yi == 0) in order to generate a 2D grid that I can then cycle through.
I create a list biomeProbabilities with different indices representing different biomes. While cycling through the positions in the map, I check the neighbours of the chosen position and adjust a value in biomeProbabilities based on its biome.
Finally, I use numpy.random.choice() with biomeList and biomeProbabilities to make a choice from biomeList using the given probabilities for each item.
My Question
How can I make sure that the sum of every item in biomeProbabilities is equal to 1 (so that numpy.random.choice will allow a random probability choice)? There are two logical solutions I see:
a) Assign new probabilities so that the highest-ranking biome is given 0.8, then the second 0.4 and the third 0.2
b) Add or subtract equal amounts to each one until the sum == 1
Which option (if any) would be better, and how would I implement it?
Also, is there a better way to get the result without resorting to the endless if statements I've used here?
This sounds like a complex way to approach the problem. It will be difficult for you to make it work this way, because you are constraining yourself to a single forward pass.
One way you can do this is choose a random location to start a biome, and "expand" it to neighboring patches with some high probability (like 0.9).
(note that there is a code error in your example, line 10 -- you have to copy the inner list)
import random
import sys
W = 78
H = 40
BIOMES = [
('#', 0.5, 5),
('.', 0.5, 5),
]
area_map = []
# Make empty map
inner_list = []
for i in range(W):
inner_list.append(' ')
for i in range(H):
area_map.append(list(inner_list))
def neighbors(x, y):
if x > 0:
yield x - 1, y
if y > 0:
yield x, y - 1
if y < H - 1:
yield x, y + 1
if x < W - 1:
yield x + 1, y
for biome, proba, locations in BIOMES:
for _ in range(locations):
# Random starting location
x = int(random.uniform(0, W))
y = int(random.uniform(0, H))
# Remember the locations to be handled next
open_locations = [(x, y)]
while open_locations:
x, y = open_locations.pop(0)
# Probability to stop
if random.random() >= proba:
continue
# Propagate to neighbors, adding them to the list to be handled next
for x, y in neighbors(x, y):
if area_map[y][x] == biome:
continue
area_map[y][x] = biome
open_locations.append((x, y))
for y in range(H):
for x in range(W):
sys.stdout.write(area_map[y][x])
sys.stdout.write('\n')
Of course a better method, the one usually used for those kinds of tasks (such as in Minecraft), is to use a Perlin noise function. If the value for a specific area is above some threshold, use the other biome. The advantages are:
Lazy generation: you don't need to generate the whole area map in advance, you determine what type of biome is in an area when you actually need to know that area
Looks much more realistic
Perlin gives you real values as output, so you can use it for more things, like terrain height, or to blend multiple biomes (or you can use it for "wetness", have 0-20% be desert, 20-60% be grass, 60-80% be swamp, 80-100% be water)
You can overlay multiple "sizes" of noise to give you details in each biome for instance, by simply multiplying them
I'd propose:
biomeProbabilities = biomeProbabilities / biomeProbabilities.sum()
For your endless if statements I'd propose to use a preallocated array of directions, like:
directions = [(-1, -1), (0, -1), (1, -1),
(-1, 0), (1, 0),
(-1, 1), (0, 1), (1, 1)]
and use it to iterate, like:
for tile_x, tile_y in tiles:
for x, y in direction:
neighbor = map[tile_x + x][tile_y + y]
#remram did a nice answer about the algorithm you may or may not use to generate terrain, so I won't go to this subject.
I have a range of dates and a measurement on each of those dates. I'd like to calculate an exponential moving average for each of the dates. Does anybody know how to do this?
I'm new to python. It doesn't appear that averages are built into the standard python library, which strikes me as a little odd. Maybe I'm not looking in the right place.
So, given the following code, how could I calculate the moving weighted average of IQ points for calendar dates?
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
(there's probably a better way to structure the data, any advice would be appreciated)
EDIT:
It seems that mov_average_expw() function from scikits.timeseries.lib.moving_funcs submodule from SciKits (add-on toolkits that complement SciPy) better suits the wording of your question.
To calculate an exponential smoothing of your data with a smoothing factor alpha (it is (1 - alpha) in Wikipedia's terms):
>>> alpha = 0.5
>>> assert 0 < alpha <= 1.0
>>> av = sum(alpha**n.days * iq
... for n, iq in map(lambda (day, iq), today=max(days): (today-day, iq),
... sorted(zip(days, IQ), key=lambda p: p[0], reverse=True)))
95.0
The above is not pretty, so let's refactor it a bit:
from collections import namedtuple
from operator import itemgetter
def smooth(iq_data, alpha=1, today=None):
"""Perform exponential smoothing with factor `alpha`.
Time period is a day.
Each time period the value of `iq` drops `alpha` times.
The most recent data is the most valuable one.
"""
assert 0 < alpha <= 1
if alpha == 1: # no smoothing
return sum(map(itemgetter(1), iq_data))
if today is None:
today = max(map(itemgetter(0), iq_data))
return sum(alpha**((today - date).days) * iq for date, iq in iq_data)
IQData = namedtuple("IQData", "date iq")
if __name__ == "__main__":
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
iqdata = list(map(IQData, days, IQ))
print("\n".join(map(str, iqdata)))
print(smooth(iqdata, alpha=0.5))
Example:
$ python26 smooth.py
IQData(date=datetime.date(2008, 1, 1), iq=110)
IQData(date=datetime.date(2008, 1, 2), iq=105)
IQData(date=datetime.date(2008, 1, 7), iq=90)
95.0
I'm always calculating EMAs with Pandas:
Here is an example how to do it:
import pandas as pd
import numpy as np
def ema(values, period):
values = np.array(values)
return pd.ewma(values, span=period)[-1]
values = [9, 5, 10, 16, 5]
period = 5
print ema(values, period)
More infos about Pandas EWMA:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.ewma.html
I did a bit of googling and I found the following sample code (http://osdir.com/ml/python.matplotlib.general/2005-04/msg00044.html):
def ema(s, n):
"""
returns an n period exponential moving average for
the time series s
s is a list ordered from oldest (index 0) to most
recent (index -1)
n is an integer
returns a numeric array of the exponential
moving average
"""
s = array(s)
ema = []
j = 1
#get n sma first and calculate the next n period ema
sma = sum(s[:n]) / n
multiplier = 2 / float(1 + n)
ema.append(sma)
#EMA(current) = ( (Price(current) - EMA(prev) ) x Multiplier) + EMA(prev)
ema.append(( (s[n] - sma) * multiplier) + sma)
#now calculate the rest of the values
for i in s[n+1:]:
tmp = ( (i - ema[j]) * multiplier) + ema[j]
j = j + 1
ema.append(tmp)
return ema
You can also use the SciPy filter method because the EMA is an IIR filter. This will have the benefit of being approximately 64 times faster as measured on my system using timeit on large data sets when compared to the enumerate() approach.
import numpy as np
from scipy.signal import lfilter
x = np.random.normal(size=1234)
alpha = .1 # smoothing coefficient
zi = [x[0]] # seed the filter state with first value
# filter can process blocks of continuous data if <zi> is maintained
y, zi = lfilter([1.-alpha], [1., -alpha], x, zi=zi)
I don't know Python, but for the averaging part, do you mean an exponentially decaying low-pass filter of the form
y_new = y_old + (input - y_old)*alpha
where alpha = dt/tau, dt = the timestep of the filter, tau = the time constant of the filter? (the variable-timestep form of this is as follows, just clip dt/tau to not be more than 1.0)
y_new = y_old + (input - y_old)*dt/tau
If you want to filter something like a date, make sure you convert to a floating-point quantity like # of seconds since Jan 1 1970.
My python is a little bit rusty (anyone can feel free to edit this code to make corrections, if I've messed up the syntax somehow), but here goes....
def movingAverageExponential(values, alpha, epsilon = 0):
if not 0 < alpha < 1:
raise ValueError("out of range, alpha='%s'" % alpha)
if not 0 <= epsilon < alpha:
raise ValueError("out of range, epsilon='%s'" % epsilon)
result = [None] * len(values)
for i in range(len(result)):
currentWeight = 1.0
numerator = 0
denominator = 0
for value in values[i::-1]:
numerator += value * currentWeight
denominator += currentWeight
currentWeight *= alpha
if currentWeight < epsilon:
break
result[i] = numerator / denominator
return result
This function moves backward, from the end of the list to the beginning, calculating the exponential moving average for each value by working backward until the weight coefficient for an element is less than the given epsilon.
At the end of the function, it reverses the values before returning the list (so that they're in the correct order for the caller).
(SIDE NOTE: if I was using a language other than python, I'd create a full-size empty array first and then fill it backwards-order, so that I wouldn't have to reverse it at the end. But I don't think you can declare a big empty array in python. And in python lists, appending is much less expensive than prepending, which is why I built the list in reverse order. Please correct me if I'm wrong.)
The 'alpha' argument is the decay factor on each iteration. For example, if you used an alpha of 0.5, then today's moving average value would be composed of the following weighted values:
today: 1.0
yesterday: 0.5
2 days ago: 0.25
3 days ago: 0.125
...etc...
Of course, if you've got a huge array of values, the values from ten or fifteen days ago won't contribute very much to today's weighted average. The 'epsilon' argument lets you set a cutoff point, below which you will cease to care about old values (since their contribution to today's value will be insignificant).
You'd invoke the function something like this:
result = movingAverageExponential(values, 0.75, 0.0001)
In matplotlib.org examples (http://matplotlib.org/examples/pylab_examples/finance_work2.html) is provided one good example of Exponential Moving Average (EMA) function using numpy:
def moving_average(x, n, type):
x = np.asarray(x)
if type=='simple':
weights = np.ones(n)
else:
weights = np.exp(np.linspace(-1., 0., n))
weights /= weights.sum()
a = np.convolve(x, weights, mode='full')[:len(x)]
a[:n] = a[n]
return a
I found the above code snippet by #earino pretty useful - but I needed something that could continuously smooth a stream of values - so I refactored it to this:
def exponential_moving_average(period=1000):
""" Exponential moving average. Smooths the values in v over ther period. Send in values - at first it'll return a simple average, but as soon as it's gahtered 'period' values, it'll start to use the Exponential Moving Averge to smooth the values.
period: int - how many values to smooth over (default=100). """
multiplier = 2 / float(1 + period)
cum_temp = yield None # We are being primed
# Start by just returning the simple average until we have enough data.
for i in xrange(1, period + 1):
cum_temp += yield cum_temp / float(i)
# Grab the timple avergae
ema = cum_temp / period
# and start calculating the exponentially smoothed average
while True:
ema = (((yield ema) - ema) * multiplier) + ema
and I use it like this:
def temp_monitor(pin):
""" Read from the temperature monitor - and smooth the value out. The sensor is noisy, so we use exponential smoothing. """
ema = exponential_moving_average()
next(ema) # Prime the generator
while True:
yield ema.send(val_to_temp(pin.read()))
(where pin.read() produces the next value I'd like to consume).
May be shortest:
#Specify decay in terms of span
#data_series should be a DataFrame
ema=data_series.ewm(span=5, adjust=False).mean()
import pandas_ta as ta
data["EMA3"] = ta.ema(data["close"], length=3)
pandas_ta is a Technical Analysis Library: https://github.com/twopirllc/pandas-ta. Above code calculates the Exponential Moving Average (EMA) for a series. You can specify the lag value using 'length'. Spesifically, above code calculates '3-day EMA'.
Here is a simple sample I worked up based on http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
Note that unlike in their spreadsheet, I don't calculate the SMA, and I don't wait to generate the EMA after 10 samples. This means my values differ slightly, but if you chart it, it follows exactly after 10 samples. During the first 10 samples, the EMA I calculate is appropriately smoothed.
def emaWeight(numSamples):
return 2 / float(numSamples + 1)
def ema(close, prevEma, numSamples):
return ((close-prevEma) * emaWeight(numSamples) ) + prevEma
samples = [
22.27, 22.19, 22.08, 22.17, 22.18, 22.13, 22.23, 22.43, 22.24, 22.29,
22.15, 22.39, 22.38, 22.61, 23.36, 24.05, 23.75, 23.83, 23.95, 23.63,
23.82, 23.87, 23.65, 23.19, 23.10, 23.33, 22.68, 23.10, 22.40, 22.17,
]
emaCap = 10
e=samples[0]
for s in range(len(samples)):
numSamples = emaCap if s > emaCap else s
e = ema(samples[s], e, numSamples)
print e
I'm a little late to the party here, but none of the solutions given were what I was looking for. Nice little challenge using recursion and the exact formula given in investopedia.
No numpy or pandas required.
prices = [{'i': 1, 'close': 24.5}, {'i': 2, 'close': 24.6}, {'i': 3, 'close': 24.8}, {'i': 4, 'close': 24.9},
{'i': 5, 'close': 25.6}, {'i': 6, 'close': 25.0}, {'i': 7, 'close': 24.7}]
def rec_calculate_ema(n):
k = 2 / (n + 1)
price = prices[n]['close']
if n == 1:
return price
res = (price * k) + (rec_calculate_ema(n - 1) * (1 - k))
return res
print(rec_calculate_ema(3))
A fast way (copy-pasted from here) is the following:
def ExpMovingAverage(values, window):
""" Numpy implementation of EMA
"""
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a
I am using a list and a rate of decay as inputs. I hope this little function with just two lines may help you here, considering deep recursion is not stable in python.
def expma(aseries, ratio):
return sum([ratio*aseries[-x-1]*((1-ratio)**x) for x in range(len(aseries))])
more simply, using pandas
def EMA(tw):
for x in tw:
data["EMA{}".format(x)] = data['close'].ewm(span=x, adjust=False).mean()
EMA([10,50,100])
Papahaba's answer was almost what I was looking for (thanks!) but I needed to match initial conditions. Using an IIR filter with scipy.signal.lfilter is certainly the most efficient. Here's my redux:
Given a NumPy vector, x
import numpy as np
from scipy import signal
period = 12
b = np.array((1,), 'd')
a = np.array((period, 1-period), 'd')
zi = signal.lfilter_zi(b, a)
y, zi = signal.lfilter(b, a, x, zi=zi*x[0:1])
Get the N-point EMA (here, 12) returned in the vector y
I'm struggling to implement a single layered perceptron: http://en.wikipedia.org/wiki/Perceptron. My program, depending on the weights, either is lost in the learning loop or find wrong weights. As a test case I use logical AND. Could you please give a hind why my perceptron does not converge? This is for my own learning. Thanks.
# learning rate
rate = 0.1
# Test data
# logical AND
# vector = (bias, coordinate1, coordinate2, targetedresult)
testdata = [[1, 0, 0, 0], [1, 0, 1, 0], [1, 1, 0, 0], [1, 1, 1, 1]]
# initial weigths
import random
w = [random.random(), random.random(), random.random()]
print 'initial weigths = ', w
def test(w, vector):
if diff(w, vector) <= 0.1:
return True
else:
return False
def diff(w, vector):
from copy import deepcopy
we = deepcopy(w)
return dirac(sum(we[i]*vector[i] for i in range(3))) - vector[3]
def improve(w, vector):
for i in range(3):
w[i] += rate*diff(w, vector)*vector[i]
return w
def dirac(z):
if z > 0:
return 1
else:
return 0
error = True
while error == True:
discrepancy = 0
for x in testdata:
if not test(w, x):
w = improve(w, x)
discrepancy += 1
if discrepancy == 0:
print 'improved weigths = ', w
error = False
It looks like you need an extra loop surrounding your for loop to iterate the improvement until your solutions converge (step 3 in the Wikipedia page you linked).
As it stands now, you give each training case exactly one chance to update the weights, so it has no chance to converge.
The only glitch I can see is in the activation function. Increase the cut off value, (z > 0.5).
Also, since there are only 4 input cases in each epoch, it very difficult to work with 0 and 1 as the only output. Try removing the dirac function and increasing the threshold to 0.2. It might take longer to learn but will be much more precise. Of course in case of NAND you dont really need to be. But it helps in understanding.