python - XOR neural network backprop -
i'm trying implement basic xor nn 1 hidden layer in python. i'm not understanding backprop algo specifically, i've been stuck on getting delta2 , updating weights...help?
import numpy np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) vec_sigmoid = np.vectorize(sigmoid) theta1 = np.matrix(np.random.rand(3,3)) theta2 = np.matrix(np.random.rand(3,1)) def fit(x, y, theta1, theta2, learn_rate=.001): #forward pass layer1 = np.matrix(x, dtype='f') layer1 = np.c_[np.ones(1), layer1] layer2 = vec_sigmoid(layer1*theta1) layer3 = sigmoid(layer2*theta2) #backprop delta3 = y - layer3 delta2 = (theta2*delta3) * np.multiply(layer2, 1 - layer2) #?? #update weights theta2 += learn_rate * delta3 #?? theta1 += learn_rate * delta2 #?? def train(x, y): _ in range(10000): in range(4): x = x[i] y = y[i] fit(x, y, theta1, theta2) x = [(0,0), (1,0), (0,1), (1,1)] y = [0, 1, 1, 0] train(x, y)
ok, so, first, here's amended code make yours work.
#! /usr/bin/python import numpy np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) vec_sigmoid = np.vectorize(sigmoid) # binesh - cleaning up, can change number of hiddens. # also, initializing heuristic yoshua bengio. # in many places using matrix multiplication , elementwise multiplication # interchangably... can't that.. (so explicitly changed # dot products , multiplies it's clear.) input_sz = 2; hidden_sz = 3; output_sz = 1; theta1 = np.matrix(0.5 * np.sqrt(6.0 / (input_sz+hidden_sz)) * (np.random.rand(1+input_sz,hidden_sz)-0.5)) theta2 = np.matrix(0.5 * np.sqrt(6.0 / (hidden_sz+output_sz)) * (np.random.rand(1+hidden_sz,output_sz)-0.5)) def fit(x, y, theta1, theta2, learn_rate=.1): #forward pass layer1 = np.matrix(x, dtype='f') layer1 = np.c_[np.ones(1), layer1] # binesh - layer2 need add bias term. layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))] layer3 = sigmoid(layer2.dot(theta2)) #backprop delta3 = y - layer3 # binesh - in reality, _negative_ derivative of cross entropy function # wrt _input_ final sigmoid function. delta2 = np.multiply(delta3.dot(theta2.t), np.multiply(layer2, (1-layer2))) # binesh - don't use delta bias term. (what point? # has no inputs. hence line below. delta2 = delta2[:,1:] # but, delta's derivatives wrt inputs sigmoid. # don't add theta directly. have multiply these # preceding layer theta2d's , theta1d's theta2d = np.dot(layer2.t, delta3) theta1d = np.dot(layer1.t, delta2) #update weights # binesh - here had delta3 , delta2... not # derivatives wrt theta's, derivatives wrt # inputs sigmoids.. (as mention above) theta2 += learn_rate * theta2d #?? theta1 += learn_rate * theta1d #?? def train(x, y): _ in range(10000): in range(4): x = x[i] y = y[i] fit(x, y, theta1, theta2) # binesh - here's little test function see works def test(x): in range(4): layer1 = np.matrix(x[i],dtype='f') layer1 = np.c_[np.ones(1), layer1] layer2 = np.c_[np.ones(1), vec_sigmoid(layer1.dot(theta1))] layer3 = sigmoid(layer2.dot(theta2)) print "%d xor %d = %.7f" % (layer1[0,1], layer1[0,2], layer3[0,0]) x = [(0,0), (1,0), (0,1), (1,1)] y = [0, 1, 1, 0] train(x, y) # binesh - alright, let's see! test(x) and, explanation. forgive crude drawing. easier take picture draw in gimp.
visual of wbc's xor neural network http://cablemodem.hex21.com/~binesh/wbc-xor-nn-small.jpg
so. first, have our error function. we'll call ce (for cross entropy. i'll try use variables possible, tho, i'm going use l1, l2 , l3 instead of layer1, layer2 , layer3. sigh (i don't know how latex here. seems work on statistics stack exchange. weird.)
ce = -(y log(l3) + (1-y) log(1-l3)) we need take derivative of wrt l3, can see how can move l3 reduce value.
dce/dl3 = -((y/l3) - (1-y)/(1-l3)) = -((y(1-l3) - (1-y)l3) / (l3(1-l3))) = -(((y-y*l3) - (l3-y*l3)) / (l3(1-l3))) = -((y-y3*l3 + y3*l3 - l3) / (l3(1-l3))) = -((y-l3) / (l3(1-l3))) = ((l3-y) / (l3(1-l3))) great, but, actually, can't alter l3 see fit. l3 function of z3 (see picture).
l3 = sigmoid(z3) dl3/dz3 = l3(1-l3) i'm not deriving here, (the derivative of sigmoid) but, it's not hard prove).
but, anyway, that's derivative of l3 wrt z3, want derivative of ce wrt z3.
dce/dz3 = (dce/dl3) * (dl3/dz3) = ((l3-y)/(l3(1-l3)) * (l3(1-l3)) # hey, @ that. denominator gets cancelled out , = (l3-y) # why in comments saying computing _negative_ derivative. we call derivatives wrt z's "deltas". so, in code, corresponds delta3.
great, can't change z3 either. need compute it's derivative wrt l2.
but more complicated.
z3 = theta2(0) + theta2(1) * l2(1) + theta2(2) * l2(2) + theta2(3) * l2(3) so, need take partial derivatives wrt. l2(1), l2(2) , l2(3)
dz3/dl2(1) = theta2(1) dz3/dl2(2) = theta2(2) dz3/dl2(3) = theta2(3) notice bias be
dz3/dbias = theta2(0) but bias never changes, it's 1, can safely ignore it. but, our layer2 includes bias, we'll keep now.
but, again, want derivative wrt z2(0), z2(1), z2(2) (looks drew badly, unfortunately. @ graph, it'll clearer it, think.)
dl2(1)/dz2(0) = l2(1) * (1-l2(1)) dl2(2)/dz2(1) = l2(2) * (1-l2(2)) dl2(3)/dz2(2) = l2(3) * (1-l2(3)) what dce/dz2(0..2)
dce/dz2(0) = dce/dz3 * dz3/dl2(1) * dl2(1)/dz2(0) = (l3-y) * theta2(1) * l2(1) * (1-l2(1)) dce/dz2(1) = dce/dz3 * dz3/dl2(2) * dl2(2)/dz2(1) = (l3-y) * theta2(2) * l2(2) * (1-l2(2)) dce/dz2(2) = dce/dz3 * dz3/dl2(3) * dl2(3)/dz2(2) = (l3-y) * theta2(3) * l2(3) * (1-l2(3)) but, can express (delta3 * transpose[theta2]) elemenwise multiplied (l2 * (1-l2)) (where l2 vector)
these our delta2 layer. remove first entry of it, because mention above, corresponds delta of bias (what label l2(0) on graph.)
so. now, have derivatives wrt our z's, but, really, can modify our thetas.
z3 = theta2(0) + theta2(1) * l2(1) + theta2(2) * l2(2) + theta2(3) * l2(3) dz3/dtheta2(0) = 1 dz3/dtheta2(1) = l2(1) dz3/dtheta2(2) = l2(2) dz3/dtheta2(3) = l2(3) once again tho, want dce/dtheta2(0) tho, becomes
dce/dtheta2(0) = dce/dz3 * dz3/dtheta2(0) = (l3-y) * 1 dce/dtheta2(1) = dce/dz3 * dz3/dtheta2(1) = (l3-y) * l2(1) dce/dtheta2(2) = dce/dz3 * dz3/dtheta2(2) = (l3-y) * l2(2) dce/dtheta2(3) = dce/dz3 * dz3/dtheta2(3) = (l3-y) * l2(3) well, np.dot(layer2.t, delta3), , that's have in theta2d
and, similarly: z2(0) = theta1(0,0) + theta1(1,0) * l1(1) + theta1(2,0) * l1(2) dz2(0)/dtheta1(0,0) = 1 dz2(0)/dtheta1(1,0) = l1(1) dz2(0)/dtheta1(2,0) = l1(2)
z2(1) = theta1(0,1) + theta1(1,1) * l1(1) + theta1(2,1) * l1(2) dz2(1)/dtheta1(0,1) = 1 dz2(1)/dtheta1(1,1) = l1(1) dz2(1)/dtheta1(2,1) = l1(2) z2(2) = theta1(0,2) + theta1(1,2) * l1(1) + theta1(2,2) * l1(2) dz2(2)/dtheta1(0,2) = 1 dz2(2)/dtheta1(1,2) = l1(1) dz2(2)/dtheta1(2,2) = l1(2) and, we'd have multiply dce/dz2(0), dce/dz2(1) , dce/dz2(2) (for each of 3 groups there. but, if think that, becomes np.dot(layer1.t, delta2), , that's have in theta1d.
now, because did y-l3 in code, you're adding theta1 , theta2... but, here's reasoning. computed above derivative of ce wrt weights. so, means, increasing weights increase ce. but, want decrease ce.. so, subtract (normally). but, because in code, you're computing negative derivative, right add.
does make sense?
Comments
Post a Comment