Some ANN code in Erlang + recent thoughts

There was a point in time this summer where I found this Erlang tutorial on artificial neural networks – you can find it here. Apparently, it’s an aggregation of a series of blog posts written up by this guy a few years back. When I was following through doing it, I had a bit of an issue with continuity in the later parts of the tutorial and a few errors as well (derivative of sigmoid was wrong). But it was good overall and if anyone’s interested, I managed to put something together that executes here. Still, there’s a lot of unused/useless/confusing code that should be killed but try to ignore it.

Here are a few notes that I gathered in a flash (and which you can gather by following the links above):

  • Neurons are modeled by perceptrons
  • Perceptrons = summing junction + nonlinear element
  • Summing junction – dot the input vector with coefficients (state)
  • Nonlinear element – take that dot product, map it under the sigmoid function for your output
  • You hook up a bunch of these perceptrons into a DAG (basically a trellis without the one input node restriction) and call it a neural network
  • You train the neural network by putting in some input signals, compare the generated output with an expected output, and perform backpropagation to adjust the coefficients

With that high-level review, I’m thinking that the tutorial just got through hooking up a small neural network with 1 input layer, 1 hidden layer, and 1 output layer, and training it. I’m not sure the code that I have has it correct because, well, where’s the expected output? -_- And how to use it once it’s trained, I’m not exactly sure. But I thought I’d put the code up so that I could look at some of the code again now that a significant amount of time has passed and I’ve learned some other related things that might help this make more sense. It sort of does, from a high-level point of view, but I couldn’t explain the details of how the neural network is actually adjusting a hyperplane to classify an input vector; I didn’t work any proofs, I just wrote some Erlang.

Just a thought: If the neural network is just doing high-dimensional gradient descent, why can’t we just solve the problem and implement gradient descent? I guess a solution in the form  of a neural network is somehow more general? Or maybe it’s simpler: you’re really just feeding some training data to the network and then using it to predict new data. I don’t know. It’s an interesting approach though.

A thought about BFS queue size

There was a Topcoder SRM this morning and one of the problems was essentially to compute the shortest distance between two points on a 2-dimensional grid.  I usually never stick around for the challenge round because I usually dwell on two or three programs, not really understanding them, before time runs out. But I’m going to try reading code from now on because, well, it’s a weakness of mine, and also you get to see different approaches to problems you’d thought were pretty hard.

Back to the problem, another guy in my room took a different approach from mine and implemented breadth-first search. Basically, you expand outwards uniformly from the source node searching for a destination node. The advantage is that once you find the destination node, you have the shortest path and are done. Sadly, a mistake in the implementation ruined this good approach.

It took me a while to figure out that the guy had implemented his own queue for BFS, using a fixed-size array and pointers to the front and back.  The terminating condition in this case would be to check whether the front and back pointers were the same, ie. the queue is empty. It wasn’t until I could run the program in a debugger that I noticed the program was segfaulting due to the front pointer extending way beyond the upper bound of the array. However! I didn’t recognize this for what it was and decided to patch things up by turning the queue into a circular queue.

This fixed the problem for the grid-type graphs that the problem was giving out, but it left me wondering about a fixed queue size that could handle all cases. In the worst case, BFS will visit every node in a graph once, and in theory, it’s possible to queue up all of the nodes at the very first step. The program that I “fixed” with a circular queue was using a fixed size queue with a size that was an order of magnitude lower than the maximum number of nodes that could be dished out by the problem. I was definitely lucky that the number of nodes enqueued at any one time never exceeded the magic number being used to size the queue.

So it’s interesting to think about what that running number – the number of nodes enqueued at any given time during BFS – represents and whether we are able to give an upper bound for it. At any given moment, it gives the size of a “cross-section” of the graph. I guess without knowing about the structure of the graph, you can’t really assume much and so you have to bound the queue by the total nodes in the graph. The graphs in this problem were 4-regular, ie. every node has 4 adjacent nodes, and since the program was passing, this upper bound is definitely an overestimate.

I guess the moral of the story is to use the data structures provided by your local library, because thinking about upper bounds can be a deal breaker, and implementing your own data structure on the fly is error-prone and definitely cost this guy a hundred and some points.

combinatorial optimization

My dad who works as a technician described to me a problem with a very combinatorial flavor to it. Suppose you had a set of widgets with a left side and a right side, where each side has a measure of offset attached to it. When you arrange two widgets side by side, you get a score computed by the difference between offsets of the widgets. So for two widgets, you can get two possible scores depending on how you arrange them. When you add in N widgets to the mix, the number of possible combinations goes way up. The question is whether there is an efficient algorithm that can return a sequence of widgets such that the net score between all the adjacent widgets is constrained below a given threshold, or better yet, minimized.

My first impression of the problem is the naive approach and factorial run time. Actually, this brute force approach involves calculating through permutations, which isn’t very intuitive how to implement, I think. There is a C++ algorithm function called next_permutation, which generates the next lexicographic permutation from some given sequence, which *is* intuitive to use: just wire it up into a loop and keep track of the best one.

A problem that seems similarly put is the knapsack problem, which tries to maximize a primary value given a secondary constraint.

It’s an interesting problem and I will look more into it.