I've toyed around with the idea of using genetic algorithms for curve fitting, and I've finally gotten round to trying it out. Here's how it went:

## A short intro

### What are genetic algorithms?

Genetic Algorithms, or GA's, solve problems using a method inspired by biological evolution -- selection of the fittest, crossing-over of alleles, and mutation of alleles, repeated over many generations on a large population. GA's have been used to design antennae, predict the market, and design networks.

#### Learn more:

Introduction to Genetic Algorithms – a quick start on the basics

A Genetic Algorithm Tutorial by Darrell Whitley Computer Science Department Colorado State University – heavy on theory

Essentials of Metaheuristics – a free downloadable book that features genetic algorithms, simulated annealing, and other metaheuristics algorithms.

### What is Pyevolve?

Pyevolve is an open source genetic algorithms library framework in Python. I've chosen it since it seems to be the most actively developed framework in Python. Since it's on Python, one can make use of the many libraries available for math and science, whilst enjoying the ease of a high-level language.## Starting off

### Downloading and installing Pyevolve

Pyevolve can be downloaded here. Installing it is as simple as running the executable (on Windows), or using easy_install (on Linux).

### Defining our fitness function

Our objective is to get a curve that matches the sample data as close as possible. Thus, our fitness function will take the absolute values of the differences between the predicted y's and sample y's, sum them all up, and take their negative. In other words:

In Python:

def fitness_function(chromosome):

score = 0

for point in sample_points:

# eval_polynomial computes the value of the polynomial ``chromosome`` at ``x``

delta = abs(eval_polynomial(point[0], *chromosome) - point[1])

score += delta

score = -score

return score

Our genetic algorithm will rank the chromosomes (polynomials) based on these scores. The higher the score, the higher chance of getting selected and of crossing.

To make our fitness functions flexible, we opt to make a factory function to make fitness functions based on different sets of sample points, like so:

def generate_fitness_function(sample_points):

def fitness_function(chromosome):

score = 0

for point in sample_points:

delta = abs(eval_polynomial(point[0], *chromosome) - point[1])

score += delta

score = -score

return score

return fitness_function

### Creating the population

For simplicity's sake, we'll use a list of integers to represent each chromosome. This is represented by `G1DList.G1DList`

. `G1DList`

has convenient default settings, but for this experiment we'll change the range of integers each chromosome can contain, and their length. Finally, we set its fitness function (or evaluator function).

genome = G1DList.G1DList(5)

genome.setParams(rangemin=-50, rangemax=50)

genome.evaluator.set(generate_fitness_function(sample_points))

### Setting up the engine

The GA engine is represented by the `GSimpleGA.GSimpleGA`

class. We create an instance of it, and optionally, set its population size, generations, selectors, and other factors. We'll stick with changing the population size and the selector.

ga = GSimpleGA.GSimpleGA(genome)

ga.setPopulationSize(1000)

ga.selector.set(Selectors.GRouletteWheel)

Increasing the population from the default 80 increases our algorithm's accuracy at a cost in speed. We'll see just how much in the next article.

Changing the selector is a crucial step in this particular algorithm. The default selector is `Selectors.GRankSelector`

, which selects chromosomes based on their ranking – it uses the scores only to rank the chromosomes. This discards important infromation – the deltas from the sample points. Without making full use of these deltas, our algorithm will be highly inaccurate, as we'll see in a future article.

To fix this problem, we change our selector to `Selectors.GRouletteWheel`

, which selects all the chromosomes to be picked simultaneously, with probabilities corresponding to their fitness relative to the population's mean fitness.This e-book has more information on the particular mechanism and the theory behind it.

Finally, before we run the algorithm, we'll change the scaling method used. The default scaling method is `Scaling.LinearScaling`

, which only works for postive raw scores. Since our scoring function uses negative scores, and requires the use of negative scores unless we decide to use reciprocals – which will require arbitary precision to work properly – we must change our scaling method. The only scaling method currently available that works for negative scores is `Scaling.SigmaTruncScaling`

, so we'll use that.

pop = ga.getPopulation()

pop.scaleMethod.set(Scaling.SigmaTruncScaling)

## Running the algorithm

Finally we run it with `GSimpleGA`

's `evolve`

function, and print out the best individual from the evolution. We'll also print out the original polynomial and the sample points to see how well our algorithm did.

ga.evolve(freq_stats=10) # freq_stats is the number of generations between each statistics message.

print(ga.bestIndividual())

print(source_polynomial)

print(sample_points)

The full code of our algorithm is:

from pyevolve import G1DList, GSimpleGA, Selectors, Scaling, DBAdapters

from random import seed, randint, random

def eval_polynomial(x, *coefficients):

result = 0

for exponent, coeff in enumerate(coefficients):

result += coeff*x**exponent

return result

def generate_fitness_function(sample_points):

def fitness_function(chromosome):

score = 0

for point in sample_points:

delta = abs(eval_polynomial(point[0], *chromosome) - point[1])

score += delta

score = -score

return score

return fitness_function

if __name__ == "__main__":

# Generate a random polynomial, and generate sample points from it

seed()

source_polynomial = []

for i in xrange(randint(1, 5)):

source_polynomial.append(randint(-20,20))

sample_points = []

for i in xrange(20):

n = randint(-100, 100)

sample_points.append((n, eval_polynomial(n, *source_polynomial)))

# Create the population

genome = G1DList.G1DList(5)

genome.evaluator.set(generate_fitness_function(sample_points))

genome.setParams(rangemin=-50, rangemax=50)

# Set up the engine

ga = GSimpleGA.GSimpleGA(genome)

ga.setPopulationSize(1000)

ga.selector.set(Selectors.GRouletteWheel)

# Change the scaling method

pop = ga.getPopulation()

pop.scaleMethod.set(Scaling.SigmaTruncScaling)

# Start the algorithm, and print the results.

ga.evolve(freq_stats=10)

print(ga.bestIndividual())

print("Source polynomial: " + repr(source_polynomial))

print("Sample points: " + repr(sample_points))

## Results

Upon running, you should get something like this:

Gen. 1 (1.00%): Max/Min/Avg Fitness(Raw) [15727000722.62(-4292810.00)/0.00(-18226440830.00)/9094256863.27(-6681328575.30)]

Gen. 10 (10.00%): Max/Min/Avg Fitness(Raw) [4637786255.82(-4204824.00)/0.00(-17834916446.00)/3703815517.04(-1089486993.87)]

Gen. 20 (20.00%): Max/Min/Avg Fitness(Raw) [4631124226.18(-4204824.00)/0.00(-17834764598.00)/4133406394.92(-687453861.04)]

Gen. 30 (30.00%): Max/Min/Avg Fitness(Raw) [2783863862.93(-4204824.00)/0.00(-14596526122.00)/2444879772.09(-441610027.03)]

Gen. 40 (40.00%): Max/Min/Avg Fitness(Raw) [3285644641.87(-4197260.00)/0.00(-16392222838.00)/2983713510.34(-436706999.06)]

Gen. 50 (50.00%): Max/Min/Avg Fitness(Raw) [3144311576.46(-62648.00)/0.00(-16006551450.00)/2935495141.08(-336836855.08)]

Gen. 60 (60.00%): Max/Min/Avg Fitness(Raw) [2965577707.55(-58706.00)/0.00(-14566565128.00)/2791088446.86(-294883289.58)]

Gen. 70 (70.00%): Max/Min/Avg Fitness(Raw) [2824632389.09(-57634.00)/0.00(-14567123878.00)/2675797266.42(-256698663.52)]

Gen. 80 (80.00%): Max/Min/Avg Fitness(Raw) [2960822190.42(-21138.00)/0.00(-17483256712.00)/2798736824.07(-273902724.22)]

Gen. 90 (90.00%): Max/Min/Avg Fitness(Raw) [2803805896.09(-2874.00)/0.00(-17482829524.00)/2653855523.20(-264112618.46)]

Gen. 100 (100.00%): Max/Min/Avg Fitness(Raw) [3539918853.96(-2868.00)/0.00(-18213161666.00)/3390487526.67(-290427123.81)]

Total time elapsed: 18.642 seconds.

- GenomeBase

Score: -2868.000000

Fitness: 3539918853.958958

Slot [Evaluator] (Count: 1)

Name: fitness_function

Slot [Initializator] (Count: 1)

Name: G1DListInitializatorInteger

Doc: Integer initialization function of G1DList

This initializator accepts the *rangemin* and *rangemax* genome parameters.

Slot [Mutator] (Count: 1)

Name: G1DListMutatorSwap

Doc: The mutator of G1DList, Swap Mutator

Slot [Crossover] (Count: 1)

Name: G1DListCrossoverSinglePoint

Doc: The crossover of G1DList, Single Point

.. warning:: You can't use this crossover method for lists with just one element.

- G1DList

List size: 5

List: [1, -10, 12, 6, 0]

Source polynomial: [16, -7, 12, 6]

Sample points: [(1, 27), (-88, -3995272), (-47, -596085), (49, 734379), (35, 271721), (54, 979414), (-44, -487548), (-93, -4721687), (-45, -522119), (-72, -2176760), (99, 5938729), (78, 2919790), (-62, -1383390), (0, 16), (-25, -86059), (-36, -264116), (-52, -810820), (-16, -21376), (6, 1702), (64, 1621584)]

As you can see, the GA did well, getting the 3rd and 4th coefficients spot on, and getting close to the others. You can compare the two polynomial functiosn in the following graph, with blue points and line representing the orginal polynomial, and the red line, the genetic algorithm's.

The red line matches the blue line almost perfectly – the blue line is barely visible behind it. The GA fit the curve well – although less efficiently than a specialized algorithm (try polynomial regression at http://www.arachnoid.com/sage/polynomial.html). Try it yourself, see how it performs for you

Did you try Differential Evoultion?

ReplyDeleteGoogle: Differential Evolution made Easy

RKU_2005_techrep-01_DE.pdf

So how does this compare to adaptive hillclimbing?

ReplyDeleteOr is that basicly the same algorithm with a population of 1?

Perhaps I misunderstand. But, I've done very similar things (I think) with nothing more than a single matrix inversion using GNU Octave.

ReplyDeleteJust a thought...

nice work.Those optimization algorithms are all related, I think this could be seen as a hillclimbing algorithm, when only using population size one. Still, evlotion may not be the most optimal algorithm (in terms of efficiency) for this kind of search problem, I think. Hillclimbing has the problem of getting stuck in local optima. evolution might be better there. but i doubt it would be the optimal choice. With some luck, there might even be an analytical solution ;)

ReplyDelete@Anonymous, I think you do misunderstand -- he's not claiming this is the only way or the most efficient way. It's an interesting experiment, and the fact that he's exploring this for himself instead of using a tool that someone else has already invented makes it inherently worthwhile.

ReplyDeleteIndeed, most people using GA / GP just don't have the background to realize there are formal solutions to most problems. This is the case.

ReplyDelete@Alex: I understand that there are much more efficient methods for curve fitting -- interpolation, heuristics, etc. I merely tried this out for curiosity's sake. It seems to work reasonably well -- but, as I stated in the article, is rather inefficient.

ReplyDeleteHowever, GA's can and are used for problems that do not have known formal solutions -- see the citations in the article. They are also highly and easily parallelized.

@Bart: Hill climbing is similar to GA's, and is in fact done, to a slight extent, in many GA's. However, I do believe the main principle of GA's is exploring the search space in a highly paralllelized manner -- see the GA Tutorial by CSU in my post for more details.

A very nice post, well done!

ReplyDeleteA little advice on formatting:

1) Change the color of the font used to show code. It's difficult to read.

2) Make code boxes with horizontal scrollers instead of wrapping - the wrapping greatly reduces the readability and general niceness of structured Python code.

@Eli: Thanks. I fixed up SyntaxHighlighter -- should look much better now.

ReplyDeleteHow are you getting those "green u-turn symbols" to start each verical line from a word-wrapped statement? Is that its default behavior? Do you know of any text editors that line-wrap a continued statment indented underneat the first? (like I'm guessing yntaxHighlighter does)

ReplyDelete@Anonymous:

ReplyDeleteYeah, it's SyntaxHighlighter doing its job. I don't know of any editors with default behavior like that. Emacs shows a line-wrap symbol at the end of the line, and wraps to first column, but I suppose there's an Elisp script for it somewhere -- of course, you can always make one for yourself.

Anyways, it would probably be more accessible to just keep code under 80 columns -- much more readable, imho.

you have a nice site. thanks for sharing this site. you can download lots of ebook from here

ReplyDeletehttp://feboook.blogspot.com

Hey Tim,

ReplyDeleteIn case you're interested more reference material, there's a book, "Essentials of Metaheuristics" that discusses GAs and related algorithms.

The author has made it available for free online:

http://cs.gmu.edu/~sean/book/metaheuristics/

@Mart

ReplyDeleteThanks for the tip. I've added it to the list.