Tuesday, July 28, 2009

Curve fitting with Pyevolve

I've toyed around with the idea of using genetic algorithms for curve fitting, and I've finally gotten round to trying it out. Here's how it went:

A short intro

What are genetic algorithms?

Genetic Algorithms, or GA's, solve problems using a method inspired by biological evolution -- selection of the fittest, crossing-over of alleles, and mutation of alleles, repeated over many generations on a large population. GA's have been used to design antennae, predict the market, and design networks.

Learn more:


Introduction to Genetic Algorithms – a quick start on the basics
A Genetic Algorithm Tutorial by Darrell Whitley Computer Science Department Colorado State University – heavy on theory
Essentials of Metaheuristics – a free downloadable book that features genetic algorithms, simulated annealing, and other metaheuristics algorithms.

What is Pyevolve?

Pyevolve is an open source genetic algorithms library framework in Python. I've chosen it since it seems to be the most actively developed framework in Python. Since it's on Python, one can make use of the many libraries available for math and science, whilst enjoying the ease of a high-level language.


Starting off

Downloading and installing Pyevolve

Pyevolve can be downloaded here. Installing it is as simple as running the executable (on Windows), or using easy_install (on Linux).

Defining our fitness function

Our objective is to get a curve that matches the sample data as close as possible. Thus, our fitness function will take the absolute values of the differences between the predicted y's and sample y's, sum them all up, and take their negative. In other words:



In Python:

def fitness_function(chromosome):
score = 0
for point in sample_points:
# eval_polynomial computes the value of the polynomial ``chromosome`` at ``x``
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score

Our genetic algorithm will rank the chromosomes (polynomials) based on these scores. The higher the score, the higher chance of getting selected and of crossing.


To make our fitness functions flexible, we opt to make a factory function to make fitness functions based on different sets of sample points, like so:

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

Creating the population

For simplicity's sake, we'll use a list of integers to represent each chromosome. This is represented by G1DList.G1DList. G1DList has convenient default settings, but for this experiment we'll change the range of integers each chromosome can contain, and their length. Finally, we set its fitness function (or evaluator function).


genome = G1DList.G1DList(5)
genome.setParams(rangemin=-50, rangemax=50)
genome.evaluator.set(generate_fitness_function(sample_points))


Setting up the engine

The GA engine is represented by the GSimpleGA.GSimpleGA class. We create an instance of it, and optionally, set its population size, generations, selectors, and other factors. We'll stick with changing the population size and the selector.

ga = GSimpleGA.GSimpleGA(genome)
ga.setPopulationSize(1000)
ga.selector.set(Selectors.GRouletteWheel)

Increasing the population from the default 80 increases our algorithm's accuracy at a cost in speed. We'll see just how much in the next article.

Changing the selector is a crucial step in this particular algorithm. The default selector is Selectors.GRankSelector, which selects chromosomes based on their ranking – it uses the scores only to rank the chromosomes. This discards important infromation – the deltas from the sample points. Without making full use of these deltas, our algorithm will be highly inaccurate, as we'll see in a future article.

To fix this problem, we change our selector to Selectors.GRouletteWheel, which selects all the chromosomes to be picked simultaneously, with probabilities corresponding to their fitness relative to the population's mean fitness.This e-book has more information on the particular mechanism and the theory behind it.

Finally, before we run the algorithm, we'll change the scaling method used. The default scaling method is Scaling.LinearScaling, which only works for postive raw scores. Since our scoring function uses negative scores, and requires the use of negative scores unless we decide to use reciprocals – which will require arbitary precision to work properly – we must change our scaling method. The only scaling method currently available that works for negative scores is Scaling.SigmaTruncScaling, so we'll use that.

pop = ga.getPopulation()
pop.scaleMethod.set(Scaling.SigmaTruncScaling)


Running the algorithm

Finally we run it with GSimpleGA's evolve function, and print out the best individual from the evolution. We'll also print out the original polynomial and the sample points to see how well our algorithm did.

ga.evolve(freq_stats=10) # freq_stats is the number of generations between each statistics message.
print(ga.bestIndividual())
print(source_polynomial)
print(sample_points)

The full code of our algorithm is:


from pyevolve import G1DList, GSimpleGA, Selectors, Scaling, DBAdapters
from random import seed, randint, random

def eval_polynomial(x, *coefficients):
result = 0
for exponent, coeff in enumerate(coefficients):
result += coeff*x**exponent
return result

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

if __name__ == "__main__":
# Generate a random polynomial, and generate sample points from it
seed()

source_polynomial = []
for i in xrange(randint(1, 5)):
source_polynomial.append(randint(-20,20))

sample_points = []
for i in xrange(20):
n = randint(-100, 100)
sample_points.append((n, eval_polynomial(n, *source_polynomial)))

# Create the population
genome = G1DList.G1DList(5)
genome.evaluator.set(generate_fitness_function(sample_points))
genome.setParams(rangemin=-50, rangemax=50)

# Set up the engine
ga = GSimpleGA.GSimpleGA(genome)
ga.setPopulationSize(1000)
ga.selector.set(Selectors.GRouletteWheel)

# Change the scaling method
pop = ga.getPopulation()
pop.scaleMethod.set(Scaling.SigmaTruncScaling)

# Start the algorithm, and print the results.
ga.evolve(freq_stats=10)
print(ga.bestIndividual())
print("Source polynomial: " + repr(source_polynomial))
print("Sample points: " + repr(sample_points))

Results

Upon running, you should get something like this:


Gen. 1 (1.00%): Max/Min/Avg Fitness(Raw) [15727000722.62(-4292810.00)/0.00(-18226440830.00)/9094256863.27(-6681328575.30)]
Gen. 10 (10.00%): Max/Min/Avg Fitness(Raw) [4637786255.82(-4204824.00)/0.00(-17834916446.00)/3703815517.04(-1089486993.87)]
Gen. 20 (20.00%): Max/Min/Avg Fitness(Raw) [4631124226.18(-4204824.00)/0.00(-17834764598.00)/4133406394.92(-687453861.04)]
Gen. 30 (30.00%): Max/Min/Avg Fitness(Raw) [2783863862.93(-4204824.00)/0.00(-14596526122.00)/2444879772.09(-441610027.03)]
Gen. 40 (40.00%): Max/Min/Avg Fitness(Raw) [3285644641.87(-4197260.00)/0.00(-16392222838.00)/2983713510.34(-436706999.06)]
Gen. 50 (50.00%): Max/Min/Avg Fitness(Raw) [3144311576.46(-62648.00)/0.00(-16006551450.00)/2935495141.08(-336836855.08)]
Gen. 60 (60.00%): Max/Min/Avg Fitness(Raw) [2965577707.55(-58706.00)/0.00(-14566565128.00)/2791088446.86(-294883289.58)]
Gen. 70 (70.00%): Max/Min/Avg Fitness(Raw) [2824632389.09(-57634.00)/0.00(-14567123878.00)/2675797266.42(-256698663.52)]
Gen. 80 (80.00%): Max/Min/Avg Fitness(Raw) [2960822190.42(-21138.00)/0.00(-17483256712.00)/2798736824.07(-273902724.22)]
Gen. 90 (90.00%): Max/Min/Avg Fitness(Raw) [2803805896.09(-2874.00)/0.00(-17482829524.00)/2653855523.20(-264112618.46)]
Gen. 100 (100.00%): Max/Min/Avg Fitness(Raw) [3539918853.96(-2868.00)/0.00(-18213161666.00)/3390487526.67(-290427123.81)]
Total time elapsed: 18.642 seconds.
- GenomeBase
Score: -2868.000000
Fitness: 3539918853.958958

Slot [Evaluator] (Count: 1)
Name: fitness_function
Slot [Initializator] (Count: 1)
Name: G1DListInitializatorInteger
Doc: Integer initialization function of G1DList

This initializator accepts the *rangemin* and *rangemax* genome parameters.


Slot [Mutator] (Count: 1)
Name: G1DListMutatorSwap
Doc: The mutator of G1DList, Swap Mutator
Slot [Crossover] (Count: 1)
Name: G1DListCrossoverSinglePoint
Doc: The crossover of G1DList, Single Point

.. warning:: You can't use this crossover method for lists with just one element.



- G1DList
List size: 5
List: [1, -10, 12, 6, 0]


Source polynomial: [16, -7, 12, 6]
Sample points: [(1, 27), (-88, -3995272), (-47, -596085), (49, 734379), (35, 271721), (54, 979414), (-44, -487548), (-93, -4721687), (-45, -522119), (-72, -2176760), (99, 5938729), (78, 2919790), (-62, -1383390), (0, 16), (-25, -86059), (-36, -264116), (-52, -810820), (-16, -21376), (6, 1702), (64, 1621584)]

As you can see, the GA did well, getting the 3rd and 4th coefficients spot on, and getting close to the others. You can compare the two polynomial functiosn in the following graph, with blue points and line representing the orginal polynomial, and the red line, the genetic algorithm's.



The red line matches the blue line almost perfectly – the blue line is barely visible behind it. The GA fit the curve well – although less efficiently than a specialized algorithm (try polynomial regression at http://www.arachnoid.com/sage/polynomial.html). Try it yourself, see how it performs for you