Sunday, August 2, 2009

Curve fitting with Pyevolve: Parameters

In my last post, I discussed how GA's can be used to fit curves – although slowly. Now we'll see what the parameters of the GA affects its accuracy and speed.

Setup

I ran the GA with the following parameters 100 times each, with random parameters, and with an error of sigma = 1, mu = 0 added to each of the sample points, and saved their running times, the GA's guess on the polynomial and the original polynomial.


ParameterValue
Generations100, 500, 1000
Population100, 500, 1000

The Raw Data




Generations





Population



Times (s)

Median

Mean

Minimum

Maximum

Standard Deviation

100

100

1.816

1.820

1.795

1.981

0.0249

100

500

9.229

9.32

9.019

10.288

0.2344

100

1000

18.96

18.99

18.58

19.75

0.2946

500

100

8.905

8.992

8.691

10.686

0.3413

1000

100

18.21

18.26

17.43

19.86

0.5212

500

500

47.63

47.81

45.56

54.06

1.3849

1000

1000

186.1

186.8

184.3

194.9

2.0383





Generations





Population



Integral Coefficients of Variation

Median

Mean

Minimum

Maximum

Standard Deviation

100

100

32951

1.5e4

0.0

1.984e7

2.279e6

100

500

0.0

8.34

0.0

1099

97.28

100

1000

4.0e-4

0.107

0.0

11.99

2.0898

500

100

1.03e-3

-6.18e4

0.0

9.919e6

1.312e6

1000

100

-0.1617

-1.97e5

0.0003

1.992e7

2.099e6

500

500

9.63e-4

0.05710

0.0

17.4944

2.8967

1000

1000

0.0

0.0410

0.0

10.991

2.2384

Note: Minimum and maximum values are absolute values.

Integral Coefficients of Variation

the differences between the definite integrals of the original polynomial and the GA's guess, scaled over the definite integral of the original polynomial. In other words:

Analysis

Time

Looking at the times for each parameter tuple, we see that the time (t) is varies linearly as generations (g) and population size (p), or t = kpg.

Accuracy

Graphical Analysis

Here's a typical 100g 100p fit:


ga(x) = -x^4 + 48x^3 + 48x^2 + 48x + 48
source(x) = 12x^3 + 16x^2 - x + 3


As we see, it's far from accurate. If we change generations to 500:




ga(x) = -x^4 - 49x^3 + 44x^2 - 49x + 44
source(x) = 15x^2 + 13x + 9



Not much is changed, but the GA takes 5x more time (9s). If we increase it further to 1000:

ga(x) = -x^4 - 19x^3 + 47x^2 + 47x + 47
source(x) = 11x^3 - 3x^2 + 2x +8

The fit is barely improved, but the time taken doubles to 18s.

Now, if we maintain our generations at 100, and increase our population to 500:


ga(x) = -11x^3 + 23x^2 - 50x + 41
source(x) = -12x^3 - 2x^2 - 17x - 9

The fit is greatly improved, for the same cost as changing generations to 500. If we further increase it to 1000:


ga(x) = 12x^4 + 19x^3 + 50x^2 - 45x + 50
source(x) = 12x^4 + 19x^3 - 11x^2 - 10x - 6

We get a near perfect fit.

What if we change our generations to 500 and population to 500?



ga(x) = 21x^2 + 7x - 49
source(x) = 20x^2 + 5x - 16

We see that the fit is little improved from p=500, while increasing our time fivefold.

If we set generations and populations to 1000:


ga(x) = -19x^2 - 19x - 48
source(x) = -20x^2 -15x - 6

Not much changes either from the p=1000 sample, but our mean time is now 186s.

Having seen these data, we can conclude that the best way to increase the accuracy of a GA would be to increase its population size, at least for curve fitting. Let's look at this a bit more thoroughly.

Numerical Analysis

Here's our table again:





Generations





Population



Integral Coefficients of Variation

Median

Mean

Minimum

Maximum

Standard Deviation

100

100

32951

1.5e4

0.0

1.984e7

2.279e6

100

500

0.0

8.34

0.0

1099

97.28

100

1000

4.0e-4

0.107

0.0

11.99

2.0898

500

100

1.03e-3

-6.18e4

0.0

9.919e6

1.312e6

1000

100

-0.1617

-1.97e5

0.0003

1.992e7

2.099e6

500

500

9.63e-4

0.05710

0.0

17.4944

2.8967

1000

1000

0.0

0.0410

0.0

10.991

2.2384

We see that at p = 100, g = 100, our GA is highly inaccurate. It is generally way off the mark, and even that varies greatly.

Changing the generations helps a bit. Our median CV is greatly decreased, but our mean and standard deviation stay at astronomical levels.

However, when we increase population, we have a marked increase in accuracy. Changing p from 100 to 500 decreases our deviation by 5 orders of magnitude. Our median is now at 0.0, perfect fit, and our mean has decreased by 4 orders of magnitude. Doubling our population drops standard deviation and mean by another tenth.

Increasing our generations with our high population has little benefit. Our quintupling our generations, therefore increasing our running time fivefold, has roughly the same effect as doubling our population.

Tuesday, July 28, 2009

Curve fitting with Pyevolve

I've toyed around with the idea of using genetic algorithms for curve fitting, and I've finally gotten round to trying it out. Here's how it went:

A short intro

What are genetic algorithms?

Genetic Algorithms, or GA's, solve problems using a method inspired by biological evolution -- selection of the fittest, crossing-over of alleles, and mutation of alleles, repeated over many generations on a large population. GA's have been used to design antennae, predict the market, and design networks.

Learn more:


Introduction to Genetic Algorithms – a quick start on the basics
A Genetic Algorithm Tutorial by Darrell Whitley Computer Science Department Colorado State University – heavy on theory

What is Pyevolve?

Pyevolve is an open source genetic algorithms library framework in Python. I've chosen it since it seems to be the most actively developed framework in Python. Since it's on Python, one can make use of the many libraries available for math and science, whilst enjoying the ease of a high-level language.


Starting off

Downloading and installing Pyevolve

Pyevolve can be downloaded here. Installing it is as simple as running the executable (on Windows), or using easy_install (on Linux).

Defining our fitness function

Our objective is to get a curve that matches the sample data as close as possible. Thus, our fitness function will take the absolute values of the differences between the predicted y's and sample y's, sum them all up, and take their negative. In other words:



In Python:

def fitness_function(chromosome):
score = 0
for point in sample_points:
# eval_polynomial computes the value of the polynomial ``chromosome`` at ``x``
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score

Our genetic algorithm will rank the chromosomes (polynomials) based on these scores. The higher the score, the higher chance of getting selected and of crossing.


To make our fitness functions flexible, we opt to make a factory function to make fitness functions based on different sets of sample points, like so:

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

Creating the population

For simplicity's sake, we'll use a list of integers to represent each chromosome. This is represented by G1DList.G1DList. G1DList has convenient default settings, but for this experiment we'll change the range of integers each chromosome can contain, and their length. Finally, we set its fitness function (or evaluator function).


genome = G1DList.G1DList(5)
genome.setParams(rangemin=-50, rangemax=50)
genome.evaluator.set(generate_fitness_function(sample_points))


Setting up the engine

The GA engine is represented by the GSimpleGA.GSimpleGA class. We create an instance of it, and optionally, set its population size, generations, selectors, and other factors. We'll stick with changing the population size and the selector.

ga = GSimpleGA.GSimpleGA(genome)
ga.setPopulationSize(1000)
ga.selector.set(Selectors.GRouletteWheel)

Increasing the population from the default 80 increases our algorithm's accuracy at a cost in speed. We'll see just how much in the next article.

Changing the selector is a crucial step in this particular algorithm. The default selector is Selectors.GRankSelector, which selects chromosomes based on their ranking – it uses the scores only to rank the chromosomes. This discards important infromation – the deltas from the sample points. Without making full use of these deltas, our algorithm will be highly inaccurate, as we'll see in a future article.

To fix this problem, we change our selector to Selectors.GRouletteWheel, which selects all the chromosomes to be picked simultaneously, with probabilities corresponding to their fitness relative to the population's mean fitness.This e-book has more information on the particular mechanism and the theory behind it.

Finally, before we run the algorithm, we'll change the scaling method used. The default scaling method is Scaling.LinearScaling, which only works for postive raw scores. Since our scoring function uses negative scores, and requires the use of negative scores unless we decide to use reciprocals – which will require arbitary precision to work properly – we must change our scaling method. The only scaling method currently available that works for negative scores is Scaling.SigmaTruncScaling, so we'll use that.

pop = ga.getPopulation()
pop.scaleMethod.set(Scaling.SigmaTruncScaling)


Running the algorithm

Finally we run it with GSimpleGA's evolve function, and print out the best individual from the evolution. We'll also print out the original polynomial and the sample points to see how well our algorithm did.

ga.evolve(freq_stats=10) # freq_stats is the number of generations between each statistics message.
print(ga.bestIndividual())
print(source_polynomial)
print(sample_points)

The full code of our algorithm is:


from pyevolve import G1DList, GSimpleGA, Selectors, Scaling, DBAdapters
from random import seed, randint, random

def eval_polynomial(x, *coefficients):
result = 0
for exponent, coeff in enumerate(coefficients):
result += coeff*x**exponent
return result

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

if __name__ == "__main__":
# Generate a random polynomial, and generate sample points from it
seed()

source_polynomial = []
for i in xrange(randint(1, 5)):
source_polynomial.append(randint(-20,20))

sample_points = []
for i in xrange(20):
n = randint(-100, 100)
sample_points.append((n, eval_polynomial(n, *source_polynomial)))

# Create the population
genome = G1DList.G1DList(5)
genome.evaluator.set(generate_fitness_function(sample_points))
genome.setParams(rangemin=-50, rangemax=50)

# Set up the engine
ga = GSimpleGA.GSimpleGA(genome)
ga.setPopulationSize(1000)
ga.selector.set(Selectors.GRouletteWheel)

# Change the scaling method
pop = ga.getPopulation()
pop.scaleMethod.set(Scaling.SigmaTruncScaling)

# Start the algorithm, and print the results.
ga.evolve(freq_stats=10)
print(ga.bestIndividual())
print("Source polynomial: " + repr(source_polynomial))
print("Sample points: " + repr(sample_points))

Results

Upon running, you should get something like this:


Gen. 1 (1.00%): Max/Min/Avg Fitness(Raw) [15727000722.62(-4292810.00)/0.00(-18226440830.00)/9094256863.27(-6681328575.30)]
Gen. 10 (10.00%): Max/Min/Avg Fitness(Raw) [4637786255.82(-4204824.00)/0.00(-17834916446.00)/3703815517.04(-1089486993.87)]
Gen. 20 (20.00%): Max/Min/Avg Fitness(Raw) [4631124226.18(-4204824.00)/0.00(-17834764598.00)/4133406394.92(-687453861.04)]
Gen. 30 (30.00%): Max/Min/Avg Fitness(Raw) [2783863862.93(-4204824.00)/0.00(-14596526122.00)/2444879772.09(-441610027.03)]
Gen. 40 (40.00%): Max/Min/Avg Fitness(Raw) [3285644641.87(-4197260.00)/0.00(-16392222838.00)/2983713510.34(-436706999.06)]
Gen. 50 (50.00%): Max/Min/Avg Fitness(Raw) [3144311576.46(-62648.00)/0.00(-16006551450.00)/2935495141.08(-336836855.08)]
Gen. 60 (60.00%): Max/Min/Avg Fitness(Raw) [2965577707.55(-58706.00)/0.00(-14566565128.00)/2791088446.86(-294883289.58)]
Gen. 70 (70.00%): Max/Min/Avg Fitness(Raw) [2824632389.09(-57634.00)/0.00(-14567123878.00)/2675797266.42(-256698663.52)]
Gen. 80 (80.00%): Max/Min/Avg Fitness(Raw) [2960822190.42(-21138.00)/0.00(-17483256712.00)/2798736824.07(-273902724.22)]
Gen. 90 (90.00%): Max/Min/Avg Fitness(Raw) [2803805896.09(-2874.00)/0.00(-17482829524.00)/2653855523.20(-264112618.46)]
Gen. 100 (100.00%): Max/Min/Avg Fitness(Raw) [3539918853.96(-2868.00)/0.00(-18213161666.00)/3390487526.67(-290427123.81)]
Total time elapsed: 18.642 seconds.
- GenomeBase
Score: -2868.000000
Fitness: 3539918853.958958

Slot [Evaluator] (Count: 1)
Name: fitness_function
Slot [Initializator] (Count: 1)
Name: G1DListInitializatorInteger
Doc: Integer initialization function of G1DList

This initializator accepts the *rangemin* and *rangemax* genome parameters.


Slot [Mutator] (Count: 1)
Name: G1DListMutatorSwap
Doc: The mutator of G1DList, Swap Mutator
Slot [Crossover] (Count: 1)
Name: G1DListCrossoverSinglePoint
Doc: The crossover of G1DList, Single Point

.. warning:: You can't use this crossover method for lists with just one element.



- G1DList
List size: 5
List: [1, -10, 12, 6, 0]


Source polynomial: [16, -7, 12, 6]
Sample points: [(1, 27), (-88, -3995272), (-47, -596085), (49, 734379), (35, 271721), (54, 979414), (-44, -487548), (-93, -4721687), (-45, -522119), (-72, -2176760), (99, 5938729), (78, 2919790), (-62, -1383390), (0, 16), (-25, -86059), (-36, -264116), (-52, -810820), (-16, -21376), (6, 1702), (64, 1621584)]

As you can see, the GA did well, getting the 3rd and 4th coefficients spot on, and getting close to the others. You can compare the two polynomial functiosn in the following graph, with blue points and line representing the orginal polynomial, and the red line, the genetic algorithm's.



The red line matches the blue line almost perfectly – the blue line is barely visible behind it. The GA fit the curve well – although less efficiently than a specialized algorithm. Try it yourself, see how it performs for you.


Next: An analysis of the effects of the various parameters on the accuracy and speed of the genetic algorithm.