Saturday, October 11, 2014

DNS through SSH on Linux and MacOS with one commands, or how to get around an ISP that blocks alternative DNS providers

My ISP (PLDT) has horrible DNS problems that makes some many sites inaccessible due to failed DNS resolution. These sites are usually those not too often visited by the wider population, such as the SQLAlchemy docs, the PostgreSQL git repository, or the Squeryl site. If I were to guess, my ISP likely has some DNS caching set-up that doesn't update the cache when a cache miss occurs. Calling my ISP didn't resolve anything either: they acknowledged that they had DNS issues and promised to work on it, but now months later I still experience the same problems.

As if to add to insult, my ISP also intercepts all DNS packets and forwards them to their own DNS servers, which then proceed to fail to resolve my query again:

% dig
;; reply from unexpected source:, expected
;; reply from unexpected source:, expected

; <<>> DiG 9.8.3-P1 <<>>
;; global options: +cmd
;; connection timed out; no servers could be reached

As a workaround, I found out about sshuttle, a program that lets you proxy connections (possibly filtered by IP address) through SSH. Since the actual TCP connections work out fine (I can connect to websites by directly using their IP if I manually resolve DNS queries by SSH-ing outside my network), I only need to proxy DNS packets, which saves on server bandwidth and connection latency. As a quick guide to others interested, here's how I did it:

% # Clone the sshuttle repository over Git
% git clone
% # cd into the cloned repository
% cd sshuttle
% # Tell sshuttle to forward DNS requests and all connections to (which in CIDR notation means all connections to the exact IP address; i.e., just forward DNS connections
% ./sshuttle --dns -vr

Tuesday, July 28, 2009

Curve fitting with Pyevolve

I've toyed around with the idea of using genetic algorithms for curve fitting, and I've finally gotten round to trying it out. Here's how it went:

A short intro

What are genetic algorithms?

Genetic Algorithms, or GA's, solve problems using a method inspired by biological evolution -- selection of the fittest, crossing-over of alleles, and mutation of alleles, repeated over many generations on a large population. GA's have been used to design antennae, predict the market, and design networks.

Learn more:

Introduction to Genetic Algorithms – a quick start on the basics
A Genetic Algorithm Tutorial by Darrell Whitley Computer Science Department Colorado State University – heavy on theory
Essentials of Metaheuristics – a free downloadable book that features genetic algorithms, simulated annealing, and other metaheuristics algorithms.

What is Pyevolve?

Pyevolve is an open source genetic algorithms library framework in Python. I've chosen it since it seems to be the most actively developed framework in Python. Since it's on Python, one can make use of the many libraries available for math and science, whilst enjoying the ease of a high-level language.

Starting off

Downloading and installing Pyevolve

Pyevolve can be downloaded here. Installing it is as simple as running the executable (on Windows), or using easy_install (on Linux).

Defining our fitness function

Our objective is to get a curve that matches the sample data as close as possible. Thus, our fitness function will take the absolute values of the differences between the predicted y's and sample y's, sum them all up, and take their negative. In other words:

In Python:

def fitness_function(chromosome):
score = 0
for point in sample_points:
# eval_polynomial computes the value of the polynomial ``chromosome`` at ``x``
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score

Our genetic algorithm will rank the chromosomes (polynomials) based on these scores. The higher the score, the higher chance of getting selected and of crossing.

To make our fitness functions flexible, we opt to make a factory function to make fitness functions based on different sets of sample points, like so:

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

Creating the population

For simplicity's sake, we'll use a list of integers to represent each chromosome. This is represented by G1DList.G1DList. G1DList has convenient default settings, but for this experiment we'll change the range of integers each chromosome can contain, and their length. Finally, we set its fitness function (or evaluator function).

genome = G1DList.G1DList(5)
genome.setParams(rangemin=-50, rangemax=50)

Setting up the engine

The GA engine is represented by the GSimpleGA.GSimpleGA class. We create an instance of it, and optionally, set its population size, generations, selectors, and other factors. We'll stick with changing the population size and the selector.

ga = GSimpleGA.GSimpleGA(genome)

Increasing the population from the default 80 increases our algorithm's accuracy at a cost in speed. We'll see just how much in the next article.

Changing the selector is a crucial step in this particular algorithm. The default selector is Selectors.GRankSelector, which selects chromosomes based on their ranking – it uses the scores only to rank the chromosomes. This discards important infromation – the deltas from the sample points. Without making full use of these deltas, our algorithm will be highly inaccurate, as we'll see in a future article.

To fix this problem, we change our selector to Selectors.GRouletteWheel, which selects all the chromosomes to be picked simultaneously, with probabilities corresponding to their fitness relative to the population's mean fitness.This e-book has more information on the particular mechanism and the theory behind it.

Finally, before we run the algorithm, we'll change the scaling method used. The default scaling method is Scaling.LinearScaling, which only works for postive raw scores. Since our scoring function uses negative scores, and requires the use of negative scores unless we decide to use reciprocals – which will require arbitary precision to work properly – we must change our scaling method. The only scaling method currently available that works for negative scores is Scaling.SigmaTruncScaling, so we'll use that.

pop = ga.getPopulation()

Running the algorithm

Finally we run it with GSimpleGA's evolve function, and print out the best individual from the evolution. We'll also print out the original polynomial and the sample points to see how well our algorithm did.

ga.evolve(freq_stats=10) # freq_stats is the number of generations between each statistics message.

The full code of our algorithm is:

from pyevolve import G1DList, GSimpleGA, Selectors, Scaling, DBAdapters
from random import seed, randint, random

def eval_polynomial(x, *coefficients):
result = 0
for exponent, coeff in enumerate(coefficients):
result += coeff*x**exponent
return result

def generate_fitness_function(sample_points):
def fitness_function(chromosome):
score = 0
for point in sample_points:
delta = abs(eval_polynomial(point[0], *chromosome) - point[1])
score += delta
score = -score
return score
return fitness_function

if __name__ == "__main__":
# Generate a random polynomial, and generate sample points from it

source_polynomial = []
for i in xrange(randint(1, 5)):

sample_points = []
for i in xrange(20):
n = randint(-100, 100)
sample_points.append((n, eval_polynomial(n, *source_polynomial)))

# Create the population
genome = G1DList.G1DList(5)
genome.setParams(rangemin=-50, rangemax=50)

# Set up the engine
ga = GSimpleGA.GSimpleGA(genome)

# Change the scaling method
pop = ga.getPopulation()

# Start the algorithm, and print the results.
print("Source polynomial: " + repr(source_polynomial))
print("Sample points: " + repr(sample_points))


Upon running, you should get something like this:

Gen. 1 (1.00%): Max/Min/Avg Fitness(Raw) [15727000722.62(-4292810.00)/0.00(-18226440830.00)/9094256863.27(-6681328575.30)]
Gen. 10 (10.00%): Max/Min/Avg Fitness(Raw) [4637786255.82(-4204824.00)/0.00(-17834916446.00)/3703815517.04(-1089486993.87)]
Gen. 20 (20.00%): Max/Min/Avg Fitness(Raw) [4631124226.18(-4204824.00)/0.00(-17834764598.00)/4133406394.92(-687453861.04)]
Gen. 30 (30.00%): Max/Min/Avg Fitness(Raw) [2783863862.93(-4204824.00)/0.00(-14596526122.00)/2444879772.09(-441610027.03)]
Gen. 40 (40.00%): Max/Min/Avg Fitness(Raw) [3285644641.87(-4197260.00)/0.00(-16392222838.00)/2983713510.34(-436706999.06)]
Gen. 50 (50.00%): Max/Min/Avg Fitness(Raw) [3144311576.46(-62648.00)/0.00(-16006551450.00)/2935495141.08(-336836855.08)]
Gen. 60 (60.00%): Max/Min/Avg Fitness(Raw) [2965577707.55(-58706.00)/0.00(-14566565128.00)/2791088446.86(-294883289.58)]
Gen. 70 (70.00%): Max/Min/Avg Fitness(Raw) [2824632389.09(-57634.00)/0.00(-14567123878.00)/2675797266.42(-256698663.52)]
Gen. 80 (80.00%): Max/Min/Avg Fitness(Raw) [2960822190.42(-21138.00)/0.00(-17483256712.00)/2798736824.07(-273902724.22)]
Gen. 90 (90.00%): Max/Min/Avg Fitness(Raw) [2803805896.09(-2874.00)/0.00(-17482829524.00)/2653855523.20(-264112618.46)]
Gen. 100 (100.00%): Max/Min/Avg Fitness(Raw) [3539918853.96(-2868.00)/0.00(-18213161666.00)/3390487526.67(-290427123.81)]
Total time elapsed: 18.642 seconds.
- GenomeBase
Score: -2868.000000
Fitness: 3539918853.958958

Slot [Evaluator] (Count: 1)
Name: fitness_function
Slot [Initializator] (Count: 1)
Name: G1DListInitializatorInteger
Doc: Integer initialization function of G1DList

This initializator accepts the *rangemin* and *rangemax* genome parameters.

Slot [Mutator] (Count: 1)
Name: G1DListMutatorSwap
Doc: The mutator of G1DList, Swap Mutator
Slot [Crossover] (Count: 1)
Name: G1DListCrossoverSinglePoint
Doc: The crossover of G1DList, Single Point

.. warning:: You can't use this crossover method for lists with just one element.

- G1DList
List size: 5
List: [1, -10, 12, 6, 0]

Source polynomial: [16, -7, 12, 6]
Sample points: [(1, 27), (-88, -3995272), (-47, -596085), (49, 734379), (35, 271721), (54, 979414), (-44, -487548), (-93, -4721687), (-45, -522119), (-72, -2176760), (99, 5938729), (78, 2919790), (-62, -1383390), (0, 16), (-25, -86059), (-36, -264116), (-52, -810820), (-16, -21376), (6, 1702), (64, 1621584)]

As you can see, the GA did well, getting the 3rd and 4th coefficients spot on, and getting close to the others. You can compare the two polynomial functiosn in the following graph, with blue points and line representing the orginal polynomial, and the red line, the genetic algorithm's.

The red line matches the blue line almost perfectly – the blue line is barely visible behind it. The GA fit the curve well – although less efficiently than a specialized algorithm (try polynomial regression at Try it yourself, see how it performs for you