A primer on distributed computing.

Introduction

Say you've got a big computing task to perform, but little computing power. A popular solution is the "distributed computing" model, where people's computers around the world share a job.

Distributed computing tends to fall into two camps, distributed computation and distributed storage. Distributed storage includes projects such as Freenet. This primer covers distributed computation.

Also related are parallel computing and clustering. From a conceptual viewpoint, these are the same type as the projects discussed here, except the domain is usually more local.

The projects include SETI, distributed.net, GIMPS and many others. Typically on the basis of harnessing the idle cycles of people's home and office computers. The time between keypresses where the computer isn't doing anything at all.

So, how to organise such a project?

Is my task suitable for distributed computing?

The main technical restriction that the large calculation has to be split up into many smaller jobs that can be performed by a normal home or office computer.

There are ways around it, but if your smaller jobs cannot be performed in isolation, there will be problems.

Additionally, if you want to keep the nature of your task secret, perhaps researching something for a patent filing, you will need to keep the general public out of it. The main strength of distributed computing is to harness the tons of idle time of computers in the world. Restrict yourself to people you know and trust, and the scope is greatly reduced.

There are ways to keep the details secret, but from a technical viewpoint, they are all doomed to failure. (Legally is another matter.)

How does distributed computing work?

A central computer acts as the allocator, assigning tasks to clients. There may be various levels to relieve the load off the central server, but there will always be the central allocator.

The protocol between client and server typically consists of two messages.

From server to client, "Here-is-some-work".
From client to server, "Here-are-the-results".

Other messages are possible. Here are just a few.

Server to client Client to server

Stop-that-job
Acknowledge
Your-results-were-incorrect

Give-me-some-work
I-will-not-do-this-job
Stop-sending-me-work

Server to client	Client to server
Stop-that-job Acknowledge Your-results-were-incorrect	Give-me-some-work I-will-not-do-this-job Stop-sending-me-work

The more popular distributed computing projects tend to be of the variety where people pick up work on their own pace. The protocol of "give-me-work", "here-is-some-work" and "here-are-the-results" are the more popular.

These messages need to be transported around. Email and http are common transportation mechanisms. The choice depends on the resources you have aviliable.

If your computer is only on line (say) an hour a day, then emails could be used. The problem here is that it could be difficult for clients to automatically pick up emails. (Not everyone uses the same mail reading client). Also, people usually don't want to be inconvenienced with the hassle of emails clogging up their inbox.

Sending emails is less of a problem. If you client code can be written to talk SMTP (the mail sending protocol) then your client will only need configuring to the user's SMTP server.

If you have a web server permanently online, then you could set it up to distribute work packages. A client would make an http request to a URL, and the server would respond with a work package.

Other protocols are impractical. Even under the most restrictive firewalls in use, HTTP and emails usually get through.

If you can, the web server approach is usually better for the client. Having said that, if the rewards are high enough or emails infrequent enough, people may put up with the hassle of sorting emails.

As an example, the distributed.net RC5-64 project seeks to break an encrypted message. To do this, they plan to try all possible keys until one works. There are 2⁶⁴ possible keys. All have to be tried one by one.

When a client starts, it will ask the server for some work. This will usually be in the form of a range of keys. ("Try all keys from a to b.")

Once the client has attempted to decrypt with all keys in that range, it will report back to the server. The server then remembers that this range has been tried and will be not be given to another client.

How should a client work?

Distributed computing is sold on the idea that you can continue working and the client will do it's thing in the background and the user will not notice anything different. The two usual ways to do are a task running at "idle" priority, or to write a screen saver.

About idle

An idle priority task will (if the OS organises it correctly) only run when the computer would otherwise have nothing to do. If you are reading off your computer, what's is your expensive CPU doing now? Unless you have something else running, a good chance is that the answer is very little indeed. Keeping a fixed image on the screen takes very little work. It'll only be when you come to press a key or move your mouse that your computer will have something to do.

Idle priority tasks are designed to work in this time. Ideally nothing should be time critical. If you want to move the mouse, the distributed task should drop everything and handle what the user wants straight away. Slow things down for the user, and unless the rewards are high, you will find people uninstalling your client code.

How to set up a task to work in idle time depends on system to system. Look it up.

Screen savers

A popular route for distributed computing clients is the screen saver. Easier to write (it says here) and the concept is more understandable to the non-techie public. People can understand that screen savers only run while they are at lunch and so won't interfere with normal working. (Rather than some deeply technical explanation of how schedulers work.)

Also, the screen saver gives you the ideal opportunity to show the user what's going on. The SETI project does this well by showing lots of nice graphics showing the computations at work.

So, which?

Modern schedulers do work rather well. Except in rather freaky circumstances, an idle priority client will get more work done. Certainly, I can personally attest that the distributed.net client runs in the background and is completely unnoticeable.

The screen saver client is a big concession to the non-techie audience. If resources allow, go for two working modes?

How do I generate work packages?

It depends on the job in question.

In the case of distributed.net, this is very simple. The key space from zero to 2⁶⁴ is split up into 2³⁰ sized blocks. When a client asks for some work, it is given a set of ten blocks.

The process is so simple, that when clients run out of work, but cannot ask for new work, it takes it upon itself to randomly choose a block to calculate.

In a project like SETI, a client would not have this freedom. The server collects information from radio telescopes and passes the information to clients to examine. For this type of project, how much data to process is simply a compromise of how much is reasonable.

Thanks to Moore's law, a modern computer will complete a work package twice as fast as a computer 18 months old. The amount of time taken by different clients will vary a lot, and so flexibility is useful, but not required.

Attacks against distributed computing projects

At the heart of many security problems of distributed projects are the motivation for people to bother running the client code.

These include;

Money.
Karma or community spirit. (Simply the knowledge that you have helped.)
Fame. (We've found aliens, and Fred Oobar helped!)
(A combination of the above.)

These can be distributed either as a fixed reward for work done, regardless of actual success, or as a lottery, awarded to the client who directly discovered the winning calculation.

However small, there is often a reward for doing the work, and then there is the motivation for parasites to get the reward without having to do the work.

Spoiler attacks

As well as parasites, who want the reward, without doing the work, there are those people who want to stop your project from succeeding. These are commonly known as spoiler attacks.

The methods to stop parasites can often be used to stop spoiler attacks as well, but when designing your security checks, consider that there may be someone whose only motivation is bragging rights for having broken your project. ("I'm a great haxor, worship me!")

Consider, someone who performs most work packages correctly, but after years of trustworthiness, hides a "good" result, and instead takes full credit for themselves.

Also consider if you have any rivals, who have an interest in seeing your project fail. (Such as commercial competition, or even pressure groups.)

This article deals only with technological measures. There may be legal means, but that is beyond my scope of knowledge.

Faulty clients

This isn't really an attack, but unintentional faults. Even if your client software is perfect, it has to run on hardware owned by the general public.

Faulty hardware isn't unheard of. The "floating point co-processor" is a popular place to find hardware glitches. If your project depends on accurate calculation, you should build in checks to confirm that the client is operating correctly.

How do I stop parasites, spoilers and faulty clients?

When a client helps out with a distributed computing project, there is usually no contractual arrangement, and clients typically select themselves without their trustworthiness or reliability being checked.

When the server receives a completed work package, "it wasn't in this range", the server needs to be sure that this response is genuine. Without checks it's easy for a parasite client to simply (and quickly) respond that their key range was not the one, simply to rise up the table.

So, the server needs to be able to confirm for itself that the work was done and the answer was genuine and accurate.

Randomly repeat a job.

This is the simplest way. Routinely send the same job to a different client and check that they come back with the same answer.

The problem is that where there is no parasite or spoiler, it wastes time. Reduce the number of retries, and risk a parasite getting though unchecked. Increase checks and waste time.

There is the additional problem that you have no way of knowing if the client you give a checking job to isn't also a parasite. If a copy of a parasite client becomes widespread, then the risk of this happening increases.

Having said that, this technique is simple, and easily implemented. It can even be used in conjunction with other schemes.

To help this process along, instead of a simple "yes/no" response, have the client include (say) a totalled up calculation, even when the result was overall, a "no". This also helps detecting a faulty client, once you have established which one is faulty.

For parasites to avoid detection, they would have to collaborate, which makes their job harder (but not impossible).

For example, a collaboration protocol amongst parasites might work;
"Did any other parasite do job #xyz?"
"Yes, I did, and my randomly chosen checking total was 58."

If you do use this method, remember that if you get a discrepancy, either the original or the checker (or both) could be parasitical or faulty

Include a quickly checkable side calculation.

If you could repeat every single completed work package yourself, you wouldn't need to enlist help.

But what if you can quickly check that the client has completed the work package, by having it respond with proof where it came close.

This method usually works where the project is a "search". Consider distributed.net's RC5-64 project.

Each client is given a range of keys to work on. Decryption on the wrong keys will result in "random numbers". Decrypting with the correct key will produce English, or in this case, it will begin "The secret word is".

Quite by chance, the random numbers produced by wrong keys just might get close to the known prefix. In fact, randomness where the first three characters (of 23 bits) is 'The' should occur one in 2²⁴ (or once in 16 million).

So for each key block of 2³⁰ keys, the client will have to find 64 keys which come close, which would be, quite by chance, fairly uniform and unpredictable. (Unpredictable, unless you actually do the job.)

This provides an easy check for the server. When (if) a client fails to find the right key, it could respond to the server, "This block of keys didn't have the right one, however, these keys came close."

The server then has a simple way to check that the work was done. The reported list of keys will be a tiny fraction of the original key block size, and so the server operators can check these themselves and trust the answer.

This is an example of one distributed project. The exact method to use (if there is one) will vary from project to project. To take the example of the GIMPS project (large prime number search) a quick check could be that if a large candidate number is found to be not prime, have the client return one of the factors found. This is easily checkable and would make a good indication that the client has done the work.

Even where the project's primary aim isn't a search, this technique can be adapted by throwing a search into the mix. Take the example of a distributed chess player (expanded on later on in this document).

While the chess player is continuously moving pieces around the board and putting a score on various combinations, if it happens to find a series of moves which produce a particular (but uninteresting) board, record that report it back to the server.

The server can then perform the series of moves and indeed confirm that the client did the work to find the answer.

This solution does not work so well (in the general case) for protecting against spoiler attacks. Unless you can be sure that there are exactly n near misses, how can you know if this work package response merely has just a statistically low amount of near misses, or if the winning answer has been deliberately ignored.

Don't distribute the source code.

This is a controversial topic. So controversial, it's in it's own section.

Only reward success.

If the project is of a type where work packages fall into two groups, unsuccessful ones and successful ones, and the only way you'll find out which group your work package falls into is to actually do the work, how about only recognising those who you can be sure did the work?

Instead of recognising unsuccessful effort, only reward successful effort. Ignore anyone who is unlucky enough to be given a work package which doesn't contribute to the effort, except perhaps marking this work package as "done". Take away the reward away and the motivation for parasitical behaviour goes away?

This strategy alone still leaves you open to spoiling attacks and faulty clients, and unless the actual rewards will be frequent, you may find this strategy will limit the number of helpers.

This, along with the following strategy would be filed under social engineering rather than technical. The effectiveness of social engineering is debatable.

Don't offer a reward.

No rewards, no parasite?

If the only reward for participating in a distributed project is the karma or community spirit, then there will be no parasites?

First of all consider how much help you will get with no reward beyond the karma. Unless you are famous or content to work within a small circle of friends, you may find help limited.

Even if you are content with the amount of help you will get, will the bad guys lay off you? This is one for the psychologists to answer, but looking at the popularity of DoS attacks, evidence suggests otherwise.

Reject "impossibly fast" responses.

Even modern computers have their limits. A parasite will want to churn though as many work packages as possible.

Have a secret threshold, where any responses before that threshold will be thrown out, or at least double checked.

For good measure, tell the client (if such a message is normal) that the work packages was accepted, and also move the threshold around, to keep attackers guessing.

Under this scheme, a parasite would instead have to pretend to be a large number of computers. Request enough work for a 100 computers to do in a day, wait a day, respond with forged completion responses, as if they were from 100 different computers.

This is much more difficult to detect, especially with firewalls being commonplace. Perhaps mark "unusually busy" groups with suspicion and have a higher proportion of their work packages double checked.

To publish source?

This remains a controversial topic within the distributed computing community. Here are some arguments for and against.

Against publishing source code

Compiled code is very difficult to reverse engineer. You don't want anyone producing hacked clients to launch parasite or spoiling attacks, and so it's best to make their job as difficult as possible.

The "open source" crowd will want to push the benefits of publishing sources, but this is not the objective of your distributed computing project. Don't lose sight of your end objective. You want to find aliens, cure cancer, etc. That's more important.

With a controlled set of clients, you can present a moving target for attackers. If your client is reverse engineered, no problem, release version 2, which you've had waiting in the wings and the attackers will have to start all over again. Any work packages from older versions can be detected.

Additionally, if you control the production of compiled copies, you can be thorough in testing them and be sure that compiler bugs or well meaning fixes or enhancements will not trip up the overall project by putting an incorrect result into the stream.

Finally, if your code is important to you, "closed source" is a well established way to protect your copyrights.

Here, the distributed.net people argue against publishing source code, amongst other issues.

For publishing source code

Without source code, an attacker will still have the client code as a compiled executable. This will be in machine code, and this is readable programming language, as much as C is.

For sure, it's difficult to understand, but with tools such as disassemblers and debuggers, the task is a whole lot easier. The bad guys will have the expertise, and if the reward is high enough, they will use it. Without publishing your source code, you lose all the benefits of open source for a short period of security.

Even so, it's not just the compiled executable that can be read. The behaviour of your code can be monitored. The files it uses can be read. The conversations the client has can be picked apart for all to see. Face it, the attackers have all the aces. All you can do is hope they won't use them.

Publishing your source code allows people to produce clients for obscure platforms. Is it worth going to the trouble of producing a copy for (say) NetBSD on PPC? Probably not, but with the source code, anyone can just compile a copy for themselves. The "closed source" crowd would have you limited to just Windows running on a Pentium series.

Publishing your code also allows a greater level of trust. If all you have is a compiled executable, can you be sure the code won't send your private data (like passwords and credit card numbers) to the bad guys?

Only if you trust the authors, or you are one of he few people who know how to read machine code. But with open source code, there will be a greater number of eyes who will spot bad behaviour and report it.

Finally, the issue of bug control, avoiding well meaning fixes which go wrong. Alas, even if you control the software, you don't control the hardware. If you are relying on correct operation, then you may as well give up now.

Other issues to consider

Latency

How quickly do you need an answer to your work package?

In the case of a distributed chess player, or distributed.net's DES challenge (which was broken in a little over a day), the answer is very quickly.

As for SETI, and the RC5-64 project, I can download a whole load of work packages, and then take my laptop up a mountain for a week, returning the completed work packages when I get back to civilisation.

Some existing distributed projects

Here we examine the workings of some distributed computing projects.

Distributed.net RC5-64

(I've used this project a lot for examples earlier in this article. Forgive the duplication.)

RSA labs often set challenges to break their cryptographic protocols, with a cash prize for the winner. They publish some encrypted text, using a secret key. The unencrypted form begins with the words "The secret word is". With longer keys, a weakness would have to be found in the algorithm, but with 64 bits, a brute force attack (where all keys are tried one by one) is feasible.

Run the client code (which has the encrypted text loaded in) and it will ask one of a group of servers for a some work. 10 blocks of 2³⁰ keys will be returned and the client code will set out decrypting using each key in each block, taking note of any which match the prefix.

Finished blocks are returned to the server, where the project organisers check the returned block to see if any of the flagged keys is the actual key.

Helpers are rewarded by placement of a chart of the top contributors, and a lottery where the helper who found the right key gets a share of the prize, although the chance of finding the right key is slim. Enthusiasm of the contributors is arguably the main motivator.

Communication between client and server is over HTTP. The two interactions are "fetch" which requests work from the server and "flush" which returns finished packages to the server.

If the client cannot contact the server, or if it is in "offline" mode, when key blocks have been exhausted, the client resorts to randomly selecting key blocks. It has no idea if the randomly chosen blocks have been done or not, and even so, the repeated blocks are effectively free checks, since the client would be idle anyway.

SETI

The Search for Extra Terrestrial Intelligence is a project to look for alien intelligence by scanning stellar radio waves for the signs of artificiality.

Recently SETI enlisted the help of the world idle computers to help in the search. Packages of radio signals would be sent to people's computers and processed. Finished packages are set back to SETI who perform checks.

Contributor's work is credited by a chart of the top helpers. If any indication of alien life is found, it would be a big news story and the lucky (random) contributor would probably become a big celebrity. Again, enthusiasm, and a nice screen saver which graphically shows what's going on, is the main motivator.

All communication works over HTTP, and it all goes pretty much behind the scenes.

If a server is not contactable, the client is pretty much on it's own. It's beyond the capabilities of most home computers to pick up distant radio signals.

GIMPS

The Great Internet Mersenne Primes Search looks to find very large prime numbers. Euclid proved there is an infinite number of primes, but locating them is difficult.

For convience, since all Mersenne primes are of the form 2ⁿ-1, just the "n" part is exchanged. The actual numbers go into many digits.

The project feels low tech compared to the others. You compile yourself a client and transactions go over email. This project is arguably the highest rewarding, since you have a better chance of finding the right answer, combined with a level of positive media interest, as most people know what prime numbers are.

The project works on many levels. Preliminary checks on numbers are first made, with positive indicators are put though the tougher tests later on.

Tom's S-box search

Tom St.Denis, a reader of sci.crypt, was on the search for an S-box (a matrix of numbers used in cryptography) which fitted certain characteristics.

Not having the resources to perform a systematic search, he wrote a client which would randomly put an S-box together and then check it. This would continue again and again until it found one, and the results displayed on the console.

This is an example of an unashamedly low tech, low scale operation. The helpers were largely other sci.crypt readers who had some idle time to spare. Work done was not recognised at all, unless a successful match was found. Enthusiasm by the contributors was pretty much the only motivator. Even success of finding a right answer only really won your Tom's gratitude.

Any more?

The GIMPS project maintains a list of current (and completed) projects.

How would a more complex project work?

Far-and-wide Blue

As a case study, let us look at the example of how a distributed AI chess player could work.

So far, the projects we have seen are rather simple. The larger project is split up into n work packages which are allocated to clients. A chess player would make for a more complicated client which needs to be adaptable.

Your server is in a chess game against Deep Blue. You have an internet connection and an army of people's computers waiting for work packages.

The strategy I propose here is a fairly dumb exhaustive search. Each possible move will be considered. The opponents possible responses will be considered. For as much time as is available, each possible next move will be given a score, and the highest scoring move will be used.

(If this a poor strategy, well, I'm a programmer, not a chess player.)

An easy way to split up work packages is for each possible move to be considered. Problem is that at any point in chess, there are a relatively small number of possible moves to make.

To give each client a more manageable work package, the server could go further and split the work load up by also telling each client a move that the opponent could make next.

If more work packages are needed, throw in your next move into the mix. Continue splitting until you have enough work packages to use.

Right, so you now have a work package for each client. Pass them along in whatever manner is appropriate. Include a certain level of redundancy as you feel safe with for checking purposes.

(To save a bit of time, consider having each client randomly pick a set of subsequent moves, score it, and then report back. Saves time, but you risk losing coverage.)

Chess is a timing critical game. Instead of a fixed end time, clients should continue giving reports, going further own the game, reporting it's score at fixed points as it goes.

Some tasks will have to left to the central server. To make the move now or wait for more information to come in? The server would be best placed to make this decision.

Optimisations to consider could include putting a means in the protocol to have a faster client pick up the slack on slower clients.

The quick parasite check could include a side response, tell me any sequence of moves where the board finishes in a certain way. Give each piece type and colour a number and tell the client that for each piece on the board, multiply it's piece number with it's board position, add them all together and remember the last two digits. (For the cryptographers among you, MD4 would be ideal.) Any sequence of movements where the calculation results in a 58, remember it and return it.

Each possible board makeup should have a 1% chance of being a "58" type, and you only know which board setups are 58s if you actually make the moves. A server can then validate each returned work package by checking that the reported moves do indeed result in a 58, and so showing that the client was legitimate. If less that 1% are 58 types, resort to some redundancy checks.

Computer animation/Special effects

This type of project was quite common when a single one 50th of a second screen sized frame took minutes to render, but less so these days with increased computing power.

The plan is for each client to have a copy of all the objects in a scene, all the light sources, etc, and for it to perform all the graphical calculations for that scene. The client would send the completed image back to the server which put all the frames together in order.

The granularity of work packages is limited to single frame. Faster clients can simply get more frames done than slower clients.

Because of the volume of data that needs to be transported, this type of project doesn't tend to be used with distributed computing in the domain of the general public. Professionals would prefer to use their own equipment, and amateurs would not want the bother of organising it and instead just wait for their own computers to perform the animation.

Sounds great. Where do I go from here?

Kirk Pearson's nifty web site

To discuss things, try the new newsgroup, "comp.distributed", where distributed computing is discussed.

Best of luck.

If you enjoyed this document, or found it useful, why not consider employing the author! <g>

Bill Godfrey, 5 years professional experience in software development, specialising in Embedded, C, C++, Unix, Network administration and applications.