A primer on distributed computing.

2006 Update

I wrote this document way back in 2002. Many of the projects mentioned have ended and some of the technologies have moved on, but the broad concepts and risks remain today as they did back then. I'm planning on updating this when time allows. I will announce any new editions of this document on my main website. Thank you.

Introduction

Say you've got a big computation task to perform. Perhaps you have found a way to cure cancer, or you want to look for aliens. All you need is a few super computers to work out some calculations, but you've only got the one PC on your desk. What to do?

A popular solution is the "distributed computing" model, where the task is split up into smaller chunks and performed by the many computers owned by the general public. This guide shows you how.

Computers spend a lot of their time doing nothing. If you are reading this on your computer, that expensive CPU is most probably just sitting around waiting for you to press a key. What a waste!

With the recent popularity of the Internet, distributed computing has become popular. The technology behind distributed computing is old, where it is usually known as parallel computing. When people speak of parallel computing, it is usually in the context of a local group of computers, owned by the same person or organisation, with good links between nodes.

Distributed computing takes the same principle a step further, bringing in computers owned by the general public over the world. Popular projects include SETI, distributed.net, GIMPS and many others.

The key issue here is that you are using computing power that you don't own. These computers are owned and controlled by other people, who you would not necessarily trust. The world is populated by both angels and demons, and unless you know them personally, you can't tell them apart.

This primer attempts to show the basics of running a distributed computing project and avoiding the pitfalls.

Is my task suitable for distributed computing?

Distributed computing works by splitting up the larger task into smaller chunks which can be performed at the same time independently of each other. If your task cannot be split up in the way, alas, distribution is not for you.

For example, say your task is in the form of a linear sequence of repeated steps, where the result of the previous step is needed to perform the subsequent step. This task is not appropriate to distribution. Only one step can be performed at a time.

Most successful distributed projects are of the form of a search. Here, there is greater scope for splitting up work. Each worker can take a particular area and do their job on that area.

Additionally, if you want to keep the nature of your task secret, perhaps researching something for a patent filing, you will need to keep the general public out of it. The main strength of distributed computing is to harness the tons of idle time of computers in the world. Restrict yourself to people you know and trust, and the scope is greatly reduced.

There are ways to keep the details secret, but from a technical viewpoint, they are all doomed to failure. (Legally is another matter.)

How does distributed computing work?

The two main entities in distributed computing are the server and the many clients. A central computer, the server, will generate work packages, which are passed onto worker clients. The clients will perform the task, detailed in the work package data, and when it has finished, the completed work package will be passed back to the server.

This is the minimal distributed computing model. A popular model is the three-tier model, which splits the server task into two layers. One to generate work packages and another to act as a broker, communicating with clients. This is simply a means to manage load and protect against failure.

So the server knows that a potential client exists, we need another message in addition to work packages and completed work packages, initiated by the client. A work request simply says, "give me a work package".

For example

Consider distributed.net's RC5-64 project. The challenge is to unscramble an encrypted message. To do this each possible key, from zero to 2⁶⁴ is tried, one by one, until we find English text. In fact, we know that the hidden text begins "The secret word is", which makes it easy to detect once we have the correct key.

Before distributed computing
Start `key` at zero. Decrypt message using `key` as the key. If decrypted message begins "The secret word is", report `key` as the answer. If not, add one to `key` and start again.

In this type of search, we could be lucky and find the answer at the beginning. We could just as equally be unlucky and find the answer at the end. With average luck, the answer would be in the middle. This process would take a single Pentium III 200,000 years to complete. Beyond most lifespans.

Instead, we find ourselves 200,000 computers to divide up the task, so it will take just one year to complete. The key range is split up into blocks of 2²⁸ keys, one block making up a work package.

After distributed computing
Server	Client
Start; Start `key` at zero. Wait for a message. Received work-package-request; Generate a work package, listing keys from `key` to (`key` + 2²⁸). Send new work package back to client. Increase `key` by 2²⁸. Wait for a message. Received completed-work-package; Load possible key value from completed work package and attempt decryption by this key. If decrypted message is English, report key as the answer. If not, repeat until all listed keys are exhausted. When finished, continue waiting for a message.	Send work-package-request to server and wait for response. Start `key` from the starting key in the work package Attempt decryption by `key`. If the first three characters of result are "The", then record `key` as a possibility. Increase `key` by one. Repeat until `key` has hit upper limit inside work package. Return completed work package to server.

This example assumes all clients are trustworthy. Later on in this document, we will discuss techniques to sort the angels from the demons. But first, an aside on the mechanics.

Talk to me

These messages need to be transported around. The Internet has become ubiquitous so we may as well use it. Email and http are common transportation mechanisms. The choice depends on the resources you have available.

If your computer is only on line (say) an hour a day, then emails could be used. The problem here is that it could be difficult for clients to automatically pick up emails. (Not everyone uses the same mail reading client). Also, people usually don't want to be inconvenienced with the hassle of emails clogging up their inbox.

Sending emails is less of a problem. If you client code can be written to talk SMTP (the mail sending protocol) then your client will only need configuring to the user's SMTP server.

If you have a web server permanently online, then you could set it up to distribute work packages. A client would make an http request to a URL, and the server would respond with a work package.

Other protocols are impractical. Even under the most restrictive firewalls in use, HTTP and emails usually get through.

If you can, the web server approach is usually better for the client. Having said that, if the rewards are high enough or emails infrequent enough, people may put up with the hassle of sorting emails.

How should a client work?

Distributed computing is sold on the idea that you can continue working and the client will do its thing in the background and the user will not notice anything different. The two usual ways to do are a task running at "idle" priority, or to write a screen saver.

About idle

An idle priority task will (if the OS organises it correctly) only run when the computer would otherwise have nothing to do. Idle priority tasks are designed to work in this time. If the user moves the mouse, the scheduler should make the client drop everything and handle what the user wants straight away. Modern operating systems handle this very well.

How to set up a task to work in idle time depends on system to system. Look it up.

Screen savers

A popular route for distributed computing clients is the screen saver. Easier to write (it says here) and the concept is more understandable to the non-techie public. People can understand that screen savers only run while they are at lunch and so won't interfere with normal working. (Rather than some deeply technical explanation of how schedulers work.)

Also, the screen saver gives you the ideal opportunity to show the user what's going on. The SETI project does this well by showing lots of nice graphics showing the computations at work.

So, which?

Modern schedulers do work rather well, and idle priority tasks will do their thing and the user need never realise anything is going on. Except in rather freaky circumstances, an idle priority client will get much much much more work done than a screen saver.

The screen saver client is a big concession to the non-techie audience, who fear that something running all the time will make things less reliable.

How do I generate work packages?

In the case of distributed.net, this is very simple. The key space from zero to 2⁶⁴ is split up into 2²⁸ sized blocks. When a client asks for some work, it is given a set of ten blocks. The process is so simple, that when clients run out of work, but cannot ask for new work, it takes it upon itself to randomly choose a block to calculate.

Why 2²⁸? 28 is a nice round number. A block of 2³² keys would take too long and a block of 2²⁴ keys would be over too quickly.

Thanks to Moore's law, a modern computer will complete a work package twice as fast as a computer 18 months old. The amount of time taken by different clients will vary a lot. Flexibility is useful, but not required.

Attacks against distributed computing projects

There are three types of attack to consider. Parasites, spoilers and faulty clients.

Parasites

People running distributed computing systems will often have some sort of reward for the people who help out. These are often simple pleasures as karma and fame. At present, people tend to help distributed projects because they want to help out. Most offer a chart of the top helpers as a motivation. In future, we may start seeing cash payments to helpers, persuading the public to give their spare time to this project rather than the other.

However small, there is often a reward for doing the work, and then there is the motivation for parasites to get the reward without having to do the work. To stop a parasite, you need to make sure they did the work correctly.

Spoiler attacks

As well as parasites, who want the reward, without doing the work, there are those people who want to stop your project from succeeding. These are commonly known as spoiler attacks.

When designing your security checks, consider that there may be someone whose only motivation is bragging rights for having broken your project. ("I'm a great haxor, worship me!")

Again, you need to make sure that when you receive your completed work package, it represents actual work done. In the case of "search" projects, so long as you have adequate parasite protection, a defence against spoilers is that it's statistically unlikely that the attacker will be lucky enough to be given the opportunity to hide a successful search.

Faulty clients

This isn't really an attack, but unintentional faults. Even if your client software is perfect, it has to run on hardware owned by the general public. Faulty hardware isn't unheard of. The "floating point co-processor" is a popular place to find hardware glitches. If your project depends on accurate calculation, you should build in checks to confirm that the client is operating correctly.

How do I stop parasites, spoilers and faulty clients?

When a client helps out with a distributed computing project, there is usually no contractual arrangement, and clients typically select themselves without their trustworthiness or reliability being checked.

Randomly repeat a job.

This is the simplest way. Routinely send the same job to a different client and check that they come back with the same answer.

The problem is that where you have a genuine and accurate client, it wastes time. Reduce the number of retries, and risk a parasite getting though unchecked. Increase checks and waste time.

There is the additional problem that you have no way of knowing if the client you give a checking job to isn't also a parasite. If a copy of a parasite client becomes widespread, then the risk of this happening increases. Having said that, this technique is simple, and easily implemented. It can even be used in conjunction with other schemes.

To help this process along, instead of a simple "yes/no" response, have the client include (say) a totalled up calculation, even when the result was overall, a "no". This also helps detecting a faulty client, once you have established which one is faulty.

With this in place, for parasites to avoid detection, they would have to collaborate. This makes their job harder (but not impossible). For example, a collaboration protocol amongst parasites might work;
"Did any other parasite do job #xyz?"
"Yes, I did, and my randomly chosen checking total was 58."

To better detect collaboration, always vary the pairings so that the same two people do not always check each other.

Include quickly checkable proof of work done.

If you could repeat every single completed work package yourself, you wouldn't need to enlist help.

But what if you can quickly check that the client has completed the work package, by having it respond with proof where it came close. Say that you know that in any work package, a client would probably come close (say) eight times. A completed work package shows ten occurances where indeed it did come close. This would show that most probably did the work.

This method usually works where the project is a "search". Consider a search for large prime numbers.

If a client tells the server that a given number turned out not to be prime at all, a simple check would be for the client to report a factor. The organisers can very quickly check this answer by performing a simple division. If that value is indeed a factor, it's proof that the larger number is indeed not-prime and you can strike it off the list.

Don't distribute the source code.

Only reward success.

If the project is of a type where work packages fall into two groups, unsuccessful ones and successful ones, and the only way you'll find out which group your work package falls into is to actually do the work, how about only recognising those who you can be sure did the work?

Instead of recognising unsuccessful effort, only reward successful effort. Ignore anyone who is unlucky enough to be given a work package which doesn't contribute to the effort, except perhaps marking this work package as "done". Take away the reward away and the motivation for parasitical behaviour goes away?

This strategy alone still leaves you open to spoiling attacks and faulty clients, and unless the actual rewards will be frequent, you may find this strategy will limit the number of helpers.

This would be filed under social engineering rather than technical. The effectiveness of social engineering is debatable.

Reject "impossibly fast" responses.

Even modern computers have their limits. A parasite will want to churn though as many work packages as possible. Have a secret threshold, where any responses before that threshold will be thrown out, or at least double checked.

For good measure, tell the client (if such a message is normal) that the work packages was accepted, and also move the threshold around, to keep attackers guessing.

Under this scheme, a parasite would instead have to pretend to be a large number of computers. Request enough work for a 100 computers to do in a day, wait a day, respond with forged completion responses, as if they were from 100 different computers. This is much more difficult to detect, especially with firewalls being commonplace (so many computers appear behind one IP address). Perhaps mark "unusually busy" groups with suspicion and have a higher proportion of their work packages double checked.

To publish source?

This remains a controversial topic within the distributed computing community. Here are some arguments for and against.

Against publishing source code

Compiled code is very difficult to reverse engineer. You don't want anyone producing hacked clients to launch parasite or spoiling attacks, and so it's best to make their job as difficult as possible.

The "open source" crowd will want to push the benefits of publishing sources, but this is not the objective of your distributed computing project. Don't lose sight of your end objective. You want to find aliens, cure cancer, etc. That's more important.

With a controlled set of clients, you can present a moving target for attackers. If your client is reverse engineered, no problem, release version 2, which you've had waiting in the wings and the attackers will have to start all over again. Any work packages from older versions can be detected.

Additionally, if you control the production of compiled copies, you can be thorough in testing them and be sure that compiler bugs or well meaning fixes or enhancements will not trip up the overall project by putting an incorrect result into the stream.

Finally, if your code is important to you, "closed source" is a well established way to protect your copyrights, by making it difficult for people to re-use your code.

For publishing source code

Without source code, an attacker will still have the client code as a compiled executable. This will be in machine code, and this is readable programming language, as much as C is.

For sure, it's difficult to understand, but with tools such as disassemblers and debuggers, the task is a whole lot easier. The bad guys will have the expertise, and if the reward is high enough, they will use it. Without publishing your source code, you lose all the benefits of open source for a short period of security.

Even so, it's not just the compiled executable that can be read. The behaviour of your code can be monitored. The files it uses can be read. The conversations the client has can be picked apart for all to see. The attackers have all the aces. All you can do is hope they won't use them.

Publishing your source code allows people to produce clients for obscure platforms. Is it worth going to the trouble of producing a copy for (say) NetBSD on PPC? Probably not, but with the source code, anyone can just compile a copy for themselves. The "closed source" crowd would have you limited to just Windows running on a Pentium series.

History has shown that reliance on security through obscurity doesn't work. Use the techniques mentioned elsewhere in this primer, and embrace the advantages that published source code bring.

Other issues to consider

Latency

In the case of a distributed chess player, or distributed.net's DES challenge (which was broken in a little over a day), the answer is very quickly.

As for SETI, and the RC5-64 project, I can download a whole load of work packages, and then take my laptop up a mountain for a week, returning the completed work packages when I get back to civilisation.

What existing distributed projects are out there already?

Is there a simpler way?

If this all seems too much to cope with, there is a simpler way. Take a cue from Tom St. Denis and his search for an S-box. (The source code has since been removed.)

Tom St. Denis is a reader of the sci.crypt newsgroup, and he wanted an S-box which fitted certain requirements. (An S-box is just a matrix of numbers, often used in cryptography.)

Not having the resources to perform a systematic search, he wrote a client which would randomly put an S-box together and then check it. This would continue again and again until it found one, where it would display the results. The user would simply copy and paste the result into an email message.

This is an example of an unashamedly low tech, low scale operation. The helpers were largely other sci.crypt readers who wanted to help out. Work done was not recognised at all, unless a successful match was found. Enthusiasm by the contributors was the only motivator. Even the success of finding a right answer only won you Tom's gratitude.

There are no messages passed between client and server, as there is no server. Instead, randomness took the role of the server in generating work packages.

For sure, randomness isn't perfect. Roll a six sided dice six times. Did any number fail to appear? If so, continue rolling until all six sides have appeared. How many rolls did it take in all?

As I write this, I rolled a dice. 3, 2, 4, 5, 2, 6, 3 and 1. Two wasted rolls. Too much wastage? Worth it?

Sounds great. Where do I go from here?

To discuss things, try the new newsgroup, "comp.distributed", where distributed computing is discussed.