You don’t have to wear a blazer to school on a hot day!

Remember, if you’re going to school on a hot day, you can leave your blazer at home if it’s normally part of your uniform.

“When you say bronze doesn’t need to be chipped, my questions is this, doesn’t it?”
  1. Teachers who insist that you wear a blazer on hot days can be ignored. They have a callous disregard for the discomfort caused by excess thick layers in hot weather and such callousness does not deserve respect.

  2. Teachers who insist that you bring your blazer to school and carry it around can also be ignored. These ones might not be callous but they are ridiculous. Making you pointlessly carry around some heavy item? Why?

  3. If you’ve been given a detention for not having your blazer, you don’t have to turn up. Go home at the normal time and let the ridiculous teacher whine to themselves. You’ve not broken any rules.

  4. If you need to, show your teacher this page.

  5. If you are a teacher who has just been shown this page by one of the children in your care, please stop making them bring their blazers in on hot days. It is people like you who caused the rise in belief in the flat-earth. “If people in authority can be so wrong about blazers, maybe they’re also wrong about the earth being a globe.” If you really must enforce rules, why not good rules like the one about running with scissors? (If I’ve not convinced you are in error, maybe teaching isn’t right for you. Why not consider a career in cooking where you’re meant to be heating things up?)

  6. I’m not the one undermining teacher’s authority. Teachers who are under the delusion that blazers are required are undermining their own authority by attempting to enforce such ridiculous rules.

  7. Yes, I do know better than those teachers. Thank you for noticing.

Picture Credit:
📷 Close up of blazer pocket emblem for boys school group by “Kaye”.

My adventure into self web-hosting (Part 1)

If you had asked twenty-something me how he thought forty-something me would be hosting his website, he’d have predicted I had a rack of small servers in my attic, as part of a grid-computing business. (That’s what we called “cloud” computing back then.)

He’d have been disappointed to find out I’m using a shared web-hosting service, but that may change.

“The end of the day, remember the way, we stayed so close to the end, we’ll remember it was me and you ’cause we are gonna be…”

Over the Cliff

It all started when my article, Data-Mining Wikipedia for Fun and Profit made it to the top of Hacker News and stayed there for three hours. I was careful to try to not overburden the system by switching on an HTML cache. This way, visitors would only be served up static files without the server having to run the PHP code or talk to the database. Despite that, the server went down and I had to post a sheepish comment with a link to a mirror.

It was clear I was out-growing my current web-host. Despite my precautions, it couldn’t handle being popular for a few hours. Not only that, I’m a software developer and I wanted to develop software. The only practical choice on this service was PHP and I had long decided that life was too short for that.

I started looking at VM services as the natural next step on the ladder, but it was a chance discussion, again on Hacker News, that gave me an idea.

Clifford Stoll: “a heavy load on my raspberry-pi web server told me something was happening…”
Me: “your web server is a Raspberry PI, and its holding up while being on the HN front page?”
CS: “Hi Bill, Yep. Cloudflare is out front, so the actual load on the rasp-pi is mitigated by their content-delivery network.”

Suddenly, the idea of hosting a web server in my attic became real again. Reality had long since taught me that residential ISPs were no good for serious web hosting – but if there was a service that could deal with the bulk of GET requests and it could cover the occasional outage on my side from its cache, that’d change everything.

“Can you deal with my GET requests?”

Tunnelling

At the time, that Raspberry-Pi web server was on his residential ISP with a public IP address. That arrangement wouldn’t work for me as my own ISP didn’t allow their customers to run services like that. However, in that same comment thread, the very CTO of Cloudflare (John Graham-Cumming) mentioned to him that they had an new service that allowed their customers to VPN out to Cloudflare, making such port-forwarding shenanigans a thing of the past.

(As a not-quite a declaration of bias, Cloudflare are on my list of companies I would like to work for should my current day-job come to end. I am not (yet) an employee of Cloudflare and they’re not paying me to write this in any case. By the time you come to read this, that might have changed.)

This service is completely free. While I like not having to pay for things, it does make me a little nervous. This particular service isn’t going to be injecting ads into my site and I do understand how the free tier fits into their business model. But still, I’ve been burnt by free services suddenly disappearing before and you get no sympathy if you’ve become dependent on them. I kind of wish I could give them a few pounds each month, just in case.

Leaving such concerns to one side, I had a plan. Acquire a server and install it into one of the slots on my IKEA KALLAX unit the TV is sitting on. Plug it into my ISP’s router and once that’s running, install a web server along with the VPN software. I’ll finally be in charge of my very own web server, just like the twenty-something me thought I’d be.

“If I get to know your name, well I could trace your private number, baby. All I know is that to me, you look like you’re lots of fun. Open up your loving arms, I want some, want some. You spin me right round, baby, right round, like a record, baby, right round…”

Quiet!

I had acquired a second-hand PC for this purpose but once I got it home it was way too noisy. I needed a machine I could leave switched on 24/7 in the lounge where we watch TV. My server would have to be really quiet.

I also considered a Raspberry Pi, the same hardware Clifford Stoll used, but I wasn’t going to only be running a few WordPress instances. I had an idea I wanted to develop and I’d need a database with plenty of space for that to work. An SD card and maybe some USB storage wouldn’t cut it.

I’m not in particular hurry to buy it as I still want to plan some more before the new machine starts taking up space. It was while I was reading reviews for various machines when I had the craziest of crazy ideas.

“And as we sit here alone, looking for a reason to go on. It’s so clear that all we have now are our thoughts of yesterday. La, la la la…”

It comes with Windows

Any PC I could buy is going to come with Windows pre-installed and fully licensed. I was always going to replace it with a variety of Linux, but I wondered, why not keep the copy Windows?

Before you all think I’ve gone insane, there are a few benefits to doing it this way. I use Windows a lot for my day job so I’m familiar with its quirks and gotchas. Even though there’s a dot-net for Linux, my development machine runs Windows so there would be fewer surprises when the development machine runs the same OS as the production machine. For the handful of WordPress sites I wanted to run, there were docker images available. Finally, because it won’t be directly connected to the scary internet I wouldn’t have to panic when there’s an update.

But even as I’m writing this, I feel I’m going to regret doing it this way. I just know I’ll be writing part six of this series and it’ll be all about installing Linux on that server machine because there’s just one stupid thing I couldn’t get working on Windows. We shall see.

A foreshadowing?

Join me for part 2 of this series, where I’ll be experimenting with getting WordPress running from a Docker container. Wish me luck.

Picture Credits:
📸 “Kee-kaws”, by me.
📸 “Duke”, by my anonymous wife.
📸 “Haven Seafront, Great Yarmouth”, by me.
📸 “Quiet Couple” by Judith Jackson. (CC)
📸 “Blisworth Canal Festival, 2019”, by me.

My Incredibly Stupid Diary

🥇First Entry
⚾ Random Entry

Years ago, 2004 to 2007, I had a website. It was mildly popular – I counted the number of readers and found I had eleven regulars. I called it “The Incredibly Stupid Diary of Bill”, although I added a few friends as writers and “of Bill” very soon became “of Bill et al”.

I occasionally posted long form pieces, but mostly it was quick-and-short stuff that these days I would post to Facebook or Twitter. I used Blogger before it was BlogSpot. Back then, it worked by connecting to my web server and uploading HTML files over FTP. I’d leave my password configured with Blogger so that in case anyone commented, they could update the page with the comment without having to wait for me to allow it.

Along the way, I started a weekly feature – Animated Short of the Week . Each Sunday, I’d pick a Flash-based animation and post a link to it. These would usually be my favourite from the back-catalogue on AlbinoBlackSheep but it was something I really enjoyed doing. It would also become an incentive to post *something* as I wouldn’t want to have two animation post next to each other. I made the decision to stop posting them after 100 posts. It was becoming more and more difficult to find good animations and it felt like the quality was on the decline so 100 selections seemed a good place to stop.

“You may find yourself behind the wheel of a large automobile.”

Time passed and I eventually stopped using writing. I had a new hobby, making old-school YouTube videos. This was the day when videos were limited to ten minutes and there was no such thing as a professional YouTuber. You can see the decline from the last handful of posts – 80% of them are just links to my videos.

When I finally made the decision to moth-ball the site, I wrote one last post and published it. A few more comments were written and the servers at Blogger dutifully updated my website via FTP, but that was it. One day, I changed my password on the web server but didn’t update it on Blogger. That last revision would be fixed as it was left, with a non-functioning comments form to boot.

For a while, my website became nothing more than a bunch of links to my social media websites, although my old posts were still there if you knew the addresses, ready to respond to searches. By now it was a folder full of static files, just as it was left when Blogger did that last FTP connection.

Now, I’ve been reminded about that old website and I wanted to give it a bit of a tidy-up. There were several files all with very similar HTML structures. I wrote a program to loop through each file, remove obsolete stuff like the comments form, added a navigation gadget and made it a nice website again.

A lot of external links have since gone, so I wrote some code to change those links to archive.org links, using the time-stamp of the original post. I made an exception for the AlbinoBlackSheep links as the archive,org copies were all of the original Adobe Flash which doesn’t work any more, whereas the current AlbinoBlackSheep website uses updated video files.

I hope you like it. There is an awful lot of rubbish there but a few gems too. I’ll be making a few new posts reacting to some of the crazy stuff I wrote. Good times.

Start with the first post: Let’s try that again.
Or jump to a random post.

Data-Mining Wikipedia for Fun and Profit

It all started after watching one too many videos narrating the English monarchy, all starting from King William Ⅰ in 1066 as if he’s the first king of England. This annoys me as it completely disregards the handful of Anglo-Saxon kings of England who reigned before the Normans.

They’re Kings of England. If you’re going to make a list of the Kings of England, then you should include the Kings of England.

It was this that made me want to make a particular edit to both the King Alfred and Queen Elizabeth pages on Wikipedia, acknowledging each as related to the other. But what is their relationship and through who?

I went to the page for Queen Elizabeth Ⅱ and started following the Mother/Father links until I found my way to King Alfred, mostly going through the other kings of England. I counted 36 generations, but was there a shorter or even longer route?

Sounds like a job for some software!

Gâteau Brûlé.

Scanning Wikipedia

We have the technology.

  • Visual Studio 2019 and C#.
  • RestSharp, a library for downloading HTML.
  • HtmlAgilityPack, a library for parsing and extracting data from HTML.

With these libraries downloaded from nuget, I was able to write some very quick and dirty code that would download the HTML for the Wikipedia page of Queen Elizabeth II, storing the HTML in a cache folder to save re-downloading it again.

Once the HTML is downloaded (or read from the cache), HtmlAgilityPack can be called upon for the task of pulling items of data from the HTML. For example, the person’s full name, which is always the page’s only <H1>…</H1> element, can be extracted using one line of code:

string personName = 
    html
    .DocumentNode
    .Descendants()
    .Where(h => h.Name == "h1")
    .Single()
    .InnerText;

I used HtmlAgilityPack and LINQ in a similar way to pull out the Mother and Father for each person. The code would look for the info-box <TABLE>, then look inside for a <TH> with the text “Mother” or “Father”. It would then take a few steps backwards to look for the <TR> that the text is a part of and finally pull out all the links it can find inside.

With the links to the Queen Elizabeth’s mother and father, the code would add those links to a queue and the top-level would pull the next link and continue until the links runs out.

Calm down!

This section was added after initial publication.

I would hope that people don’t need to be told to be considerate, but please be considerate.

Before I started on this project, I checked Wikipedia’s robots.txt file. This told me that my project was acceptable, quoth: “Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.”

The article pages were exactly what I wanted. My code was already fairly low speed as it was all in a single thread. Nonetheless, I added a short delay after each download once I had worked the kinks out. I also set the User-Agent text to include my email address and phone number so Wikipedia server admins could raise an alarm with me personally if necessary.

As I was running my code in Visual Studio’s debug mode, I could leave the code running unattended (once I had observed it over the first hundred or so) with some breakpoints to stop everything until I could return to inspect what happened.

The most important were during examination of the response from Wikipedia. If the response was anything other than an 200/OK response (after redirects) or anything other than HTML, I wanted my code to stop dead until I can inspect what happened. Even if it happened overnight, I still what that response object in memory.

In the end, the bulk of the download took two days in a number of bursts. I’ll be sending a modest donation to the Wikimedia Foundation in thanks for accommodating my bizarre projects.

“She’s just a girl who says that I am the one…”

I made the decision here to only include people with an info-box. Extracting someone’s parents from free English text was a step too far. If you’re not notable enough to have an info-box with your parents listed, you’re not notable enough for this project. (Although I did find a couple of people who didn’t have a suitable info-box surprisingly early in the process. Rather than hack in an exception, I edited Wikipedia to include those people’s parents in their info-box, copying the link from elsewhere in the text.)

While that got me out of a small hole, more annoying was when the info-box listed “Parents” or “Parent(s)” instead of Mother and Father. I wanted to track matrilineal and patrilineal lines, so it was a little annoying to just have an individual’s parents with no clear indication of which one is which. I coded it so that if there’s only one one link, assume it is the father. If there’s two links, assume the father is the first one.

Because patriarchy.

“Also known as…”

Another issue was that some of the pages changed names. RestSharp would dutifully follow HTTP redirects, but I’d end up storing a page with one name but having a different name internally. This happened right away as the page for Queen Elizabeth links to her mother as “Elizabeth_Bowes-Lyon“, but once you follow the link, you end up at “Queen_Elizabeth_The_Queen_Mother“.

The HTML included a <LINK> tag named the “canonical reference”, so I could pull that out and use it as the primary key in my data structure. To keep the link between child and parent, it collects the aliases when the are detected and a quick reconciliation loop corrects the links after the initial loop completes.

King Alfred, also known as The Muffin Man.

From Alfred to Elizabeth.

Once I had a complete set of Wikipedia pages cached, the next step was to build a tree with all of the parental connections that lead from King Alfred to Queen Elizabeth. I knew that some non-people had crept in because someone’s parents would be listed as “(name) of (town)”, but that didn’t bother me as those towns wouldn’t have a mother or father listed and those loose ends would be discarded.

I wrote some code to walk the tree of connections. It started from Queen Elizabeth and recursively walked to each of the mother and father node. If a node ended on King Alfred, the complete chain would be added to the list of nodes.

With this reduced set in place, I churned through the nodes and generated a GraphViz file. For those who don’t know about it, this an app for producing graphs of connected bubbles. You tell it what bubbles you want and how they are connected and it automatically lays them out.

At this point, I was expecting a graph that would be mainly tall and thin and it would appear right here in this article. While family trees do grow exponentially, I wasn’t including every single relationship, only those that connect both of two individuals. If I were graphing the relationships between myself an a distant ancestor, I’d expect a single line, each parent handing over to their child. There would be a few bulges when third-or-so cousins marry. There, an individual’s two children would split off into separate lines, eventually reuniting with one ever-so-slightly inbred individual.

Yeah, that’s not what I got. This is the SVG file GraphViz generated for me. If you follow this link and are faced with a blank screen, scroll right until you find the King Alfred node. Then zoom out.

Aristocrats…

(The bubbles are all clickable, by the way.)

Count the Generations.

The graph was interesting but this wasn’t the primary objective of this exercise. I wanted to write “He is the n-times great-father of his current successor Queen Elizabeth.” on King Alfred’s Wikipedia page.

But what’s the n? I already had a collection of all the chains between so I just had to loop through them to find the longest and shortest chain. The longest chain has 45 links and the shortest chain has 31 links.

King Alfred is a 42-times great-grandfather of Queen Elizabeth Ⅱ.

(And also 28 times-great-grandfather. And everything in between.)

Here’s the simplified graph showing only those lines with exactly 45 links.

All the parental chains from Alfred to Elizabeth that have exactly 45 links.

“Let’s talk about sex.”

Earlier, I mentioned being annoyed that some info-boxes listed two parents instead of a mother and a father, requiring me to make assumptions that fathers are more likely to be included and put first, because these are aristocrats and society is quite patriarchal.

I still wanted to data-mine into matrilineal lines, so to check on those assumptions, I pulled out all of the people linked only in a “Parents” line of the info-box and checked they were all in order. The fathers all had manly names and the mothers all had womanly names. Seemed fine. But just to be sure, I queried my data structure for any individual that was listed as both a mother and a father, expecting that to happen from two different children’s pages.

There were several. Not only that, the contradicting links came from the same page. Someone apparently had the same individual as both his father and mother. Expecting to see the same person linked twice or a similar variety of quirk, I was surprised to see what should have been very a simple info-box to process.

This person has an info-box with two individuals, each unambiguously listed as Father and Mother. Why was my code somehow interpreting the mother as the same individual as the father?

Investigating, I discovered that not only was Adolphus listed as someone’s mother, his actual mother was skipped over entirely. My data-structure simply didn’t have an entry for her.

To try and work out what was going on, I added a conditional breakpoint and looked as my code dutifully added her name to the queue of work, as well as later on when it was taken off the queue. The code downloaded her page as it disappeared into the parser. Yet the response that came back was that she was already accounted for. I beg to differ!

What I hadn’t done was click on her link. She didn’t have her own page, only a redirect to her husband’s page. Apparently, the only notable thing she had done, according to history, was marry her husband.

I later found a significant number of there links where a woman’s name is just a redirect to her husband. If the patriarchy isn’t going to allow me to rely on Mother/Father links as a sign of an individual’s parental role, investigating matrilineal lines will have to wait.

“We call our show, The Aristocrats!”

Acknowledgements and Notes

If you’d like to do your own analysis, I’ve saved the data I extracted into a JSON file you can download. I make no promises about its accuracy or completeness or indeed anything about the file. I’ve even hidden the word “Rutabaga” in there, just to make it clear how potentially inaccurate it is.

I showed a friend an earlier version of the chart and he wondered if I could do it better in Python. Maybe, but equally maybe not. This isn’t the C# of the early 2000s we’re dealing with. HtmlAgilityPack and LINQ combined can do very clever queries to extract data from web pages, often in single lines of code. Maybe there’s a Python component to do the same, I don’t know.

Rather than install GraphViz myself, I found online GraphViz did the job admirably and I’m grateful to them for hosting it. I’m also grateful to my friend Richard Heathfield for telling me about it several decades ago, back when I was thinking about building my own version control system. (Ah, to be young.)

RestSharp is a very nice component for downloading web content for processing. It flattens all the quirks of using the dot-net standard library directly and wraps it all up in a simple and consistent interface.

Oh, and here’s that Wikipedia edit, in all its glory. It was reverted around three minutes later by another editor but never mind.

Update: Hacker News discussion. Also, I am grateful to Denny Vrandečić for his analysis in response to this piece. I’ll be posting a more extensive response to all these soon.

Picture Credits:
📸 “Another batch of klutz” by “makeshiftlove”.
📸 “King Arthur statue in Winchester ” by “foundin_a_attic”.
📸 “</patriarchy>” by “Gaelx”.
📸 “Banana Muffins” by Richard Lewis.
📸 “River Seine” by Irene Steeves.

POP3 – The ideas that didn’t make it.

This is part of series of posts documenting extensions to the POP3 protocol.

I had a few ideas along the way. This post collects some that didn’t quite make it. I present these so the time I spent won’t have been a complete waste. 🐘

Multi-line Response Indication

Good software engineering employs reusable code.

If you’re developing a library to interact with a POP3 service as a client, you’d observe that the protocol operates on an exchange of command and response. This calls for a single function that can be called to send any command to the server and return the response when it arrives. Your function would look like:

var retrResponse = pop3.Command("RETR 4");

Except you can’t do that. There are two distinct classes of response in POP3. One where the response is a single line and another where the response is multiple lines. If all you have is the first line, you have no clear indication that’s the complete response or if there are more lines coming. You, the developer, need to know what kind of response you’re expecting from the server and have the caller pass that information along.

var retrResponse = pop3.CommandMulti("RETR 4");
var deleResponse = pop3.CommandSingle("DELE 4");

Wouldn’t it be nice if there was a clear an unambiguous way for a server to indicate if there are more lines to follow? That way, client code could have that single function that just knows what to do.

When the client calls CAPA, if the response includes “MULTI-LINE-IND”, the client can know what kind of response is coming from the server from the first line, because the server is making these promises:

  1. All multi-line responses will always have a first line that ends with a ” _”.
  2. All single-line responses will never be a line that ends with a ” _”.
C: RETR 1
S: +OK This line ends with an underscore so keep reading. _
(Message goes here.)
S: .
C: DELE 1
S: +OK This line has no underscore so send the next command.

I chose the underscore character as this would technically be encroaching into the human readable section of a response, so it would need to be ignorable by any humans passing by. I had flirted with using “…” as the indicator as it could be included in the text anyway, but that might not work for all languages. My inclination was to keep it as small as possible when displayed, printable ASCII, but also unlikely to be included in an English sentence.

The first issue I stumbled upon with this idea was that the underscore character could be included in the set of possible unique-ids. The command UIDL (n) returns a single line response with the message’s unique-id on the end as a single line response. Any servers implementing this idea would have to exclude underscores from their unique-ids.

The final nail in the coffin was when I took a step back and thought about the developers of POP3 client libraries. Would they make use of my extension?

No. Servers not implementing my new extension will still exist for a long time and people will still want to connect to those servers. As such, client libraries are still going to be passing a flag down to its command/response layer, indicating if the response is going to be multi-line or not. I won’t have saved the developer any effort.

My example implementation still includes this extension, for now.

Wait, there’s more!

Keepa your CAPA!

One thing that bothered me about reconnecting to a POP3 service was the necessity to call CAPA on reconnecting every time. Each time, the server would send the same response back. Wouldn’t it be nice if the client could store the response once and have some sort of notification if it needed to be checked?

My idea was to use the banner that the service sends immediately on connection, together with a new CAPA response.

S: Welcome to my POP3 service version 1.2.3.
C: CAPA
S: +OK Capabilities follow...
S: CAPA-VERSION 1.2.3.
S: UIDL
S: .

Because the CAPA response included this capability, the next time it connected, it could look in the connection banner and see the token “1.2.3.” included. With this, the client is assured that the response from last time is still good and the client need not ask for it again.

Even if the banner changed (which it would if it implemented APOP), so long as this one token was included, the response is still good.

I saw a problem that the response to a CAPA command might change in the course of a connection. The capabilities might be different after going through TLS. Different users might have different capabilities that only reveal themselves once you’ve logged in. Should the response to the PASS command also include a version token? What about other ways of logging in?

I tried to come up with RFC style wording that would have placed different CAPA responses into different domains, but it all got too complicated and too prone to error.

And for what? To save having to reissue a single command and brief response on top of all the TCP and TLS handshakes? Not worth it.

My demonstration implementation implements this capability.

Safe!

Pick up from where we left off.

The RETR command allows the client to download a message, while the TOP command allows the top part of a message. There’s a gap for a command that retrieves the end of a message. If you’ve downloaded part of a message but the connection broke, you could use this to resume from where you left off.

The END command would have worked in the same way as TOP. You select a message and the line you want to start from, and the server would continue from that point.

Selecting a line rather than a byte index was necessary. POP3 is line orientated protocol and any new commands would need to work within that restriction. If you selected a byte index instead, what if you wanted to start from the middle of a CRLF? What if the selected starting byte is a dot but in the middle of a line, should that be dot-padded?

I told myself that servers would need to keep track of what byte indexes each line started at, but that was an unsatisfactory answer.

It was about the same time as another idea. Email files are highly compressible thanks to their large blocks of base-64. I pictured an alternative form of RETR (RETZ?) that would return the +OK line normally, but the multi-line response would be inside a new GZIP stream. Inside that stream, the CRLF lines and dot-padding would still be present. A lone dot would complete the message as normal as the GZIP stream concludes and the normal uncompressed exchange of commands and responses continues.

It was at this point that I remembered there’s already a very well established protocol for downloading large files with compression, chunking, resumption and all that good stuff. HTTP. I’m still thinking about how that could work in practice but that road seems a lot more productive than trying to force HTTP into a POP3 shaped hole.

Picture Credits. (CC)
📷 “More Selvatiche” by Cristina Sanvito.
📷 “Fun with Cling Film” by Elizabeth Gomm.

Farewell, Hackensplat Industries!

2009, I registered “hackensplat.com”. A friend of mine called me “Wilhelm von Hackensplat” as a joke after my rather loud sneezes. It was about this time I decided to start writing about software development and technology. I liked the idea of having an alter ego. I pictured him as an evil genius, Baron von Hackensplat, and so hackensplat.com was born, an evil genius writing about his evil technologies.

That was the idea, but it never really took off in my mind. I would have an idea to write about something but it wouldn’t really lend itself to the evil genius persona. As time passed I would got bored with the alter ego and gave up writing for the character, instead just writing as myself. I later changed the name of the website to “Hackensplat Industries”, mainly so I could keep the name.

Even more time passed and I wrote a new piece that I wanted to show a friend. I read out the address as “hackensplat dot com”. My heart sank as the response came back “How do you spell that?”. A question I had been asked too many times before.

I almost registered hackandsplat.com as a redirect, but frankly I was over it. One of the reasons I was writing was to gain a little professional exposure but this other name was just getting in the way. I made the decision and started moving all my published posts to billpg.com, a domain I had previously used as my strictly personal website, distinct from my professional site. There wasn’t anything on my personal domain other than a collection of social media links anyway.

I don’t know how long I’ll keep the old domain, which now only has a set of redirects. It expires in November this year so I suspect I’ll be spending a little bit of October looking at access logs. Equally likely is that I’ll completely forget and it’ll automatically renew anyway.

Welcome to billpg industries.

POP3 – Delete Immediately

This post is part of series documenting extensions I’ve designed and prototyped for the POP3 protocol. I originally had this idea on the way to designing a mechanism for keeping connections open and avoid having to close and reopen them. I had abandoned this specific idea early on in that process but once I started writing up my notes for public discussion, I realised this small update might still be useful to implement.

DELI

To delete a message with POP3, you’d normally use a DELE command which flags a message for deletion, followed by a QUIT command, which together with closing a connection, finally deletes those flagged messages.

The DELI command, in contrast, is a command to immediately delete the specified message. Once the server has responded with a +OK response, the delete request has been committed and you don’t need a QUIT.

“I’m not going to tell you again!”

The aftermath…

I originally wrote this extension as part of effort to allow opened connections to be shelved and refreshed. The QUIT command had the job of committing message delete requests but also shut down the underlying connection. My first thought in addressing that was client’s needed a way to delete messages without having to QUIT.

Seemed simple at first. Have an alternative form of DELE that doesn’t need a QUIT. It would be just like deleting a file on an FTP server. Simple!

The problem is what happens to the other messages after one has been deleted. POP3 works by assigned each message a numeric ID from 1 to n. In a world where deletes are deferred to the end of a connection, the view of a mailbox remains consistent throughout the lifespan of a connection. Now, we’re introducing committed deletes, what happens to those numeric message IDs?

There are two realistic alternatives. Servers could either leave a gap in the IDs so all the messages have consistent IDs, or the server could reduce all the higher IDs by one. I didn’t like either of those answers as either way might very realistically require a significant refactoring of the various server implementations out there.

This was why I initially decided against going down this road. It was only when I started putting my notes together for posterity that I realised this idea might still have legs.

Delete by Unique ID

Researching this extension, I came across the POP4 proposal. While this project seemed all but abandoned, it did have an idea I liked, which I’ve adapted back into a POP3 extension.

With this extension too, suddenly a “Delete Immediately” command becomes practical.

C: UIDL
S: +OK Unique IDs follow...
S: 1 XYZ_1001
S: 2 XYZ_1002
S: 3 XYZ_1003
S: .
C: RETR UID:XYZ_1001
S: +OK Message follows...
(Contents removed for brevity.)
S: .
C: DELI UID:XYZ_1001
S: +OK Message deleted and delete committed.

Earlier, I mused that there were two realistic ways to deal with numeric message IDs, either leave a gap in the numbers or fill the gap by reducing the others by one. But what if we don’t care what those numbers are because now we’re only making requests by string unique IDs?

This way a server implementing DELI is free to do what they wish with its numeric message IDs. The RFC would state something along the lines of “After a client has used a DELI command, it MUST NOT send any command that uses a numeric message-id parameter.”

With a wave of RFC 2119 magic, the problem goes away. You can have an immediate delete that’s instantly acknowledged, you just need to completely abandon the numeric message ID. That shouldn’t be too tricky,

billpg.com

Right?

Please do have a read of the posts in this series of POP3 extensions.

Write Your Own POP3 Service

So, you want to write a POP3 service? That’s great. In this post, we’ll walk through building a simple POP3 service that uses a folder full of EML files as a mailbox and serves them to anyone logging in.

Getting Started

I’m assuming you are already set-up to be writing and building C# code. If you have Windows, the free version of Visual Studio 2019 is great. (Or use a more recent version if one exists.) Visual Studio Code is great on Linux too.

Download and build billpg industries POP3 Listener. Open up a new console app project and include the billpg,POP3Listener.dll file as a reference. You’ll find the code for this project on the same github in its own folder.

using System;
using System.IO;
using System.Collections.Generic;
using System.Net;
using System.Linq;
using billpg.pop3;

namespace BuildYourOwnPop3Service
{
    class Program
    {
        static void Main()
        {
            /* Launch POP3. */
            var pop3 = new POP3Listener();
            pop3.ListenOn(IPAddress.Loopback, 110, false);

            /* Keep running until the process is killed. */
            while (true) System.Threading.Thread.Sleep(10000);
        }
    }
}

This is the bare minimum to run a POP3 service. It’ll only accept local connections. If you’re running on Linux, you may need to change the port you’re listening on to 1100. Either way, try connecting to it. You can set up your mail reader or use telnet to connect in and type commands.

Accepting log-in requests.

You’ll notice that any username and password combination fails. This is because you’ve not set up your Provider object yet. If you don’t set one up, the default null-provider just rejects all attempts to log in. Let’s write one.

/* Add just before the ListenOn call. */
pop3.Provider = new MyProvider();

/* New class separate from the Program class. */
class MyProvider : IPOP3MailboxProvider
{
}

This won’t compile because MyProvider doesn’t meet the requirements of the interface. Let’s add those.

/* Inside the MyProvider class. */
public string Name => "My Provider";

public IPOP3Mailbox Authenticate(
    IPOP3ConnectionInfo info, 
    string username, 
    string password)
{
    return null;
}

Now, the service is just as unyielding to attempts to log-in, but we can confirm our provider code is running by adding a breakpoint to the Authenticate function. Now, when we attempt to log-in, we can see that the service has collected a username and password and is asking us if these are correct credentials or not. Returning a NULL means they’re not.

This might be a good opportunity to take a look at the info parameter. All of the functions where the listener calls to the provider will include this object, providing you with the client’s IP address, IDs, user names, etc. You don’t have to make use of them but your code may find the information useful.

A basic mailbox with no messages.

We can change our Authenticate function to actually test credentials. For our play project we’ll just accept one combination of user-name and password.

if (username == "me" && password == "passw0rd")
    return new MyMailbox();
else
    return null;

This will fail compilation because we’ve not written MyMailbox yet. Let’s go ahead and do that.

class MyMailbox : IPOP3Mailbox
{
}

Again, we’ll need to write all the requirements of the interface before we can run. So we can move on quickly, let’s provide just the minimum.

The first thing we’ll need is a list of the available messages. We’ll return an empty collection for now.

public IList<string> ListMessageUniqueIDs(
    IPOP3ConnectionInfo info)
    => new List<string>();

The service needs to know if a mailbox is read-only or not. Let’s say it isn’t.

public bool MailboxIsReadOnly(
    IPOP3ConnectionInfo info)
    => false;

The service might sometimes need to know is a message exists or not. For now, it doesn’t.

public bool MessageExists(
    IPOP3ConnectionInfo info,
    string uniqueID)
    => false;

The client might request the size of a message before it downloads it and the service will pass the request along to the provider. I’ve often suspected that clients don’t really need this so let’s just return your favorite positive integer.

public long MessageSize(
   IPOP3ConnectionInfo info, 
   string uniqueID)
   => 58;

The client will, in due course, request the contents of a message, but won’t because both the list-messages and message-exists will deny the existence of any messages, so for now, we can just return null.

public IMessageContent MessageContents(
    IPOP3ConnectionInfo info, 
    string uniqueID)
    => null;

Finally, we need to handle message deletion. Again, we don’t need to do anything just yet.

public void MessageDelete(
    IPOP3ConnectionInfo info, 
    IList<string> uniqueIDs)
{}

And we’re done. Run the code and log-in. Your mailbox will be perpetually empty but you can add breakpoints and confirm everything is running.

List the messages.

Now, let’s actually start with something useful. Let’s change our ListMessageUniqueIDs to return a list of filenames from a folder. You’ll want to replace the value of FOLDER with something that works for you.

const string FOLDER = @"C:\MyMailbox\";

public IList<string> ListMessageUniqueIDs(
    IPOP3ConnectionInfo info)
    => Directory.GetFiles(FOLDER)
           .Select(Path.GetFileName)
           .ToList();

public bool MessageExists(
    IPOP3ConnectionInfo info, 
    string uniqueID)
    => ListMessageUniqueIDs(info)
           .Contains(uniqueID);

Let’s also place an EML file into our mailbox folder. If you don’t have an EML file to hand, you can write your own using notepad. (It doesn’t care if the file has a “.txt” extension.)

Subject: I'm a very simple EML file.
From: me@example.com
To: you@example.com

Message body goes after a blank line.

If we save that into our mailbox folder and run up the POP3 service, we’ll see there’s a message available. It won’t be able to download it though.

Download the message,

The MessageContents function expects an new object that implements the IMessageContent interface.

/* Replace the MessageContents function. */
public IMessageContent MessageContents(
    IPOP3ConnectionInfo info, 
    string uniqueID)
{
    if (MessageExists(info, uniqueID))
        return new MyMessageContents(
                       Path.Combine(FOLDER, uniqueID));
    else
        return null;
}

/* New class. */
class MyMessageContents : IMessageContent
{
    List<string> lines;
    int index;

    public MyMessageContents(string path)
    {
        lines = File.ReadAllLines(path).ToList();
        index = 0;
    }

    public string NextLine()
        => (index < lines.Count) ? lines[index++] : null;

    public void Close()
    {
    }
}

This shows the requirements of the object that regurgitates a single message’s contents. A function that returns the next line, one-by-one, and another that’s called to close down the stream. The Close function could close opened file streams or delete temporary files, but we don’t need it to do anything in our play project.

Note that the command handling code inside this library has an extension that allows the client to ask for a message by an arbitrary unique ID. Make sure your code doesn’t allow, for example, “../../../../my-secret-file.txt”. Observe the code above checks that the requested unique ID is in the list of acceptable message IDs by going through MessageExists.

Delete messages.

The interface to delete messages passes along a collection of string IDs. This is necessary because the protocol requires that a set of messages are deleted in an atomic manner. Either all of them are deleted or none of them are deleted. We can’t have a situation where some of messages are deleted but some are still there.

But since this is just a play project, we can play fast and loose with such things.

public void MessageDelete(
     IPOP3ConnectionInfo info, 
     IList<string> uniqueIDs)
{
    foreach (var toDelete in uniqueIDs)
        if (MessageExists(info, toDelete))
            File.Delete(Path.Combine(FOLDER, toDelete));
}

What now?

I hope you enjoyed building your very own POP3 service using the POP3 Listener component. The above was a simple project to get you going.

billpg.com

Maybe think about your service could handle multiple users and how you’d check their passwords. What would be a good way to achieve atomic transactions on delete? What happens if someone deletes the file in a mailbox folder just as they’re about to download it?

If you do encounter an issue or you have a question, please open an issue on the project’s github page.

POP3 – Goodbye to numeric message IDs!

This is the second post in my series describing a number of extensions to the POP3 protocol. The main one is a mechanism to refresh an already opened connection to allow newly arrived messages to be downloaded, which I’ve described on a separate post. This one is a lot simpler in scope but if we’re doing work in this protocol anyway, this may as well come as a package.

I am grateful to the authors of POP4 for the original idea.

This post is part of series on POP3. See my project page for more.

Enumeration

To recap, when a client wishes to interact with a mailbox, it first needs to send a UIDL command to retrieve a list of messages in the form of pairs of message-id integers and unique-id strings. (I’ve written before how UIDL really should be considered a required command for both client and server.)

C: UIDL
S: +OK Unique-ids follow...
S: 1 AAA
S: 2 AAB
S: 3 AAJ
S: .

The numeric message-ids are only valid for the lifetime of this connection while the string unique-ids are persistent between connections. All of the commands (prior to this extension) that deal with messages use the numeric message-ids, requiring the client to store the UIDL response so it has a map from unique-id to message-id.

This extension allows the client to disregard the message-ids entirely, modifying all commands that have a message-id parameter (RETR, TOP, DELE, LIST, UIDL) to use a unique-id parameter instead.

“UID:”

If the server lists UID-PARAM in its CAPA response, the client is permitted to use this alternative form of referencing a message. If a message-id parameter to a command is all numeric, the server will interpret that parameter as a numeric message-id as it always has done. If the parameter instead begins with the four characters “UID:”, the parameter is a reference to a message by its unique-id instead.

C: DELE 1
S: +OK Message #1 (UID:AAA) flagged for deletion.
C: DELE UID:AAB
S: +OK Message #2 (UID:AAB) flagged for deletion. 

(The POP4 proposal used a hyphen to indicate the parameter was a unique-id reference. I decided against adopting this as it could be confused for a negative number, as if numeric message-ids extended into the negative number space. A prefix is a clear indication we’re no longer in realm of numeric identifiers and may allow other prefixes in future.)

If a client has multiple connections to a single mailbox, it would normally need to perform a UIDL command and store the response for each connection separately. If the server supports unique-id parameters, the client is permitted to skip the UIDL command unless it needs a fresh directory listing. Additionally, the client is able to use a multiple connections without having to store the potentially different unique-id/message-id maps for each connection.

RFC 1939 requires that unique-ids are made of “printable” ASCII characters, 33 to 126. As the space (32) is explicitly excluded, there is no ambiguity where a unique-id parameter ends, either with a space (such as with TOP) or at the end of the line.

Not-Found?

If a requested unique-id is not present, the server will need to respond with a “-ERR” response. To allow the client to be sure the error is due to a bad unique-id rather than any other error, the error response should inside a [UID] response code. (The CAPA response should also include RESP-CODES.)

S: RETR UID:ABC
C: -ERR [UID] No such message with UID:ABC.
S: RETR UID:ABD
C: -ERR [SYS/PERM] A slightly more fundamental error.

It should be noted that a [UID] error might not necessarily mean the message with this unique-id has been deleted. If a new message has arrived since this particular connection opened, the server may or may not be ready to respond to requests for that message. A client should only make the determination that a message has gone only if it can confirm it with either a new or refreshed connection.

Extensions yet to come

I’ve pondered about if this extension, once its written up in formal RFC language, should modify any future extensions that use message-id parameters. Suppose next year, someone writes a new RFC without having read mine that adds a new command “RUTA” that rutabagas the message specified on its command line.

(What? To rutabaga isn’t a verb? Get that heckler out of here!)

The wording could be: “Any command available on a service that advertises this capability in its CAPA response, that accepts a message-id parameter that is bounded on both sides by either the space character or CRLF, and normally only allows numeric values in this position, MUST allow a uid-colon-unique-id alternative in place of the message=id parameter.”

(In other words, this capability, only changes commands where a unique-id with a prefix can unambiguously be distinguished from a numeric message-id.

My inclination is for the RFC defining this capability exhaustively lists the commands it modifies to just the ones we know about. (RETR, TOP, DELE, LIST, UIDL.) I would add a note that strongly encourages authors of future extensions to allow UID: parameters as part of their standard. If someone does add a RUTA command without such a note, then strictly speaking, the client shouldn’t try and use a UID: parameter with the RUTA command, but probably will.

I’m on the fence. What do you think?

Restrictions

RFC 1939 that defines POP3, makes a couple of allowances with the UIDL command that would make UID: parameters problematic. A server is allowed to reuse unique-ids in a single mailbox, but only if the contents of two messages are identical. A server is also allowed to reuse a unique-id once the original message using that unique-id has been deleted.

Since these allowances would introduce a complication to which message is being referenced, any server advertising this capability (in RFC language) MUST NOT exercise these two allowances. If a server advertises UID parameters, it is also promising that its unique-ids really are unique.

Fortunately, all mail servers I’ve looked at can already make this promise, either they use a hash but add their own “Received:” header or they assign an incrementing ID to each incomming message.

billpg.com

Resources
New extension: Mailbox Refresh Mechanism
New extension: Delete Immediately
Prototype POP3 Service

POP3 – A Commit and Refresh Mechanism

POP3 is a popular protocol for accessing a mailbox full of email messages. While small devices have moved their mail reading apps to IMAP and proprietary protocols, POP3 remains the preferred protocol for moving email messages between big servers where a no-frills, download-and-delete system is preferred.

A problem with this protocol is embodied in this question: “How often should we poll for new messages?” There’s a non-trivial overhead to connecting. Poll too quickly and you overload the system. Poll too far apart and messages take too long to arrive.

Over my head!

To recap, let’s take a look at what needs to happen every time a POP3 client wants to check for new messages.

  • The client and server handshake TCP as the underlying connection.
  • The client and server handshake TLS for security.
  • The client authenticates itself to the server.
  • The client finally gets to ask if there are any new messages.

“Oh, no new messages? Okay, I’ll go through all that again in five seconds.”

POP3 doesn’t have a way to avoid this continual opening and closing of connections, but it does have a mechanism to add extensions to the protocol. All it needs is for someone to write the new extension down and to develop a working prototype. Which I have done.

“I have no time for your prototype!”

billpg industries POP3 service

On my github account, you’ll find a prototype POP3 server that implements this extension. Download it, compile it, run it. Go nuts. The service is written using the “listener” model. You set it up to listen for incomming connections and it talks the protocol until you shut it down. The library deals with the complexities while requests for messages are passed onto your provider code.

You don’t need to write that provider code if you only want to try it out. I’ve written a basic Windows app where you can type in new messages into a form, ready for the client to connect and download them. If Linux is more your thing or you prefer your test apps to work autonomously, I’ve also written a command-line app that sits there randomly populating a mailbox with new messages, waiting for a client to come along and get them.

To Sleep, and Goodnight…

Now for the new extension itself. It comes in two parts, the SLEE command put a connection to sleep and the WAKE command brings it back.

If a server normally locks a mailbox while a connection is open, then SLEE should release that lock. In fact, SLEE is defined to do everything that QUIT does, except actually close down the connection. Crucially, this includes committing any messages deleted with DELE.

During a sleeping state, you’re no longer attached to the mailbox. None of the normal commands work. You can only NOOP to keep the underlying connection alive, QUIT to shut it down or WAKE to reconnect with your mailbox.

If the server responds to WAKE with a +OK response, a new session has begun. The refreshed connection needs to be viewed as if it is a new connection, as if the client had QUIT and reconnected. The numeric message IDs from before will now be invalid and so the client will need to send a new STAT or UIDL command to update them.

In order to save the client additional effort, the server should include a new response code in the +OK response.

  • [ACTIVITY/NEW] indicates there is are new messages in the mailbox that were not accessible in the earlier session on this connection.
  • [ACTIVITY/NONE] indicates there are no new messages this time, but it does serve as an indication that this server has actively checked and that it is not necessary for the client to send a command to check.
  • (No “ACTIVITY” response indicates the server is not performing this test and the client will need to send a STAT or UIDL command to retrieve this information.)

A server might give an error response to a WAKE command, which may include a brief error message. In this situation where a connection can’t be refreshed for whatever reason, a client might chose to close the underlying connection and open a new one.

The error might include the response code [IN-USE] to indicate that someone else is connected to the mailbox, or [AUTH] to indicate that the credentials presented earlier are no longer acceptable.

Note that the SLEE command is required to include a commit of DELE commands made. If the client does not want the server to commit message deletes, it should send an RSET command first to clear those out.

How would a client use this?

To help unpack this protocol extension, here is description of how the process would work in practice with a server that implements this extension.

The client connects to the server for the first time. It sends a CAPA request and the response includes SLEE-WAKE, which means this connection may be pooled later on.

S: +OK Welcome to my POP3 service!
C: CAPA
S: +OK Capabilities follow...
S: UIDL
S: RESP-CODES
S: SLEE-WAKE
S: .

The client successfully logs in and performs a UIDL which reveals three messages ready to be downloaded. It successfully RETRs and DELEs each message, one by one.

C: USER me@example.com
S: +OK Send password.
C: PASS passw0rd
S: +OK Logged in. You have three messages.
C: UIDL
(Remainder of normal POP3 session redacted for brevity.) 

Having successfully downloaded and flagged those three messages for deletion, the client now sends a SLEE command to commit those three DELE commands sent earlier. The response acknowledges that the deleted messages have finally gone and the connection has entered a sleeping state.

C: SLEE
S: +OK Deleted 3 messages. Sleeping.

The client can now put the opened connection in a pool of opened connections until needed later. It is not a problem if the underlying connection is closed without ceremony in this state, but it may be prudent for the pool manager to periodically send a NOOP command to keep the connection alive and to detect if any connections have since been dropped.

C: NOOP
S: +OK Noop to you too!

Time passes and the client wishes to poll the mailbox for new messages. It looks in the pool for an opened connection to this mailbox and takes the one opened earlier. It sends a WAKE command to refresh the mailbox.

C: WAKE
S: +OK [ACTIVITY/NONE] No new messages.

Because the server supports the ACTIVITY response code and the client recognized it, the client immediately knows that there is nothing left to do. The client immediately sends a second SLEE command to put right back into the sleeping state.

C: SLEE
S: +OK Deleted 0 messages. Sleeping.

(Incidentally, the client is free to ignore the “ACTIVITY” response code and instead send a STAT or UIDL command to make its own conclusion. It will need to do this anyway if the server does not include such a response code.)

More time passes and the client is ready to poll the mailbox again. As before, it finds a suitable opened connection in the pool and it sends another WAKE command.

C: WAKE
S: +OK [ACTIVITY/NEW] You've got mail.

Having observed the notification of new mail, the client sends a new sequence of normal POP3 commands.

This was all within one connection. All the additional resources needed to repeatedly open and close TCP and TLS are no longer needed.

billpg.com

I’ll be doing the job of turning all this informal text into a formal RFC on my github project.

Picture Credits:
📷 Keyboard Painting Prototype by Rodolphe Courtier.