Archive for the 'The Web' Category

Ordering Online

Wednesday, January 11th, 2006

Some notes from my experiences with ordering online.

Provide accurate links to manufacturer products

I want to know what I’m buying dammit. Not providing links is mildly irritating, providing broken links, or links to dealer logins or such like is just down right annoying.

Send me email!

TELL ME WHATS HAPPENING. Getting one email when I submit the product which says “We got your order” is important, but so is the email saying “We’ve confirmed we got your money”, “We’ve shipped your item”, and “Oops! We ran out of stock!” and the various other things that can happen. If something is on backorder gimme an update every so often (weekly? monthly?). Try and avoid your email being picked up by spam scanners, do try and put valid headers on them (eg Message-ID: X-Mailer: Mime: etc). do put text and html varients into your email. Do make sure my order number is somewhere obvious (subject?). Do try and batch emails to send to me so you don’t send them more often than once every 2->4 hours. Getting a flurry of emails is mildly annoying. Don’t spam me with promotional updates, I don’t care. An RSS feed maybe.

Do make sure that the reply-to: address of the email is a real human. Do make sure that the From: address is something sane looking that I can pick out of an inbox (eg “ACME Order Update “). Don’t change the From: address, I want to filter it into a box that never gets spammed scanned. Never have your email come from a different domain than your website!

Let me see updates through your website

Either give me a login, or a url that refers to an order that shows me the current state of all the items in my order. I don’t want to have to email you twice a day for my information fix. You probably don’t want to recieve those emails either. This is even more important the less emails you send me. Remind me in your emails how to check the status through the website

Tell me the damn track and trace numbers

For every item I want to know which courier it was sent with, and what their track and trace number is. Bonus points if you make this a hyperlink that takes me to the courier companies website with information about that code. Courier companies are notoriously bad at delivering things properly, so make sure you place the blame on them when blame is due, let the end user see that the product has shipped and that the couriers are stuffing around.

Make your order numbers short

Don’t make them so long they don’t fit in fields! In particular if you want me to do a bank transfer into your account with your order number as a reference, make sure the order number *fits* in the reference area of a bank statement!

Distributed Caching

Sunday, November 6th, 2005

One neat idea I thought would be a content addressible web. Have a DHT, that indexes sha1(content) -> URL(content). If you have another copy of the content, you update the DHT to register your “mirror” of that content. When you want to fetch a document with a uri of say sha1:0a4d55a8d778e5022fab701977c5d840bbc486d0 then you look up the DHT for 0a4d55a8d778e5022fab701977c5d840bbc486d0, which returns you a list of url’s. You can pick the “best” one, a random one, or break the file up into sections and fetch it using byte ranges from multiple places simultaniously. Once you have the file, you save it into your local “cache” which is exported to the world via HTTP, and you register your new URL in the DHT.

If you want to do this as a normal web proxy you could also store in the dht sha1(url) -> sha1(content) mappings too.

Vunerabilities:

  • People could register sha1(content) -> url(other content). This would be obvious (after you’ve downloaded the content), as the content you downloaded didn’t match the sha1(content) key you originally had. This could be rather annoying if the file is large. Possibly you could grab random byte ranges that you already have from other url’s, and verify they are identical. Those byte ranges could be short (1k?), so long as they are random then you can use it to verify that the content is legit. If the byte ranges disagree you’re kinda screwed tho, which one is correct? If you have sha1(url)->sha1(content) mappings, you can grab the same byte range from the original url and use it to prove which one is correct conclusively.
  • To get around the above issue, the attacker may attack the original url so that you cannot verify the content.
  • An attacker may also have most of the file intact, but have corrupted small sections. At the end of the download you know that the file is corrupt (because the sha1(content) doesn’t match), but you don’t know where the error is. rsync’ing with the original url will find and correct the error fast enough, but who runs an rsync server on their webserver anyway?
  • People could register bogus sha1(url) -> sha1(content) mappings. There is no way to detect this. Maybe do random byte ranges from the original url, and verify it against the content you’re downloading. This also has the problem above.
  • sha1(url) -> sha1(content) mappings may become obsolete quickly. If the original webserver supports it you could use sha1(len(url)+”:”+url+lastmodified) which you can get from a HEAD request. We could also ask for an X-Sha1: header which we could use directly and bypass the dht sha1(uri)->sha1(content) mapping all together.

Hopefully this would mean that the original hoster has a much lower bandwidth load, they only occasionally have people requesting small byte ranges off them to do verification, most of the bandwidth consumption is actually being downloaded from other hosts. Popular files get quickly mirrored around the internet.

Abusers will find that their “corrupt” data quickly gets drowned out by lots of people with uncorrupt versions. If you pick one at random then chances are you’ll pick the uncorrupt versions. However, there is still the gazillion trojanned hosts approach to poisoning the DHT. I guess if you give much higher preference to people who have given you good data in the past, then you will tend to avoid these. The worst an attacker can do anyway is slow down your download massively (by forcing you to waste time and download multiple, useless copies) rather than convince you to accept corrupted data.

All of this could be very easily written in an afternoon as a web proxy by someone that knows java and or C#, as both languages have practically everything here as libraries, all that is required is writting a bit of glue.

Neat things google should do

Friday, September 23rd, 2005

My list of feature requests for google web search.

  1. In search results when you click on a link, it should take you to the nearest #target, it’s annoying having to search within a page to find what google is pointing you at.
  2. Provide a page like the sitemap status page where google suggests things about your page. Eg, “/index.php has a broken link on it pointing to /404.html” “/broken.html is pointed to by offsite page http://www.example.com/”, “foo.zip is served with a plain/text content-type” “/foo.xml is served with an xml content-type, but doesn’t validate as xml”, etc. I’d fix problems on my website if I knew what they were.
  3. Provide a mechanism to exclude part of a page for being used for search words. eg, the wlug wiki has/had the sites RecentChanges on a page. If you googled for “SNMP” it would return you the samba page, why? because the samba page had a very high page rank, and included the “SNMP” page in the RecentChanges area. I want google to crawl the links in recentchanges, I just don’t want it to consider terms in there as being relevant. No, rel=”nofollow” doesn’t cover it.

RSS and If-Modified-Since

Sunday, July 17th, 2005

I can’t be the only one to have had this thought. All good RSS readers send If-Modified-Since: headers, and good RSS producers send back an “Unmodified” response if it’s not been modified. But why don’t RSS producers send back a dynamically sized list of the items created since the date in the “If-Modified-Since” header (maybe +1). If not if-modified-since is sent you fall back to the last “n” items as normal. You can cap at no more than “m” items being sent. Very simple, very useful.

Server Name Indication, or how to virtual host SSL.

Friday, March 25th, 2005

So after reading chipux’s blog entry on TLS Upgrade in HTTP/1.1 I decided that I should get on and do some coding for mozilla, and have an attempt at implementing this. It would solve a problem that I’ve had for ages of virtual hosting SSL connections.

I quickly remembered why I hate state machines, and how complicated HTTP really is, and how complicated SSL is, and trying to do both together is just even more complication. But then, someone pointed out Mozilla bug 116168 (TLS server name indication extension support in NSS). After reading RFC 3546: Transport Layer Security (TLS) Extensions. I decided that it’s probably the better way to go. It allows for virtual hosting more than just HTTP, but SMTP, IMAPS, POPS, LDAPS etc. The bug for this is Mozilla Bug 116169: Browser support for TLS server name indication. So I scrapped my earlier implementation of TLS Upgrade and started implementation on this. It turned out to be very easy, only 20 or so lines of very simple code. The most complicated function is strlen(3). The only problem I had was that the ss->url actually contains a hostname, not a url. Solved.

Now, for a minor diversion. openssl doesn’t seem to support Server Name Indication. So the usual apache SSL libraryes (which use openssl) can’t support Server Name Indication. But chipux to the rescue again, with his mod_gnutls module for apache. This module uses gnutls instead of openssl for providing SSL/TLS support. And gnutls does support Server Name Indication.

So now I have to test my module, and that involves compiling a more up to date version of apache. Sigh.

The answers to the journal problem?

Monday, October 4th, 2004

Scientific Journals take your hard worked paper that you’ve written, they then organise (unpaid) people in the field to peer review your work, then publish them, and charge everyone to read your work. Computer scientists in particular are in a good position to completely avoid this. We have the ultimate publishing infrastructure and it’s called the Internet. It’s nearly free to publish new papers, and it’s nearly free to “subscribe” to papers. However it’s missing the all important peer review step.

Well, you could allow anyone at any time to review a paper, we kind of have this now with people sometimes publishing “response” papers. However the Journals choose people to do the reviewing that have a clue. The Internet certainly has it’s share people with a clue, however there are a lot of people out there that don’t know what they’re talking about. So obviously some kind of “rating” as to how important someone’s view is to a certain topic. Note that people who are good in one area aren’t immediately to be assumed to be good in another. I may be able to write two lines of code without looking like a total fool, but don’t expect me to be able to figure out how to do neurosurgery.

Who do we trust? Well, we can trust people who have published papers in peer reviewed journals. We can also push out the trust metric like googles page rank based on citetations. Good papers get cited regularly, good people cite good papers. Somehow we need to find some way of determining if a paper is within someones field or not. Maybe by looking at the papers they cite? Is this too easy to abuse? Perhaps something using something like the nzdl’s “phrasier” to search for key phrases and group based on that?

A good search interface over this is of course necessary, being able to quickly and easily find papers that relate to a topic is a very important thing to be able to do. You could even earn money from it by allowing universities and other educational institutions to sign up for “gold” services including things like emailing/rss fields for:

  • A paper that cites a given paper is “published”
  • A review for a given paper
  • A paper is written in your “field” (for reasonably narrow definitions of narrow)
  • Any paper written by a certain author
  • Any review written by a certain author

Maybe have publishing a dead tree copy of papers in a specific “field” at regular intervals. Google ads could possibly also work, although I’m not sure how applicable they could be. (Papers by definition don’t usually talk about things people have products for yet…)

Now who should do this? The people that run Citeseer would be an obvious group to do it, they have the database more or less already there. They just don’t seem to be well funded to make good use of their database (adding a new paper to their database seems adhoc at best.) All they’d need to do is allow adding “comments” on a paper, some login system, and the “rating” system. Google would be another excellent option, probably starting by purchasing Citeseer. The NZDL people are big on search too. They could join in the fray. Maybe someone else?

The world is ready for a revolution in this space, it’s begging for someone to do it, where are they?

Programming for the Web.

Thursday, February 12th, 2004

Programming for the web today, IMHO is far harder than it should be

The problem

PHP and ASP have done wonderful things to improving the interactivity on the web. However, as anyone who has done serious programming in PHP will tell you, it’s a nasty language to use for writing “applications.” Mixing HTML and code sounds like a great idea, when you mostly have HTML and just a hint of code, but when you have a whole chunk of code and just a hint of HTML then it’s a pain. Most php starts with “>?” and goes down hill from there.

PHP has had a whole slew of bad design decisions. Registering global variables being obvious amongst them, but less well known are issues like form posting and the like. PHP doesn’t treat HTML as a “First class” object, you can’t pass around HTML fragments and process them easily. This is surprising as the language is designed to deal with HTML! PHP in general doesn’t even make things like talking to a database easy.

<

PHP allows for people to use classes, and OOP, however it’s “built in” libraries are all done in the functional style. (Mostly, I suspect, due to the fact that people are just wrapping the C API’s.) PHP 5 promises to resolve a lot of these issues, and I hope it will

So, I propose, the world needs yet-another language to fill this space

The requirements

  1. Must be simple

    The language must be easy for nearly-programmers to write in. C may be a powerful language, but, in the end, people just want the job done.

  2. Should be treat HTML as a first class object

    You should be able to pass around HTML fragments, and do things like iterate over them.

  3. Generated HTML should validate

    It should take significant effort to generate broken HTML, or, ideally be impossible. The engine should be able to use browser detection to serve up the best possible HTML for your browser automatically.

  4. Should be easily “Themable” or “Skinnable”.

    Content != Display.

The Solution

I Propose (With help from Jon/Matthias) that we create a new engine to replace PHP. It should use Javascript as the programming language, and you should pass around a DOM of the page you are trying to construct. You should be able to provide an XSLT filter to “skin” your page. (And if the browser supports it, this could be done on the browser side).

I believe that you want to use Javascript, as the language is mature, it’s modern and easy to write and to find documentation about. It’s was designed to deal with the web (admittidly from the reverse point of view — on the client) and the skills learnt on the server side can be used to make client side scripts as well.

The DOM is a well documented, and well understood way of handing around HTML/XML document fragments which is advantagous as outlined above.

And XSLT is well a good idea, it’s a pain to have to go and change all your code to add a new stylesheet, particularly if you are trying to integrate multiple, unrelated projects (phpMyAdmin + phpWiki + cacti + a few index pages to glue it all together makes a great internal website, but is a pain to theme so it all looks the same).

XML II?

Wednesday, January 21st, 2004

There is a lot of debate about if XML parsers should try and recover from non-wellformed ness at the moment. The argument generally goes that if you’re parsing XML, that noone is going to parse non-wellformed XML the same way twice so who knows what you would end up with vs you should at least attempt to do something.

Now, My thoughts are that if you made it so that well formed XML had no obvious meaning people would be less inclined to try and correct it. For instance, XML has tag> tags at the end of a block, but if the tag was then it’s ambigious which tag it’s closing unless the document is well formed. As anyone who’s spent time trying to find out which } is missing in a C/C++/Java/etc program will tell you, it’s nearly impossible to do if you’re a human, let alone write a program to do anything sane with it.

The next thing is to make sure that attributes have closing “’s, well, noone really knows what an XML attribute is anyway, so get rid of ‘em. So we have instead:


And well, thats kinda messy, so lets replace < a>b with (a b) and uh, wee! sexp’s!

my RFC index

Wednesday, January 21st, 2004

Every time I look up an RFC, I want to find out more about it (which are the useful RFC’s that update this one? What other RFC’s are there on this topic?) So I wrote my own RFC index program. For a given RFC number, it provides links to:

  • the HTML version of an RFC (it’s easier on the eyes)
  • RFC’s published at about the same time
  • RFC’s by the same author
  • all the RFC’s that update this one
  • RFC’s updated by this RFC
  • RFC’s that obsolete this RFC
  • RFC’s that this RFC obsoletes
  • RFC’s that reference this RFC
  • RFC’s that this RFC references

I hope you find this vaguely useful.

XML

Wednesday, January 14th, 2004

Today’s rant is brought to you by XML. In theory XML is a very smart idea. You can extend it, and the rules are so blindingly trivial you can’t set a foot wrong. Everyone can hack up an XML parser in 10 minutes (but why would you want to? theres parsers falling off tree’s) and everyone knows exactly whats going on.

Except, Every single XML file I’ve tried to grab off the ‘net isn’t well formed. People forget to escape &’s in attributes, people use namespaces that they’ve not declared, or sometimes just don’t bother with the whole “closing a tag” thing. C’mon, people HOW DIFFICULT IS THIS?! Sheesh.