Building Hitchhikers Guide To The Galaxy

March 17th, 2009

I’m (hopefully) going travelling soon, and I’d like to have ready access to Wikipedia so I can investigate more information about various things, and generally keep up with what I should know while I’m visiting places. So I’ve spent some time trying to figure out how to get data plans for my phone.

This morning I had an epiphany, why not download the Internet before I left? My Nokia e66 can take a MicroSD Card (according to Nokia it only supports up to 8GB cards, although I don’t see any reason why it wouldn’t support a 16GB card.)

There are several articles on how to build an offline wikipedia. I like the idea of having the compressed entire English text-only version of Wikipedia (~4.5GB). Maybe using the compressed Open Street Maps data (~5.2GB) to provide some geo-location while offline. And for good measure maybe compressed FreeBase dump (~1.2GB) to provide more links between articles, and provide information on regions (Wikipedia tends to provide a single point representing a region, not useful.)

Hopefully with a bit of hacking, I can end up with something like Mobilizy’s Wikitude, using the GPS for location, and accelerometers to figure out direction and rotation in 3D space (Android has an electric compass which makes this a bit easier for them), using freebase + Wikipedia to annotate the current scene.

What would be really cool would be to have a HUD that overlayed wikipedia articles over your current vision from your phone. Although I suspect no matter what you do, you’ll end up looking like a tool.

It’s scary that the sum of human knowledge (Wikipedia + Open Street Maps + FreeBase) fits in ~11GB (ignoring indexes and the like), and that I can fairly easily fit this onto my cellphone, with heaps of room to spare!

New Zealand Copyright Amendment

January 16th, 2009

Some interesting points I’ve not heard anyone mention about the New Zealand Copyright Amendment (IANAL, YMMV, …).

The submission format

92D Requirements for notice of infringement
A notice referred to in section 92C(3) must—
(a) contain the information prescribed by regulations made under this Act; and
(b) be signed by the copyright owner or the copyright owner’s duly authorised agent.

2D Requirements for notice of infringement

I’m unaware of any regulations made under this act so far, so currently you can’t create a notice of infringement that is prescribed by any regulations… yet.

The Submission format II
Although there hasn’t been any discussion about the submission format yet, it concerns me that you obviously need enough information to uniquely identify the copyright infringer (either the person, or the account). If an ISP’s business model involves putting customers behind a NAPT box, then a timestamp and IP address is not sufficient to uniquely identify the user, you at least need a timestamp, and the 5 tuple used. This is particularly concerning given that we are rapidly running out of IPv4 addresses, and one of the suggested solutions is to place as many customers as possible behind a Service Provider NAT box. Since connections through a NAPT box are far more ephemeral than IP address allocation, timestamps must be more precise, and more accurate. Which customer an IP is assigned to is usually stored along with the rest of the accounting information in RADIUS and generally is recorded by an ISP for essentially free. Having to record every connection through a NAPT box would incur a serious overhead, and data management problem for an ISP. Also, how long should an ISP hold onto this information so that it can process these notice of infringements before it can discard it?

You can only disconnect people.

92A Internet service provider must have policy for terminating accounts of repeat infringers
(1) An Internet service provider must adopt and reasonably implement a policy that provides for termination, in appropriate circumstances, of the account with that Internet service provider of a repeat infringer.
(2) In subsection (1), repeat infringer means a person who repeatedly infringes the copyright in a work by using 1 or more of the Internet services of the Internet service provider to do a restricted act without the consent of the copyright owner.

92A Internet service provider must have policy for terminating accounts of repeat infringers

This leads me to some interesting questions: If Alice is a member of an organisation, and the organisation has an account, and Alice infringes peoples copyright repeatedly, then the account that Alice is using is the organisations, but the account is not Alice’s. Is the organisation (perhaps Alice’s place of work) considered an ISP? In the more obvious case if Bob sits at an Internet Cafe and infringes peoples copyright then can the Internet Cafe’s account get shutdown? If the Internet Cafe buys it’s bandwidth from LittleIspInc, can LittleIspInc’s account get shutdown by their upstream? What should happen if UpstreamInc receives a notice for Bob’s infringement? Obviously it should pass it to LittleIspInc and LittleIspInc should pass it on to the Internet Cafe, who should terminate Bob’s account. In this case, Bob probably doesn’t even have an account at all. Are Internet Cafes going to require ID so they can check people against previously banned lists?

If LittleIspInc gets a series of notifications from UpstreamInc, should LittleIspInc be cut off, even though it’s multiple different downstream customers of LittleIspInc that have been infringing? Should the Internet Cafe get cut off if it has multiple different customers infringe? What if the Internet Cafe places everyone behind NAPT, and the infringement notices aren’t specific enough to identify an individual person?

Fake notice of infringments
While I’m not a lawyer, I’m sure there are laws already about sending fake infringement notices. So anyone who’s doing this maliciously is likely to get themselves into trouble.

False Positives
Ok, this one I have seen people talk about at length. There is no incentive for people sending notice of infringements to make sure they aren’t generating false positives. If people are too abusive they will probably end up running into trouble, but as long as they put in a reasonable effort, it seems to me that they are likely to get away with it.

I’ve seen people sent takedown notices for Open Office because some automated tool decided it was actually Microsoft Office (At the time, an unintended compliment I’m sure). I have seen people asked to take their photo’s down, because someone /else/ had permission to use the photo and was believed incorrectly to be the copyright holder.

Under this law, you appear to have no right of reply, no way to state your case and point out that you are innocent. ISP’s don’t appear to have the right to make judgement as to the quality of the notice of infringement (not that the ISP’s want this responsibility).

What’s an ISP?
I can’t find anywhere a definition of what is considered an ISP. Does it include anyone providing IPv4/IPv6 connections? If I run a public IPv4 network that doesn’t connect with the Internet, am I an ISP? If I run a public packet switched network (such as X.25), am I an ISP? Is a disconnected UUCP graph considered an ISP? Is a FidoNet BBS considered an ISP given that you can send FidoNet files and emails around (even tho noone in a FidoNet network need be connected to “The Internet”?). Is the phone system an Internet, given that I can dial anyone and send them data via a modem? Can I call Telecom and get them to disconnect an account for infringing my copyright?

In Summary
I don’t like this law. It seems to have too many problems. It appears that it could force ISP’s to use real world IPv4 addresses where their use is unwarranted, and impractical thus hastening the depletion of the IPv4 address pool. I am not a lawyer, I’m trying to interpret this the best I can without any formal law training, but I do know something about the technology from the ISP point of view.

Things you probably never wanted to know about MP3 files

March 20th, 2008

I’ve been playing around with the mp3’s I’ve got here, trying to automatically find the ones that are broken and need to be reripped from my original CD’s. Some have been truncated over the years, some are just encoded at horrible bitrates that make your ears bleed, and some are missing tracks from the album. Some have ID3v1 tags, some have ID3v2 tags, some have both but they disagree on the information. Some don’t have any tags, some are just named wrong.

So I thought I would write a script using a nice library to find out the length of the song, and pull out all the metadata from the ID3v1 and ID3v2 tags and look the data up in The musicbrainz database to check for accuracy. Then write out nice clean mp3 files with consistant ID3v1 and ID3v2 tags (annoyingly some older hardware mp3 players can only read ID3v1 tags, but ID3v1 is so crappy you really want to use ID3v2 where you can). I thought this should be a nice straight forward evenings work. Oh boy was I wrong.

None of the libraries I looked at would tell me the length of a track. Most of them would decode the ID3v2 and ID3v1 tags, but most would only give you the ID3v2 information if there was an ID3v2 tag and ignore the ID3v1 information (so I couldn’t easily make sure both tags were consistant). Most didn’t appear to give me the option to write out a ID3v2 and a ID3v1 tag into the same file. Now I expect there probably are libraries out there that do this, but it was disheartening enough that I decided to write my own mp3 parser library. I’m glad I did, I learnt a lot of interesting things along the way.

I started off reading the mp3 frame header specification. An mp3 file is basically a series of frames. Each frame begins with the “frame sync”. If a frame sync isn’t found, you skip until you find one. This allows recovering from a corrupt datastream. Then each header has information about how this frame was compressed. (In general) Each frame is independant and can be considered in isolation. An mp3 file is just a series of these frames one after another. So my program started out by parsing all these headers keeping note of how long each frame is so I could calculate the total duration of the mp3 file.

Now, if an mp3 decoder doesn’t find an mp3 frame where it expects one, it just skips over the data in the file until it does find a valid frame sync. This means you can put arbitary data into an mp3 file and mp3 decoders will ignore it. So what happens? Every man and his dog decides it’s a great idea to squirrel some data away in mp3 files.

The most obvious of these is the ID3v1 tag. The ID3v1 tag is the last 128 bytes of an mp3 file and starts with the characters “TAG”. It then has some fixed length fields for the artist, album name, title, comment etc. There is also a “Genre” field that has a possible 80 fields, latter extended by winamp to 125. A varient of this tag (ID3v1.1) also has a track number by borrowing bytes from the comment field.

Because the ID3v1 tag is annoying in oh so many ways (limited length fields that a large propotion of songs exceed, limited number of fields, the tag being at the end of the file making getting the information when you are streaming a file impossible until the files entirely copied etc), ID3v2 was invented to solve these issues. ID3v2 tags go at the beginning of the file, they have an arbitary number of tags that are of dynamic length. Yay! You’d think.

Except there are ID3v2.2 tags (the original ID3v2 format — I have no idea what happened to ID3v2.0 or ID3v2.1),ID3v2.3 which is what a lot of the software I’ve seen today means when it says “ID3v2″, and then theres ID3v2.4 which not many people seem to implement yet.

So now my library parses ID3v1, ID3v2.3 tags (I fortunately didn’t find any ID3v2.2 or ID3v2.3 tags in my collection), and can give me summary statistics on the mp3 stream (duration, histogram of bitrates used, average bitrate, total number of frames etc). Running this over my collection starts turning up all kinds of interesting things. A large proportion of the mp3 files appear to have random data that can’t be attributed to the MP3 stream, nor the ID3v1 or ID3v2 tags.

Further investigation discovers that people put Lyrics 3 v2 tags in mp3 files. Not because they want lyrics, but because it was a convenient places to put the full version of the artist/album/title if the ID3v1 tag was too short to support it. So I implemented a Lyrics3 v2 parser.

I also discovered that MP3 renormalisation software puts undo information into the file in a tag format called “APEv2″ which has a very poor specification. APEv2 isn’t recommended for MP3 files because it uses “APETAGEX” as it’s header/footer marker, which means it can coincidently end up with “TAG” in just the right place in the file for it to be mistaken as an ID3v1 tag. Chances of this is very slim, and its impossible if there is a valid ID3v1 tag there anyway. APEv2 is very similar to ID3v2.

But still theres data in the mp3 files I can’t attribute to any of the mp3 stream, ID3v1, ID3v2.2, Lyrics3v2, or APEv2 headers. So then I discovered the completely obscure music match tagging format. At this point I gave up. Theres obviously more tagging formats in use than I really care for. The other one that would be worth investigating would be the Vorbis/Flac comment structure, but I’ve not seen any evidence of them being used in any of the .mp3 files I’ve found to test on. Not that I expect this to have stopped someone from deciding it was a great idea and doing it anyway. Sigh.

Theres still quite a bit data that is left over even if I ignroe the musicmatch tags, which I looks like it is compressed data that for some reason doesn’t have a valid mp3 syncronisation header on it. I’ve not found an adequate explaination yet as to what this data is doing there.

So now I have a long list of mp3 files that are inconsistant with their tagging, have obviously silly things going on with their tagging (eg trailing whitespace), have data I can’t attribute to the bitstream or any known tag, or have bizarre or crazy frame headers (eg have some frames that aren’t mpeg version 1 layer 3, and albums that are broken (eg two files that claim to have the same track number, or no file claiming track “2″ on a 5 track album).

I decide to compare this to the musicbrainz database to see what things “should” be like. Musicbrainz database isn’t perfect, but it’s at least reasonably internally consistant and well maintained and free for download. Immediately I discover that there are regularly multiple versions of the same CD. Urgh. I swear the entire music industry is out to get me. So now I compare the number of tracks, and the track lengths that I have vs musicbrainz and pick the one that appears closest to the version I have, and compare them for accuracy. It appears that based on the cdripper used, you get slightly different lengths, presumably due to different ideas of when the cd leadin/leadout times begin.

So I start reripping CD’s that I think need to be sorted out, using cdparanoia. At the same time I decide it would be useful to write out the TOC so I can do a CDDB lookup later, and so I can include the TOC data in the ID3v2.3 tag which allows later offline processing to determine which exact version of a CD a track came from, and what the length of the track should be vs the length of the mp3. But apparently ripping a TOC is a complicated business. Noone actually just writes out the subcode “Q” data. So I use cdrdao to generate me a TOC file, which is near enough. This however means cdrdao rips the CD again to verify track leadin/leadout times. cdrdao can be used to rip the cd itself (using libcdparanoia), but when it does so it writes out one single file instead of one file per track. Arrrggghh. So while I’m writing this I’m waiting for CD’s to rip — twice for each CD.

Then I discover that CD’s can have a GRiD (aka Catalog) number which uniquely identifies the contents of the CD! Yay! Surely this would be more reliable than CDDB? Ahh, but as far I can see there is no publically available GRiD<->information database on the Internet. There are (sometimes) also ISRC numbers for each track which identifies the recording. This sounds like a very useful thing until you realise that it can change if the recording is retouched or slightly different, but might not be modified if the song is for instance remixed. But still a set of ISRC’s in a specific order would be a good identifier for a CD, but again there appears to be no publically useable database on the Internet of ISRC’s. I guess if a CD has GRiD or ISRC information then generally it contains CDTEXT on the CD itself with the name of the tracks. Still it would be an interesting database to trawl.

Other bizarre things that you might have never known. Number of unique versions of a CD according to musicbrainz:

Artist Album Number of versions
Roberto Carlos Roberto Carlos 26
James Earl Jones The Bible 17
March 2009
  • January 2009
  • March 2008
  • April 2007
  • March 2007
  • January 2007
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • January 2005
  • December 2004
  • October 2004
  • September 2004
  • August 2004
  • June 2004
  • May 2004
  • March 2004
  • February 2004
  • January 2004
  • December 2003
  • Categories