I’ve been playing around with the mp3’s I’ve got here, trying to automatically find the ones that are broken and need to be reripped from my original CD’s. Some have been truncated over the years, some are just encoded at horrible bitrates that make your ears bleed, and some are missing tracks from the album. Some have ID3v1 tags, some have ID3v2 tags, some have both but they disagree on the information. Some don’t have any tags, some are just named wrong.
So I thought I would write a script using a nice library to find out the length of the song, and pull out all the metadata from the ID3v1 and ID3v2 tags and look the data up in The musicbrainz database to check for accuracy. Then write out nice clean mp3 files with consistant ID3v1 and ID3v2 tags (annoyingly some older hardware mp3 players can only read ID3v1 tags, but ID3v1 is so crappy you really want to use ID3v2 where you can). I thought this should be a nice straight forward evenings work. Oh boy was I wrong.
None of the libraries I looked at would tell me the length of a track. Most of them would decode the ID3v2 and ID3v1 tags, but most would only give you the ID3v2 information if there was an ID3v2 tag and ignore the ID3v1 information (so I couldn’t easily make sure both tags were consistant). Most didn’t appear to give me the option to write out a ID3v2 and a ID3v1 tag into the same file. Now I expect there probably are libraries out there that do this, but it was disheartening enough that I decided to write my own mp3 parser library. I’m glad I did, I learnt a lot of interesting things along the way.
I started off reading the mp3 frame header specification. An mp3 file is basically a series of frames. Each frame begins with the “frame sync”. If a frame sync isn’t found, you skip until you find one. This allows recovering from a corrupt datastream. Then each header has information about how this frame was compressed. (In general) Each frame is independant and can be considered in isolation. An mp3 file is just a series of these frames one after another. So my program started out by parsing all these headers keeping note of how long each frame is so I could calculate the total duration of the mp3 file.
Now, if an mp3 decoder doesn’t find an mp3 frame where it expects one, it just skips over the data in the file until it does find a valid frame sync. This means you can put arbitary data into an mp3 file and mp3 decoders will ignore it. So what happens? Every man and his dog decides it’s a great idea to squirrel some data away in mp3 files.
The most obvious of these is the ID3v1 tag. The ID3v1 tag is the last 128 bytes of an mp3 file and starts with the characters “TAG”. It then has some fixed length fields for the artist, album name, title, comment etc. There is also a “Genre” field that has a possible 80 fields, latter extended by winamp to 125. A varient of this tag (ID3v1.1) also has a track number by borrowing bytes from the comment field.
Because the ID3v1 tag is annoying in oh so many ways (limited length fields that a large propotion of songs exceed, limited number of fields, the tag being at the end of the file making getting the information when you are streaming a file impossible until the files entirely copied etc), ID3v2 was invented to solve these issues. ID3v2 tags go at the beginning of the file, they have an arbitary number of tags that are of dynamic length. Yay! You’d think.
Except there are ID3v2.2 tags (the original ID3v2 format — I have no idea what happened to ID3v2.0 or ID3v2.1),ID3v2.3 which is what a lot of the software I’ve seen today means when it says “ID3v2″, and then theres ID3v2.4 which not many people seem to implement yet.
So now my library parses ID3v1, ID3v2.3 tags (I fortunately didn’t find any ID3v2.2 or ID3v2.3 tags in my collection), and can give me summary statistics on the mp3 stream (duration, histogram of bitrates used, average bitrate, total number of frames etc). Running this over my collection starts turning up all kinds of interesting things. A large proportion of the mp3 files appear to have random data that can’t be attributed to the MP3 stream, nor the ID3v1 or ID3v2 tags.
Further investigation discovers that people put Lyrics 3 v2 tags in mp3 files. Not because they want lyrics, but because it was a convenient places to put the full version of the artist/album/title if the ID3v1 tag was too short to support it. So I implemented a Lyrics3 v2 parser.
I also discovered that MP3 renormalisation software puts undo information into the file in a tag format called “APEv2″ which has a very poor specification. APEv2 isn’t recommended for MP3 files because it uses “APETAGEX” as it’s header/footer marker, which means it can coincidently end up with “TAG” in just the right place in the file for it to be mistaken as an ID3v1 tag. Chances of this is very slim, and its impossible if there is a valid ID3v1 tag there anyway. APEv2 is very similar to ID3v2.
But still theres data in the mp3 files I can’t attribute to any of the mp3 stream, ID3v1, ID3v2.2, Lyrics3v2, or APEv2 headers. So then I discovered the completely obscure music match tagging format. At this point I gave up. Theres obviously more tagging formats in use than I really care for. The other one that would be worth investigating would be the Vorbis/Flac comment structure, but I’ve not seen any evidence of them being used in any of the .mp3 files I’ve found to test on. Not that I expect this to have stopped someone from deciding it was a great idea and doing it anyway. Sigh.
Theres still quite a bit data that is left over even if I ignroe the musicmatch tags, which I looks like it is compressed data that for some reason doesn’t have a valid mp3 syncronisation header on it. I’ve not found an adequate explaination yet as to what this data is doing there.
So now I have a long list of mp3 files that are inconsistant with their tagging, have obviously silly things going on with their tagging (eg trailing whitespace), have data I can’t attribute to the bitstream or any known tag, or have bizarre or crazy frame headers (eg have some frames that aren’t mpeg version 1 layer 3, and albums that are broken (eg two files that claim to have the same track number, or no file claiming track “2″ on a 5 track album).
I decide to compare this to the musicbrainz database to see what things “should” be like. Musicbrainz database isn’t perfect, but it’s at least reasonably internally consistant and well maintained and free for download. Immediately I discover that there are regularly multiple versions of the same CD. Urgh. I swear the entire music industry is out to get me. So now I compare the number of tracks, and the track lengths that I have vs musicbrainz and pick the one that appears closest to the version I have, and compare them for accuracy. It appears that based on the cdripper used, you get slightly different lengths, presumably due to different ideas of when the cd leadin/leadout times begin.
So I start reripping CD’s that I think need to be sorted out, using cdparanoia. At the same time I decide it would be useful to write out the TOC so I can do a CDDB lookup later, and so I can include the TOC data in the ID3v2.3 tag which allows later offline processing to determine which exact version of a CD a track came from, and what the length of the track should be vs the length of the mp3. But apparently ripping a TOC is a complicated business. Noone actually just writes out the subcode “Q” data. So I use cdrdao to generate me a TOC file, which is near enough. This however means cdrdao rips the CD again to verify track leadin/leadout times. cdrdao can be used to rip the cd itself (using libcdparanoia), but when it does so it writes out one single file instead of one file per track. Arrrggghh. So while I’m writing this I’m waiting for CD’s to rip — twice for each CD.
Then I discover that CD’s can have a GRiD (aka Catalog) number which uniquely identifies the contents of the CD! Yay! Surely this would be more reliable than CDDB? Ahh, but as far I can see there is no publically available GRiD<->information database on the Internet. There are (sometimes) also ISRC numbers for each track which identifies the recording. This sounds like a very useful thing until you realise that it can change if the recording is retouched or slightly different, but might not be modified if the song is for instance remixed. But still a set of ISRC’s in a specific order would be a good identifier for a CD, but again there appears to be no publically useable database on the Internet of ISRC’s. I guess if a CD has GRiD or ISRC information then generally it contains CDTEXT on the CD itself with the name of the tracks. Still it would be an interesting database to trawl.
Other bizarre things that you might have never known. Number of unique versions of a CD according to musicbrainz: