Archive for November, 2006

Back in my day

Monday, November 27th, 2006

It occured to me how much knowledge is being lost since I taught myself to program. Programmers today "know" some of the low level details of how a machine works in theory, but really, how much do they know?

When I started teaching myself to code it was on an IBM XT with DOS. Now, if you're somoene who's never really used DOS (or CP/M or…) except in "DOS Box"'s, then you really don't know what it means to run a non multitasking, non protected mode machine. DOS is not very much more than a way to load programs into memory and run them. It provides only very basic abstraction from the hardware, and most of it is extremely slow and limited. the DOS API's were generally only ever used to read/write files. DOS was essentially treated a filesystem driver.

The realmode x86 chips had a weird memory model called "segmentation". Each segment was 64k, and segments overlapped with each other offset by 16 bytes. So offset 0010 in segment 0000 was the same as offset 0000 in segment 0001. You'd write these addresses as segment:offset. Programmers would regularly directly access addresses in memory that they knew contained things. I can still remember a lot of interesting addresses off by heart.

  • 0000:0000 was/is the interrupt vector table
  • 0000:7c00 - the bootsector was loaded into memory
  • A000:0000 - the VGA Framebuffer was located.
  • B000:0000 - the monocrome textmode display was located
  • B800:0000 - the colour textmode display
  • C000:0000 - this was "Adaptor ROM", or where network cards (and other things) put their stuff
  • F000:FA6E - this was the VGA ROM BIOS Font

You talked directly to the hardware. If you wanted to hook an IRQ, go for it, overwrite the address it would use in the interrupt vector table, and remember to acknowledge the interrupt when it came time to return.

I remember to change the VGA palette you write the pallete entry you want to change to io port 0×3c8, then write the red value to 0×3c9, green to 0×3c9, then blue to 0×3c9. They could be from 0 to 63.

In fact, I remember you could poll an IO port (I must admit I've forgotten which one, 0×3CA?) and see if a horizontal or vertical retrace was in effect. Depending on the screen mode there were only 16 or 256 or whatever palette entries. Most of the time this meant that you could only have 256 colours on the screen at once. If you modified the palette while the video card was drawing on the screen you'd get "Snow".

Snow was caused when the video card tried to read a value that the host CPU was in the process of modifying. It would normally read out 0xFF or some other weird value instead so you'd get "white" in random places across the screen causing what looked like "Snow".

But you had enough time to modify about 4 entries during the horizontal blanking time (on my hardware at least) before it started painting the line and therefore avoiding the snow. This meant you could use 256*4+242 colours on the screen at once (on a 256 scanline resolution). Often this was used to change just the background colour, usually in a nice smooth gradient. This was called coppering.

By investigating what the other video registers were used for people discovered lots about how video cards worked. Video cards were only allocated 64k of address space, so to fit higher resolutions or higher bit depths in they used "planes" (or banks) that you switched into the 64k address space, wrote your updates to it, and then flipped in the next plane. Lots of interesting tricks could be pulled using these techniques. A lot of video modes were "chained". A write to video memory would write the same value to all planes simultaniously. These modes were fast and easy to program for, but they were stuck at very low resolutions. But by programming some of these low resolution modes and turning off the "chained" bit in the video card you could effectively create your own unsupported, undocumented video modes. These modes were collectively called "Mode X".

Because the machines were so slow, you spent a lot of time working on optimising code. Often you'd end up rewriting tight loops in assembly code. Even then you'd look at how you could optimise your assembly code.

x86 instructions are of dynamic length. There minimum length is 1 byte. Their maximum length was something like 6 bytes. I suspect on a modern x86_64 machine it's probably closer to 12 bytes by now.

Your bus was 8 or 16bits wide, and you had to wait for bus transactions to complete, so you'd want to choose the shortest opcode you could. Also memory was at a premium, every byte saved was going to be useful somewhere else. xor ax, ax was better than mov ax,0×0000 because xor ax,ax is a 1 byte opcode, where as mov ax,0×0000 was 3 bytes.

Unlike the RISC machines you get taught on at universities where everything is nice and simple and elegant, and instructions take a fixed number of cycles, on x86 they can take different amounts of time. xor ax,ax might take one cycle. a multiply or a divide might take a few hundred cycles. A sin or cos might take a small iceage. So sometimes you wanted to use more opcodes because they were faster but the achieved the same result.
One popular VGA video mode was mode 0×13. ("mode thirteen"). It was 320×200x256. To code a putpixel you needed to do (A000:0000 + Y*320 + X). But that had a nasty multiplication in it. Remember how you were taught to do multiplication at primary school?

32
 *320
 ----
    0
+ 64
+96
-----
10240

Imagine doing this in binary:

0000 0000 0001 0000
          *0000 0001 0010 0000
           -------------------
           0000 0000 0000 0000
         0 0000 0000 0000 000
        00 0000 0000 0000 00
       000 0000 0000 0000 0
      0000 0000 0000 0000
    0 0000 0000 0010 000
   00 0000 0000 0000 00
  000 0000 0000 0000 0
 0000 0000 0001 0000
<snip>
 ----------------------------------
0000 0000 0001 0010 0000

The thing to note is where there is a zero in the second argument to '*', there is a line of all 0's in the working, and where there is a 1, the first number is copied entirely shifted left by the position of the bit. So to multiply by 320 you could instead do res=(y<<5)+(y<<8). This was faster, significantly so. I remember benchmarking it

Other quirks of the x86 architecture were things like you could do complex addressing modes like [bx+4*ax+3]. this was handy when you needed to find the address of a member of a structure in a array. There is an opcode called LEA. Load effective Address. This instruction takes an address as it's argument and loads that address (not the value at that address) into some other register. The fun thing was that LEA was a single cycle opcode (and quite short 1 or 2 bytes IIRC). This meant that you could do a shift by a power of two, and an addition in one opcode instead of two. And not only that it wouldn't use the ALU to do the arithmetic so another opcode could be busy using the ALU.

One of my greatest "hacks" was a program I wrote that used one of the "Mode X"'s I mentioned above to create a 256×256x256 video mode. Most monitors hated this mode as it was effectively square, and they were 4:3, but with some protesting they'd do it. I created a subdivision style "plasma". Then each vertical refresh I'd blit onto the screen a tunnel going off to infinity with this plasma textured onto it.

Originally I wrote this in Pascal (Turbo Pascal v7 for DOS was a brilliant language, compiler, and IDE). Except it was slow. Glacially slow. So I started rewriting chunks of it in assembler. But I soon ran into the problem that intel machines don't have many registers. So I stored all the registers in the code segment (writable code segments yay!), and then used all the registers (including the stack pointer — which is why I couldn't store them on the stack) as general purpose registers.

It still wasn't fast enough. One of the reasons for this was I only had a handful of segment registers. Now when you're addressing memory in x86 realmode, you give an offset into one of the segment registers (CS (code segment), DS (Data segment), ES (Extra Segment), SS (Stack Segment) and I needed to access 4 buffers (vga buffer, texture, and x/y data for my tunnel).

So, I did some research and found the opcodes for the new 386 instructions, and tried using them. But turbo pascal's internal assembler didn't know about these opcodes. Now when it comes to x86 assembler there are prefixes for changing some characteristic about the next opcode. for instance a "ES:" prefix would modify the address in the instruction to be relative to the ES (Extra segment) instead of the DS (Data segment). If you used 32bit registeres in real mode the assembler would automatically output a prefix saying to use 32bit values/registers instead of 16 bit ones for the next opcode. This assembler didn't know how to do that. So I would use "db 0×66 ; mov ax, sp" to make the instruction "mov eax, esp". But then I discovered that the 386 had introduced another couple of segment registers (gs/fs). howeve there was no prefix for them, I had to hand assemble and insert the raw hex values for them into my program.

So now I had a program that I had hand assembled to make run, and now it ran fast enough to be useful on my machine, and it looked sweet!

But around the time of the release of ID's game "Doom" all of this disappeared. Compilers didn't produce completely moronic code anymore, and processesors were fast enough that you could afford to waste a few cycles here or there. The program I'd painstakingly hand optimised above I showed to a friend. He wrote the entire thing in C and it ran fine on his pentium computer. (his version however didn't run at all reasonably on my 386).

Doom was written almost entirely in C, including the 3d raycaster. Only the dissolve between missions was written in assembler. Soon computers were coming with 3d accelerated video cards and everyone forgot how to write assembler, except for the fringe lunatics and people who write compilers.

You might think that this is progress. No more do people spend a week painstakingly hand assembling code to get things to run as fast as humanly possible. But, the skills I learnt over 10 years ago when I taught myself to program are still useful today. Knowing the layout of a bootsector means I can correctly identify a lot more problems people have with disk images. Knowing assembly language in far more detail than anyone thinks is necessary helps with debugging obscure problems in programs.

You tell people today about hand assembling programs to get them to run at speed and they just won't believe you. :)