This can be somewhat of a taboo in the Jaguar world, and it seems to crop up every once and a while, sometimes heralded as the ultimate fix, sometimes just mentioned as an interesting quirk. The RISC CPU’s in the Jag have their fair share of bugs, one of these is related to the GPU executing it’s code from the systems main RAM, restricting it to running code out of the limited 4K of local RAM built onto the chip. Naturally no one ever abides by manufacturers rules and it was soon discovered that it is in actual fact possible to run code from main memory! There are a few caveats about address restrictions when it comes to jumps but nothing too complex. It is most likely a simple cock-up that snuck past in the final design of the chip and Atari at the time thought it easier to simply say “do not do this” rather than having to come up with work around solutions, needless to say there are a few commercially released games on the Jag that actually run code from main RAM (Rayman being one of them).
Anyway, that’s all by the by. There is a lot of passion and unfortunately the fud that comes with passion relating to this technology. So I sat down and decided to try and shine some sciency light on this afterall! (I may as well put that BSc Computer Science (Hons) to use I guess 😀 )
So here are a few facts:
- The Atari Jaguar has a single shared bus between all of it’s devices and the main memory
- Main memory is 2MB of DRAM (120ns)
- The local RAM on the RISC devices has it’s own local bus to the RISC core, is 32 bits wide and SRAM
- If the GPU is accessing main RAM it is tying up the bus, so unless a higher priority CPU comes along and nabs it, it has the bus, nothing else gets to play with the main RAM.
What does this mean performance wise? well DRAM is significantly slower than SRAM, and requires regular refreshing. So reads of instructions are going to be slower, and that is assuming nothing else has the bus (there are 4 other devices that could grab it or want it)
The performance aspects however always seem to be overlooked, some rules seem to suggest avoiding “tight loops” in main RAM, but to be honest this is irrelevant anyway as everything you run will take longer. To prove this point (here comes the science) I have crafted a simple little piece of code.
My aim to accurately time the GPU running in local RAM and also the exact same code in main RAM. To do this I am using the programmable timers available within the Jag, setting JPIT counters will cause them to decrement based on the ticks from the system clock (~25MHz). The idea is simple,
- set-up a counter
- read the counters value at the start
- Do some busy work (ensuring not to access any register to cause a pipeline stall)
- read the counters value at the end
- Save both counter values and subtract one from the other
The final value will be the number of ticks of the counter to complete the busy work.
To remove any question about loops etc I made a nice simple flat piece of code for the testing:
gpucode:
.GPU
.ORG G_RAM
movei #$F10036,r0 ; The JPIT Readable counter
movei #startval,r1 ; where we are going to store our start counter
movei #endval,r2 ; where we are going to store our end counter
moveq #0,r3 ; start counter value reg
moveq #0,r4 ; end counter value reg
; get the current counter value
loadw (r0),r3 ; save this in the start counter reg
; now for some busy work
rept 400 ; 400 repetitions
moveq #4,r10
move r12,13
moveq #6,r11
move r14,r15
endr
; get the counter now
loadw (r0),r4
nop
nop
nop
; save our counters
store r3,(r1)
store r4,(r2)
; lots of pointless faffing just to make sure the writes have completed
nop
nop
nop
nop
; change the screen colour so we know we have finished faffing
movei #BG,r20
movei #$4400000,r21
nop
nop
store r21,(r20)
moveq #0,r5 ; stop the GPU
movei #G_CTRL,r6
nop
nop
store r5,(r6)
nop
nop
nop
As you can see, nothing amazingly complex, and the test code performs no reads or writes, these are pure and simple instructions which should all complete in a single operation. The results from this little test are quite telling, but not surprising really:
I ran the test 3 times for each, the values output are the hex values of the timer, as I simply reset the Jaguar with the jcp -r command the JPIT counter doesn’t actually reset but carries on regardless! (I didn’t know that until now! learning! isn’t science great! 😀 ) This is why the values move around, but the interesting part is the difference between the two values, this represents how long it took to complete our 1200 lines of code (4*400). So first up, running the code in local RAM on the chip:
$d44c – $cae2 = $96a = 2410
$988c – $8f4f = $93e = 2531
$8d48 – $83e6 = $962 = 2402
Average of about 2448 ticks to complete 1200 instructions
And now EXACTLY the same code in Main RAM
$f519 – $a335 = $51e4 = 20964
$7567 – $23a5 = $51c2 = 20930
$86e9 – $3519 = $51d0 = 20944
Average of about 20946!!
That is almost 10 times slower!! and these instructions don’t really do ANYTHING! and this is on a system where the only other thing running is the 68K which is sat patiently waiting for results to appear. If additional padding nops were added to code to make jumps work, or there were instructions that actually accessed other areas of main RAM, or perhaps even WRITE to main RAM.. well things are going to get slower and I dare say more messy as the RAM page is flipped back and forth..
So my verdict.. run it in Local people, there may be some situations where it may be necessary to run in main, I would view these as the edge cases, minorities. It should be possible to pretty much run everything in local, a bit of thought and some paging of code if required should be all that’s needed to keep your GPU code running in a tip-toppety fashion.
Hopefully people will find this an informative and useful read. At the end of the day this is a hobby, if you want to run your code in main, go for it! have fun! enjoy what you are doing! but just don’t expect it to be the most snappy code.