Exophase wrote:Did you measure the performance of the loop w/o the loads and subtract that from the total time?
Yes, I run the exact same function with NOPs in place of the LOAD and CACHE instructions and subtract the amount of time it takes to run that function from the time it took to run the instrumented version. I think you're correct in the CACHE instruction taking two cycles; however, I believe a CP0/DC stall will only arise if the CACHE instruction is in the EX stage while the LOAD is in the DC stage, hence my attempt to separate the two with a NOP.
Of course, I could be wrong.

I can validate this at a later time by testing using a different method, perhaps the one you mentioned. For now, however, the delays I estimated using the aforementioned routine were enough to uncover some exception handling bugs due to some of my generalizations and misunderstandings. They also appear to be accurate enough to prevent at least some ROMs (OoT) from crashing in odd instances that are mentioned in another thread on the boards, so at the least I'd like to think I'm in the ballpark range.
Of course, these figures are probably inaccurate when the RCP is performing DMAs and other events. It'll never be truly accurate until everything comes full circle.