Page 1 of 1

Where's my runtime going?

Posted: Thu May 07, 2015 7:56 pm
by MarathonMan
Results collected using perf on Debian Jessie on a i7-4558U. For each ROM, I just let it run for a few minutes without pressing any buttons.

Since GCC does a lot of optimizations, a lot of runtime is associated with under the "device_*' classifier; this is essentially un-attributable to any particular entity. However, it's safe to assume that a large portion of it is divided between the RSP, VR4300, RDP display list decoding, and VI. Even so, there's still some interesting results (at least compared to what I had expected):

Mario Kart 64:
RSP (10.08%) -- perf report --stdio | grep -i rsp | awk -F% '{sum += $1} END { print sum; }'
VR4300 (11.2%) -- perf report --stdio | grep -i vr4300 | awk -F% '{sum += $1} END { print sum; }'
Device (44.56%) -- perf report --stdio | grep -i device | awk -F% '{sum += $1} END { print sum; }'

Remaining stuff is mostly RDP (~27.12%):

Code: Select all

perf report --stdio | grep -iv device | grep -iv rsp | grep -iv vr4300

     8.46%    cen64  cen64                  [.] render_spans_1cycle_notexel1.lto_priv.145          
     7.21%    cen64  cen64                  [.] render_spans_1cycle_notex.lto_priv.144             
     3.65%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.2                 
     2.48%    cen64  cen64                  [.] fetch_texel_quadro.lto_priv.9                      
     1.53%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                 
     1.49%    cen64  cen64                  [.] fbwrite_16                                         
     1.37%    cen64  cen64                  [.] fbread_16                                          
     0.80%    cen64  cen64                  [.] edgewalker_for_loads.constprop.1                   
     0.71%    cen64  cen64                  [.] edgewalker_for_prims.lto_priv.122                  
     0.48%    cen64  i965_dri.so            [.] 0x00000000000f6516                                 
     0.39%    cen64  cen64                  [.] fbfill_16                                          
     0.38%    cen64  i965_dri.so            [.] 0x00000000000f6509                                 
     0.30%    cen64  cen64                  [.] bus_read_word                                      
     0.22%    cen64  cen64                  [.] fetch_texel_entlut_quadro.lto_priv.10              
     0.14%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned                            
     0.13%    cen64  cen64                  [.] get_dither_nothing.lto_priv.151                    
     0.11%    cen64  cen64                  [.] write_dp_regs                                      
     0.10%    cen64  i965_dri.so            [.] 0x0000000000105298                                 
     0.10%    cen64  cen64                  [.] rgb_dither_nothing.lto_priv.142
Super Smash Bros.
RSP (9.22%) -- perf report --stdio | grep -i rsp | awk -F% '{sum += $1} END { print sum; }'
VR4300 (8.41%) -- perf report --stdio | grep -i vr4300 | awk -F% '{sum += $1} END { print sum; }'
Device (35.68%) -- perf report --stdio | grep -i device | awk -F% '{sum += $1} END { print sum; }'

Remaining stuff is mostly RDP (~40.94%):

Code: Select all

perf report --stdio | grep -iv device | grep -iv rsp | grep -iv vr4300

    13.81%    cen64  cen64                  [.] render_spans_1cycle_notexel1.lto_priv.145         
     7.02%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.2                
     3.44%    cen64  cen64                  [.] render_spans_1cycle_notex.lto_priv.144            
     3.13%    cen64  cen64                  [.] fetch_texel_entlut_quadro.lto_priv.10             
     2.94%    cen64  cen64                  [.] fetch_texel_quadro.lto_priv.9                     
     1.89%    cen64  cen64                  [.] rgb_dither_complete.lto_priv.141                  
     1.83%    cen64  cen64                  [.] render_spans_2cycle_notexel1.lto_priv.139         
     1.56%    cen64  cen64                  [.] fbwrite_16                                        
     1.47%    cen64  cen64                  [.] fbread_16                                         
     1.11%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                
     1.06%    cen64  cen64                  [.] edgewalker_for_prims.lto_priv.122                 
     0.81%    cen64  cen64                  [.] render_spans_2cycle_notex.lto_priv.146            
     0.67%    cen64  cen64                  [.] edgewalker_for_loads.constprop.1                  
     0.66%    cen64  cen64                  [.] fbfill_16                                         
     0.65%    cen64  cen64                  [.] get_dither_only.lto_priv.150                      
     0.35%    cen64  i965_dri.so            [.] 0x00000000000f6516                                
     0.33%    cen64  cen64                  [.] bus_read_word                                     
     0.28%    cen64  i965_dri.so            [.] 0x00000000000f6509                                
     0.13%    cen64  cen64                  [.] fbread2_16                                        
     0.12%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned                           
     0.12%    cen64  libc-2.19.so           [.] memset                                            
     0.11%    cen64  cen64                  [.] bus_write_word.constprop.8                        
     0.08%    cen64  cen64                  [.] write_dp_regs
Zelda: Ocarina of Time
RSP (8.62%) -- perf report --stdio | grep -i rsp | awk -F% '{sum += $1} END { print sum; }'
VR4300 (9.92%) -- perf report --stdio | grep -i vr4300 | awk -F% '{sum += $1} END { print sum; }'
Device (37.51%) -- perf report --stdio | grep -i device | awk -F% '{sum += $1} END { print sum; }'

Remaining stuff is mostly RDP (~38.25%):

Code: Select all

perf report --stdio | grep -iv device | grep -iv rsp | grep -iv vr4300

     8.97%    cen64  cen64                  [.] render_spans_2cycle_notexelnext.lto_priv.147      
     5.57%    cen64  cen64                  [.] render_spans_2cycle_notexel1.lto_priv.139         
     5.39%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.2                
     3.45%    cen64  cen64                  [.] fetch_texel_quadro.lto_priv.9                     
     3.13%    cen64  cen64                  [.] fetch_texel_entlut_quadro.lto_priv.10             
     2.92%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.3                
     1.48%    cen64  cen64                  [.] rgb_dither_complete.lto_priv.141                  
     1.37%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                
     0.97%    cen64  cen64                  [.] fbwrite_16                                        
     0.95%    cen64  cen64                  [.] edgewalker_for_loads.constprop.1                  
     0.93%    cen64  cen64                  [.] fbread2_16                                        
     0.75%    cen64  cen64                  [.] render_spans_1cycle_notexel1.lto_priv.145         
     0.68%    cen64  cen64                  [.] edgewalker_for_prims.lto_priv.122                 
     0.58%    cen64  cen64                  [.] render_spans_1cycle_notex.lto_priv.144            
     0.49%    cen64  cen64                  [.] fbfill_16                                         
     0.43%    cen64  i965_dri.so            [.] 0x00000000000f6516                                
     0.41%    cen64  cen64                  [.] get_dither_only.lto_priv.150                      
     0.37%    cen64  cen64                  [.] bus_read_word                                     
     0.35%    cen64  i965_dri.so            [.] 0x00000000000f6509                                
     0.21%    cen64  cen64                  [.] render_spans_2cycle_complete.lto_priv.148         
     0.15%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned                           
     0.14%    cen64  cen64                  [.] fbread_16                                         
     0.10%    cen64  cen64                  [.] bus_write_word.constprop.8                    
Super Mario 64
RSP (14.17%) -- perf report --stdio | grep -i rsp | awk -F% '{sum += $1} END { print sum; }'
VR4300 (10.51%) -- perf report --stdio | grep -i vr4300 | awk -F% '{sum += $1} END { print sum; }'
Device (48.04%) -- perf report --stdio | grep -i device | awk -F% '{sum += $1} END { print sum; }'

Remaining stuff is mostly RDP (~21.67%):

Code: Select all

perf report --stdio | grep -iv device | grep -iv rsp | grep -iv vr4300

     8.18%    cen64  cen64                  [.] render_spans_1cycle_notexel1.lto_priv.145         
     3.97%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.2                
     2.87%    cen64  cen64                  [.] fetch_texel_quadro.lto_priv.9                     
     1.32%    cen64  cen64                  [.] render_spans_2cycle_notexel1.lto_priv.139         
     1.13%    cen64  cen64                  [.] rgb_dither_complete.lto_priv.141                  
     1.09%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                
     0.98%    cen64  cen64                  [.] edgewalker_for_prims.lto_priv.122                 
     0.82%    cen64  cen64                  [.] fbwrite_16                                        
     0.78%    cen64  cen64                  [.] fbread_16                                         
     0.65%    cen64  cen64                  [.] edgewalker_for_loads.constprop.1                  
     0.48%    cen64  cen64                  [.] bus_read_word                                     
     0.34%    cen64  i965_dri.so            [.] 0x00000000000f6516                                
     0.32%    cen64  cen64                  [.] render_spans_1cycle_notex.lto_priv.144            
     0.29%    cen64  cen64                  [.] get_dither_only.lto_priv.150                      
     0.28%    cen64  i965_dri.so            [.] 0x00000000000f6509                                
     0.24%    cen64  cen64                  [.] fbfill_16                                         
     0.12%    cen64  cen64                  [.] bus_write_word.constprop.8                        
     0.12%    cen64  cen64                  [.] write_dp_regs                                     
     0.11%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned                           
     0.09%    cen64  libc-2.19.so           [.] memset                                            
     0.07%    cen64  i965_dri.so            [.] 0x0000000000105298                                
     0.07%    cen64  i965_dri.so            [.] 0x00000000000f6505                                
     0.07%    cen64  i965_dri.so            [.] 0x00000000000f6540                                
     0.07%    cen64  i965_dri.so            [.] 0x000000000010528a                                
     0.07%    cen64  cen64                  [.] fbread2_16          
LaC's "fire" demo: (no RSP/RDP)

Code: Select all

perf report --stdio

    75.33%    cen64  cen64                  [.] device_spin.lto_priv.19                            
     7.29%    cen64  cen64                  [.] VR4300_LOAD_STORE                                  
     4.44%    cen64  cen64                  [.] vr4300_cycle_slow_ex.lto_priv.46                   
     2.02%    cen64  cen64                  [.] VR4300_ADDIU_LUI_SUBIU                             
     1.50%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                 
     1.12%    cen64  cen64                  [.] VR4300_ADDU_SUBU                                   
     0.84%    cen64  cen64                  [.] VR4300_SLL_SLLV                                    
     0.49%    cen64  i965_dri.so            [.] 0x00000000000f6516                                 
     0.43%    cen64  cen64                  [.] VR4300_DCB                                         
     0.39%    cen64  i965_dri.so            [.] 0x00000000000f6509                                 
     0.39%    cen64  cen64                  [.] VR4300_ANDI_ORI_XORI                               
     0.33%    cen64  cen64                  [.] vr4300_cycle_slow_dc.lto_priv.45                   
     0.30%    cen64  cen64                  [.] VR4300_AND_OR_XOR                                  
     0.28%    cen64  cen64                  [.] VR4300_BEQ_BEQL_BNE_BNEL_BWDETECT                  
     0.23%    cen64  cen64                  [.] bus_read_word                                      
     0.16%    cen64  cen64                  [.] bus_write_word.constprop.8                         
     0.15%    cen64  cen64                  [.] VR4300_SLTIU                                       
     0.12%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned               
Poke'mon Snap!
RSP (10.66%) -- perf report --stdio | grep -i rsp | awk -F% '{sum += $1} END { print sum; }'
VR4300 (6.01%) -- perf report --stdio | grep -i vr4300 | awk -F% '{sum += $1} END { print sum; }'
Device (25.77%) -- perf report --stdio | grep -i device | awk -F% '{sum += $1} END { print sum; }'

Remaining stuff is mostly RDP (~53.91%):

Code: Select all

perf report --stdio | grep -iv device | grep -iv rsp | grep -iv vr4300

    27.20%    cen64  cen64                  [.] render_spans_2cycle_notexel1.lto_priv.139          
    10.21%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.2                 
     4.97%    cen64  cen64                  [.] fetch_texel_entlut_quadro.lto_priv.10              
     2.83%    cen64  cen64                  [.] fetch_texel_quadro.lto_priv.9                      
     1.94%    cen64  cen64                  [.] rgb_dither_complete.lto_priv.141                   
     1.54%    cen64  cen64                  [.] fbread2_16                                         
     1.37%    cen64  cen64                  [.] edgewalker_for_prims.lto_priv.122                  
     1.14%    cen64  cen64                  [.] fbwrite_16                                         
     0.83%    cen64  cen64                  [.] fbfill_16                                          
     0.72%    cen64  cen64                  [.] get_dither_only.lto_priv.150                       
     0.62%    cen64  i965_dri.so            [.] 0x00000000000f64fc                                 
     0.55%    cen64  cen64                  [.] edgewalker_for_loads.constprop.1                   
     0.36%    cen64  cen64                  [.] render_spans_2cycle_notexelnext.lto_priv.147       
     0.25%    cen64  cen64                  [.] bus_read_word                                      
     0.18%    cen64  i965_dri.so            [.] 0x00000000000f6516                                 
     0.17%    cen64  libc-2.19.so           [.] memset                                             
     0.17%    cen64  cen64                  [.] render_spans_1cycle_notexel1.lto_priv.145          
     0.16%    cen64  i965_dri.so            [.] 0x00000000000f6509                                 
     0.15%    cen64  cen64                  [.] render_spans_2cycle_notex.lto_priv.146             
     0.13%    cen64  cen64                  [.] texture_pipeline_cycle.constprop.3                 
     0.11%    cen64  cen64                  [.] render_spans_1cycle_notex.lto_priv.144             
     0.10%    cen64  cen64                  [.] bus_write_word.constprop.8                         
     0.07%    cen64  cen64                  [.] write_dp_regs                                      
     0.05%    cen64  libc-2.19.so           [.] __memcpy_sse2_unaligned
Welp... I didn't realize the RDP was putting that much of a damper on the performance in some instances.

Re: Where's my runtime going?

Posted: Sat May 09, 2015 11:48 pm
by OldGnashburg
What does this mean for you and CEN64?

Re: Where's my runtime going?

Posted: Sun May 10, 2015 3:10 am
by ShadowFX
For starters, it meant a new build :)

Re: Where's my runtime going?

Posted: Wed May 13, 2015 3:04 pm
by Narann
I'm not such surprised actually. RDP (angrylion) code is wonderfully nice and almost a RDP documentation by itself. However, IIRC, many operations could be easily vectorized. The whole code could even be multithreaded (like a tile renderer) at the cost of loosing atomic accuracy (but I'm not even sure there is any doc about RDP rendering pattern actually). IIRC, there is no locking operation during RDP processing. Everything operate on one pixel at a time. This would mean you could basely reach 4x or 8x faster speed.

But this is only from my perspective. There is maybe some dark RDP corner I don't know.