Omniperf version: 2.0.0-RC1
Profiler choice: rocprofv1
Path: /home1/josantos/omniperf/tests/workloads/dispatch_7/MI100
Target: MI100
Command: ./tests/vcopy -n 1048576 -b 256 -i 3
Kernel Selection: None
Dispatch Selection: ['7']
IP Blocks: All

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Collecting Performance Counters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/SQ_IFETCH_LEVEL.txt
   |-> [rocprof] RPL: on '240321_155247' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/SQ_IFETCH_LEVEL.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155247_1237531'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155247_1237531/input0_results_240321_155247'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155247_1237531/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 6 metrics
   |-> [rocprof] GRBM_COUNT, GRBM_GUI_ACTIVE, SQ_WAVES, SQ_IFETCH, SQ_IFETCH_LEVEL, SQ_ACCUM_PREV_HIRES
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155247_1237531/input0_results_240321_155247
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/SQ_IFETCH_LEVEL.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_LDS.txt
   |-> [rocprof] RPL: on '240321_155248' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_LDS.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155248_1237715'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155248_1237715/input0_results_240321_155248'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155248_1237715/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 3 metrics
   |-> [rocprof] SQ_INSTS_LDS, SQ_INST_LEVEL_LDS, SQ_ACCUM_PREV_HIRES
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155248_1237715/input0_results_240321_155248
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/SQ_INST_LEVEL_LDS.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_SMEM.txt
   |-> [rocprof] RPL: on '240321_155248' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_SMEM.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155248_1237900'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155248_1237900/input0_results_240321_155248'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155248_1237900/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 3 metrics
   |-> [rocprof] SQ_INSTS_SMEM, SQ_INST_LEVEL_SMEM, SQ_ACCUM_PREV_HIRES
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155248_1237900/input0_results_240321_155248
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/SQ_INST_LEVEL_SMEM.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_VMEM.txt
   |-> [rocprof] RPL: on '240321_155248' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/SQ_INST_LEVEL_VMEM.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155248_1238087'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155248_1238087/input0_results_240321_155248'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155248_1238087/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 3 metrics
   |-> [rocprof] SQ_INSTS_VMEM, SQ_INST_LEVEL_VMEM, SQ_ACCUM_PREV_HIRES
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155248_1238087/input0_results_240321_155248
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/SQ_INST_LEVEL_VMEM.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/SQ_LEVEL_WAVES.txt
   |-> [rocprof] RPL: on '240321_155249' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/SQ_LEVEL_WAVES.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155249_1238273'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155249_1238273/input0_results_240321_155249'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155249_1238273/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 9 metrics
   |-> [rocprof] GRBM_COUNT, GRBM_GUI_ACTIVE, CPC_ME1_BUSY_FOR_PACKET_DECODE, SQ_CYCLES, SQ_WAVES, SQ_WAVE_CYCLES, SQ_BUSY_CYCLES, SQ_LEVEL_WAVES, SQ_ACCUM_PREV_HIRES
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155249_1238273/input0_results_240321_155249
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/SQ_LEVEL_WAVES.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_0.txt
   |-> [rocprof] RPL: on '240321_155249' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_0.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155249_1238458'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155249_1238458/input0_results_240321_155249'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155249_1238458/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 28 metrics
   |-> [rocprof] SQ_CYCLES, SQ_BUSY_CYCLES, SQ_BUSY_CU_CYCLES, SQ_WAVES, SQ_WAVE_CYCLES, SQC_TC_INST_REQ, SQC_TC_DATA_READ_REQ, SQC_TC_DATA_WRITE_REQ, GRBM_COUNT, GRBM_GUI_ACTIVE, TCP_GATE_EN1_sum, TCP_GATE_EN2_sum, TCP_TD_TCP_STALL_CYCLES_sum, TCP_TCR_TCP_STALL_CYCLES_sum, TA_TA_BUSY_sum, TA_BUFFER_WAVEFRONTS_sum, TD_TD_BUSY_sum, TD_TC_STALL_sum, SPI_CSN_WINDOW_VALID, SPI_CSN_BUSY, CPC_CPC_STAT_BUSY, CPC_CPC_STAT_IDLE, CPF_CPF_STAT_BUSY, CPF_CPF_STAT_STALL, TCC_CYCLE_sum, TCC_BUSY_sum, TCC_PROBE_sum, TCC_PROBE_ALL_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155249_1238458/input0_results_240321_155249
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_0.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_1.txt
   |-> [rocprof] RPL: on '240321_155250' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_1.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155250_1238645'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155250_1238645/input0_results_240321_155250'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155250_1238645/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 27 metrics
   |-> [rocprof] SQC_TC_DATA_ATOMIC_REQ, SQC_TC_STALL, SQC_TC_REQ, SQC_DCACHE_REQ_READ_16, SQC_ICACHE_REQ, SQC_ICACHE_HITS, SQC_ICACHE_MISSES, SQC_ICACHE_MISSES_DUPLICATE, GRBM_SPI_BUSY, TCP_READ_TAGCONFLICT_STALL_CYCLES_sum, TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum, TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum, TCP_TA_TCP_STATE_READ_sum, TA_BUFFER_READ_WAVEFRONTS_sum, TA_BUFFER_WRITE_WAVEFRONTS_sum, TD_COALESCABLE_WAVEFRONT_sum, TD_LOAD_WAVEFRONT_sum, SPI_CSN_NUM_THREADGROUPS, SPI_CSN_WAVE, CPC_CPC_TCIU_BUSY, CPC_CPC_TCIU_IDLE, CPF_CPF_TCIU_BUSY, CPF_CPF_TCIU_STALL, TCC_NC_REQ_sum, TCC_UC_REQ_sum, TCC_CC_REQ_sum, TCC_RW_REQ_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155250_1238645/input0_results_240321_155250
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_1.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_10.txt
   |-> [rocprof] RPL: on '240321_155250' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_10.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155250_1238830'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155250_1238830/input0_results_240321_155250'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155250_1238830/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 1 metrics
   |-> [rocprof] TCC_EA_ATOMIC_LEVEL_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155250_1238830/input0_results_240321_155250
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_10.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_11.txt
   |-> [rocprof] RPL: on '240321_155251' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_11.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155251_1239014'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155251_1239014/input0_results_240321_155251'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155251_1239014/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 128 metrics
   |-> [rocprof] TCC_ATOMIC[0], TCC_CYCLE[0], TCC_EA_ATOMIC[0], TCC_EA_ATOMIC_LEVEL[0], TCC_ATOMIC[1], TCC_CYCLE[1], TCC_EA_ATOMIC[1], TCC_EA_ATOMIC_LEVEL[1], TCC_ATOMIC[2], TCC_CYCLE[2], TCC_EA_ATOMIC[2], TCC_EA_ATOMIC_LEVEL[2], TCC_ATOMIC[3], TCC_CYCLE[3], TCC_EA_ATOMIC[3], TCC_EA_ATOMIC_LEVEL[3], TCC_ATOMIC[4], TCC_CYCLE[4], TCC_EA_ATOMIC[4], TCC_EA_ATOMIC_LEVEL[4], TCC_ATOMIC[5], TCC_CYCLE[5], TCC_EA_ATOMIC[5], TCC_EA_ATOMIC_LEVEL[5], TCC_ATOMIC[6], TCC_CYCLE[6], TCC_EA_ATOMIC[6], TCC_EA_ATOMIC_LEVEL[6], TCC_ATOMIC[7], TCC_CYCLE[7], TCC_EA_ATOMIC[7], TCC_EA_ATOMIC_LEVEL[7], TCC_ATOMIC[8], TCC_CYCLE[8], TCC_EA_ATOMIC[8], TCC_EA_ATOMIC_LEVEL[8], TCC_ATOMIC[9], TCC_CYCLE[9], TCC_EA_ATOMIC[9], TCC_EA_ATOMIC_LEVEL[9], TCC_ATOMIC[10], TCC_CYCLE[10], TCC_EA_ATOMIC[10], TCC_EA_ATOMIC_LEVEL[10], TCC_ATOMIC[11], TCC_CYCLE[11], TCC_EA_ATOMIC[11], TCC_EA_ATOMIC_LEVEL[11], TCC_ATOMIC[12], TCC_CYCLE[12], TCC_EA_ATOMIC[12], TCC_EA_ATOMIC_LEVEL[12], TCC_ATOMIC[13], TCC_CYCLE[13], TCC_EA_ATOMIC[13], TCC_EA_ATOMIC_LEVEL[13], TCC_ATOMIC[14], TCC_CYCLE[14], TCC_EA_ATOMIC[14], TCC_EA_ATOMIC_LEVEL[14], TCC_ATOMIC[15], TCC_CYCLE[15], TCC_EA_ATOMIC[15], TCC_EA_ATOMIC_LEVEL[15], TCC_ATOMIC[16], TCC_CYCLE[16], TCC_EA_ATOMIC[16], TCC_EA_ATOMIC_LEVEL[16], TCC_ATOMIC[17], TCC_CYCLE[17], TCC_EA_ATOMIC[17], TCC_EA_ATOMIC_LEVEL[17], TCC_ATOMIC[18], TCC_CYCLE[18], TCC_EA_ATOMIC[18], TCC_EA_ATOMIC_LEVEL[18], TCC_ATOMIC[19], TCC_CYCLE[19], TCC_EA_ATOMIC[19], TCC_EA_ATOMIC_LEVEL[19], TCC_ATOMIC[20], TCC_CYCLE[20], TCC_EA_ATOMIC[20], TCC_EA_ATOMIC_LEVEL[20], TCC_ATOMIC[21], TCC_CYCLE[21], TCC_EA_ATOMIC[21], TCC_EA_ATOMIC_LEVEL[21], TCC_ATOMIC[22], TCC_CYCLE[22], TCC_EA_ATOMIC[22], TCC_EA_ATOMIC_LEVEL[22], TCC_ATOMIC[23], TCC_CYCLE[23], TCC_EA_ATOMIC[23], TCC_EA_ATOMIC_LEVEL[23], TCC_ATOMIC[24], TCC_CYCLE[24], TCC_EA_ATOMIC[24], TCC_EA_ATOMIC_LEVEL[24], TCC_ATOMIC[25], TCC_CYCLE[25], TCC_EA_ATOMIC[25], TCC_EA_ATOMIC_LEVEL[25], TCC_ATOMIC[26], TCC_CYCLE[26], TCC_EA_ATOMIC[26], TCC_EA_ATOMIC_LEVEL[26], TCC_ATOMIC[27], TCC_CYCLE[27], TCC_EA_ATOMIC[27], TCC_EA_ATOMIC_LEVEL[27], TCC_ATOMIC[28], TCC_CYCLE[28], TCC_EA_ATOMIC[28], TCC_EA_ATOMIC_LEVEL[28], TCC_ATOMIC[29], TCC_CYCLE[29], TCC_EA_ATOMIC[29], TCC_EA_ATOMIC_LEVEL[29], TCC_ATOMIC[30], TCC_CYCLE[30], TCC_EA_ATOMIC[30], TCC_EA_ATOMIC_LEVEL[30], TCC_ATOMIC[31], TCC_CYCLE[31], TCC_EA_ATOMIC[31], TCC_EA_ATOMIC_LEVEL[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155251_1239014/input0_results_240321_155251
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_11.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_12.txt
   |-> [rocprof] RPL: on '240321_155252' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_12.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155252_1239201'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155252_1239201/input0_results_240321_155252'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155252_1239201/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 128 metrics
   |-> [rocprof] TCC_EA_RDREQ[0], TCC_EA_RDREQ_32B[0], TCC_EA_RDREQ_DRAM_CREDIT_STALL[0], TCC_EA_RDREQ_GMI_CREDIT_STALL[0], TCC_EA_RDREQ[1], TCC_EA_RDREQ_32B[1], TCC_EA_RDREQ_DRAM_CREDIT_STALL[1], TCC_EA_RDREQ_GMI_CREDIT_STALL[1], TCC_EA_RDREQ[2], TCC_EA_RDREQ_32B[2], TCC_EA_RDREQ_DRAM_CREDIT_STALL[2], TCC_EA_RDREQ_GMI_CREDIT_STALL[2], TCC_EA_RDREQ[3], TCC_EA_RDREQ_32B[3], TCC_EA_RDREQ_DRAM_CREDIT_STALL[3], TCC_EA_RDREQ_GMI_CREDIT_STALL[3], TCC_EA_RDREQ[4], TCC_EA_RDREQ_32B[4], TCC_EA_RDREQ_DRAM_CREDIT_STALL[4], TCC_EA_RDREQ_GMI_CREDIT_STALL[4], TCC_EA_RDREQ[5], TCC_EA_RDREQ_32B[5], TCC_EA_RDREQ_DRAM_CREDIT_STALL[5], TCC_EA_RDREQ_GMI_CREDIT_STALL[5], TCC_EA_RDREQ[6], TCC_EA_RDREQ_32B[6], TCC_EA_RDREQ_DRAM_CREDIT_STALL[6], TCC_EA_RDREQ_GMI_CREDIT_STALL[6], TCC_EA_RDREQ[7], TCC_EA_RDREQ_32B[7], TCC_EA_RDREQ_DRAM_CREDIT_STALL[7], TCC_EA_RDREQ_GMI_CREDIT_STALL[7], TCC_EA_RDREQ[8], TCC_EA_RDREQ_32B[8], TCC_EA_RDREQ_DRAM_CREDIT_STALL[8], TCC_EA_RDREQ_GMI_CREDIT_STALL[8], TCC_EA_RDREQ[9], TCC_EA_RDREQ_32B[9], TCC_EA_RDREQ_DRAM_CREDIT_STALL[9], TCC_EA_RDREQ_GMI_CREDIT_STALL[9], TCC_EA_RDREQ[10], TCC_EA_RDREQ_32B[10], TCC_EA_RDREQ_DRAM_CREDIT_STALL[10], TCC_EA_RDREQ_GMI_CREDIT_STALL[10], TCC_EA_RDREQ[11], TCC_EA_RDREQ_32B[11], TCC_EA_RDREQ_DRAM_CREDIT_STALL[11], TCC_EA_RDREQ_GMI_CREDIT_STALL[11], TCC_EA_RDREQ[12], TCC_EA_RDREQ_32B[12], TCC_EA_RDREQ_DRAM_CREDIT_STALL[12], TCC_EA_RDREQ_GMI_CREDIT_STALL[12], TCC_EA_RDREQ[13], TCC_EA_RDREQ_32B[13], TCC_EA_RDREQ_DRAM_CREDIT_STALL[13], TCC_EA_RDREQ_GMI_CREDIT_STALL[13], TCC_EA_RDREQ[14], TCC_EA_RDREQ_32B[14], TCC_EA_RDREQ_DRAM_CREDIT_STALL[14], TCC_EA_RDREQ_GMI_CREDIT_STALL[14], TCC_EA_RDREQ[15], TCC_EA_RDREQ_32B[15], TCC_EA_RDREQ_DRAM_CREDIT_STALL[15], TCC_EA_RDREQ_GMI_CREDIT_STALL[15], TCC_EA_RDREQ[16], TCC_EA_RDREQ_32B[16], TCC_EA_RDREQ_DRAM_CREDIT_STALL[16], TCC_EA_RDREQ_GMI_CREDIT_STALL[16], TCC_EA_RDREQ[17], TCC_EA_RDREQ_32B[17], TCC_EA_RDREQ_DRAM_CREDIT_STALL[17], TCC_EA_RDREQ_GMI_CREDIT_STALL[17], TCC_EA_RDREQ[18], TCC_EA_RDREQ_32B[18], TCC_EA_RDREQ_DRAM_CREDIT_STALL[18], TCC_EA_RDREQ_GMI_CREDIT_STALL[18], TCC_EA_RDREQ[19], TCC_EA_RDREQ_32B[19], TCC_EA_RDREQ_DRAM_CREDIT_STALL[19], TCC_EA_RDREQ_GMI_CREDIT_STALL[19], TCC_EA_RDREQ[20], TCC_EA_RDREQ_32B[20], TCC_EA_RDREQ_DRAM_CREDIT_STALL[20], TCC_EA_RDREQ_GMI_CREDIT_STALL[20], TCC_EA_RDREQ[21], TCC_EA_RDREQ_32B[21], TCC_EA_RDREQ_DRAM_CREDIT_STALL[21], TCC_EA_RDREQ_GMI_CREDIT_STALL[21], TCC_EA_RDREQ[22], TCC_EA_RDREQ_32B[22], TCC_EA_RDREQ_DRAM_CREDIT_STALL[22], TCC_EA_RDREQ_GMI_CREDIT_STALL[22], TCC_EA_RDREQ[23], TCC_EA_RDREQ_32B[23], TCC_EA_RDREQ_DRAM_CREDIT_STALL[23], TCC_EA_RDREQ_GMI_CREDIT_STALL[23], TCC_EA_RDREQ[24], TCC_EA_RDREQ_32B[24], TCC_EA_RDREQ_DRAM_CREDIT_STALL[24], TCC_EA_RDREQ_GMI_CREDIT_STALL[24], TCC_EA_RDREQ[25], TCC_EA_RDREQ_32B[25], TCC_EA_RDREQ_DRAM_CREDIT_STALL[25], TCC_EA_RDREQ_GMI_CREDIT_STALL[25], TCC_EA_RDREQ[26], TCC_EA_RDREQ_32B[26], TCC_EA_RDREQ_DRAM_CREDIT_STALL[26], TCC_EA_RDREQ_GMI_CREDIT_STALL[26], TCC_EA_RDREQ[27], TCC_EA_RDREQ_32B[27], TCC_EA_RDREQ_DRAM_CREDIT_STALL[27], TCC_EA_RDREQ_GMI_CREDIT_STALL[27], TCC_EA_RDREQ[28], TCC_EA_RDREQ_32B[28], TCC_EA_RDREQ_DRAM_CREDIT_STALL[28], TCC_EA_RDREQ_GMI_CREDIT_STALL[28], TCC_EA_RDREQ[29], TCC_EA_RDREQ_32B[29], TCC_EA_RDREQ_DRAM_CREDIT_STALL[29], TCC_EA_RDREQ_GMI_CREDIT_STALL[29], TCC_EA_RDREQ[30], TCC_EA_RDREQ_32B[30], TCC_EA_RDREQ_DRAM_CREDIT_STALL[30], TCC_EA_RDREQ_GMI_CREDIT_STALL[30], TCC_EA_RDREQ[31], TCC_EA_RDREQ_32B[31], TCC_EA_RDREQ_DRAM_CREDIT_STALL[31], TCC_EA_RDREQ_GMI_CREDIT_STALL[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155252_1239201/input0_results_240321_155252
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_12.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_13.txt
   |-> [rocprof] RPL: on '240321_155252' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_13.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155252_1239391'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155252_1239391/input0_results_240321_155252'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155252_1239391/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 128 metrics
   |-> [rocprof] TCC_EA_RDREQ_IO_CREDIT_STALL[0], TCC_EA_RDREQ_LEVEL[0], TCC_EA_WRREQ[0], TCC_EA_WRREQ_64B[0], TCC_EA_RDREQ_IO_CREDIT_STALL[1], TCC_EA_RDREQ_LEVEL[1], TCC_EA_WRREQ[1], TCC_EA_WRREQ_64B[1], TCC_EA_RDREQ_IO_CREDIT_STALL[2], TCC_EA_RDREQ_LEVEL[2], TCC_EA_WRREQ[2], TCC_EA_WRREQ_64B[2], TCC_EA_RDREQ_IO_CREDIT_STALL[3], TCC_EA_RDREQ_LEVEL[3], TCC_EA_WRREQ[3], TCC_EA_WRREQ_64B[3], TCC_EA_RDREQ_IO_CREDIT_STALL[4], TCC_EA_RDREQ_LEVEL[4], TCC_EA_WRREQ[4], TCC_EA_WRREQ_64B[4], TCC_EA_RDREQ_IO_CREDIT_STALL[5], TCC_EA_RDREQ_LEVEL[5], TCC_EA_WRREQ[5], TCC_EA_WRREQ_64B[5], TCC_EA_RDREQ_IO_CREDIT_STALL[6], TCC_EA_RDREQ_LEVEL[6], TCC_EA_WRREQ[6], TCC_EA_WRREQ_64B[6], TCC_EA_RDREQ_IO_CREDIT_STALL[7], TCC_EA_RDREQ_LEVEL[7], TCC_EA_WRREQ[7], TCC_EA_WRREQ_64B[7], TCC_EA_RDREQ_IO_CREDIT_STALL[8], TCC_EA_RDREQ_LEVEL[8], TCC_EA_WRREQ[8], TCC_EA_WRREQ_64B[8], TCC_EA_RDREQ_IO_CREDIT_STALL[9], TCC_EA_RDREQ_LEVEL[9], TCC_EA_WRREQ[9], TCC_EA_WRREQ_64B[9], TCC_EA_RDREQ_IO_CREDIT_STALL[10], TCC_EA_RDREQ_LEVEL[10], TCC_EA_WRREQ[10], TCC_EA_WRREQ_64B[10], TCC_EA_RDREQ_IO_CREDIT_STALL[11], TCC_EA_RDREQ_LEVEL[11], TCC_EA_WRREQ[11], TCC_EA_WRREQ_64B[11], TCC_EA_RDREQ_IO_CREDIT_STALL[12], TCC_EA_RDREQ_LEVEL[12], TCC_EA_WRREQ[12], TCC_EA_WRREQ_64B[12], TCC_EA_RDREQ_IO_CREDIT_STALL[13], TCC_EA_RDREQ_LEVEL[13], TCC_EA_WRREQ[13], TCC_EA_WRREQ_64B[13], TCC_EA_RDREQ_IO_CREDIT_STALL[14], TCC_EA_RDREQ_LEVEL[14], TCC_EA_WRREQ[14], TCC_EA_WRREQ_64B[14], TCC_EA_RDREQ_IO_CREDIT_STALL[15], TCC_EA_RDREQ_LEVEL[15], TCC_EA_WRREQ[15], TCC_EA_WRREQ_64B[15], TCC_EA_RDREQ_IO_CREDIT_STALL[16], TCC_EA_RDREQ_LEVEL[16], TCC_EA_WRREQ[16], TCC_EA_WRREQ_64B[16], TCC_EA_RDREQ_IO_CREDIT_STALL[17], TCC_EA_RDREQ_LEVEL[17], TCC_EA_WRREQ[17], TCC_EA_WRREQ_64B[17], TCC_EA_RDREQ_IO_CREDIT_STALL[18], TCC_EA_RDREQ_LEVEL[18], TCC_EA_WRREQ[18], TCC_EA_WRREQ_64B[18], TCC_EA_RDREQ_IO_CREDIT_STALL[19], TCC_EA_RDREQ_LEVEL[19], TCC_EA_WRREQ[19], TCC_EA_WRREQ_64B[19], TCC_EA_RDREQ_IO_CREDIT_STALL[20], TCC_EA_RDREQ_LEVEL[20], TCC_EA_WRREQ[20], TCC_EA_WRREQ_64B[20], TCC_EA_RDREQ_IO_CREDIT_STALL[21], TCC_EA_RDREQ_LEVEL[21], TCC_EA_WRREQ[21], TCC_EA_WRREQ_64B[21], TCC_EA_RDREQ_IO_CREDIT_STALL[22], TCC_EA_RDREQ_LEVEL[22], TCC_EA_WRREQ[22], TCC_EA_WRREQ_64B[22], TCC_EA_RDREQ_IO_CREDIT_STALL[23], TCC_EA_RDREQ_LEVEL[23], TCC_EA_WRREQ[23], TCC_EA_WRREQ_64B[23], TCC_EA_RDREQ_IO_CREDIT_STALL[24], TCC_EA_RDREQ_LEVEL[24], TCC_EA_WRREQ[24], TCC_EA_WRREQ_64B[24], TCC_EA_RDREQ_IO_CREDIT_STALL[25], TCC_EA_RDREQ_LEVEL[25], TCC_EA_WRREQ[25], TCC_EA_WRREQ_64B[25], TCC_EA_RDREQ_IO_CREDIT_STALL[26], TCC_EA_RDREQ_LEVEL[26], TCC_EA_WRREQ[26], TCC_EA_WRREQ_64B[26], TCC_EA_RDREQ_IO_CREDIT_STALL[27], TCC_EA_RDREQ_LEVEL[27], TCC_EA_WRREQ[27], TCC_EA_WRREQ_64B[27], TCC_EA_RDREQ_IO_CREDIT_STALL[28], TCC_EA_RDREQ_LEVEL[28], TCC_EA_WRREQ[28], TCC_EA_WRREQ_64B[28], TCC_EA_RDREQ_IO_CREDIT_STALL[29], TCC_EA_RDREQ_LEVEL[29], TCC_EA_WRREQ[29], TCC_EA_WRREQ_64B[29], TCC_EA_RDREQ_IO_CREDIT_STALL[30], TCC_EA_RDREQ_LEVEL[30], TCC_EA_WRREQ[30], TCC_EA_WRREQ_64B[30], TCC_EA_RDREQ_IO_CREDIT_STALL[31], TCC_EA_RDREQ_LEVEL[31], TCC_EA_WRREQ[31], TCC_EA_WRREQ_64B[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155252_1239391/input0_results_240321_155252
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_13.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_14.txt
   |-> [rocprof] RPL: on '240321_155253' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_14.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155253_1239576'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155253_1239576/input0_results_240321_155253'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155253_1239576/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 128 metrics
   |-> [rocprof] TCC_EA_WRREQ_DRAM_CREDIT_STALL[0], TCC_EA_WRREQ_GMI_CREDIT_STALL[0], TCC_EA_WRREQ_IO_CREDIT_STALL[0], TCC_EA_WRREQ_LEVEL[0], TCC_EA_WRREQ_DRAM_CREDIT_STALL[1], TCC_EA_WRREQ_GMI_CREDIT_STALL[1], TCC_EA_WRREQ_IO_CREDIT_STALL[1], TCC_EA_WRREQ_LEVEL[1], TCC_EA_WRREQ_DRAM_CREDIT_STALL[2], TCC_EA_WRREQ_GMI_CREDIT_STALL[2], TCC_EA_WRREQ_IO_CREDIT_STALL[2], TCC_EA_WRREQ_LEVEL[2], TCC_EA_WRREQ_DRAM_CREDIT_STALL[3], TCC_EA_WRREQ_GMI_CREDIT_STALL[3], TCC_EA_WRREQ_IO_CREDIT_STALL[3], TCC_EA_WRREQ_LEVEL[3], TCC_EA_WRREQ_DRAM_CREDIT_STALL[4], TCC_EA_WRREQ_GMI_CREDIT_STALL[4], TCC_EA_WRREQ_IO_CREDIT_STALL[4], TCC_EA_WRREQ_LEVEL[4], TCC_EA_WRREQ_DRAM_CREDIT_STALL[5], TCC_EA_WRREQ_GMI_CREDIT_STALL[5], TCC_EA_WRREQ_IO_CREDIT_STALL[5], TCC_EA_WRREQ_LEVEL[5], TCC_EA_WRREQ_DRAM_CREDIT_STALL[6], TCC_EA_WRREQ_GMI_CREDIT_STALL[6], TCC_EA_WRREQ_IO_CREDIT_STALL[6], TCC_EA_WRREQ_LEVEL[6], TCC_EA_WRREQ_DRAM_CREDIT_STALL[7], TCC_EA_WRREQ_GMI_CREDIT_STALL[7], TCC_EA_WRREQ_IO_CREDIT_STALL[7], TCC_EA_WRREQ_LEVEL[7], TCC_EA_WRREQ_DRAM_CREDIT_STALL[8], TCC_EA_WRREQ_GMI_CREDIT_STALL[8], TCC_EA_WRREQ_IO_CREDIT_STALL[8], TCC_EA_WRREQ_LEVEL[8], TCC_EA_WRREQ_DRAM_CREDIT_STALL[9], TCC_EA_WRREQ_GMI_CREDIT_STALL[9], TCC_EA_WRREQ_IO_CREDIT_STALL[9], TCC_EA_WRREQ_LEVEL[9], TCC_EA_WRREQ_DRAM_CREDIT_STALL[10], TCC_EA_WRREQ_GMI_CREDIT_STALL[10], TCC_EA_WRREQ_IO_CREDIT_STALL[10], TCC_EA_WRREQ_LEVEL[10], TCC_EA_WRREQ_DRAM_CREDIT_STALL[11], TCC_EA_WRREQ_GMI_CREDIT_STALL[11], TCC_EA_WRREQ_IO_CREDIT_STALL[11], TCC_EA_WRREQ_LEVEL[11], TCC_EA_WRREQ_DRAM_CREDIT_STALL[12], TCC_EA_WRREQ_GMI_CREDIT_STALL[12], TCC_EA_WRREQ_IO_CREDIT_STALL[12], TCC_EA_WRREQ_LEVEL[12], TCC_EA_WRREQ_DRAM_CREDIT_STALL[13], TCC_EA_WRREQ_GMI_CREDIT_STALL[13], TCC_EA_WRREQ_IO_CREDIT_STALL[13], TCC_EA_WRREQ_LEVEL[13], TCC_EA_WRREQ_DRAM_CREDIT_STALL[14], TCC_EA_WRREQ_GMI_CREDIT_STALL[14], TCC_EA_WRREQ_IO_CREDIT_STALL[14], TCC_EA_WRREQ_LEVEL[14], TCC_EA_WRREQ_DRAM_CREDIT_STALL[15], TCC_EA_WRREQ_GMI_CREDIT_STALL[15], TCC_EA_WRREQ_IO_CREDIT_STALL[15], TCC_EA_WRREQ_LEVEL[15], TCC_EA_WRREQ_DRAM_CREDIT_STALL[16], TCC_EA_WRREQ_GMI_CREDIT_STALL[16], TCC_EA_WRREQ_IO_CREDIT_STALL[16], TCC_EA_WRREQ_LEVEL[16], TCC_EA_WRREQ_DRAM_CREDIT_STALL[17], TCC_EA_WRREQ_GMI_CREDIT_STALL[17], TCC_EA_WRREQ_IO_CREDIT_STALL[17], TCC_EA_WRREQ_LEVEL[17], TCC_EA_WRREQ_DRAM_CREDIT_STALL[18], TCC_EA_WRREQ_GMI_CREDIT_STALL[18], TCC_EA_WRREQ_IO_CREDIT_STALL[18], TCC_EA_WRREQ_LEVEL[18], TCC_EA_WRREQ_DRAM_CREDIT_STALL[19], TCC_EA_WRREQ_GMI_CREDIT_STALL[19], TCC_EA_WRREQ_IO_CREDIT_STALL[19], TCC_EA_WRREQ_LEVEL[19], TCC_EA_WRREQ_DRAM_CREDIT_STALL[20], TCC_EA_WRREQ_GMI_CREDIT_STALL[20], TCC_EA_WRREQ_IO_CREDIT_STALL[20], TCC_EA_WRREQ_LEVEL[20], TCC_EA_WRREQ_DRAM_CREDIT_STALL[21], TCC_EA_WRREQ_GMI_CREDIT_STALL[21], TCC_EA_WRREQ_IO_CREDIT_STALL[21], TCC_EA_WRREQ_LEVEL[21], TCC_EA_WRREQ_DRAM_CREDIT_STALL[22], TCC_EA_WRREQ_GMI_CREDIT_STALL[22], TCC_EA_WRREQ_IO_CREDIT_STALL[22], TCC_EA_WRREQ_LEVEL[22], TCC_EA_WRREQ_DRAM_CREDIT_STALL[23], TCC_EA_WRREQ_GMI_CREDIT_STALL[23], TCC_EA_WRREQ_IO_CREDIT_STALL[23], TCC_EA_WRREQ_LEVEL[23], TCC_EA_WRREQ_DRAM_CREDIT_STALL[24], TCC_EA_WRREQ_GMI_CREDIT_STALL[24], TCC_EA_WRREQ_IO_CREDIT_STALL[24], TCC_EA_WRREQ_LEVEL[24], TCC_EA_WRREQ_DRAM_CREDIT_STALL[25], TCC_EA_WRREQ_GMI_CREDIT_STALL[25], TCC_EA_WRREQ_IO_CREDIT_STALL[25], TCC_EA_WRREQ_LEVEL[25], TCC_EA_WRREQ_DRAM_CREDIT_STALL[26], TCC_EA_WRREQ_GMI_CREDIT_STALL[26], TCC_EA_WRREQ_IO_CREDIT_STALL[26], TCC_EA_WRREQ_LEVEL[26], TCC_EA_WRREQ_DRAM_CREDIT_STALL[27], TCC_EA_WRREQ_GMI_CREDIT_STALL[27], TCC_EA_WRREQ_IO_CREDIT_STALL[27], TCC_EA_WRREQ_LEVEL[27], TCC_EA_WRREQ_DRAM_CREDIT_STALL[28], TCC_EA_WRREQ_GMI_CREDIT_STALL[28], TCC_EA_WRREQ_IO_CREDIT_STALL[28], TCC_EA_WRREQ_LEVEL[28], TCC_EA_WRREQ_DRAM_CREDIT_STALL[29], TCC_EA_WRREQ_GMI_CREDIT_STALL[29], TCC_EA_WRREQ_IO_CREDIT_STALL[29], TCC_EA_WRREQ_LEVEL[29], TCC_EA_WRREQ_DRAM_CREDIT_STALL[30], TCC_EA_WRREQ_GMI_CREDIT_STALL[30], TCC_EA_WRREQ_IO_CREDIT_STALL[30], TCC_EA_WRREQ_LEVEL[30], TCC_EA_WRREQ_DRAM_CREDIT_STALL[31], TCC_EA_WRREQ_GMI_CREDIT_STALL[31], TCC_EA_WRREQ_IO_CREDIT_STALL[31], TCC_EA_WRREQ_LEVEL[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155253_1239576/input0_results_240321_155253
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_14.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_15.txt
   |-> [rocprof] RPL: on '240321_155254' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_15.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155254_1239763'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155254_1239763/input0_results_240321_155254'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155254_1239763/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 128 metrics
   |-> [rocprof] TCC_HIT[0], TCC_MISS[0], TCC_READ[0], TCC_REQ[0], TCC_HIT[1], TCC_MISS[1], TCC_READ[1], TCC_REQ[1], TCC_HIT[2], TCC_MISS[2], TCC_READ[2], TCC_REQ[2], TCC_HIT[3], TCC_MISS[3], TCC_READ[3], TCC_REQ[3], TCC_HIT[4], TCC_MISS[4], TCC_READ[4], TCC_REQ[4], TCC_HIT[5], TCC_MISS[5], TCC_READ[5], TCC_REQ[5], TCC_HIT[6], TCC_MISS[6], TCC_READ[6], TCC_REQ[6], TCC_HIT[7], TCC_MISS[7], TCC_READ[7], TCC_REQ[7], TCC_HIT[8], TCC_MISS[8], TCC_READ[8], TCC_REQ[8], TCC_HIT[9], TCC_MISS[9], TCC_READ[9], TCC_REQ[9], TCC_HIT[10], TCC_MISS[10], TCC_READ[10], TCC_REQ[10], TCC_HIT[11], TCC_MISS[11], TCC_READ[11], TCC_REQ[11], TCC_HIT[12], TCC_MISS[12], TCC_READ[12], TCC_REQ[12], TCC_HIT[13], TCC_MISS[13], TCC_READ[13], TCC_REQ[13], TCC_HIT[14], TCC_MISS[14], TCC_READ[14], TCC_REQ[14], TCC_HIT[15], TCC_MISS[15], TCC_READ[15], TCC_REQ[15], TCC_HIT[16], TCC_MISS[16], TCC_READ[16], TCC_REQ[16], TCC_HIT[17], TCC_MISS[17], TCC_READ[17], TCC_REQ[17], TCC_HIT[18], TCC_MISS[18], TCC_READ[18], TCC_REQ[18], TCC_HIT[19], TCC_MISS[19], TCC_READ[19], TCC_REQ[19], TCC_HIT[20], TCC_MISS[20], TCC_READ[20], TCC_REQ[20], TCC_HIT[21], TCC_MISS[21], TCC_READ[21], TCC_REQ[21], TCC_HIT[22], TCC_MISS[22], TCC_READ[22], TCC_REQ[22], TCC_HIT[23], TCC_MISS[23], TCC_READ[23], TCC_REQ[23], TCC_HIT[24], TCC_MISS[24], TCC_READ[24], TCC_REQ[24], TCC_HIT[25], TCC_MISS[25], TCC_READ[25], TCC_REQ[25], TCC_HIT[26], TCC_MISS[26], TCC_READ[26], TCC_REQ[26], TCC_HIT[27], TCC_MISS[27], TCC_READ[27], TCC_REQ[27], TCC_HIT[28], TCC_MISS[28], TCC_READ[28], TCC_REQ[28], TCC_HIT[29], TCC_MISS[29], TCC_READ[29], TCC_REQ[29], TCC_HIT[30], TCC_MISS[30], TCC_READ[30], TCC_REQ[30], TCC_HIT[31], TCC_MISS[31], TCC_READ[31], TCC_REQ[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155254_1239763/input0_results_240321_155254
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_15.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_16.txt
   |-> [rocprof] RPL: on '240321_155254' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_16.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155254_1239947'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155254_1239947/input0_results_240321_155254'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155254_1239947/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 96 metrics
   |-> [rocprof] TCC_RW_REQ[0], TCC_TOO_MANY_EA_WRREQS_STALL[0], TCC_WRITE[0], TCC_RW_REQ[1], TCC_TOO_MANY_EA_WRREQS_STALL[1], TCC_WRITE[1], TCC_RW_REQ[2], TCC_TOO_MANY_EA_WRREQS_STALL[2], TCC_WRITE[2], TCC_RW_REQ[3], TCC_TOO_MANY_EA_WRREQS_STALL[3], TCC_WRITE[3], TCC_RW_REQ[4], TCC_TOO_MANY_EA_WRREQS_STALL[4], TCC_WRITE[4], TCC_RW_REQ[5], TCC_TOO_MANY_EA_WRREQS_STALL[5], TCC_WRITE[5], TCC_RW_REQ[6], TCC_TOO_MANY_EA_WRREQS_STALL[6], TCC_WRITE[6], TCC_RW_REQ[7], TCC_TOO_MANY_EA_WRREQS_STALL[7], TCC_WRITE[7], TCC_RW_REQ[8], TCC_TOO_MANY_EA_WRREQS_STALL[8], TCC_WRITE[8], TCC_RW_REQ[9], TCC_TOO_MANY_EA_WRREQS_STALL[9], TCC_WRITE[9], TCC_RW_REQ[10], TCC_TOO_MANY_EA_WRREQS_STALL[10], TCC_WRITE[10], TCC_RW_REQ[11], TCC_TOO_MANY_EA_WRREQS_STALL[11], TCC_WRITE[11], TCC_RW_REQ[12], TCC_TOO_MANY_EA_WRREQS_STALL[12], TCC_WRITE[12], TCC_RW_REQ[13], TCC_TOO_MANY_EA_WRREQS_STALL[13], TCC_WRITE[13], TCC_RW_REQ[14], TCC_TOO_MANY_EA_WRREQS_STALL[14], TCC_WRITE[14], TCC_RW_REQ[15], TCC_TOO_MANY_EA_WRREQS_STALL[15], TCC_WRITE[15], TCC_RW_REQ[16], TCC_TOO_MANY_EA_WRREQS_STALL[16], TCC_WRITE[16], TCC_RW_REQ[17], TCC_TOO_MANY_EA_WRREQS_STALL[17], TCC_WRITE[17], TCC_RW_REQ[18], TCC_TOO_MANY_EA_WRREQS_STALL[18], TCC_WRITE[18], TCC_RW_REQ[19], TCC_TOO_MANY_EA_WRREQS_STALL[19], TCC_WRITE[19], TCC_RW_REQ[20], TCC_TOO_MANY_EA_WRREQS_STALL[20], TCC_WRITE[20], TCC_RW_REQ[21], TCC_TOO_MANY_EA_WRREQS_STALL[21], TCC_WRITE[21], TCC_RW_REQ[22], TCC_TOO_MANY_EA_WRREQS_STALL[22], TCC_WRITE[22], TCC_RW_REQ[23], TCC_TOO_MANY_EA_WRREQS_STALL[23], TCC_WRITE[23], TCC_RW_REQ[24], TCC_TOO_MANY_EA_WRREQS_STALL[24], TCC_WRITE[24], TCC_RW_REQ[25], TCC_TOO_MANY_EA_WRREQS_STALL[25], TCC_WRITE[25], TCC_RW_REQ[26], TCC_TOO_MANY_EA_WRREQS_STALL[26], TCC_WRITE[26], TCC_RW_REQ[27], TCC_TOO_MANY_EA_WRREQS_STALL[27], TCC_WRITE[27], TCC_RW_REQ[28], TCC_TOO_MANY_EA_WRREQS_STALL[28], TCC_WRITE[28], TCC_RW_REQ[29], TCC_TOO_MANY_EA_WRREQS_STALL[29], TCC_WRITE[29], TCC_RW_REQ[30], TCC_TOO_MANY_EA_WRREQS_STALL[30], TCC_WRITE[30], TCC_RW_REQ[31], TCC_TOO_MANY_EA_WRREQS_STALL[31], TCC_WRITE[31]
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155254_1239947/input0_results_240321_155254
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_16.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_2.txt
   |-> [rocprof] RPL: on '240321_155255' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_2.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155255_1240131'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155255_1240131/input0_results_240321_155255'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155255_1240131/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 26 metrics
   |-> [rocprof] SQC_DCACHE_INPUT_VALID_READYB, SQC_DCACHE_ATOMIC, SQC_DCACHE_REQ_READ_8, SQC_DCACHE_REQ, SQC_DCACHE_HITS, SQC_DCACHE_MISSES, SQC_DCACHE_MISSES_DUPLICATE, SQC_DCACHE_REQ_READ_1, TCP_VOLATILE_sum, TCP_TOTAL_ACCESSES_sum, TCP_TOTAL_READ_sum, TCP_TOTAL_WRITE_sum, TA_BUFFER_ATOMIC_WAVEFRONTS_sum, TA_BUFFER_TOTAL_CYCLES_sum, TD_ATOMIC_WAVEFRONT_sum, TD_STORE_WAVEFRONT_sum, SPI_RA_REQ_NO_ALLOC, SPI_RA_REQ_NO_ALLOC_CSN, CPC_CPC_STAT_STALL, CPC_UTCL1_STALL_ON_TRANSLATION, CPF_CPF_STAT_IDLE, CPF_CPF_TCIU_IDLE, TCC_REQ_sum, TCC_STREAMING_REQ_sum, TCC_HIT_sum, TCC_MISS_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155255_1240131/input0_results_240321_155255
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_2.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_3.txt
   |-> [rocprof] RPL: on '240321_155255' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_3.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155255_1240317'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155255_1240317/input0_results_240321_155255'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155255_1240317/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 23 metrics
   |-> [rocprof] SQC_DCACHE_REQ_READ_2, SQC_DCACHE_REQ_READ_4, SQ_INSTS_VMEM_WR, SQ_INSTS_VMEM_RD, SQ_INSTS_VMEM, SQ_INSTS_SALU, SQ_INSTS_VSKIPPED, SQ_INSTS_SMEM, TCP_TOTAL_ATOMIC_WITH_RET_sum, TCP_TOTAL_ATOMIC_WITHOUT_RET_sum, TCP_TOTAL_WRITEBACK_INVALIDATES_sum, TCP_TOTAL_CACHE_ACCESSES_sum, TA_BUFFER_COALESCED_READ_CYCLES_sum, TA_BUFFER_COALESCED_WRITE_CYCLES_sum, SPI_RA_RES_STALL_CSN, SPI_RA_TMP_STALL_CSN, CPC_CPC_UTCL2IU_BUSY, CPC_CPC_UTCL2IU_IDLE, CPF_CMP_UTCL1_STALL_ON_TRANSLATION, TCC_READ_sum, TCC_WRITE_sum, TCC_ATOMIC_sum, TCC_WRITEBACK_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155255_1240317/input0_results_240321_155255
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_3.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_4.txt
   |-> [rocprof] RPL: on '240321_155256' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_4.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155256_1240502'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155256_1240502/input0_results_240321_155256'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155256_1240502/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 22 metrics
   |-> [rocprof] SQ_INSTS_FLAT, SQ_INSTS_LDS, SQ_INSTS_GDS, SQ_INSTS_EXP_GDS, SQ_INSTS_BRANCH, SQ_INSTS_SENDMSG, SQ_INSTS, SQ_WAIT_ANY, TCP_UTCL1_TRANSLATION_MISS_sum, TCP_UTCL1_TRANSLATION_HIT_sum, TCP_UTCL1_PERMISSION_MISS_sum, TCP_UTCL1_REQUEST_sum, TA_ADDR_STALLED_BY_TC_CYCLES_sum, TA_TOTAL_WAVEFRONTS_sum, SPI_RA_WAVE_SIMD_FULL_CSN, SPI_RA_VGPR_SIMD_FULL_CSN, CPC_CPC_UTCL2IU_STALL, CPC_ME1_BUSY_FOR_PACKET_DECODE, TCC_EA_WRREQ_sum, TCC_EA_WRREQ_64B_sum, TCC_EA_WR_UNCACHED_32B_sum, TCC_EA_WRREQ_DRAM_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155256_1240502/input0_results_240321_155256
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_4.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_5.txt
   |-> [rocprof] RPL: on '240321_155256' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_5.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155256_1240687'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155256_1240687/input0_results_240321_155256'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155256_1240687/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 21 metrics
   |-> [rocprof] SQ_WAIT_INST_ANY, SQ_ACTIVE_INST_ANY, SQ_INSTS_VALU, SQ_ACTIVE_INST_VMEM, SQ_ACTIVE_INST_LDS, SQ_ACTIVE_INST_VALU, SQ_ACTIVE_INST_SCA, SQ_ACTIVE_INST_EXP_GDS, TCP_TCP_LATENCY_sum, TCP_TCC_READ_REQ_LATENCY_sum, TCP_TCC_WRITE_REQ_LATENCY_sum, TCP_TCC_READ_REQ_sum, TA_ADDR_STALLED_BY_TD_CYCLES_sum, TA_DATA_STALLED_BY_TC_CYCLES_sum, SPI_RA_SGPR_SIMD_FULL_CSN, SPI_RA_LDS_CU_FULL_CSN, CPC_ME1_DC0_SPI_BUSY, TCC_EA_WRREQ_STALL_sum, TCC_EA_WRREQ_IO_CREDIT_STALL_sum, TCC_EA_WRREQ_GMI_CREDIT_STALL_sum, TCC_EA_WRREQ_DRAM_CREDIT_STALL_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155256_1240687/input0_results_240321_155256
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_5.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_6.txt
   |-> [rocprof] RPL: on '240321_155257' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_6.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155257_1240873'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155257_1240873/input0_results_240321_155257'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155257_1240873/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 20 metrics
   |-> [rocprof] SQ_ACTIVE_INST_MISC, SQ_ACTIVE_INST_FLAT, SQ_INST_CYCLES_VMEM_WR, SQ_INST_CYCLES_VMEM_RD, SQ_INST_CYCLES_SMEM, SQ_INST_CYCLES_SALU, SQ_THREAD_CYCLES_VALU, SQ_IFETCH, TCP_TCC_WRITE_REQ_sum, TCP_TCC_ATOMIC_WITH_RET_REQ_sum, TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum, TCP_TCC_NC_READ_REQ_sum, TA_FLAT_WAVEFRONTS_sum, TA_FLAT_READ_WAVEFRONTS_sum, SPI_RA_BAR_CU_FULL_CSN, SPI_RA_TGLIM_CU_FULL_CSN, TCC_EA_RDREQ_sum, TCC_EA_RDREQ_32B_sum, TCC_EA_RD_UNCACHED_32B_sum, TCC_EA_RDREQ_DRAM_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155257_1240873/input0_results_240321_155257
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_6.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_7.txt
   |-> [rocprof] RPL: on '240321_155257' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_7.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155257_1241057'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155257_1241057/input0_results_240321_155257'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155257_1241057/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 20 metrics
   |-> [rocprof] SQ_LDS_BANK_CONFLICT, SQ_LDS_ADDR_CONFLICT, SQ_LDS_UNALIGNED_STALL, SQ_WAVES_EQ_64, SQ_WAVES_LT_64, SQ_WAVES_LT_48, SQ_WAVES_LT_32, SQ_WAVES_LT_16, TCP_TCC_NC_WRITE_REQ_sum, TCP_TCC_NC_ATOMIC_REQ_sum, TCP_TCC_UC_READ_REQ_sum, TCP_TCC_UC_WRITE_REQ_sum, TA_FLAT_WRITE_WAVEFRONTS_sum, TA_FLAT_ATOMIC_WAVEFRONTS_sum, SPI_RA_WVLIM_STALL_CSN, SPI_SWC_CSC_WR, TCC_EA_RDREQ_IO_CREDIT_STALL_sum, TCC_EA_RDREQ_GMI_CREDIT_STALL_sum, TCC_EA_RDREQ_DRAM_CREDIT_STALL_sum, TCC_TAG_STALL_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155257_1241057/input0_results_240321_155257
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_7.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_8.txt
   |-> [rocprof] RPL: on '240321_155258' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_8.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155258_1241241'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155258_1241241/input0_results_240321_155258'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155258_1241241/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 17 metrics
   |-> [rocprof] SQ_ITEMS, SQ_LDS_MEM_VIOLATIONS, SQ_LDS_ATOMIC_RETURN, SQ_LDS_IDX_ACTIVE, SQ_WAVES_RESTORED, SQ_WAVES_SAVED, SQ_INSTS_SMEM_NORM, TCP_TCC_UC_ATOMIC_REQ_sum, TCP_TCC_CC_READ_REQ_sum, TCP_TCC_CC_WRITE_REQ_sum, TCP_TCC_CC_ATOMIC_REQ_sum, SPI_VWC_CSC_WR, SPI_RA_BULKY_CU_FULL_CSN, TCC_NORMAL_WRITEBACK_sum, TCC_ALL_TC_OP_WB_WRITEBACK_sum, TCC_NORMAL_EVICT_sum, TCC_ALL_TC_OP_INV_EVICT_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155258_1241241/input0_results_240321_155258
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_8.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_9.txt
   |-> [rocprof] RPL: on '240321_155258' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/pmc_perf_9.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155258_1241427'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155258_1241427/input0_results_240321_155258'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155258_1241427/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 8 metrics
   |-> [rocprof] TCP_TCC_RW_READ_REQ_sum, TCP_TCC_RW_WRITE_REQ_sum, TCP_TCC_RW_ATOMIC_REQ_sum, TCP_PENDING_STALL_CYCLES_sum, TCC_TOO_MANY_EA_WRREQS_STALL_sum, TCC_EA_ATOMIC_sum, TCC_EA_RDREQ_LEVEL_sum, TCC_EA_WRREQ_LEVEL_sum
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155258_1241427/input0_results_240321_155258
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/pmc_perf_9.csv' is generating
   |-> [rocprof]

[profiling] Current input file: tests/workloads/dispatch_7/MI100/perfmon/timestamps.txt
   |-> [rocprof] RPL: on '240321_155259' from '/opt/rocm-6.0.2' in '/home1/josantos/omniperf'
   |-> [rocprof] RPL: profiling '""./tests/vcopy -n 1048576 -b 256 -i 3""'
   |-> [rocprof] RPL: input file 'tests/workloads/dispatch_7/MI100/perfmon/timestamps.txt'
   |-> [rocprof] RPL: output dir '/tmp/rpl_data_240321_155259_1241612'
   |-> [rocprof] RPL: result dir '/tmp/rpl_data_240321_155259_1241612/input0_results_240321_155259'
   |-> [rocprof] ROCProfiler: input from "/tmp/rpl_data_240321_155259_1241612/input0.xml"
   |-> [rocprof] gpu_index =
   |-> [rocprof] kernel =
   |-> [rocprof] range = 7
   |-> [rocprof] 0 metrics
   |-> [rocprof] vcopy testing on GCD 0
   |-> [rocprof] Finished allocating vectors on the CPU
   |-> [rocprof] Finished allocating vectors on the GPU
   |-> [rocprof] Finished copying vectors to the GPU
   |-> [rocprof] sw thinks it moved 1.000000 KB per wave
   |-> [rocprof] Total threads: 1048576, Grid Size: 4096 block Size:256, Wavefronts:16384:
   |-> [rocprof] Launching the  kernel on the GPU
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished executing kernel
   |-> [rocprof] Finished copying the output vector from the GPU to the CPU
   |-> [rocprof] Releasing GPU memory
   |-> [rocprof] Releasing CPU memory
   |-> [rocprof]
   |-> [rocprof] ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_240321_155259_1241612/input0_results_240321_155259
   |-> [rocprof] File 'tests/workloads/dispatch_7/MI100/timestamps.csv' is generating
   |-> [rocprof]
