Uploaded image for project: 'Code Tools'
  1. Code Tools
  2. CODETOOLS-7902799

perfasm still handles event modifiers incorrectly

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P4 P4
    • None
    • None
    • tools
    • None
    • jmh

      Using the perfasm profiler with an event specification like "-prof perfasm:events=cycles:ppp" will result in no output because JMH won't find any events:

      Secondary result "io.simonis.jmh.Synchronization.synchronizedIncrementGlobal:·asm":
      PrintAssembly processed: 142164 total address lines.
      Perf output processed (skipped 7.859 seconds):
       Column 1: cycles:ppp (0 events)

      This is because since CODETOOLS-7901905 the tags are removed from the events when the perf output is parsed. But the list of profiled events (which is kept in the AbstractPerfAsmProfiler base class of LinuxPerfAsmProfiler) still keeps the original event specification with the tag.

      I've tried to normalize all event name usages in LinuxPerfAsmProfiler to use the event name without tags but doing so is not easy because in the end the base class will check the original events against the parsed ones in 'processAssembly()/PerfEvents()' and fail (i.e. in the sense that it woun't find any events).

      So in the end I gave up and simple removed the part of CODETOOLS-7901905 which chops the tags from the event name. I don't think that's necessary and everything seems to still work just fine with the full name plus tags (except for a tests in PerfParseTest.java which had to be removed because it specifically checked that LinuxPerfAsmProfiler.parsePerfLine() removes tags from event names which are parsed from the perf output).


      This fix will make it possible to use the ":ppp" tag which prevents instruction skew in perf results (see http://www.brendangregg.com/perf.html#EventProfiling). So the following output:

      PrintAssembly processed: 142764 total address lines.
      Perf output processed (skipped 7.898 seconds):
       Column 1: cycles (10247 events)

      Hottest code regions (>10.00% "cycles" events):

      ....[Hottest Region 1]..............................................................................
      c2, level 4, io.simonis.jmh.Synchronization::synchronizedIncrementGlobal, version 468 (221 bytes)

        0.03% 0x00007fffe01e1f37: test $0x2,%rax
               ╭ 0x00007fffe01e1f3d: jne 0x00007fffe01e1f63
               │ 0x00007fffe01e1f3f: or $0x1,%rax
               │ 0x00007fffe01e1f43: mov %rax,(%rbx)
               │ 0x00007fffe01e1f46: lock cmpxchg %rbx,(%rsi)
               │╭ 0x00007fffe01e1f4b: je 0x00007fffe01e1f76
               ││ 0x00007fffe01e1f51: sub %rsp,%rax
               ││ 0x00007fffe01e1f54: and $0xfffffffffffff007,%rax
               ││ 0x00007fffe01e1f5b: mov %rax,(%rbx)
               ││╭ 0x00007fffe01e1f5e: jmpq 0x00007fffe01e1f76
        0.32% ↘││ 0x00007fffe01e1f63: mov %rax,%r11
        0.01% ││ 0x00007fffe01e1f66: xor %rax,%rax
        0.07% ││ 0x00007fffe01e1f69: lock cmpxchg %r15,0x3e(%r11)
       35.31% ││ 0x00007fffe01e1f6f: movq $0x3,(%rbx)
        0.15% ↘↘╭ 0x00007fffe01e1f76: jne 0x00007fffe01e202b ;*synchronization entry

      which obviously suffers from instruction skew because it attributes 35% of the cycles to the "movq" instruction after the "lock cmpxchg" will be fixed and look correctly like this:

      PrintAssembly processed: 143768 total address lines.
      Perf output processed (skipped 7.921 seconds):
       Column 1: cycles:ppp (10074 events)

      Hottest code regions (>10.00% "cycles:ppp" events):

      ....[Hottest Region 1]..............................................................................
      c2, level 4, io.simonis.jmh.Synchronization::synchronizedIncrementGlobal, version 481 (137 bytes)

                         0x00007fffe01e3fb7: test $0x2,%rax
        0.32% ╭ 0x00007fffe01e3fbd: jne 0x00007fffe01e3fe3
               │ 0x00007fffe01e3fbf: or $0x1,%rax
               │ 0x00007fffe01e3fc3: mov %rax,(%rbx)
               │ 0x00007fffe01e3fc6: lock cmpxchg %rbx,(%rsi)
               │╭ 0x00007fffe01e3fcb: je 0x00007fffe01e3ff6
               ││ 0x00007fffe01e3fd1: sub %rsp,%rax
               ││ 0x00007fffe01e3fd4: and $0xfffffffffffff007,%rax
               ││ 0x00007fffe01e3fdb: mov %rax,(%rbx)
               ││╭ 0x00007fffe01e3fde: jmpq 0x00007fffe01e3ff6
        0.01% ↘││ 0x00007fffe01e3fe3: mov %rax,%r11
        0.05% ││ 0x00007fffe01e3fe6: xor %rax,%rax
       35.03% ││ 0x00007fffe01e3fe9: lock cmpxchg %r15,0x3e(%r11)
        0.21% ││ 0x00007fffe01e3fef: movq $0x3,(%rbx)
        0.26% ↘↘ 0x00007fffe01e3ff6: jne 0x00007fffe01e40ab ;*synchronization entry

            simonis Volker Simonis
            simonis Volker Simonis
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: