Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8261238

NMT should not limit baselining by size threshold

XMLWordPrintable

    • b20

        NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.

        But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.

        There are two thresholds:

        when collecting baseline information for a report, it omits call sites smaller than MemBaseline::SIZE_THRESHOLD, which is hard coded to 1K.
        when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
        Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.

        I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.

        The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.

        Footprint costs of omitting the (1) threshold:

        According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.

        A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.

        Therefore, the standard footprint should not be affected at all. If NMT is active and someone runs jcmd VM.native_memory, it will cause about 270K (210K more than today) of temporary allocations. If someone runs jcmd VM.native_memory baseline, that increase is not temporary but sticks.

        Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.

        Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.

        Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.

        Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

              stuefe Thomas Stuefe
              stuefe Thomas Stuefe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: