Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8312182

THPs cause huge RSS due to thread start timing issue

XMLWordPrintable

    • b08
    • linux

        If THP (Transparent Huge Pages) are enabled unconditionally on the system, java applications that use many threads may see a huge Resident Set Size. That footprint is caused by thread stacks being mostly paged in. This page-in is caused by thread stack memory being transformed into huge pages by khugepaged; later, those huge pages usually shatter into small pages when Java guard pages are established at thread start, but the remaining splinter small pages remain paged in.

        Note that this effect is independent of any JVM switches; it happens regardless of -XX:+UseTransparentHugePages.

        JDK-8303215 attempted to fix this problem by making it unlikely that thread stack boundaries are aligned to THP page size. Unfortunately, that was not sufficient. We still see JVMs with huge footprints, especially if they did create many Java threads in rapid succession.

        Demonstration:

        10000 idle threads with 100 MB pre-touched java heap, -Xss2M, on x64, will consume:

        A) Baseline (THP disabled on system): 369 MB
        B) THP="always", JDK-8303215 present: 1.5 GB .. >2 GB (very wobbly)
        C) THP="always", JDK-8303215 present, artificial delay after thread start: 20,6 GB (!).
        Cause:

        The problem is caused by timing. When we create multiple Java threads, the following sequence of actions happens:

        In the parent thread:

            the parent thread calls pthread_create(3)
            pthread_create(3) creates the thread stack by calling mmap(2)
            pthread_create(3) calls clone(2) to start the child thread
            repeat to start more threads

        Each child thread:

            queries its stack dimensions
            handshakes with the parent to signal lifeness
            establishes guard pages at the low end of the stack

        The thread stack mapping is established in the parent thread; the guard pages are placed by the child threads. There is a time window in which the thread stack is already mapped into address space, but guard pages still need to be placed.

        If the parent is faster than the children, it will have created mappings faster than the children can place guard pages on them.

        For the kernel, these thread stacks are just anonymous mappings. It places them adjacent to each other to reduce address space fragmentation. As long as no guard pages are placed yet, all these thread stack mappings (VMAs) have the same attributes - same permission bits, all anonymous. Hence, the kernel will fold them into a single large VMA.

        That VMA may be large enough to be eligible for huge pages. Now the JVM races with the khugepaged: If khugepaged is faster than the JVM, it will have converted that larger VMA partly or fully into hugepages before the child threads start creating guard pages.

        The child threads will catch up and create guard pages. That will splinter the large VMA into several smaller VMAs (two for each thread, one for the usable thread section, and one protected for the guards). Each of these VMAs will typically be smaller than a huge page, and typically not huge-page-aligned. The huge pages created by khugepaged will mostly shatter into small pages, but these small pages remain paged-in. Effect: we pay memory for the whole thread stacks even though the threads did not start yet.

        This is a similar effect as described in JDK-8303215; but we assumed it only affects individual threads when it affects whole regions of adjacent thread stacks.
        Example:

        Let's create three threads. Each thread stack, including guard pages, is 2M + 4K sized (+4K because of JDK-8303215).

        Their thread stacks will be located at: ( [base .. end .. guard]:

        T1: [7feea53ff000 .. 7feea5202000 .. 7feea51fe000]
        T2: [7feea5600000 .. 7feea5403000 .. 7feea53ff000]
        T3: [7feea5801000 .. 7feea5604000 .. 7feea5600000]

        After pthread_create(3), their thread stacks exist without JVM guard pages. Kernel merges the VMAs of their thread stacks into a single mapping > 6MB. khugepaged then coalesces their small pages into 3 huge pages:

        ```
        7feea51fe000-7feea5801000 rw-p 00000000 00:00 0 <<<------- all three stacks as one VMA
        Size: 6156 kB
        KernelPageSize: 4 kB
        MMUPageSize: 4 kB
        Rss: 6148 kB
        Pss: 6148 kB
        Shared_Clean: 0 kB
        Shared_Dirty: 0 kB
        Private_Clean: 0 kB
        Private_Dirty: 6148 kB
        Referenced: 6148 kB
        Anonymous: 6148 kB
        LazyFree: 0 kB
        AnonHugePages: 6144 kB <<<---------- 3x2MB huge pages
        ShmemPmdMapped: 0 kB
        FilePmdMapped: 0 kB
        Shared_Hugetlb: 0 kB
        Private_Hugetlb: 0 kB
        Swap: 0 kB
        SwapPss: 0 kB
        Locked: 0 kB
        THPeligible: 1
        VmFlags: rd wr mr mw me ac sd
        ```

        Threads start and create their respective guard pages. The single VMA splinters into 6 smaller VMAs. The huge pages shatter into small pages that remain paged-in:

        ```
        7feea51fe000-7feea5202000 ---p 00000000 00:00 0 <<----- guard pages for T1
        Size: 16 kB
        ...
        7feea5202000-7feea53ff000 rw-p 00000000 00:00 0 <<------ thread stack for T1
        Size: 2036 kB
        KernelPageSize: 4 kB
        MMUPageSize: 4 kB
        Rss: 2036 kB
        Pss: 2036 kB
        Private_Dirty: 2036 kB <<<-------- all pages resident
        ...
        7feea53ff000-7feea5403000 ---p 00000000 00:00 0 <<----- guard pages for T2
        Size: 16 kB
        ...
        7feea5403000-7feea5600000 rw-p 00000000 00:00 0 <<------ thread stack for T2
        Size: 2036 kB
        KernelPageSize: 4 kB
        MMUPageSize: 4 kB
        Rss: 2036 kB
        Pss: 2036 kB
        Private_Dirty: 2036 kB <<<-------- all pages resident
        ...
        7feea5600000-7feea5604000 ---p 00000000 00:00 0 <<----- guard pages for T3
        Size: 16 kB
        ...
        7feea5604000-7feea5801000 rw-p 00000000 00:00 0 <<------ thread stack for T3
        Size: 2036 kB
        KernelPageSize: 4 kB
        MMUPageSize: 4 kB
        Rss: 2036 kB
        Pss: 2036 kB
        Private_Dirty: 2036 kB <<<-------- all pages resident
        ...
        ```

              stuefe Thomas Stuefe
              stuefe Thomas Stuefe
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: