Rate-limited sampling of events has been discussed since the inception of JFR, but the challenge has always been how sampling should work. After investigations, I have come to the following conclusions:
1. Throttling should happen after all other filtering has occurred, such as threshold and user-defined settings.
The arguments are:
a) Throttling is more expensive than other filtering since we need to increase an atomic counter and possibly recalculate the sampling rate, which requires locking.
b) Events are accepted as long as there are no more than x events/second. If the sampler decides to sample early, but the event is later filtered out due to a threshold, we lose one sample. We could give it back, but that requires decreasing an atomic counter, which adds to the overhead. This may happen in another window, which complicates the implementation and makes throttling unreliable.
c) An alternative would be to have sampling be an OR operation. For instance, the event is emitted if the threshold is exceeded OR the event is not above rate limit. What makes this approach appealing is that both outliers and statistical information can be collected simultaneously. However, the problem is that it is difficult to interpret the samples if outliers sneak in, for example, when calculating averages. This issue can be mitigated by setting the threshold to 0 ms, although that would undermine the OR operation. Also, combining this behavior with user-defined settings is problematic. Consider a setting that records only HTTP requests for a specific URL. If it were an OR operation, all URLs would be recorded when events are below the rate limit, which is likely not what the user intended.
2. Throttling is orthogonal to "record everything" (a.k.a. laser pointing) during an event,
There has been a discussion about an annotation that would record all events occurring in the same thread between begin() and end(). To make it practical, it would need to be throttled. For example, record all events for a transaction event, but limit it to 10 transactions per second to maintain reasonable overhead. The decision to sample must occur when a user calls begin(). Doing so in commit() is too late, and holding onto a buffer until then is impractical due to its potential size and tracking of constants. The threshold setting would not apply to these events, and we cannot allow user-defined events for them either, see 1.c. Consequently, when throttling is used with "record everything", all other settings must be disabled, so there are no interactions, making it feasible to add this capability later without impact on design decisions now.
3. The sampling decision must be stored if shouldCommit() is called.
If a user calls shouldCommit() and it returns true, the event must be written when commit() is later called, if not, the sampling rate would not be respected. The sampling decision must be cleared after commit() so the event object could potentially be reused.
4. Throttling is orthogonal to contextual events.
Contextual events, the ability to annotate events or fields and have them injected into events that occur concurrently, can happen with or without throttling, now that we know that "record everything" (2) is orthogonal and that throttling should happen after all other filtering has occurred. This means throttling can be implemented independently of these features.
5. Throttling and cascading events.
Cascading events is the capability for a lower-level event or nested events to trigger a higher-level event. For example, if an application stalls due to lock contention, it might be interesting to record an HTTP Request event that can provide the URL when the contention occurred. Sending an HTTP request event for every request may flood the buffers with data that is not interesting. If there was a way for the lower-level event to trigger the higher-level event, flooding could be avoided. How cascading should work with thresholding and throttling is not entirely clear, but the most understandable approach from a user's perspective is that shouldCommit always returns true, unless the event is disabled. Anything else would be hard to explain or implement. In the case of throttling, the event would be sent, even if the rate limit is exceeded. This is similar to a "record everything" scenario, where throttling and thresholds are also ignored. Cascading may target all events or a subset of them.
1. Throttling should happen after all other filtering has occurred, such as threshold and user-defined settings.
The arguments are:
a) Throttling is more expensive than other filtering since we need to increase an atomic counter and possibly recalculate the sampling rate, which requires locking.
b) Events are accepted as long as there are no more than x events/second. If the sampler decides to sample early, but the event is later filtered out due to a threshold, we lose one sample. We could give it back, but that requires decreasing an atomic counter, which adds to the overhead. This may happen in another window, which complicates the implementation and makes throttling unreliable.
c) An alternative would be to have sampling be an OR operation. For instance, the event is emitted if the threshold is exceeded OR the event is not above rate limit. What makes this approach appealing is that both outliers and statistical information can be collected simultaneously. However, the problem is that it is difficult to interpret the samples if outliers sneak in, for example, when calculating averages. This issue can be mitigated by setting the threshold to 0 ms, although that would undermine the OR operation. Also, combining this behavior with user-defined settings is problematic. Consider a setting that records only HTTP requests for a specific URL. If it were an OR operation, all URLs would be recorded when events are below the rate limit, which is likely not what the user intended.
2. Throttling is orthogonal to "record everything" (a.k.a. laser pointing) during an event,
There has been a discussion about an annotation that would record all events occurring in the same thread between begin() and end(). To make it practical, it would need to be throttled. For example, record all events for a transaction event, but limit it to 10 transactions per second to maintain reasonable overhead. The decision to sample must occur when a user calls begin(). Doing so in commit() is too late, and holding onto a buffer until then is impractical due to its potential size and tracking of constants. The threshold setting would not apply to these events, and we cannot allow user-defined events for them either, see 1.c. Consequently, when throttling is used with "record everything", all other settings must be disabled, so there are no interactions, making it feasible to add this capability later without impact on design decisions now.
3. The sampling decision must be stored if shouldCommit() is called.
If a user calls shouldCommit() and it returns true, the event must be written when commit() is later called, if not, the sampling rate would not be respected. The sampling decision must be cleared after commit() so the event object could potentially be reused.
4. Throttling is orthogonal to contextual events.
Contextual events, the ability to annotate events or fields and have them injected into events that occur concurrently, can happen with or without throttling, now that we know that "record everything" (2) is orthogonal and that throttling should happen after all other filtering has occurred. This means throttling can be implemented independently of these features.
5. Throttling and cascading events.
Cascading events is the capability for a lower-level event or nested events to trigger a higher-level event. For example, if an application stalls due to lock contention, it might be interesting to record an HTTP Request event that can provide the URL when the contention occurred. Sending an HTTP request event for every request may flood the buffers with data that is not interesting. If there was a way for the lower-level event to trigger the higher-level event, flooding could be avoided. How cascading should work with thresholding and throttling is not entirely clear, but the most understandable approach from a user's perspective is that shouldCommit always returns true, unless the event is disabled. Anything else would be hard to explain or implement. In the case of throttling, the event would be sent, even if the rate limit is exceeded. This is similar to a "record everything" scenario, where throttling and thresholds are also ignored. Cascading may target all events or a subset of them.
- csr for
-
JDK-8354232 JFR: Rate-limited sampling of Java events
-
- Draft
-