From caeb66bbffcc56c7c3cc4f62d3ed62686ba158c8 Mon Sep 17 00:00:00 2001 From: Doug Lea Date: Mon, 9 Mar 2026 17:11:22 -0400 Subject: [PATCH] More merge fixes --- .../java/util/concurrent/ForkJoinPool.java | 169 ++++++++---------- 1 file changed, 72 insertions(+), 97 deletions(-) diff --git a/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java b/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java index db5c304f272..1cac9c4c2f3 100644 --- a/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java +++ b/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java @@ -561,89 +561,70 @@ public class ForkJoinPool extends AbstractExecutorService * access (which is usually needed anyway). * * Signalling. Signals (in signalWork) cause new or reactivated - * workers to scan for tasks. Method signalWork and its callers - * try to approximate the unattainable goal of having the right - * number of workers activated for the tasks at hand, but must err - * on the side of too many workers vs too few to avoid stalls: + * workers to scan for tasks. SignalWork is invoked in two cases: + * (1) When a task is pushed onto an empty queue, and (2) When a + * worker takes a top-level task from a queue that has additional + * tasks. Together, these suffice in O(log(#threads)) steps to + * fully activate with at least enough workers, and ideally no + * more than required. This ideal is unobtainable: Callers do not + * know whether another worker will finish its current task and + * poll for others without need of a signal (which is otherwise an + * advantage of work-stealing vs other schemes), and also must + * conservatively estimate the triggering conditions of emptiness + * or non-emptiness; all of which usually cause more activations + * than necessary (see below). (Method signalWork is also used as + * failsafe in case of Thread failures in deregisterWorker, to + * activate or create a new worker to replace them). * - * * If computations are purely tree structured, it suffices for - * every worker to activate another when it pushes a task into - * an empty queue, resulting in O(log(#threads)) steps to full - * activation. Emptiness must be conservatively approximated, - * which may result in unnecessary signals. Also, to reduce - * resource usages in some cases, at the expense of slower - * startup in others, activation of an idle thread is preferred - * over creating a new one, here and elsewhere. - * - * * At the other extreme, if "flat" tasks (those that do not in - * turn generate others) come in serially from only a single - * producer, each worker taking a task from a queue should - * propagate a signal if there are more tasks in that - * queue. This is equivalent to, but generally faster than, - * arranging the stealer take multiple tasks, re-pushing one or - * more on its own queue, and signalling (because its queue is - * empty), also resulting in logarithmic full activation - * time. If tasks do not not engage in unbounded loops based on - * the actions of other workers with unknown dependencies loop, - * this form of proagation can be limited to one signal per - * activation (phase change). We distinguish the cases by - * further signalling only if the task is an InterruptibleTask - * (see below), which are the only supported forms of task that - * may do so. - * - * * Because we don't know about usage patterns (or most commonly, - * mixtures), we use both approaches, which present even more - * opportunities to over-signal. (Failure to distinguish these - * cases in terms of submission methods was arguably an early - * design mistake.) Note that in either of these contexts, - * signals may be (and often are) unnecessary because active - * workers continue scanning after running tasks without the - * need to be signalled (which is one reason work stealing is - * often faster than alternatives), so additional workers - * aren't needed. - * - * * For rapidly branching tasks that require full pool resources, - * oversignalling is OK, because signalWork will soon have no - * more workers to create or reactivate. But for others (mainly - * externally submitted tasks), overprovisioning may cause very - * noticeable slowdowns due to contention and resource - * wastage. We reduce impact by deactivating workers when - * queues don't have accessible tasks, but reactivating and - * rescanning if other tasks remain. - * - * * Despite these, signal contention and overhead effects still - * occur during ramp-up and ramp-down of small computations. + * Top-Level scheduling + * ==================== * * Scanning. Method runWorker performs top-level scanning for (and * execution of) tasks by polling a pseudo-random permutation of * the array (by starting at a given index, and using a constant * cyclically exhaustive stride.) It uses the same basic polling * method as WorkQueue.poll(), but restarts with a different - * permutation on each invocation. The pseudorandom generator - * need not have high-quality statistical properties in the long + * permutation on each rescan. The pseudorandom generator need + * not have high-quality statistical properties in the long * term. We use Marsaglia XorShifts, seeded with the Weyl sequence - * from ThreadLocalRandom probes, which are cheap and - * suffice. Each queue's polling attempts to avoid becoming stuck - * when other scanners/pollers stall. Scans do not otherwise - * explicitly take into account core affinities, loads, cache - * localities, etc, However, they do exploit temporal locality - * (which usually approximates these) by preferring to re-poll - * from the same queue after a successful poll before trying - * others, which also reduces bookkeeping, cache traffic, and - * scanning overhead. But it also reduces fairness, which is - * partially counteracted by giving up on detected interference - * (which also reduces contention when too many workers try to - * take small tasks from the same queue). + * from ThreadLocalRandom probes, which are cheap and suffice. * * Deactivation. When no tasks are found by a worker in runWorker, - * it tries to deactivate()), giving up (and rescanning) on "ctl" - * contention. To avoid missed signals during deactivation, the - * method rescans and reactivates if there may have been a missed - * signal during deactivation. To reduce false-alarm reactivations - * while doing so, we scan multiple times (analogously to method - * quiescent()) before trying to reactivate. Because idle workers - * are often not yet blocked (parked), we use a WorkQueue field to - * advertise that a waiter actually needs unparking upon signal. + * it invokes deactivate, that first deactivates (to an IDLE + * phase). Avoiding missed signals during deactivation requires a + * (conservative) rescan, reactivating if there may be tasks to + * poll. Because idle workers are often not yet blocked (parked), + * we use a WorkQueue field to advertise that a waiter actually + * needs unparking upon signal. + * + * When tasks are constructed as (recursive) DAGs, top-level + * scanning is usually infrequent, and doesn't encounter most + * of the following problems addressed by runWorker and awaitWork: + * + * Locality. Polls are organized into "runs", continuing until + * empty or contended, while also minimizing interference by + * postponing bookeeping to ends of runs. This may reduce + * fairness. + * + * Contention. When many workers try to poll few queues, they + * often collide, generating CAS failures and disrupting locality + * of workers already running their tasks. This also leads to + * stalls when tasks cannot be taken because other workers have + * not finished poll operations, which is detected by reading + * ahead in queue arrays. In both cases, workers restart scans in a + * way that approximates randomized backoff. + * + * Oversignalling. When many short top-level tasks are present in + * a small number of queues, the above signalling strategy may + * activate many more workers than needed, worsening locality and + * contention problems, while also generating more global + * contention (field ctl is CASed on every activation and + * deactivation). We filter out (both in runWorker and + * signalWork) attempted signals that are surely not needed + * because the signalled tasks are already taken. + * + * Shutdown and Quiescence + * ======================= * * Quiescence. Workers scan looking for work, giving up when they * don't find any, without being sure that none are available. @@ -893,9 +874,7 @@ public class ForkJoinPool extends AbstractExecutorService * shutdown, runners are interrupted so they can cancel. Since * external joining callers never run these tasks, they must await * cancellation by others, which can occur along several different - * paths. The inability to rely on caller-runs may also require - * extra signalling (resulting in scanning and contention) so is - * done only conditionally in methods push and runworker. + * paths. * * Across these APIs, rules for reporting exceptions for tasks * with results accessed via join() differ from those via get(), @@ -962,9 +941,13 @@ public class ForkJoinPool extends AbstractExecutorService * less-contended applications. To help arrange this, some * non-reference fields are declared as "long" even when ints or * shorts would suffice. For class WorkQueue, an - * embedded @Contended region segregates fields most heavily - * updated by owners from those most commonly read by stealers or - * other management. + * embedded @Contended isolates the very busy top index, along + * with status and bookkeeping fields written (mostly) by owners, + * that otherwise interfere with reading array and base + * fields. There are other variables commonly contributing to + * false-sharing-related performance issues (including fields of + * class Thread), but we can't do much about this except try to + * minimize access. * * Initial sizing and resizing of WorkQueue arrays is an even more * delicate tradeoff because the best strategy systematically @@ -973,13 +956,11 @@ public class ForkJoinPool extends AbstractExecutorService * direct false-sharing and indirect cases due to GC bookkeeping * (cardmarks etc), and reduce the number of resizes, which are * not especially fast because they require atomic transfers. - * Currently, arrays for workers are initialized to be just large - * enough to avoid resizing in most tree-structured tasks, but - * larger for external queues where both false-sharing problems - * and the need for resizing are more common. (Maintenance note: - * any changes in fields, queues, or their uses, or JVM layout - * policies, must be accompanied by re-evaluation of these - * placement and sizing decisions.) + * Currently, arrays are initialized to be just large enough to + * avoid resizing in most tree-structured tasks, but grow rapidly + * until large. (Maintenance note: any changes in fields, queues, + * or their uses, or JVM layout policies, must be accompanied by + * re-evaluation of these placement and sizing decisions.) * * Style notes * =========== @@ -1062,17 +1043,11 @@ public class ForkJoinPool extends AbstractExecutorService static final int DEFAULT_COMMON_MAX_SPARES = 256; /** - * Initial capacity of work-stealing queue array for workers. + * Initial capacity of work-stealing queue array. * Must be a power of two, at least 2. See above. */ static final int INITIAL_QUEUE_CAPACITY = 1 << 6; - /** - * Initial capacity of work-stealing queue array for external queues. - * Must be a power of two, at least 2. See above. - */ - static final int INITIAL_EXTERNAL_QUEUE_CAPACITY = 1 << 9; - // conversions among short, int, long static final int SMASK = 0xffff; // (unsigned) short bits static final long LMASK = 0xffffffffL; // lower 32 bits of long @@ -1243,11 +1218,11 @@ public class ForkJoinPool extends AbstractExecutorService */ WorkQueue(ForkJoinWorkerThread owner, int id, int cfg, boolean clearThreadLocals) { - array = new ForkJoinTask[owner == null ? - INITIAL_EXTERNAL_QUEUE_CAPACITY : - INITIAL_QUEUE_CAPACITY]; - this.owner = owner; this.config = (clearThreadLocals) ? cfg | CLEAR_TLS : cfg; + if ((this.owner = owner) == null) { + array = new ForkJoinTask[INITIAL_QUEUE_CAPACITY]; + phase = id | IDLE; + } } /**