[PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 00/16] sched: Clean-ups and asymmetric cpu capacity support

Morten Rasmussen
Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarious. SMP behaviour and performance should be
unaffected.

Test 0:
        for i in `seq 1 10`; \
               do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
               done \
        | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
               END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

        (Higher is better)
tip: Average events: 150.2
patch: Average events: 217.9

Test 1:
        perf stat --null --repeat 10 -- \
        perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip: 4.831538935 seconds time elapsed ( +-  1.58% )
patch: 4.839951382 seconds time elapsed ( +-  1.01% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip: 61.406552538 seconds time elapsed ( +-  0.12% )
patch: 61.589263159 seconds time elapsed ( +-  0.22% )

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

Patch   1-4: Generic fixes and clean-ups.
Patch  5-13: Improve capacity awareness.
Patch 14-16: Arch features for arm to enable asymmetric capacity support.

Dietmar Eggemann (1):
  sched: Store maximum per-cpu capacity in root domain

Morten Rasmussen (15):
  sched: Fix power to capacity renaming in comment
  sched/fair: Consistent use of prev_cpu in wakeup path
  sched/fair: Disregard idle task wakee_flips in wake_wide
  sched/fair: Optimize find_idlest_cpu() when there is no choice
  sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  sched: Disable WAKE_AFFINE for asymmetric configurations
  sched: Make SD_BALANCE_WAKE a topology flag
  sched/fair: Let asymmetric cpu configurations balance at wake-up
  sched/fair: Compute task/cpu utilization at wake-up more correctly
  sched/fair: Consider spare capacity in find_idlest_group()
  sched: Add per-cpu max capacity to sched_group_capacity
  sched/fair: Avoid pulling tasks from non-overloaded higher capacity
    groups
  arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
  arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems
  arm: Update arch_scale_cpu_capacity() to reflect change to define

 arch/arm/include/asm/topology.h |   5 +
 arch/arm/kernel/topology.c      |  25 ++++-
 include/linux/sched.h           |   3 +-
 kernel/sched/core.c             |  25 ++++-
 kernel/sched/fair.c             | 217 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h            |   5 +-
 6 files changed, 250 insertions(+), 30 deletions(-)

--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 06/16] sched: Disable WAKE_AFFINE for asymmetric configurations

Morten Rasmussen
If the system has cpu of different compute capacities (e.g. big.LITTLE)
let affine wakeups be constrained to cpus of the same type.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9619a3..558ec4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6410,6 +6410,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
  sd->idle_idx = 1;
  }
 
+ if (sd->flags & SD_ASYM_CPUCAPACITY)
+ sd->flags &= ~SD_WAKE_AFFINE;
+
  sd->private = &tl->data;
 
  return sd;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 09/16] sched/fair: Let asymmetric cpu configurations balance at wake-up

Morten Rasmussen
In reply to this post by Morten Rasmussen
Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the cpu capacities
available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
balancing take over (find_idlest_{group, cpu}()).

The assumption is that SD_WAKE_AFFINE is never set for a sched_domain
containing cpus with different capacities. This is enforced by a
previous patch based on the SD_ASYM_CPUCAPACITY flag.

Ideally, we shouldn't set 'want_affine' in the first place, but we don't
know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
traversing them.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 564215d..ce44fa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with cpu capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
  lw->weight += inc;
@@ -5293,6 +5299,25 @@ static int cpu_util(int cpu)
  return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+ return p->se.avg.util_avg;
+}
+
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+ long delta;
+ long prev_cap = capacity_of(prev_cpu);
+
+ delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap;
+
+ /* prev_cpu is fairly close to max, no need to abort wake_affine */
+ if (delta < prev_cap >> 3)
+ return 0;
+
+ return prev_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
  if (sd_flag & SD_BALANCE_WAKE) {
  record_wakee(p);
- want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
  }
 
  rcu_read_lock();
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 11/16] sched/fair: Consider spare capacity in find_idlest_group()

Morten Rasmussen
In reply to this post by Morten Rasmussen
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3369a..d4c5339 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5090,6 +5090,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
  return 1;
 }
 
+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+ return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
@@ -5099,7 +5107,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
   int this_cpu, int sd_flag)
 {
  struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *most_spare_sg = NULL;
  unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long most_spare = 0, this_spare = 0;
  int load_idx = sd->forkexec_idx;
  int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5107,7 +5117,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  load_idx = sd->wake_idx;
 
  do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, spare_cap, max_spare_cap;
  int local_group;
  int i;
 
@@ -5119,8 +5129,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  local_group = cpumask_test_cpu(this_cpu,
        sched_group_cpus(group));
 
- /* Tally up the load of all CPUs in the group */
+ /*
+ * Tally up the load of all CPUs in the group and find
+ * the group containing the cpu with most spare capacity.
+ */
  avg_load = 0;
+ max_spare_cap = 0;
 
  for_each_cpu(i, sched_group_cpus(group)) {
  /* Bias balancing toward cpus of our domain */
@@ -5130,6 +5144,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  load = target_load(i, load_idx);
 
  avg_load += load;
+
+ spare_cap = capacity_spare_wake(i, p);
+
+ if (spare_cap > max_spare_cap &&
+    spare_cap > capacity_of(i) >> 3) {
+ max_spare_cap = spare_cap;
+ }
  }
 
  /* Adjust by relative CPU capacity of the group */
@@ -5137,12 +5158,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
  if (local_group) {
  this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
- idlest = group;
+ this_spare = max_spare_cap;
+ } else {
+ if (avg_load < min_load) {
+ min_load = avg_load;
+ idlest = group;
+ }
+
+ if (most_spare < max_spare_cap) {
+ most_spare = max_spare_cap;
+ most_spare_sg = group;
+ }
  }
  } while (group = group->next, group != sd->groups);
 
+ /* Found a significant amount of spare capacity. */
+ if (this_spare > task_util(p) / 2 &&
+    imbalance*this_spare > 100*most_spare)
+ return NULL;
+ else if (most_spare > task_util(p) / 2)
+ return most_spare_sg;
+
  if (!idlest || 100*this_load < imbalance*min_load)
  return NULL;
  return idlest;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 12/16] sched: Add per-cpu max capacity to sched_group_capacity

Morten Rasmussen
In reply to this post by Morten Rasmussen
struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate max per-cpu capacity in the group a
suitable group for a given task utilization can easily be found such
that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/fair.c  | 17 ++++++++++++-----
 kernel/sched/sched.h |  3 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4059c..4dd5cd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5599,7 +5599,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
  printk(KERN_CONT " %*pbl",
        cpumask_pr_args(sched_group_cpus(group)));
  if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
- printk(KERN_CONT " (cpu_capacity = %d)",
+ printk(KERN_CONT " (cpu_capacity = %lu)",
  group->sgc->capacity);
  }
 
@@ -6070,6 +6070,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
  * die on a /0 trap.
  */
  sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 
  /*
  * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4c5339..0dd2d9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6658,13 +6658,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
  cpu_rq(cpu)->cpu_capacity = capacity;
  sdg->sgc->capacity = capacity;
+ sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
  struct sched_domain *child = sd->child;
  struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity;
+ unsigned long capacity, max_capacity;
  unsigned long interval;
 
  interval = msecs_to_jiffies(sd->balance_interval);
@@ -6677,6 +6678,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
  }
 
  capacity = 0;
+ max_capacity = 0;
 
  if (child->flags & SD_OVERLAP) {
  /*
@@ -6701,11 +6703,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
  */
  if (unlikely(!rq->sd)) {
  capacity += capacity_of(cpu);
- continue;
+ } else {
+ sgc = rq->sd->groups->sgc;
+ capacity += sgc->capacity;
  }
 
- sgc = rq->sd->groups->sgc;
- capacity += sgc->capacity;
+ max_capacity = max(capacity, max_capacity);
  }
  } else  {
  /*
@@ -6715,12 +6718,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
  group = child->groups;
  do {
- capacity += group->sgc->capacity;
+ struct sched_group_capacity *sgc = group->sgc;
+
+ capacity += sgc->capacity;
+ max_capacity = max(sgc->max_capacity, max_capacity);
  group = group->next;
  } while (group != child->groups);
  }
 
  sdg->sgc->capacity = capacity;
+ sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72150c2..359f689 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -868,7 +868,8 @@ struct sched_group_capacity {
  * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
  * for a single CPU.
  */
- unsigned int capacity;
+ unsigned long capacity;
+ unsigned long max_capacity; /* Max per-cpu capacity in group */
  unsigned long next_update;
  int imbalance; /* XXX unrelated to capacity but shared group state */
  /*
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 15/16] arm: Set SD_BALANCE_WAKE flag for asymmetric capacity systems

Morten Rasmussen
In reply to this post by Morten Rasmussen
Asymmetric capacity systems (e.g. ARM big.LITTLE) can not rely
exclusively on fast idle-based task placement at wake-up in all
scenarios. Enable SD_BALANCE_WAKE to have the option to do
load/utilization based wake-up task placement (existing tasks) if affine
wake-up fails.

cc: Russell King <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 arch/arm/kernel/topology.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 1b81b31..32d0a1b 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -280,18 +280,27 @@ void store_cpu_topology(unsigned int cpuid)
 
 static inline int cpu_corepower_flags(void)
 {
- return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+ int flags = SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+
+ return big_little ? flags | SD_BALANCE_WAKE : flags;
+}
+
+static inline int arm_cpu_core_flags(void)
+{
+ int flags = SD_SHARE_PKG_RESOURCES;
+
+ return big_little ? flags | SD_BALANCE_WAKE : flags;
 }
 
 static inline int arm_cpu_cpu_flags(void)
 {
- return big_little ? SD_ASYM_CPUCAPACITY : 0;
+ return big_little ? SD_ASYM_CPUCAPACITY | SD_BALANCE_WAKE : 0;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
  { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
  { cpu_cpu_mask, arm_cpu_cpu_flags, SD_INIT_NAME(DIE) },
  { NULL, },
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 16/16] arm: Update arch_scale_cpu_capacity() to reflect change to define

Morten Rasmussen
In reply to this post by Morten Rasmussen
arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.

cc: Russell King <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 arch/arm/include/asm/topology.h | 5 +++++
 arch/arm/kernel/topology.c      | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a7..a09a105 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,11 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
 #else
 
 static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 32d0a1b..dc5f8aa 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
  return per_cpu(cpu_scale, cpu);
 }
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 14/16] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms

Morten Rasmussen
In reply to this post by Morten Rasmussen
Set the SD_ASYM_CPUCAPACITY flag for DIE level sched_domain on
big.LITTLE systems.

cc: Russell King <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 arch/arm/kernel/topology.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ec279d1..1b81b31 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -78,6 +78,7 @@ static unsigned long *__cpu_capacity;
 #define cpu_capacity(cpu) __cpu_capacity[cpu]
 
 static unsigned long middle_capacity = 1;
+static unsigned int big_little;
 
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -151,6 +152,8 @@ static void __init parse_dt_topology(void)
  middle_capacity = ((max_capacity / 3)
  >> (SCHED_CAPACITY_SHIFT-1)) + 1;
 
+ if (max_capacity && max_capacity != min_capacity)
+ big_little = 1;
 }
 
 /*
@@ -280,12 +283,17 @@ static inline int cpu_corepower_flags(void)
  return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
 }
 
+static inline int arm_cpu_cpu_flags(void)
+{
+ return big_little ? SD_ASYM_CPUCAPACITY : 0;
+}
+
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
  { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
  { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, arm_cpu_cpu_flags, SD_INIT_NAME(DIE) },
  { NULL, },
 };
 
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 13/16] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

Morten Rasmussen
In reply to this post by Morten Rasmussen
For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dd2d9c..5213bb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6822,6 +6822,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
  return false;
 }
 
+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->max_capacity * capacity_margin <
+ ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
   struct sg_lb_stats *sgs)
@@ -6925,6 +6936,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
  if (sgs->avg_load <= busiest->avg_load)
  return false;
 
+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+ goto asym_packing;
+
+ /* Candidate sg has no more than one task per cpu and has
+ * higher per-cpu capacity. Migrating tasks to less capable
+ * cpus may harm throughput. Maximize throughput,
+ * power/energy consequences are not considered.
+ */
+ if (sgs->sum_nr_running <= sgs->group_weight &&
+    group_smaller_cpu_capacity(sds->local, sg))
+ return false;
+
+asym_packing:
  /* This is the busiest node in its class. */
  if (!(env->sd->flags & SD_ASYM_PACKING))
  return true;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 10/16] sched/fair: Compute task/cpu utilization at wake-up more correctly

Morten Rasmussen
In reply to this post by Morten Rasmussen
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch introduces task_util_wake() which
computes the decayed task utilization based on the last update of the
previous cpu's last load-tracking update. It is done without having to
take the rq lock, similar to how it is done in remove_entity_load_avg().

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce44fa7..6d3369a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5304,6 +5304,75 @@ static inline int task_util(struct task_struct *p)
  return p->se.avg.util_avg;
 }
 
+/*
+ * task_util_wake: Returns an updated estimate of the utilization contribution
+ * of a waking task. At wake-up the task blocked utilization contribution
+ * (cfs_rq->avg) may have decayed while the utilization tracking of the task
+ * (se->avg) hasn't yet.
+ * Note that this estimate isn't perfectly accurate as the 1ms boundaries used
+ * for updating util_avg in __update_load_avg() are not considered here. This
+ * results in an error of up to 1ms utilization decay/accumulation which leads
+ * to an absolute util_avg error margin of 1024*1024/LOAD_AVG_MAX ~= 22
+ * (for LOAD_AVG_MAX = 47742).
+ */
+static inline int task_util_wake(struct task_struct *p)
+{
+ struct cfs_rq *prev_cfs_rq = &task_rq(p)->cfs;
+ struct sched_avg *psa = &p->se.avg;
+ u64 cfs_rq_last_update, p_last_update, delta;
+ u32 util_decayed;
+
+ p_last_update = psa->last_update_time;
+
+ /*
+ * Task on rq (exec()) should be load-tracking aligned already.
+ * New tasks have no history and should use the init value.
+ */
+ if (p->se.on_rq || !p_last_update)
+ return task_util(p);
+
+ cfs_rq_last_update = cfs_rq_last_update_time(prev_cfs_rq);
+ delta = cfs_rq_last_update - p_last_update;
+
+ if ((s64)delta <= 0)
+ return task_util(p);
+
+ delta >>= 20;
+
+ if (!delta)
+ return task_util(p);
+
+ util_decayed = decay_load((u64)psa->util_sum, delta);
+ util_decayed /= LOAD_AVG_MAX;
+
+ /*
+ * psa->util_avg can be slightly out of date as it is only updated
+ * when a 1ms boundary is crossed.
+ * See 'decayed' in __update_load_avg()
+ */
+ util_decayed = min_t(unsigned long, util_decayed, task_util(p));
+
+ return util_decayed;
+}
+
+/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+ unsigned long util, capacity;
+
+ /* Task has no contribution or is new */
+ if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ return cpu_util(cpu);
+
+ capacity = capacity_orig_of(cpu);
+ util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_wake(p), 0);
+
+ return (util >= capacity) ? capacity : util;
+}
+
 static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
  long delta;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 05/16] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag

Morten Rasmussen
In reply to this post by Morten Rasmussen
Add a topology flag to the sched_domain hierarchy indicating which
domains span cpus of different capacity (e.g. big.LITTLE). This
information is currently only available through iterating through the
capacities of all the cpus.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 463b91f..37752db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1015,6 +1015,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
 #define SD_BALANCE_WAKE 0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY 0x0040  /* Domain contains cpus of different capacity*/
 #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 404c078..d9619a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5659,6 +5659,7 @@ static int sd_degenerate(struct sched_domain *sd)
  SD_BALANCE_FORK |
  SD_BALANCE_EXEC |
  SD_SHARE_CPUCAPACITY |
+ SD_ASYM_CPUCAPACITY |
  SD_SHARE_PKG_RESOURCES |
  SD_SHARE_POWERDOMAIN)) {
  if (sd->groups != sd->groups->next)
@@ -5689,6 +5690,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
  SD_BALANCE_NEWIDLE |
  SD_BALANCE_FORK |
  SD_BALANCE_EXEC |
+ SD_ASYM_CPUCAPACITY |
  SD_SHARE_CPUCAPACITY |
  SD_SHARE_PKG_RESOURCES |
  SD_PREFER_SIBLING |
@@ -6303,14 +6305,16 @@ static int sched_domains_curr_level;
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  *
- * Odd one out:
+ * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
  */
 #define TOPOLOGY_SD_FLAGS \
  (SD_SHARE_CPUCAPACITY | \
  SD_SHARE_PKG_RESOURCES | \
  SD_NUMA | \
  SD_ASYM_PACKING | \
+ SD_ASYM_CPUCAPACITY | \
  SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 08/16] sched: Store maximum per-cpu capacity in root domain

Morten Rasmussen
In reply to this post by Morten Rasmussen
From: Dietmar Eggemann <[hidden email]>

To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Dietmar Eggemann <[hidden email]>
Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8014b4a..1d4059c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6841,6 +6841,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
  enum s_alloc alloc_state;
  struct sched_domain *sd;
  struct s_data d;
+ struct rq *rq = NULL;
  int i, ret = -ENOMEM;
 
  alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6891,11 +6892,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
  /* Attach the domains */
  rcu_read_lock();
  for_each_cpu(i, cpu_map) {
+ rq = cpu_rq(i);
  sd = *per_cpu_ptr(d.sd, i);
  cpu_attach_domain(sd, d.rd, i);
+
+ if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+ rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
  }
  rcu_read_unlock();
 
+ if (rq)
+ pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+ cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+
  ret = 0;
 error:
  __free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e51145e..72150c2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -564,6 +564,8 @@ struct root_domain {
  */
  cpumask_var_t rto_mask;
  struct cpupri cpupri;
+
+ unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 07/16] sched: Make SD_BALANCE_WAKE a topology flag

Morten Rasmussen
In reply to this post by Morten Rasmussen
For systems with the SD_ASYM_CPUCAPACITY flag set on higher level in the
sched_domain hierarchy we need a way to enable wake-up balancing for the
lower levels as well as we may want to balance tasks that don't fit the
capacity of the previous cpu.

We have the option of introducing a new topology flag to express this
requirement, or let the existing SD_BALANCE_WAKE flag be set by the
architecture as a topology flag. The former means introducing yet
another flag, the latter breaks the current meaning of topology flags.
None of the options are really desirable.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 558ec4a..8014b4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5658,6 +5658,7 @@ static int sd_degenerate(struct sched_domain *sd)
  SD_BALANCE_NEWIDLE |
  SD_BALANCE_FORK |
  SD_BALANCE_EXEC |
+ SD_BALANCE_WAKE |
  SD_SHARE_CPUCAPACITY |
  SD_ASYM_CPUCAPACITY |
  SD_SHARE_PKG_RESOURCES |
@@ -5690,6 +5691,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
  SD_BALANCE_NEWIDLE |
  SD_BALANCE_FORK |
  SD_BALANCE_EXEC |
+ SD_BALANCE_WAKE |
  SD_ASYM_CPUCAPACITY |
  SD_SHARE_CPUCAPACITY |
  SD_SHARE_PKG_RESOURCES |
@@ -6308,6 +6310,7 @@ static int sched_domains_curr_level;
  * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
  * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
+ * SD_BALANCE_WAKE  - controls wake-up balancing (expensive)
  */
 #define TOPOLOGY_SD_FLAGS \
  (SD_SHARE_CPUCAPACITY | \
@@ -6315,6 +6318,7 @@ static int sched_domains_curr_level;
  SD_NUMA | \
  SD_ASYM_PACKING | \
  SD_ASYM_CPUCAPACITY | \
+ SD_BALANCE_WAKE | \
  SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 01/16] sched: Fix power to capacity renaming in comment

Morten Rasmussen
In reply to this post by Morten Rasmussen
It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 38526b6..463b91f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1015,7 +1015,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
 #define SD_BALANCE_WAKE 0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
-#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */
+#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
 #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 02/16] sched/fair: Consistent use of prev_cpu in wakeup path

Morten Rasmussen
In reply to this post by Morten Rasmussen
In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping). However, the
select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same cpu
everywhere in the wakeup path.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e8..c49e25a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -1502,7 +1502,8 @@ static void task_numa_compare(struct task_numa_env *env,
  * Call select_idle_sibling to maybe find a better one.
  */
  if (!cur)
- env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+   env->dst_cpu);
 
 assign:
  assigned = true;
@@ -5013,18 +5014,18 @@ static int wake_wide(struct task_struct *p)
  return 1;
 }
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p,
+       int prev_cpu, int sync)
 {
  s64 this_load, load;
  s64 this_eff_load, prev_eff_load;
- int idx, this_cpu, prev_cpu;
+ int idx, this_cpu;
  struct task_group *tg;
  unsigned long weight;
  int balanced;
 
  idx  = sd->wake_idx;
  this_cpu  = smp_processor_id();
- prev_cpu  = task_cpu(p);
  load  = source_load(prev_cpu, idx);
  this_load = target_load(this_cpu, idx);
 
@@ -5189,11 +5190,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
  struct sched_domain *sd;
  struct sched_group *sg;
- int i = task_cpu(p);
 
  if (idle_cpu(target))
  return target;
@@ -5201,8 +5201,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
  /*
  * If the prevous cpu is cache affine and idle, don't be stupid.
  */
- if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
- return i;
+ if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+ return prev;
 
  /*
  * Otherwise, iterate the domains and find an eligible idle cpu.
@@ -5223,6 +5223,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
  for_each_lower_domain(sd) {
  sg = sd->groups;
  do {
+ int i;
+
  if (!cpumask_intersects(sched_group_cpus(sg),
  tsk_cpus_allowed(p)))
  goto next;
@@ -5331,13 +5333,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
  if (affine_sd) {
  sd = NULL; /* Prefer wake_affine over balance flags */
- if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
  new_cpu = cpu;
  }
 
  if (!sd) {
  if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
- new_cpu = select_idle_sibling(p, new_cpu);
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
  } else while (sd) {
  struct sched_group *group;
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide

Morten Rasmussen
In reply to this post by Morten Rasmussen
wake_wide() is based on task wakee_flips of the waker and the wakee to
decide whether an affine wakeup is desirable. On lightly loaded systems
the waker is frequently the idle task (pid=0) which can accumulate a lot
of wakee_flips in that scenario. It makes little sense to prevent affine
wakeups on an idle cpu due to the idle task wakee_flips, so it makes
more sense to ignore them in wake_wide().

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c49e25a..0fe3020 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
  unsigned int slave = p->wakee_flips;
  int factor = this_cpu_read(sd_llc_size);
 
+ /* Don't let the idle task prevent affine wakeups */
+ if (is_idle_task(current))
+ return 0;
+
  if (master < slave)
  swap(master, slave);
  if (slave < factor || master < slave * factor)
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 04/16] sched/fair: Optimize find_idlest_cpu() when there is no choice

Morten Rasmussen
In reply to this post by Morten Rasmussen
In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one cpu in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

cc: Ingo Molnar <[hidden email]>
cc: Peter Zijlstra <[hidden email]>

Signed-off-by: Morten Rasmussen <[hidden email]>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe3020..564215d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5155,6 +5155,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
  int shallowest_idle_cpu = -1;
  int i;
 
+ /* Check if we have any choice */
+ if (group->group_weight == 1) {
+ return cpumask_first(sched_group_cpus(group));
+ }
+
  /* Traverse only the allowed CPUs */
  for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
  if (idle_cpu(i)) {
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide

Mike Galbraith-3
In reply to this post by Morten Rasmussen
On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> wake_wide() is based on task wakee_flips of the waker and the wakee to
> decide whether an affine wakeup is desirable. On lightly loaded systems
> the waker is frequently the idle task (pid=0) which can accumulate a lot
> of wakee_flips in that scenario. It makes little sense to prevent affine
> wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> more sense to ignore them in wake_wide().

You sure?  What's the difference between a task flipping enough to
warrant spreading the load, and an interrupt source doing the same?
 I've both witnessed firsthand, and received user confirmation of this
very thing improving utilization.

> cc: Ingo Molnar <[hidden email]>
> cc: Peter Zijlstra <[hidden email]>
>
> Signed-off-by: Morten Rasmussen <[hidden email]>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c49e25a..0fe3020 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
>   unsigned int slave = p->wakee_flips;
>   int factor = this_cpu_read(sd_llc_size);
>  
> + /* Don't let the idle task prevent affine wakeups */
> + if (is_idle_task(current))
> + return 0;
> +
>   if (master < slave)
>   swap(master, slave);
>   if (slave < factor || master < slave * factor)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide

Morten Rasmussen
On Mon, May 23, 2016 at 01:12:07PM +0200, Mike Galbraith wrote:

> On Mon, 2016-05-23 at 11:58 +0100, Morten Rasmussen wrote:
> > wake_wide() is based on task wakee_flips of the waker and the wakee to
> > decide whether an affine wakeup is desirable. On lightly loaded systems
> > the waker is frequently the idle task (pid=0) which can accumulate a lot
> > of wakee_flips in that scenario. It makes little sense to prevent affine
> > wakeups on an idle cpu due to the idle task wakee_flips, so it makes
> > more sense to ignore them in wake_wide().
>
> You sure?  What's the difference between a task flipping enough to
> warrant spreading the load, and an interrupt source doing the same?
>  I've both witnessed firsthand, and received user confirmation of this
> very thing improving utilization.

Right, I didn't consider the interrupt source scenario, my fault.

The problem then seems to be distinguishing truly idle and busy doing
interrupts. The issue that I observe is that wake_wide() likes pushing
tasks around in lightly scenarios which isn't desirable for power
management. Selecting the same cpu again may potentially let others
reach deeper C-state.

With that in mind I will if I can do better. Suggestions are welcome :-)

>
> > cc: Ingo Molnar <[hidden email]>
> > cc: Peter Zijlstra <[hidden email]>
> >
> > Signed-off-by: Morten Rasmussen <[hidden email]>
> > ---
> >  kernel/sched/fair.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c49e25a..0fe3020 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5007,6 +5007,10 @@ static int wake_wide(struct task_struct *p)
> >   unsigned int slave = p->wakee_flips;
> >   int factor = this_cpu_read(sd_llc_size);
> >  
> > + /* Don't let the idle task prevent affine wakeups */
> > + if (is_idle_task(current))
> > + return 0;
> > +
> >   if (master < slave)
> >   swap(master, slave);
> >   if (slave < factor || master < slave * factor)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 03/16] sched/fair: Disregard idle task wakee_flips in wake_wide

Mike Galbraith-3
On Mon, 2016-05-23 at 13:00 +0100, Morten Rasmussen wrote:

> The problem then seems to be distinguishing truly idle and busy doing
> interrupts. The issue that I observe is that wake_wide() likes pushing
> tasks around in lightly scenarios which isn't desirable for power
> management. Selecting the same cpu again may potentially let others
> reach deeper C-state.
>
> With that in mind I will if I can do better. Suggestions are welcome :-)

None here.  For big boxen that are highly idle, you'd likely want to
shut down nodes and consolidate load, but otoh, all that slows response
to burst, which I hate.  I prefer race to idle, let power gating do its
job.  If I had a server farm with enough capacity vs load variability
to worry about, I suspect I'd become highly interested in routing.

        -Mike
123
Loading...