[RFC][PATCH 0/7] sched: select_idle_siblings rewrite

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 0/7] sched: select_idle_siblings rewrite

Peter Zijlstra-5
Hai,

here be a semi coherent patch series for the recent select_idle_siblings()
tinkering. Happy benchmarking..

---
 include/linux/sched.h    |  10 ++
 kernel/sched/core.c      |  94 +++++++++++----
 kernel/sched/fair.c      | 298 ++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/sched.h     |  23 +++-
 kernel/time/tick-sched.c |  10 +-
 6 files changed, 348 insertions(+), 89 deletions(-)

Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 1/7] sched: Remove unused @cpu argument from destroy_sched_domain*()

Peter Zijlstra-5
Small cleanup; nothing uses the @cpu argument so make it go away.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 kernel/sched/core.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5841,15 +5841,15 @@ static void free_sched_domain(struct rcu
  kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd, int cpu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
  call_rcu(&sd->rcu, free_sched_domain);
 }
 
-static void destroy_sched_domains(struct sched_domain *sd, int cpu)
+static void destroy_sched_domains(struct sched_domain *sd)
 {
  for (; sd; sd = sd->parent)
- destroy_sched_domain(sd, cpu);
+ destroy_sched_domain(sd);
 }
 
 /*
@@ -5921,7 +5921,7 @@ cpu_attach_domain(struct sched_domain *s
  */
  if (parent->flags & SD_PREFER_SIBLING)
  tmp->flags |= SD_PREFER_SIBLING;
- destroy_sched_domain(parent, cpu);
+ destroy_sched_domain(parent);
  } else
  tmp = tmp->parent;
  }
@@ -5929,7 +5929,7 @@ cpu_attach_domain(struct sched_domain *s
  if (sd && sd_degenerate(sd)) {
  tmp = sd;
  sd = sd->parent;
- destroy_sched_domain(tmp, cpu);
+ destroy_sched_domain(tmp);
  if (sd)
  sd->child = NULL;
  }
@@ -5939,7 +5939,7 @@ cpu_attach_domain(struct sched_domain *s
  rq_attach_root(rq, rd);
  tmp = rq->sd;
  rcu_assign_pointer(rq->sd, sd);
- destroy_sched_domains(tmp, cpu);
+ destroy_sched_domains(tmp);
 
  update_top_cache_domain(cpu);
 }


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 6/7] sched: Optimize SCHED_SMT

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
Avoid pointless SCHED_SMT code when running on !SMT hardware.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 kernel/sched/core.c      |   19 +++++++++++++++++++
 kernel/sched/fair.c      |    8 +++++++-
 kernel/sched/idle_task.c |    2 --
 kernel/sched/sched.h     |   17 +++++++++++++++++
 4 files changed, 43 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7263,6 +7263,22 @@ int sched_cpu_dying(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+
+static void sched_init_smt(void)
+{
+ /*
+ * We've enumerated all CPUs and will assume that if any CPU
+ * has SMT siblings, CPU0 will too.
+ */
+ if (cpumask_weight(cpu_smt_mask(0)) > 1)
+ static_branch_enable(&sched_smt_present);
+}
+#else
+static inline void sched_init_smt(void) { }
+#endif
+
 void __init sched_init_smp(void)
 {
  cpumask_var_t non_isolated_cpus;
@@ -7292,6 +7308,9 @@ void __init sched_init_smp(void)
 
  init_sched_rt_class();
  init_sched_dl_class();
+
+ sched_init_smt();
+
  sched_smp_initialized = true;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5249,7 +5249,7 @@ static inline bool test_idle_cores(int c
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
  */
-void update_idle_core(struct rq *rq)
+void __update_idle_core(struct rq *rq)
 {
  int core = cpu_of(rq);
  int cpu;
@@ -5281,6 +5281,9 @@ static int select_idle_core(struct task_
  struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
  int core, cpu, wrap;
 
+ if (!static_branch_likely(&sched_smt_present))
+ return -1;
+
  if (!test_idle_cores(target, false))
  return -1;
 
@@ -5314,6 +5317,9 @@ static int select_idle_smt(struct task_s
 {
  int cpu;
 
+ if (!static_branch_likely(&sched_smt_present))
+ return -1;
+
  for_each_cpu(cpu, cpu_smt_mask(target)) {
  if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
  continue;
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,8 +23,6 @@ static void check_preempt_curr_idle(stru
  resched_curr(rq);
 }
 
-extern void update_idle_core(struct rq *rq);
-
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1811,3 +1811,20 @@ static inline void account_reset_rq(stru
  rq->prev_steal_time_rq = 0;
 #endif
 }
+
+
+#ifdef CONFIG_SCHED_SMT
+
+extern struct static_key_false sched_smt_present;
+
+extern void __update_idle_core(struct rq *rq);
+
+static inline void update_idle_core(struct rq *rq)
+{
+ if (static_branch_unlikely(&sched_smt_present))
+ __update_idle_core(rq);
+}
+
+#else
+static inline void update_idle_core(struct rq *rq) { }
+#endif


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 2/7] sched: Restructure destroy_sched_domain()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
There is no point in doing a call_rcu() for each domain, only do a
callback for the root sched domain and clean up the entire set in one
go.

Also make the entire call chain be called destroy_sched_domain*() to
remove confusion with the free_sched_domains() call, which does an
entirely different thing.

Both cpu_attach_domain() callers of destroy_sched_domain() can live
without the call_rcu() because at those points the sched_domain hasn't
been published yet.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 kernel/sched/core.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5824,10 +5824,8 @@ static void free_sched_groups(struct sch
  } while (sg != first);
 }
 
-static void free_sched_domain(struct rcu_head *rcu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
- struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-
  /*
  * If its an overlapping domain it has private groups, iterate and
  * nuke them all.
@@ -5841,15 +5839,17 @@ static void free_sched_domain(struct rcu
  kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd)
+static void destroy_sched_domains_rcu(struct rcu_head *rcu)
 {
- call_rcu(&sd->rcu, free_sched_domain);
+ struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
+
+ for (; sd; sd = sd->parent)
+ destroy_sched_domain(sd);
 }
 
 static void destroy_sched_domains(struct sched_domain *sd)
 {
- for (; sd; sd = sd->parent)
- destroy_sched_domain(sd);
+ call_rcu(&sd->rcu, destroy_sched_domains_rcu);
 }
 
 /*


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 3/7] sched: Introduce struct sched_domain_shared

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
Since struct sched_domain is strictly per cpu; introduce a structure
that is shared between all 'identical' sched_domains.

Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
for shared cache state; if another use comes up later we can easily
relax this.

While the sched_group's are normally shared between CPUs, these are
not natural to use when we need some shared state on a domain level --
since that would require the domain to have a parent, which is not a
given.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched/core.c   |   40 ++++++++++++++++++++++++++++++++++------
 2 files changed, 40 insertions(+), 6 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1057,6 +1057,10 @@ extern int sched_domain_level_max;
 
 struct sched_group;
 
+struct sched_domain_shared {
+ atomic_t ref;
+};
+
 struct sched_domain {
  /* These fields must be setup */
  struct sched_domain *parent; /* top domain must be null terminated */
@@ -1125,6 +1129,7 @@ struct sched_domain {
  void *private; /* used during construction */
  struct rcu_head rcu; /* used during destruction */
  };
+ struct sched_domain_shared *shared;
 
  unsigned int span_weight;
  /*
@@ -1158,6 +1163,7 @@ typedef int (*sched_domain_flags_f)(void
 
 struct sd_data {
  struct sched_domain **__percpu sd;
+ struct sched_domain_shared **__percpu sds;
  struct sched_group **__percpu sg;
  struct sched_group_capacity **__percpu sgc;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5836,6 +5836,8 @@ static void free_sched_domain(struct sch
  kfree(sd->groups->sgc);
  kfree(sd->groups);
  }
+ if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+ kfree(sd->shared);
  kfree(sd);
 }
 
@@ -6270,6 +6272,9 @@ static void claim_allocations(int cpu, s
  WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
  *per_cpu_ptr(sdd->sd, cpu) = NULL;
 
+ if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
+ *per_cpu_ptr(sdd->sds, cpu) = NULL;
+
  if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
  *per_cpu_ptr(sdd->sg, cpu) = NULL;
 
@@ -6305,10 +6310,12 @@ static int sched_domains_curr_level;
  SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-sd_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl,
+ const struct cpumask *cpu_map, int cpu)
 {
- struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
- int sd_weight, sd_flags = 0;
+ struct sd_data *sdd = &tl->data;
+ struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
+ int sd_id, sd_weight, sd_flags = 0;
 
 #ifdef CONFIG_NUMA
  /*
@@ -6362,6 +6369,9 @@ sd_init(struct sched_domain_topology_lev
 #endif
  };
 
+ cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+ sd_id = cpumask_first(sched_domain_span(sd));
+
  /*
  * Convert topological properties into behaviour.
  */
@@ -6376,6 +6386,9 @@ sd_init(struct sched_domain_topology_lev
  sd->cache_nice_tries = 1;
  sd->busy_idx = 2;
 
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+
 #ifdef CONFIG_NUMA
  } else if (sd->flags & SD_NUMA) {
  sd->cache_nice_tries = 2;
@@ -6397,7 +6410,7 @@ sd_init(struct sched_domain_topology_lev
  sd->idle_idx = 1;
  }
 
- sd->private = &tl->data;
+ sd->private = sdd;
 
  return sd;
 }
@@ -6704,6 +6717,10 @@ static int __sdt_alloc(const struct cpum
  if (!sdd->sd)
  return -ENOMEM;
 
+ sdd->sds = alloc_percpu(struct sched_domain_shared *);
+ if (!sdd->sds)
+ return -ENOMEM;
+
  sdd->sg = alloc_percpu(struct sched_group *);
  if (!sdd->sg)
  return -ENOMEM;
@@ -6714,6 +6731,7 @@ static int __sdt_alloc(const struct cpum
 
  for_each_cpu(j, cpu_map) {
  struct sched_domain *sd;
+ struct sched_domain_shared *sds;
  struct sched_group *sg;
  struct sched_group_capacity *sgc;
 
@@ -6724,6 +6742,13 @@ static int __sdt_alloc(const struct cpum
 
  *per_cpu_ptr(sdd->sd, j) = sd;
 
+ sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ GFP_KERNEL, cpu_to_node(j));
+ if (!sds)
+ return -ENOMEM;
+
+ *per_cpu_ptr(sdd->sds, j) = sds;
+
  sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
  GFP_KERNEL, cpu_to_node(j));
  if (!sg)
@@ -6763,6 +6788,8 @@ static void __sdt_free(const struct cpum
  kfree(*per_cpu_ptr(sdd->sd, j));
  }
 
+ if (sdd->sds)
+ kfree(*per_cpu_ptr(sdd->sds, j));
  if (sdd->sg)
  kfree(*per_cpu_ptr(sdd->sg, j));
  if (sdd->sgc)
@@ -6770,6 +6797,8 @@ static void __sdt_free(const struct cpum
  }
  free_percpu(sdd->sd);
  sdd->sd = NULL;
+ free_percpu(sdd->sds);
+ sdd->sds = NULL;
  free_percpu(sdd->sg);
  sdd->sg = NULL;
  free_percpu(sdd->sgc);
@@ -6781,11 +6810,10 @@ struct sched_domain *build_sched_domain(
  const struct cpumask *cpu_map, struct sched_domain_attr *attr,
  struct sched_domain *child, int cpu)
 {
- struct sched_domain *sd = sd_init(tl, cpu);
+ struct sched_domain *sd = sd_init(tl, cpu_map, cpu);
  if (!sd)
  return child;
 
- cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  if (child) {
  sd->level = child->level + 1;
  sched_domain_level_max = max(sched_domain_level_max, sd->level);


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/linux/sched.h    |    3
 kernel/sched/core.c      |    3
 kernel/sched/fair.c      |  270 ++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/idle_task.c |    4
 4 files changed, 233 insertions(+), 47 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1060,6 +1060,7 @@ struct sched_group;
 struct sched_domain_shared {
  atomic_t ref;
  atomic_t nr_busy_cpus;
+ int has_idle_cores;
 };
 
 struct sched_domain {
@@ -1092,6 +1093,8 @@ struct sched_domain {
  u64 max_newidle_lb_cost;
  unsigned long next_decay_max_lb_cost;
 
+ u64 avg_scan_cost; /* select_idle_sibling */
+
 #ifdef CONFIG_SCHEDSTATS
  /* load_balance() stats */
  unsigned int lb_count[CPU_MAX_IDLE_TYPES];
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7329,6 +7329,7 @@ static struct kmem_cache *task_group_cac
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
 
 void __init sched_init(void)
 {
@@ -7365,6 +7366,8 @@ void __init sched_init(void)
  for_each_possible_cpu(i) {
  per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
  cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+ per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ cpumask_size(), GFP_KERNEL, cpu_to_node(i));
  }
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1501,8 +1501,15 @@ static void task_numa_compare(struct tas
  * One idle CPU per node is evaluated for a task numa move.
  * Call select_idle_sibling to maybe find a better one.
  */
- if (!cur)
+ if (!cur) {
+ /*
+ * select_idle_siblings() uses an per-cpu cpumask that
+ * can be used from IRQ context.
+ */
+ local_irq_disable();
  env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ local_irq_enable();
+ }
 
 assign:
  assigned = true;
@@ -4499,6 +4506,11 @@ static void dequeue_task_fair(struct rq
 }
 
 #ifdef CONFIG_SMP
+
+/* Working cpumask for: load_balance, load_balance_newidle. */
+DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * per rq 'load' arrray crap; XXX kill this.
@@ -5172,64 +5184,233 @@ find_idlest_cpu(struct sched_group *grou
 }
 
 /*
- * Try and locate an idle CPU in the sched_domain.
+ * Implement a for_each_cpu() variant that starts the scan at a given cpu
+ * (@start), and wraps around.
+ *
+ * This is used to scan for idle CPUs; such that not all CPUs looking for an
+ * idle CPU find the same CPU. The down-side is that tasks tend to cycle
+ * through the LLC domain.
+ *
+ * Especially tbench is found sensitive to this.
+ */
+
+static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
+{
+ int next;
+
+again:
+ next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
+
+ if (*wrapped) {
+ if (next >= start)
+ return nr_cpumask_bits;
+ } else {
+ if (next >= nr_cpumask_bits) {
+ *wrapped = 1;
+ n = -1;
+ goto again;
+ }
+ }
+
+ return next;
+}
+
+#define for_each_cpu_wrap(cpu, mask, start, wrap) \
+ for ((wrap) = 0, (cpu) = (start)-1; \
+ (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \
+ (cpu) < nr_cpumask_bits; )
+
+#ifdef CONFIG_SCHED_SMT
+
+static inline void set_idle_cores(int cpu, int val)
+{
+ struct sched_domain_shared *sds;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ WRITE_ONCE(sds->has_idle_cores, val);
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+ struct sched_domain_shared *sds;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->has_idle_cores);
+
+ return def;
+}
+
+/*
+ * Scans the local SMT mask to see if the entire core is idle, and records this
+ * information in sd_llc_shared->has_idle_cores.
+ *
+ * Since SMT siblings share all cache levels, inspecting this limited remote
+ * state should be fairly cheap.
+ */
+void update_idle_core(struct rq *rq)
+{
+ int core = cpu_of(rq);
+ int cpu;
+
+ rcu_read_lock();
+ if (test_idle_cores(core, true))
+ goto unlock;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ if (cpu == core)
+ continue;
+
+ if (!idle_cpu(cpu))
+ goto unlock;
+ }
+
+ set_idle_cores(core, 1);
+unlock:
+ rcu_read_unlock();
+}
+
+/*
+ * Scan the entire LLC domain for idle cores; this dynamically switches off if there are
+ * no idle cores left in the system; tracked through sd_llc->shared->has_idle_cores
+ * and enabled through update_idle_core() above.
+ */
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int core, cpu, wrap;
+
+ if (!test_idle_cores(target, false))
+ return -1;
+
+ cpumask_and(cpus, sched_domain_span(sd), tsk_cpus_allowed(p));
+
+ for_each_cpu_wrap(core, cpus, target, wrap) {
+ bool idle = true;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ cpumask_clear_cpu(cpu, cpus);
+ if (!idle_cpu(cpu))
+ idle = false;
+ }
+
+ if (idle)
+ return core;
+ }
+
+ /*
+ * Failed to find an idle core; stop looking for one.
+ */
+ set_idle_cores(target, 0);
+
+ return -1;
+}
+
+/*
+ * Scan the local SMT mask for idle CPUs.
+ */
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ continue;
+ if (idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
+#else /* CONFIG_SCHED_SMT */
+
+void update_idle_core(struct rq *rq) { }
+
+static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
+#endif /* CONFIG_SCHED_SMT */
+
+/*
+ * Scan the LLC domain for idle CPUs; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
+ */
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+ u64 avg_idle = this_rq()->avg_idle;
+ u64 avg_cost = this_sd->avg_scan_cost;
+ u64 time, cost;
+ s64 delta;
+ int cpu, wrap;
+
+ /*
+ * Due to large variance we need a large fuzz factor; hackbench in
+ * particularly is sensitive here.
+ */
+ if ((avg_idle / 512) < avg_cost)
+ return -1;
+
+ time = local_clock();
+
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ continue;
+ if (idle_cpu(cpu))
+ break;
+ }
+
+ time = local_clock() - time;
+ cost = this_sd->avg_scan_cost;
+ delta = (s64)(time - cost) / 8;
+ this_sd->avg_scan_cost += delta;
+
+ return cpu;
+}
+
+/*
+ * Try and locate an idle core/thread in the LLC cache domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
  struct sched_domain *sd;
- struct sched_group *sg;
  int i = task_cpu(p);
 
  if (idle_cpu(target))
  return target;
 
  /*
- * If the prevous cpu is cache affine and idle, don't be stupid.
+ * If the previous cpu is cache affine and idle, don't be stupid.
  */
  if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
  return i;
 
- /*
- * Otherwise, iterate the domains and find an eligible idle cpu.
- *
- * A completely idle sched group at higher domains is more
- * desirable than an idle group at a lower level, because lower
- * domains have smaller groups and usually share hardware
- * resources which causes tasks to contend on them, e.g. x86
- * hyperthread siblings in the lowest domain (SMT) can contend
- * on the shared cpu pipeline.
- *
- * However, while we prefer idle groups at higher domains
- * finding an idle cpu at the lowest domain is still better than
- * returning 'target', which we've already established, isn't
- * idle.
- */
  sd = rcu_dereference(per_cpu(sd_llc, target));
- for_each_lower_domain(sd) {
- sg = sd->groups;
- do {
- if (!cpumask_intersects(sched_group_cpus(sg),
- tsk_cpus_allowed(p)))
- goto next;
-
- /* Ensure the entire group is idle */
- for_each_cpu(i, sched_group_cpus(sg)) {
- if (i == target || !idle_cpu(i))
- goto next;
- }
+ if (!sd)
+ return target;
+
+ i = select_idle_core(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+
+ i = select_idle_cpu(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+
+ i = select_idle_smt(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
 
- /*
- * It doesn't matter which cpu we pick, the
- * whole group is idle.
- */
- target = cpumask_first_and(sched_group_cpus(sg),
- tsk_cpus_allowed(p));
- goto done;
-next:
- sg = sg->next;
- } while (sg != sd->groups);
- }
-done:
  return target;
 }
 
@@ -7232,9 +7413,6 @@ static struct rq *find_busiest_queue(str
  */
 #define MAX_PINNED_INTERVAL 512
 
-/* Working cpumask for load_balance and load_balance_newidle. */
-DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-
 static int need_active_balance(struct lb_env *env)
 {
  struct sched_domain *sd = env->sd;
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,11 +23,13 @@ static void check_preempt_curr_idle(stru
  resched_curr(rq);
 }
 
+extern void update_idle_core(struct rq *rq);
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
  put_prev_task(rq, prev);
-
+ update_idle_core(rq);
  schedstat_inc(rq, sched_goidle);
  return rq->idle;
 }


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
location into the much more natural sched_domain_shared location.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/linux/sched.h    |    1 +
 kernel/sched/core.c      |   10 +++++-----
 kernel/sched/fair.c      |   22 ++++++++++++----------
 kernel/sched/sched.h     |    6 +-----
 kernel/time/tick-sched.c |   10 +++++-----
 5 files changed, 24 insertions(+), 25 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1059,6 +1059,7 @@ struct sched_group;
 
 struct sched_domain_shared {
  atomic_t ref;
+ atomic_t nr_busy_cpus;
 };
 
 struct sched_domain {
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5866,14 +5866,14 @@ static void destroy_sched_domains(struct
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
 
 static void update_top_cache_domain(int cpu)
 {
+ struct sched_domain_shared *sds = NULL;
  struct sched_domain *sd;
- struct sched_domain *busy_sd = NULL;
  int id = cpu;
  int size = 1;
 
@@ -5881,13 +5881,13 @@ static void update_top_cache_domain(int
  if (sd) {
  id = cpumask_first(sched_domain_span(sd));
  size = cpumask_weight(sched_domain_span(sd));
- busy_sd = sd->parent; /* sd_busy */
+ sds = sd->shared;
  }
- rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
  rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
  per_cpu(sd_llc_size, cpu) = size;
  per_cpu(sd_llc_id, cpu) = id;
+ rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
  sd = lowest_flag_domain(cpu, SD_NUMA);
  rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
@@ -6184,7 +6184,6 @@ static void init_sched_groups_capacity(i
  return;
 
  update_group_capacity(sd, cpu);
- atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
 }
 
 /*
@@ -6388,6 +6387,7 @@ sd_init(struct sched_domain_topology_lev
 
  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
  atomic_inc(&sd->shared->ref);
+ atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
 
 #ifdef CONFIG_NUMA
  } else if (sd->flags & SD_NUMA) {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
  int cpu = smp_processor_id();
 
  rcu_read_lock();
- sd = rcu_dereference(per_cpu(sd_busy, cpu));
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
  if (!sd || !sd->nohz_idle)
  goto unlock;
  sd->nohz_idle = 0;
 
- atomic_inc(&sd->groups->sgc->nr_busy_cpus);
+ atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
  rcu_read_unlock();
 }
@@ -7859,13 +7859,13 @@ void set_cpu_sd_state_idle(void)
  int cpu = smp_processor_id();
 
  rcu_read_lock();
- sd = rcu_dereference(per_cpu(sd_busy, cpu));
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
 
  if (!sd || sd->nohz_idle)
  goto unlock;
  sd->nohz_idle = 1;
 
- atomic_dec(&sd->groups->sgc->nr_busy_cpus);
+ atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
  rcu_read_unlock();
 }
@@ -8092,8 +8092,8 @@ static void nohz_idle_balance(struct rq
 static inline bool nohz_kick_needed(struct rq *rq)
 {
  unsigned long now = jiffies;
+ struct sched_domain_shared *sds;
  struct sched_domain *sd;
- struct sched_group_capacity *sgc;
  int nr_busy, cpu = rq->cpu;
  bool kick = false;
 
@@ -8121,11 +8121,13 @@ static inline bool nohz_kick_needed(stru
  return true;
 
  rcu_read_lock();
- sd = rcu_dereference(per_cpu(sd_busy, cpu));
- if (sd) {
- sgc = sd->groups->sgc;
- nr_busy = atomic_read(&sgc->nr_busy_cpus);
-
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ /*
+ * XXX: write a coherent comment on why we do this.
+ * See also: http:lkml.kernel.org/r/[hidden email]
+ */
+ nr_busy = atomic_read(&sds->nr_busy_cpus);
  if (nr_busy > 1) {
  kick = true;
  goto unlock;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,8 +856,8 @@ static inline struct sched_domain *lowes
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain *, sd_busy);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 
 struct sched_group_capacity {
@@ -869,10 +869,6 @@ struct sched_group_capacity {
  unsigned int capacity;
  unsigned long next_update;
  int imbalance; /* XXX unrelated to capacity but shared group state */
- /*
- * Number of busy cpus in this group.
- */
- atomic_t nr_busy_cpus;
 
  unsigned long cpumask[0]; /* iteration mask */
 };
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -933,11 +933,11 @@ void tick_nohz_idle_enter(void)
  WARN_ON_ONCE(irqs_disabled());
 
  /*
- * Update the idle state in the scheduler domain hierarchy
- * when tick_nohz_stop_sched_tick() is called from the idle loop.
- * State will be updated to busy during the first busy tick after
- * exiting idle.
- */
+ * Update the idle state in the scheduler domain hierarchy
+ * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ * State will be updated to busy during the first busy tick after
+ * exiting idle.
+ */
  set_cpu_sd_state_idle();
 
  local_irq_disable();


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 7/7] sched: debug muck -- not for merging

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
Add a few knobs to poke while playing with the new code.

Not-Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/linux/sched/sysctl.h |    1
 kernel/sched/fair.c          |   86 ++++++++++++++++++++++++++++++++++---------
 kernel/sched/features.h      |   10 +++++
 kernel/sysctl.c              |    7 +++
 4 files changed, 86 insertions(+), 18 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -37,6 +37,7 @@ extern unsigned int sysctl_sched_migrati
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_sched_shares_window;
+extern unsigned int sysctl_sched_shift;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
  void __user *buffer, size_t *length,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,8 @@ unsigned int __read_mostly sysctl_sched_
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+const_debug unsigned int sysctl_sched_shift = 9;
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
  lw->weight += inc;
@@ -5354,18 +5356,24 @@ static inline int select_idle_smt(struct
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
  struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
- u64 avg_idle = this_rq()->avg_idle;
- u64 avg_cost = this_sd->avg_scan_cost;
  u64 time, cost;
  s64 delta;
  int cpu, wrap;
 
- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- if ((avg_idle / 512) < avg_cost)
- return -1;
+ if (sched_feat(AVG_CPU)) {
+ u64 avg_idle = this_rq()->avg_idle;
+ u64 avg_cost = this_sd->avg_scan_cost;
+
+ if (sched_feat(PRINT_AVG))
+ trace_printk("idle: %Ld cost: %Ld\n", avg_idle, avg_cost);
+
+ /*
+ * Due to large variance we need a large fuzz factor; hackbench in
+ * particularly is sensitive here.
+ */
+ if ((avg_idle >> sysctl_sched_shift) < avg_cost)
+ return -1;
+ }
 
  time = local_clock();
 
@@ -5379,6 +5387,7 @@ static int select_idle_cpu(struct task_s
  time = local_clock() - time;
  cost = this_sd->avg_scan_cost;
  delta = (s64)(time - cost) / 8;
+ /* trace_printk("time: %Ld cost: %Ld delta: %Ld\n", time, cost, delta); */
  this_sd->avg_scan_cost += delta;
 
  return cpu;
@@ -5390,7 +5399,7 @@ static int select_idle_cpu(struct task_s
 static int select_idle_sibling(struct task_struct *p, int target)
 {
  struct sched_domain *sd;
- int i = task_cpu(p);
+ int start, i = task_cpu(p);
 
  if (idle_cpu(target))
  return target;
@@ -5401,21 +5410,62 @@ static int select_idle_sibling(struct ta
  if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
  return i;
 
+ start = target;
+ if (sched_feat(ORDER_IDLE))
+ start = per_cpu(sd_llc_id, target); /* first cpu in llc domain */
+
  sd = rcu_dereference(per_cpu(sd_llc, target));
  if (!sd)
  return target;
 
- i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+ if (sched_feat(OLD_IDLE)) {
+ struct sched_group *sg;
 
- i = select_idle_cpu(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+ for_each_lower_domain(sd) {
+ sg = sd->groups;
+ do {
+ if (!cpumask_intersects(sched_group_cpus(sg),
+ tsk_cpus_allowed(p)))
+ goto next;
+
+ /* Ensure the entire group is idle */
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ if (i == target || !idle_cpu(i))
+ goto next;
+ }
 
- i = select_idle_smt(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
- return i;
+ /*
+ * It doesn't matter which cpu we pick, the
+ * whole group is idle.
+ */
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+ goto done;
+next:
+ sg = sg->next;
+ } while (sg != sd->groups);
+ }
+done:
+ return target;
+ }
+
+ if (sched_feat(IDLE_CORE)) {
+ i = select_idle_core(p, sd, start);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ if (sched_feat(IDLE_CPU)) {
+ i = select_idle_cpu(p, sd, start);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ if (sched_feat(IDLE_SMT)) {
+ i = select_idle_smt(p, sd, start);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
 
  return target;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,13 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(OLD_IDLE, false)
+SCHED_FEAT(ORDER_IDLE, false)
+
+SCHED_FEAT(IDLE_CORE, true)
+SCHED_FEAT(IDLE_CPU, true)
+SCHED_FEAT(AVG_CPU, true)
+SCHED_FEAT(PRINT_AVG, false)
+
+SCHED_FEAT(IDLE_SMT, true)
+
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -334,6 +334,13 @@ static struct ctl_table kern_table[] = {
  .proc_handler = proc_dointvec,
  },
  {
+ .procname = "sched_shift",
+ .data = &sysctl_sched_shift,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
  .procname = "sched_nr_migrate",
  .data = &sysctl_sched_nr_migrate,
  .maxlen = sizeof(unsigned int),


Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/7] sched: Restructure destroy_sched_domain()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Mon, May 09, 2016 at 12:48:09PM +0200, Peter Zijlstra wrote:

> There is no point in doing a call_rcu() for each domain, only do a
> callback for the root sched domain and clean up the entire set in one
> go.
>
> Also make the entire call chain be called destroy_sched_domain*() to
> remove confusion with the free_sched_domains() call, which does an
> entirely different thing.
>
> Both cpu_attach_domain() callers of destroy_sched_domain() can live
> without the call_rcu() because at those points the sched_domain hasn't
> been published yet.
>
> Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
> ---

This one seems to work much better; so much for last minute 'cleanups'.

---
 kernel/sched/core.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5824,10 +5824,8 @@ static void free_sched_groups(struct sch
  } while (sg != first);
 }
 
-static void free_sched_domain(struct rcu_head *rcu)
+static void destroy_sched_domain(struct sched_domain *sd)
 {
- struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
-
  /*
  * If its an overlapping domain it has private groups, iterate and
  * nuke them all.
@@ -5841,15 +5839,21 @@ static void free_sched_domain(struct rcu
  kfree(sd);
 }
 
-static void destroy_sched_domain(struct sched_domain *sd)
+static void destroy_sched_domains_rcu(struct rcu_head *rcu)
 {
- call_rcu(&sd->rcu, free_sched_domain);
+ struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);
+
+ while (sd) {
+ struct sched_domain *parent = sd->parent;
+ destroy_sched_domain(sd);
+ sd = parent;
+ }
 }
 
 static void destroy_sched_domains(struct sched_domain *sd)
 {
- for (; sd; sd = sd->parent)
- destroy_sched_domain(sd);
+ if (sd)
+ call_rcu(&sd->rcu, destroy_sched_domains_rcu);
 }
 
 /*
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite

Chris Mason-6
In reply to this post by Peter Zijlstra-5
On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> Hai,
>
> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

Thanks Peter,

I'll have some production numbers tomorrow, but based on schbench I'm
hoping it'll score better than my original.

You win on the pipe test (1.9MB/s vs 2.1MB/s) when a thread sibling is
pegged, and even when I double the think time on schbench, it holds up
well:

# Peter
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
        50.0000th: 50
        75.0000th: 60
        90.0000th: 69
        95.0000th: 73
        *99.0000th: 85
        99.5000th: 135
        99.9000th: 4012
        min=0, max=10873

# Mason
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
        50.0000th: 50
        75.0000th: 60
        90.0000th: 70
        95.0000th: 74
        *99.0000th: 83
        99.5000th: 88
        99.9000th: 118
        min=0, max=14770

# Mainline
# ./schbench -t 23 -m 2 -c 60000 -s 30000
Latency percentiles (usec)
        50.0000th: 47
        75.0000th: 60
        90.0000th: 70
        95.0000th: 79
        *99.0000th: 5400
        99.5000th: 10352
        99.9000th: 10992
        min=0, max=19642
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()

Yuyang Du
In reply to this post by Peter Zijlstra-5
On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> + * comparing the average scan cost (tracked in sd->avg_scan_cost) against the

                                       tracked in this_sd->avg_scan_cost

> + * average idle time for this rq (as found in rq->avg_idle).
> + */
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> + u64 avg_idle = this_rq()->avg_idle;
> + u64 avg_cost = this_sd->avg_scan_cost;
> + u64 time, cost;
> + s64 delta;
> + int cpu, wrap;
> +
> + /*
> + * Due to large variance we need a large fuzz factor; hackbench in
> + * particularly is sensitive here.
> + */
> + if ((avg_idle / 512) < avg_cost)
> + return -1;
> +
> + time = local_clock();
> +
> + for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
> + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> + continue;
> + if (idle_cpu(cpu))
> + break;
> + }
> +
> + time = local_clock() - time;
> + cost = this_sd->avg_scan_cost;
> + delta = (s64)(time - cost) / 8;
> + this_sd->avg_scan_cost += delta;
> +
> + return cpu;
> +}

[snip]

> +
> + i = select_idle_core(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> +
> + i = select_idle_cpu(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> +
> + i = select_idle_smt(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;

First, on smt, these three scans have a lot of overlap, but it also has an
effect of opportunistic-spinning if it was intended, which is good. Anyway,
maybe you should write something about it, and why only consider cost for cpu,
not core.

Then, I am still considering combining them a bit, like the following patch.
And if you want, more might be combined.

P.S., The idle_smt_cpu may not be the first idle, but one more i++ can make
it.

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25bd5b0..648a15c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5281,7 +5281,7 @@ unlock:
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
  struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu, wrap;
+ int core, cpu, wrap, i = 0, idle_smt_cpu = -1;
 
  if (!static_branch_likely(&sched_smt_present))
  return -1;
@@ -5298,8 +5298,14 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  cpumask_clear_cpu(cpu, cpus);
  if (!idle_cpu(cpu))
  idle = false;
+ /*
+ * First smt must be target's smt, and any cpu here is allowed
+ */
+ else if (i == 0)
+ idle_smt_cpu = cpu;
  }
 
+ i++;
  if (idle)
  return core;
  }
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()

Peter Zijlstra-5
On Wed, May 11, 2016 at 05:05:50AM +0800, Yuyang Du wrote:

> On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> > + i = select_idle_core(p, sd, target);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > +
> > + i = select_idle_cpu(p, sd, target);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > +
> > + i = select_idle_smt(p, sd, target);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
>
> First, on smt, these three scans have a lot of overlap,

Limited, we stop doing the idle_core scan the moment we fail to find
one. And on a busy system it is unlikely to come back.

Which leaves us with the straight idle_cpu scan. Sure, the one time we
fail to find an idle core and fall through we'll scan double, but that
should be rare.

> Then, I am still considering combining them a bit, like the following patch.
> And if you want, more might be combined.

I tried that (and you really want the last idle it finds, to minimize
the time between testing for idle and returning it), but it didn't
matter for anything I could find.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()

Yuyang Du
On Wed, May 11, 2016 at 09:00:29AM +0200, Peter Zijlstra wrote:

> On Wed, May 11, 2016 at 05:05:50AM +0800, Yuyang Du wrote:
> > On Mon, May 09, 2016 at 12:48:12PM +0200, Peter Zijlstra wrote:
> > > + i = select_idle_core(p, sd, target);
> > > + if ((unsigned)i < nr_cpumask_bits)
> > > + return i;
> > > +
> > > + i = select_idle_cpu(p, sd, target);
> > > + if ((unsigned)i < nr_cpumask_bits)
> > > + return i;
> > > +
> > > + i = select_idle_smt(p, sd, target);
> > > + if ((unsigned)i < nr_cpumask_bits)
> > > + return i;
> >
> > First, on smt, these three scans have a lot of overlap,
>
> Limited, we stop doing the idle_core scan the moment we fail to find
> one. And on a busy system it is unlikely to come back.
>
> Which leaves us with the straight idle_cpu scan. Sure, the one time we
> fail to find an idle core and fall through we'll scan double, but that
> should be rare.

Yes, I was sorta well educated about how important finding an idle is ("one
stack, game over"), which therefore is sure worth the effort.

Looks to me, it is just about how much opportunisic-spinning should be applied
here. Great.

Do you have any suggestion about doing other part of select_task_rq_fair?
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 5/7] sched: Rewrite select_idle_siblings()

Mike Galbraith-3
On Wed, 2016-05-11 at 07:42 +0800, Yuyang Du wrote:

> Do you have any suggestion about doing other part of
> select_task_rq_fair?

Conjure up a decent metric for placing tasks in tasty hot L2?  That
would be really lovely to have (/me pondering page faults..).

We used to have an avg_overlap heuristic, which would let us stack high
speed ping-pong like things such that they could really smoke.  It was
too fragile though, so had to go away.

        -Mike
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Matt Fleming-3
In reply to this post by Peter Zijlstra-5
On Mon, 09 May, at 12:48:11PM, Peter Zijlstra wrote:

> Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
> location into the much more natural sched_domain_shared location.
>
> Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
> ---
>  include/linux/sched.h    |    1 +
>  kernel/sched/core.c      |   10 +++++-----
>  kernel/sched/fair.c      |   22 ++++++++++++----------
>  kernel/sched/sched.h     |    6 +-----
>  kernel/time/tick-sched.c |   10 +++++-----
>  5 files changed, 24 insertions(+), 25 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1059,6 +1059,7 @@ struct sched_group;
>  
>  struct sched_domain_shared {
>   atomic_t ref;
> + atomic_t nr_busy_cpus;
>  };
>  
>  struct sched_domain {

[...]

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
>   int cpu = smp_processor_id();
>  
>   rcu_read_lock();
> - sd = rcu_dereference(per_cpu(sd_busy, cpu));
> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
>  
>   if (!sd || !sd->nohz_idle)
>   goto unlock;
>   sd->nohz_idle = 0;
>  
> - atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> + atomic_inc(&sd->shared->nr_busy_cpus);
>  unlock:
>   rcu_read_unlock();
>  }

This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,

  NIP [c00000000012de68] .set_cpu_sd_state_idle+0x58/0x80
  LR [c00000000017ded4] .tick_nohz_idle_enter+0x24/0x90
  Call Trace:
  [c0000007774b7cf0] [c0000007774b4080] 0xc0000007774b4080 (unreliable)
  [c0000007774b7d60] [c00000000017ded4] .tick_nohz_idle_enter+0x24/0x90
  [c0000007774b7dd0] [c000000000137200] .cpu_startup_entry+0xe0/0x440
  [c0000007774b7ee0] [c00000000004739c] .start_secondary+0x35c/0x3a0
  [c0000007774b7f90] [c000000000008bfc] start_secondary_prolog+0x10/0x14

The following fixes it,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 978b3ef2d87e..d27153adee4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
  goto unlock;
  sd->nohz_idle = 0;
 
- atomic_inc(&sd->shared->nr_busy_cpus);
+ if (sd->shared)
+ atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
  rcu_read_unlock();
 }
@@ -7937,7 +7938,8 @@ void set_cpu_sd_state_idle(void)
  goto unlock;
  sd->nohz_idle = 1;
 
- atomic_dec(&sd->shared->nr_busy_cpus);
+ if (sd->shared)
+ atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
  rcu_read_unlock();
 }
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Peter Zijlstra-5
On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:

> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
> >   int cpu = smp_processor_id();
> >  
> >   rcu_read_lock();
> > - sd = rcu_dereference(per_cpu(sd_busy, cpu));
> > + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >  
> >   if (!sd || !sd->nohz_idle)
> >   goto unlock;
> >   sd->nohz_idle = 0;
> >  
> > - atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> > + atomic_inc(&sd->shared->nr_busy_cpus);
> >  unlock:
> >   rcu_read_unlock();
> >  }
>
> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
>

Hmm, PPC folks; what does your topology look like?

Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
seems to suggest your cores do not share cache at all.

https://en.wikipedia.org/wiki/POWER7 seems to agree and states

  "4 MB L3 cache per C1 core"

And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
also explicitly draws pictures with the L3 per core.

_however_, that same document describes L3 inter-core fill and lateral
cast-out, which sounds like the L3s work together to form a node wide
caching system.

Do we want to model this co-operative L3 slices thing as a sort of
node-wide LLC for the purpose of the scheduler ?

While we should definitely fix the assumption that an LLC exists (and I
need to look at why it isn't set to the core domain instead as well),
the scheduler does try and scale things by 'assuming' LLC := node.

It does this for NOHZ, and these here patches under discussion would be
doing the same for idle-core state.

Would this make sense for power, or should we somehow think of something
else?
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 0/7] sched: select_idle_siblings rewrite

Chris Mason-6
In reply to this post by Peter Zijlstra-5
On Mon, May 09, 2016 at 12:48:07PM +0200, Peter Zijlstra wrote:
> Hai,
>
> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

I ran a few more rounds of the production benchmarks, and NO_AVG_CPU is
consistently faster by about 5% than AVG_CPU.  I think what's happening
here is the production runs are ramping up load in a finer grained setup
than schbench, and production is able to see the AVG_CPU calculations
back off the scan too soon (for us anyway).

I'm going to play around with schbench to try and model this better, but
so far this is a clear win over unpatched v4.6.

-chris
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Peter Zijlstra-5
In reply to this post by Matt Fleming-3
On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:

> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,

> index 978b3ef2d87e..d27153adee4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
>   goto unlock;
>   sd->nohz_idle = 0;
>  
> - atomic_inc(&sd->shared->nr_busy_cpus);
> + if (sd->shared)
> + atomic_inc(&sd->shared->nr_busy_cpus);
>  unlock:
>   rcu_read_unlock();
>  }


Ah, no, the problem is that while it does have SHARE_PKG_RESOURCES (in
its SMT domain -- SMT threads share all cache after all), I failed to
connect the sched_domain_shared structure for it.

Does something like this also work?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6389,10 +6389,6 @@ sd_init(struct sched_domain_topology_lev
  sd->cache_nice_tries = 1;
  sd->busy_idx = 2;
 
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-
 #ifdef CONFIG_NUMA
  } else if (sd->flags & SD_NUMA) {
  sd->cache_nice_tries = 2;
@@ -6414,6 +6410,16 @@ sd_init(struct sched_domain_topology_lev
  sd->idle_idx = 1;
  }
 
+ /*
+ * For all levels sharing cache; connect a sched_domain_shared
+ * instance.
+ */
+ if (sd->flags & SH_SHARED_PKG_RESOURCES) {
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+ atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ }
+
  sd->private = sdd;
 
  return sd;
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Matt Fleming-3
On Wed, 11 May, at 07:37:52PM, Peter Zijlstra wrote:

> On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:
>
> > This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
>
> > index 978b3ef2d87e..d27153adee4d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7920,7 +7920,8 @@ static inline void set_cpu_sd_state_busy(void)
> >   goto unlock;
> >   sd->nohz_idle = 0;
> >  
> > - atomic_inc(&sd->shared->nr_busy_cpus);
> > + if (sd->shared)
> > + atomic_inc(&sd->shared->nr_busy_cpus);
> >  unlock:
> >   rcu_read_unlock();
> >  }
>
>
> Ah, no, the problem is that while it does have SHARE_PKG_RESOURCES (in
> its SMT domain -- SMT threads share all cache after all), I failed to
> connect the sched_domain_shared structure for it.
>
> Does something like this also work?

Yep, it does.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Wed, May 11, 2016 at 02:33:45PM +0200, Peter Zijlstra wrote:
> Do we want to model this co-operative L3 slices thing as a sort of
> node-wide LLC for the purpose of the scheduler ?
>
> the scheduler does try and scale things by 'assuming' LLC := node.

So this whole series is about selecting idle CPUs to run stuff on. With
the current PPC setup we would never consider placing a task outside of
its core.

If we were to add a node wide cache domain, the scheduler would look to
place a task across cores if there is an idle core to be had.

Would something like that be beneficial?
12