[PATCH v9 00/13] support "task_isolation" mode for nohz_full

classic Classic list List threaded Threaded
52 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 00/13] support "task_isolation" mode for nohz_full

Chris Metcalf-2
It has been a couple of months since the v8 version of this patch,
since various other priorities came up at work.  Since it's been
a while I will try to summarize where I think we got to on the
various issues that were raised with v8.

1. Andy Lutomirski raised the issue of whether it really made sense to
   only attempt to set up the conditions for task isolation, ask the kernel
   nicely for it, and then wait until it happened.  He wondered if a
   SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
   also suggested having an interface that would force everything else
   off a core to enable SCHED_ISOLATED to succeed.  Frederick added
   some concerns about enforcing the test that the process was in a
   good state to enter task isolation.

   I tried to address the different design philosphies for what I called
   the original "polite" mode and the reviewers' suggestions for an
   "aggressive" mode in this email:

   https://lkml.org/lkml/2015/10/26/625

   As I said there, on balance I think the "polite" option is still
   better.  Obviously folks are welcome to disagree and I'm happy to
   continue that conversation (or perhaps I convinced everyone).

2. Andy didn't like the idea of having a "STRICT" mode which
   delivered a signal to a process for violating the contract that it
   will promise to stay out of the kernel.  Gilad Ben Yossef argued that
   it made sense to have a way for the kernel to enforce the requested
   correctness guarantee of never being interrupted.  Andy pointed out
   that we should then really deliver such a signal when the kernel
   delivers an asynchronous interrupt to the core as well.  In particular
   this is a concern for the application-error case of a process that
   calls unmap() on one core while a thread on another core is running
   STRICT, and thus gets an unexpected TLB flush.

   This patch series addresses that concern by including support for
   IRQs, IPIs, and similar asynchronous interrupts to also send the
   STRICT signal to the process.  We don't try to send the signal if
   we are in an NMI, and instead just force a console backtrace like
   you would get in task_isolation_debug mode.

3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
   periodic scheduler tick.

   I'm still hoping he's open to changing his mind about that, but in
   this patch series I have removed that boot flag.

Various other changes have been introduced since v8:

https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@...

- Rebased to Linux 4.4-rc5.

- Since nohz_full and isolnodes have been separated back out again in
  4.4, I introduced a new task_isolation=MASK boot argument that sets
  both of them.  The task isolation support now requires that this
  boot flag have been used; it intentionally doesn't work if you've
  just enabled nohz_full and isolcpus separately.  I could be
  convinced that doing it the other way around makes sense, though.

- I folded the two STRICT mode patches together since there didn't
  seem to be much value in having the second patch that just enabled
  having a settable signal.  I also refactored the various routines
  that report on interrupts/exceptions/etc to make it easier to hook
  in from the case where we are interrupted asynchronously.

- For the debug support, I moved most of the functionality into
  kernel/isolation.c and out of kernel/sched/core.c, leaving only a
  small hook to handle mapping a remote cpu to a task struct safely.
  In addition to implementing Andy's suggestion of signalling a task
  when it is interrupted asynchronously, I also added a ratelimit
  hook so we won't spam the console if (for example) a timer interrupt
  runs amok - particularly since when this happens without ratelimit,
  it can end up self-perpetuating the timer interrupt.

- I added a task_isolation_debug_cpumask() helper function to check
  all the cpus in a mask to see if they are being interrupted
  inappropriately.

- I made the check for irq_enter() robust to architectures that
  have already entered user mode context_tracking before calling
  irq_enter() by testing user_mode(get_irq_regs()) instead of
  context_tracking_in_user(), and split out the code to a separate
  inlined function so I could comment it better.

- For arm64, I added a task_isolation_debug_cpumask() hook for
  smp_cross_call(), which I had missed in the earlier versions.

- I generalized the fix for tile to set up a clockevents hook for
  set_state_oneshot_stopped() to also apply to the arm_arch_timer,
  which I realized was showing the same problem.  For both cases,
  this seems to be what Viresh had in mind with commit 8fff52fd509345
  ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

- For tile, I adopted the arm model of doing user_exit() calls in the
  early assembly code (a new patch in this series).  I also added a
  missing task_isolation_debug hook for tile's IPI and remote cache
  flush code.

Chris Metcalf (12):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: move user_exit() to early kernel entry sequence
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |  16 +++
 arch/arm64/include/asm/thread_info.h |  18 ++-
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +-
 arch/arm64/kernel/signal.c           |  35 ++++--
 arch/arm64/kernel/smp.c              |   2 +
 arch/arm64/mm/fault.c                |   4 +
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 +-
 arch/tile/kernel/intvec_32.S         |  51 +++-----
 arch/tile/kernel/intvec_64.S         |  54 +++------
 arch/tile/kernel/process.c           |  83 +++++++------
 arch/tile/kernel/ptrace.c            |  19 +--
 arch/tile/kernel/single_step.c       |   8 +-
 arch/tile/kernel/smp.c               |  26 ++--
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/traps.c             |  13 +-
 arch/tile/kernel/unaligned.c         |  16 ++-
 arch/tile/mm/fault.c                 |   6 +-
 arch/tile/mm/homecache.c             |   2 +
 arch/x86/entry/common.c              |  10 +-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 drivers/clocksource/arm_arch_timer.c |   2 +
 include/linux/isolation.h            |  80 +++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 +
 include/uapi/linux/prctl.h           |   8 ++
 init/Kconfig                         |  20 ++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  18 +++
 kernel/signal.c                      |   5 +
 kernel/smp.c                         |   6 +-
 kernel/softirq.c                     |  33 +++++
 kernel/sys.c                         |   9 ++
 mm/swap.c                            |  13 +-
 mm/vmstat.c                          |  24 ++++
 40 files changed, 665 insertions(+), 188 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 01/13] vmstat: provide a function to quiet down the diff processing

Chris Metcalf-2
From: Christoph Lameter <[hidden email]>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <[hidden email]>
Signed-off-by: Chris Metcalf <[hidden email]>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 5dbc8b0ee567..6f5a21993ff3 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -189,6 +189,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -249,6 +250,7 @@ static inline void __dec_zone_page_state(struct page *page,
 
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
  struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0d5712b0206c..0510d2ec31a6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1418,6 +1418,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+ do {
+ if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+ cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+ } while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 02/13] vmstat: add vmstat_idle function

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <[hidden email]>
Signed-off-by: Chris Metcalf <[hidden email]>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6f5a21993ff3..3dc82bf5bce6 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -190,6 +190,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -251,6 +252,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
  struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0510d2ec31a6..ccc390197464 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1454,6 +1454,16 @@ static bool need_update(int cpu)
  return false;
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+ int cpu = smp_processor_id();
+ return cpumask_test_cpu(cpu, cpu_stat_off) && !need_update(cpu);
+}
+
 
 /*
  * Shepherd worker thread that checks the
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 03/13] lru_add_drain_all: factor out lru_add_drain_needed

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..66719610c9f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,6 +305,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 39395fb549c0..ce1eb052a293 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page)
  }
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+ return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+ pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
  lru_add_drain_cpu(get_cpu());
@@ -880,10 +888,7 @@ void lru_add_drain_all(void)
  for_each_online_cpu(cpu) {
  struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
- if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-    need_activate_page_drain(cpu)) {
+ if (lru_add_drain_needed(cpu)) {
  INIT_WORK(work, lru_add_drain_per_cpu);
  schedule_work_on(cpu, work);
  cpumask_set_cpu(cpu, &has_work);
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 06/13] task_isolation: add debug boot flag

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 Documentation/kernel-parameters.txt |  8 +++++
 include/linux/isolation.h           |  5 ++++
 kernel/irq_work.c                   |  5 +++-
 kernel/isolation.c                  | 60 +++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 | 18 +++++++++++
 kernel/signal.c                     |  5 ++++
 kernel/smp.c                        |  6 +++-
 kernel/softirq.c                    | 33 ++++++++++++++++++++
 8 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e035679e646e..112fba1727f4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3673,6 +3673,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
  also sets up nohz_full and isolcpus mode for the
  listed set of cpus.
 
+ task_isolation_debug [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION
+ and booted in task_isolation= mode, this
+ setting will generate console backtraces when
+ the kernel is about to interrupt a task that
+ has requested PR_TASK_ISOLATION_ENABLE and is
+ running on a task_isolation core.
+
  tcpmhash_entries= [KNL,NET]
  Set the number of tcp_metrics_hash slots.
  Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 69a3e4c59ab3..3e15e75d078f 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -43,6 +43,9 @@ static inline void task_isolation_enter(void)
 extern bool task_isolation_syscall(int nr);
 extern void task_isolation_exception(const char *fmt, ...);
 extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 
 static inline bool task_isolation_strict(void)
 {
@@ -70,6 +73,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 static inline bool task_isolation_check_syscall(int nr) { return false; }
 static inline void task_isolation_check_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
  if (!irq_work_claim(work))
  return false;
 
- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ task_isolation_debug(cpu);
  arch_send_call_function_single_ipi(cpu);
+ }
 
  return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 29ffb21ada0b..9f31c0b458ed 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include "time/tick-sched.h"
 
@@ -163,3 +164,62 @@ bool task_isolation_syscall(int syscall)
  task_isolation_exception("syscall %d", syscall);
  return true;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+ task_isolation_debug_flag = true;
+ return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p)
+{
+ static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+ bool force_debug = false;
+
+ /*
+ * Our caller made sure the task was running on a task isolation
+ * core, but make sure the task has enabled isolation.
+ */
+ if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+ return;
+
+ /*
+ * If the task was in strict mode, deliver a signal to it.
+ * We disable task isolation mode when we deliver a signal
+ * so we won't end up recursing back here again.
+ * If we are in an NMI, we don't try delivering the signal
+ * and instead just treat it as if "debug" mode was enabled,
+ * since that's pretty much all we can do.
+ */
+ if (p->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+ if (in_nmi())
+ force_debug = true;
+ else
+ task_isolation_interrupt(p, "interrupt");
+ }
+
+ /*
+ * If (for example) the timer interrupt starts ticking
+ * unexpectedly, we will get an unmanageable flow of output,
+ * so limit to one backtrace per second.
+ */
+ if (force_debug ||
+    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+ pr_err("Interrupt detected for task_isolation cpu %d, %s/%d\n",
+       cpu, p->comm, p->pid);
+ dump_stack();
+ }
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask)
+{
+ int cpu, thiscpu = smp_processor_id();
+
+ /* No need to report on this cpu since we're already in the kernel. */
+ for_each_cpu(cpu, mask)
+ if (cpu != thiscpu)
+ task_isolation_debug(cpu);
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 732e993b564b..700120221f6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -746,6 +747,23 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+ struct task_struct *p;
+
+ if (!task_isolation_possible(cpu))
+ return;
+
+ rcu_read_lock();
+ p = cpu_curr(cpu);
+ get_task_struct(p);
+ rcu_read_unlock();
+ task_isolation_debug_task(cpu, p);
+ put_task_struct(p);
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
  s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a972fd..c45ef71f329c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -638,6 +638,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_TASK_ISOLATION
+ /* If the task is being killed, don't complain about task_isolation. */
+ if (state & TASK_WAKEKILL)
+ t->task_isolation_flags = 0;
+#endif
  set_tsk_thread_flag(t, TIF_SIGPENDING);
  /*
  * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index d903c02223af..a61894409645 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -178,8 +179,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
  * locking and barrier primitives. Generic code isn't really
  * equipped to do the right thing...
  */
- if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+ if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+ task_isolation_debug(cpu);
  arch_send_call_function_single_ipi(cpu);
+ }
 
  return 0;
 }
@@ -457,6 +460,7 @@ void smp_call_function_many(const struct cpumask *mask,
  }
 
  /* Send a message to all CPUs in the map */
+ task_isolation_debug_cpumask(cfd->cpumask);
  arch_send_call_function_ipi_mask(cfd->cpumask);
 
  if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..f249b71cddf4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
  local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+ struct pt_regs *regs;
+
+ if (!context_tracking_cpu_is_enabled())
+ return;
+
+ /*
+ * We have not yet called __irq_enter() and so we haven't
+ * adjusted the hardirq count.  This test will allow us to
+ * avoid false positives for nested IRQs.
+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * If we were already in the kernel, not from an irq but from
+ * a syscall or synchronous exception/fault, this test should
+ * avoid a false positive as well.  Note that this requires
+ * architecture support for calling set_irq_regs() prior to
+ * calling irq_enter(), and if it's not done consistently, we
+ * will not consistently avoid false positives here.
+ */
+ regs = get_irq_regs();
+ if (regs && user_mode(regs))
+ task_isolation_debug(smp_processor_id());
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
  _local_bh_enable();
  }
 
+ task_isolation_irq();
  __irq_enter();
 }
 
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 07/13] arch/x86: enable task isolation functionality

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/x86/entry/common.c | 10 +++++++++-
 arch/x86/kernel/traps.c |  2 ++
 arch/x86/mm/fault.c     |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc1f0be..75958a6b5112 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
  */
  if (work & _TIF_NOHZ) {
  enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
  work &= ~_TIF_NOHZ;
  }
 #endif
@@ -254,12 +259,15 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
  if (cached_flags & _TIF_USER_RETURN_NOTIFY)
  fire_user_return_notifiers();
 
+ task_isolation_enter();
+
  /* Disable IRQs and retry */
  local_irq_disable();
 
  cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+    task_isolation_ready())
  break;
 
  }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..82bf53ec1e98 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
  case 2: /* Bound directory has invalid entry. */
  if (mpx_handle_bd_fault())
  goto exit_trap;
+ task_isolation_check_exception("bounds check");
  break; /* Success, it was handled */
  case 1: /* Bound violation. */
  info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..7b23487a3bd7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h> /* prefetchw */
 #include <linux/context_tracking.h> /* exception_enter(), ... */
 #include <linux/uaccess.h> /* faulthandler_disabled() */
+#include <linux/isolation.h> /* task_isolation_check_exception */
 
 #include <asm/traps.h> /* dotraplinkage, ... */
 #include <asm/pgalloc.h> /* pgd_*(), ... */
@@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  local_irq_enable();
  error_code |= PF_USER;
  flags |= FAULT_FLAG_USER;
+ task_isolation_check_exception("page fault at %#lx", address);
  } else {
  if (regs->flags & X86_EFLAGS_IF)
  local_irq_enable();
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 12/13] arch/tile: enable task isolation functionality

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_check_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/tile/kernel/process.c     |  6 +++++-
 arch/tile/kernel/ptrace.c      |  6 ++++++
 arch/tile/kernel/single_step.c |  5 +++++
 arch/tile/kernel/smp.c         | 26 ++++++++++++++------------
 arch/tile/kernel/unaligned.c   |  3 +++
 arch/tile/mm/fault.c           |  3 +++
 arch/tile/mm/homecache.c       |  2 ++
 7 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..832febfd65df 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -495,10 +496,13 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
  tracehook_notify_resume(regs);
  }
 
+ task_isolation_enter();
+
  local_irq_disable();
  thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
- } while (thread_info_flags & _TIF_WORK_MASK);
+ } while ((thread_info_flags & _TIF_WORK_MASK) ||
+ !task_isolation_ready());
 
  if (thread_info_flags & _TIF_SINGLESTEP) {
  single_step_once(regs);
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index 54e7b723db99..f76f2d8b8923 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -255,6 +256,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
  u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
+ if (work & _TIF_NOHZ) {
+ if (task_isolation_check_syscall(regs->regs[TREG_SYSCALL_NR]))
+ return -1;
+ }
+
  if (secure_computing() == -1)
  return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..ba01eacde7a3 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,8 @@ void single_step_once(struct pt_regs *regs)
  int size = 0, sign_ext = 0;  /* happy compiler */
  int align_ctl;
 
+ task_isolation_check_exception("single step at %#lx", regs->pc);
+
  align_ctl = unaligned_fixup;
  switch (task_thread_info(current)->align_ctl) {
  case PR_UNALIGN_NOPRINT:
@@ -767,6 +770,8 @@ void single_step_once(struct pt_regs *regs)
  unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
  unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+ task_isolation_check_exception("single step at %#lx", regs->pc);
+
  *ss_pc = regs->pc;
  control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
  control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..7298d68d4584 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -181,10 +182,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
  struct ipi_flush flush = { start, end };
 
  /* If invoked with irqs disabled, we can not issue IPIs. */
- if (irqs_disabled())
+ if (irqs_disabled()) {
+ task_isolation_debug_cpumask(&task_isolation_map);
  flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
  NULL, NULL, 0);
- else {
+ } else {
  preempt_disable();
  on_each_cpu(ipi_flush_icache_range, &flush, 1);
  preempt_enable();
@@ -258,10 +260,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
- WARN_ON(cpu_is_offline(cpu));
-
  /*
  * We just want to do an MMIO store.  The traditional writeq()
  * functions aren't really correct here, since they're always
@@ -273,15 +273,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
- HV_Coord coord;
-
- WARN_ON(cpu_is_offline(cpu));
-
- coord.y = cpu_y(cpu);
- coord.x = cpu_x(cpu);
+ HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
  hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+ WARN_ON(cpu_is_offline(cpu));
+ task_isolation_debug(cpu);
+ __smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 0db5f7c9d9e5..b1e229a1ff62 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
  return;
  }
 
+ task_isolation_check_exception("unaligned JIT at %#lx", regs->pc);
+
  if (!info->unalign_jit_base) {
  void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..1dee18d3ffbd 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -844,6 +845,8 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
 void do_page_fault(struct pt_regs *regs, int fault_num,
    unsigned long address, unsigned long write)
 {
+ task_isolation_check_exception("page fault interrupt %d at %#lx (%#lx)",
+       fault_num, regs->pc, address);
  __do_page_fault(regs, fault_num, address, write);
 }
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..e044e8dd8372 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
  * Don't bother to update atomically; losing a count
  * here is not that critical.
  */
+ task_isolation_debug_cpumask(&mask);
  for_each_cpu(cpu, &mask)
  ++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 13/13] arm, tile: turn off timer tick for oneshot_stopped state

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
  .set_next_event = tile_timer_set_next_event,
  .set_state_shutdown = tile_timer_shutdown,
  .set_state_oneshot = tile_timer_shutdown,
+ .set_state_oneshot_stopped = tile_timer_shutdown,
  .tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index c64d543d64bf..727a669afb1f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -288,6 +288,8 @@ static void __arch_timer_setup(unsigned type,
  }
  }
 
+ clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
  clk->set_state_shutdown(clk);
 
  clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 09/13] arch/arm64: enable task isolation functionality

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_cross_call() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 12 +++++++++---
 arch/arm64/kernel/signal.c           |  7 +++++--
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  4 ++++
 5 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 90c7ff233735..94a98e9e29ef 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -103,11 +103,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED 1
 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */
-#define TIF_NOHZ 7
-#define TIF_SYSCALL_TRACE 8
-#define TIF_SYSCALL_AUDIT 9
-#define TIF_SYSCALL_TRACEPOINT 10
-#define TIF_SECCOMP 11
+#define TIF_NOHZ 4
+#define TIF_SYSCALL_TRACE 5
+#define TIF_SYSCALL_AUDIT 6
+#define TIF_SYSCALL_TRACEPOINT 7
+#define TIF_SECCOMP 8
 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */
 #define TIF_FREEZE 19
 #define TIF_RESTORE_SIGMASK 20
@@ -125,9 +125,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP (1 << TIF_SECCOMP)
 #define _TIF_32BIT (1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
  _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK (_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
  _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
  _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..69ed3ba81650 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
- /* Do the secure computing check first; failures should be fast. */
+ unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+ if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+ return -1;
+
+ /* Do the secure computing check early; failures should be fast. */
  if (secure_computing() == -1)
  return -1;
 
- if (test_thread_flag(TIF_SYSCALL_TRACE))
+ if (work & _TIF_SYSCALL_TRACE)
  tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
- if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+ if (work & _TIF_SYSCALL_TRACEPOINT)
  trace_sys_enter(regs, regs->syscallno);
 
  audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..641c828653c7 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
  if (thread_flags & _TIF_FOREIGN_FPSTATE)
  fpsimd_restore_current_state();
 
+ task_isolation_enter();
+
  local_irq_disable();
 
  thread_flags = READ_ONCE(current_thread_info()->flags) &
- _TIF_WORK_MASK;
+ _TIF_WORK_LOOP_MASK;
 
- } while (thread_flags);
+ } while (thread_flags || !task_isolation_ready());
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51b2c2e..dcb3282d04a2 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -632,6 +633,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
 {
  trace_ipi_raise(target, ipi_types[ipinr]);
+ task_isolation_debug_cpumask(target);
  __smp_cross_call(target, ipinr);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1e8ca2..fbc78035b2af 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
  const struct fault_info *inf = fault_info + (esr & 63);
  struct siginfo info;
 
+ if (user_mode(regs))
+ task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
  if (!inf->fn(addr, esr, regs))
  return;
 
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ed3d75f6304..04eff4c4ac6e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -630,9 +630,8 @@ work_pending:
  mov x0, sp // 'regs'
  tst x2, #PSR_MODE_MASK // user mode regs?
  b.ne no_work_pending // returning to kernel
- enable_irq // enable interrupts for do_notify_resume()
- bl do_notify_resume
- b ret_to_user
+ bl prepare_exit_to_usermode
+ b no_user_work_pending
 work_resched:
  bl schedule
 
@@ -644,6 +643,7 @@ ret_to_user:
  ldr x1, [tsk, #TI_FLAGS]
  and x2, x1, #_TIF_WORK_MASK
  cbnz x2, work_pending
+no_user_work_pending:
  enable_step_tsk x1, x2
 no_work_pending:
  kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
  restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
- unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+ unsigned int thread_flags)
 {
- if (thread_flags & _TIF_SIGPENDING)
- do_signal(regs);
+ do {
+ local_irq_enable();
 
- if (thread_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- }
+ if (thread_flags & _TIF_NEED_RESCHED)
+ schedule();
+
+ if (thread_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (thread_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ if (thread_flags & _TIF_FOREIGN_FPSTATE)
+ fpsimd_restore_current_state();
+
+ local_irq_disable();
 
- if (thread_flags & _TIF_FOREIGN_FPSTATE)
- fpsimd_restore_current_state();
+ thread_flags = READ_ONCE(current_thread_info()->flags) &
+ _TIF_WORK_MASK;
 
+ } while (thread_flags);
 }
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 11/13] arch/tile: move user_exit() to early kernel entry sequence

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
This ensures that we always notify context tracking that we
have exited from user space no matter how we enter the kernel.
It is similar to how arm64 handles context tracking, for example.

This allows the removal of all the exception_enter() calls that
were added in commit 49e4e15619cd ("tile: support CONTEXT_TRACKING and
thus NOHZ_FULL").

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/tile/kernel/intvec_32.S   |  5 ++++-
 arch/tile/kernel/intvec_64.S   |  5 ++++-
 arch/tile/kernel/ptrace.c      | 15 ---------------
 arch/tile/kernel/single_step.c |  3 ---
 arch/tile/kernel/traps.c       | 13 ++++---------
 arch/tile/kernel/unaligned.c   | 13 ++++---------
 arch/tile/mm/fault.c           |  3 ---
 7 files changed, 16 insertions(+), 41 deletions(-)

diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 33d48812872a..9ff75e3a318a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -572,7 +572,7 @@ intvec_\vecname:
  }
  wh64    r52
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) || defined(CONFIG_CONTEXT_TRACKING)
  .ifnc \function,handle_nmi
  /*
  * We finally have enough state set up to notify the irq
@@ -588,6 +588,9 @@ intvec_\vecname:
  { move r32, r2; move r33, r3 }
  .endif
  TRACE_IRQS_OFF
+#ifdef CONFIG_CONTEXT_TRACKING
+ jal     context_tracking_user_exit
+#endif
  .ifnc \function,handle_syscall
  { move r0, r30; move r1, r31 }
  { move r2, r32; move r3, r33 }
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index a41c994ce237..f080a6c3d82b 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -753,7 +753,7 @@ intvec_\vecname:
  }
  wh64    r52
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) || defined(CONFIG_CONTEXT_TRACKING)
  .ifnc \function,handle_nmi
  /*
  * We finally have enough state set up to notify the irq
@@ -769,6 +769,9 @@ intvec_\vecname:
  { move r32, r2; move r33, r3 }
  .endif
  TRACE_IRQS_OFF
+#ifdef CONFIG_CONTEXT_TRACKING
+ jal     context_tracking_user_exit
+#endif
  .ifnc \function,handle_syscall
  { move r0, r30; move r1, r31 }
  { move r2, r32; move r3, r33 }
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index bdc126faf741..54e7b723db99 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -255,13 +255,6 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
  u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
- /*
- * If TIF_NOHZ is set, we are required to call user_exit() before
- * doing anything that could touch RCU.
- */
- if (work & _TIF_NOHZ)
- user_exit();
-
  if (secure_computing() == -1)
  return -1;
 
@@ -281,12 +274,6 @@ void do_syscall_trace_exit(struct pt_regs *regs)
  long errno;
 
  /*
- * We may come here right after calling schedule_user()
- * in which case we can be in RCU user mode.
- */
- user_exit();
-
- /*
  * The standard tile calling convention returns the value (or negative
  * errno) in r0, and zero (or positive errno) in r1.
  * It saves a couple of cycles on the hot path to do this work in
@@ -322,7 +309,5 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs)
 /* Handle synthetic interrupt delivered only by the simulator. */
 void __kprobes do_breakpoint(struct pt_regs* regs, int fault_num)
 {
- enum ctx_state prev_state = exception_enter();
  send_sigtrap(current, regs);
- exception_exit(prev_state);
 }
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 53f7b9def07b..862973074bf9 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,7 +23,6 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
-#include <linux/context_tracking.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -739,7 +738,6 @@ static DEFINE_PER_CPU(unsigned long, ss_saved_pc);
 
 void gx_singlestep_handle(struct pt_regs *regs, int fault_num)
 {
- enum ctx_state prev_state = exception_enter();
  unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
  struct thread_info *info = (void *)current_thread_info();
  int is_single_step = test_ti_thread_flag(info, TIF_SINGLESTEP);
@@ -756,7 +754,6 @@ void gx_singlestep_handle(struct pt_regs *regs, int fault_num)
  __insn_mtspr(SPR_SINGLE_STEP_CONTROL_K, control);
  send_sigtrap(current, regs);
  }
- exception_exit(prev_state);
 }
 
 
diff --git a/arch/tile/kernel/traps.c b/arch/tile/kernel/traps.c
index 0011a9ff0525..4d9651c5b1ad 100644
--- a/arch/tile/kernel/traps.c
+++ b/arch/tile/kernel/traps.c
@@ -20,7 +20,6 @@
 #include <linux/reboot.h>
 #include <linux/uaccess.h>
 #include <linux/ptrace.h>
-#include <linux/context_tracking.h>
 #include <asm/stack.h>
 #include <asm/traps.h>
 #include <asm/setup.h>
@@ -254,7 +253,6 @@ static int do_bpt(struct pt_regs *regs)
 void __kprobes do_trap(struct pt_regs *regs, int fault_num,
        unsigned long reason)
 {
- enum ctx_state prev_state = exception_enter();
  siginfo_t info = { 0 };
  int signo, code;
  unsigned long address = 0;
@@ -263,7 +261,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 
  /* Handle breakpoints, etc. */
  if (is_kernel && fault_num == INT_ILL && do_bpt(regs))
- goto done;
+ return;
 
  /* Re-enable interrupts, if they were previously enabled. */
  if (!(regs->flags & PT_FLAGS_DISABLE_IRQ))
@@ -277,7 +275,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
  const char *name;
  char buf[100];
  if (fixup_exception(regs))  /* ILL_TRANS or UNALIGN_DATA */
- goto done;
+ return;
  if (fault_num >= 0 &&
     fault_num < ARRAY_SIZE(int_name) &&
     int_name[fault_num] != NULL)
@@ -319,7 +317,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
  case INT_GPV:
 #if CHIP_HAS_TILE_DMA()
  if (retry_gpv(reason))
- goto done;
+ return;
 #endif
  /*FALLTHROUGH*/
  case INT_UDN_ACCESS:
@@ -346,7 +344,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
  if (!state ||
     (void __user *)(regs->pc) != state->buffer) {
  single_step_once(regs);
- goto done;
+ return;
  }
  }
 #endif
@@ -390,9 +388,6 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
  if (signo != SIGTRAP)
  trace_unhandled_signal("trap", regs, address, signo);
  force_sig_info(signo, &info, current);
-
-done:
- exception_exit(prev_state);
 }
 
 void do_nmi(struct pt_regs *regs, int fault_num, unsigned long reason)
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index d075f92ccee0..0db5f7c9d9e5 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,7 +25,6 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
-#include <linux/context_tracking.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1449,7 +1448,6 @@ void jit_bundle_gen(struct pt_regs *regs, tilegx_bundle_bits bundle,
 
 void do_unaligned(struct pt_regs *regs, int vecnum)
 {
- enum ctx_state prev_state = exception_enter();
  tilegx_bundle_bits __user  *pc;
  tilegx_bundle_bits bundle;
  struct thread_info *info = current_thread_info();
@@ -1503,7 +1501,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
  *((tilegx_bundle_bits *)(regs->pc)));
  jit_bundle_gen(regs, bundle, align_ctl);
  }
- goto done;
+ return;
  }
 
  /*
@@ -1527,7 +1525,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
  trace_unhandled_signal("unaligned fixup trap", regs, 0, SIGBUS);
  force_sig_info(info.si_signo, &info, current);
- goto done;
+ return;
  }
 
 
@@ -1544,7 +1542,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
  trace_unhandled_signal("segfault in unalign fixup", regs,
        (unsigned long)info.si_addr, SIGSEGV);
  force_sig_info(info.si_signo, &info, current);
- goto done;
+ return;
  }
 
  if (!info->unalign_jit_base) {
@@ -1579,7 +1577,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
  if (IS_ERR((void __force *)user_page)) {
  pr_err("Out of kernel pages trying do_mmap\n");
- goto done;
+ return;
  }
 
  /* Save the address in the thread_info struct */
@@ -1592,9 +1590,6 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
  /* Generate unalign JIT */
  jit_bundle_gen(regs, GX_INSN_BSWAP(bundle), align_ctl);
-
-done:
- exception_exit(prev_state);
 }
 
 #endif /* __tilegx__ */
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..26734214818c 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,7 +35,6 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
-#include <linux/context_tracking.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -845,9 +844,7 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
 void do_page_fault(struct pt_regs *regs, int fault_num,
    unsigned long address, unsigned long write)
 {
- enum ctx_state prev_state = exception_enter();
  __do_page_fault(regs, fault_num, address, write);
- exception_exit(prev_state);
 }
 
 #if CHIP_HAS_TILE_DMA()
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 include/linux/isolation.h  | 25 +++++++++++++++++++
 include/uapi/linux/prctl.h |  3 +++
 kernel/isolation.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ed1bfc793c5a..69a3e4c59ab3 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -40,11 +40,36 @@ static inline void task_isolation_enter(void)
  _task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+ return (task_isolation_possible(smp_processor_id()) &&
+ (current->task_isolation_flags &
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+ (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+ return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...) \
+ do { \
+ if (task_isolation_strict()) \
+ task_isolation_exception(fmt, ## __VA_ARGS__); \
+ } while (0)
+
 #else
 static inline bool task_isolation_possible(int cpu) { return false; }
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION 48
 #define PR_GET_TASK_ISOLATION 49
 # define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_STRICT (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 68a9f7457bc0..29ffb21ada0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -103,3 +104,62 @@ void _task_isolation_enter(void)
  /* Quieten the vmstat worker so it won't interrupt us. */
  quiet_vmstat();
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+ siginfo_t info = {};
+ int sig;
+
+ pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+ task->comm, task->pid, buf);
+
+ /*
+ * Turn off task isolation mode entirely to avoid spamming
+ * the process with signals.  It can re-enable task isolation
+ * mode in the signal handler if it wants to.
+ */
+ task->task_isolation_flags = 0;
+
+ sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+ if (sig == 0)
+ sig = SIGKILL;
+ info.si_signo = sig;
+ send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ /* RCU should have been enabled prior to this point. */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+
+ task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+ /* Ignore prctl() syscalls or any task exit. */
+ switch (syscall) {
+ case __NR_prctl:
+ case __NR_exit:
+ case __NR_exit_group:
+ return false;
+ }
+
+ task_isolation_exception("syscall %d", syscall);
+ return true;
+}
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 10/13] arch/tile: adopt prepare_exit_to_usermode() model from x86

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
tile do_work_pending() was called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

This change exposes a pre-existing bug on the older tilepro platform;
the singlestep processing is done last, but on tilepro (unlike tilegx)
we enable interrupts while doing that processing, so we could in
theory miss a signal or other asynchronous event.  A future change
could fix this by breaking the singlestep work into a "prepare"
step done in the main loop, and a "trigger" step done after exiting
the loop.  Since this change is intended as purely a restructuring
change, we call out the bug explicitly now, but don't yet fix it.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 arch/tile/include/asm/processor.h   |  2 +-
 arch/tile/include/asm/thread_info.h |  8 +++-
 arch/tile/kernel/intvec_32.S        | 46 +++++++--------------
 arch/tile/kernel/intvec_64.S        | 49 +++++++----------------
 arch/tile/kernel/process.c          | 79 +++++++++++++++++++------------------
 5 files changed, 77 insertions(+), 107 deletions(-)

diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h
index 139dfdee0134..0684e88aacd8 100644
--- a/arch/tile/include/asm/processor.h
+++ b/arch/tile/include/asm/processor.h
@@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task)
  /* Nothing for now */
 }
 
-extern int do_work_pending(struct pt_regs *regs, u32 flags);
+extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags);
 
 
 /*
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index dc1fb28d9636..4b7cef9e94e0 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -140,10 +140,14 @@ extern void _cpu_idle(void);
 #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ (1<<TIF_NOHZ)
 
+/* Work to do as we loop to exit to user space. */
+#define _TIF_WORK_MASK \
+ (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
+ _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
- (_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \
- _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ)
+ (_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ)
 
 /* Work to do at syscall entry. */
 #define _TIF_SYSCALL_ENTRY_WORK \
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index fbbe2ea882ea..33d48812872a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return)
  FEEDBACK_REENTER(interrupt_return)
 
  /*
- * Use r33 to hold whether we have already loaded the callee-saves
- * into ptregs.  We don't want to do it twice in this loop, since
- * then we'd clobber whatever changes are made by ptrace, etc.
- * Get base of stack in r32.
- */
- {
- GET_THREAD_INFO(r32)
- movei  r33, 0
- }
-
-.Lretry_work_pending:
- /*
  * Disable interrupts so as to make sure we don't
  * miss an interrupt that sets any of the thread flags (like
  * need_resched or sigpending) between sampling and the iret.
@@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return)
  IRQ_DISABLE(r20, r21)
  TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
- /* Check to see if there is any work to do before returning to user. */
+ /*
+ * See if there are any work items (including single-shot items)
+ * to do.  If so, save the callee-save registers to pt_regs
+ * and then dispatch to C code.
+ */
+ GET_THREAD_INFO(r21)
  {
- addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
- moveli r1, lo16(_TIF_ALLWORK_MASK)
+ addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+ moveli r20, lo16(_TIF_ALLWORK_MASK)
  }
  {
- lw     r29, r29
- auli   r1, r1, ha16(_TIF_ALLWORK_MASK)
+ lw     r22, r22
+ auli   r20, r20, ha16(_TIF_ALLWORK_MASK)
  }
- and     r1, r29, r1
- bzt     r1, .Lrestore_all
-
- /*
- * Make sure we have all the registers saved for signal
- * handling, notify-resume, or single-step.  Call out to C
- * code to figure out exactly what we need to do for each flag bit,
- * then if necessary, reload the flags and recheck.
- */
+ and     r1, r22, r20
  {
  PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
- bnz    r33, 1f
+ bzt    r1, .Lrestore_all
  }
  push_extra_callee_saves r0
- movei   r33, 1
-1: jal     do_work_pending
- bnz     r0, .Lretry_work_pending
+ jal     prepare_exit_to_usermode
 
  /*
  * In the NMI case we
@@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread)
  FEEDBACK_REENTER(ret_from_kernel_thread)
  {
  movei  r30, 0               /* not an NMI */
- j      .Lresume_userspace   /* jump into middle of interrupt_return */
+ j      interrupt_return
  }
  STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 58964d209d4d..a41c994ce237 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return)
  FEEDBACK_REENTER(interrupt_return)
 
  /*
- * Use r33 to hold whether we have already loaded the callee-saves
- * into ptregs.  We don't want to do it twice in this loop, since
- * then we'd clobber whatever changes are made by ptrace, etc.
- */
- {
- movei  r33, 0
- move   r32, sp
- }
-
- /* Get base of stack in r32. */
- EXTRACT_THREAD_INFO(r32)
-
-.Lretry_work_pending:
- /*
  * Disable interrupts so as to make sure we don't
  * miss an interrupt that sets any of the thread flags (like
  * need_resched or sigpending) between sampling and the iret.
@@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return)
  IRQ_DISABLE(r20, r21)
  TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
- /* Check to see if there is any work to do before returning to user. */
+ /*
+ * See if there are any work items (including single-shot items)
+ * to do.  If so, save the callee-save registers to pt_regs
+ * and then dispatch to C code.
+ */
+ move    r21, sp
+ EXTRACT_THREAD_INFO(r21)
  {
- addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
- moveli r1, hw1_last(_TIF_ALLWORK_MASK)
+ addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+ moveli r20, hw1_last(_TIF_ALLWORK_MASK)
  }
  {
- ld     r29, r29
- shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK)
+ ld     r22, r22
+ shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK)
  }
- and     r1, r29, r1
- beqzt   r1, .Lrestore_all
-
- /*
- * Make sure we have all the registers saved for signal
- * handling or notify-resume.  Call out to C code to figure out
- * exactly what we need to do for each flag bit, then if
- * necessary, reload the flags and recheck.
- */
+ and     r1, r22, r20
  {
  PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
- bnez   r33, 1f
+ beqzt  r1, .Lrestore_all
  }
  push_extra_callee_saves r0
- movei   r33, 1
-1: jal     do_work_pending
- bnez    r0, .Lretry_work_pending
+ jal     prepare_exit_to_usermode
 
  /*
  * In the NMI case we
@@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread)
  FEEDBACK_REENTER(ret_from_kernel_thread)
  {
  movei  r30, 0               /* not an NMI */
- j      .Lresume_userspace   /* jump into middle of interrupt_return */
+ j      interrupt_return
  }
  STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 7d5769310bef..b5f30d376ce1 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
 
 /*
  * This routine is called on return from interrupt if any of the
- * TIF_WORK_MASK flags are set in thread_info->flags.  It is
- * entered with interrupts disabled so we don't miss an event
- * that modified the thread_info flags.  If any flag is set, we
- * handle it and return, and the calling assembly code will
- * re-disable interrupts, reload the thread flags, and call back
- * if more flags need to be handled.
- *
- * We return whether we need to check the thread_info flags again
- * or not.  Note that we don't clear TIF_SINGLESTEP here, so it's
- * important that it be tested last, and then claim that we don't
- * need to recheck the flags.
+ * TIF_ALLWORK_MASK flags are set in thread_info->flags.  It is
+ * entered with interrupts disabled so we don't miss an event that
+ * modified the thread_info flags.  We loop until all the tested flags
+ * are clear.  Note that the function is called on certain conditions
+ * that are not listed in the loop condition here (e.g. SINGLESTEP)
+ * which guarantees we will do those things once, and redo them if any
+ * of the other work items is re-done, but won't continue looping if
+ * all the other work is done.
  */
-int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
+void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 {
- /* If we enter in kernel mode, do nothing and exit the caller loop. */
- if (!user_mode(regs))
- return 0;
+ if (WARN_ON(!user_mode(regs)))
+ return;
 
- user_exit();
+ do {
+ local_irq_enable();
 
- /* Enable interrupts; they are disabled again on return to caller. */
- local_irq_enable();
+ if (thread_info_flags & _TIF_NEED_RESCHED)
+ schedule();
 
- if (thread_info_flags & _TIF_NEED_RESCHED) {
- schedule();
- return 1;
- }
 #if CHIP_HAS_TILE_DMA()
- if (thread_info_flags & _TIF_ASYNC_TLB) {
- do_async_page_fault(regs);
- return 1;
- }
+ if (thread_info_flags & _TIF_ASYNC_TLB)
+ do_async_page_fault(regs);
 #endif
- if (thread_info_flags & _TIF_SIGPENDING) {
- do_signal(regs);
- return 1;
- }
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- return 1;
- }
- if (thread_info_flags & _TIF_SINGLESTEP)
+
+ if (thread_info_flags & _TIF_SIGPENDING)
+ do_signal(regs);
+
+ if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ }
+
+ local_irq_disable();
+ thread_info_flags = READ_ONCE(current_thread_info()->flags);
+
+ } while (thread_info_flags & _TIF_WORK_MASK);
+
+ if (thread_info_flags & _TIF_SINGLESTEP) {
  single_step_once(regs);
+#ifndef __tilegx__
+ /*
+ * FIXME: on tilepro, since we enable interrupts in
+ * this routine, it's possible that we miss a signal
+ * or other asynchronous event.
+ */
+ local_irq_disable();
+#endif
+ }
 
  user_enter();
-
- return 0;
 }
 
 unsigned long get_wchan(struct task_struct *p)
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9 04/13] task_isolation: add initial support

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl().  When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <[hidden email]>
---
 Documentation/kernel-parameters.txt |   8 +++
 include/linux/isolation.h           |  50 +++++++++++++++++
 include/linux/sched.h               |   3 ++
 include/uapi/linux/prctl.h          |   5 ++
 init/Kconfig                        |  20 +++++++
 kernel/Makefile                     |   1 +
 kernel/isolation.c                  | 105 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   9 ++++
 8 files changed, 201 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d18fc8..e035679e646e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3665,6 +3665,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
  neutralize any effect of /proc/sys/kernel/sysrq.
  Useful for debugging.
 
+ task_isolation= [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION=y, set
+ the specified list of CPUs where cpus will be able
+ to use prctl(PR_SET_TASK_ISOLATION) to set up task
+ isolation mode.  Setting this boot flag implicitly
+ also sets up nohz_full and isolcpus mode for the
+ listed set of cpus.
+
  tcpmhash_entries= [KNL,NET]
  Set the number of tcp_metrics_hash slots.
  Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..ed1bfc793c5a
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,50 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+static inline bool task_isolation_possible(int cpu)
+{
+ return tick_nohz_full_enabled() &&
+ cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+ return task_isolation_possible(smp_processor_id()) &&
+ (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+ return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+ if (task_isolation_enabled())
+ _task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a43edea..d439ee4f2ce2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
  unsigned long task_state_change;
 #endif
  int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+ unsigned int task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
  struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER 3
 # define PR_CAP_AMBIENT_CLEAR_ALL 4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION 48
+#define PR_GET_TASK_ISOLATION 49
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2c0d20..fb0c707e527f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL
+ help
+ Allow userspace processes to place themselves on task_isolation
+ cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+ themselves from the kernel.  On return to userspace,
+ isolated tasks will first arrange that no future kernel
+ activity will interrupt the task while the task is running
+ in userspace.  This "hard" isolation from the kernel is
+ required for userspace tasks that are running hard real-time
+ tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+ Without this option, but with NO_HZ_FULL enabled, the kernel
+ will make a best-faith, "soft" effort to shield a single userspace
+ process from interrupts, but makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
  bool
  default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..68a9f7457bc0
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,105 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+ alloc_bootmem_cpumask_var(&task_isolation_map);
+ if (cpulist_parse(str, task_isolation_map) < 0) {
+ pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+ return 1;
+ }
+
+ alloc_bootmem_cpumask_var(&cpu_isolated_map);
+ cpumask_copy(cpu_isolated_map, task_isolation_map);
+
+ alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
+ cpumask_copy(tick_nohz_full_mask, task_isolation_map);
+ tick_nohz_full_running = true;
+
+ return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+    !task_isolation_possible(smp_processor_id()))
+ return -EINVAL;
+
+ current->task_isolation_flags = flags;
+ return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+ WARN_ON_ONCE(!irqs_disabled());
+
+ /* If we need to drain the LRU cache, we're not ready. */
+ if (lru_add_drain_needed(smp_processor_id()))
+ return false;
+
+ /* If vmstats need updating, we're not ready. */
+ if (!vmstat_idle())
+ return false;
+
+ /* Request rescheduling unless we are in full dynticks mode. */
+ if (!tick_nohz_tick_stopped()) {
+ set_tsk_need_resched(current);
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+ WARN_ON_ONCE(irqs_disabled());
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ /* Quieten the vmstat worker so it won't interrupt us. */
+ quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 6af9212ab5aa..7c97227dfb39 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  case PR_GET_FP_MODE:
  error = GET_FP_MODE(me);
  break;
+#ifdef CONFIG_TASK_ISOLATION
+ case PR_SET_TASK_ISOLATION:
+ error = task_isolation_set(arg2);
+ break;
+ case PR_GET_TASK_ISOLATION:
+ error = me->task_isolation_flags;
+ break;
+#endif
  default:
  error = -EINVAL;
  break;
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86

Mark Rutland
In reply to this post by Chris Metcalf-2
Hi,

On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> This change is a prerequisite change for TASK_ISOLATION but also
> stands on its own for readability and maintainability.

I have also been looking into converting the userspace return path from
assembly to C [1], for the latter two reasons. Based on that, I have a
couple of comments.

> The existing arm64 do_notify_resume() is called in a loop from
> assembly on the slow path; this change moves the loop into C code as
> well.  For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add
> new, comprehensible entry and exit handlers written in C").
>
> Signed-off-by: Chris Metcalf <[hidden email]>
> ---
>  arch/arm64/kernel/entry.S  |  6 +++---
>  arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
>  2 files changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 7ed3d75f6304..04eff4c4ac6e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -630,9 +630,8 @@ work_pending:
>   mov x0, sp // 'regs'
>   tst x2, #PSR_MODE_MASK // user mode regs?
>   b.ne no_work_pending // returning to kernel
> - enable_irq // enable interrupts for do_notify_resume()
> - bl do_notify_resume
> - b ret_to_user
> + bl prepare_exit_to_usermode
> + b no_user_work_pending
>  work_resched:
>   bl schedule
>  
> @@ -644,6 +643,7 @@ ret_to_user:
>   ldr x1, [tsk, #TI_FLAGS]
>   and x2, x1, #_TIF_WORK_MASK
>   cbnz x2, work_pending
> +no_user_work_pending:
>   enable_step_tsk x1, x2
>  no_work_pending:
>   kernel_exit 0

It seems unfortunate to leave behind portions of the entry.S
_TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
and the majority of work_pending and ret_to_user).

I think it would be nicer if we could handle all of that in one place
(or at least all in C).

> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index e18c48cb6db1..fde59c1139a9 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>   restore_saved_sigmask();
>  }
>  
> -asmlinkage void do_notify_resume(struct pt_regs *regs,
> - unsigned int thread_flags)
> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
> + unsigned int thread_flags)
>  {
> - if (thread_flags & _TIF_SIGPENDING)
> - do_signal(regs);
> + do {
> + local_irq_enable();
>  
> - if (thread_flags & _TIF_NOTIFY_RESUME) {
> - clear_thread_flag(TIF_NOTIFY_RESUME);
> - tracehook_notify_resume(regs);
> - }
> + if (thread_flags & _TIF_NEED_RESCHED)
> + schedule();

Previously, had we called schedule(), we'd reload the thread info flags
and start that state machine again, whereas now we'll handle all the
cached flags before reloading.

Are we sure nothing is relying on the prior behaviour?

> +
> + if (thread_flags & _TIF_SIGPENDING)
> + do_signal(regs);
> +
> + if (thread_flags & _TIF_NOTIFY_RESUME) {
> + clear_thread_flag(TIF_NOTIFY_RESUME);
> + tracehook_notify_resume(regs);
> + }
> +
> + if (thread_flags & _TIF_FOREIGN_FPSTATE)
> + fpsimd_restore_current_state();
> +
> + local_irq_disable();
>  
> - if (thread_flags & _TIF_FOREIGN_FPSTATE)
> - fpsimd_restore_current_state();
> + thread_flags = READ_ONCE(current_thread_info()->flags) &
> + _TIF_WORK_MASK;
>  
> + } while (thread_flags);
>  }

Other than that, this looks good to me.

Thanks,
Mark.

[1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86

Chris Metcalf-2
On 01/04/2016 03:33 PM, Mark Rutland wrote:
> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.

Thanks!

> It seems unfortunate to leave behind portions of the entry.S
> _TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
> and the majority of work_pending and ret_to_user).
>
> I think it would be nicer if we could handle all of that in one place
> (or at least all in C).

Yes, in principle I agree with this, and I think your deasm tree looks
like an excellent idea.

For this patch series I wanted to focus more on what was necessary
for the various platforms to implement task isolation, and less on
additional cleanups of the platforms in question.  I think my changes
don't make the TIF state machine any less clear, nor do they make
it harder for an eventual further migration to C code along the lines
of what you've done, so it seems plausible to me to commit them
upstream independently of your work.

>> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
>> index e18c48cb6db1..fde59c1139a9 100644
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>>   restore_saved_sigmask();
>>   }
>>  
>> -asmlinkage void do_notify_resume(struct pt_regs *regs,
>> - unsigned int thread_flags)
>> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
>> + unsigned int thread_flags)
>>   {
>> - if (thread_flags & _TIF_SIGPENDING)
>> - do_signal(regs);
>> + do {
>> + local_irq_enable();
>>  
>> - if (thread_flags & _TIF_NOTIFY_RESUME) {
>> - clear_thread_flag(TIF_NOTIFY_RESUME);
>> - tracehook_notify_resume(regs);
>> - }
>> + if (thread_flags & _TIF_NEED_RESCHED)
>> + schedule();
> Previously, had we called schedule(), we'd reload the thread info flags
> and start that state machine again, whereas now we'll handle all the
> cached flags before reloading.
>
> Are we sure nothing is relying on the prior behaviour?

Good eye, and I probably should have called that out in the commit
message.  My best guess is that there should be nothing that depends
on the old semantics.  Other platforms (certainly x86 and tile, anyway)
already have the semantics that you run out the old state machine on
return from schedule(), so regardless, it's probably appropriate for
arm to follow that same convention.

>> +
>> + if (thread_flags & _TIF_SIGPENDING)
>> + do_signal(regs);
>> +
>> + if (thread_flags & _TIF_NOTIFY_RESUME) {
>> + clear_thread_flag(TIF_NOTIFY_RESUME);
>> + tracehook_notify_resume(regs);
>> + }
>> +
>> + if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> + fpsimd_restore_current_state();
>> +
>> + local_irq_disable();
>>  
>> - if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> - fpsimd_restore_current_state();
>> + thread_flags = READ_ONCE(current_thread_info()->flags) &
>> + _TIF_WORK_MASK;
>>  
>> + } while (thread_flags);
>>   }
> Other than that, this looks good to me.
>
> Thanks,
> Mark.
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Thanks again for the review - shall I add your Reviewed-by (or Acked-by?)
to this patch?

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v9bis 07/13] arch/x86: enable task isolation functionality

Chris Metcalf-2
In reply to this post by Chris Metcalf-2
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <[hidden email]>
---
Oops! In v9 I sent a version of this patch that didn't have the
semantic merge to 4.4 from Andy's commit 39b48e575e92 ("x86/entry:
Split and inline prepare_exit_to_usermode()").  This "v9bis" version
adds the necessary extra check to get into exit_to_usermode_loop()
in the first place when running in task-isolation mode.

 arch/x86/entry/common.c | 18 ++++++++++++++++--
 arch/x86/kernel/traps.c |  2 ++
 arch/x86/mm/fault.c     |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc1f0be..477d8cafaaf2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
  */
  if (work & _TIF_NOHZ) {
  enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
  work &= ~_TIF_NOHZ;
  }
 #endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
  if (cached_flags & _TIF_USER_RETURN_NOTIFY)
  fire_user_return_notifiers();
 
+ task_isolation_enter();
+
  /* Disable IRQs and retry */
  local_irq_disable();
 
  cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
- if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+    task_isolation_ready())
  break;
 
  }
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
@@ -278,7 +292,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
  cached_flags =
  READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
- if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+ if (unlikely(cached_flags & EXIT_TO_USERMODE_FLAGS))
  exit_to_usermode_loop(regs, cached_flags);
 
  user_enter();
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..82bf53ec1e98 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
  case 2: /* Bound directory has invalid entry. */
  if (mpx_handle_bd_fault())
  goto exit_trap;
+ task_isolation_check_exception("bounds check");
  break; /* Success, it was handled */
  case 1: /* Bound violation. */
  info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..7b23487a3bd7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h> /* prefetchw */
 #include <linux/context_tracking.h> /* exception_enter(), ... */
 #include <linux/uaccess.h> /* faulthandler_disabled() */
+#include <linux/isolation.h> /* task_isolation_check_exception */
 
 #include <asm/traps.h> /* dotraplinkage, ... */
 #include <asm/pgalloc.h> /* pgd_*(), ... */
@@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
  local_irq_enable();
  error_code |= PF_USER;
  flags |= FAULT_FLAG_USER;
+ task_isolation_check_exception("page fault at %#lx", address);
  } else {
  if (regs->flags & X86_EFLAGS_IF)
  local_irq_enable();
--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86

Andy Lutomirski
In reply to this post by Mark Rutland
On Mon, Jan 4, 2016 at 12:33 PM, Mark Rutland <[hidden email]> wrote:

> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
>
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.
>

>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Neat!

In case you want to compare notes, I have a branch with the entire
syscall path on x86 in C except for cleanly separated asm fast path
optimizations:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/entry_compat

Even in Linus' tree, the x86 32-bit syscalls are in C.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v9 06/13] task_isolation: add debug boot flag

Steven Rostedt
In reply to this post by Chris Metcalf-2
On Mon, 4 Jan 2016 14:34:44 -0500
Chris Metcalf <[hidden email]> wrote:


> +#ifdef CONFIG_TASK_ISOLATION
> +void task_isolation_debug(int cpu)
> +{
> + struct task_struct *p;
> +
> + if (!task_isolation_possible(cpu))
> + return;
> +
> + rcu_read_lock();

What's the rcu_read_lock() for? I don't see what is being protected by
rcu here?

-- Steve

> + p = cpu_curr(cpu);
> + get_task_struct(p);
> + rcu_read_unlock();
> + task_isolation_debug_task(cpu, p);
> + put_task_struct(p);
> +}
> +#endif
> +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v9 06/13] task_isolation: add debug boot flag

Chris Metcalf-2
On 1/4/2016 5:52 PM, Steven Rostedt wrote:

> On Mon, 4 Jan 2016 14:34:44 -0500
> Chris Metcalf<[hidden email]>  wrote:
>
>
>> >+#ifdef CONFIG_TASK_ISOLATION
>> >+void task_isolation_debug(int cpu)
>> >+{
>> >+ struct task_struct *p;
>> >+
>> >+ if (!task_isolation_possible(cpu))
>> >+ return;
>> >+
>> >+ rcu_read_lock();
> What's the rcu_read_lock() for? I don't see what is being protected by
> rcu here?

I'm not completely clear either, but this is the same idiom as is used throughout
kernel/sched/core.c when mapping from a pid or a cpu to a task_struct, since
obviously you could end up racing with the task_struct being removed after the
task dies.  My best understanding is that the rcu_read_lock() holds up the final
free of the structure so that we have time here to get another reference to it.

See for example sched_setaffinity() for a similar use of the idiom.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123