[PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack

Andy Lutomirski
We currently run IST interrupt handlers on the IST stack.  Changing
it may simplify a few things.  See patch 2 for details.

Patch 1 is a fix for a not-quite-bug in uprobes that Oleg noticed
that would be exposed by patch 2.

NB: Tony has seen odd behavior when stress-testing injected
machine checks with this series applied.  I suspect that
it's a bug in something else, possibly his BIOS.  Bugs in
this series shouldn't be ruled out, though.

Andy Lutomirski (3):
  uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  x86, entry: Switch stacks on a paranoid entry from userspace
  sched, x86: Check that we're on the right stack in schedule and
    __might_sleep

 Documentation/x86/entry_64.txt         | 18 ++++---
 Documentation/x86/x86_64/kernel-stacks |  8 ++--
 arch/x86/Kconfig                       |  1 +
 arch/x86/include/asm/thread_info.h     | 19 +++++++-
 arch/x86/kernel/entry_64.S             | 86 ++++++++++++++++++----------------
 arch/x86/kernel/irq_32.c               | 13 ++---
 arch/x86/kernel/traps.c                | 23 ++-------
 include/linux/thread_info.h            |  7 +++
 kernel/Kconfig.locks                   |  3 ++
 kernel/events/uprobes.c                |  1 -
 kernel/sched/core.c                    | 14 ++++--
 11 files changed, 109 insertions(+), 84 deletions(-)

--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME

Andy Lutomirski
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
but not on non-paranoid returns.  I suspect that this is a mistake
and that the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
workaround for the x86 bug.  With that bug fixed, we can remove
_TIF_NOTIFY_RESUME from the uprobes code.

Acked-by: Srikar Dronamraju <[hidden email]>
Reported-by: Oleg Nesterov <[hidden email]>
Signed-off-by: Andy Lutomirski <[hidden email]>
---
 arch/x86/include/asm/thread_info.h | 2 +-
 kernel/events/uprobes.c            | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 854053889d4d..547e344a6dc6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -141,7 +141,7 @@ struct thread_info {
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK \
  (_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME | \
- _TIF_USER_RETURN_NOTIFY)
+ _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW \
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a2c646..ed8f2cde34c5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1640,7 +1640,6 @@ bool uprobe_deny_signal(void)
  if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
  utask->state = UTASK_SSTEP_TRAPPED;
  set_tsk_thread_flag(t, TIF_UPROBE);
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
  }
  }
 
--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v3 2/3] x86, entry: Switch stacks on a paranoid entry from userspace

Andy Lutomirski
In reply to this post by Andy Lutomirski
This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack.  Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code.  It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This may enable the machine check code to be simplified at some
point, because a machine check from userspace will be able to safely
enable interrupts.  It will also allow the upcoming fsgsbase code to
be simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <[hidden email]>
Cc: Andi Kleen <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: Borislav Petkov <[hidden email]>
Signed-off-by: Andy Lutomirski <[hidden email]>
---
 Documentation/x86/entry_64.txt         | 18 ++++---
 Documentation/x86/x86_64/kernel-stacks |  8 ++--
 arch/x86/kernel/entry_64.S             | 86 ++++++++++++++++++----------------
 arch/x86/kernel/traps.c                | 23 ++-------
 4 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index bc7226ef5055..8e53217e6d40 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -75,9 +75,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
  xorl %ebx,%ebx
 1: ret
 
-and the whole paranoid non-paranoid macro complexity is about whether
-to suffer that RDMSR cost.
-
 If we are at an interrupt or user-trap/gate-alike boundary then we can
 use the faster check: the stack will be a reliable indicator of
 whether SWAPGS was already done: if we see that we are a secondary
@@ -90,6 +87,15 @@ which might have triggered right after a normal entry wrote CS to the
 stack but before we executed SWAPGS, then the only safe way to check
 for GS is the slower method: the RDMSR.
 
-So we try only to mark those entry methods 'paranoid' that absolutely
-need the more expensive check for the GS base - and we generate all
-'normal' entry points with the regular (faster) entry macros.
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index a01eec5d1d0b..e3c8a49d1a2f 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
 interrupt-gate descriptor.  When an interrupt occurs and the hardware
 loads such a descriptor, the hardware automatically sets the new stack
 pointer based on the IST value, then invokes the interrupt handler.  If
-software wants to allow nested IST interrupts then the handler must
-adjust the IST values on entry to and exit from the interrupt handler.
-(This is occasionally done, e.g. for debug exceptions.)
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack.  If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
 
 Events with different IST codes (i.e. with different stacks) can be
 nested.  For example, a debug interrupt can safely be interrupted by an
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..de24b2eac6b2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1064,6 +1064,11 @@ ENTRY(\sym)
  CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 
  .if \paranoid
+ .if \paranoid == 1
+ CFI_REMEMBER_STATE
+ testl $3, CS(%rsp) /* If coming from userspace, switch */
+ jnz 1f /* stacks. */
+ .endif
  call save_paranoid
  .else
  call error_entry
@@ -1104,6 +1109,36 @@ ENTRY(\sym)
  jmp error_exit /* %ebx: no swapgs flag */
  .endif
 
+ .if \paranoid == 1
+ CFI_RESTORE_STATE
+ /*
+ * Paranoid entry from userspace.  Switch stacks and treat it
+ * as a normal entry.  This means that paranoid handlers
+ * run in real process context if user_mode(regs).
+ */
+1:
+ call error_entry
+
+ DEFAULT_FRAME 0
+
+ movq %rsp,%rdi /* pt_regs pointer */
+ call sync_regs
+ movq %rax,%rsp /* switch stack */
+
+ movq %rsp,%rdi /* pt_regs pointer */
+
+ .if \has_error_code
+ movq ORIG_RAX(%rsp),%rsi /* get error code */
+ movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
+ .else
+ xorl %esi,%esi /* no error code */
+ .endif
+
+ call \do_sym
+
+ jmp error_exit /* %ebx: no swapgs flag */
+ .endif
+
  CFI_ENDPROC
 END(\sym)
 .endm
@@ -1124,7 +1159,7 @@ idtentry overflow do_overflow has_error_code=0
 idtentry bounds do_bounds has_error_code=0
 idtentry invalid_op do_invalid_op has_error_code=0
 idtentry device_not_available do_device_not_available has_error_code=0
-idtentry double_fault __do_double_fault has_error_code=1 paranoid=1
+idtentry double_fault __do_double_fault has_error_code=1 paranoid=2
 idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
 idtentry invalid_TSS do_invalid_TSS has_error_code=1
 idtentry segment_not_present do_segment_not_present has_error_code=1
@@ -1305,16 +1340,14 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 #endif
 
  /*
- * "Paranoid" exit path from exception stack.
- * Paranoid because this is used by NMIs and cannot take
- * any kernel state for granted.
- * We don't do kernel preemption checks here, because only
- * NMI should be common and it does not enable IRQs and
- * cannot get reschedule ticks.
+ * "Paranoid" exit path from exception stack.  This is invoked
+ * only on return from non-NMI IST interrupts that came
+ * from kernel space.
  *
- * "trace" is 0 for the NMI handler only, because irq-tracing
- * is fundamentally NMI-unsafe. (we cannot change the soft and
- * hard flags at once, atomically)
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated.  Fortunately, we there's no good reason
+ * to try to handle preemption here.
  */
 
  /* ebx: no swapgs flag */
@@ -1324,43 +1357,14 @@ ENTRY(paranoid_exit)
  TRACE_IRQS_OFF_DEBUG
  testl %ebx,%ebx /* swapgs needed? */
  jnz paranoid_restore
- testl $3,CS(%rsp)
- jnz   paranoid_userspace
-paranoid_swapgs:
  TRACE_IRQS_IRETQ 0
  SWAPGS_UNSAFE_STACK
  RESTORE_ALL 8
- jmp irq_return
+ INTERRUPT_RETURN
 paranoid_restore:
  TRACE_IRQS_IRETQ_DEBUG 0
  RESTORE_ALL 8
- jmp irq_return
-paranoid_userspace:
- GET_THREAD_INFO(%rcx)
- movl TI_flags(%rcx),%ebx
- andl $_TIF_WORK_MASK,%ebx
- jz paranoid_swapgs
- movq %rsp,%rdi /* &pt_regs */
- call sync_regs
- movq %rax,%rsp /* switch stack for scheduling */
- testl $_TIF_NEED_RESCHED,%ebx
- jnz paranoid_schedule
- movl %ebx,%edx /* arg3: thread flags */
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- xorl %esi,%esi /* arg2: oldset */
- movq %rsp,%rdi /* arg1: &pt_regs */
- call do_notify_resume
- DISABLE_INTERRUPTS(CLBR_NONE)
- TRACE_IRQS_OFF
- jmp paranoid_userspace
-paranoid_schedule:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_ANY)
- SCHEDULE_USER
- DISABLE_INTERRUPTS(CLBR_ANY)
- TRACE_IRQS_OFF
- jmp paranoid_userspace
+ INTERRUPT_RETURN
  CFI_ENDPROC
 END(paranoid_exit)
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922fafc1..837843513ac3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -375,27 +375,14 @@ NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
 /*
- * Help handler running on IST stack to switch back to user stack
- * for scheduling or signal handling. The actual stack switch is done in
- * entry.S
+ * Help handler running on IST stack to switch off the IST stack if the
+ * interrupted code was in user mode. The actual stack switch is done in
+ * entry_64.S
  */
 asmlinkage __visible struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
- struct pt_regs *regs = eregs;
- /* Did already sync */
- if (eregs == (struct pt_regs *)eregs->sp)
- ;
- /* Exception from user space */
- else if (user_mode(eregs))
- regs = task_pt_regs(current);
- /*
- * Exception from kernel and interrupts are enabled. Move to
- * kernel process stack.
- */
- else if (eregs->flags & X86_EFLAGS_IF)
- regs = (struct pt_regs *)(eregs->sp -= sizeof(struct pt_regs));
- if (eregs != regs)
- *regs = *eregs;
+ struct pt_regs *regs = task_pt_regs(current);
+ *regs = *eregs;
  return regs;
 }
 NOKPROBE_SYMBOL(sync_regs);
--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
In reply to this post by Andy Lutomirski
On x86, sleeping while on an IST or irq stack has a surprisingly
good chance of working, but it can also fail dramatically.  Add an
arch hook to allow schedule and __might_sleep to catch sleeping on
the wrong stack.

This will also catch do_exit from a funny stack, which could leave
an IST stack shifted or an NMI nesting count incremented.

Signed-off-by: Andy Lutomirski <[hidden email]>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/thread_info.h | 17 +++++++++++++++++
 arch/x86/kernel/irq_32.c           | 13 +++----------
 include/linux/thread_info.h        |  7 +++++++
 kernel/Kconfig.locks               |  3 +++
 kernel/sched/core.c                | 14 ++++++++++----
 6 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ded8a6774ac9..a811286636d2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -137,6 +137,7 @@ config X86
  select HAVE_ACPI_APEI_NMI if ACPI
  select ACPI_LEGACY_TABLES_LOOKUP if ACPI
  select X86_FEATURE_NAMES if PROC_FS
+ select HAVE_ARCH_SCHEDULE_ALLOWED
 
 config INSTRUCTION_DECODER
  def_bool y
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344a6dc6..05701f132473 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -170,6 +170,23 @@ static inline struct thread_info *current_thread_info(void)
  return ti;
 }
 
+static inline unsigned long current_stack_pointer(void)
+{
+ unsigned long sp;
+#ifdef CONFIG_X86_64
+ asm("mov %%rsp,%0" : "=g" (sp));
+#else
+ asm("mov %%esp,%0" : "=g" (sp));
+#endif
+ return sp;
+}
+
+static inline bool arch_schedule_allowed(void)
+{
+ return ((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
+ & ~(THREAD_SIZE - 1)) == 0;
+}
+
 #else /* !__ASSEMBLY__ */
 
 /* how to get the thread information struct from ASM */
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..28d28f5eb8f4 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -69,16 +69,9 @@ static void call_on_stack(void *func, void *stack)
      : "memory", "cc", "edx", "ecx", "eax");
 }
 
-/* how to get the current stack pointer from C */
-#define current_stack_pointer ({ \
- unsigned long sp; \
- asm("mov %%esp,%0" : "=g" (sp)); \
- sp; \
-})
-
 static inline void *current_stack(void)
 {
- return (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+ return (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
 }
 
 static inline int
@@ -103,7 +96,7 @@ execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq)
 
  /* Save the next esp at the bottom of the stack */
  prev_esp = (u32 *)irqstk;
- *prev_esp = current_stack_pointer;
+ *prev_esp = current_stack_pointer();
 
  if (unlikely(overflow))
  call_on_stack(print_stack_overflow, isp);
@@ -156,7 +149,7 @@ void do_softirq_own_stack(void)
 
  /* Push the previous esp onto the stack */
  prev_esp = (u32 *)irqstk;
- *prev_esp = current_stack_pointer;
+ *prev_esp = current_stack_pointer();
 
  call_on_stack(__do_softirq, isp);
 }
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ff307b548ed3..6deaf7e97009 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -145,6 +145,13 @@ static inline bool test_and_clear_restore_sigmask(void)
 #error "no set_restore_sigmask() provided and default one won't work"
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_SCHEDULE_ALLOWED
+static inline bool arch_schedule_allowed(void)
+{
+ return true;
+}
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_THREAD_INFO_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 76768ee812b2..2714dc34695a 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -237,3 +237,6 @@ config ARCH_USE_QUEUE_RWLOCK
 config QUEUE_RWLOCK
  def_bool y if ARCH_USE_QUEUE_RWLOCK
  depends on SMP
+
+config HAVE_ARCH_SCHEDULE_ALLOWED
+       bool
\ No newline at end of file
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c13ddc..e51ab65a9750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2705,8 +2705,12 @@ static inline void schedule_debug(struct task_struct *prev)
  * Test if we are atomic. Since do_exit() needs to call into
  * schedule() atomically, we ignore that path. Otherwise whine
  * if we are scheduling when we should not.
+ *
+ * If architectural conditions for scheduling are not met,
+ * complain even if we are in do_exit.
  */
- if (unlikely(in_atomic_preempt_off() && prev->state != TASK_DEAD))
+ if (unlikely((in_atomic_preempt_off() && prev->state != TASK_DEAD) ||
+     !arch_schedule_allowed()))
  __schedule_bug(prev);
  rcu_sleep_check();
 
@@ -7200,10 +7204,12 @@ static inline int preempt_count_equals(int preempt_offset)
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
  static unsigned long prev_jiffy; /* ratelimiting */
+ bool arch_ok;
 
  rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
+ arch_ok = arch_schedule_allowed();
  if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
-     !is_idle_task(current)) ||
+     !is_idle_task(current) && arch_ok) ||
     system_state != SYSTEM_RUNNING || oops_in_progress)
  return;
  if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
@@ -7214,8 +7220,8 @@ void __might_sleep(const char *file, int line, int preempt_offset)
  "BUG: sleeping function called from invalid context at %s:%d\n",
  file, line);
  printk(KERN_ERR
- "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
- in_atomic(), irqs_disabled(),
+ "in_atomic(): %d, irqs_disabled(): %d, arch_schedule_allowed: %d, pid: %d, name: %s\n",
+ in_atomic(), irqs_disabled(), (int)arch_ok,
  current->pid, current->comm);
 
  debug_show_held_locks(current);
--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

RE: [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack

Luck, Tony
In reply to this post by Andy Lutomirski
> NB: Tony has seen odd behavior when stress-testing injected
> machine checks with this series applied.  I suspect that
> it's a bug in something else, possibly his BIOS.  Bugs in
> this series shouldn't be ruled out, though.

v3 did 3.5x better than earlier ones ... survived overnight but died at 91724
injection/consumption/recovery cycles just now. Different symptom,
instead of losing some cpus, there was a fatal machine check (PCC=1
and OVER=1 bits set in the machine check bank). This might be from a
known issue.
Not sure if this was due to some improvement in the code, or because
I changed the system configuration by pulling out all the memory except
for that on memory controller 0 on node 0. Our BIOS team had told me
they'd seen some instability in the injection code on fully populated
systems.

I did instrument the synchronization in mce_start(). I was a bit worried
that with ever increasing numbers of cpus the 100ns delay between
pounding on atomic ops on mce_callin might not be enough. But it
seems we are not in trouble yet. Slowest synchronization recorded
took 1.8M TSC cycles. Mean is 500K cycles.  So my gut feeling that
the one second timeout was very conservative is correct.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
In reply to this post by Andy Lutomirski
On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <[hidden email]> wrote:
> On x86, sleeping while on an IST or irq stack has a surprisingly
> good chance of working, but it can also fail dramatically.  Add an
> arch hook to allow schedule and __might_sleep to catch sleeping on
> the wrong stack.

Why doesn't the normal in_interrupt() test catch this?

          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
On Wed, Nov 19, 2014 at 10:40 AM, Linus Torvalds
<[hidden email]> wrote:
> On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <[hidden email]> wrote:
>> On x86, sleeping while on an IST or irq stack has a surprisingly
>> good chance of working, but it can also fail dramatically.  Add an
>> arch hook to allow schedule and __might_sleep to catch sleeping on
>> the wrong stack.
>
> Why doesn't the normal in_interrupt() test catch this?

It catches scheduling on an irq stack, assuming that all of the
irq_count stuff is working correctly.  I don't think it catches
sleeping on an IST stack.

--Andy

>
>           Linus



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andi Kleen-3
In reply to this post by Linus Torvalds-2
On Wed, Nov 19, 2014 at 10:40:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <[hidden email]> wrote:
> > On x86, sleeping while on an IST or irq stack has a surprisingly
> > good chance of working, but it can also fail dramatically.  Add an
> > arch hook to allow schedule and __might_sleep to catch sleeping on
> > the wrong stack.
>
> Why doesn't the normal in_interrupt() test catch this?

The exception handlers which use the IST stacks don't necessarily
set irq count. Maybe they should.

-Andi

--
[hidden email] -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <[hidden email]> wrote:
>
> The exception handlers which use the IST stacks don't necessarily
> set irq count. Maybe they should.

Hmm. I think they should. Since they clearly must not schedule, as
they use a percpu stack.

Which exceptions use IST?

[ grep grep ]

Looks like stack, doublefault, nmi, debug and mce. And yes, I really
think they should all raise the irq count if they don't already.
Rather than add random arch-specific "let's check that we're on the
right stack" code to the might-sleep stuff, just use the one we have.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks

Luck, Tony
In reply to this post by Andy Lutomirski
We now switch to the kernel stack when a machine check interrupts
during user mode.  This means that we can perform recovery actions
in the tail of do_machine_check(). So say goodbye to TIF_MCE_NOTIFY,
mce_save_info(), mce_find_info() and mce_notify_process()

Signed-off-by: Tony Luck <[hidden email]>
---

Obviously this is dependent on Andy's patch series in
  git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git x86/paranoid

 arch/x86/include/asm/mce.h         |   1 -
 arch/x86/include/asm/thread_info.h |   4 +-
 arch/x86/kernel/cpu/mcheck/mce.c   | 106 ++++++++-----------------------------
 arch/x86/kernel/signal.c           |   6 ---
 4 files changed, 23 insertions(+), 94 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f761e5..be5a51a4a219 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -187,7 +187,6 @@ enum mcp_flags {
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
-void mce_notify_process(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 05701f132473..55a63f114a74 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -75,7 +75,6 @@ struct thread_info {
 #define TIF_SYSCALL_EMU 6 /* syscall emulation active */
 #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
 #define TIF_SECCOMP 8 /* secure computing */
-#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
 #define TIF_UPROBE 12 /* breakpointed or singlestepping */
 #define TIF_NOTSC 16 /* TSC is not accessible in userland */
@@ -100,7 +99,6 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP (1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE (1 << TIF_UPROBE)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
@@ -140,7 +138,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK \
- (_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME | \
+ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \
  _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668cebfd..66efc536f81d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -956,51 +956,6 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
- * Need to save faulting physical address associated with a process
- * in the machine check handler some place where we can grab it back
- * later in mce_notify_process()
- */
-#define MCE_INFO_MAX 16
-
-struct mce_info {
- atomic_t inuse;
- struct task_struct *t;
- __u64 paddr;
- int restartable;
-} mce_info[MCE_INFO_MAX];
-
-static void mce_save_info(__u64 addr, int c)
-{
- struct mce_info *mi;
-
- for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
- if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
- mi->t = current;
- mi->paddr = addr;
- mi->restartable = c;
- return;
- }
- }
-
- mce_panic("Too many concurrent recoverable errors", NULL, NULL);
-}
-
-static struct mce_info *mce_find_info(void)
-{
- struct mce_info *mi;
-
- for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
- if (atomic_read(&mi->inuse) && mi->t == current)
- return mi;
- return NULL;
-}
-
-static void mce_clear_info(struct mce_info *mi)
-{
- atomic_set(&mi->inuse, 0);
-}
-
-/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -1037,6 +992,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
  DECLARE_BITMAP(toclear, MAX_NR_BANKS);
  DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
  char *msg = "Unknown";
+ u64 recover_paddr = ~0ull;
+ int flags = MF_ACTION_REQUIRED;
 
  this_cpu_inc(mce_exception_count);
 
@@ -1155,9 +1112,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
  if (no_way_out)
  mce_panic("Fatal machine check on current CPU", &m, msg);
  if (worst == MCE_AR_SEVERITY) {
- /* schedule action before return to userland */
- mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
- set_thread_flag(TIF_MCE_NOTIFY);
+ recover_paddr = m.addr;
+ if (!(m.mcgstatus & MCG_STATUS_RIPV))
+ flags |= MF_MUST_KILL;
  } else if (kill_it) {
  force_sig(SIGBUS, current);
  }
@@ -1168,6 +1125,23 @@ void do_machine_check(struct pt_regs *regs, long error_code)
  mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
  sync_core();
+
+ if (recover_paddr == ~0ull)
+ return;
+
+ pr_err("Uncorrected hardware memory error in user-access at %llx",
+ recover_paddr);
+ /*
+ * We must call memory_failure() here even if the current process is
+ * doomed. We still need to mark the page as poisoned and alert any
+ * other users of the page.
+ */
+ local_irq_enable();
+ if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS, current);
+ }
+ local_irq_disable();
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
@@ -1185,42 +1159,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Called in process context that interrupted by MCE and marked with
- * TIF_MCE_NOTIFY, just before returning to erroneous userland.
- * This code is allowed to sleep.
- * Attempt possible recovery such as calling the high level VM handler to
- * process any corrupted pages, and kill/signal current process if required.
- * Action required errors are handled here.
- */
-void mce_notify_process(void)
-{
- unsigned long pfn;
- struct mce_info *mi = mce_find_info();
- int flags = MF_ACTION_REQUIRED;
-
- if (!mi)
- mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
- pfn = mi->paddr >> PAGE_SHIFT;
-
- clear_thread_flag(TIF_MCE_NOTIFY);
-
- pr_err("Uncorrected hardware memory error in user-access at %llx",
- mi->paddr);
- /*
- * We must call memory_failure() here even if the current process is
- * doomed. We still need to mark the page as poisoned and alert any
- * other users of the page.
- */
- if (!mi->restartable)
- flags |= MF_MUST_KILL;
- if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
- pr_err("Memory error not recovered");
- force_sig(SIGBUS, current);
- }
- mce_clear_info(mi);
-}
-
-/*
  * Action optional processing happens here (picking up
  * from the list of faulting pages that do_machine_check()
  * placed into the "ring").
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ed37a768d0fc..2a33c8f68319 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -740,12 +740,6 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
  user_exit();
 
-#ifdef CONFIG_X86_MCE
- /* notify userspace of pending MCEs */
- if (thread_info_flags & _TIF_MCE_NOTIFY)
- mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
  if (thread_info_flags & _TIF_UPROBE)
  uprobe_notify_resume(regs);
 
--
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
In reply to this post by Linus Torvalds-2
On Wed, Nov 19, 2014 at 11:44 AM, Linus Torvalds
<[hidden email]> wrote:

> On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <[hidden email]> wrote:
>>
>> The exception handlers which use the IST stacks don't necessarily
>> set irq count. Maybe they should.
>
> Hmm. I think they should. Since they clearly must not schedule, as
> they use a percpu stack.
>
> Which exceptions use IST?
>
> [ grep grep ]
>
> Looks like stack, doublefault, nmi, debug and mce. And yes, I really
> think they should all raise the irq count if they don't already.
> Rather than add random arch-specific "let's check that we're on the
> right stack" code to the might-sleep stuff, just use the one we have.
>

Does that include nmi?  I'm a bit afraid of touching that code.

It's certainly easy enough to bump irq_count in the paranoid entries.

--Andy

>                     Linus



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <[hidden email]> wrote:
>
> Does that include nmi?  I'm a bit afraid of touching that code.

NMI is kind of special, since it's really not supposed to touch
'current' or anything like that, and that's how we do preempt-count
(and that's where irq-count is) right now.

I would prefer to have preempt_count be a percpu variable rather than
a per-thread one, but there are historical reasons for that horror. Oh
well.

> It's certainly easy enough to bump irq_count in the paranoid entries.

It definitely shouldn't be done from the assembly code. Usually it's
"irq_enter/exit()" that does it, but for NMI you'd presumably just do
it explicitly from do_nmi() or similar. But NMI really is nastier than
other cases, see above.

(That said, I thought we did it anyway, but now that I look I can't
find it. So I was probably just smoking something)

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Thomas Gleixner
On Wed, 19 Nov 2014, Linus Torvalds wrote:

> On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <[hidden email]> wrote:
> >
> > Does that include nmi?  I'm a bit afraid of touching that code.
>
> NMI is kind of special, since it's really not supposed to touch
> 'current' or anything like that, and that's how we do preempt-count
> (and that's where irq-count is) right now.
>
> I would prefer to have preempt_count be a percpu variable rather than
> a per-thread one, but there are historical reasons for that horror. Oh
> well.

preempt_count is a per cpu variable at least on x86 since:

  commit c2daa3bed53a 'sched, x86: Provide a per-cpu preempt_count
  implementation'
   
> > It's certainly easy enough to bump irq_count in the paranoid entries.
>
> It definitely shouldn't be done from the assembly code. Usually it's
> "irq_enter/exit()" that does it, but for NMI you'd presumably just do
> it explicitly from do_nmi() or similar. But NMI really is nastier than
> other cases, see above.

do_nmi()
  nmi_enter()
    preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);

> (That said, I thought we did it anyway, but now that I look I can't
> find it. So I was probably just smoking something)

Certainly something you should not smoke again :)

Thanks,

        tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 3:32 PM, Thomas Gleixner <[hidden email]> wrote:
>
> preempt_count is a per cpu variable at least on x86 since:

Oh, goodie. I hadn't tracked that it went in, I just remembered the
discussion. I should have looked.

And apparently we do it in nmi_enter too, so NMI is already ok. Don't
know about the others.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
In reply to this post by Linus Torvalds-2
On Wed, Nov 19, 2014 at 3:23 PM, Linus Torvalds
<[hidden email]> wrote:

> On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <[hidden email]> wrote:
>>
>> Does that include nmi?  I'm a bit afraid of touching that code.
>
> NMI is kind of special, since it's really not supposed to touch
> 'current' or anything like that, and that's how we do preempt-count
> (and that's where irq-count is) right now.
>
> I would prefer to have preempt_count be a percpu variable rather than
> a per-thread one, but there are historical reasons for that horror. Oh
> well.
>
>> It's certainly easy enough to bump irq_count in the paranoid entries.
>
> It definitely shouldn't be done from the assembly code. Usually it's
> "irq_enter/exit()" that does it, but for NMI you'd presumably just do
> it explicitly from do_nmi() or similar. But NMI really is nastier than
> other cases, see above.

My only real objection is that it's going to be ugly and error prone.
It'll have to be something like:

if (!user_mode_vm(regs))
    irq_enter();

...

if (!user_mode_vm(regs))
    irq_exit();

because the whole point of this series is to make the IST entries not
be atomic when they come from userspace.  We can wrap this in an
inline cond_ist_enter/cond_ist_exit or whatever, but still, blech.

Tony already sent the follow-up patch that will make do_machine_check
enable interrupts if it came from userspace.

--Andy

>
> (That said, I thought we did it anyway, but now that I look I can't
> find it. So I was probably just smoking something)
>
>              Linus



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 3:49 PM, Andy Lutomirski <[hidden email]> wrote:
>
> My only real objection is that it's going to be ugly and error prone.
> It'll have to be something like:

No.

> because the whole point of this series is to make the IST entries not
> be atomic when they come from userspace.

Andy,  you need to lay off the drugs.

NMI absolutely *has* to be atomic. The whole "oh, there's a per-core
NMI flag and it disables all other NMI's and interrupts" kind of
enforces that.

Trust me. Talking about being able to preempt the NMI handler is just
crazy talk.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
On Wed, Nov 19, 2014 at 3:59 PM, Linus Torvalds
<[hidden email]> wrote:

> On Wed, Nov 19, 2014 at 3:49 PM, Andy Lutomirski <[hidden email]> wrote:
>>
>> My only real objection is that it's going to be ugly and error prone.
>> It'll have to be something like:
>
> No.
>
>> because the whole point of this series is to make the IST entries not
>> be atomic when they come from userspace.
>
> Andy,  you need to lay off the drugs.
>

No drugs, just imprecision.  This series doesn't change NMI handling
at all.  It only changes machine_check int3, debug, and stack_segment.
(Why is #SS using IST stacks anyway?)

So my point stands: if machine_check is going to be conditionally
atomic, then that condition needs to be expressed somewhere.  And
machine_check really does need to sleep, and it does so today.  It's
just that, in current kernels, it pretends to be atomic, but then it
fakes everyone out before returning (look for sync_regs in entry_64.S)
and becomes non-atomic part-way through, mediated by a special TIF
flag.  I think that the current conditional sync_regs thing is even
crazier than switching stacks depending on regs->cs, so I'm trying to
reduce craziness.

With this series applied, machine_check is honest: if it came from
user space, it doesn't even pretend to be atomic.

And I'm not even making these things preemptable by default.  They
still run with irqs off.  They're just allowed to turn irqs off *if
user_mode_vm(regs)* if they want.

> NMI absolutely *has* to be atomic. The whole "oh, there's a per-core
> NMI flag and it disables all other NMI's and interrupts" kind of
> enforces that.
>
> Trust me. Talking about being able to preempt the NMI handler is just
> crazy talk.

I engage in crazy talk all the time, but I'm not *that* crazy.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 4:13 PM, Andy Lutomirski <[hidden email]> wrote:
>
> No drugs, just imprecision.  This series doesn't change NMI handling
> at all.  It only changes machine_check int3, debug, and stack_segment.
> (Why is #SS using IST stacks anyway?)

.. ok, we were talking about adding an explicit preemption count to
nmi, and then you wanted to make that conditional, that kind of
freaked me out.

> So my point stands: if machine_check is going to be conditionally
> atomic, then that condition needs to be expressed somewhere.

I'd still prefer to keep that knowledge in one place, rather than
adding *another* completely ad-hoc thing in addition to what we
already have.

Also, I really don't think it should be about the particular stack
you're using. Sure, if a debug fault happens in user space, the fault
handler could sleep if it runs on the regular stack, but our
"might_sleep()" are about catching things that *could* be problematic,
even if the sleep never happens. And so, might_sleep() _should_
actually trigger, even if it's not using the IST stack, because *if*
the debug exception happened in kernel space, then we should warn.

So I'd actually *prefer* to have special hacks that perhaps then
"undo" the preemption count if the code expressly tests for "did this
happen in user space, then I know I'm safe". But then it's an
*explicit* thing, not something that just magically works because
nobody even thought about it, and the trap happened in user space.

See the argument? I'd *rather* see code like

   /* Magic */
   if (user_mode(regs)) {
       .. verify that we're using the normal kernel stack
       .. enable interrupts, enable preemption
       .. this is the explicit special case and it is aware
       .. of being special
   }

even if on the face of it it looks hacky. But an *explicit* hack is
preferable to something that just "happens" to work only for the
user-mode case.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Andy Lutomirski
On Wed, Nov 19, 2014 at 4:37 PM, Linus Torvalds
<[hidden email]> wrote:
> On Wed, Nov 19, 2014 at 4:13 PM, Andy Lutomirski <[hidden email]> wrote:
>>
>> No drugs, just imprecision.  This series doesn't change NMI handling
>> at all.  It only changes machine_check int3, debug, and stack_segment.
>> (Why is #SS using IST stacks anyway?)
>
> .. ok, we were talking about adding an explicit preemption count to
> nmi, and then you wanted to make that conditional, that kind of
> freaked me out.

I guess I jumped around in the conversation a bit...

>
>> So my point stands: if machine_check is going to be conditionally
>> atomic, then that condition needs to be expressed somewhere.
>
> I'd still prefer to keep that knowledge in one place, rather than
> adding *another* completely ad-hoc thing in addition to what we
> already have.
>
> Also, I really don't think it should be about the particular stack
> you're using. Sure, if a debug fault happens in user space, the fault
> handler could sleep if it runs on the regular stack, but our
> "might_sleep()" are about catching things that *could* be problematic,
> even if the sleep never happens. And so, might_sleep() _should_
> actually trigger, even if it's not using the IST stack, because *if*
> the debug exception happened in kernel space, then we should warn.
>
> So I'd actually *prefer* to have special hacks that perhaps then
> "undo" the preemption count if the code expressly tests for "did this
> happen in user space, then I know I'm safe". But then it's an
> *explicit* thing, not something that just magically works because
> nobody even thought about it, and the trap happened in user space.
>
> See the argument? I'd *rather* see code like
>
>    /* Magic */
>    if (user_mode(regs)) {
>        .. verify that we're using the normal kernel stack
>        .. enable interrupts, enable preemption
>        .. this is the explicit special case and it is aware
>        .. of being special
>    }
>
> even if on the face of it it looks hacky. But an *explicit* hack is
> preferable to something that just "happens" to work only for the
> user-mode case.

So we'd do, in do_machine_check:

irq_enter();

do atomic stuff;

ist_stop_being_atomic(regs);
local_irq_enable();
...
local_irq_disable();
ist_start_being_atomic_again();

irq_exit();

and we'd have something like:

void ist_stop_being_atomic(struct pt_regs *regs)
{
  BUG_ON(!user_mode_vm(regs));
  --irq_count;
}

I'm very hesitant to use irq_enter for this, though.  I think we want
just the irq_count part.  Maybe ist_enter() and ist_exit()?  I think
that we really don't want to go anywhere near the accounting stuff in
irq_enter from an IST handler if !user_mode_vm(regs).  Doing it from
asm is somewhat less error prone, although I guess we already rely on
the IDT entries themselves being in sync with the paranoid idtentry
setting.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

Linus Torvalds-2
On Wed, Nov 19, 2014 at 4:46 PM, Andy Lutomirski <[hidden email]> wrote:
>
> I'm very hesitant to use irq_enter for this, though.  I think we want
> just the irq_count part.

Oh, I agree. I mentioned irq_enter/exit() just as the place the
preempt count increment normally happens. I have nothing against doing
it manually for these things that really aren't normal interrupts.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
12