[RFC][PATCH 0/3] spin_unlock_wait and assorted borkage

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 0/3] spin_unlock_wait and assorted borkage

Peter Zijlstra-5
As per recent discussions spin_unlock_wait() has an unintuitive 'feature'.

  lkml.kernel.org/r/[hidden email]

These patches pull the existing solution into generic code; annotate all
spin_unlock_wait() users and fix nf_conntrack more.

The alternative -- putting smp_acquire__after_ctrl_dep() into
spin_unlock_wait() can be done later if desired, now that al users are audited
and corrected.

Please (very) carefully consider.

Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Peter Zijlstra-5
spin_unlock_wait() has an unintuitive 'feature' in that it doesn't
fully serialize against the spin_unlock() we've waited on.

In particular, spin_unlock_wait() only provides a control dependency,
which is a LOAD->STORE order. This means subsequent loads can creep up
and observe state prior to the waited-for unlock. This means we don't
necessarily observe the full critical section.

We must employ smp_acquire__after_ctrl_dep() to upgrade the
LOAD->STORE to LOAD->{LOAD,STORE} aka. load-AQUIRE and thereby ensure
we observe the full critical section we've waited on.

Many spin_unlock_wait() users were unaware of this issue and need
help.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 drivers/ata/libata-eh.c   |    4 +++-
 kernel/exit.c             |   14 ++++++++++++--
 kernel/sched/completion.c |    7 +++++++
 kernel/task_work.c        |    2 +-
 4 files changed, 23 insertions(+), 4 deletions(-)

--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -703,8 +703,10 @@ void ata_scsi_cmd_error_handler(struct S
 
  /* initialize eh_tries */
  ap->eh_tries = ATA_EH_MAX_TRIES;
- } else
+ } else {
  spin_unlock_wait(ap->lock);
+ smp_acquire__after_ctrl_dep();
+ }
 
 }
 EXPORT_SYMBOL(ata_scsi_cmd_error_handler);
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -776,11 +776,16 @@ void do_exit(long code)
 
  exit_signals(tsk);  /* sets PF_EXITING */
  /*
- * tsk->flags are checked in the futex code to protect against
- * an exiting task cleaning up the robust pi futexes.
+ * Ensure that all new tsk->pi_lock acquisitions must observe
+ * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
  */
  smp_mb();
  raw_spin_unlock_wait(&tsk->pi_lock);
+ /*
+ * Ensure that we must observe the pi_state in exit_mm() ->
+ * mm_release() -> exit_pi_state_list().
+ */
+ smp_acquire__after_ctrl_dep();
 
  if (unlikely(in_atomic())) {
  pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -897,6 +902,11 @@ void do_exit(long code)
  */
  smp_mb();
  raw_spin_unlock_wait(&tsk->pi_lock);
+ /*
+ * Since there are no following loads the LOAD->LOAD order
+ * provided by smp_acquire__after_ctrl_dep() is not
+ * strictly required.
+ */
 
  /* causes final put_task_struct in finish_task_switch(). */
  tsk->state = TASK_DEAD;
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -312,6 +312,13 @@ bool completion_done(struct completion *
  */
  smp_rmb();
  spin_unlock_wait(&x->wait.lock);
+ /*
+ * Even though we've observed 'done', this doesn't mean we can observe
+ * all stores prior to complete(), as the only RELEASE barrier on that
+ * path is provided by the spin_unlock().
+ */
+ smp_acquire__after_ctrl_dep();
+
  return true;
 }
 EXPORT_SYMBOL(completion_done);
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -108,7 +108,7 @@ void task_work_run(void)
  * fail, but it can play with *work and other entries.
  */
  raw_spin_unlock_wait(&task->pi_lock);
- smp_mb();
+ smp_acquire__after_ctrl_dep();
 
  do {
  next = work->next;


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 3/3] locking,netfilter: Fix nf_conntrack_lock()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
nf_conntrack_lock{,_all}() is borken as it misses a bunch of memory
barriers to order the whole global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 net/netfilter/nf_conntrack_core.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -74,7 +74,18 @@ void nf_conntrack_lock(spinlock_t *lock)
  spin_lock(lock);
  while (unlikely(nf_conntrack_locks_all)) {
  spin_unlock(lock);
+ /*
+ * Order the nf_contrack_locks_all load vs the spin_unlock_wait()
+ * loads below, to ensure locks_all is indeed held.
+ */
+ smp_rmb(); /* spin_lock(locks_all) */
  spin_unlock_wait(&nf_conntrack_locks_all_lock);
+ /*
+ * The control dependency's LOAD->STORE order is enough to
+ * guarantee the spin_lock() is ordered after the above
+ * unlock_wait(). And the ACQUIRE of the lock ensures we are
+ * fully ordered against the spin_unlock() of locks_all.
+ */
  spin_lock(lock);
  }
 }
@@ -119,14 +130,31 @@ static void nf_conntrack_all_lock(void)
  spin_lock(&nf_conntrack_locks_all_lock);
  nf_conntrack_locks_all = true;
 
+ /*
+ * Order the above store against the spin_unlock_wait() loads
+ * below, such that if nf_conntrack_lock() observes lock_all
+ * we must observe lock[] held.
+ */
+ smp_mb(); /* spin_lock(locks_all) */
+
  for (i = 0; i < CONNTRACK_LOCKS; i++) {
  spin_unlock_wait(&nf_conntrack_locks[i]);
  }
+ /*
+ * Ensure we observe all state prior to the spin_unlock()s
+ * observed above.
+ */
+ smp_acquire__after_ctrl_dep();
 }
 
 static void nf_conntrack_all_unlock(void)
 {
- nf_conntrack_locks_all = false;
+ /*
+ * All prior stores must be complete before we clear locks_all.
+ * Otherwise nf_conntrack_lock() might observe the false but not
+ * the entire critical section.
+ */
+ smp_store_release(&nf_conntrack_locks_all, false);
  spin_unlock(&nf_conntrack_locks_all_lock);
 }
 


Reply | Threaded
Open this post in threaded view
|

[RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommen, but the lack of this barrier is.

Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/linux/compiler.h |   14 ++++++++++----
 ipc/sem.c                |   14 ++------------
 2 files changed, 12 insertions(+), 16 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,20 +305,26 @@ static __always_inline void __write_once
 })
 
 /**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep() smp_rmb()
+
+/**
  * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
  *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #define smp_cond_acquire(cond) do { \
  while (!(cond)) \
  cpu_relax(); \
- smp_rmb(); /* ctrl + rmb := acquire */ \
+ smp_acquire__after_ctrl_dep(); \
 } while (0)
 
 #endif /* __KERNEL__ */
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
 }
 
 /*
- * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
- * are only control barriers.
- * The code must pair with spin_unlock(&sem->lock) or
- * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
-#define ipc_smp_acquire__after_spin_is_unlocked() smp_rmb()
-
-/*
  * Wait until all currently ongoing simple ops have completed.
  * Caller must own sem_perm.lock.
  * New simple ops cannot start, because simple ops first check
@@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
  sem = sma->sem_base + i;
  spin_unlock_wait(&sem->lock);
  }
- ipc_smp_acquire__after_spin_is_unlocked();
+ smp_acquire__after_ctrl_dep();
 }
 
 /*
@@ -350,7 +340,7 @@ static inline int sem_lock(struct sem_ar
  * complex_count++;
  * spin_unlock(sem_perm.lock);
  */
- ipc_smp_acquire__after_spin_is_unlocked();
+ smp_acquire__after_ctrl_dep();
 
  /*
  * Now repeat the test of complex_count:


Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 3/3] locking,netfilter: Fix nf_conntrack_lock()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Tue, May 24, 2016 at 04:27:26PM +0200, Peter Zijlstra wrote:

> nf_conntrack_lock{,_all}() is borken as it misses a bunch of memory
> barriers to order the whole global vs local locks scheme.
>
> Even x86 (and other TSO archs) are affected.
>
> Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
> ---
>  net/netfilter/nf_conntrack_core.c |   30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -74,7 +74,18 @@ void nf_conntrack_lock(spinlock_t *lock)
>   spin_lock(lock);
>   while (unlikely(nf_conntrack_locks_all)) {

And note that we can replace nf_conntrack_locks_all with
spin_is_locked(nf_conntrack_locks_all_lock), since that is the exact
same state.

But I didn't want to do too much in one go.

>   spin_unlock(lock);
> + /*
> + * Order the nf_contrack_locks_all load vs the spin_unlock_wait()
> + * loads below, to ensure locks_all is indeed held.
> + */
> + smp_rmb(); /* spin_lock(locks_all) */
>   spin_unlock_wait(&nf_conntrack_locks_all_lock);
> + /*
> + * The control dependency's LOAD->STORE order is enough to
> + * guarantee the spin_lock() is ordered after the above
> + * unlock_wait(). And the ACQUIRE of the lock ensures we are
> + * fully ordered against the spin_unlock() of locks_all.
> + */
>   spin_lock(lock);
>   }
>  }
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Linus Torvalds-2
In reply to this post by Peter Zijlstra-5
On Tue, May 24, 2016 at 7:27 AM, Peter Zijlstra <[hidden email]> wrote:
> spin_unlock_wait() has an unintuitive 'feature' in that it doesn't
> fully serialize against the spin_unlock() we've waited on.

NAK.

We don't start adding more of this "after_ctrl_dep" crap.

It's completely impossible to understand, and even people who have
been locking experts have gotten it wrong.

So it is *completely* unacceptable to have it in drivers.

This needs to be either hidden inside the basic spinlock functions,
_or_ it needs to be a clear and unambiguous interface. Anything that
starts talking about control dependencies is not it.

Note that this really is about naming and use, not about
implementation. So something like "spin_sync_after_unlock_wait()" is
acceptable, even if the actual _implementation_ were to be exactly the
same as the "after_ctrl_dep()" crap.

The difference is that one talks about incomprehensible implementation
details that nobody outside of the person who *implemented* the
spinlock code is supposed to understand (and seriously, I have my
doubts even the spinlock implementer understands it, judging by the
last time this happened), and the other is a much simpler semantic
guarantee.

So don't talk about "acquire". And most certainly don't talk about
"control dependencies". Not if we end up having things like *drivers*
using this like in this example libata.

                Linus
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Tejun Heo-2
Hello,

On Tue, May 24, 2016 at 09:17:13AM -0700, Linus Torvalds wrote:
> So don't talk about "acquire". And most certainly don't talk about
> "control dependencies". Not if we end up having things like *drivers*
> using this like in this example libata.

A delta but that particular libata usage is probably not needed now.
The path was used while libata was gradually adding error handlers to
the low level drivers.  I don't think we don't have any left w/o one
at this point.  I'll verify and get rid of that usage.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Peter Zijlstra-5
In reply to this post by Linus Torvalds-2
On Tue, May 24, 2016 at 09:17:13AM -0700, Linus Torvalds wrote:

> This needs to be either hidden inside the basic spinlock functions,
> _or_ it needs to be a clear and unambiguous interface. Anything that
> starts talking about control dependencies is not it.
>
> Note that this really is about naming and use, not about
> implementation. So something like "spin_sync_after_unlock_wait()" is
> acceptable, even if the actual _implementation_ were to be exactly the
> same as the "after_ctrl_dep()" crap.

OK; so I would prefer to keep the smp_acquire__after_ctrl_dep() crap for
common use in smp_cond_acquire() and such, but I'd be more than happy to
just stuff it unconditionally into spin_unlock_wait().

Most users really need it, and its restores intuitive semantics to the
primitive.

I'm assuming the explicit use then left in ipc/sem.c (as paired with the
spin_is_locked) is fine with you; that's certainly not driver code.

Todays series was really more about auditing all the spin_unlock_wait()
usage sites.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Peter Zijlstra-5
In reply to this post by Tejun Heo-2
On Tue, May 24, 2016 at 12:22:07PM -0400, Tejun Heo wrote:
> A delta but that particular libata usage is probably not needed now.
> The path was used while libata was gradually adding error handlers to
> the low level drivers.  I don't think we don't have any left w/o one
> at this point.  I'll verify and get rid of that usage.

OK, that would be great; I was sorta lost in there, but it looked like
if you need the spin_unlock_wait() you also need the extra barrier
thing.

If you can remove it, better still.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Paul E. McKenney
In reply to this post by Peter Zijlstra-5
On Tue, May 24, 2016 at 11:01:21PM -0400, Waiman Long wrote:

> On 05/24/2016 10:27 AM, Peter Zijlstra wrote:
> >Introduce smp_acquire__after_ctrl_dep(), this construct is not
> >uncommen, but the lack of this barrier is.
> >
> >Signed-off-by: Peter Zijlstra (Intel)<[hidden email]>
> >---
> >  include/linux/compiler.h |   14 ++++++++++----
> >  ipc/sem.c                |   14 ++------------
> >  2 files changed, 12 insertions(+), 16 deletions(-)
> >
> >--- a/include/linux/compiler.h
> >+++ b/include/linux/compiler.h
> >@@ -305,20 +305,26 @@ static __always_inline void __write_once
> >  })
> >
> >  /**
> >+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
> >+ *
> >+ * A control dependency provides a LOAD->STORE order, the additional RMB
> >+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> >+ * aka. ACQUIRE.
> >+ */
> >+#define smp_acquire__after_ctrl_dep() smp_rmb()
> >+
> >+/**
> >   * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> >   * @cond: boolean expression to wait for
> >   *
> >   * Equivalent to using smp_load_acquire() on the condition variable but employs
> >   * the control dependency of the wait to reduce the barrier on many platforms.
> >   *
> >- * The control dependency provides a LOAD->STORE order, the additional RMB
> >- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> >- * aka. ACQUIRE.
> >   */
> >  #define smp_cond_acquire(cond) do { \
> >   while (!(cond)) \
> >   cpu_relax(); \
> >- smp_rmb(); /* ctrl + rmb := acquire */ \
> >+ smp_acquire__after_ctrl_dep(); \
> >  } while (0)
> >
> >
>
> I have a question about the claim that control dependence + rmb is
> equivalent to an acquire memory barrier. For example,
>
> S1:    if (a)
> S2:       b = 1;
>        smp_rmb()
> S3:    c = 2;
>
> Since c is independent of both a and b, is it possible that the cpu
> may reorder to execute store statement S3 first before S1 and S2?

The CPUs I know of won't do, nor should the compiler, at least assuming
"a" (AKA "cond") includes READ_ONCE().  Ditto "b" and WRITE_ONCE().
Otherwise, the compiler could do quite a few "interesting" things,
especially if it knows the value of "b".  For example, if the compiler
knows that b==1, without the volatile casts, the compiler could just
throw away both S1 and S2, eliminating any ordering.  This can get
quite tricky -- see memory-barriers.txt for more mischief.

The smp_rmb() is not needed in this example because S3 is a write, not
a read.  Perhaps you meant something more like this:

        if (READ_ONCE(a))
                WRITE_ONCE(b, 1);
        smp_rmb();
        r1 = READ_ONCE(c);

This sequence would guarantee that "a" was read before "c".

                                                        Thanx, Paul

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Boqun Feng
On Tue, May 24, 2016 at 09:53:29PM -0700, Paul E. McKenney wrote:

> On Tue, May 24, 2016 at 11:01:21PM -0400, Waiman Long wrote:
> > On 05/24/2016 10:27 AM, Peter Zijlstra wrote:
> > >Introduce smp_acquire__after_ctrl_dep(), this construct is not
> > >uncommen, but the lack of this barrier is.
> > >
> > >Signed-off-by: Peter Zijlstra (Intel)<[hidden email]>
> > >---
> > >  include/linux/compiler.h |   14 ++++++++++----
> > >  ipc/sem.c                |   14 ++------------
> > >  2 files changed, 12 insertions(+), 16 deletions(-)
> > >
> > >--- a/include/linux/compiler.h
> > >+++ b/include/linux/compiler.h
> > >@@ -305,20 +305,26 @@ static __always_inline void __write_once
> > >  })
> > >
> > >  /**
> > >+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
> > >+ *
> > >+ * A control dependency provides a LOAD->STORE order, the additional RMB
> > >+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > >+ * aka. ACQUIRE.
> > >+ */
> > >+#define smp_acquire__after_ctrl_dep() smp_rmb()
> > >+
> > >+/**
> > >   * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > >   * @cond: boolean expression to wait for
> > >   *
> > >   * Equivalent to using smp_load_acquire() on the condition variable but employs
> > >   * the control dependency of the wait to reduce the barrier on many platforms.
> > >   *
> > >- * The control dependency provides a LOAD->STORE order, the additional RMB
> > >- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > >- * aka. ACQUIRE.
> > >   */
> > >  #define smp_cond_acquire(cond) do { \
> > >   while (!(cond)) \
> > >   cpu_relax(); \
> > >- smp_rmb(); /* ctrl + rmb := acquire */ \
> > >+ smp_acquire__after_ctrl_dep(); \
> > >  } while (0)
> > >
> > >
> >
> > I have a question about the claim that control dependence + rmb is
> > equivalent to an acquire memory barrier. For example,
> >
> > S1:    if (a)
> > S2:       b = 1;
> >        smp_rmb()
> > S3:    c = 2;
> >
> > Since c is independent of both a and b, is it possible that the cpu
> > may reorder to execute store statement S3 first before S1 and S2?
>
> The CPUs I know of won't do, nor should the compiler, at least assuming
> "a" (AKA "cond") includes READ_ONCE().  Ditto "b" and WRITE_ONCE().
> Otherwise, the compiler could do quite a few "interesting" things,
> especially if it knows the value of "b".  For example, if the compiler
> knows that b==1, without the volatile casts, the compiler could just
> throw away both S1 and S2, eliminating any ordering.  This can get
> quite tricky -- see memory-barriers.txt for more mischief.
>
> The smp_rmb() is not needed in this example because S3 is a write, not
but S3 needs to be an WRITE_ONCE(), right? IOW, the following code can
result in reordering:

S1: if (READ_ONCE(a))
S2: WRITE_ONCE(b, 1);
       
S3: c = 2; // this can be reordered before READ_ONCE(a)

but if we change S3 to WRITE_ONCE(c, 2), the reordering can not happen
for the CPUs you are aware of, right?

Regards,
Boqun

> a read.  Perhaps you meant something more like this:
>
> if (READ_ONCE(a))
> WRITE_ONCE(b, 1);
> smp_rmb();
> r1 = READ_ONCE(c);
>
> This sequence would guarantee that "a" was read before "c".
>
> Thanx, Paul
>

signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Paul E. McKenney
On Wed, May 25, 2016 at 01:39:30PM +0800, Boqun Feng wrote:

> On Tue, May 24, 2016 at 09:53:29PM -0700, Paul E. McKenney wrote:
> > On Tue, May 24, 2016 at 11:01:21PM -0400, Waiman Long wrote:
> > > On 05/24/2016 10:27 AM, Peter Zijlstra wrote:
> > > >Introduce smp_acquire__after_ctrl_dep(), this construct is not
> > > >uncommen, but the lack of this barrier is.
> > > >
> > > >Signed-off-by: Peter Zijlstra (Intel)<[hidden email]>
> > > >---
> > > >  include/linux/compiler.h |   14 ++++++++++----
> > > >  ipc/sem.c                |   14 ++------------
> > > >  2 files changed, 12 insertions(+), 16 deletions(-)
> > > >
> > > >--- a/include/linux/compiler.h
> > > >+++ b/include/linux/compiler.h
> > > >@@ -305,20 +305,26 @@ static __always_inline void __write_once
> > > >  })
> > > >
> > > >  /**
> > > >+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
> > > >+ *
> > > >+ * A control dependency provides a LOAD->STORE order, the additional RMB
> > > >+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > > >+ * aka. ACQUIRE.
> > > >+ */
> > > >+#define smp_acquire__after_ctrl_dep() smp_rmb()
> > > >+
> > > >+/**
> > > >   * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > > >   * @cond: boolean expression to wait for
> > > >   *
> > > >   * Equivalent to using smp_load_acquire() on the condition variable but employs
> > > >   * the control dependency of the wait to reduce the barrier on many platforms.
> > > >   *
> > > >- * The control dependency provides a LOAD->STORE order, the additional RMB
> > > >- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > > >- * aka. ACQUIRE.
> > > >   */
> > > >  #define smp_cond_acquire(cond) do { \
> > > >   while (!(cond)) \
> > > >   cpu_relax(); \
> > > >- smp_rmb(); /* ctrl + rmb := acquire */ \
> > > >+ smp_acquire__after_ctrl_dep(); \
> > > >  } while (0)
> > > >
> > > >
> > >
> > > I have a question about the claim that control dependence + rmb is
> > > equivalent to an acquire memory barrier. For example,
> > >
> > > S1:    if (a)
> > > S2:       b = 1;
> > >        smp_rmb()
> > > S3:    c = 2;
> > >
> > > Since c is independent of both a and b, is it possible that the cpu
> > > may reorder to execute store statement S3 first before S1 and S2?
> >
> > The CPUs I know of won't do, nor should the compiler, at least assuming
> > "a" (AKA "cond") includes READ_ONCE().  Ditto "b" and WRITE_ONCE().
> > Otherwise, the compiler could do quite a few "interesting" things,
> > especially if it knows the value of "b".  For example, if the compiler
> > knows that b==1, without the volatile casts, the compiler could just
> > throw away both S1 and S2, eliminating any ordering.  This can get
> > quite tricky -- see memory-barriers.txt for more mischief.
> >
> > The smp_rmb() is not needed in this example because S3 is a write, not
>
> but S3 needs to be an WRITE_ONCE(), right? IOW, the following code can
> result in reordering:
>
> S1: if (READ_ONCE(a))
> S2: WRITE_ONCE(b, 1);
>
> S3: c = 2; // this can be reordered before READ_ONCE(a)
>
> but if we change S3 to WRITE_ONCE(c, 2), the reordering can not happen
> for the CPUs you are aware of, right?

Yes, if you remove the smp_rmb(), you also need a WRITE_ONCE() for S3.

Even with the smp_rmb(), you have to be careful.

In general, if you don't tell the compiler otherwise, it is within its
rights to assume that nothing else is reading from or writing to the
variables in question.  That means that it can split and fuse loads
and stores.  Or keep some of the variables in registers, so that it
never loads and stores them.  ;-)

                                                        Thanx, Paul

> Regards,
> Boqun
>
> > a read.  Perhaps you meant something more like this:
> >
> > if (READ_ONCE(a))
> > WRITE_ONCE(b, 1);
> > smp_rmb();
> > r1 = READ_ONCE(c);
> >
> > This sequence would guarantee that "a" was read before "c".
> >
> > Thanx, Paul
> >


Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Waiman Long-2
In reply to this post by Paul E. McKenney
On 05/25/2016 12:53 AM, Paul E. McKenney wrote:

> On Tue, May 24, 2016 at 11:01:21PM -0400, Waiman Long wrote:
>> On 05/24/2016 10:27 AM, Peter Zijlstra wrote:
>>> Introduce smp_acquire__after_ctrl_dep(), this construct is not
>>> uncommen, but the lack of this barrier is.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel)<[hidden email]>
>>> ---
>>>   include/linux/compiler.h |   14 ++++++++++----
>>>   ipc/sem.c                |   14 ++------------
>>>   2 files changed, 12 insertions(+), 16 deletions(-)
>>>
>>> --- a/include/linux/compiler.h
>>> +++ b/include/linux/compiler.h
>>> @@ -305,20 +305,26 @@ static __always_inline void __write_once
>>>   })
>>>
>>>   /**
>>> + * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
>>> + *
>>> + * A control dependency provides a LOAD->STORE order, the additional RMB
>>> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
>>> + * aka. ACQUIRE.
>>> + */
>>> +#define smp_acquire__after_ctrl_dep() smp_rmb()
>>> +
>>> +/**
>>>    * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
>>>    * @cond: boolean expression to wait for
>>>    *
>>>    * Equivalent to using smp_load_acquire() on the condition variable but employs
>>>    * the control dependency of the wait to reduce the barrier on many platforms.
>>>    *
>>> - * The control dependency provides a LOAD->STORE order, the additional RMB
>>> - * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
>>> - * aka. ACQUIRE.
>>>    */
>>>   #define smp_cond_acquire(cond) do { \
>>>   while (!(cond)) \
>>>   cpu_relax(); \
>>> - smp_rmb(); /* ctrl + rmb := acquire */ \
>>> + smp_acquire__after_ctrl_dep(); \
>>>   } while (0)
>>>
>>>
>> I have a question about the claim that control dependence + rmb is
>> equivalent to an acquire memory barrier. For example,
>>
>> S1:    if (a)
>> S2:       b = 1;
>>         smp_rmb()
>> S3:    c = 2;
>>
>> Since c is independent of both a and b, is it possible that the cpu
>> may reorder to execute store statement S3 first before S1 and S2?
> The CPUs I know of won't do, nor should the compiler, at least assuming
> "a" (AKA "cond") includes READ_ONCE().  Ditto "b" and WRITE_ONCE().
> Otherwise, the compiler could do quite a few "interesting" things,
> especially if it knows the value of "b".  For example, if the compiler
> knows that b==1, without the volatile casts, the compiler could just
> throw away both S1 and S2, eliminating any ordering.  This can get
> quite tricky -- see memory-barriers.txt for more mischief.
>
> The smp_rmb() is not needed in this example because S3 is a write, not
> a read.  Perhaps you meant something more like this:
>
> if (READ_ONCE(a))
> WRITE_ONCE(b, 1);
> smp_rmb();
> r1 = READ_ONCE(c);
>
> This sequence would guarantee that "a" was read before "c".
>
> Thanx, Paul
>

The smp_rmb() in Linux should be a compiler barrier. So the compiler
should not recorder it above smp_rmb. However, what I am wondering is
whether a condition + rmb combination can be considered a real acquire
memory barrier from the CPU point of view which requires that it cannot
reorder the data store in S3 above S1 and S2. This is where I am not so
sure about.

Cheers,
Longman
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Paul E. McKenney
On Wed, May 25, 2016 at 11:20:42AM -0400, Waiman Long wrote:

> On 05/25/2016 12:53 AM, Paul E. McKenney wrote:
> >On Tue, May 24, 2016 at 11:01:21PM -0400, Waiman Long wrote:
> >>On 05/24/2016 10:27 AM, Peter Zijlstra wrote:
> >>>Introduce smp_acquire__after_ctrl_dep(), this construct is not
> >>>uncommen, but the lack of this barrier is.
> >>>
> >>>Signed-off-by: Peter Zijlstra (Intel)<[hidden email]>
> >>>---
> >>>  include/linux/compiler.h |   14 ++++++++++----
> >>>  ipc/sem.c                |   14 ++------------
> >>>  2 files changed, 12 insertions(+), 16 deletions(-)
> >>>
> >>>--- a/include/linux/compiler.h
> >>>+++ b/include/linux/compiler.h
> >>>@@ -305,20 +305,26 @@ static __always_inline void __write_once
> >>>  })
> >>>
> >>>  /**
> >>>+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
> >>>+ *
> >>>+ * A control dependency provides a LOAD->STORE order, the additional RMB
> >>>+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> >>>+ * aka. ACQUIRE.
> >>>+ */
> >>>+#define smp_acquire__after_ctrl_dep() smp_rmb()
> >>>+
> >>>+/**
> >>>   * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> >>>   * @cond: boolean expression to wait for
> >>>   *
> >>>   * Equivalent to using smp_load_acquire() on the condition variable but employs
> >>>   * the control dependency of the wait to reduce the barrier on many platforms.
> >>>   *
> >>>- * The control dependency provides a LOAD->STORE order, the additional RMB
> >>>- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> >>>- * aka. ACQUIRE.
> >>>   */
> >>>  #define smp_cond_acquire(cond) do { \
> >>>   while (!(cond)) \
> >>>   cpu_relax(); \
> >>>- smp_rmb(); /* ctrl + rmb := acquire */ \
> >>>+ smp_acquire__after_ctrl_dep(); \
> >>>  } while (0)
> >>>
> >>>
> >>I have a question about the claim that control dependence + rmb is
> >>equivalent to an acquire memory barrier. For example,
> >>
> >>S1:    if (a)
> >>S2:       b = 1;
> >>        smp_rmb()
> >>S3:    c = 2;
> >>
> >>Since c is independent of both a and b, is it possible that the cpu
> >>may reorder to execute store statement S3 first before S1 and S2?
> >The CPUs I know of won't do, nor should the compiler, at least assuming
> >"a" (AKA "cond") includes READ_ONCE().  Ditto "b" and WRITE_ONCE().
> >Otherwise, the compiler could do quite a few "interesting" things,
> >especially if it knows the value of "b".  For example, if the compiler
> >knows that b==1, without the volatile casts, the compiler could just
> >throw away both S1 and S2, eliminating any ordering.  This can get
> >quite tricky -- see memory-barriers.txt for more mischief.
> >
> >The smp_rmb() is not needed in this example because S3 is a write, not
> >a read.  Perhaps you meant something more like this:
> >
> > if (READ_ONCE(a))
> > WRITE_ONCE(b, 1);
> > smp_rmb();
> > r1 = READ_ONCE(c);
> >
> >This sequence would guarantee that "a" was read before "c".
>
> The smp_rmb() in Linux should be a compiler barrier. So the compiler
> should not recorder it above smp_rmb. However, what I am wondering
> is whether a condition + rmb combination can be considered a real
> acquire memory barrier from the CPU point of view which requires
> that it cannot reorder the data store in S3 above S1 and S2. This is
> where I am not so sure about.

For your example, but keeping the compiler in check:

        if (READ_ONCE(a))
                WRITE_ONCE(b, 1);
        smp_rmb();
        WRITE_ONCE(c, 2);

On x86, the smp_rmb() is as you say nothing but barrier().  However,
x86's TSO prohibits reordering reads with subsequent writes.  So the
read from "a" is ordered before the write to "c".

On powerpc, the smp_rmb() will be the lwsync instruction plus a compiler
barrier.  This orders prior reads against subsequent reads and writes, so
again the read from "a" will be ordered befoer the write to "c".  But the
ordering against subsequent writes is an accident of implementation.
The real guarantee comes from powerpc's guarantee that stores won't be
speculated, so that the read from "a" is guaranteed to be ordered before
the write to "c" even without the smp_rmb().

On arm, the smp_rmb() is a full memory barrier, so you are good
there.  On arm64, it is the "dmb ishld" instruction, which only orders
reads.  But in both arm and arm64, speculative stores are forbidden,
just as in powerpc.  So in both cases, the load from "a" is ordered
before the store to "c".

Other CPUs are required to behave similarly, but hopefully those
examples help.

But the READ_ONCE() and WRITE_ONCE() are critically important.
The compiler is permitted to play all sorts of tricks if you have
something like this:

        if (a)
                b = 1;
        smp_rmb();
        c = 2;

Here, the compiler is permitted to assume that no other CPU is either
looking at or touching these variables.  After all, you didn't tell
it otherwise!  (Another way of telling it otherwise is through use
of atomics, as in David Howells's earlier patch.)

First, it might decide to place a, b, and c into registers for the
duration.  In that case, the compiler barrier has no effect, and
the compiler is free to rearrange.  (Yes, real compilers are probably
more strict and thus more forgiving of this sort of thing.  But they
are under no obligation to forgive.)

Second, as noted earlier, the compiler might see an earlier load from
or store to "b".  If so, it is permitted to remember the value loaded
or stored, and if that value happened to have been 1, the compiler
is within its rights to drop the "if" statement completely, thus never
loading "a" or storing to "b".

Finally, at least for this email, there is the possibility of load
or store tearing.

Does that help?

                                                        Thanx, Paul

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Peter Zijlstra-5
On Wed, May 25, 2016 at 08:57:47AM -0700, Paul E. McKenney wrote:

> For your example, but keeping the compiler in check:
>
> if (READ_ONCE(a))
> WRITE_ONCE(b, 1);
> smp_rmb();
> WRITE_ONCE(c, 2);
>
> On x86, the smp_rmb() is as you say nothing but barrier().  However,
> x86's TSO prohibits reordering reads with subsequent writes.  So the
> read from "a" is ordered before the write to "c".
>
> On powerpc, the smp_rmb() will be the lwsync instruction plus a compiler
> barrier.  This orders prior reads against subsequent reads and writes, so
> again the read from "a" will be ordered befoer the write to "c".  But the
> ordering against subsequent writes is an accident of implementation.
> The real guarantee comes from powerpc's guarantee that stores won't be
> speculated, so that the read from "a" is guaranteed to be ordered before
> the write to "c" even without the smp_rmb().
>
> On arm, the smp_rmb() is a full memory barrier, so you are good
> there.  On arm64, it is the "dmb ishld" instruction, which only orders
> reads.

IIRC dmb ishld orders more than load vs load (like the manual states),
but I forgot the details; we'll have to wait for Will to clarify. But
yes, it also orders loads vs loads so it sufficient here.
 
> But in both arm and arm64, speculative stores are forbidden,
> just as in powerpc.  So in both cases, the load from "a" is ordered
> before the store to "c".
>
> Other CPUs are required to behave similarly, but hopefully those
> examples help.

I would consider any architecture that allows speculative stores as
broken. They are values out of thin air and would make any kind of
concurrency extremely 'interesting'.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Linus Torvalds-2
On Wed, May 25, 2016 at 9:28 AM, Peter Zijlstra <[hidden email]> wrote:
>
> I would consider any architecture that allows speculative stores as
> broken. They are values out of thin air and would make any kind of
> concurrency extremely 'interesting'.

It's worth noting that the same is true of compilers too. You will
find compiler people who argue that speculative stores are valid
because the spec doesn't explicitly forbid them. Same is true of
compiler-generated value speculation.

Both are cases of "yeah, the C standard may not explicitly disallow
it, but sanity in a threaded environment does". Sadly, I've seen
compiler people who dismiss "sanity" as an argument, since that also
isn't defined in the C standard. There are people who think that paper
is the most precious resource in the universe.

I'm not actually aware of anybody doing speculative stores or value
speculation in a compiler we care about, but if those kinds of things
are the kinds of things where we'd just go "that compiler is shit" and
not use it (possibly with a command line option to disable the
particular broken optimization, like we do for the broken type-based
aliasing and some other code generation things that just don't work in
the kernel).

So we definitely have the option to just say "theory is just theory".
We'll never make design decisions based on insane things being
possible in theory, whether it be crazy architectures or crazy
compilers.

               Linus
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 1/3] locking: Introduce smp_acquire__after_ctrl_dep

Paul E. McKenney
On Wed, May 25, 2016 at 09:54:55AM -0700, Linus Torvalds wrote:

> On Wed, May 25, 2016 at 9:28 AM, Peter Zijlstra <[hidden email]> wrote:
> >
> > I would consider any architecture that allows speculative stores as
> > broken. They are values out of thin air and would make any kind of
> > concurrency extremely 'interesting'.
>
> It's worth noting that the same is true of compilers too. You will
> find compiler people who argue that speculative stores are valid
> because the spec doesn't explicitly forbid them. Same is true of
> compiler-generated value speculation.

Thankfully, this has improved.  There was a time when compiler writers
were happy to overwrite adjacent variables and fix them up later,
believe it or not.  Not so good if the variables are shared variables,
possibly protected by different locks.  Most compiler writers now
understand that this sort of thing is not permitted, and recent
versions of the standard explicitly forbid it.

But there are still any number of optimizations that can cause trouble
for concurrent code.  Common subexpresssion elimination, for example...
Which is one reason for my heavy use of READ_ONCE() and WRITE_ONCE().

> Both are cases of "yeah, the C standard may not explicitly disallow
> it, but sanity in a threaded environment does". Sadly, I've seen
> compiler people who dismiss "sanity" as an argument, since that also
> isn't defined in the C standard. There are people who think that paper
> is the most precious resource in the universe.
>
> I'm not actually aware of anybody doing speculative stores or value
> speculation in a compiler we care about, but if those kinds of things
> are the kinds of things where we'd just go "that compiler is shit" and
> not use it (possibly with a command line option to disable the
> particular broken optimization, like we do for the broken type-based
> aliasing and some other code generation things that just don't work in
> the kernel).
>
> So we definitely have the option to just say "theory is just theory".
> We'll never make design decisions based on insane things being
> possible in theory, whether it be crazy architectures or crazy
> compilers.

There has been some discussion of adding "-std=kernel" to tell the
compiler to follow Linux-kernel rules, but not sure whether this is
really going anywhere.

                                                        Thanx, Paul

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 2/3] locking: Annotate spin_unlock_wait() users

Tejun Heo-2
In reply to this post by Peter Zijlstra-5
On Tue, May 24, 2016 at 06:58:36PM +0200, Peter Zijlstra wrote:

> On Tue, May 24, 2016 at 12:22:07PM -0400, Tejun Heo wrote:
> > A delta but that particular libata usage is probably not needed now.
> > The path was used while libata was gradually adding error handlers to
> > the low level drivers.  I don't think we don't have any left w/o one
> > at this point.  I'll verify and get rid of that usage.
>
> OK, that would be great; I was sorta lost in there, but it looked like
> if you need the spin_unlock_wait() you also need the extra barrier
> thing.
>
> If you can remove it, better still.

Unfortunately, ipr SAS driver is still using the old error handling
path, so please include libata in the conversion for now.  I'll see if
ipr can be converted.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][PATCH 3/3] locking,netfilter: Fix nf_conntrack_lock()

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Tue, May 24, 2016 at 10:41:43PM +0200, Manfred Spraul wrote:

> >--- a/net/netfilter/nf_conntrack_core.c
> >+++ b/net/netfilter/nf_conntrack_core.c
> >@@ -74,7 +74,18 @@ void nf_conntrack_lock(spinlock_t *lock)
> >   spin_lock(lock);
> >   while (unlikely(nf_conntrack_locks_all)) {
> >   spin_unlock(lock);
> >+ /*
> >+ * Order the nf_contrack_locks_all load vs the spin_unlock_wait()
> >+ * loads below, to ensure locks_all is indeed held.
> >+ */
> >+ smp_rmb(); /* spin_lock(locks_all) */
> >   spin_unlock_wait(&nf_conntrack_locks_all_lock);

> I don't understand the comment, and I don't understand why the smp_rmb() is
> required: What do you want to protect against?

I order against the spin_unlock_wait() load of
nf_conntrack_locks_all_lock being done _before_ the
nf_conntrack_locks_all load.

This is possible because the spin_unlock() in between will stop the
nf_conntrack_locks_all load from falling down, but it doesn't stop the
&nf_conntrack_locks_all_lock lock from being done early.

Inverting these two loads could result in not waiting when we should.


        nf_conntrack_all_lock() nf_conntrack_lock()

                                          [L] all_locks == unlocked
          [S] spin_lock(&all_lock);
          [S] all = true;
                                          [L] all == true

And we'll not wait for the current all_lock section to finish.
Resulting in an all_lock and lock section at the same time.