sem_lock() vs qspinlocks

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Linus Torvalds-2
On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra <[hidden email]> wrote:
> +        * queued_spin_lock_slowpath() can ACQUIRE the lock before
> +        * issuing the unordered store that sets _Q_LOCKED_VAL.

Ugh. This was my least favorite part of the queued locks, and I really
liked the completely unambiguous semantics of the x86 atomics-based
versions we used to have.

But I guess we're stuck with it.

That said, it strikes me that almost all of the users of
"spin_is_locked()" are using it for verification purposes (not locking
correctness), and that the people who are playing games with locking
correctness are few and already have to play *other* games anyway.

See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
a big comment atop of it that now becomes nonsensical with this patch.

So I do wonder if we should make that smp_mb() be something the
*caller* has to do, and document rules for it. IOW, introduce a new
spinlock primitive called "spin_lock_synchronize()", and then spinlock
implementations that have this non-atomic behavior with an unordered
store would do something like

    static inline void queued_spin_lock_synchronize(struct qspinlock
*a, struct qspinlock *b)
    {
        smp_mb();
    }

and then we'd document that *if* yuou need ordering guarantees between

   spin_lock(a);
   .. spin_is_locked/spin_wait_lock(b) ..

you have to have a

    spin_lock_synchronize(a, b);

in between.

A spin-lock implementation with the old x86 atomic semantics would
make it a no-op.

We should also introduce something like that
"splin_lock_acquire_after_unlock()" so that the ipc/sem.c behavior
would be documented too. Partly because some spinlock implementations
might have stronger unlock ordering and that could be a no-op for
them), but mostly for documentation of the rules.

Now, I'd take Peter's patch as-is, because I don't think any of this
matters from a *performance* standpoint, and Peter's patch is much
smaller and simpler.

But the reason I think it might be a good thing to introduce those
spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
would be to make it very very clear what those subtle implementations
in mutexes and the multi-level locks in the ipc layer are doing and
what they rely on.

Comments? There are arguments for Peter's simple approach too ("robust
and simpler interface" vs "make our requirements explicit").

               Linus
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Waiman Long-2
In reply to this post by Peter Zijlstra-5
On 05/20/2016 07:58 AM, Peter Zijlstra wrote:

> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>> As such, the following restores the behavior of the ticket locks and 'fixes'
>> (or hides?) the bug in sems. Naturally incorrect approach:
>>
>> @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
>>
>> for (i = 0; i<  sma->sem_nsems; i++) {
>> sem = sma->sem_base + i;
>> -               spin_unlock_wait(&sem->lock);
>> +               while (atomic_read(&sem->lock))
>> +                       cpu_relax();
>> }
>> ipc_smp_acquire__after_spin_is_unlocked();
>> }
> The actual bug is clear_pending_set_locked() not having acquire
> semantics. And the above 'fixes' things because it will observe the old
> pending bit or the locked bit, so it doesn't matter if the store
> flipping them is delayed.

The clear_pending_set_locked() is not the only place where the lock is
set. If there are more than one waiter, the queuing patch will be used
instead. The set_locked(), which is also an unordered store, will then
be used to set the lock.

Cheers,
Longman
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Waiman Long-2
In reply to this post by Davidlohr Bueso-5
On 05/20/2016 11:00 AM, Davidlohr Bueso wrote:

> On Fri, 20 May 2016, Peter Zijlstra wrote:
>
>> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>>>  In addition, this makes me wonder if queued_spin_is_locked() should
>>> then be:
>>>
>>> -    return atomic_read(&lock->val);
>>> +    return atomic_read(&lock->val) & _Q_LOCKED_MASK;
>>>
>>> And avoid considering pending waiters as locked.
>>
>> Probably
>
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
>
> -       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
> +       return atomic_read(&lock->val);
>
> Thanks,
> Davidlohr

Looking for contended lock, you need to consider the lock waiters also.
So looking at the whole word is right.

Cheers,
Longman
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:

> >Similarly, and I know you hate it, but afaict, then semantically
> >queued_spin_is_contended() ought to be:
> >
> >-       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
> >+       return atomic_read(&lock->val);
> >

> Looking for contended lock, you need to consider the lock waiters also. So
> looking at the whole word is right.

No, you _only_ need to look at the lock waiters.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Waiman Long-2
On Fri, May 20, 2016 at 04:44:19PM -0400, Waiman Long wrote:

> On 05/20/2016 07:58 AM, Peter Zijlstra wrote:
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >>As such, the following restores the behavior of the ticket locks and 'fixes'
> >>(or hides?) the bug in sems. Naturally incorrect approach:
> >>
> >>@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> >>
> >> for (i = 0; i<  sma->sem_nsems; i++) {
> >> sem = sma->sem_base + i;
> >>-               spin_unlock_wait(&sem->lock);
> >>+               while (atomic_read(&sem->lock))
> >>+                       cpu_relax();
> >> }
> >> ipc_smp_acquire__after_spin_is_unlocked();
> >>}
> >The actual bug is clear_pending_set_locked() not having acquire
> >semantics. And the above 'fixes' things because it will observe the old
> >pending bit or the locked bit, so it doesn't matter if the store
> >flipping them is delayed.
>
> The clear_pending_set_locked() is not the only place where the lock is set.
> If there are more than one waiter, the queuing patch will be used instead.
> The set_locked(), which is also an unordered store, will then be used to set
> the lock.

Ah yes. I didn't get that far. One case was enough :-)
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Linus Torvalds-2
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:
> On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra <[hidden email]> wrote:
> > +        * queued_spin_lock_slowpath() can ACQUIRE the lock before
> > +        * issuing the unordered store that sets _Q_LOCKED_VAL.
>
> Ugh. This was my least favorite part of the queued locks, and I really
> liked the completely unambiguous semantics of the x86 atomics-based
> versions we used to have.
>
> But I guess we're stuck with it.

Yeah, I wasn't too happy either when I realized it today. We _could_
make these atomic ops, but then we're just making things slower for this
one weird case :/

> That said, it strikes me that almost all of the users of
> "spin_is_locked()" are using it for verification purposes (not locking
> correctness),

Right; although I feel those people should be using
lockdep_assert_held() for this instead. That not only compiles out when
!LOCKDEP but also asserts the current task/cpu is the lock owner, not
someone else.

> and that the people who are playing games with locking
> correctness are few and already have to play *other* games anyway.
>
> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
> a big comment atop of it that now becomes nonsensical with this patch.

Not quite; we still need that I think.

We now have:

        spin_lock(A);
        smp_mb();
        while (!spin_is_locked(B))
                cpu_relax();
        smp_rmb();

And that control dependency together with the rmb form a load-acquire on
the unlocked B, which matches the release of the spin_unlock(B) and
ensures we observe the whole previous critical section we waited for.

The new smp_mb() doesn't help with that.

> Now, I'd take Peter's patch as-is, because I don't think any of this
> matters from a *performance* standpoint, and Peter's patch is much
> smaller and simpler.

I would suggest you do this and also mark it for stable v4.2 and later.

> But the reason I think it might be a good thing to introduce those
> spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
> would be to make it very very clear what those subtle implementations
> in mutexes and the multi-level locks in the ipc layer are doing and
> what they rely on.

We can always do the fancy stuff on top, but that isn't going to need
backporting to all stable trees, this is.

I'll think a little more on the explicit document vs simple thing.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Linus Torvalds-2
On Fri, May 20, 2016 at 2:06 PM, Peter Zijlstra <[hidden email]> wrote:
>>
>> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
>> a big comment atop of it that now becomes nonsensical with this patch.
>
> Not quite; we still need that I think.

I think so too, but it's the *comment* that is nonsensical.

The comment says that "spin_unlock_wait() and !spin_is_locked() are
not memory barriers", and clearly now those instructions *are* memory
barriers with your patch.

However, the semaphore code wants a memory barrier after the _read_ in
the spin_unlocked_wait(), which it doesn't get.

So that is part of why I don't like the "hide memory barriers inside
the implementation".

Because once the operations aren't atomic (exactly like the spinlock
is now no longer atomic on x86: it's a separate read-with-acquire
followed by an unordered store for the queued case), the barrier
semantics within such an operation get very screwy.  There may be
barriers, but they aren't barriers to *everything*, they are just
barriers to part of the non-atomic operation.

If we were to make the synchronization explicit, we'd still have to
deal with all the subtle semantics, but now the subtle semantics would
at least be *explicit*. And it would make it much easier to explain
the barriers in that ipc semaphore code.

>> Now, I'd take Peter's patch as-is, because I don't think any of this
>> matters from a *performance* standpoint, and Peter's patch is much
>> smaller and simpler.
>
> I would suggest you do this and also mark it for stable v4.2 and later.

Oh, I definitely agree on the stable part, and yes, the "splt things
up" model should come later if people agree that it's a good thing.

Should I take the patch as-is, or should I just wait for a pull
request from the locking tree? Either is ok by me.

              Linus
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
On Fri, 20 May 2016, Linus Torvalds wrote:


>Oh, I definitely agree on the stable part, and yes, the "splt things
>up" model should come later if people agree that it's a good thing.

The backporting part is quite nice, yes, but ultimately I think I prefer
Linus' suggestion making things explicit, as opposed to consulting the spinlock
implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked)
given that we are not optimizing for the common case (regular mutual excl).

As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(),
it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes.
While not identical to sems, it goes like:

nf_conntrack_all_lock(): nf_conntrack_lock():
spin_lock(B); spin_lock(A);

                                if (bar) { // false
bar = 1;   ...
                                }
[loop ctrl-barrier]
  spin_unlock_wait(A);
foo(); foo();

If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
we could end up with both threads in foo(), no?. (Although I'm unsure about that
ctrl barrier and archs could fall into it. The point was to see in-tree examples
of creative thinking with locking).

>Should I take the patch as-is, or should I just wait for a pull
>request from the locking tree? Either is ok by me.

I can verify that this patch fixes the issue.

Thanks,
Davidlohr
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
In reply to this post by Peter Zijlstra-5
On Fri, 20 May 2016, Peter Zijlstra wrote:

>On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:
>
>> >Similarly, and I know you hate it, but afaict, then semantically
>> >queued_spin_is_contended() ought to be:
>> >
>> >-       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
>> >+       return atomic_read(&lock->val);
>> >
>
>> Looking for contended lock, you need to consider the lock waiters also. So
>> looking at the whole word is right.
>
>No, you _only_ need to look at the lock waiters.

Is there anyway to do this in a single atomic_read? My thought is that otherwise
we could further expand the race window of when the lock is and isn't
contended (as returned to by the user). Ie avoiding crap like:

atomic_read(&lock->val) && atomic_read(&lock->val) != _Q_LOCKED_VAL

In any case, falsely returning for the 'locked, uncontended' case, vs completely
ignoring waiters is probably the lesser evil :).

Thanks,
Davidlohr
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Linus Torvalds-2
In reply to this post by Davidlohr Bueso-5
On Fri, May 20, 2016 at 5:48 PM, Davidlohr Bueso <[hidden email]> wrote:
>
> I can verify that this patch fixes the issue.

Ok, I've applied it to my tree.

         Linus
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Waiman Long-2
In reply to this post by Davidlohr Bueso-5
On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:

> On Fri, 20 May 2016, Peter Zijlstra wrote:
>
>> On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:
>>
>>> >Similarly, and I know you hate it, but afaict, then semantically
>>> >queued_spin_is_contended() ought to be:
>>> >
>>> >-       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
>>> >+       return atomic_read(&lock->val);
>>> >
>>
>>> Looking for contended lock, you need to consider the lock waiters
>>> also. So
>>> looking at the whole word is right.
>>
>> No, you _only_ need to look at the lock waiters.
>
> Is there anyway to do this in a single atomic_read? My thought is that
> otherwise
> we could further expand the race window of when the lock is and isn't
> contended (as returned to by the user). Ie avoiding crap like:
>
> atomic_read(&lock->val) && atomic_read(&lock->val) != _Q_LOCKED_VAL
>
> In any case, falsely returning for the 'locked, uncontended' case, vs
> completely
> ignoring waiters is probably the lesser evil :).
>
> Thanks,
> Davidlohr

The existing code is doing that, but I would argue that including the
locked, but uncontended case isn't a bad idea.

Cheers,
Longman
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Davidlohr Bueso-5
On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:

> On Fri, 20 May 2016, Linus Torvalds wrote:
>
>
> >Oh, I definitely agree on the stable part, and yes, the "splt things
> >up" model should come later if people agree that it's a good thing.
>
> The backporting part is quite nice, yes, but ultimately I think I prefer
> Linus' suggestion making things explicit, as opposed to consulting the spinlock
> implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked)
> given that we are not optimizing for the common case (regular mutual excl).

I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().

> As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
> to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(),
> it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes.
> While not identical to sems, it goes like:
>
> nf_conntrack_all_lock(): nf_conntrack_lock():
> spin_lock(B); spin_lock(A);
>
> if (bar) { // false
> bar = 1;   ...
> }
> [loop ctrl-barrier]
>  spin_unlock_wait(A);
> foo(); foo();
>
> If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
> we could end up with both threads in foo(), no?. (Although I'm unsure about that
> ctrl barrier and archs could fall into it. The point was to see in-tree examples
> of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.

Something a little like so; but then for _all_ implementations.

---
 include/asm-generic/qspinlock.h |  6 ++++++
 include/linux/compiler.h        | 13 ++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..2f2eddd3e1f9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -135,6 +135,12 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock)
  smp_mb();
  while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
  cpu_relax();
+
+ /*
+ * Match the RELEASE of the spin_unlock() we just observed. Thereby
+ * ensuring we observe the whole critical section that ended.
+ */
+ smp_acquire__after_ctrl_dep();
 }
 
 #ifndef virt_spin_lock
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 793c0829e3a3..3c4bc8160947 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,21 +304,24 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
  __u.__val; \
 })
 
+/*
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep() smp_rmb()
+
 /**
  * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #define smp_cond_acquire(cond) do { \
  while (!(cond)) \
  cpu_relax(); \
- smp_rmb(); /* ctrl + rmb := acquire */ \
+ smp_acquire__after_ctrl_dep(); \
 } while (0)
 
 #endif /* __KERNEL__ */
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Waiman Long-2
On Sat, May 21, 2016 at 12:01:00AM -0400, Waiman Long wrote:

> On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:
> >On Fri, 20 May 2016, Peter Zijlstra wrote:
> >
> >>On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:
> >>
> >>>>Similarly, and I know you hate it, but afaict, then semantically
> >>>>queued_spin_is_contended() ought to be:
> >>>>
> >>>>-       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
> >>>>+       return atomic_read(&lock->val);
> >>>>
> >>
> >>>Looking for contended lock, you need to consider the lock waiters
> >>>also. So
> >>>looking at the whole word is right.
> >>
> >>No, you _only_ need to look at the lock waiters.
> >
> >Is there anyway to do this in a single atomic_read? My thought is that
> >otherwise
> >we could further expand the race window

Its inherently racy, arrival of a contender is subject to timing. No
point in trying to fix what can't be fixed.

> The existing code is doing that, but I would argue that including the
> locked, but uncontended case isn't a bad idea.

It _IS_ a bad idea, you get unconditional lock-breaks.

Its the same as:

#define spin_is_contended(l) (true)

Because the only reason you're using spin_is_conteded() is if you're
holding it.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Manfred Spraul
In reply to this post by Peter Zijlstra-5
On 05/21/2016 09:37 AM, Peter Zijlstra wrote:

> On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:
>> As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
>> to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(),
>> it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes.
>> While not identical to sems, it goes like:
>>
>> nf_conntrack_all_lock(): nf_conntrack_lock():
>> spin_lock(B); spin_lock(A);
>>
>> if (bar) { // false
>> bar = 1;   ...
>> }
>> [loop ctrl-barrier]
>>   spin_unlock_wait(A);
>> foo(); foo();
>>
>> If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
>> we could end up with both threads in foo(), no?. (Although I'm unsure about that
>> ctrl barrier and archs could fall into it. The point was to see in-tree examples
>> of creative thinking with locking).
> I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
> because I suspect the netfilter code is broken without it.
>
> And it seems intuitive to assume that if we return from unlock_wait() we
> can indeed observe the critical section we waited on.
Then !spin_is_locked() and spin_unlock_wait() would be different with
regards to memory barriers.
Would that really help?

My old plan was to document the rules, and define a generic
smp_acquire__after_spin_is_unlocked.
https://lkml.org/lkml/2015/3/1/153

Noone supported it, so it ended up as
ipc_smp_acquire__after_spin_is_unlocked().
Should we move it to linux/spinlock.h?

Who needs it?
- ipc/sem.c (but please start from the version from linux-next as
reference, it is far less convoluted compared to the current code)
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/ipc/sem.c

- nf_conntrack

- task_rq_lock() perhaps needs smp_acquire__after_ctrl_dep
(I didn't figure out yet what happened to the proposed patch)
https://lkml.org/lkml/2015/2/17/129

--
     Manfred
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
In reply to this post by Peter Zijlstra-5
On Sat, 21 May 2016, Peter Zijlstra wrote:

>On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:
>> On Fri, 20 May 2016, Linus Torvalds wrote:
>>
>>
>> >Oh, I definitely agree on the stable part, and yes, the "splt things
>> >up" model should come later if people agree that it's a good thing.
>>
>> The backporting part is quite nice, yes, but ultimately I think I prefer
>> Linus' suggestion making things explicit, as opposed to consulting the spinlock
>> implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked)
>> given that we are not optimizing for the common case (regular mutual excl).
>
>I'm confused; we _are_ optimizing for the common case. spin_is_locked()
>is very unlikely to be used. And arguably should be used less in favour
>of lockdep_assert_held().

Indeed we are.

But by 'common case' I was really thinking about spin_is_locked() vs spin_wait_unlock().
The former being the more common of the two, and the one which mostly will _not_ be used
for lock correctness purposes, hence it doesn't need that new smp_mb. Hence allowing users
to explicitly set the ordering needs (ie spin_lock_synchronize()) seems like the better
long term alternative. otoh, with your approach all such bugs are automatically fixed :)

Thanks,
Davidlohr
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Manfred Spraul
In reply to this post by Peter Zijlstra-5
Hi Peter,


On 05/20/2016 06:04 PM, Peter Zijlstra wrote:

> On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:
>
>> Let me write a patch..
> OK, something like the below then.. lemme go build that and verify that
> too fixes things.
>
> ---
> Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()
>
> Similar to commits:
>
>    51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
>    d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
>
> qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
> unordered inside the ACQUIRE of the lock.
>
> And while this is not a problem for the regular mutual exclusive
> critical section usage of spinlocks, it breaks creative locking like:
>
> spin_lock(A) spin_lock(B)
> spin_unlock_wait(B) if (!spin_is_locked(A))
> do_something()  do_something()
>
> In that both CPUs can end up running do_something at the same time,
> because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
> spin_is_locked() loads (even on x86!!).
How would we handle mixed spin_lock()/mutex_lock() code?
For the IPC code, I would like to replace the outer lock with a mutex.
The code only uses spinlocks, because at the time it was written, the
mutex code didn't contain a busy wait.
With a mutex, the code would become simpler (all the
lock/unlock/kmalloc/relock parts could be removed).

The result would be something like:

        mutex_lock(A) spin_lock(B)
        spin_unlock_wait(B) if (!mutex_is_locked(A))
        do_something()  do_something()

--
     Manfred
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Sun, May 22, 2016 at 10:43:08AM +0200, Manfred Spraul wrote:

> How would we handle mixed spin_lock()/mutex_lock() code?
> For the IPC code, I would like to replace the outer lock with a mutex.
> The code only uses spinlocks, because at the time it was written, the mutex
> code didn't contain a busy wait.
> With a mutex, the code would become simpler (all the
> lock/unlock/kmalloc/relock parts could be removed).
>
> The result would be something like:
>
> mutex_lock(A) spin_lock(B)
> spin_unlock_wait(B) if (!mutex_is_locked(A))
> do_something()  do_something()
>

Should work similarly, but we'll have to audit mutex for these same
issues. I'll put it on todo.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Linus Torvalds-2
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:

> So I do wonder if we should make that smp_mb() be something the
> *caller* has to do, and document rules for it. IOW, introduce a new
> spinlock primitive called "spin_lock_synchronize()", and then spinlock
> implementations that have this non-atomic behavior with an unordered
> store would do something like
>
>     static inline void queued_spin_lock_synchronize(struct qspinlock
> *a, struct qspinlock *b)
>     {
>         smp_mb();
>     }
>
> and then we'd document that *if* yuou need ordering guarantees between
>
>    spin_lock(a);
>    .. spin_is_locked/spin_wait_lock(b) ..
>
> you have to have a
>
>     spin_lock_synchronize(a, b);
>
> in between.

So I think I favour the explicit barrier. But my 'problem' is that we
now have _two_ different scenarios in which we need to order two
different spinlocks.

The first is the RCpc vs RCsc spinlock situation (currently only on
PowerPC). Where the spin_unlock() spin_lock() 'barier' is not
transitive.

And the second is this 'new' situation, where the store is unordered and
is not observable until a release, which is fundamentally so on PPC and
ARM64 but also possible due to lock implementation choices like with our
qspinlock, which makes it manifest even on x86.

Now, ideally we'd be able to use one barrier construct for both; but
given that, while there is overlap, they're not the same. And I'd be
somewhat reluctant to issue superfluous smp_mb()s just because; it is an
expensive instruction.

Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
something like:

        smp_mb__after_lock()

?


OTOH; even if we document this, it is something that is easy to forget
or miss. It is not like Documentation/memory-barriers.txt is in want of
more complexity.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Linus Torvalds-2
On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra <[hidden email]> wrote:
>
> Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
> something like:
>
>         smp_mb__after_lock()

I'd much rather make the naming be higher level. It's not necessarily
going to be a "mb", and while the problem is about smp, the primitives
it is synchronizing aren't actually smp-specific (ie you're
synchronizing a lock that is relevant on UP too).

So I'd just call it something like

        spin_lock_sync_after_lock();

because different locks might have different levels of serialization
(ie maybe a spinlock needs one thing, and a mutex needs another - if
we start worrying about ordering between spin_lock and
mutex_is_locked(), for example, or between mutex_lock() and
spin_is_locked()).

Hmm?

                     Linus
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Manfred Spraul
On Sat, May 21, 2016 at 03:49:20PM +0200, Manfred Spraul wrote:

> >I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
> >because I suspect the netfilter code is broken without it.
> >
> >And it seems intuitive to assume that if we return from unlock_wait() we
> >can indeed observe the critical section we waited on.

> Then !spin_is_locked() and spin_unlock_wait() would be different with
> regards to memory barriers.
> Would that really help?

We could fix that I think; something horrible like:

static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
{
        int locked;
        smp_mb();
        locked = atomic_read(&lock->val) & _Q_LOCKED_MASK;
        smp_acquire__after_ctrl_dep();
        return locked;
}

Which if used in a conditional like:

        spin_lock(A);
        if (spin_is_locked(B)) {
                spin_unlock(A);
                spin_lock(B);
                ...
        }

would still provide the ACQUIRE semantics required. The only difference
is that it would provide it to _both_ branches, which might be a little
more expensive.

> My old plan was to document the rules, and define a generic
> smp_acquire__after_spin_is_unlocked.
> https://lkml.org/lkml/2015/3/1/153

Yeah; I more or less forgot all that.

Now, I too think having the thing documented is good; _however_ I also
think having primitives that actually do what you assume them to is a
good thing.

spin_unlock_wait() not actually serializing against the spin_unlock() is
really surprising and subtle.


123