sem_lock() vs qspinlocks

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

sem_lock() vs qspinlocks

Davidlohr Bueso-5
Hi,

Giovanni ran into a pretty reproducible situation in which the libmicro benchmark[1]
shows a functional regression in sysv semaphores, on upstream kernels. Specifically
for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
semop() blocked waiting for zero. Alternatively, the following splats may appear:

[  692.991258] BUG: unable to handle kernel NULL pointer dereference (null)
[  692.992062] IP: [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[  692.992062] PGD 862fab067 PUD 858bbc067 PMD 0
[  692.992062] Oops: 0000 [#1] SMP
[  692.992062] Modules linked in: ...
[  692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: G            E   4.6.0-juancho2-default+ #18
[  692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  692.992062] task: ffff88084a7e9640 ti: ffff880854748000 task.ti: ffff880854748000
[  692.992062] RIP: 0010:[<ffffffff812a0a9f>]  [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[  692.992062] RSP: 0018:ffff88085474bce8  EFLAGS: 00010216
[  692.992062] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88086cc3d0d0
[  692.992062] RDX: ffff88086cc3d0d0 RSI: ffff88086cc3d0d0 RDI: ffff88086cc3d040
[  692.992062] RBP: ffff88085474bce8 R08: 0000000000000007 R09: ffff88086cc3d088
[  692.992062] R10: 0000000000000000 R11: 000000a1597ea64c R12: ffff88085474bd90
[  692.992062] R13: ffff88086cc3d040 R14: 0000000000000000 R15: 00000000ffffffff
[  692.992062] FS:  00007faa46a2d700(0000) GS:ffff88086e500000(0000) knlGS:0000000000000000
[  692.992062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  692.992062] CR2: 0000000000000000 CR3: 0000000862faa000 CR4: 00000000001406e0
[  692.992062] Stack:
[  692.992062]  ffff88085474bf38 ffffffff812a2ac3 ffffffff810c3995 ffff88084a7e9640
[  692.992062]  0000000000000000 ffffffff81cb1f48 0000000000000002 ffff880400038000
[  692.992062]  ffff88084a7e9640 ffff88085474bd40 fffffffffffffffc ffff88085474bd40
[  692.992062] Call Trace:
[  692.992062]  [<ffffffff812a2ac3>] SYSC_semtimedop+0x833/0xc00
[  692.992062]  [<ffffffff810c3995>] ? __wake_up_common+0x55/0x90
[  692.992062]  [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [<ffffffff81266f33>] ? locks_insert_lock_ctx+0x93/0xa0
[  692.992062]  [<ffffffff81268594>] ? flock_lock_inode+0xf4/0x220
[  692.992062]  [<ffffffff81269cd7>] ? locks_lock_inode_wait+0x47/0x160
[  692.992062]  [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [<ffffffff81266d0f>] ? locks_free_lock+0x4f/0x60
[  692.992062]  [<ffffffff812a3340>] SyS_semop+0x10/0x20
[  692.992062]  [<ffffffff81639c32>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83
[  692.992062] RIP  [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[  692.992062]  RSP <ffff88085474bce8>
[  692.992062] CR2: 0000000000000000
[  693.882179] ---[ end trace 5605f108ab79cdb2 ]---

Or,

[  463.567641] BUG: unable to handle kernel paging request at fffffffffffffffa
[  463.576246] IP: [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[  463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0
[  463.590071] Oops: 0000 [#1] SMP
[  463.594667] Modules linked in: ...
[  463.664710] Supported: Yes
[  463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default #1
[  463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015
[  463.685588] task: ffff88105dba0b40 ti: ffff8808fc7e0000 task.ti: ffff8808fc7e0000
[  463.694366] RIP: 0010:[<ffffffff8126dcbf>]  [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[  463.705084] RSP: 0018:ffff8808fc7e3c60  EFLAGS: 00010217
[  463.711610] RAX: 0000000000000000 RBX: ffff88085d22dae0 RCX: 000000005732f1e7
[  463.719952] RDX: fffffffffffffffa RSI: ffff88085d22dad0 RDI: ffff88085d22da80
[  463.728270] RBP: 0000000000000000 R08: 00000000fffffff7 R09: 0000000000000000
[  463.736561] R10: 0000000000000000 R11: 0000000000000206 R12: ffff88085d22da88
[  463.745001] R13: ffff88085d22dad0 R14: ffffffffffffffc0 R15: ffff88085d22da40
[  463.753450] FS:  00007f30fd9e5700(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
[  463.762684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  463.769731] CR2: fffffffffffffffa CR3: 000000017bc09000 CR4: 00000000001406e0
[  463.778039] Stack:
[  463.781130]  ffff8808fc7e3d50 ffff88085d22dad0 ffffffff8126dfe1 ffffffff4d226800
[  463.789704]  ffff88085d22da80 0000000000000001 ffff88085d22da88 0000000000000001
[  463.798254]  0000000000000001 ffff88085d22da40 ffff8808fc7e3d50 ffff8808fc7e3da0
[  463.806758] Call Trace:
[  463.810305]  [<ffffffff8126dfe1>] update_queue+0xa1/0x180
[  463.816706]  [<ffffffff8126eb95>] do_smart_update+0x45/0xf0
[  463.823276]  [<ffffffff8126f141>] SYSC_semtimedop+0x3d1/0xb00
[  463.830035]  [<ffffffff815cf66e>] entry_SYSCALL_64_fastpath+0x12/0x71
[  463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71
[  463.846331]
[  463.848747] Leftover inexact backtrace:
[  463.848747]
[  463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 48 83 c2 06 44
 89 00 48 39 ea 72 99 48 83 ea 06 8b 4e 20 49 39 d2 77 16 <0f> b7 02 48 83 ea 06 48 c1 e0 06
  48 03 07 49 39 d2 89 48 04 76
  [  463.877668] RIP  [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
  [  463.885725]  RSP <ffff8808fc7e3c60>
  [  463.890145] CR2: fffffffffffffffa
  [  463.894338] ---[ end trace 0b29cae12f0e401c ]---

From both I've reach the same conclusion that the pending operations array is getting
corrupted (sop, struct sembuf), ie: for the second splat, being for perform_atomic_semop():

  2b:*  0f b7 02 movzwl (%rdx),%eax  <-- trapping instruction

             sma->sem_base[sop->sem_num].sempid = pid;

  2e:  48 83 ea 06 sub    $0x6,%rdx
  32:     48 c1 e0 06 shl    $0x6,%rax
  36:     48 03 07              add    (%rdi),%ra
  39:  49 39 d2              cmp    %rdx,%r10
  3c:  89 48 04              mov    %ecx,0x4(%rax)
  3f:     76                          .byte 0x76

libc's semop()'s mainly distributes simple and complex ops on the set fairly evenly acting
on a unique set, ie:

       semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 1);
       semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 2);

Given that this suggests a broken sem_lock(), I gave -next's c4b7bb08c295 (ipc/sem.c: Fix
complex_count vs. simple op race) a try, but unfortunately does not fix the issue. In fact
the regression is bisectable to the introduction of qspinlocks -- as of v4.2), with:

1bf7067 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Considering how sem_lock() plays games with spin_is_locked() and spin_unlock_wait() to
enable finer grained locking (as opposed to the whole set), the regression seems to be
introduced due to the fact that spin_unlock_wait() with qspinlocks only checks the first
least-signficant byte, therefore ignoring pending waiters. Fair enough, given that a simple
write to that byte is enough to release the lock. However, this is semantically different to
what was previously done with ticket locks in that spin_unlock_wait() will always observe
all waiters by adding itself to the tail. For sysvsems this could cause sem_wait_array() to
possibly miss any pending waiters on the sem->lock when a thread is trying to acquire the
global lock, which could iterate over that specific lock in the semaphore set, and shortly
thereafter the pending waiter takes the already iterated semaphore.

As such, the following restores the behavior of the ticket locks and 'fixes'
(or hides?) the bug in sems. Naturally incorrect approach:

@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)

        for (i = 0; i < sma->sem_nsems; i++) {
                sem = sma->sem_base + i;
-               spin_unlock_wait(&sem->lock);
+               while (atomic_read(&sem->lock))
+                       cpu_relax();
        }
        ipc_smp_acquire__after_spin_is_unlocked();
 }

While the differences between both versions wrt unlock_wait() are certainly there, ultimately,
I agree that code should not rely on the semantics of spinlock waiters -- and therefore sems
need fixing. Note of course lockref is the exception to this in how queued_spin_value_unlocked()
is implemented. In addition, this makes me wonder if queued_spin_is_locked() should then be:

- return atomic_read(&lock->val);
+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.

Thoughts?

Thanks,
Davidlohr

[1] https://hg.java.net/hg/libmicro~hg-repo
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>  In addition, this makes me wonder if queued_spin_is_locked() should then be:
>
> - return atomic_read(&lock->val);
> + return atomic_read(&lock->val) & _Q_LOCKED_MASK;
>
> And avoid considering pending waiters as locked.

Probably
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Davidlohr Bueso-5
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> However, this is semantically different to
> what was previously done with ticket locks in that spin_unlock_wait() will always observe
> all waiters by adding itself to the tail.


static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
        __ticket_t head = READ_ONCE(lock->tickets.head);

        for (;;) {
                struct __raw_tickets tmp = READ_ONCE(lock->tickets);
                /*
                 * We need to check "unlocked" in a loop, tmp.head == head
                 * can be false positive because of overflow.
                 */
                if (__tickets_equal(tmp.head, tmp.tail) ||
                                !__tickets_equal(tmp.head, head))
                        break;

                cpu_relax();
        }
}


I'm not seeing that (although I think I agreed yesterday on IRC). Note
how we observe the head and then loop until either the lock is unlocked
(head == tail) or simply head isn't what it used to be.

And head is the lock holder end of the queue; see arch_spin_unlock()
incrementing it.

So the ticket lock too should only wait for the current lock holder to
go away, not any longer.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Davidlohr Bueso-5
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> [1] https://hg.java.net/hg/libmicro~hg-repo

So far I've managed to install mercurial and clone this thing, but it
doesn't actually build :/

I'll try harder..
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Fri, May 20, 2016 at 10:13:15AM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>
> > [1] https://hg.java.net/hg/libmicro~hg-repo
>
> So far I've managed to install mercurial and clone this thing, but it
> doesn't actually build :/
>
> I'll try harder..

The stuff needs this..

---
diff -r 7dd95b416c3c Makefile.com
--- a/Makefile.com Thu Jul 26 12:56:00 2012 -0700
+++ b/Makefile.com Fri May 20 10:18:08 2016 +0200
@@ -107,7 +107,7 @@
  echo "char compiler_version[] = \""`$(COMPILER_VERSION_CMD)`"\";" > tattle.h
  echo "char CC[] = \""$(CC)"\";" >> tattle.h
  echo "char extra_compiler_flags[] = \""$(extra_CFLAGS)"\";" >> tattle.h
- $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm
+ $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm -lpthread
 
 $(ELIDED_BENCHMARKS): ../elided.c
  $(CC) -o $(@) ../elided.c
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Davidlohr Bueso-5
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> Specifically
> for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
> semop() blocked waiting for zero.

OK; so I've been running:

while :; do
  bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 2000000 ;
  bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 5000000 ;
done

for a few minutes now and its not stuck and my machine didn't splat.

Am I not doing it right?
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:

> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > Specifically
> > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
> > semop() blocked waiting for zero.
>
> OK; so I've been running:
>
> while :; do
>   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 2000000 ;
>   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 5000000 ;
> done
>
> for a few minutes now and its not stuck and my machine didn't splat.
>
> Am I not doing it right?

Hooray, it went *bang*.. OK, lemme have a harder look at this semaphore
code.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Giovanni Gherdovich
In reply to this post by Peter Zijlstra-5
On Fri, 2016-05-20 at 10:18 +0200, Peter Zijlstra wrote:

> On Fri, May 20, 2016 at 10:13:15AM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >
> > > [1] https://hg.java.net/hg/libmicro~hg-repo
> >
> > So far I've managed to install mercurial and clone this thing, but
> > it
> > doesn't actually build :/
> >
> > I'll try harder..
>
> The stuff needs this..
>
> ---
> diff -r 7dd95b416c3c Makefile.com
> --- a/Makefile.com Thu Jul 26 12:56:00 2012 -0700
> +++ b/Makefile.com Fri May 20 10:18:08 2016 +0200
> @@ -107,7 +107,7 @@
>   echo "char compiler_version[] =
> \""`$(COMPILER_VERSION_CMD)`"\";" > tattle.h
>   echo "char CC[] = \""$(CC)"\";" >> tattle.h
>   echo "char extra_compiler_flags[] = \""$(extra_CFLAGS)"\";"
> >> tattle.h
> - $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt
> -lm
> + $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt
> -lm -lpthread
>  
>  $(ELIDED_BENCHMARKS): ../elided.c
>   $(CC) -o $(@) ../elided.c
>
Hello Peter,


right, we forgot to mention that the libmicro Makefile is broken;
sorry for the hassle. At the bottom of this message you'll find the
script I use to reproduce the problem; you might have to modify the
variable $CASCADE_PATH.

The script needs an argument, which is the offending benchmark to run,
like

$ ./run_cascade.sh c_flock_200

or

$ ./run_cascade.sh c_cond_10

This runs the benchmark 10 times, and kills it if it lasts too long. I
get around 3 hangs per invocation, and on the affected kernels (4.2 or
later) I get around one panic each invocation of this reproducer.

The .config file with which you build the kernel seems to affect that,
too; I attach 2 config files:

- config.no-bug
- config.with-bug

The results I report (hangs & panics) happens if I compile with
config.with-bug, but disappear with config.no-bug. If you take
config.no-bug as reference, config.with-bug introduces

    CONFIG_MFD_SYSCON=y
    CONFIG_NO_HZ_IDLE=y
    CONFIG_QUEUED_SPINLOCK=y
    CONFIG_REGMAP=y
    CONFIG_REGMAP_MMIO=y
    CONFIG_TICK_CPU_ACCOUNTING=y

and removes

    CONFIG_BLK_DEV_DM=m
    CONFIG_BLK_DEV_DM_BUILTIN=y
    CONFIG_CONTEXT_TRACKING=y
    CONFIG_DM_UEVENT=y
    CONFIG_NO_HZ_FULL=y
    CONFIG_PAGE_EXTENSION=y
    CONFIG_PAGE_OWNER=y
    CONFIG_PARAVIRT_SPINLOCKS=y
    CONFIG_PERSISTENT_KEYRINGS=y
    CONFIG_RCU_NOCB_CPU=y
    CONFIG_RCU_NOCB_CPU_NONE=y
    CONFIG_RCU_USER_QS=y
    CONFIG_STAGING=y
    CONFIG_UNINLINE_SPIN_UNLOCK=y
    CONFIG_VIRT_CPU_ACCOUNTING=y
    CONFIG_VIRT_CPU_ACCOUNTING_GEN=y

Most of those params might be irrelevant, but some must trigger the
problem. Both configs are taken from /proc/config.gz on a running
system. FWIW my test machine is a 48 haswell cores with 64GB or RAM.


Giovanni
SUSE Labs


----------- run_cascade.sh -------------------------------------
#!/bin/bash

TESTCASE=$1
CASCADE_PATH="libmicro-1-installed/bin-x86_64"

case $TESTCASE in
    c_flock_200)
        BINNAME="cascade_flock"
        COMMAND="$CASCADE_PATH/cascade_flock -E -D 60000 -L -S -W \
                                             -N c_flock_200 \
                                             -P 200 -I 5000000"
        # c_flock_200 is supposed to last 60 seconds.
        SLEEPTIME=70
        ;;
    c_cond_10)
        BINNAME="cascade_cond"
        COMMAND="$CASCADE_PATH/cascade_cond -E -C 2000 -L -S -W \
                                            -N c_cond_10 \
                                            -T 10 -I 3000"
        # c_cond_10 terminates in less than 1 second.
        SLEEPTIME=5
        ;;
    *)
        echo "Unknown test case" >&2
        exit 1
        ;;
esac

ERRORS=0
uname -a
for i in {1..10} ; do
    {
        eval $COMMAND &
    } >/dev/null 2>&1
    sleep $SLEEPTIME
    if pidof $BINNAME >/dev/null ; then
        echo Run \#$i: $TESTCASE hangs
        for PID in $(pidof $BINNAME) ; do
            head -1 /proc/$PID/stack
        done | sort | uniq -c
        ERRORS=$((ERRORS+1))
        killall $BINNAME
    else
        echo Run \#$i: $TESTCASE exits successfully
    fi
done
echo $TESTCASE hanged $ERRORS times.
----------------------------------------------------------------

config.with-bug (136K) Download Attachment
config.no-bug (138K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Fri, May 20, 2016 at 11:07:49AM +0200, Giovanni Gherdovich wrote:

> ----------- run_cascade.sh -------------------------------------
> #!/bin/bash
>
> TESTCASE=$1
> CASCADE_PATH="libmicro-1-installed/bin-x86_64"
>
> case $TESTCASE in
>     c_flock_200)
>         BINNAME="cascade_flock"
>         COMMAND="$CASCADE_PATH/cascade_flock -E -D 60000 -L -S -W \
>                                              -N c_flock_200 \
>                                              -P 200 -I 5000000"
>         # c_flock_200 is supposed to last 60 seconds.
>         SLEEPTIME=70
>         ;;
>     c_cond_10)
>         BINNAME="cascade_cond"
>         COMMAND="$CASCADE_PATH/cascade_cond -E -C 2000 -L -S -W \
>                                             -N c_cond_10 \
>                                             -T 10 -I 3000"
>         # c_cond_10 terminates in less than 1 second.
>         SLEEPTIME=5
>         ;;
>     *)
>         echo "Unknown test case" >&2
>         exit 1
>         ;;
> esac
>
> ERRORS=0
> uname -a
> for i in {1..10} ; do
>     {
>         eval $COMMAND &
>     } >/dev/null 2>&1
>     sleep $SLEEPTIME
>     if pidof $BINNAME >/dev/null ; then
>         echo Run \#$i: $TESTCASE hangs
>         for PID in $(pidof $BINNAME) ; do
>             head -1 /proc/$PID/stack
>         done | sort | uniq -c
>         ERRORS=$((ERRORS+1))
>         killall $BINNAME
>     else
>         echo Run \#$i: $TESTCASE exits successfully
>     fi
> done
> echo $TESTCASE hanged $ERRORS times.
> ----------------------------------------------------------------

Thanks, that's a much nicer script than mine ;-)
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Ingo Molnar
In reply to this post by Peter Zijlstra-5

* Peter Zijlstra <[hidden email]> wrote:

> On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > Specifically
> > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
> > > semop() blocked waiting for zero.
> >
> > OK; so I've been running:
> >
> > while :; do
> >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 2000000 ;
> >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 5000000 ;
> > done
> >
> > for a few minutes now and its not stuck and my machine didn't splat.
> >
> > Am I not doing it right?
>
> Hooray, it went *bang*..

I suspect a required step was to post about failure to reproduce!

        Ingo
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Mel Gorman-4
On Fri, May 20, 2016 at 12:09:01PM +0200, Ingo Molnar wrote:

>
> * Peter Zijlstra <[hidden email]> wrote:
>
> > On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > > Specifically
> > > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
> > > > semop() blocked waiting for zero.
> > >
> > > OK; so I've been running:
> > >
> > > while :; do
> > >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 2000000 ;
> > >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 5000000 ;
> > > done
> > >
> > > for a few minutes now and its not stuck and my machine didn't splat.
> > >
> > > Am I not doing it right?
> >
> > Hooray, it went *bang*..
>
> I suspect a required step was to post about failure to reproduce!
>

It is known that the bug is both intermittent and not all machines can
reproduce the problem. If it fails to reproduce, it's not necessarily a
methodology error and can simply be a function of luck.

--
Mel Gorman
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Davidlohr Bueso-5
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> As such, the following restores the behavior of the ticket locks and 'fixes'
> (or hides?) the bug in sems. Naturally incorrect approach:
>
> @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
>
> for (i = 0; i < sma->sem_nsems; i++) {
> sem = sma->sem_base + i;
> -               spin_unlock_wait(&sem->lock);
> +               while (atomic_read(&sem->lock))
> +                       cpu_relax();
> }
> ipc_smp_acquire__after_spin_is_unlocked();
> }

The actual bug is clear_pending_set_locked() not having acquire
semantics. And the above 'fixes' things because it will observe the old
pending bit or the locked bit, so it doesn't matter if the store
flipping them is delayed.

The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
states that that acquire is sufficient, but this is incorrect in the
face of spin_is_locked()/spin_unlock_wait() usage only looking at the
lock byte.

The problem is that the clear_pending_set_locked() is an unordered
store, therefore this store can be delayed until no later than
spin_unlock() (which orders against it due to the address dependency).

This opens numerous races; for example:

        ipc_lock_object(&sma->sem_perm);
        sem_wait_array(sma);

                                false   -> spin_is_locked(&sma->sem_perm.lock)

is entirely possible, because sem_wait_array() consists of pure reads,
so the store can pass all that, even on x86.

The below 'hack' seems to solve the problem.

_However_ this also means the atomic_cmpxchg_relaxed() in the locked:
branch is equally wrong -- although not visible on x86. And note that
atomic_cmpxchg_acquire() would not in fact be sufficient either, since
the acquire is on the LOAD not the STORE of the LL/SC.

I need a break of sorts, because after twisting my head around the sem
code and then the qspinlock code I'm wrecked. I'll try and make a proper
patch if people can indeed confirm my thinking here.

---
 kernel/locking/qspinlock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e32ae1..348e172e774f 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
  * *,1,0 -> *,0,1
  */
  clear_pending_set_locked(lock);
+ smp_mb();
  return;
 
  /*
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Boqun Feng
Hi Peter,

On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:

> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > As such, the following restores the behavior of the ticket locks and 'fixes'
> > (or hides?) the bug in sems. Naturally incorrect approach:
> >
> > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> >
> > for (i = 0; i < sma->sem_nsems; i++) {
> > sem = sma->sem_base + i;
> > -               spin_unlock_wait(&sem->lock);
> > +               while (atomic_read(&sem->lock))
> > +                       cpu_relax();
> > }
> > ipc_smp_acquire__after_spin_is_unlocked();
> > }
>
> The actual bug is clear_pending_set_locked() not having acquire
> semantics. And the above 'fixes' things because it will observe the old
> pending bit or the locked bit, so it doesn't matter if the store
> flipping them is delayed.
>
> The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> states that that acquire is sufficient, but this is incorrect in the
> face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> lock byte.
>
> The problem is that the clear_pending_set_locked() is an unordered
> store, therefore this store can be delayed until no later than
> spin_unlock() (which orders against it due to the address dependency).
>
> This opens numerous races; for example:
>
> ipc_lock_object(&sma->sem_perm);
> sem_wait_array(sma);
>
> false   -> spin_is_locked(&sma->sem_perm.lock)
>
> is entirely possible, because sem_wait_array() consists of pure reads,
> so the store can pass all that, even on x86.
>
> The below 'hack' seems to solve the problem.
>
> _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> branch is equally wrong -- although not visible on x86. And note that
> atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> the acquire is on the LOAD not the STORE of the LL/SC.
>
> I need a break of sorts, because after twisting my head around the sem
> code and then the qspinlock code I'm wrecked. I'll try and make a proper
> patch if people can indeed confirm my thinking here.
>
I think your analysis is right, however, the problem only exists if we
have the following use pattern, right?

        CPU 0 CPU 1
        ==================== ==================
        spin_lock(A); spin_lock(B);
        spin_unlock_wait(B); spin_unlock_wait(A);
        do_something(); do_something();

, which ends up CPU 0 and 1 both running do_something(). And actually
this can be simply fixed by add smp_mb() between spin_lock() and
spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

So if relaxed/acquire atomics and clear_pending_set_locked() work fine
in other situations, a proper fix would be fixing the
spin_is_locked()/spin_unlock_wait() or their users?

Regards,
Boqun

> ---
>  kernel/locking/qspinlock.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index ce2f75e32ae1..348e172e774f 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   * *,1,0 -> *,0,1
>   */
>   clear_pending_set_locked(lock);
> + smp_mb();
>   return;
>  
>   /*

signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
In reply to this post by Peter Zijlstra-5
On Fri, 20 May 2016, Peter Zijlstra wrote:

>On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>>  In addition, this makes me wonder if queued_spin_is_locked() should then be:
>>
>> - return atomic_read(&lock->val);
>> + return atomic_read(&lock->val) & _Q_LOCKED_MASK;
>>
>> And avoid considering pending waiters as locked.
>
>Probably

Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+       return atomic_read(&lock->val);

Thanks,
Davidlohr
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:

> On Fri, 20 May 2016, Peter Zijlstra wrote:
>
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >> In addition, this makes me wonder if queued_spin_is_locked() should then be:
> >>
> >>- return atomic_read(&lock->val);
> >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;
> >>
> >>And avoid considering pending waiters as locked.
> >
> >Probably
>
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
>
> -       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
> +       return atomic_read(&lock->val);

Nah, that would make it return true for (0,0,1), ie. uncontended locked.


Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Boqun Feng
On Fri, May 20, 2016 at 10:05:33PM +0800, Boqun Feng wrote:

> On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > As such, the following restores the behavior of the ticket locks and 'fixes'
> > > (or hides?) the bug in sems. Naturally incorrect approach:
> > >
> > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> > >
> > > for (i = 0; i < sma->sem_nsems; i++) {
> > > sem = sma->sem_base + i;
> > > -               spin_unlock_wait(&sem->lock);
> > > +               while (atomic_read(&sem->lock))
> > > +                       cpu_relax();
> > > }
> > > ipc_smp_acquire__after_spin_is_unlocked();
> > > }
> >
> > The actual bug is clear_pending_set_locked() not having acquire
> > semantics. And the above 'fixes' things because it will observe the old
> > pending bit or the locked bit, so it doesn't matter if the store
> > flipping them is delayed.
> >
> > The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> > states that that acquire is sufficient, but this is incorrect in the
> > face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> > lock byte.
> >
> > The problem is that the clear_pending_set_locked() is an unordered
> > store, therefore this store can be delayed until no later than
> > spin_unlock() (which orders against it due to the address dependency).
> >
> > This opens numerous races; for example:
> >
> > ipc_lock_object(&sma->sem_perm);
> > sem_wait_array(sma);
> >
> > false   -> spin_is_locked(&sma->sem_perm.lock)
> >
> > is entirely possible, because sem_wait_array() consists of pure reads,
> > so the store can pass all that, even on x86.
> >
> > The below 'hack' seems to solve the problem.
> >
> > _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> > branch is equally wrong -- although not visible on x86. And note that
> > atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> > the acquire is on the LOAD not the STORE of the LL/SC.
> >
> > I need a break of sorts, because after twisting my head around the sem
> > code and then the qspinlock code I'm wrecked. I'll try and make a proper
> > patch if people can indeed confirm my thinking here.
> >
>
> I think your analysis is right, however, the problem only exists if we
> have the following use pattern, right?
>
> CPU 0 CPU 1
> ==================== ==================
> spin_lock(A); spin_lock(B);
> spin_unlock_wait(B); spin_unlock_wait(A);
> do_something(); do_something();

More or less yes. The semaphore code is like:

        spin_lock(A) spin_lock(B)
        spin_unlock_wait(B) spin_is_locked(A)

which shows that both spin_is_locked() and spin_unlock_wait() are in the
same class.

> , which ends up CPU 0 and 1 both running do_something(). And actually
> this can be simply fixed by add smp_mb() between spin_lock() and
> spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
> as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

Right and arm64 does in d86b8da04dfa. Curiously you only fixed
spin_is_locked() and Will only fixed spin_unlock_wait, while AFAIU we
need to have _BOTH_ fixed.

Now looking at the PPC code, spin_unlock_wait() as per
arch/powerpc/lib/locks.c actually does included the extra smp_mb().

> So if relaxed/acquire atomics and clear_pending_set_locked() work fine
> in other situations, a proper fix would be fixing the
> spin_is_locked()/spin_unlock_wait() or their users?

Right; the relaxed stores work fine for the 'regular' mutual exclusive
critical section usage of locks. And yes, I think only the case you
outlined can care about it.

Let me write a patch..
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
In reply to this post by Peter Zijlstra-5
On Fri, 20 May 2016, Peter Zijlstra wrote:

>On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
>> On Fri, 20 May 2016, Peter Zijlstra wrote:
>>
>> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>> >> In addition, this makes me wonder if queued_spin_is_locked() should then be:
>> >>
>> >>- return atomic_read(&lock->val);
>> >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;
>> >>
>> >>And avoid considering pending waiters as locked.
>> >
>> >Probably
>>
>> Similarly, and I know you hate it, but afaict, then semantically
>> queued_spin_is_contended() ought to be:
>>
>> -       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
>> +       return atomic_read(&lock->val);
>
>Nah, that would make it return true for (0,0,1), ie. uncontended locked.

Right, and we want:

(*, 1, 1)
(*, 1, 0)
(n, 0, 0)

I may be missing some combinations, its still early.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Fri, May 20, 2016 at 05:05:05PM +0200, Peter Zijlstra wrote:

> On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
> > On Fri, 20 May 2016, Peter Zijlstra wrote:
> >
> > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > >> In addition, this makes me wonder if queued_spin_is_locked() should then be:
> > >>
> > >>- return atomic_read(&lock->val);
> > >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;
> > >>
> > >>And avoid considering pending waiters as locked.
> > >
> > >Probably
> >
> > Similarly, and I know you hate it, but afaict, then semantically
> > queued_spin_is_contended() ought to be:
> >
> > -       return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
> > +       return atomic_read(&lock->val);
>
> Nah, that would make it return true for (0,0,1), ie. uncontended locked.

FWIW, the only usage of spin_is_contended() should be for lock breaking,
see spin_needbreak().

This also means that

  #define spin_is_contended(l) (false)

is a valid implementation, where the only down-side is worse latency.

This is done (together with GENERIC_LOCKBREAK), to allow trivial
test-and-set spinlock implementations; as these cannot tell if the lock
is contended.
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Peter Zijlstra-5
In reply to this post by Peter Zijlstra-5
On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:

> Let me write a patch..

OK, something like the below then.. lemme go build that and verify that
too fixes things.

---
Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()

Similar to commits:

  51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

        spin_lock(A) spin_lock(B)
        spin_unlock_wait(B) if (!spin_is_locked(A))
        do_something()  do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

To avoid making the normal case slower, add smp_mb()s to the less used
spin_unlock_wait() / spin_is_locked() side of things to avoid this
problem.

Reported-by: Davidlohr Bueso <[hidden email]>
Reported-by: Giovanni Gherdovich <[hidden email]>
Signed-off-by: Peter Zijlstra (Intel) <[hidden email]>
---
 include/asm-generic/qspinlock.h | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 35a52a880b2f..6bd05700d8c9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -28,7 +28,30 @@
  */
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
- return atomic_read(&lock->val);
+ /*
+ * queued_spin_lock_slowpath() can ACQUIRE the lock before
+ * issuing the unordered store that sets _Q_LOCKED_VAL.
+ *
+ * See both smp_cond_acquire() sites for more detail.
+ *
+ * This however means that in code like:
+ *
+ *   spin_lock(A) spin_lock(B)
+ *   spin_unlock_wait(B) spin_is_locked(A)
+ *   do_something() do_something()
+ *
+ * Both CPUs can end up running do_something() because the store
+ * setting _Q_LOCKED_VAL will pass through the loads in
+ * spin_unlock_wait() and/or spin_is_locked().
+ *
+ * Avoid this by issuing a full memory barrier between the spin_lock()
+ * and the loads in spin_unlock_wait() and spin_is_locked().
+ *
+ * Note that regular mutual exclusion doesn't care about this
+ * delayed store.
+ */
+ smp_mb();
+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;
 }
 
 /**
@@ -108,6 +131,8 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
  */
 static inline void queued_spin_unlock_wait(struct qspinlock *lock)
 {
+ /* See queued_spin_is_locked() */
+ smp_mb();
  while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
  cpu_relax();
 }
Reply | Threaded
Open this post in threaded view
|

Re: sem_lock() vs qspinlocks

Davidlohr Bueso-5
In reply to this post by Peter Zijlstra-5
On Fri, 20 May 2016, Peter Zijlstra wrote:

>The problem is that the clear_pending_set_locked() is an unordered
>store, therefore this store can be delayed until no later than
>spin_unlock() (which orders against it due to the address dependency).
>
>This opens numerous races; for example:
>
> ipc_lock_object(&sma->sem_perm);
> sem_wait_array(sma);
>
> false   -> spin_is_locked(&sma->sem_perm.lock)
>
>is entirely possible, because sem_wait_array() consists of pure reads,
>so the store can pass all that, even on x86.

I had pondered at the unordered stores in clear_pending_set_locked() for arm,
for example, but I _certainly_ missed this for x86 inside the ACQUIRE region.

Thanks,
Davidlohr
123