[RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
On 12/21/2012 10:49 PM, Steven Rostedt wrote:
> On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:

>> However, since spinlock contention should not be the
>> usual state, and all a scalable lock does is make sure
>> that N+1 CPUs does not perform worse than N CPUs, using
>> scalable locks is a stop-gap measure.
>>
>> I believe a stop-gap measure should be kept as simple as
>> we can. I am willing to consider moving to a per-lock
>> delay factor if we can figure out an easy way to do it,
>> but I would like to avoid too much extra complexity...
>
> Rik,
>
> I like your solution. It's rather simple and simple solutions tend to
> end up being the closest to optimal. The more complex a solution gets,
> the more it starts chasing fireflies.

> Anyway, I'd like to see this code tested, and more benchmarks run
> against it.

Absolutely.  I would love to see if this code actually
causes regressions anywhere.

It is simple enough that I suspect it will not, but there
really is only one way to find out.

The more people test this with different workloads on
different SMP systems, the better.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

Michel Lespinasse
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 3:50 PM, Rik van Riel <[hidden email]> wrote:

> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..2a45eb0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>
>         inc = xadd(&lock->tickets, inc);
>
> -       for (;;) {
> -               if (inc.head == inc.tail)
> -                       break;
> -               cpu_relax();
> -               inc.head = ACCESS_ONCE(lock->tickets.head);
> -       }
> +       if (inc.head == inc.tail)
> +               goto out;
> +
> +       ticket_spin_lock_wait(lock, inc);
> + out:

why not just:

if (inc.head != inc.tail)
  ticket_spin_lock_wait(lock, inc)

>         barrier();              /* make sure nothing creeps before the lock is taken */
>  }
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 48d2b7d..20da354 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>
>  /*
> + * Wait on a congested ticket spinlock.
> + */
> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> +{
> +       for (;;) {
> +               cpu_relax();
> +               inc.head = ACCESS_ONCE(lock->tickets.head);
> +
> +               if (inc.head == inc.tail)
> +                       break;
> +       }

Why not just:

do {
  cpu_relax()
  inc.head = ...
} while (inc.head != inc.tail);


Other than that, no problems with the principle of it.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Michel Lespinasse
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 3:51 PM, Rik van Riel <[hidden email]> wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
>
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
>
> Signed-off-by: Rik van Riel <[hidden email]>

Looks fine to me other than the arbitrary-ness of 50

Reviewed-by: Michel Lespinasse <[hidden email]>

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

Rik van Riel
In reply to this post by Michel Lespinasse
On 12/21/2012 11:40 PM, Michel Lespinasse wrote:
> On Fri, Dec 21, 2012 at 3:50 PM, Rik van Riel <[hidden email]> wrote:

>> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>>
>>          inc = xadd(&lock->tickets, inc);

>> +       if (inc.head == inc.tail)
>> +               goto out;
>> +
>> +       ticket_spin_lock_wait(lock, inc);
>> + out:
>
> why not just:
>
> if (inc.head != inc.tail)
>    ticket_spin_lock_wait(lock, inc)

That makes the code nicer, thank you. Applied.

>> +++ b/arch/x86/kernel/smp.c
>> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>>   static bool smp_no_nmi_ipi = false;
>>
>>   /*
>> + * Wait on a congested ticket spinlock.
>> + */
>> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>> +{
>> +       for (;;) {
>> +               cpu_relax();
>> +               inc.head = ACCESS_ONCE(lock->tickets.head);
>> +
>> +               if (inc.head == inc.tail)
>> +                       break;
>> +       }
>
> Why not just:
>
> do {
>    cpu_relax()
>    inc.head = ...
> } while (inc.head != inc.tail);
>
>
> Other than that, no problems with the principle of it.

In patch #3 I do something else inside the head == tail
conditional block, so this one is best left alone.

Thank you for the comments.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Michel Lespinasse
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 3:51 PM, Rik van Riel <[hidden email]> wrote:

> Subject: x86,smp: auto tune spinlock backoff delay factor
>
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
>         http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
>
> Signed-off-by: Rik van Riel <[hidden email]>

So I like the idea a lot, and I had never seen the auto-tuning as you
propose it. Your implementation looks simple enough and doesn't slow
the uncontended case, so props for that.

However, I have a few concerns about the behavior of this, which I
think deserve more experimentation (I may try helping with it after
new years).

One thing you mentioned in 0/3 is that the best value varies depending
on the number of CPUs contending. This is somewhat surprising to me; I
would have guessed/hoped that the (inc.tail - inc.head) multiplicative
factor would account for that already. I wonder if we can somehow
adjust the code so that a same constant would work no matter how many
threads are contending for the lock (note that one single read to the
spinlock word gives us both the current number of waiters and our
position among them).

The other factor that might change your auto-tuned value is the amount
of time that each thread holds the spinlock. I wonder what would
happen if the constant was tuned for spinlocks that have a low hold
time, and then used on spinlocks that might have a higher hold time.
obviously this would result in accessing the spinlock word more often
than necessary, but it shouldn't be very bad since the accesses
wouldn't be any more frequent than in the low hold time case, where
throughput is good. So maybe this would work acceptably well.

What I'm getting at is that I would be more confident that the
autotune algorithm will work well in all cases if the value only
depended on the system parameters such as CPU type and frequency,
rather than per-spinlock parameters such as number of waiters and hold
time.

I feel this review is too high-level to be really helpful, so I'll
stop until I can find time to experiment :)

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
On 12/22/2012 12:42 AM, Michel Lespinasse wrote:

> However, I have a few concerns about the behavior of this, which I
> think deserve more experimentation (I may try helping with it after
> new years).

More experimentation is always good. Different hardware
will probably behave differently, etc...

> One thing you mentioned in 0/3 is that the best value varies depending
> on the number of CPUs contending. This is somewhat surprising to me; I
> would have guessed/hoped that the (inc.tail - inc.head) multiplicative
> factor would account for that already.

I had hoped the same, but testing things out while
developing the code showed it not to be so.

A larger constant scales to a larger number of CPUs,
while a smaller constant works faster when there are
few CPUs contending on the lock.

The graph attached to patch 0/3 shows this effect.

> What I'm getting at is that I would be more confident that the
> autotune algorithm will work well in all cases if the value only
> depended on the system parameters such as CPU type and frequency,
> rather than per-spinlock parameters such as number of waiters and hold
> time.

The autotune algorithm adjusts the delay factor so
we access the spinlock, on average, 2.7 times before
we acquire it (the 3.7th time).

This provides scalability by keeping the number of
shared cache line touches constant for each lock
acquisition.

Going for a fixed value, either at compile time or
at boot time, cannot provide such a guarantee.

> I feel this review is too high-level to be really helpful, so I'll
> stop until I can find time to experiment :)

I am looking forward to test results.

If anyone manages to break the code, I will fix it...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

Rafael Aquini
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 06:50:38PM -0500, Rik van Riel wrote:
> Subject: x86,smp: move waiting on contended ticket lock out of line
>
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
>
> Signed-off-by: Rik van Riel <[hidden email]>
> ---

Reviewed-by: Rafael Aquini <[hidden email]>


>  arch/x86/include/asm/spinlock.h |   13 +++++++------
>  arch/x86/kernel/smp.c           |   14 ++++++++++++++
>  2 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..2a45eb0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>  
> +extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  
>   inc = xadd(&lock->tickets, inc);
>  
> - for (;;) {
> - if (inc.head == inc.tail)
> - break;
> - cpu_relax();
> - inc.head = ACCESS_ONCE(lock->tickets.head);
> - }
> + if (inc.head == inc.tail)
> + goto out;
> +
> + ticket_spin_lock_wait(lock, inc);
> + out:
>   barrier(); /* make sure nothing creeps before the lock is taken */
>  }
>  
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 48d2b7d..20da354 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> + * Wait on a congested ticket spinlock.
> + */
> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> +{
> + for (;;) {
> + cpu_relax();
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> +
> + if (inc.head == inc.tail)
> + break;
> + }
> +}
> +
> +/*
>   * this function sends a 'reschedule' IPI to another CPU.
>   * it goes straight through and wastes no time serializing
>   * anything. Worst case is that we lose a reschedule ...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Rafael Aquini
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 06:51:15PM -0500, Rik van Riel wrote:

> Subject: x86,smp: proportional backoff for ticket spinlocks
>
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
>
> Signed-off-by: Rik van Riel <[hidden email]>
> ---

Reviewed-by: Rafael Aquini <[hidden email]>

>  arch/x86/kernel/smp.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..4e44840 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>   for (;;) {
> - cpu_relax();
> - inc.head = ACCESS_ONCE(lock->tickets.head);
> + int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> + while (loops--)
> + cpu_relax();
>  
> + inc.head = ACCESS_ONCE(lock->tickets.head);
>   if (inc.head == inc.tail)
>   break;
>   }
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Rafael Aquini
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 10:58:48PM -0500, Rik van Riel wrote:

> On 12/21/2012 10:49 PM, Steven Rostedt wrote:
> >On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:
>
> >>However, since spinlock contention should not be the
> >>usual state, and all a scalable lock does is make sure
> >>that N+1 CPUs does not perform worse than N CPUs, using
> >>scalable locks is a stop-gap measure.
> >>
> >>I believe a stop-gap measure should be kept as simple as
> >>we can. I am willing to consider moving to a per-lock
> >>delay factor if we can figure out an easy way to do it,
> >>but I would like to avoid too much extra complexity...
> >
> >Rik,
> >
> >I like your solution. It's rather simple and simple solutions tend to
> >end up being the closest to optimal. The more complex a solution gets,
> >the more it starts chasing fireflies.
>
> >Anyway, I'd like to see this code tested, and more benchmarks run
> >against it.
>
> Absolutely.  I would love to see if this code actually
> causes regressions anywhere.
>
> It is simple enough that I suspect it will not, but there
> really is only one way to find out.
>
> The more people test this with different workloads on
> different SMP systems, the better.
>

Great work Rik,

I have a couple of small SMP systems I'll start to test with your patches, also
I might be able to test this work after new year's eve on a big SMP box that
seems to be facing a severe lock starvation issue due to the BUS saturation
your work is aiming to reduce.

Cheers!
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
In reply to this post by Rik van Riel
On Fri, 2012-12-21 at 22:50 -0500, Rik van Riel wrote:

> I will try to run this test on a really large SMP system
> in the lab during the break.
>
> Ideally, the auto-tuning will keep the delay value large
> enough that performance will stay flat even when there are
> 100 CPUs contending over the same lock.
>
> Maybe it turns out that the maximum allowed delay value
> needs to be larger.  Only one way to find out...
>

Hi Rik

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..59f98f6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -113,6 +114,55 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+#define DELAY_HASH_SHIFT 4
+DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+ MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+};
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+ unsigned int slot = hash_32((u32)(unsigned long)lock, DELAY_HASH_SHIFT);
+ int delay = __this_cpu_read(spinlock_delay[slot]);
+
+ for (;;) {
+ int loops = delay * (__ticket_t)(inc.tail - inc.head);
+
+ while (loops--)
+ cpu_relax();
+
+ inc.head = ACCESS_ONCE(lock->tickets.head);
+
+ if (inc.head == inc.tail) {
+ /* Decrease the delay, since we may have overslept. */
+ if (delay > MIN_SPINLOCK_DELAY)
+ delay--;
+ break;
+ }
+
+ /*
+ * The lock is still busy, the delay was not long enough.
+ * Going through here 2.7 times will, on average, cancel
+ * out the decrement above. Using a non-integer number
+ * gets rid of performance artifacts and reduces oversleeping.
+ */
+ if (delay < MAX_SPINLOCK_DELAY &&
+    (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+ delay++;
+ }
+ __this_cpu_write(spinlock_delay[slot], delay);
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Wed, 2012-12-26 at 11:10 -0800, Eric Dumazet wrote:

> +#define DELAY_HASH_SHIFT 4
> +DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> + MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +};

This can use the following initialize by the way :

DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
        [0 ... (1 << DELAY_HASH_SHIFT) - 1] = MIN_SPINLOCK_DELAY,
};



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Eric Dumazet-2
On 12/26/2012 02:10 PM, Eric Dumazet wrote:

> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
>
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.

Thank you for breaking my patches.

I had been thinking about ways to deal with multiple
spinlocks, and hoping there would not be a serious
issue with systems contending on multiple locks.

> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
>
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.

Thank you even more for fixing my patches :)

That is a huge win!

Could I have your Signed-off-by: line, so I can merge
your hashed spinlock slots in?

I will probably keep it as a separate patch 4/4, with
your report and performance numbers in it, to preserve
the reason why we keep multiple hashed values, etc...

There is enough stuff in this code that will be
indishinguishable from magic if we do not document it
properly...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Michel Lespinasse
On Wed, Dec 26, 2012 at 11:51 AM, Rik van Riel <[hidden email]> wrote:
> On 12/26/2012 02:10 PM, Eric Dumazet wrote:
>> We might try to use a hash on lock address, and an array of 16 different
>> delays so that different spinlocks have a chance of not sharing the same
>> delay.
>>
>> With following patch, I get 982 Mbits/s with same bench, so an increase
>> of 45 % instead of a 13 % regression.

Awesome :)

> I will probably keep it as a separate patch 4/4, with
> your report and performance numbers in it, to preserve
> the reason why we keep multiple hashed values, etc...
>
> There is enough stuff in this code that will be
> indishinguishable from magic if we do not document it
> properly...

If we go with per-spinlock tunings, I feel we'll most likely want to
add an associative cache in order to avoid the 1/16 chance (~6%) of
getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.

I would still prefer if we could make up something that didn't require
per-spinlock tunings, but it's not clear if that'll work. At least we
now know of a simple enough workload to figure it out :)

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Wed, 2012-12-26 at 22:07 -0800, Michel Lespinasse wrote:

> If we go with per-spinlock tunings, I feel we'll most likely want to
> add an associative cache in order to avoid the 1/16 chance (~6%) of
> getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.
>
> I would still prefer if we could make up something that didn't require
> per-spinlock tunings, but it's not clear if that'll work. At least we
> now know of a simple enough workload to figure it out :)

Even with a per spinlock tuning, we can find workloads where holding
time depends on the context.

For example, complex qdisc hierarchy typically use different times on
enqueue and dequeue operations.

So the hash sounds good to me, because the hash key could mix both lock
address and caller IP ( __builtin_return_address(1) in
ticket_spin_lock_wait())





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
On 12/27/2012 09:27 AM, Eric Dumazet wrote:

> On Wed, 2012-12-26 at 22:07 -0800, Michel Lespinasse wrote:
>
>> If we go with per-spinlock tunings, I feel we'll most likely want to
>> add an associative cache in order to avoid the 1/16 chance (~6%) of
>> getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.
>>
>> I would still prefer if we could make up something that didn't require
>> per-spinlock tunings, but it's not clear if that'll work. At least we
>> now know of a simple enough workload to figure it out :)
>
> Even with a per spinlock tuning, we can find workloads where holding
> time depends on the context.
>
> For example, complex qdisc hierarchy typically use different times on
> enqueue and dequeue operations.
 >
> So the hash sounds good to me, because the hash key could mix both lock
> address and caller IP ( __builtin_return_address(1) in
> ticket_spin_lock_wait())

The lock acquisition time depends on the holder of the lock,
and what the CPUs ahead of us in line will do with the lock,
not on the caller IP of the spinner.

Therefore, I am not convinced that hashing on the caller IP
will add much, if anything, except increasing the chance
that we end up not backing off when we should...

IMHO it would be good to try keeping this solution as simple
as we can get away with.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Jan Beulich-2
>>> Rik van Riel <[hidden email]> 12/27/12 4:01 PM >>>
>On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>> So the hash sounds good to me, because the hash key could mix both lock
>> address and caller IP ( __builtin_return_address(1) in
>> ticket_spin_lock_wait())
>
>The lock acquisition time depends on the holder of the lock,
>and what the CPUs ahead of us in line will do with the lock,
>not on the caller IP of the spinner.

The lock holder could supply its __builtin_return_address(0) for use
in eventual hashing.

Also, with all of this - did you evaluate the alternative of using
monitor/mwait instead?

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
In reply to this post by Rik van Riel
On Thu, 2012-12-27 at 09:35 -0500, Rik van Riel wrote:

>
> The lock acquisition time depends on the holder of the lock,
> and what the CPUs ahead of us in line will do with the lock,
> not on the caller IP of the spinner.
>

That would be true only for general cases.

In network land, we do have spinlock acquisition time depending on the
context.

A garbage collector usually runs for longer time than the regular fast
path.

But even without gc, its pretty often we have consumer/producers that
don't have the same amount of work to perform per lock/unlock sections.

The socket lock per example, might be held for very small sections for
process contexts (lock_sock() / release_sock()), but longer sections
from softirq context. Of course, severe lock contention on a socket
seems unlikely in real workloads.

> Therefore, I am not convinced that hashing on the caller IP
> will add much, if anything, except increasing the chance
> that we end up not backing off when we should...
>
> IMHO it would be good to try keeping this solution as simple
> as we can get away with.
>

unsigned long hash = (unsigned long)lock ^
                     (unsigned long)__builtin_return_address(1);

seems simple enough to me, but I get your point.

I also recorded the max 'delay' value reached on my machine to check how
good MAX_SPINLOCK_DELAY value was :

[   89.628265] cpu 16 delay 3710
[   89.631230] cpu 6 delay 2930
[   89.634120] cpu 15 delay 3186
[   89.637092] cpu 18 delay 3789
[   89.640071] cpu 22 delay 4012
[   89.643080] cpu 11 delay 3389
[   89.646057] cpu 21 delay 3123
[   89.649035] cpu 9 delay 3295
[   89.651931] cpu 3 delay 3063
[   89.654811] cpu 14 delay 3335

Although it makes no performance difference to use a bigger/smaller one.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Jan Beulich-2
On 12/27/2012 01:41 PM, Jan Beulich wrote:

>>>> Rik van Riel <[hidden email]> 12/27/12 4:01 PM >>>
>> On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>>> So the hash sounds good to me, because the hash key could mix both lock
>>> address and caller IP ( __builtin_return_address(1) in
>>> ticket_spin_lock_wait())
>>
>> The lock acquisition time depends on the holder of the lock,
>> and what the CPUs ahead of us in line will do with the lock,
>> not on the caller IP of the spinner.
>
> The lock holder could supply its __builtin_return_address(0) for use
> in eventual hashing.
>
> Also, with all of this - did you evaluate the alternative of using
> monitor/mwait instead?

How much bus traffic do monitor/mwait cause behind the scenes?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Eric Dumazet-2
On 12/27/2012 01:49 PM, Eric Dumazet wrote:

> On Thu, 2012-12-27 at 09:35 -0500, Rik van Riel wrote:
>
>>
>> The lock acquisition time depends on the holder of the lock,
>> and what the CPUs ahead of us in line will do with the lock,
>> not on the caller IP of the spinner.
>
> That would be true only for general cases.
>
> In network land, we do have spinlock acquisition time depending on the
> context.
>
> A garbage collector usually runs for longer time than the regular fast
> path.

Won't the garbage collector running, hold up the lock
acquisition time by OTHER acquirers?

> But even without gc, its pretty often we have consumer/producers that
> don't have the same amount of work to perform per lock/unlock sections.
>
> The socket lock per example, might be held for very small sections for
> process contexts (lock_sock() / release_sock()), but longer sections
> from softirq context. Of course, severe lock contention on a socket
> seems unlikely in real workloads.

If one actor holds the lock for longer than the
others, surely it would be the others that suffer
in lock acquisition time?

>> Therefore, I am not convinced that hashing on the caller IP
>> will add much, if anything, except increasing the chance
>> that we end up not backing off when we should...
>>
>> IMHO it would be good to try keeping this solution as simple
>> as we can get away with.
>>
>
> unsigned long hash = (unsigned long)lock ^
>                       (unsigned long)__builtin_return_address(1);
>
> seems simple enough to me, but I get your point.
>
> I also recorded the max 'delay' value reached on my machine to check how
> good MAX_SPINLOCK_DELAY value was :
>
> [   89.628265] cpu 16 delay 3710
> [   89.631230] cpu 6 delay 2930
> [   89.634120] cpu 15 delay 3186
> [   89.637092] cpu 18 delay 3789
> [   89.640071] cpu 22 delay 4012
> [   89.643080] cpu 11 delay 3389
> [   89.646057] cpu 21 delay 3123
> [   89.649035] cpu 9 delay 3295
> [   89.651931] cpu 3 delay 3063
> [   89.654811] cpu 14 delay 3335
>
> Although it makes no performance difference to use a bigger/smaller one.

I guess we want a larger value.

With your hashed lock approach, we can get away with
larger values - they will not penalize other locks
the same way a single value per cpu might have.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Thu, 2012-12-27 at 14:31 -0500, Rik van Riel wrote:
> to use a bigger/smaller one.
>
> I guess we want a larger value.
>
> With your hashed lock approach, we can get away with
> larger values - they will not penalize other locks
> the same way a single value per cpu might have.

Then, we absolutely want to detect hash collisions to clear the (wrong)
estimation or else we might 'pollute' a spinlock with a delay of a very
slow spinlock.

In my tests, the mm zone lock can be held for very long for example...

[  657.439995] cpu 18 lock ffff88067fffeb40 delay 6906
[  657.444855] ------------[ cut here ]------------
[  657.444859] WARNING: at arch/x86/kernel/smp.c:170
ticket_spin_lock_wait+0xf9/0x100()
[  657.444860] Hardware name: TBG,ICH10
[  657.444861] Modules linked in: msr cpuid genrtc mlx4_en ib_uverbs
mlx4_ib ib_sa ib_mad ib_core mlx4_core e1000e bnx2x libcrc32c mdio ipv6
[  657.444871] Pid: 24942, comm: hotplug Tainted: G        W
3.8.0-smp-DEV #31
[  657.444872] Call Trace:
[  657.444876]  [<ffffffff8108634f>] warn_slowpath_common+0x7f/0xc0
[  657.444878]  [<ffffffff810863aa>] warn_slowpath_null+0x1a/0x20
[  657.444881]  [<ffffffff81070089>] ticket_spin_lock_wait+0xf9/0x100
[  657.444885]  [<ffffffff815ad58f>] _raw_spin_lock_irqsave+0x2f/0x40
[  657.444887]  [<ffffffff8113f5d0>] release_pages+0x160/0x220
[  657.444891]  [<ffffffff8116cffe>] free_pages_and_swap_cache+0x9e/0xc0
[  657.444893]  [<ffffffff8107f8a8>] ? flush_tlb_mm_range+0x48/0x220
[  657.444896]  [<ffffffff81158ee7>] tlb_flush_mmu+0x67/0xb0
[  657.444898]  [<ffffffff81158f4c>] tlb_finish_mmu+0x1c/0x50
[  657.444900]  [<ffffffff81163346>] exit_mmap+0xf6/0x170
[  657.444903]  [<ffffffff81083737>] mmput+0x47/0xf0
[  657.444906]  [<ffffffff8108baed>] do_exit+0x24d/0xa20
[  657.444908]  [<ffffffff810986af>] ? recalc_sigpending+0x1f/0x60
[  657.444910]  [<ffffffff81098db7>] ? __set_task_blocked+0x37/0x80
[  657.444913]  [<ffffffff8108c354>] do_group_exit+0x44/0xa0
[  657.444915]  [<ffffffff8108c3c7>] sys_exit_group+0x17/0x20
[  657.444918]  [<ffffffff815b68a5>] sysenter_dispatch+0x7/0x1a
[  657.444920] ---[ end trace a460fe18a5578dda ]---

My current function looks like :

 /*
+ * Wait on a congested ticket spinlock.
+ */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 10000
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+       u32 hash;
+       u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+       [0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+               .hash = 0,
+               .delay = MIN_SPINLOCK_DELAY,
+       },
+};
+static DEFINE_PER_CPU(u16, maxdelay);
+
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+       u32 hash = hash32_ptr(lock);
+       u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+       struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
+
+       for (;;) {
+               u32 loops = delay * (__ticket_t)(inc.tail - inc.head);
+
+               loops -= delay >> 1;
+               while (loops--)
+                       cpu_relax();
+
+               inc.head = ACCESS_ONCE(lock->tickets.head);
+
+               if (inc.head == inc.tail) {
+                       /* Decrease the delay, since we may have overslept. */
+                       if (delay > MIN_SPINLOCK_DELAY)
+                               delay--;
+                       break;
+               }
+
+               /*
+                * The lock is still busy, the delay was not long enough.
+                * Going through here 2.7 times will, on average, cancel
+                * out the decrement above. Using a non-integer number
+                * gets rid of performance artifacts and reduces oversleeping.
+                */
+               if (delay < MAX_SPINLOCK_DELAY &&
+                   (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+                       delay++;
+       }
+       if (__this_cpu_read(maxdelay) < delay) {
+               pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay);
+               __this_cpu_write(maxdelay, delay);
+               WARN_ON(1);
+       }
+       ent->hash = hash;
+       ent->delay = delay;
+}
+


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123
Loading...