Quantcast

[RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

        http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
and with autotuning proportional delay tuning for either 2 or
2.7 spins before the lock is obtained.

Please let me know if you manage to break this code in any way,
so I can fix it...

spinlock-backoff.png (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

Rik van Riel
Subject: x86,smp: move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <[hidden email]>
---
 arch/x86/include/asm/spinlock.h |   13 +++++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..2a45eb0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
  inc = xadd(&lock->tickets, inc);
 
- for (;;) {
- if (inc.head == inc.tail)
- break;
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
- }
+ if (inc.head == inc.tail)
+ goto out;
+
+ ticket_spin_lock_wait(lock, inc);
+ out:
  barrier(); /* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+ for (;;) {
+ cpu_relax();
+ inc.head = ACCESS_ONCE(lock->tickets.head);
+
+ if (inc.head == inc.tail)
+ break;
+ }
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Rik van Riel
In reply to this post by Rik van Riel
Subject: x86,smp: proportional backoff for ticket spinlocks

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

Signed-off-by: Rik van Riel <[hidden email]>
---
 arch/x86/kernel/smp.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..4e44840 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
  for (;;) {
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
+ int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+ while (loops--)
+ cpu_relax();
 
+ inc.head = ACCESS_ONCE(lock->tickets.head);
  if (inc.head == inc.tail)
  break;
  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Rik van Riel
Subject: x86,smp: auto tune spinlock backoff delay factor

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

        http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <[hidden email]>
---
 arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 1c1eb02..8786476 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. Adjusting "delay" to poll, on average, 2.7 times before the
+ * lock is obtained seems to result in low bus traffic. The combination
+ * of aiming for a non-integer amount of average polls, and scaling the
+ * sleep period proportionally to how many CPUs are ahead of us in the
+ * queue for this ticket lock seems to reduce the amount of time spent
+ * "oversleeping" the release of the lock.
  */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+ /*
+ * Use the raw per-cpu pointer; preemption is disabled in the
+ * spinlock code. This avoids put_cpu_var once we have the lock.
+ */
+ int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
+ int delay = *delay_ptr;
+
  for (;;) {
- int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+ int loops = delay * (__ticket_t)(inc.tail - inc.head);
  while (loops--)
  cpu_relax();
 
  inc.head = ACCESS_ONCE(lock->tickets.head);
- if (inc.head == inc.tail)
+ if (inc.head == inc.tail) {
+ /* Decrease the delay, since we may have overslept. */
+ if (delay > MIN_SPINLOCK_DELAY)
+ delay--;
  break;
+ }
+
+ /*
+ * The lock is still busy, the delay was not long enough.
+ * Going through here 2.7 times will, on average, cancel
+ * out the decrement above. Using a non-integer number
+ * gets rid of performance artifacts and reduces oversleeping.
+ */
+ if (delay < MAX_SPINLOCK_DELAY &&
+ (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+ delay++;
  }
+ *delay_ptr = delay;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
Argh, the first one had a typo in it that did not influence
performance with fewer threads running, but that made things
worse with more than a dozen threads...

Please let me know if you can break these patches.
---8<---
Subject: x86,smp: auto tune spinlock backoff delay factor

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

        http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <[hidden email]>
---
 arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 4e44840..e44c56f 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. Adjusting "delay" to poll, on average, 2.7 times before the
+ * lock is obtained seems to result in low bus traffic. The combination
+ * of aiming for a non-integer amount of average polls, and scaling the
+ * sleep period proportionally to how many CPUs are ahead of us in the
+ * queue for this ticket lock seems to reduce the amount of time spent
+ * "oversleeping" the release of the lock.
  */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+ /*
+ * Use the raw per-cpu pointer; preemption is disabled in the
+ * spinlock code. This avoids put_cpu_var once we have the lock.
+ */
+ int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
+ int delay = *delay_ptr;
+
  for (;;) {
- int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+ int loops = delay * (__ticket_t)(inc.tail - inc.head);
  while (loops--)
  cpu_relax();
 
  inc.head = ACCESS_ONCE(lock->tickets.head);
- if (inc.head == inc.tail)
+ if (inc.head == inc.tail) {
+ /* Decrease the delay, since we may have overslept. */
+ if (delay > MIN_SPINLOCK_DELAY)
+ delay--;
  break;
+ }
+
+ /*
+ * The lock is still busy, the delay was not long enough.
+ * Going through here 2.7 times will, on average, cancel
+ * out the decrement above. Using a non-integer number
+ * gets rid of performance artifacts and reduces oversleeping.
+ */
+ if (delay < MAX_SPINLOCK_DELAY &&
+ ((inc.head & 3) == 0 || (inc.head & 7) == 1))
+ delay++;
  }
+ *delay_ptr = delay;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> Argh, the first one had a typo in it that did not influence
> performance with fewer threads running, but that made things
> worse with more than a dozen threads...

> +
> + /*
> + * The lock is still busy, the delay was not long enough.
> + * Going through here 2.7 times will, on average, cancel
> + * out the decrement above. Using a non-integer number
> + * gets rid of performance artifacts and reduces oversleeping.
> + */
> + if (delay < MAX_SPINLOCK_DELAY &&
> + ((inc.head & 3) == 0 || (inc.head & 7) == 1))
> + delay++;

((inc.head & 3) == 0 || (inc.head & 7) == 1)) seems a strange condition
to me...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

David Daney-5
In reply to this post by Rik van Riel
On 12/21/2012 03:51 PM, Rik van Riel wrote:
> Subject: x86,smp: auto tune spinlock backoff delay factor

Nice work,

A couple of comments...

>
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
>
> Signed-off-by: Rik van Riel <[hidden email]>
> ---
>   arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
>   1 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 1c1eb02..8786476 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>   static bool smp_no_nmi_ipi = false;
>
>   /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
> + * lock is obtained seems to result in low bus traffic. The combination
> + * of aiming for a non-integer amount of average polls, and scaling the
> + * sleep period proportionally to how many CPUs are ahead of us in the
> + * queue for this ticket lock seems to reduce the amount of time spent
> + * "oversleeping" the release of the lock.
>    */
> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };


This gives the same delay for all locks in the system, but the amount of
work done under each lock is different.  So, for any given lock, the
delay is not optimal.

This is an untested idea that came to me after looking at this:

o Assume that for any given lock, the optimal delay is the same for all
CPUs in the system.

o Store a per-lock delay value in arch_spinlock_t.

o Once a CPU owns the lock it can update the delay as you do for the
per_cpu version.  Tuning the delay on fewer of the locking operations
reduces bus traffic, but makes it converge more slowly.

o Bonus points if you can update the delay as part of the releasing store.

>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>   {
> + /*
> + * Use the raw per-cpu pointer; preemption is disabled in the
> + * spinlock code. This avoids put_cpu_var once we have the lock.
> + */
> + int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> + int delay = *delay_ptr;
> +
>   for (;;) {
> - int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> + int loops = delay * (__ticket_t)(inc.tail - inc.head);
>   while (loops--)
>   cpu_relax();
>
>   inc.head = ACCESS_ONCE(lock->tickets.head);
> - if (inc.head == inc.tail)
> + if (inc.head == inc.tail) {
> + /* Decrease the delay, since we may have overslept. */
> + if (delay > MIN_SPINLOCK_DELAY)
> + delay--;
>   break;
> + }
> +
> + /*
> + * The lock is still busy, the delay was not long enough.
> + * Going through here 2.7 times will, on average, cancel
> + * out the decrement above. Using a non-integer number
> + * gets rid of performance artifacts and reduces oversleeping.
> + */
> + if (delay < MAX_SPINLOCK_DELAY &&
> + (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
> + delay++;
>   }
> + *delay_ptr = delay;
>   }
>
>   /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
In reply to this post by Rik van Riel
On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> Argh, the first one had a typo in it that did not influence
> performance with fewer threads running, but that made things
> worse with more than a dozen threads...
>
> Please let me know if you can break these patches.
> ---8<---
> Subject: x86,smp: auto tune spinlock backoff delay factor

> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };

Using a single spinlock_delay per cpu assumes there is a single
contended spinlock on the machine, or that contended
spinlocks protect the same critical section.

Given that we probably know where the contended spinlocks are, couldnt
we use a real scalable implementation for them ?

A known contended one is the Qdisc lock in network layer. We added a
second lock (busylock) to lower a bit the pressure on a separate cache
line, but a scalable lock would be much better...

I guess there are patent issues...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Eric Dumazet-2
On 12/21/2012 07:18 PM, Eric Dumazet wrote:

> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> Argh, the first one had a typo in it that did not influence
>> performance with fewer threads running, but that made things
>> worse with more than a dozen threads...
>
>> +
>> + /*
>> + * The lock is still busy, the delay was not long enough.
>> + * Going through here 2.7 times will, on average, cancel
>> + * out the decrement above. Using a non-integer number
>> + * gets rid of performance artifacts and reduces oversleeping.
>> + */
>> + if (delay < MAX_SPINLOCK_DELAY &&
>> + ((inc.head & 3) == 0 || (inc.head & 7) == 1))
>> + delay++;
>
> ((inc.head & 3) == 0 || (inc.head & 7) == 1)) seems a strange condition
> to me...

It is. It turned out that doing the increment
every 4 times (just the first check) resulted
in odd performance artifacts when running with
4, 8, 12 or 16 CPUs.

Moving to the above got rid of the performance
artifact.

It also results in aiming for a sleep period
that is not an exact multiple of the lock
acquiring period, which results in less
"oversleeping", and measurably better
performance.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by David Daney-5
On 12/21/2012 07:47 PM, David Daney wrote:

>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>
>
> This gives the same delay for all locks in the system, but the amount of
> work done under each lock is different.  So, for any given lock, the
> delay is not optimal.
>
> This is an untested idea that came to me after looking at this:
>
> o Assume that for any given lock, the optimal delay is the same for all
> CPUs in the system.
 >
> o Store a per-lock delay value in arch_spinlock_t.
>
> o Once a CPU owns the lock it can update the delay as you do for the
> per_cpu version.  Tuning the delay on fewer of the locking operations
> reduces bus traffic, but makes it converge more slowly.
>
> o Bonus points if you can update the delay as part of the releasing store.

It would absolutely have to be part of the same load and
store cycle, otherwise we would increase bus traffic and
defeat the purpose.

However, since spinlock contention should not be the
usual state, and all a scalable lock does is make sure
that N+1 CPUs does not perform worse than N CPUs, using
scalable locks is a stop-gap measure.

I believe a stop-gap measure should be kept as simple as
we can. I am willing to consider moving to a per-lock
delay factor if we can figure out an easy way to do it,
but I would like to avoid too much extra complexity...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Eric Dumazet-2
On 12/21/2012 07:48 PM, Eric Dumazet wrote:

> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> Argh, the first one had a typo in it that did not influence
>> performance with fewer threads running, but that made things
>> worse with more than a dozen threads...
>>
>> Please let me know if you can break these patches.
>> ---8<---
>> Subject: x86,smp: auto tune spinlock backoff delay factor
>
>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>
> Using a single spinlock_delay per cpu assumes there is a single
> contended spinlock on the machine, or that contended
> spinlocks protect the same critical section.

The goal is to reduce bus traffic, and keep total
system performance from falling through the floor.

If we have one lock that takes N cycles to acquire,
and a second contended lock that takes N*2 cycles
to acquire, checking the first lock fewer times
before acquisition, and the second lock more times,
should still result in similar average system
throughput.

I suspect this approach should work well if we have
multiple contended locks in the system.

> Given that we probably know where the contended spinlocks are, couldnt
> we use a real scalable implementation for them ?

The scalable locks tend to have a slightly more
complex locking API, resulting in a slightly
higher overhead in the non-contended (normal)
case.  That means we cannot use them everywhere.

Also, scalable locks merely make sure that N+1
CPUs perform the same as N CPUs when there is
lock contention.  They do not cause the system
to actually scale.

For actual scalability, the data structure would
need to be changed, so locking requirements are
better.

> A known contended one is the Qdisc lock in network layer. We added a
> second lock (busylock) to lower a bit the pressure on a separate cache
> line, but a scalable lock would be much better...

My locking patches are meant for dealing with the
offenders we do not know about, to make sure that
system performance does not fall off a cliff when
we run into a surprise.

Known scalability bugs we can fix.

Unknown ones should not cause somebody's system
to fail.

> I guess there are patent issues...

At least one of the scalable lock implementations has been
known since 1991, so there should not be any patent issues
with that one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line

Steven Rostedt
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 06:50:38PM -0500, Rik van Riel wrote:
> Subject: x86,smp: move waiting on contended ticket lock out of line
>
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
>
> Signed-off-by: Rik van Riel <[hidden email]>

Reviewed-by: Steven Rostedt <[hidden email]>

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Steven Rostedt
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 06:51:15PM -0500, Rik van Riel wrote:

> Subject: x86,smp: proportional backoff for ticket spinlocks
>
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
>
> Signed-off-by: Rik van Riel <[hidden email]>
> ---
>  arch/x86/kernel/smp.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..4e44840 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>   for (;;) {
> - cpu_relax();
> - inc.head = ACCESS_ONCE(lock->tickets.head);
> + int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> + while (loops--)
> + cpu_relax();

-ENOCOMMENT

Please add a comment above to explain what it's doing. Don't expect
people to check change logs. Also, explain why you picked 50.

-- Steve

>  
> + inc.head = ACCESS_ONCE(lock->tickets.head);
>   if (inc.head == inc.tail)
>   break;
>   }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Steven Rostedt
On Fri, Dec 21, 2012 at 10:07:56PM -0500, Steven Rostedt wrote:

> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index 20da354..4e44840 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
> >  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> >  {
> >   for (;;) {
> > - cpu_relax();
> > - inc.head = ACCESS_ONCE(lock->tickets.head);
> > + int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> > + while (loops--)
> > + cpu_relax();
>
> -ENOCOMMENT
>
> Please add a comment above to explain what it's doing. Don't expect
> people to check change logs. Also, explain why you picked 50.
>

OK, I replied here before reading patch 3 (still reviewing it). Why have
this patch at all? Just to test if you broke something between this and
patch 3? Or perhaps patch 3 may not get accepted? In that case, you
would still need a comment.

Either explicitly state that this patch is just a stepping stone for
patch 3, and will either be accepted or rejected along with patch 3. Or
keep it as a stand alone patch and add comments as such. Or just get rid
of it all together.

Thanks,

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
In reply to this post by Rik van Riel
On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> + int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> + int delay = *delay_ptr;

int delay = __this_cpu_read(spinlock_delay);

>   }
> + *delay_ptr = delay;

__this_cpu_write(spinlock_delay, delay);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Steven Rostedt
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 06:56:13PM -0500, Rik van Riel wrote:

> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 4e44840..e44c56f 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
> + * lock is obtained seems to result in low bus traffic. The combination
> + * of aiming for a non-integer amount of average polls, and scaling the
> + * sleep period proportionally to how many CPUs are ahead of us in the
> + * queue for this ticket lock seems to reduce the amount of time spent
> + * "oversleeping" the release of the lock.
>   */
> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> + /*
> + * Use the raw per-cpu pointer; preemption is disabled in the
> + * spinlock code. This avoids put_cpu_var once we have the lock.
> + */
> + int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> + int delay = *delay_ptr;

I'm confused by the above comment. Why not just:

        int delay = this_cpu_read(spinlock_delay);
?

> +
>   for (;;) {
> - int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> + int loops = delay * (__ticket_t)(inc.tail - inc.head);
>   while (loops--)
>   cpu_relax();
>  
>   inc.head = ACCESS_ONCE(lock->tickets.head);
> - if (inc.head == inc.tail)
> + if (inc.head == inc.tail) {
> + /* Decrease the delay, since we may have overslept. */
> + if (delay > MIN_SPINLOCK_DELAY)
> + delay--;
>   break;
> + }
> +
> + /*
> + * The lock is still busy, the delay was not long enough.
> + * Going through here 2.7 times will, on average, cancel
> + * out the decrement above. Using a non-integer number
> + * gets rid of performance artifacts and reduces oversleeping.
> + */
> + if (delay < MAX_SPINLOCK_DELAY &&
> + ((inc.head & 3) == 0 || (inc.head & 7) == 1))
> + delay++;
>   }
> + *delay_ptr = delay;

        this_cpu_write(spinlock_delay, delay);

Too bad you posted this just before break. I currently have access to a
40 core box, and I would have loved to test this. But right now I have
it testing other things, and hopefully I'll still have access to it
after the break.

-- Steve

>  }
>  
>  /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Eric Dumazet-2
On 12/21/2012 10:29 PM, Eric Dumazet wrote:

> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> + int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
>> + int delay = *delay_ptr;
>
> int delay = __this_cpu_read(spinlock_delay);
>
>>   }
>> + *delay_ptr = delay;
>
> __this_cpu_write(spinlock_delay, delay);

Thanks for that cleanup. I have applied it to my code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks

Rik van Riel
In reply to this post by Steven Rostedt
On 12/21/2012 10:14 PM, Steven Rostedt wrote:

> OK, I replied here before reading patch 3 (still reviewing it). Why have
> this patch at all? Just to test if you broke something between this and
> patch 3? Or perhaps patch 3 may not get accepted? In that case, you
> would still need a comment.
>
> Either explicitly state that this patch is just a stepping stone for
> patch 3, and will either be accepted or rejected along with patch 3. Or
> keep it as a stand alone patch and add comments as such. Or just get rid
> of it all together.

I will document this patch better, explaining that it is
a stepping stone, that the number 50 is likely to be
wrong for many systems, and that the next patch fixes
things, using this text in the changelog:



The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Rik van Riel
In reply to this post by Steven Rostedt
On 12/21/2012 10:33 PM, Steven Rostedt wrote:

> On Fri, Dec 21, 2012 at 06:56:13PM -0500, Rik van Riel wrote:
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 4e44840..e44c56f 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>>   static bool smp_no_nmi_ipi = false;
>>
>>   /*
>> - * Wait on a congested ticket spinlock.
>> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
>> + * data structures; having many CPUs pounce on the cache line with the
>> + * spinlock simultaneously can slow down the lock holder, and the system
>> + * as a whole.
>> + *
>> + * To prevent total performance collapse in case of bad spinlock contention,
>> + * perform proportional backoff. The per-cpu value of delay is automatically
>> + * tuned to limit the number of times spinning CPUs poll the lock before
>> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
>> + * a spinlock, and keeps system performance from dropping off a cliff.
>> + *
>> + * There is a tradeoff. If we poll too often, the whole system is slowed
>> + * down. If we sleep too long, the lock will go unused for a period of
>> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
>> + * lock is obtained seems to result in low bus traffic. The combination
>> + * of aiming for a non-integer amount of average polls, and scaling the
>> + * sleep period proportionally to how many CPUs are ahead of us in the
>> + * queue for this ticket lock seems to reduce the amount of time spent
>> + * "oversleeping" the release of the lock.
>>    */
>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>>   {
>> + /*
>> + * Use the raw per-cpu pointer; preemption is disabled in the
>> + * spinlock code. This avoids put_cpu_var once we have the lock.
>> + */
>> + int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
>> + int delay = *delay_ptr;
>
> I'm confused by the above comment. Why not just:
>
> int delay = this_cpu_read(spinlock_delay);
> ?

Eric Dumazet pointed out the same thing.  My code now
uses __this_cpu_read and __this_cpu_write.

> Too bad you posted this just before break. I currently have access to a
> 40 core box, and I would have loved to test this. But right now I have
> it testing other things, and hopefully I'll still have access to it
> after the break.

I will try to run this test on a really large SMP system
in the lab during the break.

Ideally, the auto-tuning will keep the delay value large
enough that performance will stay flat even when there are
100 CPUs contending over the same lock.

Maybe it turns out that the maximum allowed delay value
needs to be larger.  Only one way to find out...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor

Steven Rostedt
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:

> On 12/21/2012 07:47 PM, David Daney wrote:
>
> >>+#define MIN_SPINLOCK_DELAY 1
> >>+#define MAX_SPINLOCK_DELAY 1000
> >>+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
> >
> >
> >This gives the same delay for all locks in the system, but the amount of
> >work done under each lock is different.  So, for any given lock, the
> >delay is not optimal.
> >
> >This is an untested idea that came to me after looking at this:
> >
> >o Assume that for any given lock, the optimal delay is the same for all
> >CPUs in the system.
> >
> >o Store a per-lock delay value in arch_spinlock_t.

This can bloat the data structures. I would like to avoid that.

> >
> >o Once a CPU owns the lock it can update the delay as you do for the
> >per_cpu version.  Tuning the delay on fewer of the locking operations
> >reduces bus traffic, but makes it converge more slowly.
> >
> >o Bonus points if you can update the delay as part of the releasing store.
>
> It would absolutely have to be part of the same load and
> store cycle, otherwise we would increase bus traffic and
> defeat the purpose.
>
> However, since spinlock contention should not be the
> usual state, and all a scalable lock does is make sure
> that N+1 CPUs does not perform worse than N CPUs, using
> scalable locks is a stop-gap measure.
>
> I believe a stop-gap measure should be kept as simple as
> we can. I am willing to consider moving to a per-lock
> delay factor if we can figure out an easy way to do it,
> but I would like to avoid too much extra complexity...

Rik,

I like your solution. It's rather simple and simple solutions tend to
end up being the closest to optimal. The more complex a solution gets,
the more it starts chasing fireflies.

Locks that are not likely to be contended will most likely not hit this
code, as it only gets triggered when contended. Now, really the wait
will be the size of the critical section. If quick locks gets hit a lot,
the auto delay will shink. If long critical sections start getting
contention, the auto delay will grow. But the general delay should
become a balance in the system that should be ideal. Kind of like the
NUMA balancing ;-)

Anyway, I'd like to see this code tested, and more benchmarks run
against it.

Thanks,

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123
Loading...