Quantcast

[RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Michel Lespinasse
On Wed, Dec 26, 2012 at 11:10 AM, Eric Dumazet <[hidden email]> wrote:

> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

I've been trying to use this workload on a similar machine. I am
getting some confusing results however:

with 24 concurrent netperf -t UDP_STREAM -H $target -- -m 128 -R 1 , I
am seeing some non-trivial run-to-run performance variation - about 5%
in v3.7 baseline, but very significant after applying rik's 3 patches.
my last few runs gave me results of 890.92, 1073.74, 963.13, 1234.41,
754.18, 893.82. This is generally better than what I'm getting with
baseline, but the variance is huge (which is somewhat surprising given
that rik's patches don't have the issue of hash collisions). Also,
this is significant in that I am not seeing the regression you were
observing with just these 3 patches.

If I add a 1 second delay in the netperf command line (netperf -t
UDP_STREAM -s 1 -H lpk18 -- -m 128 -R 1), I am seeing a very constant
660 Mbps result, but then I don't see any benefit from applying rik's
patches. I have no explanation for these results, but I am getting
them very consistently...

> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)

Just to confirm, I believe you are refering to qdisc->q.lock and
qdisc->busylock ?

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ticket spinlock proportional backoff experiments

Michel Lespinasse
In reply to this post by Rik van Riel
On Fri, Dec 21, 2012 at 3:49 PM, Rik van Riel <[hidden email]> wrote:

> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>         http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
So as I mentioned before, I had a few ideas about trying to remove any
per-spinlock tunings from the equation and see if we can get good
performance with just one per-system tuning value. I recently had time
to run these experiments, so here are my current results.

I have based my experiments on v3.7 and used the synchro-test module
from Andrew's tree as an easily understandable synthetic workload (I
had to extend synchro-test to add a spinlock test, which I posted
separately to lkml a couple days ago: "[PATCH 0/2] extend synchro-test
module to test spinlocks too")


I tested 3 software configurations:

base: v3.7 + synchro-test ("mutex subsystem, synchro-test module",
"mutex-subsystem-synchro-test-module-fix",
"mutex-subsystem-synchro-test-module-fix-2", "kernel: fix
synchro-test.c printk format warrnings" and "add spinlock test to
synchro-test module")

auto: base + Rik's proposed patches ("x86,smp: move waiting on
contended ticket lock out of line", "x86,smp: proportional backoff for
ticket spinlocks", "x86,smp: auto tune spinlock backoff delay factor")

mine: base + Rik's "x86,smp: move waiting on contended ticket lock out
of line" + my two patches (which I'll attach as replies) "x86,smp:
simplify __ticket_spin_lock" (should be just cosmetic) and my own
version of "x86,smp: proportional backoff for ticket spinlocks". This
version differ's from Rik's principally by exporting the system-wide
tuning value through debugfs, and by using a delay of (ticket - head -
1) * spinlock_delay instead of just (ticket - head) *
spinlock_delay. Since spinlock_delay is a tunable parameter in my
experiments, I have been annotating my results with the spinlock_delay
value used in the experiment.


I have been testing on 3 machines:
mach1: 4 socket AMD Barcelona (16 cores total)
mach2: 2 socket Intel Westmere (12 cores / 24 threads total)
mach3: 2 socket Intel Sandybridge (16 cores / 32 threads total)


I ran two kinds of synthetic workloads using the synchro-test module:

noload: just N threads taking/releasing a single spinlock as fast as possible

load: N threads taking/releasing a single spinlock with 2uS loops
between each operation (2uS with spinlock held, 2uS with spinlock
released).


The test script looks like the following:
for i in 1 2 3 4 6 8 12 16 20 24 28 32; do
  printf "%2d spinlockers: " $i
  modprobe synchro-test sp=$i load=0 interval=0 do_sched=1 v=1 2>/dev/null || \
    true
  dmesg | sed 's/.*] //' | awk '/^spinlocks taken: / {sp=$3} END {print sp}'
done

(This is for the noload workload; the load workload is similar except for the load=0 interval=0 parameters being removed)


For each machine, I collected the noload performance results
first. When testing my patches, I did several runs and manually
adjusted the spinlock_delay parameter in order to get the best
performance. There is a bit of a line to walk there: too high and we
oversleep, thus reducing the lock utilization; too low and we do
excessive polling which reduces the lock throughput. However the tests
show that there is a small range of parameter values (~375 on mach1,
~75 on mach2, ~45 on mach3) which result in good performance
independently of the number of spinlocker threads.

mach3 was actually the tougher machine to tune for here. A tuning
value of 40 would have actually resulted in slightly higher
performance at 24-32 threads, and a value of 60 would have helped a
bit in the 6-12 threads range. However, we are only talking about
small percentage differences here - I think the main goal is to avoid
any large performance cliffs, and we certainly achieve that (note that
mach3 was already the best behaved of the 3 here)


I then collected the "load" performance results (with a 2uS load with
spinlock held, and a 2uS interval with spinlock released). On each
machine, I collected baseline, auto (rik's proposal), my patches with
the spinlock_delay value that had been determined best for the
"noload" workload, and then I tried to manually tune the
spinlock_delay value for best performance on the "load" workload.


Analysis:

- In the "noload" workload, my proposal can (with some hand tuning)
  result in better performance than rik's autotuned approach on mach1
  and mach2, and be at rough parity on mach3. I believe the benefits
  on mach1 and mach2 are due to achieving even lower coherency traffic
  than rik's autotuned target of ~3.7 accesses per contended spinlock
  acquisition. As it turns out, it was possible to find a single
  spinlock_delay value that works well no mater how many spinlocker
  threads are running. On this point, I only hit partial success on
  mach3, as I was able to find a "decent all-around" value for
  spinlock_delay that reached parity with Rik's proposal, but did not
  beat it (and it seems that the optimal value did depend on the
  amount of spinlocker threads in use)

- In the "load" workload, Rik's proposal performs poorly. This is
  especially apparent on mach2 and mach3, where the performance
  degradation due to coherency traffic in the baseline run wasn't very
  high in the first place; the extra delays introduced to try and
  avoid that coherency traffic end up hurting more than they help.

- In my proposal, tuning for the "noload" workload and applying that
  to the "load" workload seems to work well enough. It may be possible
  to get a few extra percent performance by using per-spinlock
  tunings, but the difference is low enough that I would think it's
  not worth the extra complexity. I like the simplicity of having one
  single per-system tuning value that works well for any spinlocks it
  might get applied to.

Of course, one major downside in my proposal is that I haven't figured
out an automatic way to find the most appropriate spinlock_delay
system tunable. So there is clearly some more experimentation needed
there. However, IMO the important result here is that our goal of
avoiding performance cliffs seems to be reachable without the
complexity (and IMO, risk) of per-spinlock tuning values.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

mach1-load.png (5K) Download Attachment
mach1-noload.png (5K) Download Attachment
mach2-load.png (5K) Download Attachment
mach2-noload.png (5K) Download Attachment
mach3-load.png (5K) Download Attachment
mach3-noload.png (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/2] x86,smp: simplify __ticket_spin_lock

Michel Lespinasse
Just cosmetic - avoid an unnecessary goto construct

Signed-off-by: Michel Lespinasse <[hidden email]>

---
 arch/x86/include/asm/spinlock.h |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 2a45eb0cdb2c..19e8a36b3b83 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -55,11 +55,8 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
  inc = xadd(&lock->tickets, inc);
 
- if (inc.head == inc.tail)
- goto out;
-
- ticket_spin_lock_wait(lock, inc);
- out:
+ if (inc.head != inc.tail)
+ ticket_spin_lock_wait(lock, inc);
  barrier(); /* make sure nothing creeps before the lock is taken */
 }
 
--
1.7.7.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/2] x86,smp: proportional backoff for ticket spinlocks

Michel Lespinasse
In reply to this post by Michel Lespinasse
Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

Note that when a thread notices that it's at the head of the line for
acquiring the spinlock, this thread has already touched the spinlock's
cacheline and now holds it in shared state. At this point, extra reads
to the cache line are local to the processor and do not generate any
extra coherency traffic, until another thread (probably the spinlock
owner) writes to it. When that write occurs, the writing thread will
get exclusive access to the cacheline first, and then the waiting
thread will ask to get its shared access again. It is expected that in
many cases, the writing thread will release the spinlock before the
waiting thread gets its shared access back. For these reasons, it
seems unproductive for the head waiter to throttle its accesses;
however we do want to throttle the other waiter threads so that they
don't generate any extra coherency traffic until they can acquire the
spinlock, or at least reach the head position among waiters.

Signed-off-by: Michel Lespinasse <[hidden email]>

---
 arch/x86/include/asm/spinlock.h |    2 ++
 arch/x86/kernel/smp.c           |   33 +++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 19e8a36b3b83..b49ae57a62c8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern unsigned int __read_mostly spinlock_delay;
+
 extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
 
 /*
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da35427bd5..eb2c49c6cc08 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/debugfs.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -111,20 +112,40 @@
 
 static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
+unsigned int __read_mostly spinlock_delay = 1;
 
 /*
  * Wait on a congested ticket spinlock.
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
- for (;;) {
- cpu_relax();
- inc.head = ACCESS_ONCE(lock->tickets.head);
+ __ticket_t head = inc.head, ticket = inc.tail;
+ __ticket_t waiters_ahead;
+ unsigned delay;
+ do {
+ waiters_ahead = ticket - head - 1;
+ if (!waiters_ahead) {
+ do
+ cpu_relax();
+ while (ACCESS_ONCE(lock->tickets.head) != ticket);
+ return;
+ }
+ delay = waiters_ahead * spinlock_delay;
+ do
+ cpu_relax();
+ while (delay--);
+ head = ACCESS_ONCE(lock->tickets.head);
+ } while (head != ticket);
+}
 
- if (inc.head == inc.tail)
- break;
- }
+#ifdef CONFIG_DEBUG_FS
+static __init int spinlock_delay_init_debug(void)
+{
+ debugfs_create_u32("spinlock_delay", 0644, NULL, &spinlock_delay);
+ return 0;
 }
+late_initcall(spinlock_delay_init_debug);
+#endif
 
 /*
  * this function sends a 'reschedule' IPI to another CPU.
--
1.7.7.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/2] x86,smp: simplify __ticket_spin_lock

Rik van Riel
In reply to this post by Michel Lespinasse
On 01/01/2013 07:09 PM, Michel Lespinasse wrote:
> Just cosmetic - avoid an unnecessary goto construct
>
> Signed-off-by: Michel Lespinasse <[hidden email]>

I have had this in my code base since you first suggested
it on December 21st. Thank you for sending in more improvements,
though :)

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Jan Beulich-2
In reply to this post by Rik van Riel
>>> On 27.12.12 at 20:09, Rik van Riel <[hidden email]> wrote:
> On 12/27/2012 01:41 PM, Jan Beulich wrote:
>>>>> Rik van Riel <[hidden email]> 12/27/12 4:01 PM >>>
>>> On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>>>> So the hash sounds good to me, because the hash key could mix both lock
>>>> address and caller IP ( __builtin_return_address(1) in
>>>> ticket_spin_lock_wait())
>>>
>>> The lock acquisition time depends on the holder of the lock,
>>> and what the CPUs ahead of us in line will do with the lock,
>>> not on the caller IP of the spinner.
>>
>> The lock holder could supply its __builtin_return_address(0) for use
>> in eventual hashing.
>>
>> Also, with all of this - did you evaluate the alternative of using
>> monitor/mwait instead?
>
> How much bus traffic do monitor/mwait cause behind the scenes?

I would suppose that this just snoops the bus for writes, but the
amount of bus traffic involved in this isn't explicitly documented.

One downside of course is that unless a spin lock is made occupy
exactly a cache line, false wakeups are possible.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Steven Rostedt
On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:

> > How much bus traffic do monitor/mwait cause behind the scenes?
>
> I would suppose that this just snoops the bus for writes, but the
> amount of bus traffic involved in this isn't explicitly documented.
>
> One downside of course is that unless a spin lock is made occupy
> exactly a cache line, false wakeups are possible.

And that would probably be very likely, as the whole purpose of Rik's
patches was to lower cache stalls due to other CPUs pounding on spin
locks that share the cache line of what is being protected (and
modified).

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:

> On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
>
> > > How much bus traffic do monitor/mwait cause behind the scenes?
> >
> > I would suppose that this just snoops the bus for writes, but the
> > amount of bus traffic involved in this isn't explicitly documented.
> >
> > One downside of course is that unless a spin lock is made occupy
> > exactly a cache line, false wakeups are possible.
>
> And that would probably be very likely, as the whole purpose of Rik's
> patches was to lower cache stalls due to other CPUs pounding on spin
> locks that share the cache line of what is being protected (and
> modified).

A monitor/mwait would be an option only if using MCS (or K42 variant)
locks, where each cpu would wait on a private and dedicated cache line.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Steven Rostedt
On Thu, 2013-01-03 at 05:35 -0800, Eric Dumazet wrote:

> On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:
> > On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
> >
> > > > How much bus traffic do monitor/mwait cause behind the scenes?
> > >
> > > I would suppose that this just snoops the bus for writes, but the
> > > amount of bus traffic involved in this isn't explicitly documented.
> > >
> > > One downside of course is that unless a spin lock is made occupy
> > > exactly a cache line, false wakeups are possible.
> >
> > And that would probably be very likely, as the whole purpose of Rik's
> > patches was to lower cache stalls due to other CPUs pounding on spin
> > locks that share the cache line of what is being protected (and
> > modified).
>
> A monitor/mwait would be an option only if using MCS (or K42 variant)
> locks, where each cpu would wait on a private and dedicated cache line.


But then would the problem even exist? If the lock is on its own cache
line, it shouldn't cause a performance issue if other CPUs are spinning
on it. Would it?

-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Thu, 2013-01-03 at 10:32 -0500, Steven Rostedt wrote:

> On Thu, 2013-01-03 at 05:35 -0800, Eric Dumazet wrote:
> > On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:
> > > On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
> > >
> > > > > How much bus traffic do monitor/mwait cause behind the scenes?
> > > >
> > > > I would suppose that this just snoops the bus for writes, but the
> > > > amount of bus traffic involved in this isn't explicitly documented.
> > > >
> > > > One downside of course is that unless a spin lock is made occupy
> > > > exactly a cache line, false wakeups are possible.
> > >
> > > And that would probably be very likely, as the whole purpose of Rik's
> > > patches was to lower cache stalls due to other CPUs pounding on spin
> > > locks that share the cache line of what is being protected (and
> > > modified).
> >
> > A monitor/mwait would be an option only if using MCS (or K42 variant)
> > locks, where each cpu would wait on a private and dedicated cache line.
>
>
> But then would the problem even exist? If the lock is on its own cache
> line, it shouldn't cause a performance issue if other CPUs are spinning
> on it. Would it?

Not sure I understand the question.

The lock itself would not consume a whole cache line, only the items
chained on it would be percpu, and cache line aligned.

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#mcs

Instead of spinning in :

repeat while I->next = nil

This part could use monitor/mwait


But :

1) We dont have such lock implementation

2) Trying to save power while waiting on a spinlock would be a clear
sign something is wrong in the implementation. A spinlock should not
protect a long critical section.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Steven Rostedt
On Thu, 2013-01-03 at 08:10 -0800, Eric Dumazet wrote:

> > But then would the problem even exist? If the lock is on its own cache
> > line, it shouldn't cause a performance issue if other CPUs are spinning
> > on it. Would it?
>
> Not sure I understand the question.
>

I'll explain my question better.

I thought the whole point of Rik's patches was to solve a performance
problem caused by contention on a lock that shares a cache line with
data.

In the ideal case, locks wont be contented, and are taken and released
quickly (being from the RT world, I know this isn't true :-( ). In this
case, it's also advantageous to keep the lock on the same cache line as
the data that's being updated. This way, the process of grabbing the
lock also pulls in the data that you will soon be using.

But then the problem occurs when you have a bunch of other CPUs trying
to take this lock in a tight spin. Every time the owner of the lock
touches the data, the other CPUs doing a LOCK read on the spinlock will
cause bus contention on the owner CPU as the data shares the cache and
needs to be synced. As the owner CPU just touched the cache line that is
under a tight loop of LOCK reads on other CPUs. By adding the delays,
the CPU with the lock doesn't stall at every update of the data
protected by the lock.

Thus, if monitor/mwait is ideal only for locks on its own cache line,
then they are pointless for the locks that are causing the issue we are
trying to fix.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
On Thu, 2013-01-03 at 11:45 -0500, Steven Rostedt wrote:

> On Thu, 2013-01-03 at 08:10 -0800, Eric Dumazet wrote:
>
> > > But then would the problem even exist? If the lock is on its own cache
> > > line, it shouldn't cause a performance issue if other CPUs are spinning
> > > on it. Would it?
> >
> > Not sure I understand the question.
> >
>
> I'll explain my question better.
>
> I thought the whole point of Rik's patches was to solve a performance
> problem caused by contention on a lock that shares a cache line with
> data.
>
> In the ideal case, locks wont be contented, and are taken and released
> quickly (being from the RT world, I know this isn't true :-( ). In this
> case, it's also advantageous to keep the lock on the same cache line as
> the data that's being updated. This way, the process of grabbing the
> lock also pulls in the data that you will soon be using.
>
> But then the problem occurs when you have a bunch of other CPUs trying
> to take this lock in a tight spin. Every time the owner of the lock
> touches the data, the other CPUs doing a LOCK read on the spinlock will
> cause bus contention on the owner CPU as the data shares the cache and
> needs to be synced. As the owner CPU just touched the cache line that is
> under a tight loop of LOCK reads on other CPUs. By adding the delays,
> the CPU with the lock doesn't stall at every update of the data
> protected by the lock.
>
> Thus, if monitor/mwait is ideal only for locks on its own cache line,
> then they are pointless for the locks that are causing the issue we are
> trying to fix.

I think you misunderstood the monitor/mwait usage I was speaking of

- Only for MCS type lock, where each cpu spins on it own busy/locked
bit.

Of course, if we use a ticket spinlock with no additional storage, we
have to spin without making any memory reference, and thats Rick's patch
using this idea :

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#ticket



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

Eric Dumazet-2
In reply to this post by Michel Lespinasse
On Sat, 2012-12-29 at 02:27 -0800, Michel Lespinasse wrote:

> On Wed, Dec 26, 2012 at 11:10 AM, Eric Dumazet <[hidden email]> wrote:
> > I did some tests with your patches with following configuration :
> >
> > tc qdisc add dev eth0 root htb r2q 1000 default 3
> > (to force a contention on qdisc lock, even with a multi queue net
> > device)
> >
> > and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
> >
> > Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> > (24 threads), and a fast NIC (10Gbps)
> >
> > Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>
> I've been trying to use this workload on a similar machine. I am
> getting some confusing results however:
>
> with 24 concurrent netperf -t UDP_STREAM -H $target -- -m 128 -R 1 , I
> am seeing some non-trivial run-to-run performance variation - about 5%
> in v3.7 baseline, but very significant after applying rik's 3 patches.
> my last few runs gave me results of 890.92, 1073.74, 963.13, 1234.41,
> 754.18, 893.82. This is generally better than what I'm getting with
> baseline, but the variance is huge (which is somewhat surprising given
> that rik's patches don't have the issue of hash collisions).

You mean that with Rik's patch, there is definitely an issue, as it has
a single bucket. Chances of collisions are high.


Your numbers being very random, I suspect you might hit another limit.

My tests involved a NIC with 24 transmit queues, to remove the per TX
queue lock out of the bench equation.

My guess is you use a NIC with 4 or 8 TX queues.

"ethtool -S eth0" would probably give some hints.

>  Also,
> this is significant in that I am not seeing the regression you were
> observing with just these 3 patches.
>
> If I add a 1 second delay in the netperf command line (netperf -t
> UDP_STREAM -s 1 -H lpk18 -- -m 128 -R 1), I am seeing a very constant
> 660 Mbps result, but then I don't see any benefit from applying rik's
> patches. I have no explanation for these results, but I am getting
> them very consistently...
>
> > In this workload we have at least two contended spinlocks, with
> > different delays. (spinlocks are not held for the same duration)
>
> Just to confirm, I believe you are refering to qdisc->q.lock and
> qdisc->busylock ?
>

Yes.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123
Loading...