Disabling in-memory write cache for x86-64 in Linux II

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Pavel Machek
Hi!

> >>  - temp-files may not be written out at all.
> >>
> >>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> >> got issues
> >   Actually people do stuff like this e.g. when generating ISO images before
> > burning them.
>
> Yes, but then the temp-file is long-lived enough that it *will* hit
> the disk anyway. So it's only the "create temporary file and pretty
> much immediately delete it" case that changes behavior (ie compiler
> assembly files etc).
>
> If the temp-file is for something like burning an ISO image, the
> burning part is slow enough that the temp-file will hit the disk
> regardless of when we start writing it.

It will hit the disk, but with proposed change, burning still will be
slower.

Before:

create 700MB iso
burn the CD, at the same time writing the iso to disk

After:

create 700MB iso and write most of it to disk
burn the CD, writing the rest.

But yes, limiting dirty ammounts is good idea.

> That said, I'd certainly like it even *more* if the limits really were
> per-BDI, and the global limit was in addition to the per-bdi ones.
> Because when you have a USB device that gets maybe 10MB/s on
> contiguous writes, and 100kB/s on random 4k writes, I think it would
> make more sense to make the "start writeout" limits be 1MB/2MB, not

Actually I believe I seen 10kB/sec on an SD card... would expect that
from USB sticks, too.

And yes, there are actually real problems with this at least on N900.

You do apt-get install <big package>. apt internally does fsyncs. It
results in big enough latencies that watchdogs kick in and kill the
machine.

http://pavelmachek.livejournal.com/117089.html

People are doing

 echo 3 > /proc/sys/vm/dirty_ratio
    echo 3 > /proc/sys/vm/dirty_background_ratio
    echo 100 > /proc/sys/vm/dirty_writeback_centisecs
    echo 100 > /proc/sys/vm/dirty_expire_centisecs
    echo 4096 > /proc/sys/vm/min_free_kbytes
    echo 50 > /proc/sys/vm/swappiness
    echo 200 > /proc/sys/vm/vfs_cache_pressure
    echo 8 > /proc/sys/vm/page-cluster
    echo 4 > /sys/block/mmcblk0/queue/nr_requests
    echo 4 > /sys/block/mmcblk1/queue/nr_requests

.. to avoid it, but IIRC it only makes the watchdog reset less likely
:-(.

                                                                        Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: add strictlimit knob

akpm
In reply to this post by Maxim Patlasov
On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov <[hidden email]> wrote:

> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
> FUSE which sets bdi max_ratio to 1% by default:
>
> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809
>
> However the feature can be useful for other relatively slow or untrusted
> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
> feature:
>
> echo 1 > /sys/class/bdi/X:Y/strictlimit
>
> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
> dirty limit is not reached. Of course, the effect is not visible until
> max_ratio is decreased to some reasonable value.

I suggest replacing "max_ratio" here with the much more informative
"/sys/class/bdi/X:Y/max_ratio".

Also, Documentation/ABI/testing/sysfs-class-bdi will need an update
please.

>  mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
>

I'm not really sure what to make of the patch.  I assume you tested it
and observed some effect.  Could you please describe the test setup and
the effects in some detail?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Andreas Dilger-7
In reply to this post by Linus Torvalds-2

On Oct 25, 2013, at 2:18 AM, Linus Torvalds <[hidden email]> wrote:

> On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <[hidden email]> wrote:
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
>> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
>> really troubling me is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4
>> partitions or flash drive with FAT32 partitions, the kernel first
>> caches them in memory entirely then flushes them some time later
>> (quite unpredictably though) or immediately upon invoking "sync".
>
> Yeah, I think we default to a 10% "dirty background memory" (and
> allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> of dirty memory for writeout before we even start writing, and twice
> that before we start *waiting* for it.
>
> On 32-bit x86, we only count the memory in the low 1GB (really
> actually up to about 890MB), so "10% dirty" really means just about
> 90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
> And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> come from the old days of less memory (and perhaps servers that don't
> much care), and the fact that x86-32 ends up having much lower limits
> even if you end up having more memory.

I think the “delay writes for a long time” is a holdover from the
days when e.g. /tmp was on a disk and compilers had lousy IO
patterns, then they deleted the file.  Today, /tmp is always in
RAM, and IMHO the “write and delete” workload tested by dbench
is not worthwhile optimizing for.

With Lustre, we’ve long taken the approach that if there is enough
dirty data on a file to make a decent write (which is around 8MB
today even for very fast storage) then there isn’t much point to
hold back for more data before starting the IO.

Any decent allocator will be able to grow allocated extents to
handle following data, or allocate a new extent.  At 4-8MB extents,
even very seek-impaired media could do 400-800MB/s (likely much
faster than the underlying storage anyway).

This also avoids wasting (tens of?) seconds of idle disk bandwidth.
If the disk is already busy, then the IO will be delayed anyway.
If it is not busy, then why aggregate GB of dirty data in memory
before flushing it?

Something simple like “start writing at 16MB dirty on a single file”
would probably avoid a lot of complexity at little real-world cost.
That shouldn’t throttle dirtying memory above 16MB, but just start
writeout much earlier than it does today.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

David Lang-3
In reply to this post by NeilBrown
On Tue, 5 Nov 2013, Figo.zhang wrote:

>>>
>>> Of course, if you don't use Linux on the desktop you don't really care -
>> well, I do. Also
>>> not everyone in this world has an UPS - which means such a huge buffer
>> can lead to a
>>> serious data loss in case of a power blackout.
>>
>> I don't have a desk (just a lap), but I use Linux on all my computers and
>> I've never really noticed the problem.  Maybe I'm just very patient, or
>> maybe
>> I don't work with large data sets and slow devices.
>>
>> However I don't think data-loss is really a related issue.  Any process
>> that
>> cares about data safety *must* use fsync at appropriate places.  This has
>> always been true.
>>
>> =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?
>

with any filesystem and any OS, if you create dirty data but do not f*sync() the
data, there isa possibility that the system can go down between the time the
application creates the dirty data and the time the OS actually gets it on disk.
If the system goes down in this timeframe, the data will be lost and it may
corrupt the file if only some of the data got written.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

NeilBrown
In reply to this post by NeilBrown
On Tue, 5 Nov 2013 09:40:55 +0800 "Figo.zhang" <[hidden email]> wrote:

> > >
> > > Of course, if you don't use Linux on the desktop you don't really care -
> > well, I do. Also
> > > not everyone in this world has an UPS - which means such a huge buffer
> > can lead to a
> > > serious data loss in case of a power blackout.
> >
> > I don't have a desk (just a lap), but I use Linux on all my computers and
> > I've never really noticed the problem.  Maybe I'm just very patient, or
> > maybe
> > I don't work with large data sets and slow devices.
> >
> > However I don't think data-loss is really a related issue.  Any process
> > that
> > cares about data safety *must* use fsync at appropriate places.  This has
> > always been true.
> >
> > =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?
If you modify a file, then you must take care that you can recover from a
crash at any point in the process.

If the file is small, the usual approach is to create a copy of the file with
the appropriate changes made, then 'fsync' the file and rename the new file
over the old file.

If the file is large you might need some sort of update log (in a small file)
so you can replay recent updates after a crash.

The  journalling that the filesystem provides only protects the filesystem
metadata.  It does not protect the consistency of the data in your file.

I hope  that helps.

NeilBrown

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Dave Chinner
In reply to this post by Andreas Dilger-7
On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:

>
> On Oct 25, 2013, at 2:18 AM, Linus Torvalds <[hidden email]> wrote:
> > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <[hidden email]> wrote:
> >>
> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
> >> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
> >> really troubling me is that the x86-64 kernel has the following problem:
> >>
> >> When I copy large files to any storage device, be it my HDD with ext4
> >> partitions or flash drive with FAT32 partitions, the kernel first
> >> caches them in memory entirely then flushes them some time later
> >> (quite unpredictably though) or immediately upon invoking "sync".
> >
> > Yeah, I think we default to a 10% "dirty background memory" (and
> > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> > of dirty memory for writeout before we even start writing, and twice
> > that before we start *waiting* for it.
> >
> > On 32-bit x86, we only count the memory in the low 1GB (really
> > actually up to about 890MB), so "10% dirty" really means just about
> > 90MB of buffering (and a "hard limit" of ~180MB of dirty).
> >
> > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> > come from the old days of less memory (and perhaps servers that don't
> > much care), and the fact that x86-32 ends up having much lower limits
> > even if you end up having more memory.
>
> I think the “delay writes for a long time” is a holdover from the
> days when e.g. /tmp was on a disk and compilers had lousy IO
> patterns, then they deleted the file.  Today, /tmp is always in
> RAM, and IMHO the “write and delete” workload tested by dbench
> is not worthwhile optimizing for.
>
> With Lustre, we’ve long taken the approach that if there is enough
> dirty data on a file to make a decent write (which is around 8MB
> today even for very fast storage) then there isn’t much point to
> hold back for more data before starting the IO.

Agreed - write-through caching is much better for high throughput
streaming data environments than write back caching that can leave
the devices unnecessarily idle.

However, most systems are not running in high-throughput streaming
data environments... :/

> Any decent allocator will be able to grow allocated extents to
> handle following data, or allocate a new extent.  At 4-8MB extents,
> even very seek-impaired media could do 400-800MB/s (likely much
> faster than the underlying storage anyway).

True, but this makes the assumption that the filesystem you are
using is optimising purely for write throughput and your storage is
not seek limited on reads. That's simply not an assumption we can
allow the generic writeback code to make.

In more detail, if we simply implement "we have 8 MB of dirty pages
on a single file, write it" we can maximise write throughput by
allocating sequentially on disk for each subsquent write. The
problem with this comes when you are writing multiple files at a
time, and that leads to this pattern on disk:

 ABC...ABC....ABC....ABC....

And the result is a) fragmented files b) a large number of seeks
during sequential read operations and c) filesystems that age and
degrade rapidly under workloads that concurrently write files with
different life times (i.e. due to free space fragmention).

In some situations this is acceptable, but the performance
degradation as the filesystem ages that this sort of allocation
causes in most environments is not. I'd say that >90% of filesystems
out there would suffer accelerated aging as a result of doing
writeback in this manner by default.

> This also avoids wasting (tens of?) seconds of idle disk bandwidth.
> If the disk is already busy, then the IO will be delayed anyway.
> If it is not busy, then why aggregate GB of dirty data in memory
> before flushing it?

There are plenty of workloads out there where delaying IO for a few
seconds can result in writeback that is an order of magnitude
faster. Similarly, I've seen other workloads where the writeback
delay results in files that can be *read* orders of magnitude
faster....

> Something simple like “start writing at 16MB dirty on a single file”
> would probably avoid a lot of complexity at little real-world cost.
> That shouldn’t throttle dirtying memory above 16MB, but just start
> writeout much earlier than it does today.

That doesn't solve the "slow device, large file" problem. We can
write data into the page cache at rates of over a GB/s, so it's
irrelevant to a device that can write at 5MB/s whether we start
writeback immediately or a second later when there is 500MB of dirty
pages in memory.  AFAIK, the only way to avoid that problem is to
use write-through caching for such devices - where they throttle to
the IO rate at very low levels of cached data.

Realistically, there is no "one right answer" for all combinations
of applications, filesystems and hardware, but writeback caching is
the best *general solution* we've got right now.

However, IMO users should not need to care about tuning BDI dirty
ratios or even have to understand what a BDI dirty ratio is to
select the rigth caching method for their devices and/or workload.
The difference between writeback and write through caching is easy
to explain and AFAICT those two modes suffice to solve the problems
being discussed here.  Further, if two modes suffice to solve the
problems, then we should be able to easily define a trigger to
automatically switch modes.

/me notes that if we look at random vs sequential IO and the impact
that has on writeback duration, then it's very similar to suddenly
having a very slow device. IOWs, fadvise(RANDOM) could be used to
switch an *inode* to write through mode rather than writeback mode
to solve the problem aggregating massive amounts of random write IO
in the page cache...

So rather than treating this as a "one size fits all" type of
problem, let's step back and:

        a) define 2-3 different caching behaviours we consider
           optimal for the majority of workloads/hardware we care
           about.
        b) determine optimal workloads for each caching
           behaviour.
        c) develop reliable triggers to detect when we
           should switch between caching behaviours.

e.g:

        a) write back caching
                - what we have now
           write through caching
                - extremely low dirty threshold before writeback
                  starts, enough to optimise for, say, stripe width
                  of the underlying storage.

        b) write back caching:
                - general purpose workload
           write through caching:
                - slow device, write large file, sync
                - extremely high bandwidth devices, multi-stream
                  sequential IO
                - random IO.

        c) write back caching:
                - default
                - fadvise(NORMAL, SEQUENTIAL, WILLNEED)
           write through caching:
                - fadvise(NOREUSE, DONTNEED, RANDOM)
                - random IO
                - sequential IO, BDI write bandwidth <<< dirty threshold
                - sequential IO, BDI write bandwidth >>> dirty threshold

I think that covers most of the issues and use cases that have been
discussed in this thread. IMO, this is the level at which we need to
solve the problem (i.e. architectural), not at the level of "let's
add sysfs variables so we can tweak bdi ratios".

Indeed, the above implies that we need the caching behaviour to be a
property of the address space, not just a property of the backing
device.

IOWs, the implementation needs to trickle down from a coherent high
level design - that will define the knobs that we need to expose to
userspace. We should not be adding new writeback behaviours by
adding knobs to sysfs without first having some clue about whether
we are solving the right problem and solving it in a sane manner...

Cheers,

Dave.
--
Dave Chinner
[hidden email]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: add strictlimit knob

Maxim Patlasov
In reply to this post by akpm
Hi Andrew,

On 11/05/2013 02:01 AM, Andrew Morton wrote:

> On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov <[hidden email]> wrote:
>
>> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
>> FUSE which sets bdi max_ratio to 1% by default:
>>
>> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809
>>
>> However the feature can be useful for other relatively slow or untrusted
>> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
>> feature:
>>
>> echo 1 > /sys/class/bdi/X:Y/strictlimit
>>
>> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
>> dirty limit is not reached. Of course, the effect is not visible until
>> max_ratio is decreased to some reasonable value.
> I suggest replacing "max_ratio" here with the much more informative
> "/sys/class/bdi/X:Y/max_ratio".
>
> Also, Documentation/ABI/testing/sysfs-class-bdi will need an update
> please.

OK, I'll update it, fix patch description and re-send the patch.

>
>>   mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
>>   1 file changed, 35 insertions(+)
>>
> I'm not really sure what to make of the patch.  I assume you tested it
> and observed some effect.  Could you please describe the test setup and
> the effects in some detail?

I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and
started writing a huge file by "dd" (from /dev/zero to USB-flash
mount-point). While writing I was observing "Dirty" counter as reported
by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15%
of total RAM). Immediately after dd completed, the "umount" command took
about 5 minutes. This corresponded to 5MB write throughput of the flash
drive.

Then I repeated the experiment after setting tunables:

echo 1 > /sys/class/bdi/8\:16/max_ratio
echo 1 > /sys/class/bdi/8\:16/strictlimit

This time, "Dirty" counter became 100 times lesser - about 12MB and
"umount" took about a second.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH] mm: add strictlimit knob -v2

Maxim Patlasov
In reply to this post by akpm
"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
/sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.

Changed in v2:
 - updated patch description and documentation

Signed-off-by: Maxim Patlasov <[hidden email]>
---
 Documentation/ABI/testing/sysfs-class-bdi |    8 +++++++
 mm/backing-dev.c                          |   35 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index d773d56..3187a18 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -53,3 +53,11 @@ stable_pages_required (read-only)
 
  If set, the backing device requires that all pages comprising a write
  request must not be changed until writeout is complete.
+
+strictlimit (read-write)
+
+ Forces per-BDI checks for the share of given device in the write-back
+ cache even before the global background dirty limit is reached. This
+ is useful in situations where the global limit is much higher than
+ affordable for given relatively slow (or untrusted) device. Turning
+ strictlimit on has no visible effect if max_ratio is equal to 100%.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+ unsigned int val;
+ ssize_t ret;
+
+ ret = kstrtouint(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+
+ switch (val) {
+ case 0:
+ bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+ break;
+ case 1:
+ bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+ return snprintf(page, PAGE_SIZE-1, "%d\n",
+ !!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
  &dev_attr_read_ahead_kb.attr,
  &dev_attr_min_ratio.attr,
  &dev_attr_max_ratio.attr,
  &dev_attr_stable_pages_required.attr,
+ &dev_attr_strictlimit.attr,
  NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: add strictlimit knob -v2

Henrique de Moraes Holschuh-2
Is there a reason to not enforce strictlimit by default?

--
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Dave Chinner
On Tue 05-11-13 15:12:45, Dave Chinner wrote:

> On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > Something simple like “start writing at 16MB dirty on a single file”
> > would probably avoid a lot of complexity at little real-world cost.
> > That shouldn’t throttle dirtying memory above 16MB, but just start
> > writeout much earlier than it does today.
>
> That doesn't solve the "slow device, large file" problem. We can
> write data into the page cache at rates of over a GB/s, so it's
> irrelevant to a device that can write at 5MB/s whether we start
> writeback immediately or a second later when there is 500MB of dirty
> pages in memory.  AFAIK, the only way to avoid that problem is to
> use write-through caching for such devices - where they throttle to
> the IO rate at very low levels of cached data.
  Agreed.

> Realistically, there is no "one right answer" for all combinations
> of applications, filesystems and hardware, but writeback caching is
> the best *general solution* we've got right now.
>
> However, IMO users should not need to care about tuning BDI dirty
> ratios or even have to understand what a BDI dirty ratio is to
> select the rigth caching method for their devices and/or workload.
> The difference between writeback and write through caching is easy
> to explain and AFAICT those two modes suffice to solve the problems
> being discussed here.  Further, if two modes suffice to solve the
> problems, then we should be able to easily define a trigger to
> automatically switch modes.
>
> /me notes that if we look at random vs sequential IO and the impact
> that has on writeback duration, then it's very similar to suddenly
> having a very slow device. IOWs, fadvise(RANDOM) could be used to
> switch an *inode* to write through mode rather than writeback mode
> to solve the problem aggregating massive amounts of random write IO
> in the page cache...
  I disagree here. Writeback cache is also useful for aggregating random
writes and making semi-sequential writes out of them. There are quite some
applications which rely on the fact that they can write a file in a rather
random manner (Berkeley DB, linker, ...) but the files are written out in
one large linear sweep. That is actually the reason why SLES (and I believe
RHEL as well) tune dirty_limit even higher than what's the default value.

So I think it's rather the other way around: If you can detect the file is
being written in a streaming manner, there's not much point in caching too
much data for it. And I agree with you that we also have to be careful not
to cache too few because otherwise two streaming writes would be
interleaved too much. Currently, we have writeback_chunk_size() which
determines how much we ask to write from a single inode. So streaming
writers are going to be interleaved at this chunk size anyway (currently
that number is "measured bandwidth / 2"). So it would make sense to also
limit amount of dirty cache for each file with streaming pattern at this
number.

> So rather than treating this as a "one size fits all" type of
> problem, let's step back and:
>
> a) define 2-3 different caching behaviours we consider
>   optimal for the majority of workloads/hardware we care
>   about.
> b) determine optimal workloads for each caching
>   behaviour.
> c) develop reliable triggers to detect when we
>   should switch between caching behaviours.
>
> e.g:
>
> a) write back caching
> - what we have now
>   write through caching
> - extremely low dirty threshold before writeback
>  starts, enough to optimise for, say, stripe width
>  of the underlying storage.
>
> b) write back caching:
> - general purpose workload
>   write through caching:
> - slow device, write large file, sync
> - extremely high bandwidth devices, multi-stream
>  sequential IO
> - random IO.
>
> c) write back caching:
> - default
> - fadvise(NORMAL, SEQUENTIAL, WILLNEED)
>   write through caching:
> - fadvise(NOREUSE, DONTNEED, RANDOM)
> - random IO
> - sequential IO, BDI write bandwidth <<< dirty threshold
> - sequential IO, BDI write bandwidth >>> dirty threshold
>
> I think that covers most of the issues and use cases that have been
> discussed in this thread. IMO, this is the level at which we need to
> solve the problem (i.e. architectural), not at the level of "let's
> add sysfs variables so we can tweak bdi ratios".
>
> Indeed, the above implies that we need the caching behaviour to be a
> property of the address space, not just a property of the backing
> device.
  Yes, and that would be interesting to implement and not make a mess out
of the whole writeback logic because the way we currently do writeback is
inherently BDI based. When we introduce some special per-inode limits,
flusher threads would have to pick more carefully what to write and what
not. We might be forced to go that way eventually anyway because of memcg
aware writeback but it's not a simple step.

> IOWs, the implementation needs to trickle down from a coherent high
> level design - that will define the knobs that we need to expose to
> userspace. We should not be adding new writeback behaviours by
> adding knobs to sysfs without first having some clue about whether
> we are solving the right problem and solving it in a sane manner...
  Agreed. But the ability to limit amount of dirty pages outstanding
against a particular BDI seems as a sane one to me. It's not as flexible
and automatic as the approach you suggested but it's much simpler and
solves most of problems we currently have.

The biggest objection against the sysfs-tunable approach is that most
people won't have a clue meaning that the tunable is useless for them. But I
wonder if something like:
1) turn on strictlimit by default
2) don't allow dirty cache of BDI to grow over 5s of measured writeback
   speed

won't go a long way into solving our current problems without too much
complication...

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Dave Chinner
On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:

> On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Something simple like “start writing at 16MB dirty on a single file”
> > > would probably avoid a lot of complexity at little real-world cost.
> > > That shouldn’t throttle dirtying memory above 16MB, but just start
> > > writeout much earlier than it does today.
> >
> > That doesn't solve the "slow device, large file" problem. We can
> > write data into the page cache at rates of over a GB/s, so it's
> > irrelevant to a device that can write at 5MB/s whether we start
> > writeback immediately or a second later when there is 500MB of dirty
> > pages in memory.  AFAIK, the only way to avoid that problem is to
> > use write-through caching for such devices - where they throttle to
> > the IO rate at very low levels of cached data.
>   Agreed.
>
> > Realistically, there is no "one right answer" for all combinations
> > of applications, filesystems and hardware, but writeback caching is
> > the best *general solution* we've got right now.
> >
> > However, IMO users should not need to care about tuning BDI dirty
> > ratios or even have to understand what a BDI dirty ratio is to
> > select the rigth caching method for their devices and/or workload.
> > The difference between writeback and write through caching is easy
> > to explain and AFAICT those two modes suffice to solve the problems
> > being discussed here.  Further, if two modes suffice to solve the
> > problems, then we should be able to easily define a trigger to
> > automatically switch modes.
> >
> > /me notes that if we look at random vs sequential IO and the impact
> > that has on writeback duration, then it's very similar to suddenly
> > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > switch an *inode* to write through mode rather than writeback mode
> > to solve the problem aggregating massive amounts of random write IO
> > in the page cache...
>   I disagree here. Writeback cache is also useful for aggregating random
> writes and making semi-sequential writes out of them. There are quite some
> applications which rely on the fact that they can write a file in a rather
> random manner (Berkeley DB, linker, ...) but the files are written out in
> one large linear sweep. That is actually the reason why SLES (and I believe
> RHEL as well) tune dirty_limit even higher than what's the default value.

Right - but the correct behaviour really depends on the pattern of
randomness. The common case we get into trouble with is when no
clustering occurs and we end up with small, random IO for gigabytes
of cached data. That's the case where write-through caching for
random data is better.

It's also questionable whether writeback caching for aggregation is
faster for random IO on high-IOPS devices or not. Again, I think it
woul depend very much on how random the patterns are...

> So I think it's rather the other way around: If you can detect the file is
> being written in a streaming manner, there's not much point in caching too
> much data for it.

But we're not talking about how much data we cache here - we are
considering how much data we allow to get dirty before writing it
back.  It doesn't matter if we use writeback or write through
caching, the page cache footprint for a given workload is likely to
be similar, but without any data we can't draw any conclusions here.

> And I agree with you that we also have to be careful not
> to cache too few because otherwise two streaming writes would be
> interleaved too much. Currently, we have writeback_chunk_size() which
> determines how much we ask to write from a single inode. So streaming
> writers are going to be interleaved at this chunk size anyway (currently
> that number is "measured bandwidth / 2"). So it would make sense to also
> limit amount of dirty cache for each file with streaming pattern at this
> number.

My experience says that for streaming IO we typically need at least
5s of cached *dirty* data to even out delays and latencies in the
writeback IO pipeline. Hence limiting a file to what we can write in
a second given we might only write a file once a second is likely
going to result in pipeline stalls...

Remember, writeback caching is about maximising throughput, not
minimising latency. The "sync latency" problem with caching too much
dirty data on slow block devices is really a corner case behaviour
and should not compromise the common case for bulk writeback
throughput.

> > Indeed, the above implies that we need the caching behaviour to be a
> > property of the address space, not just a property of the backing
> > device.
>   Yes, and that would be interesting to implement and not make a mess out
> of the whole writeback logic because the way we currently do writeback is
> inherently BDI based. When we introduce some special per-inode limits,
> flusher threads would have to pick more carefully what to write and what
> not. We might be forced to go that way eventually anyway because of memcg
> aware writeback but it's not a simple step.

Agreed, it's not simple, and that's why we need to start working
from the architectural level....

> > IOWs, the implementation needs to trickle down from a coherent high
> > level design - that will define the knobs that we need to expose to
> > userspace. We should not be adding new writeback behaviours by
> > adding knobs to sysfs without first having some clue about whether
> > we are solving the right problem and solving it in a sane manner...
>   Agreed. But the ability to limit amount of dirty pages outstanding
> against a particular BDI seems as a sane one to me. It's not as flexible
> and automatic as the approach you suggested but it's much simpler and
> solves most of problems we currently have.

That's true, but....

> The biggest objection against the sysfs-tunable approach is that most
> people won't have a clue meaning that the tunable is useless for them.

.... that's the big problem I see - nobody is going to know how to
use it, when to use it, or be able to tell if it's the root cause of
some weird performance problem they are seeing.

> But I
> wonder if something like:
> 1) turn on strictlimit by default
> 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
>    speed
>
> won't go a long way into solving our current problems without too much
> complication...

Turning on strict limit by default is going to change behaviour
quite markedly. Again, it's not something I'd want to see done
without a bunch of data showing that it doesn't cause regressions
for common workloads...

Cheers,

Dave.
--
Dave Chinner
[hidden email]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
On Mon 11-11-13 14:22:11, Dave Chinner wrote:

> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > >
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here.  Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > >
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> >   I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
>
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
>
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
  I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).

> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
>
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
  Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.

> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
>
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
>
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
  I guess this begs for real data. We agree in principle but differ in
constants :).
 
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
  Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).

> >   Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
>
> That's true, but....
>
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
>
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
>
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> >    speed
> >
> > won't go a long way into solving our current problems without too much
> > complication...
>
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
  Agreed.

                                                        Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Diego Calleja
In reply to this post by kbuild test robot-2
El Sábado, 26 de octubre de 2013 00:32:25 Fengguang Wu escribió:

> What's the kernel you are running? And it's writing to a hard disk?
> The stalls are most likely caused by either one of
>
> 1) write IO starves read IO
> 2) direct page reclaim blocked when
>    - trying to writeout PG_dirty pages
>    - trying to lock PG_writeback pages
>
> Which may be confirmed by running
>
>         ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
> or
>         echo w > /proc/sysrq-trigger    # and check dmesg
>
> during the stalls. The latter command works more reliably.


Sorry for the delay (background: rsync'ing large files from/to a hard disk
in a desktop with 16GB of RAM makes the whole desktop unreponsive)

I just triggered it today (running 3.12), and run sysrq-w:

[ 5547.001505] SysRq : Show Blocked State
[ 5547.001509]   task                        PC stack   pid father
[ 5547.001516] btrfs-transacti D ffff880425d7a8a0     0   193      2 0x00000000
[ 5547.001519]  ffff880425eede10 0000000000000002 ffff880425eedfd8 0000000000012e40
[ 5547.001521]  ffff880425eedfd8 0000000000012e40 ffff880425d7a8a0 ffffea00104baa80
[ 5547.001523]  ffff880425eedd90 ffff880425eedd68 ffff880425eedd70 ffffffff81080edd
[ 5547.001525] Call Trace:
[ 5547.001530]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001533]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001535]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001552]  [<ffffffffa008a742>] ? btrfs_run_ordered_operations+0x212/0x2c0 [btrfs]
[ 5547.001554]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001556]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001557]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001559]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001566]  [<ffffffffa0072215>] btrfs_commit_transaction+0x265/0x9d0 [btrfs]
[ 5547.001569]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001575]  [<ffffffffa006982d>] transaction_kthread+0x19d/0x220 [btrfs]
[ 5547.001581]  [<ffffffffa0069690>] ? free_fs_root+0xc0/0xc0 [btrfs]
[ 5547.001583]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.001585]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001587]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.001588]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001590] systemd-journal D ffff880426e19860     0   234      1 0x00000000
[ 5547.001592]  ffff880426d77d90 0000000000000002 ffff880426d77fd8 0000000000012e40
[ 5547.001593]  ffff880426d77fd8 0000000000012e40 ffff880426e19860 ffffffff8155d7cd
[ 5547.001595]  0000000000000001 0000000000000001 0000000000000000 ffffffff81572560
[ 5547.001596] Call Trace:
[ 5547.001598]  [<ffffffff8155d7cd>] ? retint_restore_args+0xe/0xe
[ 5547.001601]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.001602]  [<ffffffff8122da9b>] ? blk_flush_plug_list+0x1eb/0x230
[ 5547.001604]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001606]  [<ffffffff8155bb88>] schedule_preempt_disabled+0x18/0x30
[ 5547.001607]  [<ffffffff8155a2f4>] __mutex_lock_slowpath+0x124/0x1f0
[ 5547.001613]  [<ffffffffa0071c9b>] ? btrfs_write_marked_extents+0xbb/0xe0 [btrfs]
[ 5547.001615]  [<ffffffff8155a3d7>] mutex_lock+0x17/0x30
[ 5547.001623]  [<ffffffffa00ae06a>] btrfs_sync_log+0x22a/0x690 [btrfs]
[ 5547.001630]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001632]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001634]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001635]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001644] mysqld          D ffff8803f0901860     0   643    579 0x00000000
[ 5547.001645]  ffff8803f090de18 0000000000000002 ffff8803f090dfd8 0000000000012e40
[ 5547.001647]  ffff8803f090dfd8 0000000000012e40 ffff8803f0901860 ffff88016d038000
[ 5547.001648]  ffff880426908d00 0000000024119d80 0000000000000000 0000000000000000
[ 5547.001650] Call Trace:
[ 5547.001657]  [<ffffffffa0074d14>] ? btrfs_submit_bio_hook+0x84/0x1f0 [btrfs]
[ 5547.001659]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001660]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001662]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001663]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001669]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001671]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001677]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001680]  [<ffffffff8112632e>] ? do_writepages+0x1e/0x40
[ 5547.001686]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001693]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.001694]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001696]  [<ffffffff811abe43>] SyS_fdatasync+0x13/0x20
[ 5547.001697]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001701] virtuoso-t      D ffff88000310b0c0     0   617    609 0x00000000
[ 5547.001702]  ffff8803f4867c20 0000000000000002 ffff8803f4867fd8 0000000000012e40
[ 5547.001704]  ffff8803f4867fd8 0000000000012e40 ffff88000310b0c0 ffffffff813ce4af
[ 5547.001705]  ffffffff81860520 ffff8802d8ad8a00 ffff8803f4867ba0 ffffffff81231a0e
[ 5547.001707] Call Trace:
[ 5547.001709]  [<ffffffff813ce4af>] ? scsi_pool_alloc_command+0x3f/0x80
[ 5547.001712]  [<ffffffff81231a0e>] ? __blk_segment_map_sg+0x4e/0x120
[ 5547.001713]  [<ffffffff81231b6b>] ? blk_rq_map_sg+0x8b/0x1f0
[ 5547.001716]  [<ffffffff812481da>] ? cfq_dispatch_requests+0xba/0xc40
[ 5547.001718]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001721]  [<ffffffff81119d70>] ? filemap_fdatawait+0x30/0x30
[ 5547.001722]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001723]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001725]  [<ffffffff81119d7e>] sleep_on_page+0xe/0x20
[ 5547.001727]  [<ffffffff81559142>] __wait_on_bit+0x62/0x90
[ 5547.001728]  [<ffffffff81119b2f>] wait_on_page_bit+0x7f/0x90
[ 5547.001730]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001732]  [<ffffffff81119cbb>] filemap_fdatawait_range+0x11b/0x1a0
[ 5547.001734]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001740]  [<ffffffffa0071d47>] btrfs_wait_marked_extents+0x87/0xe0 [btrfs]
[ 5547.001747]  [<ffffffffa00ae328>] btrfs_sync_log+0x4e8/0x690 [btrfs]
[ 5547.001754]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001756]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001758]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001759]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001761] pool            D ffff88040db1c100     0   657    477 0x00000000
[ 5547.001763]  ffff8803ee809ba0 0000000000000002 ffff8803ee809fd8 0000000000012e40
[ 5547.001764]  ffff8803ee809fd8 0000000000012e40 ffff88040db1c100 0000000000000004
[ 5547.001766]  ffff8803ee809ae8 ffffffff8155cc86 ffff8803ee809bd0 ffffffffa005ada4
[ 5547.001767] Call Trace:
[ 5547.001769]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001775]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001776]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001778]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001779]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001781]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001783]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001784]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001790]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001792]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001798]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001804]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001810]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001813]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001815]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001817]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001818]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001820]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001822]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001824]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001827]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001829]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001831]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001833]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001834]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001836]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001839]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001842]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001844]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001845]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001850] akregator       D ffff8803ed1d4100     0   875      1 0x00000000
[ 5547.001851]  ffff8803c7f1bba0 0000000000000002 ffff8803c7f1bfd8 0000000000012e40
[ 5547.001853]  ffff8803c7f1bfd8 0000000000012e40 ffff8803ed1d4100 0000000000000004
[ 5547.001854]  ffff8803c7f1bae8 ffffffff8155cc86 ffff8803c7f1bbd0 ffffffffa005ada4
[ 5547.001856] Call Trace:
[ 5547.001858]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001863]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001865]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001866]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001868]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001870]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001871]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001873]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001879]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001881]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001886]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001888]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001894]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001900]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001902]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001904]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001906]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001907]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001909]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001911]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001912]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001914]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001916]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001918]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001920]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001921]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001923]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001925]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001927]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001928]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001930]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001931] mpegaudioparse3 D ffff880341d10820     0  5917      1 0x00000000
[ 5547.001933]  ffff88030f779ce0 0000000000000002 ffff88030f779fd8 0000000000012e40
[ 5547.001934]  ffff88030f779fd8 0000000000012e40 ffff880341d10820 ffffffff81122a28
[ 5547.001936]  ffff88043e5ddc00 ffff880400000002 ffff88043e2138d0 0000000000000000
[ 5547.001938] Call Trace:
[ 5547.001939]  [<ffffffff81122a28>] ? __alloc_pages_nodemask+0x158/0xb00
[ 5547.001941]  [<ffffffff8102af55>] ? native_send_call_func_single_ipi+0x35/0x40
[ 5547.001943]  [<ffffffff810b31a8>] ? generic_exec_single+0x98/0xa0
[ 5547.001945]  [<ffffffff81086a18>] ? __enqueue_entity+0x78/0x80
[ 5547.001947]  [<ffffffff8108a837>] ? enqueue_entity+0x197/0x780
[ 5547.001948]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001950]  [<ffffffff81119d90>] ? sleep_on_page+0x20/0x20
[ 5547.001951]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001953]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001954]  [<ffffffff81119d9e>] sleep_on_page_killable+0xe/0x40
[ 5547.001956]  [<ffffffff8155925d>] __wait_on_bit_lock+0x5d/0xc0
[ 5547.001958]  [<ffffffff81119f2a>] __lock_page_killable+0x6a/0x70
[ 5547.001960]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001961]  [<ffffffff8111b9e5>] generic_file_aio_read+0x435/0x700
[ 5547.001963]  [<ffffffff8117d2ba>] do_sync_read+0x5a/0x90
[ 5547.001965]  [<ffffffff8117d85a>] vfs_read+0x9a/0x170
[ 5547.001967]  [<ffffffff8117e039>] SyS_read+0x49/0xa0
[ 5547.001968]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001970] mozStorage #2   D ffff8803b7aa1860     0   920    477 0x00000000
[ 5547.001972]  ffff8803b1473d80 0000000000000002 ffff8803b1473fd8 0000000000012e40
[ 5547.001974]  ffff8803b1473fd8 0000000000012e40 ffff8803b7aa1860 0000000000000004
[ 5547.001975]  ffff8803b1473cc8 ffffffff8155cc86 ffff8803b1473db0 ffffffffa005ada4
[ 5547.001977] Call Trace:
[ 5547.001978]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001984]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001990]  [<ffffffffa0084729>] ? __btrfs_buffered_write+0x3d9/0x490 [btrfs]
[ 5547.001992]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001994]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001995]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001997]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002003]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002004]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002010]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002016]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002023]  [<ffffffffa007c8a1>] btrfs_setattr+0x101/0x290 [btrfs]
[ 5547.002025]  [<ffffffff810d675c>] ? rcu_eqs_enter+0x5c/0xa0
[ 5547.002027]  [<ffffffff81198a6c>] notify_change+0x1dc/0x360
[ 5547.002029]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002030]  [<ffffffff8117bdcb>] do_truncate+0x6b/0xa0
[ 5547.002032]  [<ffffffff8117f8b9>] ? __sb_start_write+0x49/0x100
[ 5547.002033]  [<ffffffff8117c12b>] SyS_ftruncate+0x10b/0x160
[ 5547.002035]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002036] Cache I/O       D ffff8803b7aa28a0     0   922    477 0x00000000
[ 5547.002038]  ffff8803b1495e18 0000000000000002 ffff8803b1495fd8 0000000000012e40
[ 5547.002039]  ffff8803b1495fd8 0000000000012e40 ffff8803b7aa28a0 ffff8803b1495e08
[ 5547.002041]  ffff8803b1495db0 ffffffff8111a25a ffff8803b1495e40 ffff8803b1495df0
[ 5547.002043] Call Trace:
[ 5547.002045]  [<ffffffff8111a25a>] ? find_get_pages_tag+0xea/0x180
[ 5547.002047]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002048]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002050]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002051]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002057]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002059]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002065]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002071]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002077]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.002079]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.002080]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.002081]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002083] mozStorage #6   D ffff8803c0cfa8a0     0   982    477 0x00000000
[ 5547.002085]  ffff8803a10f5ba0 0000000000000002 ffff8803a10f5fd8 0000000000012e40
[ 5547.002086]  ffff8803a10f5fd8 0000000000012e40 ffff8803c0cfa8a0 0000000000000004
[ 5547.002088]  ffff8803a10f5ae8 ffffffff8155cc86 ffff8803a10f5bd0 ffffffffa005ada4
[ 5547.002089] Call Trace:
[ 5547.002091]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002096]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002098]  [<ffffffff8102b067>] ? native_smp_send_reschedule+0x47/0x60
[ 5547.002100]  [<ffffffff8107f7bc>] ? resched_task+0x5c/0x60
[ 5547.002101]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002103]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002104]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002106]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002112]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002113]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002119]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002125]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002131]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002133]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002134]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002136]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002138]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002139]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002141]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002143]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002145]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002147]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002148]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002150]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002152]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002153]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002155]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002157]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002159]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002160]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002164] rsync           D ffff8802dcde0820     0  5803   5802 0x00000000
[ 5547.002165]  ffff8802daeb1a90 0000000000000002 ffff8802daeb1fd8 0000000000012e40
[ 5547.002167]  ffff8802daeb1fd8 0000000000012e40 ffff8802dcde0820 ffff880100000002
[ 5547.002169]  ffff8802daeb19e0 ffffffff81080edd ffff880308b337e0 0000000000000000
[ 5547.002170] Call Trace:
[ 5547.002172]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002173]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002175]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002177]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002178]  [<ffffffff81560e8d>] ? add_preempt_count+0x3d/0x40
[ 5547.002180]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002181]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002182]  [<ffffffff81558f6a>] schedule_timeout+0x11a/0x230
[ 5547.002185]  [<ffffffff8105e0c0>] ? detach_if_pending+0x120/0x120
[ 5547.002187]  [<ffffffff810a5078>] ? ktime_get_ts+0x48/0xe0
[ 5547.002189]  [<ffffffff8155bd2b>] io_schedule_timeout+0x9b/0xf0
[ 5547.002191]  [<ffffffff811259a9>] balance_dirty_pages_ratelimited+0x3d9/0xa10
[ 5547.002198]  [<ffffffffa0c9ad84>] ? ext4_dirty_inode+0x54/0x60 [ext4]
[ 5547.002200]  [<ffffffff8111a8c8>] generic_file_buffered_write+0x1b8/0x290
[ 5547.002202]  [<ffffffff8111bfd9>] __generic_file_aio_write+0x1a9/0x3b0
[ 5547.002203]  [<ffffffff8111c238>] generic_file_aio_write+0x58/0xa0
[ 5547.002208]  [<ffffffffa0c8ef79>] ext4_file_write+0x99/0x3e0 [ext4]
[ 5547.002210]  [<ffffffff810ddaac>] ? acct_account_cputime+0x1c/0x20
[ 5547.002212]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002213]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002215]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002216]  [<ffffffff8117d34a>] do_sync_write+0x5a/0x90
[ 5547.002218]  [<ffffffff8117d9ed>] vfs_write+0xbd/0x1e0
[ 5547.002220]  [<ffffffff8117e0d9>] SyS_write+0x49/0xa0
[ 5547.002221]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002223] ktorrent        D ffff8802e7680820     0  5806      1 0x00000000
[ 5547.002224]  ffff8802daf7fba0 0000000000000002 ffff8802daf7ffd8 0000000000012e40
[ 5547.002226]  ffff8802daf7ffd8 0000000000012e40 ffff8802e7680820 0000000000000004
[ 5547.002227]  ffff8802daf7fae8 ffffffff8155cc86 ffff8802daf7fbd0 ffffffffa005ada4
[ 5547.002229] Call Trace:
[ 5547.002230]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002236]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002241]  [<ffffffffa004ae49>] ? btrfs_set_path_blocking+0x39/0x80 [btrfs]
[ 5547.002246]  [<ffffffffa004fe78>] ? btrfs_search_slot+0x498/0x970 [btrfs]
[ 5547.002247]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002249]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002251]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002252]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002258]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002260]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002266]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002268]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002273]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002280]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002281]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002283]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002285]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002287]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002288]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002290]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002292]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002293]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002295]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002297]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002299]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002300]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002302]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002304]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002306]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002307]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002309]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002311] kworker/u16:0   D ffff88035c5ac920     0  6043      2 0x00000000
[ 5547.002313] Workqueue: writeback bdi_writeback_workfn (flush-8:32)
[ 5547.002315]  ffff88036c9cb898 0000000000000002 ffff88036c9cbfd8 0000000000012e40
[ 5547.002316]  ffff88036c9cbfd8 0000000000012e40 ffff88035c5ac920 ffff8804281de048
[ 5547.002318]  ffff88036c9cb7e8 ffffffff81080edd 0000000000000001 ffff88036c9cb800
[ 5547.002319] Call Trace:
[ 5547.002321]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002323]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002324]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002326]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.002328]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002329]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.002331]  [<ffffffff8122b8aa>] get_request+0x1aa/0x780
[ 5547.002332]  [<ffffffff8123099e>] ? ioc_lookup_icq+0x4e/0x80
[ 5547.002334]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002336]  [<ffffffff8122db58>] blk_queue_bio+0x78/0x3e0
[ 5547.002337]  [<ffffffff8122c5c2>] generic_make_request+0xc2/0x110
[ 5547.002338]  [<ffffffff8122c683>] submit_bio+0x73/0x160
[ 5547.002344]  [<ffffffffa0c9bae5>] ext4_io_submit+0x25/0x50 [ext4]
[ 5547.002348]  [<ffffffffa0c981d3>] ext4_writepages+0x823/0xe00 [ext4]
[ 5547.002350]  [<ffffffff8112632e>] do_writepages+0x1e/0x40
[ 5547.002352]  [<ffffffff811a6340>] __writeback_single_inode+0x40/0x330
[ 5547.002353]  [<ffffffff811a7392>] writeback_sb_inodes+0x262/0x450
[ 5547.002355]  [<ffffffff811a761f>] __writeback_inodes_wb+0x9f/0xd0
[ 5547.002357]  [<ffffffff811a797b>] wb_writeback+0x32b/0x360
[ 5547.002358]  [<ffffffff811a8111>] bdi_writeback_workfn+0x221/0x510
[ 5547.002361]  [<ffffffff8106b917>] process_one_work+0x167/0x450
[ 5547.002362]  [<ffffffff8106c6a1>] worker_thread+0x121/0x3a0
[ 5547.002364]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002366]  [<ffffffff8106c580>] ? manage_workers.isra.25+0x2a0/0x2a0
[ 5547.002367]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.002369]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.002371]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.002372]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Rob Landley
In reply to this post by Mel Gorman-2
On 10/30/2013 07:01:52 AM, Mel Gorman wrote:

> We talked about this a
> few months ago but I still suspect that we will have to bite the  
> bullet and
> tune based on "do not dirty more data than it takes N seconds to  
> writeback"
> using per-bdi writeback estimations. It's just not that trivial to  
> implement
> as the writeback speeds can change for a variety of reasons (multiple  
> IO
> sources, random vs sequential etc).

Record "block writes finished this second" into an 8 entry ring buffer,  
with a flag saying "device was partly idle this period" so you can  
ignore those entries. Keep a high water mark, which should converge to  
the device's linear write capacity.

This gives you recent thrashing speed and max capacity, and some  
weighted average of the two lets you avoid queuing up 10 minutes of  
writes all at once like 3.0 would to a terabyte USB2 disk. (And then  
vim calls sync() and hangs...)

The first tricky bit is the high water mark, but it's not too bad. If  
the device reads and writes at the same rate you can populate it from  
that, but even starting it with just one block should converge really  
fast because A) the round trip time should be well under a second, B)  
if you're submitting more than one period's worth of data (you can  
dirty enough to keep disk busy for 2 seconds), then it'll queue up 2  
blocks at a time, then 4, then 8, and increase exponentially until you  
hit the high water mark. (Which is measured so it won't overshoot.)

The second tricky bit is weighting the average, but presumably counting  
the high water mark as one, then adding in all the "device did not  
actually go idle during this period" entries, and dividing by the  
number of entries considered... Reasonable first guess?

Obvious optimizations: instead of recording the "disk went idle" flag  
in the ring buffer, just don't advance the ring buffer at the end of  
that second, but zero out the entry and re-accumulate it. That way the  
ring buffer should always have 7 seconds of measured activity, even if  
it's not necessarily recent. And of course you don't have to wake  
anything up when there was no I/O, so it's nicely quiescent when the  
system is...

Lowering the high water mark in the case of a transient spurious  
reading (maybe clock skew during suspend or virtualization glitch or  
some such) is fun, and could give you a 4 billion block bad reading,  
but if you always decrement the high water mark by 25% (x-=(x>>2)) each  
second the disk didn't go idle (rounding up) and then queue up more  
than one period's worth of data (but no more than say 8 seconds worth),  
such glitches should fix themselves and it'll work its way back up or  
down to a reasonably accurate value. (Keep in mind you're averaging the  
high water mark back down with 7 seconds of measured data from the ring  
buffer. Maybe you can cap the high water mark at the sum of all the  
measured values in the ring buffer as an extra check? You're already  
calculating it to do the average, so...)

This is assuming your hard drive _itself_ doesn't have bufferbloat, but  
http://spritesmods.com/?art=hddhack&f=rss implies they don't, and  
tagged command queueing lets you see through that anyway so your  
"actually committed" numbers could presumably still be accurate if the  
manufacturers aren't totally lying.

Given how far behind I am on my email, I assume somebody's already  
suggested this by now. :)

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

One Thousand Gnomes
> This is assuming your hard drive _itself_ doesn't have bufferbloat, but  
> http://spritesmods.com/?art=hddhack&f=rss implies they don't, and  
> tagged command queueing lets you see through that anyway so your  
> "actually committed" numbers could presumably still be accurate if the  
> manufacturers aren't totally lying.

They don't but they do have wildly variable completion rates and times.
Nothing like a drive having a seven second hiccup to annoy people but
they can do that at times.

There are two problems though

1. Disk performance particularly in the rotating rust world is
operations/second which is rarely related to volume

2. If the block layer is trying to decide whether the drive is busy
you've got it the wrong way up IMHO. Busy-ness is a property of the
device and often very device and subsystem specific, so the device end of
the chain should figure out how loaded it feels


Beyond that the entire problem is well understood and there isn't any
real difference between an IPv4 network and a storage layer. In fact in
some cases like NFS, DRBD, AoE, and remote block device stuff it's even
more so.

(TCP based remote block devices btw are a prime example of why you need
device end of chain figuring out busy state.. you'll otherwise end up
doing double backoff)

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: add strictlimit knob -v2

akpm
In reply to this post by Maxim Patlasov
On Wed, 06 Nov 2013 19:05:57 +0400 Maxim Patlasov <[hidden email]> wrote:

> "strictlimit" feature was introduced to enforce per-bdi dirty limits for
> FUSE which sets bdi max_ratio to 1% by default:
>
> http://article.gmane.org/gmane.linux.kernel.mm/105809
>
> However the feature can be useful for other relatively slow or untrusted
> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
> feature:
>
> echo 1 > /sys/class/bdi/X:Y/strictlimit
>
> Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
> dirty limit is not reached. Of course, the effect is not visible until
> /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.
>
> ...
>
> --- a/Documentation/ABI/testing/sysfs-class-bdi
> +++ b/Documentation/ABI/testing/sysfs-class-bdi
> @@ -53,3 +53,11 @@ stable_pages_required (read-only)
>  
>   If set, the backing device requires that all pages comprising a write
>   request must not be changed until writeout is complete.
> +
> +strictlimit (read-write)
> +
> + Forces per-BDI checks for the share of given device in the write-back
> + cache even before the global background dirty limit is reached. This
> + is useful in situations where the global limit is much higher than
> + affordable for given relatively slow (or untrusted) device. Turning
> + strictlimit on has no visible effect if max_ratio is equal to 100%.
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index ce682f7..4ee1d64 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(stable_pages_required);
>  
> +static ssize_t strictlimit_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t count)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> + unsigned int val;
> + ssize_t ret;
> +
> + ret = kstrtouint(buf, 10, &val);
> + if (ret < 0)
> + return ret;
> +
> + switch (val) {
> + case 0:
> + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
> + break;
> + case 1:
> + bdi->capabilities |= BDI_CAP_STRICTLIMIT;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + return count;
> +}
> +static ssize_t strictlimit_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n",
> + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
> +}
> +static DEVICE_ATTR_RW(strictlimit);
> +
>  static struct attribute *bdi_dev_attrs[] = {
>   &dev_attr_read_ahead_kb.attr,
>   &dev_attr_min_ratio.attr,
>   &dev_attr_max_ratio.attr,
>   &dev_attr_stable_pages_required.attr,
> + &dev_attr_strictlimit.attr,
>   NULL,

Well the patch is certainly simple and straightforward enough and
*seems* like it will be useful.  The main (and large!) downside is that
it adds to the user interface so we'll have to maintain this feature
and its functionality for ever.

Given this, my concern is that while potentially useful, the feature
might not be *sufficiently* useful to justify its inclusion.  So we'll
end up addressing these issues by other means, then we're left
maintaining this obsolete legacy feature.

So I'm thinking that unless someone can show that this is good and
complete and sufficient for a "large enough" set of issues, I'll take a
pass on the patch[1].  What do people think?


[1] Actually, I'll stick it in -mm and maintain it, so next time
someone reports an issue I can say "hey, try this".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123