Disabling in-memory write cache for x86-64 in Linux II

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Disabling in-memory write cache for x86-64 in Linux II

Artem S. Tashkinov
Hello!

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
then flushes them some time later (quite unpredictably though) or immediately upon
invoking "sync".

How can I disable this memory cache altogether (or at least minimize caching)? When
running the i686 kernel with the same configuration I don't observe this effect - files get
written out almost immediately (for instance "sync" takes less than a second, whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <[hidden email]> wrote:
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".

Yeah, I think we default to a 10% "dirty background memory" (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.

On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so "10% dirty" really means just about
90MB of buffering (and a "hard limit" of ~180MB of dirty).

And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
come from the old days of less memory (and perhaps servers that don't
much care), and the fact that x86-32 ends up having much lower limits
even if you end up having more memory.

You can easily tune it:

    echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
    echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes

or similar. But you're right, we need to make the defaults much saner.

Wu? Andrew? Comments?

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Artem S. Tashkinov
Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote:
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote:

>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>
>Yeah, I think we default to a 10% "dirty background memory" (and
>allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
>of dirty memory for writeout before we even start writing, and twice
>that before we start *waiting* for it.
>
>On 32-bit x86, we only count the memory in the low 1GB (really
>actually up to about 890MB), so "10% dirty" really means just about
>90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
>And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
>come from the old days of less memory (and perhaps servers that don't
>much care), and the fact that x86-32 ends up having much lower limits
>even if you end up having more memory.
>
>You can easily tune it:
>
>    echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
>    echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
>
>or similar. But you're right, we need to make the defaults much saner.
>
>Wu? Andrew? Comments?
>

My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
more) this value becomes unrealistic (13GB) and I've already had some
unpleasant effects due to it.

I.e. when I dump a large MySQL database (its dump weighs around 10GB)
- it appears on the disk almost immediately, but then, later, when the kernel
decides to flush it to the disk, the server almost stalls and other IO requests
take a lot more time to complete even though mysqldump is run with ionice -c3,
so the use of ionice has no real effect.

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[hidden email]> wrote:
>
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.

Right. The percentage notion really goes back to the days when we
typically had 8-64 *megabytes* of memory So if you had a 8MB machine
you wouldn't want to have more than one megabyte of dirty data, but if
you were "Mr Moneybags" and could afford 64MB, you might want to have
up to 8MB dirty!!

Things have changed.

So I would suggest we change the defaults. Or pwehaps make the rule be
that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
semantics similar across 32-bit HIGHMEM machines and 64-bit machines.

The modern way of expressing the dirty limits are to give the actual
absolute byte amounts, but we default to the legacy ratio mode..

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Karl Kiniger
On Fri 131025, Linus Torvalds wrote:

> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[hidden email]> wrote:
> >
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
>
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
>
> Things have changed.
>
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
>
> The modern way of expressing the dirty limits are to give the actual
> absolute byte amounts, but we default to the legacy ratio mode..
>
>                 Linus

Is it currently possible to somehow set above values per block device?

I want default behaviour for almost everything but  DVD drives in DVD+RW
packet writing mode may easily take several minutes in case of a sync.

Karl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Theodore Ts'o
In reply to this post by Artem S. Tashkinov
On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> more) this value becomes unrealistic (13GB) and I've already had some
> unpleasant effects due to it.

What I think would make sense is to dynamically measure the speed of
writeback, so that we can set these limits as a function of the device
speed.  It's already the case that the writeback limits don't make
sense on a slow USB 2.0 storage stick; I suspect that for really huge
RAID arrays or very fast flash devices, it doesn't make much sense
either.

The problem is that if you have a system that has *both* a USB stick
_and_ a fast flash/RAID storage array both needing writeback, this
doesn't work well --- but what we have right now doesn't work all that
well anyway.

                                                - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

akpm
On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <[hidden email]> wrote:

> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed.

We attempt to do this now - have a look through struct backing_dev_info.

Apparently all this stuff isn't working as desired (and perhaps as designed)
in this case.  Will take a look after a return to normalcy ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
<[hidden email]> wrote:
>
> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case.  Will take a look after a return to normalcy ;)

It definitely doesn't work. I can trivially reproduce problems by just
having a cheap (==slow) USB key with an ext3 filesystem, and going a
git clone to it. The end result is not pretty, and that's actually not
even a huge amount of data.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

NeilBrown
In reply to this post by Artem S. Tashkinov
On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
<[hidden email]> wrote:

> Hello!
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".
>
> How can I disable this memory cache altogether (or at least minimize caching)? When
> running the i686 kernel with the same configuration I don't observe this effect - files get
> written out almost immediately (for instance "sync" takes less than a second, whereas
> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
> performance).
What exactly is bothering you about this?  The amount of memory used or the
time until data is flushed?

If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
look.
This defaults to 30 seconds (3000 centisecs).
You could make it smaller (providing you also shrink
dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
more quickly.

NeilBrown


>
> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
> in this instance.
>
> Swap is totally disabled, usually my memory is entirely free.
>
> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>
> Please, advise.
>
> Best regards,
>
> Artem
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

David Lang-3
On Fri, 25 Oct 2013, NeilBrown wrote:

> On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
> <[hidden email]> wrote:
>
>> Hello!
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>>
>> How can I disable this memory cache altogether (or at least minimize caching)? When
>> running the i686 kernel with the same configuration I don't observe this effect - files get
>> written out almost immediately (for instance "sync" takes less than a second, whereas
>> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
>> performance).
>
> What exactly is bothering you about this?  The amount of memory used or the
> time until data is flushed?

actually, I think the problem is more the impact of the huge write later on.

David Lang

> If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
> look.
> This defaults to 30 seconds (3000 centisecs).
> You could make it smaller (providing you also shrink
> dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
> more quickly.
>
> NeilBrown
>
>
>>
>> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
>> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
>> in this instance.
>>
>> Swap is totally disabled, usually my memory is entirely free.
>>
>> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>>
>> Please, advise.
>>
>> Best regards,
>>
>> Artem
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [hidden email]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

David Lang-3
In reply to this post by Linus Torvalds-2
On Fri, 25 Oct 2013, Linus Torvalds wrote:

> On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[hidden email]> wrote:
>>
>> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
>> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
>> more) this value becomes unrealistic (13GB) and I've already had some
>> unpleasant effects due to it.
>
> Right. The percentage notion really goes back to the days when we
> typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> you wouldn't want to have more than one megabyte of dirty data, but if
> you were "Mr Moneybags" and could afford 64MB, you might want to have
> up to 8MB dirty!!
>
> Things have changed.
>
> So I would suggest we change the defaults. Or pwehaps make the rule be
> that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> semantics similar across 32-bit HIGHMEM machines and 64-bit machines.

If you go this direction, allow ratios larger than 100%, some people may be
willing to have huge amounts of dirty data on large memory machines (if the load
is extremely bursty, they don't have other needs for I/O, or they have a very
fast storage system, as a few examples)

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Artem S. Tashkinov
In reply to this post by David Lang-3
Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:
>
>>
>> What exactly is bothering you about this?  The amount of memory used or the
>> time until data is flushed?
>
>actually, I think the problem is more the impact of the huge write later on.

Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
my life without being able to see the progress of a copying operation. With the current
dirty cache there's no way to understand how you storage media actually behaves.

Hopefully this issue won't dissolve into obscurity and someone will actually make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).

Per device dirty cache seems like a nice idea, I, for one, would like to disable it
altogether or make it an absolute minimum for things like USB flash drives - because
I don't care about multithreaded performance or delayed allocation on such devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most people
use them.

Regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Diego Calleja
El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> Oct 25, 2013 05:26:45 PM, david wrote:
> >actually, I think the problem is more the impact of the huge write later
> >on.
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but I
> cannot imagine my life without being able to see the progress of a copying
> operation. With the current dirty cache there's no way to understand how
> you storage media actually behaves.


This is a problem I also have been suffering for a long time. It's not so much
how much and when the systems syncs dirty data, but how unreponsive the
desktop becomes when it happens (usually, with rsync + large files). Most
programs become completely unreponsive, specially if they have a large memory
consumption (ie. the browser). I need to pause rsync and wait until the
systems writes out all dirty data if I want to do simple things like scrolling
or do any action that uses I/O, otherwise I need to wait minutes.

I have 16 GB of RAM and excluding the browser (which usually uses about half
of a GB) and KDE itself, there are no memory hogs, so it seem like it's
something that shouldn't happen. I can understand that I/O operations are
laggy when there is some other intensive I/O ongoing, but right now the system
becomes completely unreponsive. If I am unlucky and Konsole also becomes
unreponsive, I need to switch to a VT (which also takes time).

I haven't reported it before in part because I didn't know how to do it, "my
browser stalls" is not a very useful description and I didn't know what kind
of data I'm supposed to report.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

NeilBrown
In reply to this post by Artem S. Tashkinov
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
<[hidden email]> wrote:

> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
>
> Exactly. And not being able to use applications which show you IO performance
> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> my life without being able to see the progress of a copying operation. With the current
> dirty cache there's no way to understand how you storage media actually behaves.
So fix Midnight Commander.  If you want the copy to be actually finished when
it says  it is finished, then it needs to call 'fsync()' at the end.

>
> Hopefully this issue won't dissolve into obscurity and someone will actually make
> up a plan (and a patch) how to make dirty write cache behave in a sane manner
> considering the fact that there are devices with very different write speeds and
> requirements. It'd be ever better, if I could specify dirty cache as a mount option
> (though sane defaults or semi-automatic values based on runtime estimates
> won't hurt).
>
> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> altogether or make it an absolute minimum for things like USB flash drives - because
> I don't care about multithreaded performance or delayed allocation on such devices -
> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> use them.
>
As has already been said, you can substantially disable  the cache by tuning
down various values in /proc/sys/vm/.
Have you tried?

NeilBrown

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Artem S. Tashkinov
Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
>>
>> Exactly. And not being able to use applications which show you IO performance
>> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
>> my life without being able to see the progress of a copying operation. With the current
>> dirty cache there's no way to understand how you storage media actually behaves.
>
>So fix Midnight Commander.  If you want the copy to be actually finished when
>it says  it is finished, then it needs to call 'fsync()' at the end.

This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
solve this problem in user space - alas, even if such calls are implemented, user
space will start using them only in 2018 if not further from that.

>>
>> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
>> altogether or make it an absolute minimum for things like USB flash drives - because
>> I don't care about multithreaded performance or delayed allocation on such devices -
>> I'm interested in my data reaching my USB stick ASAP - because it's how most people
>> use them.
>>
>
>As has already been said, you can substantially disable  the cache by tuning
>down various values in /proc/sys/vm/.
>Have you tried?

I don't understand who you are replying to. I asked about per device settings, you are
again referring me to system wide settings - they don't look that good if we're talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the first place.

I don't know any other OS which has a similar behaviour.

And like people (including me) have already mentioned, such a huge dirty cache can
stall their PCs/servers for a considerable amount of time.

Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead to a
serious data loss in case of a power blackout.

Regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

NeilBrown
On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov"
<[hidden email]> wrote:

> Oct 26, 2013 02:44:07 AM, neil wrote:
> On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
> >>
> >> Exactly. And not being able to use applications which show you IO performance
> >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> >> my life without being able to see the progress of a copying operation. With the current
> >> dirty cache there's no way to understand how you storage media actually behaves.
> >
> >So fix Midnight Commander.  If you want the copy to be actually finished when
> >it says  it is finished, then it needs to call 'fsync()' at the end.
>
> This sounds like a very bad joke. How applications are supposed to show and
> calculate an _average_ write speed if there are no kernel calls/ioctls to actually
> make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
> solve this problem in user space - alas, even if such calls are implemented, user
> space will start using them only in 2018 if not further from that.
But there is a way to flush dirty buffers *during* copies.  
  man 2 sync_file_range

if giving precise feedback is is paramount importance to you, then this would
be the interface to use.

>
> >>
> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> >> altogether or make it an absolute minimum for things like USB flash drives - because
> >> I don't care about multithreaded performance or delayed allocation on such devices -
> >> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> >> use them.
> >>
> >
> >As has already been said, you can substantially disable  the cache by tuning
> >down various values in /proc/sys/vm/.
> >Have you tried?
>
> I don't understand who you are replying to. I asked about per device settings, you are
> again referring me to system wide settings - they don't look that good if we're talking
> about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
> to allocate 20% of physical RAM for things which don't belong to it in the first place.
Sorry, missed the per-device bit.
You could try playing with
  /sys/class/bdi/XX:YY/max_ratio

where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda.
Wind it right down for slow devices and you might get something like what you
want.


>
> I don't know any other OS which has a similar behaviour.

I don't know about the internal details of any other OS, so I cannot really
comment.

>
> And like people (including me) have already mentioned, such a huge dirty cache can
> stall their PCs/servers for a considerable amount of time.

Yes.  But this is a different issue.
There are two very different issues that should be kept separate.

One is that when "cp" or similar complete, the data hasn't all be written out
yet.  It typically takes another 30 seconds before the flush will complete.
You seemed to primarily complain about this, so that is what I originally
address.  That is where in the "dirty_*_centisecs" values apply.

The other, quite separate, issue is that Linux will cache more dirty data
than it can write out in a reasonable time.  All the tuning parameters refer
to the amount of data (whether as a percentage of RAM or as a number of
bytes), but what people really care about is a number of seconds.

As you might imagine, estimating how long it will take to write out a certain
amount of data is highly non-trivial.  The relationship between megabytes and
seconds can be non-linear and can change over time.

Caching nothing at all can hurt a lot of workloads.  Caching too much can
obviously hurt too.  Caching "5 seconds" worth of data would be ideal, but
would be incredibly difficult to implement.
It is possible that keeping a sliding estimate of device throughput for each
device would be possible, and using that to automatically adjust the
"max_ratio" value (or some related internal thing) might be a 70% solution.

Certainly it would be an interesting project for someone.


>
> Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer can lead to a
> serious data loss in case of a power blackout.

I don't have a desk (just a lap), but I use Linux on all my computers and
I've never really noticed the problem.  Maybe I'm just very patient, or maybe
I don't work with large data sets and slow devices.

However I don't think data-loss is really a related issue.  Any process that
cares about data safety *must* use fsync at appropriate places.  This has
always been true.

NeilBrown

>
> Regards,
>
> Artem


signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

kbuild test robot-2
In reply to this post by akpm
On Fri, Oct 25, 2013 at 02:29:37AM -0700, Andrew Morton wrote:
> On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" <[hidden email]> wrote:
>
> > What I think would make sense is to dynamically measure the speed of
> > writeback, so that we can set these limits as a function of the device
> > speed.
>
> We attempt to do this now - have a look through struct backing_dev_info.

To be exact, it's backing_dev_info.write_bandwidth which is estimated
in bdi_update_write_bandwidth() and exported as "BdiWriteBandwidth" in
debugfs file bdi.stats.

> Apparently all this stuff isn't working as desired (and perhaps as designed)
> in this case.  Will take a look after a return to normalcy ;)

Right. The write bandwidth estimation is only estimated and used when
background dirty threshold is reached and hence the disk is actively
doing writeback IO -- which is the case that we can do reasonable
estimation of the writeback bandwidth.

Note that this estimated BdiWriteBandwidth may better be named
"writeback" bandwidth because it may change depending on the workload
at the time -- eg. sequential vs. random writes; whether there are
parallel reads or direct IO competing the disk time.

BdiWriteBandwidth is only designed for use by the dirty throttling
logic and is not generally useful/reliable for other purposes.

It's a bit late and I'd like to carry the original question as
exercises in tomorrow's airplanes. :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

kbuild test robot-2
In reply to this post by Theodore Ts'o
On Fri, Oct 25, 2013 at 05:18:42AM -0400, Theodore Ts'o wrote:

> On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote:
> > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > more) this value becomes unrealistic (13GB) and I've already had some
> > unpleasant effects due to it.
>
> What I think would make sense is to dynamically measure the speed of
> writeback, so that we can set these limits as a function of the device
> speed.  It's already the case that the writeback limits don't make
> sense on a slow USB 2.0 storage stick; I suspect that for really huge
> RAID arrays or very fast flash devices, it doesn't make much sense
> either.
>
> The problem is that if you have a system that has *both* a USB stick
> _and_ a fast flash/RAID storage array both needing writeback, this
> doesn't work well --- but what we have right now doesn't work all that
> well anyway.

Ted, when trying to follow up your email, I got a crazy idea and it'd
be better throw it out rather than carrying it to bed. :)

We could do per-bdi dirty thresholds - which has been proposed 1-2
times before by different people.

The per-bdi dirty thresholds could be auto set by the kernel this way:
start it with an initial value of 100MB. When reached, put all the
100MB dirty data to IO and get an estimation of the write bandwidth.
From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
where N is the seconds of dirty data we'd like to cache in memory.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

kbuild test robot-2
In reply to this post by Diego Calleja
On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote:

> El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> > Oct 25, 2013 05:26:45 PM, david wrote:
> > >actually, I think the problem is more the impact of the huge write later
> > >on.
> > Exactly. And not being able to use applications which show you IO
> > performance like Midnight Commander. You might prefer to use "cp -a" but I
> > cannot imagine my life without being able to see the progress of a copying
> > operation. With the current dirty cache there's no way to understand how
> > you storage media actually behaves.
>
>
> This is a problem I also have been suffering for a long time. It's not so much
> how much and when the systems syncs dirty data, but how unreponsive the
> desktop becomes when it happens (usually, with rsync + large files). Most
> programs become completely unreponsive, specially if they have a large memory
> consumption (ie. the browser). I need to pause rsync and wait until the
> systems writes out all dirty data if I want to do simple things like scrolling
> or do any action that uses I/O, otherwise I need to wait minutes.

That's a problem. And it's kind of independent of the dirty threshold
-- if you are doing large file copies in the background, it will lead
to continuous disk writes and stalls anyway -- the large dirty threshold
merely delays the write IO time.

> I have 16 GB of RAM and excluding the browser (which usually uses about half
> of a GB) and KDE itself, there are no memory hogs, so it seem like it's
> something that shouldn't happen. I can understand that I/O operations are
> laggy when there is some other intensive I/O ongoing, but right now the system
> becomes completely unreponsive. If I am unlucky and Konsole also becomes
> unreponsive, I need to switch to a VT (which also takes time).
>
> I haven't reported it before in part because I didn't know how to do it, "my
> browser stalls" is not a very useful description and I didn't know what kind
> of data I'm supposed to report.

What's the kernel you are running? And it's writing to a hard disk?
The stalls are most likely caused by either one of

1) write IO starves read IO
2) direct page reclaim blocked when
   - trying to writeout PG_dirty pages
   - trying to lock PG_writeback pages

Which may be confirmed by running

        ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
or
        echo w > /proc/sysrq-trigger    # and check dmesg

during the stalls. The latter command works more reliably.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Theodore Ts'o
In reply to this post by kbuild test robot-2
On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:

>
> Ted, when trying to follow up your email, I got a crazy idea and it'd
> be better throw it out rather than carrying it to bed. :)
>
> We could do per-bdi dirty thresholds - which has been proposed 1-2
> times before by different people.
>
> The per-bdi dirty thresholds could be auto set by the kernel this way:
> start it with an initial value of 100MB. When reached, put all the
> 100MB dirty data to IO and get an estimation of the write bandwidth.
> From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> where N is the seconds of dirty data we'd like to cache in memory.

Sure, although I wonder if it would be worth it calcuate some kind of
rolling average of the write bandwidth while we are doing writeback,
so if it turns out we got unlucky with the contents of the first 100MB
of dirty data (it could be either highly random or highly sequential)
the we'll eventually correct to the right level.

This means that VM would have to keep dirty page counters for each BDI
--- which I thought we weren't doing right now, which is why we have a
global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
have cause and effect reversed?  :-)

                                                - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123