Disabling in-memory write cache for x86-64 in Linux II

classic Classic list List threaded Threaded
56 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Pavel Machek
On Fri 2013-10-25 10:32:16, Linus Torvalds wrote:

> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <[hidden email]> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case.  Will take a look after a return to normalcy ;)
>
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.

Hmm, I'd expect the result to be "dead USB key". Putting
ext3 on cheap flash device normally just kills the devic :-(.


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
On Sat, Oct 26, 2013 at 4:32 AM, Pavel Machek <[hidden email]> wrote:
>
> Hmm, I'd expect the result to be "dead USB key". Putting
> ext3 on cheap flash device normally just kills the devic :-(.

Not my experience. It may be true for some really cheap devices, but
normal USB keys seem to just get really slow, probably due to having
had their flash rewrite algorithm tuned for FAT accesses.

I *do* suspect that to see the really bad behavior, you don't write
just one large file to it, but many smaller ones. "git clone" will
check out all the kernel tree files, obviously.

                     Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Karl Kiniger
On Fri 25-10-13 11:15:55, Karl Kiniger wrote:

> On Fri 131025, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov <[hidden email]> wrote:
> > >
> > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
> > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
> > > more) this value becomes unrealistic (13GB) and I've already had some
> > > unpleasant effects due to it.
> >
> > Right. The percentage notion really goes back to the days when we
> > typically had 8-64 *megabytes* of memory So if you had a 8MB machine
> > you wouldn't want to have more than one megabyte of dirty data, but if
> > you were "Mr Moneybags" and could afford 64MB, you might want to have
> > up to 8MB dirty!!
> >
> > Things have changed.
> >
> > So I would suggest we change the defaults. Or pwehaps make the rule be
> > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the
> > semantics similar across 32-bit HIGHMEM machines and 64-bit machines.
> >
> > The modern way of expressing the dirty limits are to give the actual
> > absolute byte amounts, but we default to the legacy ratio mode..
> >
> >                 Linus
>
> Is it currently possible to somehow set above values per block device?
  Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
the maximum proportion the device's dirty data can take from the total
amount. The caveat currently is that this setting only takes effect after
we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
total because that is an amount of dirty data when we start to throttle
processes. So if the device you'd like to limit is the only one which is
currently written to, the limiting doesn't have a big effect.

Andrew has queued up a patch series from Maxim Patlasov which removes this
caveat but currently we don't have a way admin can switch that from
userspace. But I'd like to have that tunable from userspace exactly for the
cases as you describe below.

> I want default behaviour for almost everything but  DVD drives in DVD+RW
> packet writing mode may easily take several minutes in case of a sync.

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Theodore Ts'o
On Fri 25-10-13 19:37:53, Ted Tso wrote:

> On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote:
> >
> > Ted, when trying to follow up your email, I got a crazy idea and it'd
> > be better throw it out rather than carrying it to bed. :)
> >
> > We could do per-bdi dirty thresholds - which has been proposed 1-2
> > times before by different people.
> >
> > The per-bdi dirty thresholds could be auto set by the kernel this way:
> > start it with an initial value of 100MB. When reached, put all the
> > 100MB dirty data to IO and get an estimation of the write bandwidth.
> > From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth,
> > where N is the seconds of dirty data we'd like to cache in memory.
>
> Sure, although I wonder if it would be worth it calcuate some kind of
> rolling average of the write bandwidth while we are doing writeback,
> so if it turns out we got unlucky with the contents of the first 100MB
> of dirty data (it could be either highly random or highly sequential)
> the we'll eventually correct to the right level.
  We already do average measured throughput over a longer time window and
have kind of rolling average algorithm doing some averaging.

> This means that VM would have to keep dirty page counters for each BDI
> --- which I thought we weren't doing right now, which is why we have a
> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
> have cause and effect reversed?  :-)
  And we do currently keep the number of dirty & under writeback pages per
BDI. We have global limits because mm wants to limit the total number of dirty
pages (as those are harder to free). It doesn't care as much to which device
these pages belong (although it probably should care a bit more because
there are huge differences between how quickly can different devices get rid
of dirty pages).

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

akpm
In reply to this post by Jan Kara
On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[hidden email]> wrote:

> Andrew has queued up a patch series from Maxim Patlasov which removes this
> caveat but currently we don't have a way admin can switch that from
> userspace. But I'd like to have that tunable from userspace exactly for the
> cases as you describe below.

This?

commit 5a53748568f79641eaf40e41081a2f4987f005c2
Author:     Maxim Patlasov <[hidden email]>
AuthorDate: Wed Sep 11 14:22:46 2013 -0700
Commit:     Linus Torvalds <[hidden email]>
CommitDate: Wed Sep 11 15:58:04 2013 -0700

    mm/page-writeback.c: add strictlimit feature

That's already in mainline, for 3.12.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Artem S. Tashkinov
On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote:

> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
>
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but
> I cannot imagine my life without being able to see the progress of a
> copying operation. With the current dirty cache there's no way to
> understand how you storage media actually behaves.
  Large writes shouldn't stall your desktop, that's certain and we must fix
that. I don't find the problem with copy progress indicators that
pressing...

> Hopefully this issue won't dissolve into obscurity and someone will
> actually make up a plan (and a patch) how to make dirty write cache
> behave in a sane manner considering the fact that there are devices with
> very different write speeds and requirements. It'd be ever better, if I
> could specify dirty cache as a mount option (though sane defaults or
> semi-automatic values based on runtime estimates won't hurt).
>
> Per device dirty cache seems like a nice idea, I, for one, would like to
> disable it altogether or make it an absolute minimum for things like USB
> flash drives - because I don't care about multithreaded performance or
> delayed allocation on such devices - I'm interested in my data reaching
> my USB stick ASAP - because it's how most people use them.
  See my other emails in this thread. There are ways to tune the amount of
dirty data allowed per device. Currently the result isn't very satisfactory
but we should have something usable after the next merge window.

                                                                        Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Linus Torvalds-2
On Fri 25-10-13 10:32:16, Linus Torvalds wrote:

> On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> <[hidden email]> wrote:
> >
> > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > in this case.  Will take a look after a return to normalcy ;)
>
> It definitely doesn't work. I can trivially reproduce problems by just
> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> git clone to it. The end result is not pretty, and that's actually not
> even a huge amount of data.
  I'll try to reproduce this tomorrow so that I can have a look where
exactly are we stuck. But in last few releases problems like this were
caused by problems in reclaim which got fed up by seeing lots of dirty
/ under writeback pages and ended up stuck waiting for IO to finish. Mel
has been tweaking the logic here and there but maybe it haven't got fixed
completely. Mel, do you know about any outstanding issues?

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by akpm
On Tue 29-10-13 13:43:46, Andrew Morton wrote:

> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[hidden email]> wrote:
>
> > Andrew has queued up a patch series from Maxim Patlasov which removes this
> > caveat but currently we don't have a way admin can switch that from
> > userspace. But I'd like to have that tunable from userspace exactly for the
> > cases as you describe below.
>
> This?
>
> commit 5a53748568f79641eaf40e41081a2f4987f005c2
> Author:     Maxim Patlasov <[hidden email]>
> AuthorDate: Wed Sep 11 14:22:46 2013 -0700
> Commit:     Linus Torvalds <[hidden email]>
> CommitDate: Wed Sep 11 15:58:04 2013 -0700
>
>     mm/page-writeback.c: add strictlimit feature
>
> That's already in mainline, for 3.12.
  Yes, I should have checked the code...

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
In reply to this post by Jan Kara
On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <[hidden email]> wrote:

> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
>>
>> It definitely doesn't work. I can trivially reproduce problems by just
>> having a cheap (==slow) USB key with an ext3 filesystem, and going a
>> git clone to it. The end result is not pretty, and that's actually not
>> even a huge amount of data.
>
>   I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?
I'm not sure this has ever worked, and in the last few years the
common desktop memory size has continued to grow.

For servers and "serious" desktops, having tons of dirty data doesn't
tend to be as much of a problem, because those environments are pretty
much defined by also having fairly good IO subsystems, and people
seldom use crappy USB devices for more than doing things like reading
pictures off them etc. And you'd not even see the problem under any
such load.

But it's actually really easy to reproduce by just taking your average
USB key and trying to write to it. I just did it with a random ISO
image, and it's _painful_. And it's not that it's painful for doing
most other things in the background, but if you just happen to run
anything that does "sync" (and it happens in scripts), the thing just
comes to a screeching halt. For minutes.

Same obviously goes with trying to eject/unmount the media etc.

We've had this problem before with the whole "ratio of dirty memory"
thing. It was a mistake. It made sense (and came from) back in the
days when people had 16MB or 32MB of RAM, and the concept of "let's
limit dirty memory to x% of that" was actually fairly reasonable. But
that "x%" doesn't make much sense any more. x% of 16GB (which is quite
the reasonable amount of memory for any modern desktop) is a huge
thing, and in the meantime the performance of disks have gone up a lot
(largely thanks to SSD's), but the *minimum* performance of disks
hasn't really improved all that much (largely thanks to USB ;).

So how about we just admit that the whole "ratio" thing was a big
mistake, and tell people that if they want to set a dirty limit, they
should do so in bytes? Which we already really do, but we default to
that ratio nevertheless. Which is why I'd suggest we just say "the
ratio works fine up to a certain amount, and makes no sense past it".

Why not make that "the ratio works fine up to a certain amount, and
makes no sense past it" be part of the calculations. We actually
*hace* exactly that on HIGHMEM machines, where we have this
configuration option of "vm_highmem_is_dirtyable" that defaults to
off. It just doesn't trigger on nonhighmem machines (today: "64-bit").

So I would suggest that we just expose that "vm_highmem_is_dirtyable"
on 64-bit too, and just say that anything over 1GB is highmem. That
means that 32-bit and 64-bit environments will basically act the same,
and I think it makes the defaults a bit saner.

Limiting the amount of dirty memory to 100MB/200MB (for "start
background writing" and "wait synchronously" respectively) even if you
happen to have 16GB of memory sounds like a good idea. Sure, it might
make some benchmarks a bit slower, but it will at least avoid the
"wait forever" symptom. And if you really have a very studly IO
subsystem, the fact that it starts writing out earlier won't really be
a problem.

After all, there are two reasons to do delayed writes:

 - temp-files may not be written out at all.

   Quite frankly, if you have multi-hundred-megabyte temptiles, you've
got issues

 - coalescing writes improves throughput

   There are very much diminishing returns, and the big return is to
make sure that we write things out in a good order, which a 100MB
buffer should make more than possible.

so I really think that it's insane to default to 1.6GB of dirty data
before you even start writing it out if you happen to have 16GB of
memory.

And again: if your benchmark is to create a kernel tree and then
immediately delete it, and you used to do that without doing any
actual IO, then yes, the attached patch will make that go much slower.
But for that benchmark, maybe you should just set the dirty limits (in
bytes) by hand, rather than expect the default kernel values to prefer
benchmarks over sanity?

Suggested patch attached. Comments?

                            Linus

patch.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
In reply to this post by akpm
On Tue, Oct 29, 2013 at 1:43 PM, Andrew Morton
<[hidden email]> wrote:

> On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara <[hidden email]> wrote:
>
>> Andrew has queued up a patch series from Maxim Patlasov which removes this
>> caveat but currently we don't have a way admin can switch that from
>> userspace. But I'd like to have that tunable from userspace exactly for the
>> cases as you describe below.
>
> This?
>
>     mm/page-writeback.c: add strictlimit feature
>
> That's already in mainline, for 3.12.

Nothing currently actually *sets* the BDI_CAP_STRICTLIMIT flag, though.

So it's a potential fix, but it's certainly not a fix now.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Linus Torvalds-2
On Tue 29-10-13 14:33:53, Linus Torvalds wrote:

> On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara <[hidden email]> wrote:
> > On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> >>
> >> It definitely doesn't work. I can trivially reproduce problems by just
> >> having a cheap (==slow) USB key with an ext3 filesystem, and going a
> >> git clone to it. The end result is not pretty, and that's actually not
> >> even a huge amount of data.
> >
> >   I'll try to reproduce this tomorrow so that I can have a look where
> > exactly are we stuck. But in last few releases problems like this were
> > caused by problems in reclaim which got fed up by seeing lots of dirty
> > / under writeback pages and ended up stuck waiting for IO to finish. Mel
> > has been tweaking the logic here and there but maybe it haven't got fixed
> > completely. Mel, do you know about any outstanding issues?
>
> I'm not sure this has ever worked, and in the last few years the
> common desktop memory size has continued to grow.
>
> For servers and "serious" desktops, having tons of dirty data doesn't
> tend to be as much of a problem, because those environments are pretty
> much defined by also having fairly good IO subsystems, and people
> seldom use crappy USB devices for more than doing things like reading
> pictures off them etc. And you'd not even see the problem under any
> such load.
>
> But it's actually really easy to reproduce by just taking your average
> USB key and trying to write to it. I just did it with a random ISO
> image, and it's _painful_. And it's not that it's painful for doing
> most other things in the background, but if you just happen to run
> anything that does "sync" (and it happens in scripts), the thing just
> comes to a screeching halt. For minutes.
  Yes, I agree that caching more than couple of seconds worth of writeback
for a device isn't good.

> Same obviously goes with trying to eject/unmount the media etc.
>
> We've had this problem before with the whole "ratio of dirty memory"
> thing. It was a mistake. It made sense (and came from) back in the
> days when people had 16MB or 32MB of RAM, and the concept of "let's
> limit dirty memory to x% of that" was actually fairly reasonable. But
> that "x%" doesn't make much sense any more. x% of 16GB (which is quite
> the reasonable amount of memory for any modern desktop) is a huge
> thing, and in the meantime the performance of disks have gone up a lot
> (largely thanks to SSD's), but the *minimum* performance of disks
> hasn't really improved all that much (largely thanks to USB ;).
>
> So how about we just admit that the whole "ratio" thing was a big
> mistake, and tell people that if they want to set a dirty limit, they
> should do so in bytes? Which we already really do, but we default to
> that ratio nevertheless. Which is why I'd suggest we just say "the
> ratio works fine up to a certain amount, and makes no sense past it".
>
> Why not make that "the ratio works fine up to a certain amount, and
> makes no sense past it" be part of the calculations. We actually
> *hace* exactly that on HIGHMEM machines, where we have this
> configuration option of "vm_highmem_is_dirtyable" that defaults to
> off. It just doesn't trigger on nonhighmem machines (today: "64-bit").
>
> So I would suggest that we just expose that "vm_highmem_is_dirtyable"
> on 64-bit too, and just say that anything over 1GB is highmem. That
> means that 32-bit and 64-bit environments will basically act the same,
> and I think it makes the defaults a bit saner.
>
> Limiting the amount of dirty memory to 100MB/200MB (for "start
> background writing" and "wait synchronously" respectively) even if you
> happen to have 16GB of memory sounds like a good idea. Sure, it might
> make some benchmarks a bit slower, but it will at least avoid the
> "wait forever" symptom. And if you really have a very studly IO
> subsystem, the fact that it starts writing out earlier won't really be
> a problem.
  So I think we both realize this is only about what the default should be.
There will always be people who have loads which benefit from setting dirty
limits high but I agree they are minority. The reason why we left the
limits at what they are now despite them having less and less sence is that
we didn't want to break user expectations. If we cap the dirty limits as
you suggest, I bet we'll get some user complaints and "don't break users"
policy thus tells me we shouldn't do such changes ;)

Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
but I think we should experiment with numbers a bit to check whether we
didn't miss something.
 
> After all, there are two reasons to do delayed writes:
>
>  - temp-files may not be written out at all.
>
>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> got issues
  Actually people do stuff like this e.g. when generating ISO images before
burning them.

>  - coalescing writes improves throughput
>
>    There are very much diminishing returns, and the big return is to
> make sure that we write things out in a good order, which a 100MB
> buffer should make more than possible.
  True.

  There is one more aspect:
- transforming random writes into mostly sequential writes

  Different userspace programs use simple memory mapped databases which do
random writes into their data files. The less you writeback these the
better (at least from throughput POV). I'm not sure how large are these
files together on average user desktop though but my guess would be that
100 MB *should* be enough for them. Can anyone with GNOME / KDE desktop try
running with limits set this low for some time?
 

> so I really think that it's insane to default to 1.6GB of dirty data
> before you even start writing it out if you happen to have 16GB of
> memory.
>
> And again: if your benchmark is to create a kernel tree and then
> immediately delete it, and you used to do that without doing any
> actual IO, then yes, the attached patch will make that go much slower.
> But for that benchmark, maybe you should just set the dirty limits (in
> bytes) by hand, rather than expect the default kernel values to prefer
> benchmarks over sanity?
>
> Suggested patch attached. Comments?

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Linus Torvalds-2
On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara <[hidden email]> wrote:
>
>   So I think we both realize this is only about what the default should be.

Yes. Most people will use the defaults, but there will always be
people who tune things for particular loads.

In fact, I think we have gone much too far in saying "all policy in
user space", because the fact is, user space isn't very good at
policy. Especially not at reacting to complex situations with
different devices. From what I've seen, "policy in user space" has
resulted in exactly two modes:

 - user space does something stupid and wrong (example: "nice -19 X"
to work around some scheduler oddities)

 - user space does nothing at all, and the kernel people say "hey,
user space _could_ set this value Xyz, so it's not our problem, and
it's policy, so we shouldn't touch it".

I think we in the kernel should say "our defaults should be what
everybody sane can use, and they should work fine on average". With
"policy in user space" being for crazy people that do really odd
things and can really spare the time to tune for their particular
issue.

So the "policy in user space" should be about *overriding* kernel
policy choices, not about the kernel never having them.

And this kind of "you can have many different devices and they act
quite differently" is a good example of something complicated that
user space really doesn't have a great model for. And we actually have
much better possible information in the kernel than user space ever is
likely to have.

> Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
> but I think we should experiment with numbers a bit to check whether we
> didn't miss something.

Sure. That said, the patch I suggested basically makes the numbers be
at least roughly comparable across different architectures. So it's
been at least somewhat tested, even if 16GB x86-32 machines are
hopefully pretty rare (but I hear about people installing 32-bit on
modern machines much too often).

>>  - temp-files may not be written out at all.
>>
>>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
>> got issues
>   Actually people do stuff like this e.g. when generating ISO images before
> burning them.

Yes, but then the temp-file is long-lived enough that it *will* hit
the disk anyway. So it's only the "create temporary file and pretty
much immediately delete it" case that changes behavior (ie compiler
assembly files etc).

If the temp-file is for something like burning an ISO image, the
burning part is slow enough that the temp-file will hit the disk
regardless of when we start writing it.

>   There is one more aspect:
> - transforming random writes into mostly sequential writes

Sure. And I think that if you have a big database, that's when you do
end up tweaking the dirty limits.

That said, I'd certainly like it even *more* if the limits really were
per-BDI, and the global limit was in addition to the per-bdi ones.
Because when you have a USB device that gets maybe 10MB/s on
contiguous writes, and 100kB/s on random 4k writes, I think it would
make more sense to make the "start writeout" limits be 1MB/2MB, not
100MB/200MB. So my patch doesn't even take it far enough, it's just a
"let's not be ridiculous". The per-BDI limits don't seem quite ready
for prime time yet, though. Even the new "strict" limits seems to be
more about "trusted filesystems" than about really sane writeback
limits.

Fengguang, comments?

(And I added Maxim to the cc, since he's the author of the strict
mode, and while it is currently limited to FUSE, he did mention USB
storage in the commit message..).

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Artem S. Tashkinov
In reply to this post by Jan Kara
Oct 30, 2013 02:41:01 AM, Jack wrote:
On Fri 25-10-13 19:37:53, Ted Tso wrote:

>> Sure, although I wonder if it would be worth it calcuate some kind of
>> rolling average of the write bandwidth while we are doing writeback,
>> so if it turns out we got unlucky with the contents of the first 100MB
>> of dirty data (it could be either highly random or highly sequential)
>> the we'll eventually correct to the right level.
>  We already do average measured throughput over a longer time window and
>have kind of rolling average algorithm doing some averaging.
>
>> This means that VM would have to keep dirty page counters for each BDI
>> --- which I thought we weren't doing right now, which is why we have a
>> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
>> have cause and effect reversed?  :-)
>  And we do currently keep the number of dirty & under writeback pages per
>BDI. We have global limits because mm wants to limit the total number of dirty
>pages (as those are harder to free). It doesn't care as much to which device
>these pages belong (although it probably should care a bit more because
>there are huge differences between how quickly can different devices get rid
>of dirty pages).

This might sound like an absolutely stupid question which makes no sense at
all, so I want to apologize for it in advance, but since the Linux kernel lacks
revoke(), does that mean that dirty buffers will always occupy the kernel memory
if I for instance remove my USB stick before the kernel has had the time to flush
these buffers?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Mel Gorman-2
In reply to this post by Jan Kara
On Tue, Oct 29, 2013 at 09:57:56PM +0100, Jan Kara wrote:

> On Fri 25-10-13 10:32:16, Linus Torvalds wrote:
> > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton
> > <[hidden email]> wrote:
> > >
> > > Apparently all this stuff isn't working as desired (and perhaps as designed)
> > > in this case.  Will take a look after a return to normalcy ;)
> >
> > It definitely doesn't work. I can trivially reproduce problems by just
> > having a cheap (==slow) USB key with an ext3 filesystem, and going a
> > git clone to it. The end result is not pretty, and that's actually not
> > even a huge amount of data.
>
>   I'll try to reproduce this tomorrow so that I can have a look where
> exactly are we stuck. But in last few releases problems like this were
> caused by problems in reclaim which got fed up by seeing lots of dirty
> / under writeback pages and ended up stuck waiting for IO to finish. Mel
> has been tweaking the logic here and there but maybe it haven't got fixed
> completely. Mel, do you know about any outstanding issues?
>

Yeah, there are still a few. The work in that general area dealt with
such problems as dirty pages reaching the end of the LRU (excessive CPU
usage), calling wait_on_page_writeback from reclaim context (random
processes stalling even though there was not much memory pressure),
desktop applications stalling randomly (second quick write stalling on
stable writeback). The systemtap script caught those type of areas and I
believe they are fixed up.

There are still problems though. If all dirty pages were backed by a slow
device then dirty limiting is still eventually going to cause stalls in
dirty page balancing. If there is a global sync then the shit can really
hit the fan if it all gets stuck waiting on something like journal space.
Applications that are very fsync happy can still get stalled for long
periods of time behind slower writers as they wait for the IO to flush.
When all this happens there still make be spikes in CPU usage if it scans
the dirty pages excessively without sleeping.

Consciously or unconsciously my desktop applications generally do not fall
foul of these problems. At least one of the desktop environments can stall
because it calls fsync on history and preference files constantly but I
cannot remember which one of if it has been fixed since. I did have a problem
with gnome-terminal as it depended on a library that implemented scrollback
buffering by writing single-line files to /tmp and then truncating them
which would "freeze" the terminal under IO. I now use tmpfs for /tmp to
get around this. When I'm writing to USB sticks I think it tends to stay
between the point where background writing starts and dirty throttling
occurs so I rarely notice any major problems. I'm probably unconsciously
avoiding doing any write-heavy work while a USB stick is plugged in.

Addressing this goes back to tuning dirty ratio or replacing it. Tuning
it always falls foul of "works for one person and not another" and fails
utterly when there is storage with differet speeds. We talked about this a
few months ago but I still suspect that we will have to bite the bullet and
tune based on "do not dirty more data than it takes N seconds to writeback"
using per-bdi writeback estimations. It's just not that trivial to implement
as the writeback speeds can change for a variety of reasons (multiple IO
sources, random vs sequential etc). Hence at one point we think we are
within our target window and then get it completely wrong. Dirty ratio
is a hard guarantee, dirty writeback estimation is best-effort that will
go wrong in some cases.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Jan Kara
In reply to this post by Artem S. Tashkinov
On Wed 30-10-13 10:07:08, Artem S. Tashkinov wrote:

> Oct 30, 2013 02:41:01 AM, Jack wrote:
> On Fri 25-10-13 19:37:53, Ted Tso wrote:
> >> Sure, although I wonder if it would be worth it calcuate some kind of
> >> rolling average of the write bandwidth while we are doing writeback,
> >> so if it turns out we got unlucky with the contents of the first 100MB
> >> of dirty data (it could be either highly random or highly sequential)
> >> the we'll eventually correct to the right level.
> >  We already do average measured throughput over a longer time window and
> >have kind of rolling average algorithm doing some averaging.
> >
> >> This means that VM would have to keep dirty page counters for each BDI
> >> --- which I thought we weren't doing right now, which is why we have a
> >> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
> >> have cause and effect reversed?  :-)
> >  And we do currently keep the number of dirty & under writeback pages per
> >BDI. We have global limits because mm wants to limit the total number of dirty
> >pages (as those are harder to free). It doesn't care as much to which device
> >these pages belong (although it probably should care a bit more because
> >there are huge differences between how quickly can different devices get rid
> >of dirty pages).
>
> This might sound like an absolutely stupid question which makes no sense at
> all, so I want to apologize for it in advance, but since the Linux kernel lacks
> revoke(), does that mean that dirty buffers will always occupy the kernel memory
> if I for instance remove my USB stick before the kernel has had the time to flush
> these buffers?
  That's actually a good question. And the answer is that currently when we
hit EIO while writing out dirty data, we just throw away that data. Not
an ideal solution for some cases but it solves the problem with unwriteable
data...

                                                                Honza
--
Jan Kara <[hidden email]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Karl Kiniger
In reply to this post by Jan Kara
On Tue 131029, Jan Kara wrote:
> On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > On Fri 131025, Linus Torvalds wrote:
....
> > Is it currently possible to somehow set above values per block device?
>   Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
> the maximum proportion the device's dirty data can take from the total
> amount. The caveat currently is that this setting only takes effect after
> we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> total because that is an amount of dirty data when we start to throttle
> processes. So if the device you'd like to limit is the only one which is
> currently written to, the limiting doesn't have a big effect.

Thanks for the info - thats was I am looking for.

You are right that the limiting doesn't have a big effect right now:

on my  4x speed  DVD+RW on /dev/sr0, x86_64, 4GB,
Fedora19:

max_ratio set to 100  - about 500MB buffered, sync time 2:10 min.
max_ratio set to 1    - about 330MB buffered, sync time 1:23 min.

... way too much buffering.

(measured with strace -tt -ewrite dd if=/dev/zero of=bigfile bs=1M count=1000
by looking at the timestamps).


Karl

....
                                                                Honza
> --
> Jan Kara <[hidden email]>
> SUSE Labs, CR

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Maxim Patlasov
On Thu 31-10-13 14:26:12, Karl Kiniger wrote:

> On Tue 131029, Jan Kara wrote:
> > On Fri 25-10-13 11:15:55, Karl Kiniger wrote:
> > > On Fri 131025, Linus Torvalds wrote:
> ....
> > > Is it currently possible to somehow set above values per block device?
> >   Yes, to some extent. You can set /sys/block/<device>/bdi/max_ratio to
> > the maximum proportion the device's dirty data can take from the total
> > amount. The caveat currently is that this setting only takes effect after
> > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in
> > total because that is an amount of dirty data when we start to throttle
> > processes. So if the device you'd like to limit is the only one which is
> > currently written to, the limiting doesn't have a big effect.
>
> Thanks for the info - thats was I am looking for.
>
> You are right that the limiting doesn't have a big effect right now:
>
> on my  4x speed  DVD+RW on /dev/sr0, x86_64, 4GB,
> Fedora19:
>
> max_ratio set to 100  - about 500MB buffered, sync time 2:10 min.
> max_ratio set to 1    - about 330MB buffered, sync time 1:23 min.
>
> ... way too much buffering.

"strictlimit" feature must fit your and Artem's needs quite well. The feature
enforces per-BDI dirty limits even if the global dirty limit is not reached
yet. I'll send a patch adding knob to turn it on/off.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH] mm: add strictlimit knob

Maxim Patlasov
In reply to this post by Karl Kiniger
"strictlimit" feature was introduced to enforce per-bdi dirty limits for
FUSE which sets bdi max_ratio to 1% by default:

http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809

However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the
feature:

echo 1 > /sys/class/bdi/X:Y/strictlimit

Being enabled, the feature enforces bdi max_ratio limit even if global (10%)
dirty limit is not reached. Of course, the effect is not visible until
max_ratio is decreased to some reasonable value.

Signed-off-by: Maxim Patlasov <[hidden email]>
---
 mm/backing-dev.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7..4ee1d64 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(stable_pages_required);
 
+static ssize_t strictlimit_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+ unsigned int val;
+ ssize_t ret;
+
+ ret = kstrtouint(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+
+ switch (val) {
+ case 0:
+ bdi->capabilities &= ~BDI_CAP_STRICTLIMIT;
+ break;
+ case 1:
+ bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return count;
+}
+static ssize_t strictlimit_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+ return snprintf(page, PAGE_SIZE-1, "%d\n",
+ !!(bdi->capabilities & BDI_CAP_STRICTLIMIT));
+}
+static DEVICE_ATTR_RW(strictlimit);
+
 static struct attribute *bdi_dev_attrs[] = {
  &dev_attr_read_ahead_kb.attr,
  &dev_attr_min_ratio.attr,
  &dev_attr_max_ratio.attr,
  &dev_attr_stable_pages_required.attr,
+ &dev_attr_strictlimit.attr,
  NULL,
 };
 ATTRIBUTE_GROUPS(bdi_dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

kbuild test robot-2
In reply to this post by Linus Torvalds-2
// Sorry for the late response! I'm in marriage leave these days. :)

On Tue, Oct 29, 2013 at 03:42:08PM -0700, Linus Torvalds wrote:

> On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara <[hidden email]> wrote:
> >
> >   So I think we both realize this is only about what the default should be.
>
> Yes. Most people will use the defaults, but there will always be
> people who tune things for particular loads.
>
> In fact, I think we have gone much too far in saying "all policy in
> user space", because the fact is, user space isn't very good at
> policy. Especially not at reacting to complex situations with
> different devices. From what I've seen, "policy in user space" has
> resulted in exactly two modes:
>
>  - user space does something stupid and wrong (example: "nice -19 X"
> to work around some scheduler oddities)
>
>  - user space does nothing at all, and the kernel people say "hey,
> user space _could_ set this value Xyz, so it's not our problem, and
> it's policy, so we shouldn't touch it".
>
> I think we in the kernel should say "our defaults should be what
> everybody sane can use, and they should work fine on average". With
> "policy in user space" being for crazy people that do really odd
> things and can really spare the time to tune for their particular
> issue.
>
> So the "policy in user space" should be about *overriding* kernel
> policy choices, not about the kernel never having them.

Agreed totally. The kernel defaults should better be geared to the
typical use case by the majority users, unless it will lead to insane
behaviors in some less frequent but still relevant use cases.

> And this kind of "you can have many different devices and they act
> quite differently" is a good example of something complicated that
> user space really doesn't have a great model for. And we actually have
> much better possible information in the kernel than user space ever is
> likely to have.
>
> > Also I'm not sure capping dirty limits at 200MB is the best spot. It may be
> > but I think we should experiment with numbers a bit to check whether we
> > didn't miss something.
>
> Sure. That said, the patch I suggested basically makes the numbers be
> at least roughly comparable across different architectures. So it's
> been at least somewhat tested, even if 16GB x86-32 machines are
> hopefully pretty rare (but I hear about people installing 32-bit on
> modern machines much too often).

Yeah, it's interesting the new policy rule actually makes x86_64
behave more consistent with i386, and hence have been reasonably
tested.

> >>  - temp-files may not be written out at all.
> >>
> >>    Quite frankly, if you have multi-hundred-megabyte temptiles, you've
> >> got issues
> >   Actually people do stuff like this e.g. when generating ISO images before
> > burning them.
>
> Yes, but then the temp-file is long-lived enough that it *will* hit
> the disk anyway. So it's only the "create temporary file and pretty
> much immediately delete it" case that changes behavior (ie compiler
> assembly files etc).
>
> If the temp-file is for something like burning an ISO image, the
> burning part is slow enough that the temp-file will hit the disk
> regardless of when we start writing it.

The temp-file IO avoidance is an optimization not a guarantee. If a
user want to avoid IO seriously, he will probably use tmpfs and
disable swap.

So if we have to do some trade-offs in the optimization, I agree that
we should optimize more towards the "large copies to USB stick" use case.

The alternative solution, per-bdi dirty thresholds, could eliminate
the need to do such trade-offs. So it's worth looking at the two
solutions side by side.

> >   There is one more aspect:
> > - transforming random writes into mostly sequential writes
>
> Sure. And I think that if you have a big database, that's when you do
> end up tweaking the dirty limits.

Sure. In general, whenever we have to make some tradeoffs, it's
probably better to "sacrifice" the embedded and super computing worlds
much more than the desktop. Because in the former areas, people tend
to have the skill and mind set to do customizations and optimizations.

I wonder if some hand-held devices will set dirty_background_bytes to
0 for better data safety.

> That said, I'd certainly like it even *more* if the limits really were
> per-BDI, and the global limit was in addition to the per-bdi ones.
> Because when you have a USB device that gets maybe 10MB/s on
> contiguous writes, and 100kB/s on random 4k writes, I think it would
> make more sense to make the "start writeout" limits be 1MB/2MB, not
> 100MB/200MB. So my patch doesn't even take it far enough, it's just a
> "let's not be ridiculous". The per-BDI limits don't seem quite ready
> for prime time yet, though. Even the new "strict" limits seems to be
> more about "trusted filesystems" than about really sane writeback
> limits.
>
> Fengguang, comments?

Basically A) lowering the global dirty limit is a reasonable tradeoff,
and B) the time based per-bdi dirty limits seems like the ultimate
solution that could offer the sane defaults to your heart's content.

Since both will be user interface (including semantic) changes, we
have to be careful. It's obvious that if ever (B) can be implemented
properly and made mature quickly, it would be the best choice and will
eliminate the need to do (A). But as Mel said in the other email, (B)
is not that easy to implement...

> (And I added Maxim to the cc, since he's the author of the strict
> mode, and while it is currently limited to FUSE, he did mention USB
> storage in the commit message..).
 
The *bytes* based per-bdi limits are relatively easy. It's only a
question of code matureness. When exported user interface to the user
space, we can guarantee the exact limit to the user.

However for *time* based per-bdi limits, there will always be
estimation errors as summarized in Mel's email. It offers the sane
semantics to the user, however may not always work to the expectation,
since writeback bandwidth may change over time depending on the workload.

It feels much better to have some hard guarantee. So even when the
time based limits are implemented, we'll probably still want to
disable the slippery time/bandwidth estimation when the user is able
to provide some bytes based per-bdi limits: hey I don't care about
random writes etc. subtle situations. I know this disk's max write
bandwidth is 100MB/s and it's a good rule of thumb. Let's simply set
its dirty limit to 100MB.

Or shall we do the more simple and less volatile "max write bandwidth"
estimation and use it for auto per-bdi dirty limits?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: Disabling in-memory write cache for x86-64 in Linux II

Pavel Machek
Hi!

> > Yes, but then the temp-file is long-lived enough that it *will* hit
> > the disk anyway. So it's only the "create temporary file and pretty
> > much immediately delete it" case that changes behavior (ie compiler
> > assembly files etc).
> >
> > If the temp-file is for something like burning an ISO image, the
> > burning part is slow enough that the temp-file will hit the disk
> > regardless of when we start writing it.
>
> The temp-file IO avoidance is an optimization not a guarantee. If a
> user want to avoid IO seriously, he will probably use tmpfs and
> disable swap.

No, sorry, they can't. Assuming ISO image fits in tmpfs would be
cruel.

> So if we have to do some trade-offs in the optimization, I agree that
> we should optimize more towards the "large copies to USB stick" use case.
>
> The alternative solution, per-bdi dirty thresholds, could eliminate
> the need to do such trade-offs. So it's worth looking at the two
> solutions side by side.

Yes, please.
                                                                Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123