[RFC 00/13] make direct compaction more deterministic

classic Classic list List threaded Threaded
52 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath

Michal Hocko-4
On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:

> The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim
> and compaction (and OOM), until either the allocation succeeds, or returns
> with failure. Success here is more probable when reclaim precedes compaction,
> as certain watermarks have to be met for compaction to even try, and more free
> pages increase the probability of compaction success. On the other hand,
> starting with light async compaction (if the watermarks allow it), can be
> more efficient, especially for smaller orders, if there's enough free memory
> which is just fragmented.
>
> Thus, the current code starts with compaction before reclaim, and to make sure
> that the last reclaim is always followed by a final compaction, there's another
> direct compaction call at the end of the loop. This makes the code hard to
> follow and adds some duplicated handling of migration_mode decisions. It's also
> somewhat inefficient that even if reclaim or compaction decides not to retry,
> the final compaction is still attempted. Some gfp flags combination also
> shortcut these retry decisions by "goto noretry;", making it even harder to
> follow.

I completely agree. It was a head scratcher to properly handle all the
potential paths when I was reorganizing the code for the oom detection
rework.

> This patch attempts to restructure the code with only minimal functional
> changes. The call to the first compaction and THP-specific checks are now
> placed above the retry loop, and the "noretry" direct compaction is removed.
>
> The initial compaction is additionally restricted only to costly orders, as we
> can expect smaller orders to be held back by watermarks, and only larger orders
> to suffer primarily from fragmentation. This better matches the checks in
> reclaim's shrink_zones().
>
> There are two other smaller functional changes. One is that the upgrade from
> async migration to light sync migration will always occur after the initial
> compaction.

I do not think this belongs to the patch. There are two reasons. First
we do not need to do potentially more expensive sync mode when async is
able to make some progress and the second is that with the currently
fragile compaction implementation this might reintroduce the premature
OOM for order-2 requests reported by Hugh. Please see
http://lkml.kernel.org/r/alpine.LSU.2.11.1604141114290.1086@...

Your later patch (which I haven't reviewed yet) is then changing this
considerably but I think it would be safer to not touch the migration
mode in this - mostly cleanup - patch.

> This is how it has been until recent patch "mm, oom: protect
> !costly allocations some more", which introduced upgrading the mode based on
> COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
> made it even more special. It's better to return to the simpler handling for
> now, as migration modes will be further modified later in the series.
>
> The second change is that once both reclaim and compaction declare it's not
> worth to retry the reclaim/compact loop, there is no final compaction attempt.
> As argued above, this is intentional. If that final compaction were to succeed,
> it would be due to a wrong retry decision, or simply a race with somebody else
> freeing memory for us.
>
> The main outcome of this patch should be simpler code. Logically, the initial
> compaction without reclaim is the exceptional case to the reclaim/compaction
> scheme, but prior to the patch, it was the last loop iteration that was
> exceptional. Now the code matches the logic better. The change also enable the
> following patches.
>
> Signed-off-by: Vlastimil Babka <[hidden email]>

Other than the above thing I like this patch.
Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/page_alloc.c | 107 +++++++++++++++++++++++++++++---------------------------
>  1 file changed, 55 insertions(+), 52 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7249949d65ca..88d680b3e7b6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3555,7 +3555,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   struct page *page = NULL;
>   unsigned int alloc_flags;
>   unsigned long did_some_progress;
> - enum migrate_mode migration_mode = MIGRATE_ASYNC;
> + enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
>   enum compact_result compact_result;
>   int compaction_retries = 0;
>   int no_progress_loops = 0;
> @@ -3598,6 +3598,50 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   if (page)
>   goto got_pg;
>  
> + /*
> + * For costly allocations, try direct compaction first, as it's likely
> + * that we have enough base pages and don't need to reclaim.
> + */
> + if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
> + page = __alloc_pages_direct_compact(gfp_mask, order,
> + alloc_flags, ac,
> + MIGRATE_ASYNC,
> + &compact_result);
> + if (page)
> + goto got_pg;
> +
> + /* Checks for THP-specific high-order allocations */
> + if (is_thp_gfp_mask(gfp_mask)) {
> + /*
> + * If compaction is deferred for high-order allocations,
> + * it is because sync compaction recently failed. If
> + * this is the case and the caller requested a THP
> + * allocation, we do not want to heavily disrupt the
> + * system, so we fail the allocation instead of entering
> + * direct reclaim.
> + */
> + if (compact_result == COMPACT_DEFERRED)
> + goto nopage;
> +
> + /*
> + * Compaction is contended so rather back off than cause
> + * excessive stalls.
> + */
> + if (compact_result == COMPACT_CONTENDED)
> + goto nopage;
> +
> + /*
> + * It can become very expensive to allocate transparent
> + * hugepages at fault, so use asynchronous memory
> + * compaction for THP unless it is khugepaged trying to
> + * collapse. All other requests should tolerate at
> + * least light sync migration.
> + */
> + if (!(current->flags & PF_KTHREAD))
> + migration_mode = MIGRATE_ASYNC;
> + }
> + }
> +
>  retry:
>   /* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
>   if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> @@ -3646,55 +3690,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>   goto nopage;
>  
> - /*
> - * Try direct compaction. The first pass is asynchronous. Subsequent
> - * attempts after direct reclaim are synchronous
> - */
> +
> + /* Try direct reclaim and then allocating */
> + page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> + &did_some_progress);
> + if (page)
> + goto got_pg;
> +
> + /* Try direct compaction and then allocating */
>   page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>   migration_mode,
>   &compact_result);
>   if (page)
>   goto got_pg;
>  
> - /* Checks for THP-specific high-order allocations */
> - if (is_thp_gfp_mask(gfp_mask)) {
> - /*
> - * If compaction is deferred for high-order allocations, it is
> - * because sync compaction recently failed. If this is the case
> - * and the caller requested a THP allocation, we do not want
> - * to heavily disrupt the system, so we fail the allocation
> - * instead of entering direct reclaim.
> - */
> - if (compact_result == COMPACT_DEFERRED)
> - goto nopage;
> -
> - /*
> - * Compaction is contended so rather back off than cause
> - * excessive stalls.
> - */
> - if(compact_result == COMPACT_CONTENDED)
> - goto nopage;
> - }
> -
>   if (order && compaction_made_progress(compact_result))
>   compaction_retries++;
>  
> - /* Try direct reclaim and then allocating */
> - page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> - &did_some_progress);
> - if (page)
> - goto got_pg;
> -
>   /* Do not loop if specifically requested */
>   if (gfp_mask & __GFP_NORETRY)
> - goto noretry;
> + goto nopage;
>  
>   /*
>   * Do not retry costly high order allocations unless they are
>   * __GFP_REPEAT
>   */
>   if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> - goto noretry;
> + goto nopage;
>  
>   /*
>   * Costly allocations might have made a progress but this doesn't mean
> @@ -3733,25 +3755,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   goto retry;
>   }
>  
> -noretry:
> - /*
> - * High-order allocations do not necessarily loop after direct reclaim
> - * and reclaim/compaction depends on compaction being called after
> - * reclaim so call directly if necessary.
> - * It can become very expensive to allocate transparent hugepages at
> - * fault, so use asynchronous memory compaction for THP unless it is
> - * khugepaged trying to collapse. All other requests should tolerate
> - * at least light sync migration.
> - */
> - if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> - migration_mode = MIGRATE_ASYNC;
> - else
> - migration_mode = MIGRATE_SYNC_LIGHT;
> - page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> -    ac, migration_mode,
> -    &compact_result);
> - if (page)
> - goto got_pg;
>  nopage:
>   warn_alloc_failed(gfp_mask, order, NULL);
>  got_pg:
> --
> 2.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:55, Vlastimil Babka wrote:

> Since THP allocations during page faults can be costly, extra decisions are
> employed for them to avoid excessive reclaim and compaction, if the initial
> compaction doesn't look promising. The detection has never been perfect as
> there is no gfp flag specific to THP allocations. At this moment it checks the
> whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other
> users of such combination exist, or would mind being treated the same way.
> Extra care is also taken to separate allocations from khugepaged, where latency
> doesn't matter that much.
>
> It is however possible to distinguish these allocations in a simpler and more
> reliable way. The key observation is that after the initial compaction followed
> by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY
> allocations and costly allocations without __GFP_REPEAT are declared as
> failures:
>
>         /* Do not loop if specifically requested */
>         if (gfp_mask & __GFP_NORETRY)
>                 goto nopage;
>
>         /*
>          * Do not retry costly high order allocations unless they are
>          * __GFP_REPEAT
>          */
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 goto nopage;
>
> This means we can further distinguish allocations that are costly order *and*
> additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE
> allocations do already fall into this category. This will also allow other
> costly allocations with similar high-order benefit vs latency considerations to
> use this semantic. Furthermore, we can distinguish THP allocations that should
> try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will
> be done in the next patch.

Yes, using __GFP_NORETRY makes perfect sense. It is the weakest mode for
the costly allocation which includes both compaction and reclaim. I am
happy to see is_thp_gfp_mask going away.

> Signed-off-by: Vlastimil Babka <[hidden email]>

Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/page_alloc.c | 22 +++++++++-------------
>  1 file changed, 9 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 88d680b3e7b6..f5d931e0854a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3182,7 +3182,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>   return page;
>  }
>  
> -
>  /*
>   * Maximum number of compaction retries wit a progress before OOM
>   * killer is consider as the only way to move forward.
> @@ -3447,11 +3446,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>   return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
>  }
>  
> -static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> -{
> - return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
> -}
> -
>  /*
>   * Maximum number of reclaim retries without any progress before OOM killer
>   * is consider as the only way to move forward.
> @@ -3610,8 +3604,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   if (page)
>   goto got_pg;
>  
> - /* Checks for THP-specific high-order allocations */
> - if (is_thp_gfp_mask(gfp_mask)) {
> + /*
> + * Checks for costly allocations with __GFP_NORETRY, which
> + * includes THP page fault allocations
> + */
> + if (gfp_mask & __GFP_NORETRY) {
>   /*
>   * If compaction is deferred for high-order allocations,
>   * it is because sync compaction recently failed. If
> @@ -3631,11 +3628,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   goto nopage;
>  
>   /*
> - * It can become very expensive to allocate transparent
> - * hugepages at fault, so use asynchronous memory
> - * compaction for THP unless it is khugepaged trying to
> - * collapse. All other requests should tolerate at
> - * least light sync migration.
> + * Looks like reclaim/compaction is worth trying, but
> + * sync compaction could be very expensive, so keep
> + * using async compaction, unless it's khugepaged
> + * trying to collapse.
>   */
>   if (!(current->flags & PF_KTHREAD))
>   migration_mode = MIGRATE_ASYNC;
> --
> 2.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
[...]

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 570383a41853..0cb09714d960 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -256,8 +256,7 @@ struct vm_area_struct;
>  #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
>  #define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> - ~__GFP_RECLAIM)
> + __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)

I am not sure this is the right thing to do. I think we should keep
__GFP_NORETRY and clear it where we want a stronger semantic. This is
just too suble that all callsites are doing the right thing.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/12/2016 03:29 PM, Michal Hocko wrote:

> On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:
>> This patch attempts to restructure the code with only minimal functional
>> changes. The call to the first compaction and THP-specific checks are now
>> placed above the retry loop, and the "noretry" direct compaction is removed.
>>
>> The initial compaction is additionally restricted only to costly orders, as we
>> can expect smaller orders to be held back by watermarks, and only larger orders
>> to suffer primarily from fragmentation. This better matches the checks in
>> reclaim's shrink_zones().
>>
>> There are two other smaller functional changes. One is that the upgrade from
>> async migration to light sync migration will always occur after the initial
>> compaction.
>
> I do not think this belongs to the patch. There are two reasons. First
> we do not need to do potentially more expensive sync mode when async is
> able to make some progress and the second

My concern was that __GFP_NORETRY non-costly allocations wouldn't
otherwise get a MIGRATE_SYNC_LIGHT pass at all. Previously they would
get it in the noretry: label. So do you think it's a corner case not
worth caring about? Alternatively we could also remove the 'restriction
of initial async compaction to costly orders' from this patch and apply
it separately later. That would also result in the flip to sync_light
after the initial async attempt for these allocations.

> is that with the currently
> fragile compaction implementation this might reintroduce the premature
> OOM for order-2 requests reported by Hugh. Please see
> http://lkml.kernel.org/r/alpine.LSU.2.11.1604141114290.1086@...

Hmm IIRC that involved some wrong conflict resolution in mmotm? I don't
remember what the code exactly did look like, but wasn't the problem
that the initial compaction was async, then the left-over hunk changed
migration_mode to sync_light, and then should_compact_retry() thought
"oh we already failed sync_light, return false" when in fact the
sync_light compaction never happened? Otherwise I don't see how
switching to sync_light "too early" could lead to premature OOMs.

> Your later patch (which I haven't reviewed yet) is then changing this
> considerably

Yes, my other concern with should_compact_retry() after your "mm, oom:
protect !costly allocations some more" is that relying on
compaction_failed() to upgrade the migration mode is unreliable. Async
compaction can easily keep returning as contended, so might never see
the COMPACT_COMPLETE result, if it's e.g. limited to nodes without a
really small zone such as ZONE_DMA.

> but I think it would be safer to not touch the migration
> mode in this - mostly cleanup - patch.
>
>> This is how it has been until recent patch "mm, oom: protect
>> !costly allocations some more", which introduced upgrading the mode based on
>> COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
>> made it even more special. It's better to return to the simpler handling for
>> now, as migration modes will be further modified later in the series.
>>
>> The second change is that once both reclaim and compaction declare it's not
>> worth to retry the reclaim/compact loop, there is no final compaction attempt.
>> As argued above, this is intentional. If that final compaction were to succeed,
>> it would be due to a wrong retry decision, or simply a race with somebody else
>> freeing memory for us.
>>
>> The main outcome of this patch should be simpler code. Logically, the initial
>> compaction without reclaim is the exceptional case to the reclaim/compaction
>> scheme, but prior to the patch, it was the last loop iteration that was
>> exceptional. Now the code matches the logic better. The change also enable the
>> following patches.
>>
>> Signed-off-by: Vlastimil Babka <[hidden email]>
>
> Other than the above thing I like this patch.
> Acked-by: Michal Hocko <[hidden email]>

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/12/2016 06:20 PM, Michal Hocko wrote:

> On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
> [...]
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 570383a41853..0cb09714d960 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -256,8 +256,7 @@ struct vm_area_struct;
>>   #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
>>   #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
>>   #define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>> - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>> - ~__GFP_RECLAIM)
>> + __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>
> I am not sure this is the right thing to do. I think we should keep
> __GFP_NORETRY and clear it where we want a stronger semantic. This is
> just too suble that all callsites are doing the right thing.

That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you
think it's worth it, I can turn the default around, OK.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Fri 13-05-16 10:10:50, Vlastimil Babka wrote:

> On 05/12/2016 03:29 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:
> > > This patch attempts to restructure the code with only minimal functional
> > > changes. The call to the first compaction and THP-specific checks are now
> > > placed above the retry loop, and the "noretry" direct compaction is removed.
> > >
> > > The initial compaction is additionally restricted only to costly orders, as we
> > > can expect smaller orders to be held back by watermarks, and only larger orders
> > > to suffer primarily from fragmentation. This better matches the checks in
> > > reclaim's shrink_zones().
> > >
> > > There are two other smaller functional changes. One is that the upgrade from
> > > async migration to light sync migration will always occur after the initial
> > > compaction.
> >
> > I do not think this belongs to the patch. There are two reasons. First
> > we do not need to do potentially more expensive sync mode when async is
> > able to make some progress and the second
>
> My concern was that __GFP_NORETRY non-costly allocations wouldn't otherwise
> get a MIGRATE_SYNC_LIGHT pass at all. Previously they would get it in the
> noretry: label.

OK, I haven't considered this. So scratch this then.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Fri 13-05-16 10:23:31, Vlastimil Babka wrote:

> On 05/12/2016 06:20 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
> > [...]
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 570383a41853..0cb09714d960 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -256,8 +256,7 @@ struct vm_area_struct;
> > >   #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
> > >   #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
> > >   #define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> > > - __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> > > - ~__GFP_RECLAIM)
> > > + __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
> >
> > I am not sure this is the right thing to do. I think we should keep
> > __GFP_NORETRY and clear it where we want a stronger semantic. This is
> > just too suble that all callsites are doing the right thing.
>
> That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you
> think it's worth it, I can turn the default around, OK.

Hmm, on the other hand it is true that GFP_TRANSHUGE is clearing both
reclaim flags by default and then overwrites that. This is just too
ugly. Can we make GFP_TRANSHUGE to only define flags we care about and
then tweak those that should go away at the callsites which matter now
that we do not rely on is_thp_gfp_mask?

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 07/13] mm, compaction: introduce direct compaction priority

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:57, Vlastimil Babka wrote:

> In the context of direct compaction, for some types of allocations we would
> like the compaction to either succeed or definitely fail while trying as hard
> as possible. Current async/sync_light migration mode is insufficient, as there
> are heuristics such as caching scanner positions, marking pageblocks as
> unsuitable or deferring compaction for a zone. At least the final compaction
> attempt should be able to override these heuristics.
>
> To communicate how hard compaction should try, we replace migration mode with
> a new enum compact_priority and change the relevant function signatures. In
> compact_zone_order() where struct compact_control is constructed, the priority
> is mapped to suitable control flags. This patch itself has no functional
> change, as the current priority levels are mapped back to the same migration
> modes as before. Expanding them will be done next.
>
> Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
> the only caller exists under CONFIG_COMPACTION.

Your s-o-b is missing

Anyway I like the idea. The migration_mode felt really weird. It exposes
an internal detail of the compaction code which should have no business
in the allocator path.

Acked-by: Michal Hocko <[hidden email]>

> ---
>  include/linux/compaction.h | 18 +++++++++---------
>  mm/compaction.c            | 14 ++++++++------
>  mm/page_alloc.c            | 27 +++++++++++++--------------
>  3 files changed, 30 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4ba90e74969c..900d181ff1b0 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,14 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +// TODO: lower value means higher priority to match reclaim, makes sense?

Yes this makes sense to me.

> +enum compact_priority {

enums might be tricky but I guess it should work ok here. I would just
add

        COMPACT_MIN_PRIO,
> + COMPACT_PRIO_SYNC_LIGHT = COMPACT_MIN_PRIO,
> + DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
> + COMPACT_PRIO_ASYNC,
> + INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
> +};
> +

to make an implementation independent lowest priority.

[...]

> @@ -3269,11 +3269,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>   /*
>   * compaction considers all the zone as desperately out of memory
>   * so it doesn't really make much sense to retry except when the
> - * failure could be caused by weak migration mode.
> + * failure could be caused by insufficient priority
>   */
>   if (compaction_failed(compact_result)) {
> - if (*migrate_mode == MIGRATE_ASYNC) {
> - *migrate_mode = MIGRATE_SYNC_LIGHT;
> + if (*compact_priority > 0) {

                if (*compact_priority > COMPACT_MIN_PRIO)

> + (*compact_priority)--;
>   return true;
>   }
>   return false;
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 08/13] mm, compaction: simplify contended compaction handling

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:58, Vlastimil Babka wrote:

> Async compaction detects contention either due to failing trylock on zone->lock
> or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction:
> khugepaged should not give up due to need_resched()") the code got quite
> complicated to distinguish these two up to the __alloc_pages_slowpath() level,
> so different decisions could be taken for khugepaged allocations.
>
> After the recent changes, khugepaged allocations don't check for contended
> compaction anymore, so we again don't need to distinguish lock and sched
> contention, and simplify the current convoluted code a lot.
>
> However, I believe it's also possible to simplify even more and completely
> remove the check for contended compaction after the initial async compaction
> for costly orders, which was originally aimed at THP page fault allocations.
> There are several reasons why this can be done now:
>
> - with the new defaults, THP page faults no longer do reclaim/compaction at
>   all, unless the system admin has overriden the default, or application has
>   indicated via madvise that it can benefit from THP's. In both cases, it
>   means that the potential extra latency is expected and worth the benefits.

Yes this sounds reasonable to me. Especially when we consider the code bloat
size this is causing.

> - even if reclaim/compaction proceeds after this patch where it previously
>   wouldn't, the second compaction attempt is still async and will detect the
>   contention and back off, if the contention persists

MIGRATE_ASYNC still backs off after this patch so I would be surprise to
see more latency issues from this change.

> - there are still heuristics like deferred compaction and pageblock skip bits
>   in place that prevent excessive THP page fault latencies
>
> Signed-off-by: Vlastimil Babka <[hidden email]>

I hope I haven't missed anything because the compaction is full of
subtle traps but this seems the changes seem ok to me.

Acked-by: Michal Hocko <[hidden email]>

> ---
>  include/linux/compaction.h | 10 +------
>  mm/compaction.c            | 72 +++++++++-------------------------------------
>  mm/internal.h              |  5 +---
>  mm/page_alloc.c            | 28 +-----------------
>  4 files changed, 16 insertions(+), 99 deletions(-)

This is really nice cleanup considering it doesn't introduce big
behavior changes which is my understanding from the code.

[...]
> @@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>   trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>   cc->free_pfn, end_pfn, sync, ret);
>  
> - if (ret == COMPACT_CONTENDED)
> - ret = COMPACT_PARTIAL;
> -
>   return ret;
>  }

This took me a while to grasp but then I realized this is correct
because we shouldn't pretend progress when there was none in fact,
especially when __alloc_pages_direct_compact basically replaced this
"fake" COMPACT_PARTIAL by COMPACT_CONTENDED anyway.

[...]
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:59, Vlastimil Babka wrote:

> A recent patch has added whole_zone flag that compaction sets when scanning
> starts from the zone boundary, in order to report that zone has been fully
> scanned in one attempt. For allocations that want to try really hard or cannot
> fail, we will want to introduce a mode where scanning whole zone is guaranteed
> regardless of the cached positions.
>
> This patch reuses the whole_zone flag in a way that if it's already passed true
> to compaction, the cached scanner positions are ignored. Employing this flag
> during reclaim/compaction loop will be done in the next patch. This patch
> however converts compaction invoked from userspace via procfs to use this flag.
> Before this patch, the cached positions were first reset to zone boundaries and
> then read back from struct zone, so there was a window where a parallel
> compaction could replace the reset values, making the manual compaction less
> effective. Using the flag instead of performing reset is more robust.

This makes perfect sense because user visible API should better have a
deterministic behavior. Playing scanner positions games which might race
with the system triggered compaction sounds quite unpredictable.

> Signed-off-by: Vlastimil Babka <[hidden email]>

Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/compaction.c | 15 +++++----------
>  mm/internal.h   |  2 +-
>  2 files changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f649c7bc6de5..1ce6783d3ead 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1442,11 +1442,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>   */
>   cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
>   cc->free_pfn = zone->compact_cached_free_pfn;
> - if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
> + if (cc->whole_zone || cc->free_pfn < start_pfn ||
> + cc->free_pfn >= end_pfn) {
>   cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
>   zone->compact_cached_free_pfn = cc->free_pfn;
>   }
> - if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
> + if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
> + cc->migrate_pfn >= end_pfn) {
>   cc->migrate_pfn = start_pfn;
>   zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>   zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> @@ -1693,14 +1695,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
>   INIT_LIST_HEAD(&cc->freepages);
>   INIT_LIST_HEAD(&cc->migratepages);
>  
> - /*
> - * When called via /proc/sys/vm/compact_memory
> - * this makes sure we compact the whole zone regardless of
> - * cached scanner positions.
> - */
> - if (is_via_compact_memory(cc->order))
> - __reset_isolation_suitable(zone);
> -
>   if (is_via_compact_memory(cc->order) ||
>   !compaction_deferred(zone, cc->order))
>   compact_zone(zone, cc);
> @@ -1736,6 +1730,7 @@ static void compact_node(int nid)
>   .order = -1,
>   .mode = MIGRATE_SYNC,
>   .ignore_skip_hint = true,
> + .whole_zone = true,
>   };
>  
>   __compact_pgdat(NODE_DATA(nid), &cc);
> diff --git a/mm/internal.h b/mm/internal.h
> index 556bc9d0a817..2acdee8ab0e6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -178,7 +178,7 @@ struct compact_control {
>   enum migrate_mode mode; /* Async or sync migration mode */
>   bool ignore_skip_hint; /* Scan blocks even if marked skip */
>   bool direct_compaction; /* False from kcompactd or /proc/... */
> - bool whole_zone; /* Whole zone has been scanned */
> + bool whole_zone; /* Whole zone should/has been scanned */
>   int order; /* order a direct compactor needs */
>   const gfp_t gfp_mask; /* gfp mask of a direct compactor */
>   const unsigned int alloc_flags; /* alloc flags of a direct compactor */
> --
> 2.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:

> During reclaim/compaction loop, it's desirable to get a final answer from
> unsuccessful compaction so we can either fail the allocation or invoke the OOM
> killer. However, heuristics such as deferred compaction or pageblock skip bits
> can cause compaction to skip parts or whole zones and lead to premature OOM's,
> failures or excessive reclaim/compaction retries.
>
> To remedy this, we introduce a new direct compaction priority called
> COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
>
> - ignore deferred compaction status for a zone
> - ignore pageblock skip hints
> - ignore cached scanner positions and scan the whole zone
> - use MIGRATE_SYNC migration mode

I do not think we can do MIGRATE_SYNC because fallback_migrate_page
would trigger pageout and we are in the allocation path and so we
could blow up the stack.

> The new priority should get eventually picked up by should_compact_retry() and
> this should improve success rates for costly allocations using __GFP_RETRY,

s@__GFP_RETRY@__GFP_REPEAT@

> such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
> allocations.

My testing has shown that even with the current implementation with
deferring, skip hints and cached positions had (close to) 100% success
rate even with close to OOM conditions.

I am wondering whether this strongest priority should be done only for
!costly high order pages. But we probably want less special cases
between costly and !costly orders.

> Signed-off-by: Vlastimil Babka <[hidden email]>

Acked-by: Michal Hocko <[hidden email]>

> ---
>  include/linux/compaction.h |  1 +
>  mm/compaction.c            | 15 ++++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
[...]

> @@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>   ac->nodemask) {
>   enum compact_result status;
>  
> - if (compaction_deferred(zone, order)) {
> + if (prio > COMPACT_PRIO_SYNC_FULL
> + && compaction_deferred(zone, order)) {
>   rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>   continue;
>   }

Wouldn't it be better to pull the prio check into compaction_deferred
directly? There are more callers and I am not really sure all of them
would behave consistently.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
> During reclaim/compaction loop, compaction priority can be increased by the
> should_compact_retry() function, but the current code is not optimal for
> several reasons:
>
> - priority is only increased when compaction_failed() is true, which means
>   that compaction has scanned the whole zone. This may not happen even after
>   multiple attempts with the lower priority due to parallel activity, so we
>   might needlessly struggle on the lower priority.

OK, I can see that this can be changed if we have a guarantee that at
least one full round is guaranteed. Which seems to be the case for the
lowest priority.

>
> - should_compact_retry() is only called when should_reclaim_retry() returns
>   false. This means that compaction priority cannot get increased as long
>   as reclaim makes sufficient progress. Theoretically, reclaim should stop
>   retrying for high-order allocations as long as the high-order page doesn't
>   exist but due to races, this may result in spurious retries when the
>   high-order page momentarily does exist.

This is intentional behavior and I would like to preserve it if it is
possible. For higher order pages should_reclaim_retry retries as long
as there are some eligible high order pages present which are just hidden
by the watermark check. So this is mostly to get us over watermarks to
start carrying about fragmentation. If we race there then nothing really
terrible should happen and we should eventually converge to a terminal
state.

Does this make sense to you?
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 08/13] mm, compaction: simplify contended compaction handling

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/13/2016 03:09 PM, Michal Hocko wrote:

>> >@@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>> >   trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>> >   cc->free_pfn, end_pfn, sync, ret);
>> >
>> >- if (ret == COMPACT_CONTENDED)
>> >- ret = COMPACT_PARTIAL;
>> >-
>> >   return ret;
>> >  }
> This took me a while to grasp but then I realized this is correct
> because we shouldn't pretend progress when there was none in fact,
> especially when __alloc_pages_direct_compact basically replaced this
> "fake" COMPACT_PARTIAL by COMPACT_CONTENDED anyway.

Yes. Actually COMPACT_CONTENDED compact_result used to be just for the
tracepoint, and __alloc_pages_direct_compact used another function
parameter to signal contention. You changed it with the oom rework so
COMPACT_CONTENDED result value was used, so this hunk just makes sure
it's still reported correctly.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/13/2016 03:38 PM, Michal Hocko wrote:

> On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:
>> During reclaim/compaction loop, it's desirable to get a final answer from
>> unsuccessful compaction so we can either fail the allocation or invoke the OOM
>> killer. However, heuristics such as deferred compaction or pageblock skip bits
>> can cause compaction to skip parts or whole zones and lead to premature OOM's,
>> failures or excessive reclaim/compaction retries.
>>
>> To remedy this, we introduce a new direct compaction priority called
>> COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
>>
>> - ignore deferred compaction status for a zone
>> - ignore pageblock skip hints
>> - ignore cached scanner positions and scan the whole zone
>> - use MIGRATE_SYNC migration mode
>
> I do not think we can do MIGRATE_SYNC because fallback_migrate_page
> would trigger pageout and we are in the allocation path and so we
> could blow up the stack.

Ah, I thought it was just waiting for the writeout to complete, and you
wanted to introduce another migrate mode to actually do the writeout.
But looks like I misremembered.

>> The new priority should get eventually picked up by should_compact_retry() and
>> this should improve success rates for costly allocations using __GFP_RETRY,
>
> s@__GFP_RETRY@__GFP_REPEAT@

Ah thanks. Depending on the patch timing it might be __GFP_RETRY_HARD in
the end, right :)

>> such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
>> allocations.
>
> My testing has shown that even with the current implementation with
> deferring, skip hints and cached positions had (close to) 100% success
> rate even with close to OOM conditions.

Hmm, I thought you at one point said that ignoring skip hints was a
large improvement, because the current resetting of them is just too fuzzy.

> I am wondering whether this strongest priority should be done only for
> !costly high order pages. But we probably want less special cases
> between costly and !costly orders.

Yeah, if somebody wants to retry hard, let him.

>> Signed-off-by: Vlastimil Babka <[hidden email]>
>
> Acked-by: Michal Hocko <[hidden email]>
>
>> ---
>>   include/linux/compaction.h |  1 +
>>   mm/compaction.c            | 15 ++++++++++++---
>>   2 files changed, 13 insertions(+), 3 deletions(-)
>>
> [...]
>> @@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>>   ac->nodemask) {
>>   enum compact_result status;
>>
>> - if (compaction_deferred(zone, order)) {
>> + if (prio > COMPACT_PRIO_SYNC_FULL
>> + && compaction_deferred(zone, order)) {
>>   rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>>   continue;
>>   }
>
> Wouldn't it be better to pull the prio check into compaction_deferred
> directly? There are more callers and I am not really sure all of them
> would behave consistently.

I'll check, thanks.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/13/2016 04:15 PM, Michal Hocko wrote:

> On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
>>
>> - should_compact_retry() is only called when should_reclaim_retry() returns
>>    false. This means that compaction priority cannot get increased as long
>>    as reclaim makes sufficient progress. Theoretically, reclaim should stop
>>    retrying for high-order allocations as long as the high-order page doesn't
>>    exist but due to races, this may result in spurious retries when the
>>    high-order page momentarily does exist.
>
> This is intentional behavior and I would like to preserve it if it is
> possible. For higher order pages should_reclaim_retry retries as long
> as there are some eligible high order pages present which are just hidden
> by the watermark check. So this is mostly to get us over watermarks to
> start carrying about fragmentation. If we race there then nothing really
> terrible should happen and we should eventually converge to a terminal
> state.
>
> Does this make sense to you?

Yeah it should work, my only worry was that this may get subtly wrong
(as experience shows us) and due to e.g. slightly different watermark
checks and/or a corner-case zone such as ZONE_DMA,
should_reclaim_retry() would keep returning true, even if reclaim
couldn't/wouldn't help anything. Then compaction would be needlessly
kept at ineffective priority.

Also my understanding of the initial compaction priorities is to lower
the latency if fragmentation is just light and there's enough memory.
Once we start struggling, I don't see much point in not switching to the
full compaction priority quickly.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Mon 16-05-16 09:17:11, Vlastimil Babka wrote:
> On 05/13/2016 03:38 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:
[...]
> > > such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
> > > allocations.
> >
> > My testing has shown that even with the current implementation with
> > deferring, skip hints and cached positions had (close to) 100% success
> > rate even with close to OOM conditions.
>
> Hmm, I thought you at one point said that ignoring skip hints was a large
> improvement, because the current resetting of them is just too fuzzy.

Not in the hugetlb test. But you are right that skip hints resulted in
really fuzzy behavior.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Mon 16-05-16 09:31:44, Vlastimil Babka wrote:

> On 05/13/2016 04:15 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
> > >
> > > - should_compact_retry() is only called when should_reclaim_retry() returns
> > >    false. This means that compaction priority cannot get increased as long
> > >    as reclaim makes sufficient progress. Theoretically, reclaim should stop
> > >    retrying for high-order allocations as long as the high-order page doesn't
> > >    exist but due to races, this may result in spurious retries when the
> > >    high-order page momentarily does exist.
> >
> > This is intentional behavior and I would like to preserve it if it is
> > possible. For higher order pages should_reclaim_retry retries as long
> > as there are some eligible high order pages present which are just hidden
> > by the watermark check. So this is mostly to get us over watermarks to
> > start carrying about fragmentation. If we race there then nothing really
> > terrible should happen and we should eventually converge to a terminal
> > state.
> >
> > Does this make sense to you?
>
> Yeah it should work, my only worry was that this may get subtly wrong (as
> experience shows us) and due to e.g. slightly different watermark checks
> and/or a corner-case zone such as ZONE_DMA, should_reclaim_retry() would
> keep returning true, even if reclaim couldn't/wouldn't help anything. Then
> compaction would be needlessly kept at ineffective priority.

watermark check for ZONE_DMA should always fail because it fails even
when is completely free to the lowmem reserves. I had a subtle bug in
the original code to check highzone_idx rather than classzone_idx but
that should the fix has been posted recently:
http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@...

> Also my understanding of the initial compaction priorities is to lower the
> latency if fragmentation is just light and there's enough memory. Once we
> start struggling, I don't see much point in not switching to the full
> compaction priority quickly.

That is true but why to compact when there are high order pages and they
are just hidden by the watermark check.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 13/13] mm, compaction: fix and improve watermark handling

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:

> Compaction has been using watermark checks when deciding whether it was
> successful, and whether compaction is at all suitable. There are few problems
> with these checks.
>
> - __compact_finished() uses low watermark in a check that has to pass if
>   the direct compaction is to finish and allocation should succeed. This is
>   too pessimistic, as the allocation will typically use min watermark. It
>   may happen that during compaction, we drop below the low watermark (due to
>   parallel activity), but still form the target high-order page. By checking
>   against low watermark, we might needlessly continue compaction. After this
>   patch, the check uses direct compactor's alloc_flags to determine the
>   watermark, which is effectively the min watermark.

OK, this makes some sense. It would be great if we could have at least
some clarification why the low wmark has been used previously. Probably
Mel can remember?

> - __compaction_suitable has the same issue in the check whether the allocation
>   is already supposed to succeed and we don't need to compact. Fix it the same
>   way.
>
> - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
>   to decide if there's enough free memory to perform compaction. This check

And this was a real head scratcher when I started looking into the
compaction recently. Why do we need to be above low watermark to even
start compaction. Compaction uses additional memory only for a short
period of time and then releases the already migrated pages.

>   uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
>   include ALLOC_CMA, we might fail the check, even though the freepage
>   isolation isn't restricted outside of CMA pageblocks. On the other hand,
>   alloc_flags may indicate access to memory reserves, making compaction proceed
>   and then fail watermark check during freepage isolation, which doesn't pass
>   alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
>   __compaction_suitable() check.

This makes my head hurt. Whut?

> - __isolate_free_page uses low watermark check to decide if free page can be
>   isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.

Why do we check the watermark at all? What would happen if this obscure
if (!is_migrate_isolate(mt)) was gone? I remember I put some tracing
there and it never hit for me even when I was testing close to OOM
conditions. Maybe an earlier check bailed out but this code path looks
really obscure so it should either deserve a large fat comment or to
die.

> - The use of low watermark checks in __compaction_suitable() and
>   __isolate_free_page does perhaps make sense for high-order allocations where
>   more freepages increase the chance of success, and we can typically fail
>   with some order-0 fallback when the system is struggling. But for low-order
>   allocation, forming the page should not be that hard. So using low watermark
>   here might just prevent compaction from even trying, and eventually lead to
>   OOM killer even if we are above min watermarks. So after this patch, we use
>   min watermark for non-costly orders in these checks, by passing the
>   alloc_flags parameter to split_page() and __isolate_free_page().

OK, so if IIUC costly high order requests even shouldn't try when we are
below watermark (unless they are __GFP_REPEAT which would get us to a
stronger compaction mode/priority) and that would reclaim us over low
wmark and go on. Is that what you are saying? This makes some sense but
then let's have a _single_ place to check the watermak please. This
checks at few different levels is just subtle as hell and error prone
likewise.

> To sum up, after this patch, the kernel should in some situations finish
> successful direct compaction sooner, prevent compaction from starting when it's
> not needed, proceed with compaction when free memory is in CMA pageblocks, and
> for non-costly orders, prevent OOM killing or excessive reclaim when free
> memory is between the min and low watermarks.

Could you please split this patch into three(?) parts. One to remove as many
wmark checks as possible, move low wmark to min for !costly high orders
and finally the cma part which I fail to understand...

Thanks!
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/16/2016 10:14 AM, Michal Hocko wrote:

> On Mon 16-05-16 09:31:44, Vlastimil Babka wrote:
>>
>> Yeah it should work, my only worry was that this may get subtly wrong (as
>> experience shows us) and due to e.g. slightly different watermark checks
>> and/or a corner-case zone such as ZONE_DMA, should_reclaim_retry() would
>> keep returning true, even if reclaim couldn't/wouldn't help anything. Then
>> compaction would be needlessly kept at ineffective priority.
>
> watermark check for ZONE_DMA should always fail because it fails even
> when is completely free to the lowmem reserves. I had a subtle bug in
> the original code to check highzone_idx rather than classzone_idx but
> that should the fix has been posted recently:
> http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@...

Sure, but that just adds to the experience of being subtly wrong in this
area :) But sure we can leave this part alone until proven wrong, I
don't insist strongly.

>> Also my understanding of the initial compaction priorities is to lower the
>> latency if fragmentation is just light and there's enough memory. Once we
>> start struggling, I don't see much point in not switching to the full
>> compaction priority quickly.
>
> That is true but why to compact when there are high order pages and they
> are just hidden by the watermark check.

Compaction should skip such zone regardless of priority.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 13/13] mm, compaction: fix and improve watermark handling

Vlastimil Babka
In reply to this post by Michal Hocko-4
On 05/16/2016 11:25 AM, Michal Hocko wrote:

> On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
>> Compaction has been using watermark checks when deciding whether it was
>> successful, and whether compaction is at all suitable. There are few problems
>> with these checks.
>>
>> - __compact_finished() uses low watermark in a check that has to pass if
>>    the direct compaction is to finish and allocation should succeed. This is
>>    too pessimistic, as the allocation will typically use min watermark. It
>>    may happen that during compaction, we drop below the low watermark (due to
>>    parallel activity), but still form the target high-order page. By checking
>>    against low watermark, we might needlessly continue compaction. After this
>>    patch, the check uses direct compactor's alloc_flags to determine the
>>    watermark, which is effectively the min watermark.
>
> OK, this makes some sense. It would be great if we could have at least
> some clarification why the low wmark has been used previously. Probably
> Mel can remember?
>
>> - __compaction_suitable has the same issue in the check whether the allocation
>>    is already supposed to succeed and we don't need to compact. Fix it the same
>>    way.
>>
>> - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
>>    to decide if there's enough free memory to perform compaction. This check
>
> And this was a real head scratcher when I started looking into the
> compaction recently. Why do we need to be above low watermark to even
> start compaction.

Hmm, above you said you're fine with low wmark (maybe after
clarification). I don't know why it was used, can only guess.

> Compaction uses additional memory only for a short
> period of time and then releases the already migrated pages.

As for the 2 << order gap. I can imagine that e.g. order-5 compaction
(32 pages) isolates 20 pages for migration and starts looking for free
pages. It collects 19 free pages and then reaches an order-4 free page.
Splitting that page to collect it would result in 19+16=35 pages
isolated, thus exceed the 1 << order gap, and fail. With 2 << order gap,
chances of this happening are reduced.

>>    uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
>>    include ALLOC_CMA, we might fail the check, even though the freepage
>>    isolation isn't restricted outside of CMA pageblocks. On the other hand,
>>    alloc_flags may indicate access to memory reserves, making compaction proceed
>>    and then fail watermark check during freepage isolation, which doesn't pass
>>    alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
>>    __compaction_suitable() check.
>
> This makes my head hurt. Whut?

I'll try to explain better next time.

>> - __isolate_free_page uses low watermark check to decide if free page can be
>>    isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.
>
> Why do we check the watermark at all? What would happen if this obscure
> if (!is_migrate_isolate(mt)) was gone? I remember I put some tracing
> there and it never hit for me even when I was testing close to OOM
> conditions. Maybe an earlier check bailed out but this code path looks
> really obscure so it should either deserve a large fat comment or to
> die.

The check is there so that compaction doesn't exhaust memory below
reserves during its work, just like any other non-privileged allocation.

>> - The use of low watermark checks in __compaction_suitable() and
>>    __isolate_free_page does perhaps make sense for high-order allocations where
>>    more freepages increase the chance of success, and we can typically fail
>>    with some order-0 fallback when the system is struggling. But for low-order
>>    allocation, forming the page should not be that hard. So using low watermark
>>    here might just prevent compaction from even trying, and eventually lead to
>>    OOM killer even if we are above min watermarks. So after this patch, we use
>>    min watermark for non-costly orders in these checks, by passing the
>>    alloc_flags parameter to split_page() and __isolate_free_page().
>
> OK, so if IIUC costly high order requests even shouldn't try when we are
> below watermark (unless they are __GFP_REPEAT which would get us to a
> stronger compaction mode/priority) and that would reclaim us over low
> wmark and go on. Is that what you are saying? This makes some sense but
> then let's have a _single_ place to check the watermak please. This
> checks at few different levels is just subtle as hell and error prone
> likewise.

What single place then? The situation might change dynamically so
passing the initial __compaction_suitable() check doesn't guarantee that
enough free pages are still available when it comes to isolating
freepages. Your testing that never hit it shows that this is rare, but
do we want to risk compaction making an OOM situation worse?

>> To sum up, after this patch, the kernel should in some situations finish
>> successful direct compaction sooner, prevent compaction from starting when it's
>> not needed, proceed with compaction when free memory is in CMA pageblocks, and
>> for non-costly orders, prevent OOM killing or excessive reclaim when free
>> memory is between the min and low watermarks.
>
> Could you please split this patch into three(?) parts. One to remove as many
> wmark checks as possible, move low wmark to min for !costly high orders
> and finally the cma part which I fail to understand...

Sure, although I'm not yet convinced we can remove any checks.

> Thanks!
>

123