[RFC 00/13] make direct compaction more deterministic

classic Classic list List threaded Threaded
52 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[RFC 00/13] make direct compaction more deterministic

Vlastimil Babka
This is mostly a followup to Michal's oom detection rework, which highlighted
the need for direct compaction to provide better feedback in reclaim/compaction
loop, so that it can reliably recognize when compaction cannot make further
progress, and allocation should invoke OOM killer or fail. We've discussed
this at LSF/MM [1] where I proposed expanding the async/sync migration mode
used in compaction to more general "priorities". This patchset adds one new
priority that just overrides all the heuristics and makes compaction fully
scan all zones. I don't currently think that we need more fine-grained
priorities, but we'll see. Other than that there's some smaller fixes and
cleanups, mainly related to the THP-specific hacks.

Testing/evaluation is pending, but I'm posting it now with hope to help the
discussions around oom detection rework. I also hope for testing the near-OOM
conditions, and the new priority level should also help hugetlbfs allocations
since they use __GFP_RETRY, and it has already been reported that ignoring
compaction heuristics helps these allocations.

The series is based on mmotm git [2] tag mmotm-2016-04-27-15-21-14.
First one needs to git revert the commit 69340d225e8d ("mm: use compaction
feedback for thp backoff conditions") which already happened in mmots.

[1] https://lwn.net/Articles/684611/
[2] git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Hugh Dickins (1):
  mm, compaction: don't isolate PageWriteback pages in
    MIGRATE_SYNC_LIGHT mode

Vlastimil Babka (12):
  mm, page_alloc: set alloc_flags only once in slowpath
  mm, page_alloc: don't retry initial attempt in slowpath
  mm, page_alloc: restructure direct compaction handling in slowpath
  mm, page_alloc: make THP-specific decisions more generic
  mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  mm, compaction: introduce direct compaction priority
  mm, compaction: simplify contended compaction handling
  mm, compaction: make whole_zone flag ignore cached scanner positions
  mm, compaction: cleanup unused functions
  mm, compaction: add the ultimate direct compaction priority
  mm, compaction: more reliably increase direct compaction priority
  mm, compaction: fix and improve watermark handling

 include/linux/compaction.h |  32 +++----
 include/linux/gfp.h        |   3 +-
 include/linux/mm.h         |   2 +-
 mm/compaction.c            | 196 +++++++++++++++-------------------------
 mm/huge_memory.c           |   8 +-
 mm/internal.h              |  10 +--
 mm/page_alloc.c            | 220 +++++++++++++++++++++------------------------
 mm/page_isolation.c        |   2 +-
 8 files changed, 196 insertions(+), 277 deletions(-)

--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode

Vlastimil Babka
From: Hugh Dickins <[hidden email]>

At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects
with -EBUSY: of course the writeback might complete in between, but
that's not what we usually expect, so probably better not to isolate it.

Signed-off-by: Hugh Dickins <[hidden email]>
Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c72987603343..481004c73c90 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1146,7 +1146,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
  struct page *page;
  const isolate_mode_t isolate_mode =
  (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
- (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+ (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
 
  /*
  * Start at where we last stopped, or beginning of the zone as
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

Vlastimil Babka
In reply to this post by Vlastimil Babka
In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
so move the initialization above the retry: label. Also make the comment above
the initialization more descriptive.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a50184ec6ca0..91fbf6f95403 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3579,17 +3579,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
  gfp_mask &= ~__GFP_ATOMIC;
 
-retry:
- if (gfp_mask & __GFP_KSWAPD_RECLAIM)
- wake_all_kswapds(order, ac);
-
  /*
- * OK, we're below the kswapd watermark and have kicked background
- * reclaim. Now things get more complex, so set up alloc_flags according
- * to how we want to proceed.
+ * The fast path uses conservative alloc_flags to succeed only until
+ * kswapd needs to be woken up, and to avoid the cost of setting up
+ * alloc_flags precisely. So we do that now.
  */
  alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+retry:
+ if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+ wake_all_kswapds(order, ac);
+
  /* This is the last chance, in general, before the goto nopage. */
  page = get_page_from_freelist(gfp_mask, order,
  alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Vlastimil Babka
In reply to this post by Vlastimil Babka
During reclaim/compaction loop, compaction priority can be increased by the
should_compact_retry() function, but the current code is not optimal for
several reasons:

- priority is only increased when compaction_failed() is true, which means
  that compaction has scanned the whole zone. This may not happen even after
  multiple attempts with the lower priority due to parallel activity, so we
  might needlessly struggle on the lower priority.

- should_compact_retry() is only called when should_reclaim_retry() returns
  false. This means that compaction priority cannot get increased as long
  as reclaim makes sufficient progress. Theoretically, reclaim should stop
  retrying for high-order allocations as long as the high-order page doesn't
  exist but due to races, this may result in spurious retries when the
  high-order page momentarily does exist.

We can remove these corner cases by making sure that should_compact_retry() is
always called, and increases compaction priority if possible. Examining further
the compaction result can be done only after reaching the highest priority.
This is a simple solution and we don't need to worry about reaching the highest
priority "too soon" here - when should_compact_retry() is called it means that
the system is already struggling and the allocation is supposed to either try
as hard as possible, or it cannot fail at all. There's not much point staying
at lower priorities with heuristics that may result in only partial compaction.

The only exception here is the COMPACT_SKIPPED result, which means that
compaction could not run at all due to failing order-0 watermarks. In that
case, don't increase compaction priority, and check if compaction could proceed
when everything reclaimable was reclaimed. Before this patch, this was tied to
compaction_withdrawn(), but the other results considered there are in fact only
due to low compaction priority so we can ignore them thanks to the patch.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa9c39a7f40a..623027fb8121 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3248,28 +3248,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
  return false;
 
  /*
- * compaction considers all the zone as desperately out of memory
- * so it doesn't really make much sense to retry except when the
- * failure could be caused by insufficient priority
+ * Compaction backed off due to watermark checks for order-0
+ * so the regular reclaim has to try harder and reclaim something
+ * Retry only if it looks like reclaim might have a chance.
  */
- if (compaction_failed(compact_result)) {
- if (*compact_priority > 0) {
- (*compact_priority)--;
- return true;
- }
- return false;
- }
+ if (compact_result == COMPACT_SKIPPED)
+ return compaction_zonelist_suitable(ac, order, alloc_flags);
 
  /*
- * make sure the compaction wasn't deferred or didn't bail out early
- * due to locks contention before we declare that we should give up.
- * But do not retry if the given zonelist is not suitable for
- * compaction.
+ * Compaction could have withdrawn early or skip some zones or
+ * pageblocks. We were asked to retry, which means the allocation
+ * should try really hard, so increase the priority if possible.
  */
- if (compaction_withdrawn(compact_result))
- return compaction_zonelist_suitable(ac, order, alloc_flags);
+ if (*compact_priority > 0) {
+ (*compact_priority)--;
+ return true;
+ }
 
  /*
+ * The remaining possibility is that compaction made progress and
+ * created a high-order page, but it was allocated by somebody else.
+ * To prevent thrashing, limit the number of retries in such case.
  * !costly requests are much more important than __GFP_REPEAT
  * costly ones because they are de facto nofail and invoke OOM
  * killer to move on while costly can fail and users are ready
@@ -3527,6 +3526,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  struct alloc_context *ac)
 {
  bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+ bool should_retry;
  struct page *page = NULL;
  unsigned int alloc_flags;
  unsigned long did_some_progress;
@@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  else
  no_progress_loops++;
 
- if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, no_progress_loops))
- goto retry;
-
+ should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, no_progress_loops);
  /*
  * It doesn't make any sense to retry for the compaction if the order-0
  * reclaim is not able to make any progress because the current
  * implementation of the compaction depends on the sufficient amount
  * of free memory (see __compaction_suitable)
  */
- if (did_some_progress > 0 &&
- should_compact_retry(ac, order, alloc_flags,
+ if (did_some_progress > 0)
+ should_retry |= should_compact_retry(ac, order, alloc_flags,
  compact_result, &compact_priority,
- compaction_retries))
+ compaction_retries);
+ if (should_retry)
  goto retry;
 
+
  /* Reclaim has failed us, start killing things */
  page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
  if (page)
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 11/13] mm, compaction: add the ultimate direct compaction priority

Vlastimil Babka
In reply to this post by Vlastimil Babka
During reclaim/compaction loop, it's desirable to get a final answer from
unsuccessful compaction so we can either fail the allocation or invoke the OOM
killer. However, heuristics such as deferred compaction or pageblock skip bits
can cause compaction to skip parts or whole zones and lead to premature OOM's,
failures or excessive reclaim/compaction retries.

To remedy this, we introduce a new direct compaction priority called
COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

- ignore deferred compaction status for a zone
- ignore pageblock skip hints
- ignore cached scanner positions and scan the whole zone
- use MIGRATE_SYNC migration mode

The new priority should get eventually picked up by should_compact_retry() and
this should improve success rates for costly allocations using __GFP_RETRY,
such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
allocations.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 include/linux/compaction.h |  1 +
 mm/compaction.c            | 15 ++++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index eeaed24e87a8..af85c620c788 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -3,6 +3,7 @@
 
 // TODO: lower value means higher priority to match reclaim, makes sense?
 enum compact_priority {
+ COMPACT_PRIO_SYNC_FULL,
  COMPACT_PRIO_SYNC_LIGHT,
  DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
  COMPACT_PRIO_ASYNC,
diff --git a/mm/compaction.c b/mm/compaction.c
index 7d0935e1a195..9bc475dc4c99 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1580,12 +1580,20 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
  .order = order,
  .gfp_mask = gfp_mask,
  .zone = zone,
- .mode = (prio == COMPACT_PRIO_ASYNC) ?
- MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
  .alloc_flags = alloc_flags,
  .classzone_idx = classzone_idx,
  .direct_compaction = true,
+ .whole_zone = (prio == COMPACT_PRIO_SYNC_FULL),
+ .ignore_skip_hint = (prio == COMPACT_PRIO_SYNC_FULL)
  };
+
+ if (prio == COMPACT_PRIO_ASYNC)
+ cc.mode = MIGRATE_ASYNC;
+ else if (prio == COMPACT_PRIO_SYNC_LIGHT)
+ cc.mode = MIGRATE_SYNC_LIGHT;
+ else
+ cc.mode = MIGRATE_SYNC;
+
  INIT_LIST_HEAD(&cc.freepages);
  INIT_LIST_HEAD(&cc.migratepages);
 
@@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  ac->nodemask) {
  enum compact_result status;
 
- if (compaction_deferred(zone, order)) {
+ if (prio > COMPACT_PRIO_SYNC_FULL
+ && compaction_deferred(zone, order)) {
  rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
  continue;
  }
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 13/13] mm, compaction: fix and improve watermark handling

Vlastimil Babka
In reply to this post by Vlastimil Babka
Compaction has been using watermark checks when deciding whether it was
successful, and whether compaction is at all suitable. There are few problems
with these checks.

- __compact_finished() uses low watermark in a check that has to pass if
  the direct compaction is to finish and allocation should succeed. This is
  too pessimistic, as the allocation will typically use min watermark. It
  may happen that during compaction, we drop below the low watermark (due to
  parallel activity), but still form the target high-order page. By checking
  against low watermark, we might needlessly continue compaction. After this
  patch, the check uses direct compactor's alloc_flags to determine the
  watermark, which is effectively the min watermark.

- __compaction_suitable has the same issue in the check whether the allocation
  is already supposed to succeed and we don't need to compact. Fix it the same
  way.

- __compaction_suitable() then checks the low watermark plus a (2 << order) gap
  to decide if there's enough free memory to perform compaction. This check
  uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
  include ALLOC_CMA, we might fail the check, even though the freepage
  isolation isn't restricted outside of CMA pageblocks. On the other hand,
  alloc_flags may indicate access to memory reserves, making compaction proceed
  and then fail watermark check during freepage isolation, which doesn't pass
  alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
  __compaction_suitable() check.

- __isolate_free_page uses low watermark check to decide if free page can be
  isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.

- The use of low watermark checks in __compaction_suitable() and
  __isolate_free_page does perhaps make sense for high-order allocations where
  more freepages increase the chance of success, and we can typically fail
  with some order-0 fallback when the system is struggling. But for low-order
  allocation, forming the page should not be that hard. So using low watermark
  here might just prevent compaction from even trying, and eventually lead to
  OOM killer even if we are above min watermarks. So after this patch, we use
  min watermark for non-costly orders in these checks, by passing the
  alloc_flags parameter to split_page() and __isolate_free_page().

To sum up, after this patch, the kernel should in some situations finish
successful direct compaction sooner, prevent compaction from starting when it's
not needed, proceed with compaction when free memory is in CMA pageblocks, and
for non-costly orders, prevent OOM killing or excessive reclaim when free
memory is between the min and low watermarks.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 include/linux/mm.h  |  2 +-
 mm/compaction.c     | 28 +++++++++++++++++++++++-----
 mm/internal.h       |  3 ++-
 mm/page_alloc.c     | 13 ++++++++-----
 mm/page_isolation.c |  2 +-
 5 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db8979ce28a3..ce7248022114 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -518,7 +518,7 @@ void __put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
-int split_free_page(struct page *page);
+int split_free_page(struct page *page, unsigned int alloc_flags);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 9bc475dc4c99..207b6c132d6d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -368,6 +368,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
  unsigned long flags = 0;
  bool locked = false;
  unsigned long blockpfn = *start_pfn;
+ unsigned int alloc_flags;
+
+ /*
+ * Determine how split_free_page() will check watermarks, in line with
+ * compaction_suitable(). Pages in CMA pageblocks should be counted
+ * as free for this purpose as a migratable page is likely movable
+ */
+ alloc_flags = (cc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+ ALLOC_WMARK_LOW : ALLOC_WMARK_MIN;
+ alloc_flags |= ALLOC_CMA;
 
  cursor = pfn_to_page(blockpfn);
 
@@ -440,7 +450,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
  }
 
  /* Found a free page, break it into order-0 pages */
- isolated = split_free_page(page);
+ isolated = split_free_page(page, alloc_flags);
  total_isolated += isolated;
  for (i = 0; i < isolated; i++) {
  list_add(&page->lru, freelist);
@@ -1262,7 +1272,7 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
  return COMPACT_CONTINUE;
 
  /* Compaction run is not finished if the watermark is not met */
- watermark = low_wmark_pages(zone);
+ watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];
 
  if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
  cc->alloc_flags))
@@ -1327,7 +1337,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
  if (is_via_compact_memory(order))
  return COMPACT_CONTINUE;
 
- watermark = low_wmark_pages(zone);
+ watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
  /*
  * If watermarks for high-order allocation are already met, there
  * should be no need for compaction at all.
@@ -1339,11 +1349,19 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
  /*
  * Watermarks for order-0 must be met for compaction. Note the 2UL.
  * This is because during migration, copies of pages need to be
- * allocated and for a short time, the footprint is higher
+ * allocated and for a short time, the footprint is higher. For
+ * costly orders, we require low watermark instead of min for
+ * compaction to proceed to increase its chances. Note that watermark
+ * and alloc_flags here have to match (or be more pessimistic than)
+ * the watermark checks done in __isolate_free_page(), and we use the
+ * direct compactor's classzone_idx to skip over zones where
+ * lowmem reserves would prevent allocation even if compaction succeeds
  */
+ watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
+ low_wmark_pages(zone) : min_wmark_pages(zone);
  watermark += (2UL << order);
  if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
- alloc_flags, wmark_target))
+ ALLOC_CMA, wmark_target))
  return COMPACT_SKIPPED;
 
  /*
diff --git a/mm/internal.h b/mm/internal.h
index 2acdee8ab0e6..62c1bf61953b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -149,7 +149,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
  return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
  unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 623027fb8121..2d74eddffcf6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2489,7 +2489,8 @@ void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
 {
  unsigned long watermark;
  struct zone *zone;
@@ -2502,8 +2503,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
 
  if (!is_migrate_isolate(mt)) {
  /* Obey watermarks as if the page was being allocated */
- watermark = low_wmark_pages(zone) + (1 << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+ /* We know our order page exists, so only check order-0 */
+ watermark += (1UL << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, alloc_flags))
  return 0;
 
  __mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -2541,14 +2544,14 @@ int __isolate_free_page(struct page *page, unsigned int order)
  * Note: this is probably too low level an operation for use in drivers.
  * Please consult with lkml before using this in your driver.
  */
-int split_free_page(struct page *page)
+int split_free_page(struct page *page, unsigned int alloc_flags)
 {
  unsigned int order;
  int nr_pages;
 
  order = page_order(page);
 
- nr_pages = __isolate_free_page(page, order);
+ nr_pages = __isolate_free_page(page, order, alloc_flags);
  if (!nr_pages)
  return 0;
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 612122bf6a42..0bcb7a32d84c 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -107,7 +107,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 
  if (pfn_valid_within(page_to_pfn(buddy)) &&
     !is_migrate_isolate_page(buddy)) {
- __isolate_free_page(page, order);
+ __isolate_free_page(page, order, 0);
  kernel_map_pages(page, (1 << order), 1);
  set_page_refcounted(page);
  isolated_page = page;
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions

Vlastimil Babka
In reply to this post by Vlastimil Babka
A recent patch has added whole_zone flag that compaction sets when scanning
starts from the zone boundary, in order to report that zone has been fully
scanned in one attempt. For allocations that want to try really hard or cannot
fail, we will want to introduce a mode where scanning whole zone is guaranteed
regardless of the cached positions.

This patch reuses the whole_zone flag in a way that if it's already passed true
to compaction, the cached scanner positions are ignored. Employing this flag
during reclaim/compaction loop will be done in the next patch. This patch
however converts compaction invoked from userspace via procfs to use this flag.
Before this patch, the cached positions were first reset to zone boundaries and
then read back from struct zone, so there was a window where a parallel
compaction could replace the reset values, making the manual compaction less
effective. Using the flag instead of performing reset is more robust.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/compaction.c | 15 +++++----------
 mm/internal.h   |  2 +-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f649c7bc6de5..1ce6783d3ead 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1442,11 +1442,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
  */
  cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
  cc->free_pfn = zone->compact_cached_free_pfn;
- if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
+ if (cc->whole_zone || cc->free_pfn < start_pfn ||
+ cc->free_pfn >= end_pfn) {
  cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
  zone->compact_cached_free_pfn = cc->free_pfn;
  }
- if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
+ if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
+ cc->migrate_pfn >= end_pfn) {
  cc->migrate_pfn = start_pfn;
  zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
  zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
@@ -1693,14 +1695,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
  INIT_LIST_HEAD(&cc->freepages);
  INIT_LIST_HEAD(&cc->migratepages);
 
- /*
- * When called via /proc/sys/vm/compact_memory
- * this makes sure we compact the whole zone regardless of
- * cached scanner positions.
- */
- if (is_via_compact_memory(cc->order))
- __reset_isolation_suitable(zone);
-
  if (is_via_compact_memory(cc->order) ||
  !compaction_deferred(zone, cc->order))
  compact_zone(zone, cc);
@@ -1736,6 +1730,7 @@ static void compact_node(int nid)
  .order = -1,
  .mode = MIGRATE_SYNC,
  .ignore_skip_hint = true,
+ .whole_zone = true,
  };
 
  __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 556bc9d0a817..2acdee8ab0e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -178,7 +178,7 @@ struct compact_control {
  enum migrate_mode mode; /* Async or sync migration mode */
  bool ignore_skip_hint; /* Scan blocks even if marked skip */
  bool direct_compaction; /* False from kcompactd or /proc/... */
- bool whole_zone; /* Whole zone has been scanned */
+ bool whole_zone; /* Whole zone should/has been scanned */
  int order; /* order a direct compactor needs */
  const gfp_t gfp_mask; /* gfp mask of a direct compactor */
  const unsigned int alloc_flags; /* alloc flags of a direct compactor */
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 08/13] mm, compaction: simplify contended compaction handling

Vlastimil Babka
In reply to this post by Vlastimil Babka
Async compaction detects contention either due to failing trylock on zone->lock
or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction:
khugepaged should not give up due to need_resched()") the code got quite
complicated to distinguish these two up to the __alloc_pages_slowpath() level,
so different decisions could be taken for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for contended
compaction anymore, so we again don't need to distinguish lock and sched
contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and completely
remove the check for contended compaction after the initial async compaction
for costly orders, which was originally aimed at THP page fault allocations.
There are several reasons why this can be done now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
  all, unless the system admin has overriden the default, or application has
  indicated via madvise that it can benefit from THP's. In both cases, it
  means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
  wouldn't, the second compaction attempt is still async and will detect the
  contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
  in place that prevent excessive THP page fault latencies

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 include/linux/compaction.h | 10 +------
 mm/compaction.c            | 72 +++++++++-------------------------------------
 mm/internal.h              |  5 +---
 mm/page_alloc.c            | 28 +-----------------
 4 files changed, 16 insertions(+), 99 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 900d181ff1b0..cd3a59f1601e 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -51,14 +51,6 @@ enum compact_result {
  COMPACT_PARTIAL,
 };
 
-/* Used to signal whether compaction detected need_sched() or lock contention */
-/* No contention detected */
-#define COMPACT_CONTENDED_NONE 0
-/* Either need_sched() was true or fatal signal pending */
-#define COMPACT_CONTENDED_SCHED 1
-/* Zone lock or lru_lock was contended in async compaction */
-#define COMPACT_CONTENDED_LOCK 2
-
 struct alloc_context; /* in mm/internal.h */
 
 #ifdef CONFIG_COMPACTION
@@ -74,7 +66,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
  unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, int *contended);
+ enum compact_priority prio);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index abfd71e1f1a3..f649c7bc6de5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -279,7 +279,7 @@ static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
 {
  if (cc->mode == MIGRATE_ASYNC) {
  if (!spin_trylock_irqsave(lock, *flags)) {
- cc->contended = COMPACT_CONTENDED_LOCK;
+ cc->contended = true;
  return false;
  }
  } else {
@@ -313,13 +313,13 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
  }
 
  if (fatal_signal_pending(current)) {
- cc->contended = COMPACT_CONTENDED_SCHED;
+ cc->contended = true;
  return true;
  }
 
  if (need_resched()) {
  if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = COMPACT_CONTENDED_SCHED;
+ cc->contended = true;
  return true;
  }
  cond_resched();
@@ -342,7 +342,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
  /* async compaction aborts if contended */
  if (need_resched()) {
  if (cc->mode == MIGRATE_ASYNC) {
- cc->contended = COMPACT_CONTENDED_SCHED;
+ cc->contended = true;
  return true;
  }
 
@@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
  trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
  cc->free_pfn, end_pfn, sync, ret);
 
- if (ret == COMPACT_CONTENDED)
- ret = COMPACT_PARTIAL;
-
  return ret;
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum compact_priority prio, int *contended,
+ gfp_t gfp_mask, enum compact_priority prio,
  unsigned int alloc_flags, int classzone_idx)
 {
  enum compact_result ret;
@@ -1595,7 +1592,6 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
  VM_BUG_ON(!list_empty(&cc.freepages));
  VM_BUG_ON(!list_empty(&cc.migratepages));
 
- *contended = cc.contended;
  return ret;
 }
 
@@ -1608,23 +1604,18 @@ int sysctl_extfrag_threshold = 500;
  * @alloc_flags: The allocation flags of the current allocation
  * @ac: The context of current allocation
  * @mode: The migration mode for async, sync light, or sync migration
- * @contended: Return value that determines if compaction was aborted due to
- *       need_resched() or lock contention
  *
  * This is the main entry point for direct page compaction.
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, int *contended)
+ enum compact_priority prio)
 {
  int may_enter_fs = gfp_mask & __GFP_FS;
  int may_perform_io = gfp_mask & __GFP_IO;
  struct zoneref *z;
  struct zone *zone;
  enum compact_result rc = COMPACT_SKIPPED;
- int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
-
- *contended = COMPACT_CONTENDED_NONE;
 
  /* Check if the GFP flags allow compaction */
  if (!order || !may_enter_fs || !may_perform_io)
@@ -1637,7 +1628,6 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
  ac->nodemask) {
  enum compact_result status;
- int zone_contended;
 
  if (compaction_deferred(zone, order)) {
  rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
@@ -1645,14 +1635,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  }
 
  status = compact_zone_order(zone, order, gfp_mask, prio,
- &zone_contended, alloc_flags,
- ac_classzone_idx(ac));
+ alloc_flags, ac_classzone_idx(ac));
  rc = max(status, rc);
- /*
- * It takes at least one zone that wasn't lock contended
- * to clear all_zones_contended.
- */
- all_zones_contended &= zone_contended;
 
  /* If a normal allocation would succeed, stop compacting */
  if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
@@ -1664,59 +1648,29 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  * succeeds in this zone.
  */
  compaction_defer_reset(zone, order, false);
- /*
- * It is possible that async compaction aborted due to
- * need_resched() and the watermarks were ok thanks to
- * somebody else freeing memory. The allocation can
- * however still fail so we better signal the
- * need_resched() contention anyway (this will not
- * prevent the allocation attempt).
- */
- if (zone_contended == COMPACT_CONTENDED_SCHED)
- *contended = COMPACT_CONTENDED_SCHED;
 
- goto break_loop;
+ break;
  }
 
  if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
- status == COMPACT_PARTIAL_SKIPPED)) {
+ status == COMPACT_PARTIAL_SKIPPED))
  /*
  * We think that allocation won't succeed in this zone
  * so we defer compaction there. If it ends up
  * succeeding after all, it will be reset.
  */
  defer_compaction(zone, order);
- }
 
  /*
  * We might have stopped compacting due to need_resched() in
  * async compaction, or due to a fatal signal detected. In that
- * case do not try further zones and signal need_resched()
- * contention.
- */
- if ((zone_contended == COMPACT_CONTENDED_SCHED)
- || fatal_signal_pending(current)) {
- *contended = COMPACT_CONTENDED_SCHED;
- goto break_loop;
- }
-
- continue;
-break_loop:
- /*
- * We might not have tried all the zones, so  be conservative
- * and assume they are not all lock contended.
+ * case do not try further zones
  */
- all_zones_contended = 0;
- break;
+ if ((prio == COMPACT_PRIO_ASYNC && need_resched())
+ || fatal_signal_pending(current))
+ break;
  }
 
- /*
- * If at least one zone wasn't deferred or skipped, we report if all
- * zones that were tried were lock contended.
- */
- if (rc > COMPACT_INACTIVE && all_zones_contended)
- *contended = COMPACT_CONTENDED_LOCK;
-
  return rc;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..556bc9d0a817 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -184,10 +184,7 @@ struct compact_control {
  const unsigned int alloc_flags; /* alloc flags of a direct compactor */
  const int classzone_idx; /* zone index of a direct compactor */
  struct zone *zone;
- int contended; /* Signal need_sched() or lock
- * contention detected during
- * compaction
- */
+ bool contended; /* Signal lock or sched contention */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17abc05be972..aa9c39a7f40a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3196,14 +3196,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  enum compact_priority prio, enum compact_result *compact_result)
 {
  struct page *page;
- int contended_compaction;
 
  if (!order)
  return NULL;
 
  current->flags |= PF_MEMALLOC;
  *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &contended_compaction);
+ prio);
  current->flags &= ~PF_MEMALLOC;
 
  if (*compact_result <= COMPACT_INACTIVE)
@@ -3233,24 +3232,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  */
  count_vm_event(COMPACTFAIL);
 
- /*
- * In all zones where compaction was attempted (and not
- * deferred or skipped), lock contention has been detected.
- * For THP allocation we do not want to disrupt the others
- * so we fallback to base pages instead.
- */
- if (contended_compaction == COMPACT_CONTENDED_LOCK)
- *compact_result = COMPACT_CONTENDED;
-
- /*
- * If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
- */
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
- *compact_result = COMPACT_CONTENDED;
-
  cond_resched();
 
  return NULL;
@@ -3621,13 +3602,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  goto nopage;
 
  /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if (compact_result == COMPACT_CONTENDED)
- goto nopage;
-
- /*
  * Looks like reclaim/compaction is worth trying, but
  * sync compaction could be very expensive, so keep
  * using async compaction.
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 07/13] mm, compaction: introduce direct compaction priority

Vlastimil Babka
In reply to this post by Vlastimil Babka
In the context of direct compaction, for some types of allocations we would
like the compaction to either succeed or definitely fail while trying as hard
as possible. Current async/sync_light migration mode is insufficient, as there
are heuristics such as caching scanner positions, marking pageblocks as
unsuitable or deferring compaction for a zone. At least the final compaction
attempt should be able to override these heuristics.

To communicate how hard compaction should try, we replace migration mode with
a new enum compact_priority and change the relevant function signatures. In
compact_zone_order() where struct compact_control is constructed, the priority
is mapped to suitable control flags. This patch itself has no functional
change, as the current priority levels are mapped back to the same migration
modes as before. Expanding them will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
the only caller exists under CONFIG_COMPACTION.
---
 include/linux/compaction.h | 18 +++++++++---------
 mm/compaction.c            | 14 ++++++++------
 mm/page_alloc.c            | 27 +++++++++++++--------------
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4ba90e74969c..900d181ff1b0 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,14 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+// TODO: lower value means higher priority to match reclaim, makes sense?
+enum compact_priority {
+ COMPACT_PRIO_SYNC_LIGHT,
+ DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+ COMPACT_PRIO_ASYNC,
+ INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
+};
+
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
@@ -66,7 +74,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
  unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended);
+ enum compact_priority prio, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
@@ -151,14 +159,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
- unsigned int order, unsigned int alloc_flags,
- const struct alloc_context *ac,
- enum migrate_mode mode, int *contended)
-{
- return COMPACT_CONTINUE;
-}
-
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 481004c73c90..abfd71e1f1a3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1571,7 +1571,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
- gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+ gfp_t gfp_mask, enum compact_priority prio, int *contended,
  unsigned int alloc_flags, int classzone_idx)
 {
  enum compact_result ret;
@@ -1581,7 +1581,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
  .order = order,
  .gfp_mask = gfp_mask,
  .zone = zone,
- .mode = mode,
+ .mode = (prio == COMPACT_PRIO_ASYNC) ?
+ MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
  .alloc_flags = alloc_flags,
  .classzone_idx = classzone_idx,
  .direct_compaction = true,
@@ -1614,7 +1615,7 @@ int sysctl_extfrag_threshold = 500;
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended)
+ enum compact_priority prio, int *contended)
 {
  int may_enter_fs = gfp_mask & __GFP_FS;
  int may_perform_io = gfp_mask & __GFP_IO;
@@ -1629,7 +1630,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  if (!order || !may_enter_fs || !may_perform_io)
  return COMPACT_SKIPPED;
 
- trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
+ //XXX: FIXME
+ //trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
 
  /* Compact each zone in the list */
  for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
@@ -1642,7 +1644,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  continue;
  }
 
- status = compact_zone_order(zone, order, gfp_mask, mode,
+ status = compact_zone_order(zone, order, gfp_mask, prio,
  &zone_contended, alloc_flags,
  ac_classzone_idx(ac));
  rc = max(status, rc);
@@ -1676,7 +1678,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  goto break_loop;
  }
 
- if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+ if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
  status == COMPACT_PARTIAL_SKIPPED)) {
  /*
  * We think that allocation won't succeed in this zone
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a5ff4525a0e..17abc05be972 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3193,7 +3193,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, enum compact_result *compact_result)
+ enum compact_priority prio, enum compact_result *compact_result)
 {
  struct page *page;
  int contended_compaction;
@@ -3203,7 +3203,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
  current->flags |= PF_MEMALLOC;
  *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- mode, &contended_compaction);
+ prio, &contended_compaction);
  current->flags &= ~PF_MEMALLOC;
 
  if (*compact_result <= COMPACT_INACTIVE)
@@ -3258,7 +3258,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-     enum compact_result compact_result, enum migrate_mode *migrate_mode,
+     enum compact_result compact_result, enum compact_priority *compact_priority,
      int compaction_retries)
 {
  int max_retries = MAX_COMPACT_RETRIES;
@@ -3269,11 +3269,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
  /*
  * compaction considers all the zone as desperately out of memory
  * so it doesn't really make much sense to retry except when the
- * failure could be caused by weak migration mode.
+ * failure could be caused by insufficient priority
  */
  if (compaction_failed(compact_result)) {
- if (*migrate_mode == MIGRATE_ASYNC) {
- *migrate_mode = MIGRATE_SYNC_LIGHT;
+ if (*compact_priority > 0) {
+ (*compact_priority)--;
  return true;
  }
  return false;
@@ -3307,7 +3307,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, enum compact_result *compact_result)
+ enum compact_priority prio, enum compact_result *compact_result)
 {
  return NULL;
 }
@@ -3315,7 +3315,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
      enum compact_result compact_result,
-     enum migrate_mode *migrate_mode,
+     enum compact_priority *compact_priority,
      int compaction_retries)
 {
  return false;
@@ -3549,7 +3549,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  struct page *page = NULL;
  unsigned int alloc_flags;
  unsigned long did_some_progress;
- enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
+ enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
  enum compact_result compact_result;
  int compaction_retries = 0;
  int no_progress_loops = 0;
@@ -3599,7 +3599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
  page = __alloc_pages_direct_compact(gfp_mask, order,
  alloc_flags, ac,
- MIGRATE_ASYNC,
+ INIT_COMPACT_PRIORITY,
  &compact_result);
  if (page)
  goto got_pg;
@@ -3632,7 +3632,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  * sync compaction could be very expensive, so keep
  * using async compaction.
  */
- migration_mode = MIGRATE_ASYNC;
+ compact_priority = INIT_COMPACT_PRIORITY;
  }
  }
 
@@ -3693,8 +3693,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
  /* Try direct compaction and then allocating */
  page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
- migration_mode,
- &compact_result);
+ compact_priority, &compact_result);
  if (page)
  goto got_pg;
 
@@ -3734,7 +3733,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  */
  if (did_some_progress > 0 &&
  should_compact_retry(ac, order, alloc_flags,
- compact_result, &migration_mode,
+ compact_result, &compact_priority,
  compaction_retries))
  goto retry;
 
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 10/13] mm, compaction: cleanup unused functions

Vlastimil Babka
In reply to this post by Vlastimil Babka
Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
anymore, so we remove it. The only caller of __compact_pgdat() is
compact_node(), so we merge them and remove code that was only reachable from
kswapd.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 include/linux/compaction.h |  5 ----
 mm/compaction.c            | 60 +++++++++++++---------------------------------
 2 files changed, 17 insertions(+), 48 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index cd3a59f1601e..eeaed24e87a8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -67,7 +67,6 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
  unsigned int order,
  unsigned int alloc_flags, const struct alloc_context *ac,
  enum compact_priority prio);
-extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
  unsigned int alloc_flags, int classzone_idx);
@@ -151,10 +150,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline void compact_pgdat(pg_data_t *pgdat, int order)
-{
-}
-
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 1ce6783d3ead..7d0935e1a195 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1678,10 +1678,18 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 
 /* Compact all zones within a node */
-static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
+static void compact_node(int nid)
 {
+ pg_data_t *pgdat = NODE_DATA(nid);
  int zoneid;
  struct zone *zone;
+ struct compact_control cc = {
+ .order = -1,
+ .mode = MIGRATE_SYNC,
+ .ignore_skip_hint = true,
+ .whole_zone = true,
+ };
+
 
  for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
@@ -1689,53 +1697,19 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
  if (!populated_zone(zone))
  continue;
 
- cc->nr_freepages = 0;
- cc->nr_migratepages = 0;
- cc->zone = zone;
- INIT_LIST_HEAD(&cc->freepages);
- INIT_LIST_HEAD(&cc->migratepages);
-
- if (is_via_compact_memory(cc->order) ||
- !compaction_deferred(zone, cc->order))
- compact_zone(zone, cc);
-
- VM_BUG_ON(!list_empty(&cc->freepages));
- VM_BUG_ON(!list_empty(&cc->migratepages));
+ cc.nr_freepages = 0;
+ cc.nr_migratepages = 0;
+ cc.zone = zone;
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
 
- if (is_via_compact_memory(cc->order))
- continue;
+ compact_zone(zone, &cc);
 
- if (zone_watermark_ok(zone, cc->order,
- low_wmark_pages(zone), 0, 0))
- compaction_defer_reset(zone, cc->order, false);
+ VM_BUG_ON(!list_empty(&cc.freepages));
+ VM_BUG_ON(!list_empty(&cc.migratepages));
  }
 }
 
-void compact_pgdat(pg_data_t *pgdat, int order)
-{
- struct compact_control cc = {
- .order = order,
- .mode = MIGRATE_ASYNC,
- };
-
- if (!order)
- return;
-
- __compact_pgdat(pgdat, &cc);
-}
-
-static void compact_node(int nid)
-{
- struct compact_control cc = {
- .order = -1,
- .mode = MIGRATE_SYNC,
- .ignore_skip_hint = true,
- .whole_zone = true,
- };
-
- __compact_pgdat(NODE_DATA(nid), &cc);
-}
-
 /* Compact all nodes in the system */
 static void compact_nodes(void)
 {
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 05/13] mm, page_alloc: make THP-specific decisions more generic

Vlastimil Babka
In reply to this post by Vlastimil Babka
Since THP allocations during page faults can be costly, extra decisions are
employed for them to avoid excessive reclaim and compaction, if the initial
compaction doesn't look promising. The detection has never been perfect as
there is no gfp flag specific to THP allocations. At this moment it checks the
whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other
users of such combination exist, or would mind being treated the same way.
Extra care is also taken to separate allocations from khugepaged, where latency
doesn't matter that much.

It is however possible to distinguish these allocations in a simpler and more
reliable way. The key observation is that after the initial compaction followed
by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY
allocations and costly allocations without __GFP_REPEAT are declared as
failures:

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_REPEAT
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                goto nopage;

This means we can further distinguish allocations that are costly order *and*
additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE
allocations do already fall into this category. This will also allow other
costly allocations with similar high-order benefit vs latency considerations to
use this semantic. Furthermore, we can distinguish THP allocations that should
try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will
be done in the next patch.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88d680b3e7b6..f5d931e0854a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3182,7 +3182,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
  return page;
 }
 
-
 /*
  * Maximum number of compaction retries wit a progress before OOM
  * killer is consider as the only way to move forward.
@@ -3447,11 +3446,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
  return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
-static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
-{
- return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
-}
-
 /*
  * Maximum number of reclaim retries without any progress before OOM killer
  * is consider as the only way to move forward.
@@ -3610,8 +3604,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (page)
  goto got_pg;
 
- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
+ /*
+ * Checks for costly allocations with __GFP_NORETRY, which
+ * includes THP page fault allocations
+ */
+ if (gfp_mask & __GFP_NORETRY) {
  /*
  * If compaction is deferred for high-order allocations,
  * it is because sync compaction recently failed. If
@@ -3631,11 +3628,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  goto nopage;
 
  /*
- * It can become very expensive to allocate transparent
- * hugepages at fault, so use asynchronous memory
- * compaction for THP unless it is khugepaged trying to
- * collapse. All other requests should tolerate at
- * least light sync migration.
+ * Looks like reclaim/compaction is worth trying, but
+ * sync compaction could be very expensive, so keep
+ * using async compaction, unless it's khugepaged
+ * trying to collapse.
  */
  if (!(current->flags & PF_KTHREAD))
  migration_mode = MIGRATE_ASYNC;
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations

Vlastimil Babka
In reply to this post by Vlastimil Babka
After the previous patch, we can distinguish costly allocations that should be
really lightweight, such as THP page faults, with __GFP_NORETRY. This means we
don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can
also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to
try as hard as khugepaged, as the process has indicated that it benefits from
THP's and is willing to pay some initial latency costs.

This is implemented by removing __GFP_NORETRY from GFP_TRANSHUGE and applying
it selectively for current GFP_TRANSHUGE users:

* get_huge_zero_page() - the zero page lifetime should be relatively long and
  it's shared by multiple users, so it's worth spending some effort on it.
  __GFP_NORETRY is not added

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not
  an issue. So if khugepaged "defrag" is enabled (the default), do reclaim
  without __GFP_NORETRY. We can remove the PF_KTHREAD check from page alloc.
  As a side-effect, khugepaged will now no longer check if the initial
  compaction was deferred or contended. This is OK, as khugepaged sleep times
  between collapsion attemps are long enough to prevent noticeable disruption,
  so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already does ~__GFP_RECLAIM, so
  removing __GFP_NORETRY has no effect here

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are
  now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY
  if direct reclaim/compaction is at all allowed (by default it's allowed only
  for VM_HUGEPAGE vma's)

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 include/linux/gfp.h | 3 +--
 mm/huge_memory.c    | 8 +++++---
 mm/page_alloc.c     | 6 ++----
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..0cb09714d960 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -256,8 +256,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
- __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
- ~__GFP_RECLAIM)
+ __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a69e1e144050..30a254a5e780 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,9 +882,10 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 }
 
 /*
- * If THP is set to always then directly reclaim/compact as necessary
- * If set to defer then do no reclaim and defer to khugepaged
+ * If THP defrag is set to always then directly reclaim/compact as necessary
+ * If set to defer then do only background reclaim/compact and defer to khugepaged
  * If set to madvise and the VMA is flagged then directly reclaim/compact
+ * When direct reclaim/compact is allowed, try a bit harder for flagged VMA's
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
@@ -896,7 +897,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
  else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
  reclaim_flags = __GFP_KSWAPD_RECLAIM;
  else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
- reclaim_flags = __GFP_DIRECT_RECLAIM;
+ reclaim_flags = __GFP_DIRECT_RECLAIM |
+ ((vma->vm_flags & VM_HUGEPAGE) ? 0 : __GFP_NORETRY);
 
  return GFP_TRANSHUGE | reclaim_flags;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5d931e0854a..1a5ff4525a0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3630,11 +3630,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  /*
  * Looks like reclaim/compaction is worth trying, but
  * sync compaction could be very expensive, so keep
- * using async compaction, unless it's khugepaged
- * trying to collapse.
+ * using async compaction.
  */
- if (!(current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_ASYNC;
+ migration_mode = MIGRATE_ASYNC;
  }
  }
 
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath

Vlastimil Babka
In reply to this post by Vlastimil Babka
The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim
and compaction (and OOM), until either the allocation succeeds, or returns
with failure. Success here is more probable when reclaim precedes compaction,
as certain watermarks have to be met for compaction to even try, and more free
pages increase the probability of compaction success. On the other hand,
starting with light async compaction (if the watermarks allow it), can be
more efficient, especially for smaller orders, if there's enough free memory
which is just fragmented.

Thus, the current code starts with compaction before reclaim, and to make sure
that the last reclaim is always followed by a final compaction, there's another
direct compaction call at the end of the loop. This makes the code hard to
follow and adds some duplicated handling of migration_mode decisions. It's also
somewhat inefficient that even if reclaim or compaction decides not to retry,
the final compaction is still attempted. Some gfp flags combination also
shortcut these retry decisions by "goto noretry;", making it even harder to
follow.

This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are now
placed above the retry loop, and the "noretry" direct compaction is removed.

The initial compaction is additionally restricted only to costly orders, as we
can expect smaller orders to be held back by watermarks, and only larger orders
to suffer primarily from fragmentation. This better matches the checks in
reclaim's shrink_zones().

There are two other smaller functional changes. One is that the upgrade from
async migration to light sync migration will always occur after the initial
compaction. This is how it has been until recent patch "mm, oom: protect
!costly allocations some more", which introduced upgrading the mode based on
COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
made it even more special. It's better to return to the simpler handling for
now, as migration modes will be further modified later in the series.

The second change is that once both reclaim and compaction declare it's not
worth to retry the reclaim/compact loop, there is no final compaction attempt.
As argued above, this is intentional. If that final compaction were to succeed,
it would be due to a wrong retry decision, or simply a race with somebody else
freeing memory for us.

The main outcome of this patch should be simpler code. Logically, the initial
compaction without reclaim is the exceptional case to the reclaim/compaction
scheme, but prior to the patch, it was the last loop iteration that was
exceptional. Now the code matches the logic better. The change also enable the
following patches.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 107 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 55 insertions(+), 52 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7249949d65ca..88d680b3e7b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3555,7 +3555,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  struct page *page = NULL;
  unsigned int alloc_flags;
  unsigned long did_some_progress;
- enum migrate_mode migration_mode = MIGRATE_ASYNC;
+ enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
  enum compact_result compact_result;
  int compaction_retries = 0;
  int no_progress_loops = 0;
@@ -3598,6 +3598,50 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (page)
  goto got_pg;
 
+ /*
+ * For costly allocations, try direct compaction first, as it's likely
+ * that we have enough base pages and don't need to reclaim.
+ */
+ if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
+ page = __alloc_pages_direct_compact(gfp_mask, order,
+ alloc_flags, ac,
+ MIGRATE_ASYNC,
+ &compact_result);
+ if (page)
+ goto got_pg;
+
+ /* Checks for THP-specific high-order allocations */
+ if (is_thp_gfp_mask(gfp_mask)) {
+ /*
+ * If compaction is deferred for high-order allocations,
+ * it is because sync compaction recently failed. If
+ * this is the case and the caller requested a THP
+ * allocation, we do not want to heavily disrupt the
+ * system, so we fail the allocation instead of entering
+ * direct reclaim.
+ */
+ if (compact_result == COMPACT_DEFERRED)
+ goto nopage;
+
+ /*
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
+ */
+ if (compact_result == COMPACT_CONTENDED)
+ goto nopage;
+
+ /*
+ * It can become very expensive to allocate transparent
+ * hugepages at fault, so use asynchronous memory
+ * compaction for THP unless it is khugepaged trying to
+ * collapse. All other requests should tolerate at
+ * least light sync migration.
+ */
+ if (!(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ }
+ }
+
 retry:
  /* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
  if (gfp_mask & __GFP_KSWAPD_RECLAIM)
@@ -3646,55 +3690,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
  goto nopage;
 
- /*
- * Try direct compaction. The first pass is asynchronous. Subsequent
- * attempts after direct reclaim are synchronous
- */
+
+ /* Try direct reclaim and then allocating */
+ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
+ &did_some_progress);
+ if (page)
+ goto got_pg;
+
+ /* Try direct compaction and then allocating */
  page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
  migration_mode,
  &compact_result);
  if (page)
  goto got_pg;
 
- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (compact_result == COMPACT_DEFERRED)
- goto nopage;
-
- /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if(compact_result == COMPACT_CONTENDED)
- goto nopage;
- }
-
  if (order && compaction_made_progress(compact_result))
  compaction_retries++;
 
- /* Try direct reclaim and then allocating */
- page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
- &did_some_progress);
- if (page)
- goto got_pg;
-
  /* Do not loop if specifically requested */
  if (gfp_mask & __GFP_NORETRY)
- goto noretry;
+ goto nopage;
 
  /*
  * Do not retry costly high order allocations unless they are
  * __GFP_REPEAT
  */
  if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
- goto noretry;
+ goto nopage;
 
  /*
  * Costly allocations might have made a progress but this doesn't mean
@@ -3733,25 +3755,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  goto retry;
  }
 
-noretry:
- /*
- * High-order allocations do not necessarily loop after direct reclaim
- * and reclaim/compaction depends on compaction being called after
- * reclaim so call directly if necessary.
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse. All other requests should tolerate
- * at least light sync migration.
- */
- if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_ASYNC;
- else
- migration_mode = MIGRATE_SYNC_LIGHT;
- page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
-    ac, migration_mode,
-    &compact_result);
- if (page)
- goto got_pg;
 nopage:
  warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath

Vlastimil Babka
In reply to this post by Vlastimil Babka
After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
first tries get_page_from_freelist() with the new alloc_flags, as it may
succeed e.g. due to using min watermark instead of low watermark. This attempt
does not have to be retried on each loop, since direct reclaim, direct
compaction and oom call get_page_from_freelist() themselves.

This patch therefore moves the initial attempt above the retry label. The
ALLOC_NO_WATERMARKS attempt is kept under retry label as it's special and
should be retried after each loop. Kswapd wakeups are also done on each retry
to be safe from potential races resulting in kswapd going to sleep while a
process (that may not be able to reclaim by itself) is still looping.

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91fbf6f95403..7249949d65ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3586,16 +3586,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  */
  alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-retry:
  if (gfp_mask & __GFP_KSWAPD_RECLAIM)
  wake_all_kswapds(order, ac);
 
- /* This is the last chance, in general, before the goto nopage. */
+ /*
+ * The adjusted alloc_flags might result in immediate success, so try
+ * that first
+ */
  page = get_page_from_freelist(gfp_mask, order,
  alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
  if (page)
  goto got_pg;
 
+retry:
+ /* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
+ if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+ wake_all_kswapds(order, ac);
+
  /* Allocate without watermarks if the context allows */
  if (alloc_flags & ALLOC_NO_WATERMARKS) {
  /*
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

Tetsuo Handa
In reply to this post by Vlastimil Babka
Vlastimil Babka wrote:
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.

Not true. gfp_to_alloc_flags() will include ALLOC_NO_WATERMARKS if current
thread got TIF_MEMDIE after gfp_to_alloc_flags() was called for the first
time. Do you want to make TIF_MEMDIE threads fail their allocations without
using memory reserves?
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

Vlastimil Babka
On 05/10/2016 01:28 PM, Tetsuo Handa wrote:
> Vlastimil Babka wrote:
>> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
>> so move the initialization above the retry: label. Also make the comment above
>> the initialization more descriptive.
>
> Not true. gfp_to_alloc_flags() will include ALLOC_NO_WATERMARKS if current
> thread got TIF_MEMDIE after gfp_to_alloc_flags() was called for the first

Oh, right. Stupid global state.

> time. Do you want to make TIF_MEMDIE threads fail their allocations without
> using memory reserves?

No, thanks for catching this. How about the following version? I think
that's even nicer cleanup, if correct. Note it causes a conflict in
patch 03/13 but it's simple to resolve.

Thanks

----8<----
From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[hidden email]>
Date: Fri, 29 Apr 2016 11:51:17 +0200
Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
so move the initialization above the retry: label. Also make the comment above
the initialization more descriptive.

The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
which may change due to TIF_MEMDIE being set on the allocating thread. We can
fix this, and make the code simpler and a bit more effective at the same time,
by moving the part that determines ALLOC_NO_WATERMARKS from
gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().

Signed-off-by: Vlastimil Babka <[hidden email]>
---
 mm/page_alloc.c | 49 ++++++++++++++++++++++++-------------------------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a50184ec6ca0..1b58facf4b5e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,8 +3216,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  */
  count_vm_event(COMPACTSTALL);
 
- page = get_page_from_freelist(gfp_mask, order,
- alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
  if (page) {
  struct zone *zone = page_zone(page);
@@ -3366,8 +3365,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
  return NULL;
 
 retry:
- page = get_page_from_freelist(gfp_mask, order,
- alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
  /*
  * If an allocation failed after direct reclaim, it could be because
@@ -3425,16 +3423,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
  } else if (unlikely(rt_task(current)) && !in_interrupt())
  alloc_flags |= ALLOC_HARDER;
 
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (gfp_mask & __GFP_MEMALLOC)
- alloc_flags |= ALLOC_NO_WATERMARKS;
- else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
- alloc_flags |= ALLOC_NO_WATERMARKS;
- else if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
- alloc_flags |= ALLOC_NO_WATERMARKS;
- }
 #ifdef CONFIG_CMA
  if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
  alloc_flags |= ALLOC_CMA;
@@ -3444,7 +3432,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
- return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+ if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+ return false;
+
+ if (gfp_mask & __GFP_MEMALLOC)
+ return true;
+ if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+ return true;
+ if (!in_interrupt() &&
+ ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ return true;
+
+ return false;
 }
 
 static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
@@ -3579,25 +3579,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
  gfp_mask &= ~__GFP_ATOMIC;
 
-retry:
- if (gfp_mask & __GFP_KSWAPD_RECLAIM)
- wake_all_kswapds(order, ac);
-
  /*
- * OK, we're below the kswapd watermark and have kicked background
- * reclaim. Now things get more complex, so set up alloc_flags according
- * to how we want to proceed.
+ * The fast path uses conservative alloc_flags to succeed only until
+ * kswapd needs to be woken up, and to avoid the cost of setting up
+ * alloc_flags precisely. So we do that now.
  */
  alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+retry:
+ if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+ wake_all_kswapds(order, ac);
+
  /* This is the last chance, in general, before the goto nopage. */
- page = get_page_from_freelist(gfp_mask, order,
- alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
  if (page)
  goto got_pg;
 
  /* Allocate without watermarks if the context allows */
- if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ if (gfp_pfmemalloc_allowed(gfp_mask)) {
  /*
  * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
  * the allocation is high priority and these type of
--
2.8.2


Reply | Threaded
Open this post in threaded view
|

Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority

Vlastimil Babka
In reply to this post by Vlastimil Babka
On 05/10/2016 09:36 AM, Vlastimil Babka wrote:

>   /*
> - * compaction considers all the zone as desperately out of memory
> - * so it doesn't really make much sense to retry except when the
> - * failure could be caused by insufficient priority
> + * Compaction backed off due to watermark checks for order-0
> + * so the regular reclaim has to try harder and reclaim something
> + * Retry only if it looks like reclaim might have a chance.
>   */
> - if (compaction_failed(compact_result)) {
> - if (*compact_priority > 0) {
> - (*compact_priority)--;
> - return true;
> - }
> - return false;
> - }

Oops, looks like my editing resulted in compaction_failed() check to be
removed completely, which wasn't intentional and can lead to infinite
loops. This should be added on top.

----8<----
From 59a2b38689aa451f661c964dc9bfb990736ad92d Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[hidden email]>
Date: Tue, 10 May 2016 14:51:03 +0200
Subject: [PATCH 15/15] fixup! mm, compaction: more reliably increase direct
 compaction priority

---
 mm/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa49eb4a5919..e8a0d33cfb67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3268,6 +3268,14 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
  }
 
  /*
+ * Compaction considers all the zones as unfixably fragmented and we
+ * are on the highest priority, which means it can't be due to
+ * heuristics and it doesn't really make much sense to retry.
+ */
+ if (compaction_failed(compact_result))
+ return false;
+
+ /*
  * The remaining possibility is that compaction made progress and
  * created a high-order page, but it was allocated by somebody else.
  * To prevent thrashing, limit the number of retries in such case.
--
2.8.2


Reply | Threaded
Open this post in threaded view
|

Re: [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:51, Vlastimil Babka wrote:
> From: Hugh Dickins <[hidden email]>
>
> At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
> isolate a PageWriteback page, which __unmap_and_move() then rejects
> with -EBUSY: of course the writeback might complete in between, but
> that's not what we usually expect, so probably better not to isolate it.

this makes a lot of sense regardless the rest of the series. I will have
a look at the rest tomorrow more closely.

>
> Signed-off-by: Hugh Dickins <[hidden email]>
> Signed-off-by: Vlastimil Babka <[hidden email]>

Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c72987603343..481004c73c90 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1146,7 +1146,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>   struct page *page;
>   const isolate_mode_t isolate_mode =
>   (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
> - (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> + (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>  
>   /*
>   * Start at where we last stopped, or beginning of the zone as
> --
> 2.8.2

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 14:30:11, Vlastimil Babka wrote:
[...]

> >From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[hidden email]>
> Date: Fri, 29 Apr 2016 11:51:17 +0200
> Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
>
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.
>
> The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
> which may change due to TIF_MEMDIE being set on the allocating thread. We can
> fix this, and make the code simpler and a bit more effective at the same time,
> by moving the part that determines ALLOC_NO_WATERMARKS from
> gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
> mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
> anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().

I like this _very_ much! gfp_to_alloc_flags was really ugly doing two
separate things and it is nice to split them up and give the whole thing
more sense. gfp_pfmemalloc_allowed() is the bright example of it.

> Signed-off-by: Vlastimil Babka <[hidden email]>
 
Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/page_alloc.c | 49 ++++++++++++++++++++++++-------------------------
>  1 file changed, 24 insertions(+), 25 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a50184ec6ca0..1b58facf4b5e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3216,8 +3216,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   */
>   count_vm_event(COMPACTSTALL);
>  
> - page = get_page_from_freelist(gfp_mask, order,
> - alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>   if (page) {
>   struct zone *zone = page_zone(page);
> @@ -3366,8 +3365,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>   return NULL;
>  
>  retry:
> - page = get_page_from_freelist(gfp_mask, order,
> - alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>   /*
>   * If an allocation failed after direct reclaim, it could be because
> @@ -3425,16 +3423,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>   } else if (unlikely(rt_task(current)) && !in_interrupt())
>   alloc_flags |= ALLOC_HARDER;
>  
> - if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> - if (gfp_mask & __GFP_MEMALLOC)
> - alloc_flags |= ALLOC_NO_WATERMARKS;
> - else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> - alloc_flags |= ALLOC_NO_WATERMARKS;
> - else if (!in_interrupt() &&
> - ((current->flags & PF_MEMALLOC) ||
> - unlikely(test_thread_flag(TIF_MEMDIE))))
> - alloc_flags |= ALLOC_NO_WATERMARKS;
> - }
>  #ifdef CONFIG_CMA
>   if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>   alloc_flags |= ALLOC_CMA;
> @@ -3444,7 +3432,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
> - return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
> + if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> + return false;
> +
> + if (gfp_mask & __GFP_MEMALLOC)
> + return true;
> + if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> + return true;
> + if (!in_interrupt() &&
> + ((current->flags & PF_MEMALLOC) ||
> + unlikely(test_thread_flag(TIF_MEMDIE))))
> + return true;
> +
> + return false;
>  }
>  
>  static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> @@ -3579,25 +3579,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>   gfp_mask &= ~__GFP_ATOMIC;
>  
> -retry:
> - if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> - wake_all_kswapds(order, ac);
> -
>   /*
> - * OK, we're below the kswapd watermark and have kicked background
> - * reclaim. Now things get more complex, so set up alloc_flags according
> - * to how we want to proceed.
> + * The fast path uses conservative alloc_flags to succeed only until
> + * kswapd needs to be woken up, and to avoid the cost of setting up
> + * alloc_flags precisely. So we do that now.
>   */
>   alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> +retry:
> + if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> + wake_all_kswapds(order, ac);
> +
>   /* This is the last chance, in general, before the goto nopage. */
> - page = get_page_from_freelist(gfp_mask, order,
> - alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>   if (page)
>   goto got_pg;
>  
>   /* Allocate without watermarks if the context allows */
> - if (alloc_flags & ALLOC_NO_WATERMARKS) {
> + if (gfp_pfmemalloc_allowed(gfp_mask)) {
>   /*
>   * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
>   * the allocation is high priority and these type of
> --
> 2.8.2
>

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|

Re: [RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath

Michal Hocko-4
In reply to this post by Vlastimil Babka
On Tue 10-05-16 09:35:53, Vlastimil Babka wrote:
> After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
> first tries get_page_from_freelist() with the new alloc_flags, as it may
> succeed e.g. due to using min watermark instead of low watermark. This attempt
> does not have to be retried on each loop, since direct reclaim, direct
> compaction and oom call get_page_from_freelist() themselves.
>
> This patch therefore moves the initial attempt above the retry label. The
> ALLOC_NO_WATERMARKS attempt is kept under retry label as it's special and
> should be retried after each loop.

Yes this makes code both more clear and more logical

> Kswapd wakeups are also done on each retry
> to be safe from potential races resulting in kswapd going to sleep while a
> process (that may not be able to reclaim by itself) is still looping.

I am not sure this is really necessary but it shouldn't be harmful. The
comment clarifies the duplicity so we are not risking "cleanups to
remove duplicated code" I guess.

> Signed-off-by: Vlastimil Babka <[hidden email]>

Acked-by: Michal Hocko <[hidden email]>

> ---
>  mm/page_alloc.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91fbf6f95403..7249949d65ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3586,16 +3586,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   */
>   alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> -retry:
>   if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>   wake_all_kswapds(order, ac);
>  
> - /* This is the last chance, in general, before the goto nopage. */
> + /*
> + * The adjusted alloc_flags might result in immediate success, so try
> + * that first
> + */
>   page = get_page_from_freelist(gfp_mask, order,
>   alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
>   if (page)
>   goto got_pg;
>  
> +retry:
> + /* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
> + if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> + wake_all_kswapds(order, ac);
> +
>   /* Allocate without watermarks if the context allows */
>   if (alloc_flags & ALLOC_NO_WATERMARKS) {
>   /*
> --
> 2.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
Michal Hocko
SUSE Labs
123