[PATCH 0.14] oom detection rework v6

classic Classic list List threaded Threaded
60 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PATCH 0.14] oom detection rework v6

Michal Hocko-4
Hi,

This is v6 of the series. The previous version was posted [1]. The
code hasn't changed much since then. I have found one old standing
bug (patch 1) which just got much more severe and visible with this
series. Other than that I have reorganized the series and put the
compaction feedback abstraction to the front just in case we find out
that parts of the series would have to be reverted later on for some
reason. The premature oom killer invocation reported by Hugh [2] seems
to be addressed.

We have discussed this series at LSF/MM summit in Raleigh and there
didn't seem to be any concerns/objections to go on with the patch set
and target it for the next merge window.

Motivation:
As pointed by Linus [3][4] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
   file size, removes the files and starts over again) running in
   parallel for 10s to build up a lot of dirty pages when 100 parallel
   mem_eaters (anon private populated mmap which waits until it gets
   signal) with 80M each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature (e.g. due to dirty pages):

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78

$ grep "Kill process" base-oom-run1.log | tail -n1
[   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3

* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77

e grep "Kill process" patched-oom-run1.log | tail -n1
[  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3

The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs in three different boots
   without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.

I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.

I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.

* Changes since v5
- added "vmscan: consider classzone_idx in compaction_ready"
- added "mm, oom, compaction: prevent from should_compact_retry looping
  for ever for costly orders"
- acked-bys from Vlastimil
- integrated feedback from review
* Changes since v4
- dropped __GFP_REPEAT for costly allocation as it is now replaced by
  the compaction based feedback logic
- !costly high order requests are retried based on the compaction feedback
- compaction feedback has been tweaked to give us an useful information
  to make decisions in the page allocator
- rebased on the current mmotm-2016-04-01-16-24 with the previous version
  of the rework reverted

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)

* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@...
[2] http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@...
[3] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@...
[4] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@...

Reply | Threaded
Open this post in threaded view
|

[PATCH 01/14] vmscan: consider classzone_idx in compaction_ready

Michal Hocko-4
From: Michal Hocko <[hidden email]>

while playing with the oom detection rework [1] I have noticed
that my heavy order-9 (hugetlb) load close to OOM ended up in an
endless loop where the reclaim hasn't made any progress but
did_some_progress didn't reflect that and compaction_suitable
was backing off because no zone is above low wmark + 1 << order.

It turned out that this is in fact an old standing bug in compaction_ready
which ignores the requested_highidx and did the watermark check for
0 classzone_idx. This succeeds for zone DMA most of the time as the zone
is mostly unused because of lowmem protection. This also means that the
OOM killer wouldn't be triggered for higher order requests even when
there is no reclaim progress and we essentially rely on order-0 request
to find this out. This has been broken in one way or another since
fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
are sufficient free pages available") but only since 7335084d446b ("mm:
vmscan: do not OOM if aborting reclaim to start compaction") we are not
invoking the OOM killer based on the wrong calculation.

Propagate requested_highidx down to compaction_ready and use it for both
the watermak check and compaction_suitable to fix this issue.

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@...

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/vmscan.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c839adc13efd..3e6347e2a5fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
  * Returns true if compaction should go ahead for a high-order request, or
  * the high-order allocation would succeed without compaction.
  */
-static inline bool compaction_ready(struct zone *zone, int order)
+static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
 {
  unsigned long balance_gap, watermark;
  bool watermark_ok;
@@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
  balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
  zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
  watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
- watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 
  /*
  * If compaction is deferred, reclaim up to a point where
@@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
  * If compaction is not ready to start and allocation is not likely
  * to succeed without it, then keep reclaiming.
  */
- if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
+ if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
  return false;
 
  return watermark_ok;
@@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
  if (IS_ENABLED(CONFIG_COMPACTION) &&
     sc->order > PAGE_ALLOC_COSTLY_ORDER &&
     zonelist_zone_idx(z) <= requested_highidx &&
-    compaction_ready(zone, sc->order)) {
+    compaction_ready(zone, sc->order, requested_highidx)) {
  sc->compaction_ready = true;
  continue;
  }
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 03/14] mm, compaction: cover all compaction mode in compact_zone

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
  switch (ret) {
  ^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <[hidden email]>
Acked-by: Vlastimil Babka <[hidden email]>
Acked-by: Hillf Danton <[hidden email]>
---
 mm/compaction.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8ae7b1c46c72..b06de27b7f72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,15 +1433,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
  ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
  cc->classzone_idx);
- switch (ret) {
- case COMPACT_PARTIAL:
- case COMPACT_SKIPPED:
- /* Compaction is likely to fail */
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
  return ret;
- case COMPACT_CONTINUE:
- /* Fall through to compaction */
- ;
- }
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);
 
  /*
  * Clear pageblock skip if there were failures recently and compaction
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

COMPACT_COMPLETE now means that compaction and free scanner met. This is
not very useful information if somebody just wants to use this feedback
and make any decisions based on that. The current caller might be a poor
guy who just happened to scan tiny portion of the zone and that could be
the reason no suitable pages were compacted. Make sure we distinguish
the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to
use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/compaction.h        | 10 +++++++++-
 include/trace/events/compaction.h |  1 +
 mm/compaction.c                   | 14 +++++++++++---
 mm/internal.h                     |  1 +
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7e177d111c39..7c4de92d12cc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -21,7 +21,15 @@ enum compact_result {
  * pages
  */
  COMPACT_PARTIAL,
- /* The full zone was compacted */
+ /*
+ * direct compaction has scanned part of the zone but wasn't successfull
+ * to compact suitable pages.
+ */
+ COMPACT_PARTIAL_SKIPPED,
+ /*
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
+ */
  COMPACT_COMPLETE,
  /* For more detailed tracepoint output */
  COMPACT_NO_SUITABLE_PAGE,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6ba16c86d7db..36e2d6fb1360 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -14,6 +14,7 @@
  EM( COMPACT_DEFERRED, "deferred") \
  EM( COMPACT_CONTINUE, "continue") \
  EM( COMPACT_PARTIAL, "partial") \
+ EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
  EM( COMPACT_COMPLETE, "complete") \
  EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
  EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
diff --git a/mm/compaction.c b/mm/compaction.c
index 13709e33a2fc..e2e487cea5ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
  if (cc->direct_compaction)
  zone->compact_blockskip_flush = true;
 
- return COMPACT_COMPLETE;
+ if (cc->whole_zone)
+ return COMPACT_COMPLETE;
+ else
+ return COMPACT_PARTIAL_SKIPPED;
  }
 
  if (is_via_compact_memory(cc->order))
@@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
  zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
  zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
  }
+
+ if (cc->migrate_pfn == start_pfn)
+ cc->whole_zone = true;
+
  cc->last_migrated_pfn = 0;
 
  trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
@@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  goto break_loop;
  }
 
- if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
+ if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+ status == COMPACT_PARTIAL_SKIPPED)) {
  /*
  * We think that allocation won't succeed in this zone
  * so we defer compaction there. If it ends up
@@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
  cc.classzone_idx, 0)) {
  success = true;
  compaction_defer_reset(zone, cc.order, false);
- } else if (status == COMPACT_COMPLETE) {
+ } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
  /*
  * We use sync migration mode here, so we defer like
  * sync direct compaction does.
diff --git a/mm/internal.h b/mm/internal.h
index e9aacea1a0d1..4423dfe69382 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -182,6 +182,7 @@ struct compact_control {
  enum migrate_mode mode; /* Async or sync migration mode */
  bool ignore_skip_hint; /* Scan blocks even if marked skip */
  bool direct_compaction; /* False from kcompactd or /proc/... */
+ bool whole_zone; /* Whole zone has been scanned */
  int order; /* order a direct compactor needs */
  const gfp_t gfp_mask; /* gfp mask of a direct compactor */
  const int alloc_flags; /* alloc flags of a direct compactor */
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 06/14] mm, compaction: Update compaction_result ordering

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

compaction_result will be used as the primary feedback channel for
compaction users. At the same time try_to_compact_pages (and potentially
others) assume a certain ordering where a more specific feedback takes
precendence. This gets a bit awkward when we have conflicting feedback
from different zones. E.g one returing COMPACT_COMPLETE meaning the full
zone has been scanned without any outcome while other returns with
COMPACT_PARTIAL aka made some progress. The caller should get
COMPACT_PARTIAL because that means that the compaction still can make
some progress. The same applies for COMPACT_PARTIAL vs.
COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
larger the value is the more progress we have done.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/compaction.h | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7c4de92d12cc..a7b9091ff349 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,6 +4,8 @@
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NOT_SUITABLE_ZONE,
  /*
  * compaction didn't start as it was not possible or direct reclaim
  * was more suitable
@@ -11,30 +13,34 @@ enum compact_result {
  COMPACT_SKIPPED,
  /* compaction didn't start as it was deferred due to past failures */
  COMPACT_DEFERRED,
+
  /* compaction not active last round */
  COMPACT_INACTIVE = COMPACT_DEFERRED,
 
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NO_SUITABLE_PAGE,
  /* compaction should continue to another pageblock */
  COMPACT_CONTINUE,
+
  /*
- * direct compaction partially compacted a zone and there are suitable
- * pages
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
  */
- COMPACT_PARTIAL,
+ COMPACT_COMPLETE,
  /*
  * direct compaction has scanned part of the zone but wasn't successfull
  * to compact suitable pages.
  */
  COMPACT_PARTIAL_SKIPPED,
+
+ /* compaction terminated prematurely due to lock contentions */
+ COMPACT_CONTENDED,
+
  /*
- * The full zone was compacted scanned but wasn't successfull to compact
- * suitable pages.
+ * direct compaction partially compacted a zone and there might be
+ * suitable pages
  */
- COMPACT_COMPLETE,
- /* For more detailed tracepoint output */
- COMPACT_NO_SUITABLE_PAGE,
- COMPACT_NOT_SUITABLE_ZONE,
- COMPACT_CONTENDED,
+ COMPACT_PARTIAL,
 };
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

Compaction can provide a wild variation of feedback to the caller. Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.

This patch abstracts the feedback into three basic types:
        - compaction_made_progress - compaction was active and made some
          progress.
        - compaction_failed - compaction failed and further attempts to
          invoke it would most probably fail and therefore it is not
          worth retrying
        - compaction_withdrawn - compaction wasn't invoked for an
          implementation specific reasons. In the current implementation
          it means that the compaction was deferred, contended or the
          page scanners met too early without any progress. Retrying is
          still worthwhile.

[[hidden email]: do not change thp back off behavior]
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a7b9091ff349..a002ca55c513 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
  bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
 
+/* Compaction has made some progress and retrying makes sense */
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ /*
+ * Even though this might sound confusing this in fact tells us
+ * that the compaction successfully isolated and migrated some
+ * pageblocks.
+ */
+ if (result == COMPACT_PARTIAL)
+ return true;
+
+ return false;
+}
+
+/* Compaction has failed and it doesn't make much sense to keep retrying. */
+static inline bool compaction_failed(enum compact_result result)
+{
+ /* All zones where scanned completely and still not result. */
+ if (result == COMPACT_COMPLETE)
+ return true;
+
+ return false;
+}
+
+/*
+ * Compaction  has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ /*
+ * Compaction backed off due to watermark checks for order-0
+ * so the regular reclaim has to try harder and reclaim something.
+ */
+ if (result == COMPACT_SKIPPED)
+ return true;
+
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (result == COMPACT_DEFERRED)
+ return true;
+
+ /*
+ * If compaction in async mode encounters contention or blocks higher
+ * priority task we back off early rather than cause stalls.
+ */
+ if (result == COMPACT_CONTENDED)
+ return true;
+
+ /*
+ * Page scanners have met but we haven't scanned full zones so this
+ * is a back off in fact.
+ */
+ if (result == COMPACT_PARTIAL_SKIPPED)
+ return true;
+
+ return false;
+}
+
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
@@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
  return true;
 }
 
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_failed(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ return true;
+}
+
 static inline int kcompactd_run(int nid)
 {
  return 0;
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 09/14] mm: use compaction feedback for thp backoff conditions

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

THP requests skip the direct reclaim if the compaction is either
deferred or contended to reduce stalls which wouldn't help the
allocation success anyway. These checks are ignoring other potential
feedback modes which we have available now.

It clearly doesn't make much sense to go and reclaim few pages if the
previous compaction has failed.

We can also simplify the check by using compaction_withdrawn which
checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
is however covering more reasons why the compaction was withdrawn.
None of them should be a problem for the THP case though.

It is safe to back of if we see COMPACT_SKIPPED because that means
that compaction_suitable failed and a single round of the reclaim is
unlikely to make any difference here. We would have to be close to
the low watermark to reclaim enough and even then there is no guarantee
that the compaction would make any progress while the direct reclaim
would have caused the stall.

COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
have only seen a part of the zone so a retry would make some sense. But
it would be a compaction retry not a reclaim retry to perform. We are
not doing that and that might indeed lead to situations where THP fails
but this should happen only rarely and it would be really hard to
measure.

Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/page_alloc.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 350d13f3709b..d551fe326c33 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3257,25 +3257,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (page)
  goto got_pg;
 
- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (compact_result == COMPACT_DEFERRED)
- goto nopage;
-
- /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if(compact_result == COMPACT_CONTENDED)
- goto nopage;
- }
+ /*
+ * Checks for THP-specific high-order allocations and back off
+ * if the the compaction backed off or failed
+ */
+ if (is_thp_gfp_mask(gfp_mask) &&
+ (compaction_withdrawn(compact_result) ||
+ compaction_failed(compact_result)))
+ goto nopage;
 
  /*
  * It can become very expensive to allocate transparent hugepages at
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 11/14] mm: throttle on IO only when there are too many dirty and writeback pages

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by 373ccbe59270 ("mm, vmstat: allow
WQ concurrency to discover memory reclaim doesn't make any progress")
into the __alloc_pages_slowpath now that wait_iff_congested is not
used anymore.  As the only remaining user of wait_iff_congested is
shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

Acked-by: Hillf Danton <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/backing-dev.c | 20 +++-----------------
 mm/page_alloc.c  | 39 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index bfbd7096b6ed..08e3a58628ed 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -979,20 +978,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
  */
  if (atomic_read(&nr_wb_congested[sync]) == 0 ||
     !test_bit(ZONE_CONGESTED, &zone->flags)) {
-
- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that a particular
- * WQ is congested if the worker thread is looping without
- * ever sleeping. Therefore we have to do a short sleep
- * here rather than calling cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
+ cond_resched();
  /* In case we scheduled, work out time remaining */
  ret = timeout - (jiffies - start);
  if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38302c2041a3..3b78936eca70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3195,8 +3195,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
  for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
  ac->nodemask) {
  unsigned long available;
+ unsigned long reclaimable;
 
- available = zone_reclaimable_pages(zone);
+ available = reclaimable = zone_reclaimable_pages(zone);
  available -= DIV_ROUND_UP(no_progress_loops * available,
   MAX_RECLAIM_RETRIES);
  available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -3207,8 +3208,40 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
  */
  if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
  ac->high_zoneidx, alloc_flags, available)) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ /*
+ * If we didn't make any progress and have a lot of
+ * dirty + writeback pages then we should wait for
+ * an IO to complete to slow down the reclaim and
+ * prevent from pre mature OOM
+ */
+ if (!did_some_progress) {
+ unsigned long writeback;
+ unsigned long dirty;
+
+ writeback = zone_page_state_snapshot(zone,
+     NR_WRITEBACK);
+ dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+ if (2*(writeback + dirty) > reclaimable) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ return true;
+ }
+ }
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+
  return true;
  }
  }
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders. While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic. The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap. If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g. hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark. The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.

I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.

The reason why compaction requires being over low rather than min
watermark is not clear to me. This check was there essentially since
56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails"). It is clearly an implementation detail though and we
shouldn't pull it into the generic retry logic while we should be able
to cope with such eventuality. The only place in should_compact_retry
where we retry without any upper bound is for compaction_withdrawn()
case.

Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed. In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark check.
The reclaimable memory is reduced linearly by the allocation order. The
idea is that we do not want to reclaim all the remaining memory for a
single allocation request just unblock __compaction_suitable which
doesn't guarantee we will make a further progress.

The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress. !costly requests shouldn't be affected much - e.g. order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.

[[hidden email]: fix classzone_idx vs. high_zoneidx usage in
compaction_zonelist_suitable]
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/compaction.h |  4 ++++
 include/linux/mmzone.h     |  3 +++
 mm/compaction.c            | 42 +++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c            | 18 +++++++++++-------
 4 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a002ca55c513..7bbdbf729757 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
  return false;
 }
 
+
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags);
+
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 150c6049f961..0bf13c7cd8cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
+ unsigned long mark, int classzone_idx, int alloc_flags,
+ long free_pages);
 bool zone_watermark_ok(struct zone *z, unsigned int order,
  unsigned long mark, int classzone_idx, int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index e2e487cea5ea..0a7ca578af97 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
  *   COMPACT_CONTINUE - If compaction should run now
  */
 static enum compact_result __compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx)
+ int alloc_flags, int classzone_idx,
+ unsigned long wmark_target)
 {
  int fragindex;
  unsigned long watermark;
@@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
  * allocated and for a short time, the footprint is higher
  */
  watermark += (2UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
+ if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
+ alloc_flags, wmark_target))
  return COMPACT_SKIPPED;
 
  /*
@@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 {
  enum compact_result ret;
 
- ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
+ ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
+    zone_page_state(zone, NR_FREE_PAGES));
  trace_mm_compaction_suitable(zone, order, ret);
  if (ret == COMPACT_NOT_SUITABLE_ZONE)
  ret = COMPACT_SKIPPED;
@@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
  return ret;
 }
 
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure at least one zone would pass __compaction_suitable if we continue
+ * retrying the reclaim.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+ enum compact_result compact_result;
+
+ /*
+ * Do not consider all the reclaimable memory because we do not
+ * want to trash just for a single high order allocation which
+ * is even not guaranteed to appear even if __compaction_suitable
+ * is happy about the watermark check.
+ */
+ available = zone_reclaimable_pages(zone) / order;
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+ compact_result = __compaction_suitable(zone, order, alloc_flags,
+ ac->classzone_idx, available);
+ if (compact_result != COMPACT_SKIPPED &&
+ compact_result != COMPACT_NOT_SUITABLE_ZONE)
+ return true;
+ }
+
+ return false;
+}
+
 static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
  enum compact_result ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d5a938f12554..6757d6df2160 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * one free page of a suitable size. Checking now avoids taking the zone lock
  * to check in the allocation paths if no pages are free.
  */
-static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
  unsigned long mark, int classzone_idx, int alloc_flags,
  long free_pages)
 {
@@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
-     enum migrate_mode *migrate_mode,
+should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+     enum compact_result compact_result, enum migrate_mode *migrate_mode,
      int compaction_retries)
 {
  int max_retries = MAX_COMPACT_RETRIES;
@@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
  /*
  * make sure the compaction wasn't deferred or didn't bail out early
  * due to locks contention before we declare that we should give up.
+ * But do not retry if the given zonelist is not suitable for
+ * compaction.
  */
  if (compaction_withdrawn(compact_result))
- return true;
+ return compaction_zonelist_suitable(ac, order, alloc_flags);
 
  /*
  * !costly requests are much more important than __GFP_REPEAT
@@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
+should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
+     enum compact_result compact_result,
      enum migrate_mode *migrate_mode,
      int compaction_retries)
 {
@@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  * of free memory (see __compaction_suitable)
  */
  if (did_some_progress > 0 &&
- should_compact_retry(order, compact_result,
- &migration_mode, compaction_retries))
+ should_compact_retry(ac, order, alloc_flags,
+ compact_result, &migration_mode,
+ compaction_retries))
  goto retry;
 
  /* Reclaim has failed us, start killing things */
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 13/14] mm: consider compaction feedback also for costly allocation

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. Remove the check for the number of reclaimed pages and rely
on the compaction feedback solely. should_reclaim_retry now only
makes sure that we keep retrying reclaim for high order pages only
if they are hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt. !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim. We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
        - compact memory
        - mmap large portion of the memory (1920M on 2GRAM machine with 2G
          of swapspace)
        - MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
          steps until certain amount of memory is freed (250M in my test)
          and reduce the step to (step / 2) + 1 after reaching the end of
          the mapping
        - then run a script which populates the page cache 2G (MemTotal)
          from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb4df1be0d43..d5a938f12554 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
      enum migrate_mode *migrate_mode,
      int compaction_retries)
 {
+ int max_retries = MAX_COMPACT_RETRIES;
+
  if (!order)
  return false;
 
@@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
  }
 
  /*
- * !costly allocations are really important and we have to make sure
- * the compaction wasn't deferred or didn't bail out early due to locks
- * contention before we go OOM. Still cap the reclaim retry loops with
- * progress to prevent from looping forever and potential trashing.
+ * make sure the compaction wasn't deferred or didn't bail out early
+ * due to locks contention before we declare that we should give up.
  */
- if (order <= PAGE_ALLOC_COSTLY_ORDER) {
- if (compaction_withdrawn(compact_result))
- return true;
- if (compaction_retries <= MAX_COMPACT_RETRIES)
- return true;
- }
+ if (compaction_withdrawn(compact_result))
+ return true;
+
+ /*
+ * !costly requests are much more important than __GFP_REPEAT
+ * costly ones because they are de facto nofail and invoke OOM
+ * killer to move on while costly can fail and users are ready
+ * to cope with that. 1/4 retries is rather arbitrary but we
+ * would need much more detailed feedback from compaction to
+ * make a better decision.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_retries /= 4;
+ if (compaction_retries <= max_retries)
+ return true;
 
  return false;
 }
@@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
      struct alloc_context *ac, int alloc_flags,
-     bool did_some_progress, unsigned long pages_reclaimed,
-     int no_progress_loops)
+     bool did_some_progress, int no_progress_loops)
 {
  struct zone *zone;
  struct zoneref *z;
@@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
  if (no_progress_loops > MAX_RECLAIM_RETRIES)
  return false;
 
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- if (pages_reclaimed >= (1<<order))
- return false;
-
- if (did_some_progress)
- return true;
- }
-
  /*
  * Keep reclaiming pages while there is a chance this will lead somewhere.
  * If none of the target zones can satisfy our allocation request even
@@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
  struct page *page = NULL;
  int alloc_flags;
- unsigned long pages_reclaimed = 0;
  unsigned long did_some_progress;
  enum migrate_mode migration_mode = MIGRATE_ASYNC;
  enum compact_result compact_result;
@@ -3444,16 +3443,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
  goto noretry;
 
- if (did_some_progress) {
+ /*
+ * Costly allocations might have made a progress but this doesn't mean
+ * their order will become available due to high fragmentation so
+ * always increment the no progress counter for them
+ */
+ if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
  no_progress_loops = 0;
- pages_reclaimed += did_some_progress;
- } else {
+ else
  no_progress_loops++;
- }
 
  if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, pages_reclaimed,
- no_progress_loops))
+ did_some_progress > 0, no_progress_loops))
  goto retry;
 
  /*
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 12/14] mm, oom: protect !costly allocations some more

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
        - the last compaction round backed off or
        - we haven't completed at least MAX_COMPACT_RETRIES active
          compaction rounds.

The first rule ensures that the very last attempt for compaction
was not ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy backing of other contexts to steal the compacted pages before
the current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense. We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up. With this logic in place we do not have
to increase the migration mode unconditionally and rather do it only if
the compaction failed for the weaker mode. A nice side effect is that
the stronger migration mode is used only when really needed so this has
a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b78936eca70..bb4df1be0d43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
  return page;
 }
 
+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
 #ifdef CONFIG_COMPACTION
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
@@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
  return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+     enum migrate_mode *migrate_mode,
+     int compaction_retries)
+{
+ if (!order)
+ return false;
+
+ /*
+ * compaction considers all the zone as desperately out of memory
+ * so it doesn't really make much sense to retry except when the
+ * failure could be caused by weak migration mode.
+ */
+ if (compaction_failed(compact_result)) {
+ if (*migrate_mode == MIGRATE_ASYNC) {
+ *migrate_mode = MIGRATE_SYNC_LIGHT;
+ return true;
+ }
+ return false;
+ }
+
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM. Still cap the reclaim retry loops with
+ * progress to prevent from looping forever and potential trashing.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compaction_withdrawn(compact_result))
+ return true;
+ if (compaction_retries <= MAX_COMPACT_RETRIES)
+ return true;
+ }
+
+ return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 {
  return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+     enum migrate_mode *migrate_mode,
+     int compaction_retries)
+{
+ return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  unsigned long did_some_progress;
  enum migrate_mode migration_mode = MIGRATE_ASYNC;
  enum compact_result compact_result;
+ int compaction_retries = 0;
  int no_progress_loops = 0;
 
  /*
@@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  compaction_failed(compact_result)))
  goto nopage;
 
- /*
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse.
- */
- if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_SYNC_LIGHT;
+ if (order && compaction_made_progress(compact_result))
+ compaction_retries++;
 
  /* Try direct reclaim and then allocating */
  page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
@@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  no_progress_loops))
  goto retry;
 
+ /*
+ * It doesn't make any sense to retry for the compaction if the order-0
+ * reclaim is not able to make any progress because the current
+ * implementation of the compaction depends on the sufficient amount
+ * of free memory (see __compaction_suitable)
+ */
+ if (did_some_progress > 0 &&
+ should_compact_retry(order, compact_result,
+ &migration_mode, compaction_retries))
+ goto retry;
+
  /* Reclaim has failed us, start killing things */
  page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
  if (page)
@@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 noretry:
  /*
- * High-order allocations do not necessarily loop after
- * direct reclaim and reclaim/compaction depends on compaction
- * being called after reclaim so call directly if necessary
+ * High-order allocations do not necessarily loop after direct reclaim
+ * and reclaim/compaction depends on compaction being called after
+ * reclaim so call directly if necessary.
+ * It can become very expensive to allocate transparent hugepages at
+ * fault, so use asynchronous memory compaction for THP unless it is
+ * khugepaged trying to collapse. All other requests should tolerate
+ * at least light sync migration.
  */
+ if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ else
+ migration_mode = MIGRATE_SYNC_LIGHT;
  page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
     ac, migration_mode,
     &compact_result);
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 10/14] mm, oom: rework oom detection

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[[hidden email]: check classzone_idx for shrink_zone]
[[hidden email]: separate the heuristic into should_reclaim_retry]
[[hidden email]: use zone_page_state_snapshot for NR_FREE_PAGES]
[[hidden email]: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/swap.h |   1 +
 mm/page_alloc.c      | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          |  25 +++----------
 3 files changed, 97 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..b14a2bb33514 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
  struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
  gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d551fe326c33..38302c2041a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3145,6 +3145,77 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+     struct alloc_context *ac, int alloc_flags,
+     bool did_some_progress, unsigned long pages_reclaimed,
+     int no_progress_loops)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure we converge to OOM if we cannot make any progress
+ * several times in the row.
+ */
+ if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;
+
+ if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (pages_reclaimed >= (1<<order))
+ return false;
+
+ if (did_some_progress)
+ return true;
+ }
+
+ /*
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
+ * If none of the target zones can satisfy our allocation request even
+ * if all reclaimable pages are considered then we are screwed and have
+ * to go OOM.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+
+ available = zone_reclaimable_pages(zone);
+ available -= DIV_ROUND_UP(no_progress_loops * available,
+  MAX_RECLAIM_RETRIES);
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ /*
+ * Would the allocation succeed if we reclaimed the whole
+ * available?
+ */
+ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ ac->high_zoneidx, alloc_flags, available)) {
+ /* Wait for some write requests to complete then retry */
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ return true;
+ }
+ }
+
+ return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  struct alloc_context *ac)
@@ -3156,6 +3227,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  unsigned long did_some_progress;
  enum migrate_mode migration_mode = MIGRATE_ASYNC;
  enum compact_result compact_result;
+ int no_progress_loops = 0;
 
  /*
  * In the slowpath, we sanity check order to avoid ever trying to
@@ -3284,23 +3356,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  if (gfp_mask & __GFP_NORETRY)
  goto noretry;
 
- /* Keep reclaiming pages as long as there is reasonable progress */
- pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
- goto retry;
+ /*
+ * Do not retry costly high order allocations unless they are
+ * __GFP_REPEAT
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+ goto noretry;
+
+ if (did_some_progress) {
+ no_progress_loops = 0;
+ pages_reclaimed += did_some_progress;
+ } else {
+ no_progress_loops++;
  }
 
+ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, pages_reclaimed,
+ no_progress_loops))
+ goto retry;
+
  /* Reclaim has failed us, start killing things */
  page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
  if (page)
  goto got_pg;
 
  /* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ no_progress_loops = 0;
  goto retry;
+ }
 
 noretry:
  /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3e6347e2a5fc..a2ba60aa7b88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,7 +191,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
  unsigned long nr;
 
@@ -2530,10 +2530,8 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
  struct zoneref *z;
  struct zone *zone;
@@ -2541,7 +2539,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
  unsigned long nr_soft_scanned;
  gfp_t orig_mask;
  enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
- bool reclaimable = false;
 
  /*
  * If the number of buffer_heads in the machine exceeds the maximum
@@ -2606,17 +2603,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
  &nr_soft_scanned);
  sc->nr_reclaimed += nr_soft_reclaimed;
  sc->nr_scanned += nr_soft_scanned;
- if (nr_soft_reclaimed)
- reclaimable = true;
  /* need some check for avoid more shrink_zone() */
  }
 
- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
- reclaimable = true;
-
- if (global_reclaim(sc) &&
-    !reclaimable && zone_reclaimable(zone))
- reclaimable = true;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
  }
 
  /*
@@ -2624,8 +2614,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
  * promoted it to __GFP_HIGHMEM.
  */
  sc->gfp_mask = orig_mask;
-
- return reclaimable;
 }
 
 /*
@@ -2650,7 +2638,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
  int initial_priority = sc->priority;
  unsigned long total_scanned = 0;
  unsigned long writeback_threshold;
- bool zones_reclaimable;
 retry:
  delayacct_freepages_start();
 
@@ -2661,7 +2648,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
  vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
  sc->priority);
  sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ shrink_zones(zonelist, sc);
 
  total_scanned += sc->nr_scanned;
  if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2708,10 +2695,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
  goto retry;
  }
 
- /* Any of the zones still reclaimable?  Don't OOM. */
- if (zones_reclaimable)
- return 1;
-
  return 0;
 }
 
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

__alloc_pages_direct_compact communicates potential back off by two
variables:
        - deferred_compaction tells that the compaction returned
          COMPACT_DEFERRED
        - contended_compaction is set when there is a contention on
          zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06af8a757d52..350d13f3709b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
 {
- enum compact_result compact_result;
  struct page *page;
+ int contended_compaction;
 
  if (!order)
  return NULL;
 
  current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- mode, contended_compaction);
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ mode, &contended_compaction);
  current->flags &= ~PF_MEMALLOC;
 
- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_INACTIVE)
  return NULL;
- default:
- break;
- }
 
  /*
  * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  */
  count_vm_event(COMPACTFAIL);
 
+ /*
+ * In all zones where compaction was attempted (and not
+ * deferred or skipped), lock contention has been detected.
+ * For THP allocation we do not want to disrupt the others
+ * so we fallback to base pages instead.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ *compact_result = COMPACT_CONTENDED;
+
+ /*
+ * If compaction was aborted due to need_resched(), we do not
+ * want to further increase allocation latency, unless it is
+ * khugepaged trying to collapse.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_SCHED
+ && !(current->flags & PF_KTHREAD))
+ *compact_result = COMPACT_CONTENDED;
+
  cond_resched();
 
  return NULL;
@@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
 {
  return NULL;
 }
@@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  unsigned long pages_reclaimed = 0;
  unsigned long did_some_progress;
  enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
- int contended_compaction = COMPACT_CONTENDED_NONE;
+ enum compact_result compact_result;
 
  /*
  * In the slowpath, we sanity check order to avoid ever trying to
@@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  */
  page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
  migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
  if (page)
  goto got_pg;
 
@@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  * to heavily disrupt the system, so we fail the allocation
  * instead of entering direct reclaim.
  */
- if (deferred_compaction)
- goto nopage;
-
- /*
- * In all zones where compaction was attempted (and not
- * deferred or skipped), lock contention has been detected.
- * For THP allocation we do not want to disrupt the others
- * so we fallback to base pages instead.
- */
- if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ if (compact_result == COMPACT_DEFERRED)
  goto nopage;
 
  /*
- * If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
  */
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
+ if(compact_result == COMPACT_CONTENDED)
  goto nopage;
  }
 
@@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  */
  page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
     ac, migration_mode,
-    &contended_compaction,
-    &deferred_compaction);
+    &compact_result);
  if (page)
  goto got_pg;
 nopage:
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

try_to_compact_pages can currently return COMPACT_SKIPPED even when the
compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.

Acked-by: Vlastimil Babka <[hidden email]>
Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/compaction.h        | 7 +++++--
 include/trace/events/compaction.h | 2 +-
 mm/compaction.c                   | 8 +++++---
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4458fd94170f..7e177d111c39 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,13 +4,16 @@
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
- /* compaction didn't start as it was deferred due to past failures */
- COMPACT_DEFERRED,
  /*
  * compaction didn't start as it was not possible or direct reclaim
  * was more suitable
  */
  COMPACT_SKIPPED,
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /* compaction not active last round */
+ COMPACT_INACTIVE = COMPACT_DEFERRED,
+
  /* compaction should continue to another pageblock */
  COMPACT_CONTINUE,
  /*
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index e215bf68f521..6ba16c86d7db 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -10,8 +10,8 @@
 #include <trace/events/mmflags.h>
 
 #define COMPACTION_STATUS \
- EM( COMPACT_DEFERRED, "deferred") \
  EM( COMPACT_SKIPPED, "skipped") \
+ EM( COMPACT_DEFERRED, "deferred") \
  EM( COMPACT_CONTINUE, "continue") \
  EM( COMPACT_PARTIAL, "partial") \
  EM( COMPACT_COMPLETE, "complete") \
diff --git a/mm/compaction.c b/mm/compaction.c
index b06de27b7f72..13709e33a2fc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  int may_perform_io = gfp_mask & __GFP_IO;
  struct zoneref *z;
  struct zone *zone;
- enum compact_result rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_SKIPPED;
  int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
  *contended = COMPACT_CONTENDED_NONE;
@@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  enum compact_result status;
  int zone_contended;
 
- if (compaction_deferred(zone, order))
+ if (compaction_deferred(zone, order)) {
+ rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
  continue;
+ }
 
  status = compact_zone_order(zone, order, gfp_mask, mode,
  &zone_contended, alloc_flags,
@@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  * If at least one zone wasn't deferred or skipped, we report if all
  * zones that were tried were lock contended.
  */
- if (rc > COMPACT_SKIPPED && all_zones_contended)
+ if (rc > COMPACT_INACTIVE && all_zones_contended)
  *contended = COMPACT_CONTENDED_LOCK;
 
  return rc;
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

[PATCH 02/14] mm, compaction: change COMPACT_ constants into enum

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[hidden email]>
Acked-by: Vlastimil Babka <[hidden email]>
Acked-by: Hillf Danton <[hidden email]>
---
 include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
 mm/compaction.c            | 27 ++++++++++++++-------------
 mm/page_alloc.c            |  2 +-
 3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d7c8de583a23..4458fd94170f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
 #define _LINUX_COMPACTION_H
 
 /* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED 0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE 4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE 5
-#define COMPACT_NOT_SUITABLE_ZONE 6
-#define COMPACT_CONTENDED 7
 /* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /*
+ * compaction didn't start as it was not possible or direct reclaim
+ * was more suitable
+ */
+ COMPACT_SKIPPED,
+ /* compaction should continue to another pageblock */
+ COMPACT_CONTINUE,
+ /*
+ * direct compaction partially compacted a zone and there are suitable
+ * pages
+ */
+ COMPACT_PARTIAL,
+ /* The full zone was compacted */
+ COMPACT_COMPLETE,
+ /* For more detailed tracepoint output */
+ COMPACT_NO_SUITABLE_PAGE,
+ COMPACT_NOT_SUITABLE_ZONE,
+ COMPACT_CONTENDED,
+};
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
 /* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+ unsigned int order,
  int alloc_flags, const struct alloc_context *ac,
  enum migrate_mode mode, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
  int alloc_flags, int classzone_idx);
 
 extern void defer_compaction(struct zone *zone, int order);
@@ -57,7 +66,7 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
  unsigned int order, int alloc_flags,
  const struct alloc_context *ac,
  enum migrate_mode mode, int *contended)
@@ -73,7 +82,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
 
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
  int alloc_flags, int classzone_idx)
 {
  return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8cc495042303..8ae7b1c46c72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1281,7 +1281,7 @@ static inline bool is_via_compact_memory(int order)
  return order == -1;
 }
 
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
     const int migratetype)
 {
  unsigned int order;
@@ -1344,8 +1344,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
  return COMPACT_NO_SUITABLE_PAGE;
 }
 
-static int compact_finished(struct zone *zone, struct compact_control *cc,
-    const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+ struct compact_control *cc,
+ const int migratetype)
 {
  int ret;
 
@@ -1364,7 +1365,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
  *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
  *   COMPACT_CONTINUE - If compaction should run now
  */
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
  int alloc_flags, int classzone_idx)
 {
  int fragindex;
@@ -1409,10 +1410,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
  return COMPACT_CONTINUE;
 }
 
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
  int alloc_flags, int classzone_idx)
 {
- unsigned long ret;
+ enum compact_result ret;
 
  ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
  trace_mm_compaction_suitable(zone, order, ret);
@@ -1422,9 +1423,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
  return ret;
 }
 
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
- int ret;
+ enum compact_result ret;
  unsigned long start_pfn = zone->zone_start_pfn;
  unsigned long end_pfn = zone_end_pfn(zone);
  const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1588,11 +1589,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
  return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
  gfp_t gfp_mask, enum migrate_mode mode, int *contended,
  int alloc_flags, int classzone_idx)
 {
- unsigned long ret;
+ enum compact_result ret;
  struct compact_control cc = {
  .nr_freepages = 0,
  .nr_migratepages = 0,
@@ -1631,7 +1632,7 @@ int sysctl_extfrag_threshold = 500;
  *
  * This is the main entry point for direct page compaction.
  */
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  int alloc_flags, const struct alloc_context *ac,
  enum migrate_mode mode, int *contended)
 {
@@ -1639,7 +1640,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  int may_perform_io = gfp_mask & __GFP_IO;
  struct zoneref *z;
  struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_DEFERRED;
  int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
  *contended = COMPACT_CONTENDED_NONE;
@@ -1653,7 +1654,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
  /* Compact each zone in the list */
  for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
  ac->nodemask) {
- int status;
+ enum compact_result status;
  int zone_contended;
 
  if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4efafc38273..06af8a757d52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2947,7 +2947,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
  enum migrate_mode mode, int *contended_compaction,
  bool *deferred_compaction)
 {
- unsigned long compact_result;
+ enum compact_result compact_result;
  struct page *page;
 
  if (!order)
--
2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready

Hillf Danton-2
In reply to this post by Michal Hocko-4
>
> From: Michal Hocko <[hidden email]>
>
> while playing with the oom detection rework [1] I have noticed
> that my heavy order-9 (hugetlb) load close to OOM ended up in an
> endless loop where the reclaim hasn't made any progress but
> did_some_progress didn't reflect that and compaction_suitable
> was backing off because no zone is above low wmark + 1 << order.
>
> It turned out that this is in fact an old standing bug in compaction_ready
> which ignores the requested_highidx and did the watermark check for
> 0 classzone_idx. This succeeds for zone DMA most of the time as the zone
> is mostly unused because of lowmem protection. This also means that the
> OOM killer wouldn't be triggered for higher order requests even when
> there is no reclaim progress and we essentially rely on order-0 request
> to find this out.

Thanks.

> This has been broken in one way or another since
> fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
> are sufficient free pages available") but only since 7335084d446b ("mm:
> vmscan: do not OOM if aborting reclaim to start compaction") we are not
> invoking the OOM killer based on the wrong calculation.
>
> Propagate requested_highidx down to compaction_ready and use it for both
> the watermak check and compaction_suitable to fix this issue.
>
> [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@...
>
> Acked-by: Vlastimil Babka <[hidden email]>
> Signed-off-by: Michal Hocko <[hidden email]>
> ---

Acked-by: Hillf Danton <[hidden email]>

>  mm/vmscan.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c839adc13efd..3e6347e2a5fc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>   * Returns true if compaction should go ahead for a high-order request, or
>   * the high-order allocation would succeed without compaction.
>   */
> -static inline bool compaction_ready(struct zone *zone, int order)
> +static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
>  {
>   unsigned long balance_gap, watermark;
>   bool watermark_ok;
> @@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
>   balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
>   zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
>   watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
> - watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
> + watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
>
>   /*
>   * If compaction is deferred, reclaim up to a point where
> @@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
>   * If compaction is not ready to start and allocation is not likely
>   * to succeed without it, then keep reclaiming.
>   */
> - if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
> + if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
>   return false;
>
>   return watermark_ok;
> @@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>   if (IS_ENABLED(CONFIG_COMPACTION) &&
>      sc->order > PAGE_ALLOC_COSTLY_ORDER &&
>      zonelist_zone_idx(z) <= requested_highidx &&
> -    compaction_ready(zone, sc->order)) {
> +    compaction_ready(zone, sc->order, requested_highidx)) {
>   sc->compaction_ready = true;
>   continue;
>   }
> --
> 2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

Hillf Danton-2
In reply to this post by Michal Hocko-4
>
> From: Michal Hocko <[hidden email]>
>
> COMPACT_COMPLETE now means that compaction and free scanner met. This is
> not very useful information if somebody just wants to use this feedback
> and make any decisions based on that. The current caller might be a poor
> guy who just happened to scan tiny portion of the zone and that could be
> the reason no suitable pages were compacted. Make sure we distinguish
> the full and partial zone walks.
>
> Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
> and be optimistic in retrying.
>
> The existing users of COMPACT_COMPLETE are conservatively changed to
> use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
> reconsidered and only defer the compaction only for COMPACT_COMPLETE
> with the new semantic.
>
> This patch shouldn't introduce any functional changes.
>
> Acked-by: Vlastimil Babka <[hidden email]>
> Signed-off-by: Michal Hocko <[hidden email]>
> ---

Acked-by: Hillf Danton <[hidden email]>

>  include/linux/compaction.h        | 10 +++++++++-
>  include/trace/events/compaction.h |  1 +
>  mm/compaction.c                   | 14 +++++++++++---
>  mm/internal.h                     |  1 +
>  4 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7e177d111c39..7c4de92d12cc 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -21,7 +21,15 @@ enum compact_result {
>   * pages
>   */
>   COMPACT_PARTIAL,
> - /* The full zone was compacted */
> + /*
> + * direct compaction has scanned part of the zone but wasn't successfull
> + * to compact suitable pages.
> + */
> + COMPACT_PARTIAL_SKIPPED,
> + /*
> + * The full zone was compacted scanned but wasn't successfull to compact
> + * suitable pages.
> + */
>   COMPACT_COMPLETE,
>   /* For more detailed tracepoint output */
>   COMPACT_NO_SUITABLE_PAGE,
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 6ba16c86d7db..36e2d6fb1360 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -14,6 +14,7 @@
>   EM( COMPACT_DEFERRED, "deferred") \
>   EM( COMPACT_CONTINUE, "continue") \
>   EM( COMPACT_PARTIAL, "partial") \
> + EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
>   EM( COMPACT_COMPLETE, "complete") \
>   EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
>   EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 13709e33a2fc..e2e487cea5ea 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
>   if (cc->direct_compaction)
>   zone->compact_blockskip_flush = true;
>
> - return COMPACT_COMPLETE;
> + if (cc->whole_zone)
> + return COMPACT_COMPLETE;
> + else
> + return COMPACT_PARTIAL_SKIPPED;
>   }
>
>   if (is_via_compact_memory(cc->order))
> @@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>   zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>   zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
>   }
> +
> + if (cc->migrate_pfn == start_pfn)
> + cc->whole_zone = true;
> +
>   cc->last_migrated_pfn = 0;
>
>   trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> @@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>   goto break_loop;
>   }
>
> - if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
> + if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
> + status == COMPACT_PARTIAL_SKIPPED)) {
>   /*
>   * We think that allocation won't succeed in this zone
>   * so we defer compaction there. If it ends up
> @@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>   cc.classzone_idx, 0)) {
>   success = true;
>   compaction_defer_reset(zone, cc.order, false);
> - } else if (status == COMPACT_COMPLETE) {
> + } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
>   /*
>   * We use sync migration mode here, so we defer like
>   * sync direct compaction does.
> diff --git a/mm/internal.h b/mm/internal.h
> index e9aacea1a0d1..4423dfe69382 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -182,6 +182,7 @@ struct compact_control {
>   enum migrate_mode mode; /* Async or sync migration mode */
>   bool ignore_skip_hint; /* Scan blocks even if marked skip */
>   bool direct_compaction; /* False from kcompactd or /proc/... */
> + bool whole_zone; /* Whole zone has been scanned */
>   int order; /* order a direct compactor needs */
>   const gfp_t gfp_mask; /* gfp mask of a direct compactor */
>   const int alloc_flags; /* alloc flags of a direct compactor */
> --
> 2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 06/14] mm, compaction: Update compaction_result ordering

Hillf Danton-2
In reply to this post by Michal Hocko-4
>
> From: Michal Hocko <[hidden email]>
>
> compaction_result will be used as the primary feedback channel for
> compaction users. At the same time try_to_compact_pages (and potentially
> others) assume a certain ordering where a more specific feedback takes
> precendence. This gets a bit awkward when we have conflicting feedback
> from different zones. E.g one returing COMPACT_COMPLETE meaning the full
> zone has been scanned without any outcome while other returns with
> COMPACT_PARTIAL aka made some progress. The caller should get
> COMPACT_PARTIAL because that means that the compaction still can make
> some progress. The same applies for COMPACT_PARTIAL vs.
> COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
> larger the value is the more progress we have done.
>
> Acked-by: Vlastimil Babka <[hidden email]>
> Signed-off-by: Michal Hocko <[hidden email]>
> ---

Acked-by: Hillf Danton <[hidden email]>

>  include/linux/compaction.h | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7c4de92d12cc..a7b9091ff349 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -4,6 +4,8 @@
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* When adding new states, please adjust include/trace/events/compaction.h */
>  enum compact_result {
> + /* For more detailed tracepoint output - internal to compaction */
> + COMPACT_NOT_SUITABLE_ZONE,
>   /*
>   * compaction didn't start as it was not possible or direct reclaim
>   * was more suitable
> @@ -11,30 +13,34 @@ enum compact_result {
>   COMPACT_SKIPPED,
>   /* compaction didn't start as it was deferred due to past failures */
>   COMPACT_DEFERRED,
> +
>   /* compaction not active last round */
>   COMPACT_INACTIVE = COMPACT_DEFERRED,
>
> + /* For more detailed tracepoint output - internal to compaction */
> + COMPACT_NO_SUITABLE_PAGE,
>   /* compaction should continue to another pageblock */
>   COMPACT_CONTINUE,
> +
>   /*
> - * direct compaction partially compacted a zone and there are suitable
> - * pages
> + * The full zone was compacted scanned but wasn't successfull to compact
> + * suitable pages.
>   */
> - COMPACT_PARTIAL,
> + COMPACT_COMPLETE,
>   /*
>   * direct compaction has scanned part of the zone but wasn't successfull
>   * to compact suitable pages.
>   */
>   COMPACT_PARTIAL_SKIPPED,
> +
> + /* compaction terminated prematurely due to lock contentions */
> + COMPACT_CONTENDED,
> +
>   /*
> - * The full zone was compacted scanned but wasn't successfull to compact
> - * suitable pages.
> + * direct compaction partially compacted a zone and there might be
> + * suitable pages
>   */
> - COMPACT_COMPLETE,
> - /* For more detailed tracepoint output */
> - COMPACT_NO_SUITABLE_PAGE,
> - COMPACT_NOT_SUITABLE_ZONE,
> - COMPACT_CONTENDED,
> + COMPACT_PARTIAL,
>  };
>
>  /* Used to signal whether compaction detected need_sched() or lock contention */
> --
> 2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface

Hillf Danton-2
In reply to this post by Michal Hocko-4
>
> From: Michal Hocko <[hidden email]>
>
> __alloc_pages_direct_compact communicates potential back off by two
> variables:
> - deferred_compaction tells that the compaction returned
>  COMPACT_DEFERRED
> - contended_compaction is set when there is a contention on
>  zone->lock resp. zone->lru_lock locks
>
> __alloc_pages_slowpath then backs of for THP allocation requests to
> prevent from long stalls. This is rather messy and it would be much
> cleaner to return a single compact result value and hide all the nasty
> details into __alloc_pages_direct_compact.
>
> This patch shouldn't introduce any functional changes.
>
> Acked-by: Vlastimil Babka <[hidden email]>
> Signed-off-by: Michal Hocko <[hidden email]>
> ---

Acked-by: Hillf Danton <[hidden email]>

>  mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
>  1 file changed, 31 insertions(+), 36 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 06af8a757d52..350d13f3709b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   int alloc_flags, const struct alloc_context *ac,
> - enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum migrate_mode mode, enum compact_result *compact_result)
>  {
> - enum compact_result compact_result;
>   struct page *page;
> + int contended_compaction;
>
>   if (!order)
>   return NULL;
>
>   current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> - mode, contended_compaction);
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + mode, &contended_compaction);
>   current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> - default:
> - break;
> - }
>
>   /*
>   * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   */
>   count_vm_event(COMPACTFAIL);
>
> + /*
> + * In all zones where compaction was attempted (and not
> + * deferred or skipped), lock contention has been detected.
> + * For THP allocation we do not want to disrupt the others
> + * so we fallback to base pages instead.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_LOCK)
> + *compact_result = COMPACT_CONTENDED;
> +
> + /*
> + * If compaction was aborted due to need_resched(), we do not
> + * want to further increase allocation latency, unless it is
> + * khugepaged trying to collapse.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_SCHED
> + && !(current->flags & PF_KTHREAD))
> + *compact_result = COMPACT_CONTENDED;
> +
>   cond_resched();
>
>   return NULL;
> @@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   int alloc_flags, const struct alloc_context *ac,
> - enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum migrate_mode mode, enum compact_result *compact_result)
>  {
>   return NULL;
>  }
> @@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   unsigned long pages_reclaimed = 0;
>   unsigned long did_some_progress;
>   enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> - int contended_compaction = COMPACT_CONTENDED_NONE;
> + enum compact_result compact_result;
>
>   /*
>   * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   */
>   page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>   migration_mode,
> - &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
>   if (page)
>   goto got_pg;
>
> @@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   * to heavily disrupt the system, so we fail the allocation
>   * instead of entering direct reclaim.
>   */
> - if (deferred_compaction)
> - goto nopage;
> -
> - /*
> - * In all zones where compaction was attempted (and not
> - * deferred or skipped), lock contention has been detected.
> - * For THP allocation we do not want to disrupt the others
> - * so we fallback to base pages instead.
> - */
> - if (contended_compaction == COMPACT_CONTENDED_LOCK)
> + if (compact_result == COMPACT_DEFERRED)
>   goto nopage;
>
>   /*
> - * If compaction was aborted due to need_resched(), we do not
> - * want to further increase allocation latency, unless it is
> - * khugepaged trying to collapse.
> + * Compaction is contended so rather back off than cause
> + * excessive stalls.
>   */
> - if (contended_compaction == COMPACT_CONTENDED_SCHED
> - && !(current->flags & PF_KTHREAD))
> + if(compact_result == COMPACT_CONTENDED)
>   goto nopage;
>   }
>
> @@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   */
>   page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>      ac, migration_mode,
> -    &contended_compaction,
> -    &deferred_compaction);
> +    &compact_result);
>   if (page)
>   goto got_pg;
>  nopage:
> --
> 2.8.0.rc3

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers

Hillf Danton-2
In reply to this post by Michal Hocko-4
>
> From: Michal Hocko <[hidden email]>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> - compaction_made_progress - compaction was active and made some
>  progress.
> - compaction_failed - compaction failed and further attempts to
>  invoke it would most probably fail and therefore it is not
>  worth retrying
> - compaction_withdrawn - compaction wasn't invoked for an
>           implementation specific reasons. In the current implementation
>           it means that the compaction was deferred, contended or the
>           page scanners met too early without any progress. Retrying is
>           still worthwhile.
>
> [[hidden email]: do not change thp back off behavior]
> Signed-off-by: Michal Hocko <[hidden email]>
> ---

Acked-by: Hillf Danton <[hidden email]>

>  include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a7b9091ff349..a002ca55c513 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>   bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>
> +/* Compaction has made some progress and retrying makes sense */
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + /*
> + * Even though this might sound confusing this in fact tells us
> + * that the compaction successfully isolated and migrated some
> + * pageblocks.
> + */
> + if (result == COMPACT_PARTIAL)
> + return true;
> +
> + return false;
> +}
> +
> +/* Compaction has failed and it doesn't make much sense to keep retrying. */
> +static inline bool compaction_failed(enum compact_result result)
> +{
> + /* All zones where scanned completely and still not result. */

s/where/were/

> + if (result == COMPACT_COMPLETE)
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Compaction  has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> + /*
> + * Compaction backed off due to watermark checks for order-0
> + * so the regular reclaim has to try harder and reclaim something.
> + */
> + if (result == COMPACT_SKIPPED)
> + return true;
> +
> + /*
> + * If compaction is deferred for high-order allocations, it is
> + * because sync compaction recently failed. If this is the case
> + * and the caller requested a THP allocation, we do not want
> + * to heavily disrupt the system, so we fail the allocation
> + * instead of entering direct reclaim.
> + */
> + if (result == COMPACT_DEFERRED)
> + return true;
> +
> + /*
> + * If compaction in async mode encounters contention or blocks higher
> + * priority task we back off early rather than cause stalls.
> + */
> + if (result == COMPACT_CONTENDED)
> + return true;
> +
> + /*
> + * Page scanners have met but we haven't scanned full zones so this
> + * is a back off in fact.
> + */
> + if (result == COMPACT_PARTIAL_SKIPPED)
> + return true;
> +
> + return false;
> +}
> +
>  extern int kcompactd_run(int nid);
>  extern void kcompactd_stop(int nid);
>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> @@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>   return true;
>  }
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return false;
> +}
> +
> +static inline bool compaction_failed(enum compact_result result)
> +{
> + return false;
> +}
> +
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> + return true;
> +}
> +
>  static inline int kcompactd_run(int nid)
>  {
>   return 0;
> --
> 2.8.0.rc3

123