[PATCH 0/5] Handle oom bypass more gracefully

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 0/5] Handle oom bypass more gracefully

Michal Hocko-4
Hi,
the following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally.  Even though mm
shared outside of threadgroup is rare (either use_mm by kernel threads
or exotic clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND) it
makes the current oom killer logic quite hard to follow and evaluate. It
is possible to select an oom victim which shares the mm with unkillable
process or bypass the oom killer even when other processes sharing the
mm are still alive and other weird cases.

Patch 1 optimizes oom_kill_task to skip the costly process
iteration when the current oom victim is not sharing mm with other
processes. Patch 2 is a clean up of oom_score_adj handling and a
preparatory work. Patch 3 enforces oom_adj_score to be consistent
between processes sharing the mm to behave consistently with the regular
thread groups. Patch 4 tries to handle vforked tasks better in the oom
path, patch 5 ensures that all tasks sharing the mm are killed and
finally patch 6 should guarantee that task_will_free_mem will always
imply reapable bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-23-16-51).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
      mm, oom: do not loop over all tasks if there are no external tasks sharing mm
      proc, oom_adj: extract oom_score_adj setting into a helper
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
      mm, oom: skip over vforked tasks
      mm, oom: kill all tasks sharing the mm
      mm, oom: fortify task_will_free_mem

 fs/proc/base.c      | 168 +++++++++++++++++++++++++++++-----------------------
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  72 ++++++++++++++++++++--
 mm/memcontrol.c     |   4 +-
 mm/oom_kill.c       |  96 ++++++++++--------------------
 5 files changed, 196 insertions(+), 146 deletions(-)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

Michal Hocko-4
From: Michal Hocko <[hidden email]>

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct becuase that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
        vfork()
        set_oom_adj()
        exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Signed-off-by: Michal Hocko <[hidden email]>
---
 fs/proc/base.c     | 35 +++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c      |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 23679673bf5a..e3ee4fb1930c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+ static DEFINE_MUTEX(oom_adj_mutex);
+ struct mm_struct *mm = NULL;
  struct task_struct *task;
  unsigned long flags;
  int err = 0;
 
+ mutex_lock(&oom_adj_mutex);
  task = get_proc_task(file_inode(file));
  if (!task) {
  err = -ESRCH;
@@ -1085,6 +1088,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
  }
  }
 
+ /*
+ * If we are not in the vfork and share mm with other processes we
+ * have to propagate the score otherwise we would have a schizophrenic
+ * requirements for the same mm. We can use racy check because we
+ * only risk the slow path.
+ */
+ if (!task->vfork_done &&
+ atomic_read(&task->mm->mm_users) > get_nr_threads(task)) {
+ mm = task->mm;
+
+ /* pin the mm so it doesn't go away and get reused */
+ atomic_inc(&mm->mm_count);
+ }
+
  task->signal->oom_score_adj = oom_adj;
  if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
  task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1094,7 +1111,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 err_task_lock:
  task_unlock(task);
  put_task_struct(task);
+
+ if (mm) {
+ struct task_struct *p;
+
+ rcu_read_lock();
+ for_each_process(p) {
+ task_lock(p);
+ if (!p->vfork_done && process_shares_mm(p, mm)) {
+ p->signal->oom_score_adj = oom_adj;
+ if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+ p->signal->oom_score_adj_min = (short)oom_adj;
+ }
+ task_unlock(p);
+ }
+ rcu_read_unlock();
+ mmdrop(mm);
+ }
 out:
+ mutex_unlock(&oom_adj_mutex);
  return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05102822912c..b44d3d792a00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif /* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e33e912f7e4..eeccb4d7e7f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,7 +416,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
  struct task_struct *t;
 
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 5/6] mm, oom: kill all tasks sharing the mm

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
Moreover after "mm, oom: skip over vforked tasks" we even cannot
encounter vfork task so we can allow both SIG_KILL and oom reaper. A
potential race is highly unlikely but possible. It would happen if
__set_oom_adj raced with select_bad_process and then it is OK to
consider the old value or with fork when it should be acceptable as
well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/oom_kill.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1cbaaa1a666..008c5b4732de 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -850,8 +850,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  continue;
  if (same_thread_group(p, victim))
  continue;
- if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
  /*
  * We cannot use oom_reaper for the mm shared by this
  * process because it wouldn't get killed and so the
@@ -860,6 +859,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  can_oom_reap = false;
  continue;
  }
+ if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+ pr_warn("%s pid=%d shares mm with oom disabled %s pid=%d. Seems like misconfiguration, killing anyway!"
+ " Report at [hidden email]\n",
+ victim->comm, task_pid_nr(victim),
+ p->comm, task_pid_nr(p));
  do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
  }
  rcu_read_unlock();
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 6/6] mm, oom: fortify task_will_free_mem

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/oom.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 +--
 mm/oom_kill.c       | 65 ++++-------------------------------------------
 3 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..412c4ecb42b1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
  struct signal_struct *sig = task->signal;
 
@@ -117,13 +117,75 @@ static inline bool task_will_free_mem(struct task_struct *task)
  if (sig->flags & SIGNAL_GROUP_COREDUMP)
  return false;
 
- if (!(task->flags & PF_EXITING))
+ if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
  return false;
 
  /* Make sure that the whole thread group is going down */
- if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+ if (!thread_group_empty(task) &&
+ !(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
+ return false;
+
+ return true;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+ struct mm_struct *mm = NULL;
+ struct task_struct *p;
+
+ /*
+ * If the process has passed exit_mm we have to skip it because
+ * we have lost a link to other tasks sharing this mm, we do not
+ * have anything to reap and the task might then get stuck waiting
+ * for parent as zombie and we do not want it to hold TIF_MEMDIE
+ */
+ p = find_lock_task_mm(task);
+ if (!p)
  return false;
 
+ if (!__task_will_free_mem(p)) {
+ task_unlock(p);
+ return false;
+ }
+
+ /*
+ * Check whether there are other processes sharing the mm - they all have
+ * to be killed or exiting.
+ */
+ if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
+ mm = p->mm;
+ /* pin the mm to not get freed and reused */
+ atomic_inc(&mm->mm_count);
+ }
+ task_unlock(p);
+
+ if (mm) {
+ rcu_read_lock();
+ for_each_process(p) {
+ bool vfork;
+
+ /*
+ * skip over vforked tasks because they are mostly
+ * independent and will drop the mm soon
+ */
+ task_lock(p);
+ vfork = p->vfork_done;
+ task_unlock(p);
+ if (vfork)
+ continue;
+
+ if (!__task_will_free_mem(p))
+ break;
+ }
+ rcu_read_unlock();
+ mmdrop(mm);
+ }
+
  return true;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
  * select it.  The goal is to allow it to allocate so that it may
  * quickly exit and free its memory.
  */
- if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+ if (task_will_free_mem(current)) {
  mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
  goto unlock;
  }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 008c5b4732de..428e34df9f49 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -573,7 +573,7 @@ static int oom_reaper(void *unused)
  return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
  if (!oom_reaper_th)
  return;
@@ -594,50 +594,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-void try_oom_reaper(struct task_struct *tsk)
-{
- struct mm_struct *mm = tsk->mm;
- struct task_struct *p;
-
- if (!mm)
- return;
-
- /*
- * There might be other threads/processes which are either not
- * dying or even not killable.
- */
- if (atomic_read(&mm->mm_users) > 1) {
- rcu_read_lock();
- for_each_process(p) {
- bool exiting;
-
- if (!process_shares_mm(p, mm))
- continue;
- if (same_thread_group(p, tsk))
- continue;
- if (fatal_signal_pending(p))
- continue;
-
- /*
- * If the task is exiting make sure the whole thread group
- * is exiting and cannot acces mm anymore.
- */
- spin_lock_irq(&p->sighand->siglock);
- exiting = signal_group_exit(p->signal);
- spin_unlock_irq(&p->sighand->siglock);
- if (exiting)
- continue;
-
- /* Give up */
- rcu_read_unlock();
- return;
- }
- rcu_read_unlock();
- }
-
- wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
  oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -649,10 +605,6 @@ static int __init oom_init(void)
  return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -750,15 +702,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  * If the task is already exiting, don't alarm the sysadmin or kill
  * its children or threads, just set TIF_MEMDIE so it can die quickly
  */
- task_lock(p);
- if (p->mm && task_will_free_mem(p)) {
+ if (task_will_free_mem(p)) {
  mark_oom_victim(p);
- try_oom_reaper(p);
- task_unlock(p);
+ wake_oom_reaper(p);
  put_task_struct(p);
  return;
  }
- task_unlock(p);
 
  if (__ratelimit(&oom_rs))
  dump_header(oc, p, memcg);
@@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc)
  * If current has a pending SIGKILL or is exiting, then automatically
  * select it.  The goal is to allow it to allocate so that it may
  * quickly exit and free its memory.
- *
- * But don't select if current has already released its mm and cleared
- * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
  */
- if (current->mm &&
-    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+ if (task_will_free_mem(current)) {
  mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
  return true;
  }
 
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 4/6] mm, oom: skip over vforked tasks

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

vforked tasks are not really sitting on memory so it doesn't matter much
to kill them. Parents are waiting for vforked task killable so it is
better to chose parent which is the real mm owner. Teach oom_badness
to ignore all tasks which haven't passed mm_release. oom_kill_process
should ignore them as well because they will drop the mm soon and they
will not block oom_reaper because they cannot touch any memory.

Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/oom_kill.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eeccb4d7e7f5..d1cbaaa1a666 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
  /*
  * Do not even consider tasks which are explicitly marked oom
- * unkillable or have been already oom reaped.
+ * unkillable or have been already oom reaped or they are in
+ * the middle of vfork
  */
  adj = (long)p->signal->oom_score_adj;
  if (adj == OOM_SCORE_ADJ_MIN ||
- test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+ test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+ p->vfork_done) {
  task_unlock(p);
  return 0;
  }
@@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  for_each_process(p) {
  if (!process_shares_mm(p, mm))
  continue;
+ /*
+ * vforked tasks are ignored because they will drop the mm soon
+ * hopefully and even if not they will not mind being oom
+ * reaped because they cannot touch any memory.
+ */
+ if (p->vfork_done)
+ continue;
  if (same_thread_group(p, victim))
  continue;
  if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[hidden email]>
---
 fs/proc/base.c | 133 +++++++++++++++++++++++++--------------------------------
 1 file changed, 59 insertions(+), 74 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..23679673bf5a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,63 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
  return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+ struct task_struct *task;
+ unsigned long flags;
+ int err = 0;
+
+ task = get_proc_task(file_inode(file));
+ if (!task) {
+ err = -ESRCH;
+ goto out;
+ }
+
+ task_lock(task);
+ if (!task->mm) {
+ err = -EINVAL;
+ goto err_task_lock;
+ }
+
+ if (!lock_task_sighand(task, &flags)) {
+ err = -ESRCH;
+ goto err_task_lock;
+ }
+
+ if (legacy) {
+ if (oom_adj < task->signal->oom_score_adj &&
+ !capable(CAP_SYS_RESOURCE)) {
+ err = -EACCES;
+ goto err_sighand;
+ }
+ /*
+ * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
+ * /proc/pid/oom_score_adj instead.
+ */
+ pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
+  current->comm, task_pid_nr(current), task_pid_nr(task),
+  task_pid_nr(task));
+ } else {
+ if ((short)oom_adj < task->signal->oom_score_adj_min &&
+ !capable(CAP_SYS_RESOURCE)) {
+ err = -EACCES;
+ goto err_sighand;
+ }
+ }
+
+ task->signal->oom_score_adj = oom_adj;
+ if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+ task->signal->oom_score_adj_min = (short)oom_adj;
+ trace_oom_score_adj_update(task);
+err_sighand:
+ unlock_task_sighand(task, &flags);
+err_task_lock:
+ task_unlock(task);
+ put_task_struct(task);
+out:
+ return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +1111,8 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
      size_t count, loff_t *ppos)
 {
- struct task_struct *task;
  char buffer[PROC_NUMBUF];
  int oom_adj;
- unsigned long flags;
  int err;
 
  memset(buffer, 0, sizeof(buffer));
@@ -1077,23 +1132,6 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
  goto out;
  }
 
- task = get_proc_task(file_inode(file));
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
- }
-
  /*
  * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
  * value is always attainable.
@@ -1103,27 +1141,8 @@ static ssize_t oom_adj_write(struct file *file, const char __user *buf,
  else
  oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
- if (oom_adj < task->signal->oom_score_adj &&
-    !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- /*
- * /proc/pid/oom_adj is provided for legacy purposes, ask users to use
- * /proc/pid/oom_score_adj instead.
- */
- pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n",
-  current->comm, task_pid_nr(current), task_pid_nr(task),
-  task_pid_nr(task));
+ err = __set_oom_adj(file, oom_adj, true);
 
- task->signal->oom_score_adj = oom_adj;
- trace_oom_score_adj_update(task);
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
- put_task_struct(task);
 out:
  return err < 0 ? err : count;
 }
@@ -1157,9 +1176,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
 static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
  size_t count, loff_t *ppos)
 {
- struct task_struct *task;
  char buffer[PROC_NUMBUF];
- unsigned long flags;
  int oom_score_adj;
  int err;
 
@@ -1180,39 +1197,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
  goto out;
  }
 
- task = get_proc_task(file_inode(file));
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
- }
-
- if ((short)oom_score_adj < task->signal->oom_score_adj_min &&
- !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- task->signal->oom_score_adj = (short)oom_score_adj;
- if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
- task->signal->oom_score_adj_min = (short)oom_score_adj;
- trace_oom_score_adj_update(task);
-
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
- put_task_struct(task);
+ err = __set_oom_adj(file, oom_score_adj, false);
 out:
  return err < 0 ? err : count;
 }
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

Michal Hocko-4
In reply to this post by Michal Hocko-4
From: Michal Hocko <[hidden email]>

oom_kill_process makes sure to kill all processes outside of the thread
group which are sharing the mm. This requires to iterate over all tasks.
This is however not a common case so we can optimize it a bit and only
do that path only if we know that there are external users of the mm
struct outside of the thread group.

Signed-off-by: Michal Hocko <[hidden email]>
---
 mm/oom_kill.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..0e33e912f7e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  task_unlock(victim);
 
  /*
+ * skip expensive iterations over all tasks if we know that there
+ * are no users outside of threads in the same thread group
+ */
+ if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
+ goto oom_reap;
+
+ /*
  * Kill all user processes sharing victim->mm in other thread groups, if
  * any.  They don't get access to memory reserves, though, to avoid
  * depletion of all memory.  This prevents mm->mmap_sem livelock when an
@@ -848,6 +855,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  }
  rcu_read_unlock();
 
+oom_reap:
  if (can_oom_reap)
  wake_oom_reaper(victim);
 
--
2.8.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

Tetsuo Handa
In reply to this post by Michal Hocko-4
Michal Hocko wrote:

> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> + struct mm_struct *mm = NULL;
> + struct task_struct *p;
> +
> + /*
> + * If the process has passed exit_mm we have to skip it because
> + * we have lost a link to other tasks sharing this mm, we do not
> + * have anything to reap and the task might then get stuck waiting
> + * for parent as zombie and we do not want it to hold TIF_MEMDIE
> + */
> + p = find_lock_task_mm(task);
> + if (!p)
>   return false;
>  
> + if (!__task_will_free_mem(p)) {
> + task_unlock(p);
> + return false;
> + }
> +
> + /*
> + * Check whether there are other processes sharing the mm - they all have
> + * to be killed or exiting.
> + */
> + if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
> + mm = p->mm;
> + /* pin the mm to not get freed and reused */
> + atomic_inc(&mm->mm_count);
> + }
> + task_unlock(p);
> +
> + if (mm) {
> + rcu_read_lock();
> + for_each_process(p) {
> + bool vfork;
> +
> + /*
> + * skip over vforked tasks because they are mostly
> + * independent and will drop the mm soon
> + */
> + task_lock(p);
> + vfork = p->vfork_done;
> + task_unlock(p);
> + if (vfork)
> + continue;
> +
> + if (!__task_will_free_mem(p))
> + break;
> + }
> + rcu_read_unlock();
> + mmdrop(mm);

Did you forget "if (something) return false;" here?

> + }
> +

If task_will_free_mem() == true is always followed by mark_oom_victim()
and wake_oom_reaper(), can we call them from here?

>   return true;
>  }
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

Tetsuo Handa
In reply to this post by Michal Hocko-4
Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5bb2f7698ad7..0e33e912f7e4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>   task_unlock(victim);
>  
>   /*
> + * skip expensive iterations over all tasks if we know that there
> + * are no users outside of threads in the same thread group
> + */
> + if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> + goto oom_reap;

Is this really safe? Isn't it possible that victim thread's thread group has
more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
at exit_task_work() which are before __exit_signal() from release_task() from
exit_notify()?

> +
> + /*
>   * Kill all user processes sharing victim->mm in other thread groups, if
>   * any.  They don't get access to memory reserves, though, to avoid
>   * depletion of all memory.  This prevents mm->mmap_sem livelock when an
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

Michal Hocko-4
In reply to this post by Tetsuo Handa
On Thu 26-05-16 23:11:47, Tetsuo Handa wrote:
[...]

> > + if (mm) {
> > + rcu_read_lock();
> > + for_each_process(p) {
> > + bool vfork;
> > +
> > + /*
> > + * skip over vforked tasks because they are mostly
> > + * independent and will drop the mm soon
> > + */
> > + task_lock(p);
> > + vfork = p->vfork_done;
> > + task_unlock(p);
> > + if (vfork)
> > + continue;
> > +
> > + if (!__task_will_free_mem(p))
> > + break;
> > + }
> > + rcu_read_unlock();
> > + mmdrop(mm);
>
> Did you forget "if (something) return false;" here?

Dohh, I screwed when cleaning up the code before submission. Thanks for
catching that.

>
> > + }
> > +
>
> If task_will_free_mem() == true is always followed by mark_oom_victim()
> and wake_oom_reaper(), can we call them from here?

I would rather keep the function doing only the check. We never know
when this could be reused somewhere else. All the callsites are quite
trivial so we do not duplicate a lot of code.
---
From 0b94b4d6f749f322279e911e00abf1390fea4b2b Mon Sep 17 00:00:00 2001
From: Michal Hocko <[hidden email]>
Date: Thu, 26 May 2016 09:38:32 +0200
Subject: [PATCH] mm, oom: fortify task_will_free_mem

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko <[hidden email]>
---
 include/linux/oom.h | 74 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/memcontrol.c     |  4 +--
 mm/oom_kill.c       | 65 ++++------------------------------------------
 3 files changed, 76 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..91d90a5885dc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
  struct signal_struct *sig = task->signal;
 
@@ -117,16 +117,80 @@ static inline bool task_will_free_mem(struct task_struct *task)
  if (sig->flags & SIGNAL_GROUP_COREDUMP)
  return false;
 
- if (!(task->flags & PF_EXITING))
+ if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
  return false;
 
  /* Make sure that the whole thread group is going down */
- if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+ if (!thread_group_empty(task) &&
+ !(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
  return false;
 
  return true;
 }
 
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+ struct mm_struct *mm = NULL;
+ struct task_struct *p;
+ bool ret = false;
+
+ /*
+ * If the process has passed exit_mm we have to skip it because
+ * we have lost a link to other tasks sharing this mm, we do not
+ * have anything to reap and the task might then get stuck waiting
+ * for parent as zombie and we do not want it to hold TIF_MEMDIE
+ */
+ p = find_lock_task_mm(task);
+ if (!p)
+ return false;
+
+ if (!__task_will_free_mem(p)) {
+ task_unlock(p);
+ return false;
+ }
+
+ /*
+ * Check whether there are other processes sharing the mm - they all have
+ * to be killed or exiting.
+ */
+ if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
+ mm = p->mm;
+ /* pin the mm to not get freed and reused */
+ atomic_inc(&mm->mm_count);
+ }
+ task_unlock(p);
+
+ if (mm) {
+ rcu_read_lock();
+ for_each_process(p) {
+ bool vfork;
+
+ /*
+ * skip over vforked tasks because they are mostly
+ * independent and will drop the mm soon
+ */
+ task_lock(p);
+ vfork = p->vfork_done;
+ task_unlock(p);
+ if (vfork)
+ continue;
+
+ ret = __task_will_free_mem(p);
+ if (!ret)
+ break;
+ }
+ rcu_read_unlock();
+ mmdrop(mm);
+ }
+
+ return ret;
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
  * select it.  The goal is to allow it to allocate so that it may
  * quickly exit and free its memory.
  */
- if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+ if (task_will_free_mem(current)) {
  mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
  goto unlock;
  }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 008c5b4732de..428e34df9f49 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -573,7 +573,7 @@ static int oom_reaper(void *unused)
  return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
  if (!oom_reaper_th)
  return;
@@ -594,50 +594,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-void try_oom_reaper(struct task_struct *tsk)
-{
- struct mm_struct *mm = tsk->mm;
- struct task_struct *p;
-
- if (!mm)
- return;
-
- /*
- * There might be other threads/processes which are either not
- * dying or even not killable.
- */
- if (atomic_read(&mm->mm_users) > 1) {
- rcu_read_lock();
- for_each_process(p) {
- bool exiting;
-
- if (!process_shares_mm(p, mm))
- continue;
- if (same_thread_group(p, tsk))
- continue;
- if (fatal_signal_pending(p))
- continue;
-
- /*
- * If the task is exiting make sure the whole thread group
- * is exiting and cannot acces mm anymore.
- */
- spin_lock_irq(&p->sighand->siglock);
- exiting = signal_group_exit(p->signal);
- spin_unlock_irq(&p->sighand->siglock);
- if (exiting)
- continue;
-
- /* Give up */
- rcu_read_unlock();
- return;
- }
- rcu_read_unlock();
- }
-
- wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
  oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -649,10 +605,6 @@ static int __init oom_init(void)
  return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -750,15 +702,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  * If the task is already exiting, don't alarm the sysadmin or kill
  * its children or threads, just set TIF_MEMDIE so it can die quickly
  */
- task_lock(p);
- if (p->mm && task_will_free_mem(p)) {
+ if (task_will_free_mem(p)) {
  mark_oom_victim(p);
- try_oom_reaper(p);
- task_unlock(p);
+ wake_oom_reaper(p);
  put_task_struct(p);
  return;
  }
- task_unlock(p);
 
  if (__ratelimit(&oom_rs))
  dump_header(oc, p, memcg);
@@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc)
  * If current has a pending SIGKILL or is exiting, then automatically
  * select it.  The goal is to allow it to allocate so that it may
  * quickly exit and free its memory.
- *
- * But don't select if current has already released its mm and cleared
- * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
  */
- if (current->mm &&
-    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+ if (task_will_free_mem(current)) {
  mark_oom_victim(current);
- try_oom_reaper(current);
+ wake_oom_reaper(current);
  return true;
  }
 
--
2.8.1

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

Tetsuo Handa
Michal Hocko wrote:

> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> + struct mm_struct *mm = NULL;
> + struct task_struct *p;
> + bool ret = false;

If atomic_read(&p->mm->mm_users) <= get_nr_threads(p), this returns "false".
According to previous version, I think this is "bool ret = true;".

> +
> + /*
> + * If the process has passed exit_mm we have to skip it because
> + * we have lost a link to other tasks sharing this mm, we do not
> + * have anything to reap and the task might then get stuck waiting
> + * for parent as zombie and we do not want it to hold TIF_MEMDIE
> + */
> + p = find_lock_task_mm(task);
> + if (!p)
> + return false;
> +
> + if (!__task_will_free_mem(p)) {
> + task_unlock(p);
> + return false;
> + }
> +
> + /*
> + * Check whether there are other processes sharing the mm - they all have
> + * to be killed or exiting.
> + */
> + if (atomic_read(&p->mm->mm_users) > get_nr_threads(p)) {
> + mm = p->mm;
> + /* pin the mm to not get freed and reused */
> + atomic_inc(&mm->mm_count);
> + }
> + task_unlock(p);
> +
> + if (mm) {
> + rcu_read_lock();
> + for_each_process(p) {
> + bool vfork;
> +
> + /*
> + * skip over vforked tasks because they are mostly
> + * independent and will drop the mm soon
> + */
> + task_lock(p);
> + vfork = p->vfork_done;
> + task_unlock(p);
> + if (vfork)
> + continue;
> +
> + ret = __task_will_free_mem(p);
> + if (!ret)
> + break;
> + }
> + rcu_read_unlock();
> + mmdrop(mm);
> + }
> +
> + return ret;
> +}
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

Michal Hocko-4
On Thu 26-05-16 23:41:54, Tetsuo Handa wrote:

> Michal Hocko wrote:
> > +/*
> > + * Checks whether the given task is dying or exiting and likely to
> > + * release its address space. This means that all threads and processes
> > + * sharing the same mm have to be killed or exiting.
> > + */
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > + struct mm_struct *mm = NULL;
> > + struct task_struct *p;
> > + bool ret = false;
>
> If atomic_read(&p->mm->mm_users) <= get_nr_threads(p), this returns "false".
> According to previous version, I think this is "bool ret = true;".

true. Thanks for catching this. Fixed locally.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

Michal Hocko-4
In reply to this post by Tetsuo Handa
On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:

> Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5bb2f7698ad7..0e33e912f7e4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >   task_unlock(victim);
> >  
> >   /*
> > + * skip expensive iterations over all tasks if we know that there
> > + * are no users outside of threads in the same thread group
> > + */
> > + if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > + goto oom_reap;
>
> Is this really safe? Isn't it possible that victim thread's thread group has
> more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> at exit_task_work() which are before __exit_signal() from release_task() from
> exit_notify()?

You are right. The race window between exit_mm and __exit_signal is
really large. I thought about == check instead but that wouldn't work
for the same reason, dang, it looked so promissing.

Scratch this patch then.

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal tasks sharing mm

Tetsuo Handa
Michal Hocko wrote:

> On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 5bb2f7698ad7..0e33e912f7e4 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >   task_unlock(victim);
> > >  
> > >   /*
> > > + * skip expensive iterations over all tasks if we know that there
> > > + * are no users outside of threads in the same thread group
> > > + */
> > > + if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > > + goto oom_reap;
> >
> > Is this really safe? Isn't it possible that victim thread's thread group has
> > more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> > at exit_task_work() which are before __exit_signal() from release_task() from
> > exit_notify()?
>
> You are right. The race window between exit_mm and __exit_signal is
> really large. I thought about == check instead but that wouldn't work
> for the same reason, dang, it looked so promissing.
>
> Scratch this patch then.
>

I think that remembering whether this mm might be shared between
multiple thread groups at clone() time (i.e. whether
clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
is safe (given that that thread already got SIGKILL or is exiting).

By the way, in oom_kill_process(), how (p->flags & PF_KTHREAD) case can
become true when process_shares_mm() is true? Even if it can become true,
why can't we reap that mm? Is (p->flags & PF_KTHREAD) case only for
not to send SIGKILL rather than not to reap that mm?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are noexternal tasks sharing mm

Michal Hocko-4
On Fri 27-05-16 00:25:23, Tetsuo Handa wrote:

> Michal Hocko wrote:
> > On Thu 26-05-16 23:30:06, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 5bb2f7698ad7..0e33e912f7e4 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > > >   task_unlock(victim);
> > > >  
> > > >   /*
> > > > + * skip expensive iterations over all tasks if we know that there
> > > > + * are no users outside of threads in the same thread group
> > > > + */
> > > > + if (atomic_read(&mm->mm_users) <= get_nr_threads(victim))
> > > > + goto oom_reap;
> > >
> > > Is this really safe? Isn't it possible that victim thread's thread group has
> > > more than atomic_read(&mm->mm_users) threads which are past exit_mm() and blocked
> > > at exit_task_work() which are before __exit_signal() from release_task() from
> > > exit_notify()?
> >
> > You are right. The race window between exit_mm and __exit_signal is
> > really large. I thought about == check instead but that wouldn't work
> > for the same reason, dang, it looked so promissing.
> >
> > Scratch this patch then.
> >
>
> I think that remembering whether this mm might be shared between
> multiple thread groups at clone() time (i.e. whether
> clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
> is safe (given that that thread already got SIGKILL or is exiting).

I was already playing with that idea but I didn't want to add anything
to the fork path which is really hot. This patch is not really needed
for the rest. It just felt like a nice optimization. I do not think it
is worth deeper changes in the fast paths.

> By the way, in oom_kill_process(), how (p->flags & PF_KTHREAD) case can
> become true when process_shares_mm() is true?

not sure I understand. But the PF_KTHREAD check is there to catch
use_mm() usage by kernel threads.

> Even if it can become true,
> why can't we reap that mm? Is (p->flags & PF_KTHREAD) case only for
> not to send SIGKILL rather than not to reap that mm?

If we reaped the mm then the kernel thread could blow up when accessing
a memory.

--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

Tetsuo Handa
Michal Hocko wrote:

> On Fri 27-05-16 00:25:23, Tetsuo Handa wrote:
> > I think that remembering whether this mm might be shared between
> > multiple thread groups at clone() time (i.e. whether
> > clone(CLONE_VM without CLONE_SIGHAND) was ever requested on this mm)
> > is safe (given that that thread already got SIGKILL or is exiting).
>
> I was already playing with that idea but I didn't want to add anything
> to the fork path which is really hot. This patch is not really needed
> for the rest. It just felt like a nice optimization. I do not think it
> is worth deeper changes in the fast paths.

"[PATCH 6/6] mm, oom: fortify task_will_free_mem" depends on [PATCH 1/6].
You will need to update [PATCH 6/6].

It seems to me that [PATCH 6/6] resembles
http://lkml.kernel.org/r/201605250005.GHH26082.JOtQOSLMFFOFVH@... .
I think we will be happy if we can speed up mm_is_reapable() test using
"whether this mm might be shared between multiple thread groups" flag.
I don't think updating such flag at clone() is too heavy operation to add.
Loading...