[PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Aleksa Sarai-2
This is an updated (and rewritten) version of v3 of this patchset[1].

The main difference is that I changed how we the "allow management" is
implemented. Rather than just chmod-ing the cgroup directory (which
everyone agreed was quite an odd way of doing it),
unshare(CLONE_NEWCGROUP) will create a new subtree in every cgroup the
task is associated with. The task will then be migrated to those
subtrees (which form the root cset of the cgroup namespace). This change
will be transparent to namespaced processes, and they'll gain a new
ability (the ability to create cgroups).

The name of the cgroup is randomly generated to ensure we don't get
conflicts (but maybe this should be dealt with in a nicer way). In
addition, I've updated the cgroup.procs write permission checks to be
user namespace aware, but I also added an additional "permitted" case
(where all of the tasks are in the same cgroup namespace and %current
has CAP_SYS_ADMIN in all of the relevant user namespaces).

I'm not _completely_ convinced about the addition of that case, and
maybe we should drop it (but I might be biased since this all comes from
the requirements of rootless containers).

Also, I haven't added a way to disable the functionality on a per-cgroup
(or even global) basis. Maybe there should be a way to do that, but I'm
not sure how it should be done (a cgroup.ns_subtrees file that allows
administrators to change it on a per-cgroup basis, or just a sysctl?).

PTAL.

[1]: https://lkml.org/lkml/2016/5/2/280

Aleksa Sarai (2):
  cgroup: make cgroup.procs permissions userns-aware
  cgroup: implement subtree creation on copy_cgroup_ns()

 kernel/cgroup.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 126 insertions(+), 23 deletions(-)

--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[PATCH v4 1/2] cgroup: make cgroup.procs permissions userns-aware

Aleksa Sarai-2
Instead of comparing the thread's euid to GLOBAL_ROOT_UID, use
capabilities to check the additional privileges required to migrate
processes between cgroups. In addition, add further permissions to
cgroup namespaces where:

* Both tasks are in the same cgroup namespace; and
* The current task has CAP_SYS_ADMIN in the target task's user namespace
  well as in the user namespace the cgroup namespace was created in.
  This essentially requires all of the namespaces to to be the same.

This allows for subtree management of processes in a cgroup namespace,
by a process that is in a user namespace.

Signed-off-by: Aleksa Sarai <[hidden email]>
---
 kernel/cgroup.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 86cb5c6e8932..f1c798b69561 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2841,6 +2841,8 @@ static int cgroup_procs_write_permission(struct task_struct *task,
  struct cgroup *dst_cgrp,
  struct kernfs_open_file *of)
 {
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+ struct cgroup_namespace *tns = task->nsproxy->cgroup_ns;
  const struct cred *cred = current_cred();
  const struct cred *tcred = get_task_cred(task);
  int ret = 0;
@@ -2850,6 +2852,10 @@ static int cgroup_procs_write_permission(struct task_struct *task,
  * need to check permissions on one of them.
  */
  if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+    !capable(CAP_SYS_ADMIN) &&
+    !(ns == tns && ns_capable(tcred->user_ns, CAP_SYS_ADMIN) &&
+      ns_capable(cred->user_ns, CAP_SYS_ADMIN) &&
+      ns_capable(ns->user_ns, CAP_SYS_ADMIN)) &&
     !uid_eq(cred->euid, tcred->uid) &&
     !uid_eq(cred->euid, tcred->suid))
  ret = -EACCES;
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

[PATCH v4 2/2] cgroup: implement subtree creation on copy_cgroup_ns()

Aleksa Sarai-2
In reply to this post by Aleksa Sarai-2
Allow unprivileged processes to control subtrees of their associated
processes, a necessary feature if a rootless container wishes to take
advantage of cgroups for its own processes.

As cgroups are hierarchical, having the ability to set limits in a
subtree does not preclude the ability to modify the limits imposed by
parent cgroups. In addition, in the default hierarchy a process must
have write access to the common ancestor of the two (src and dest)
cgroups' cgroup.procs file. This makes this change safe against cgroup
escape.

There isn't a way to disable this at the moment.

Signed-off-by: Aleksa Sarai <[hidden email]>
---
 kernel/cgroup.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 120 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f1c798b69561..f455488dc899 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -62,6 +62,7 @@
 #include <linux/proc_ns.h>
 #include <linux/nsproxy.h>
 #include <linux/proc_ns.h>
+#include <linux/time.h>
 #include <net/sock.h>
 
 /*
@@ -5269,34 +5270,40 @@ out_destroy:
  return ERR_PTR(ret);
 }
 
-static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
- umode_t mode)
+/**
+ * cgroup_create_subtree - creates a new subtree of a cgroup
+ * @parent: the parent cgroup to create the subtree under
+ * @name: the name of the cgroup in kernfs
+ * @mode: the mode of the cgroup in kernfs
+ *
+ * Creates a new cgroup under the given @parent, with the given @name and @mode.
+ * The caller must hold cgroup_mutex, and must not be under active protection of
+ * kernfs.
+ */
+static struct cgroup *cgroup_create_subtree(struct cgroup *parent,
+    const char *name, umode_t mode)
 {
- struct cgroup *parent, *cgrp;
+ struct cgroup *child;
  struct kernfs_node *kn;
- int ret;
+ int ret = 0;
+
+ lockdep_assert_held(&cgroup_mutex);
 
  /* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
  if (strchr(name, '\n'))
- return -EINVAL;
-
- parent = cgroup_kn_lock_live(parent_kn, false);
- if (!parent)
- return -ENODEV;
+ return ERR_PTR(-EINVAL);
 
- cgrp = cgroup_create(parent);
- if (IS_ERR(cgrp)) {
- ret = PTR_ERR(cgrp);
- goto out_unlock;
- }
+ child = cgroup_create(parent);
+ if (IS_ERR(child))
+ return child;
 
  /* create the directory */
- kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
+ kn = kernfs_create_dir(parent->kn, name, mode, child);
  if (IS_ERR(kn)) {
  ret = PTR_ERR(kn);
  goto out_destroy;
  }
- cgrp->kn = kn;
+ child->kn = kn;
 
  /*
  * This extra ref will be put in cgroup_free_fn() and guarantees
@@ -5308,22 +5315,51 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
  if (ret)
  goto out_destroy;
 
- ret = css_populate_dir(&cgrp->self);
+ ret = css_populate_dir(&child->self);
  if (ret)
  goto out_destroy;
 
- ret = cgroup_apply_control_enable(cgrp);
+ ret = cgroup_apply_control_enable(child);
  if (ret)
  goto out_destroy;
 
  /* let's create and online css's */
  kernfs_activate(kn);
 
- ret = 0;
- goto out_unlock;
+ return child;
 
 out_destroy:
- cgroup_destroy_locked(cgrp);
+ cgroup_destroy_locked(child);
+ return ERR_PTR(ret);
+}
+
+/*
+ * cgroup directories starting with this prefix are forbidden from being created
+ * from userspace. This prefix is used internally to make sure that there's no
+ * conflicts with userspace when creating cgroups inside copy_cgroup_ns().
+ */
+#define CGROUPNS_INTERNAL_PREFIX ".__cgroupns_subtree:"
+
+static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+ umode_t mode)
+{
+ struct cgroup *parent, *cgrp;
+ int ret = 0;
+
+ if (strncmp(CGROUPNS_INTERNAL_PREFIX, name,
+    strlen(CGROUPNS_INTERNAL_PREFIX)) == 0)
+ return -EINVAL;
+
+ parent = cgroup_kn_lock_live(parent_kn, false);
+ if (!parent)
+ return -ENODEV;
+
+ cgrp = cgroup_create_subtree(parent, name, mode);
+ if (IS_ERR(cgrp)) {
+ ret = PTR_ERR(cgrp);
+ goto out_unlock;
+ }
+
 out_unlock:
  cgroup_kn_unlock(parent_kn);
  return ret;
@@ -6298,7 +6334,9 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
  struct cgroup_namespace *old_ns)
 {
  struct cgroup_namespace *new_ns;
+ struct cgroup_root *root;
  struct css_set *cset;
+ char id[16], id_string[1+2*ARRAY_SIZE(id)] = {0};
 
  BUG_ON(!old_ns);
 
@@ -6311,12 +6349,71 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
  if (!ns_capable(user_ns, CAP_SYS_ADMIN))
  return ERR_PTR(-EPERM);
 
+ /*
+ * In order to make sure that the dirname we create is unique, we use a
+ * random id for all of the subtrees. The ID is the same to reduce
+ * confusion when reading /proc/<pid>/cgroup.
+ */
+ get_random_bytes(id, ARRAY_SIZE(id));
+ bin2hex(id_string, id, ARRAY_SIZE(id));
+
+ /*
+ * Create a new subtree in every cgroup the task is associated with.
+ * The cgroup is owned by the task uid and gid, to allow for management
+ * of subtrees in cgroup namespaces. This is safe because:
+ *
+ * 1. cgroups are hierarchical, so having the ability to set limits in
+ *    a subtree does not preclude the ability to modify the limits
+ *    imposed by parent cgroups.
+ *
+ * 2. cgroup_procs_write_permission() does checks to ensure that a
+ *    task cannot move other tasks into its cgroup unless they are both
+ *    running as the same user (or the task moving the process has
+ *    CAP_SYS_ADMIN in the user namespace of the process being moved).
+ *    This means that a misbehaving process can't start messing around
+ *    with other processes' cgroup associations.
+ *
+ * 3. On the default hierarchy, you cannot migrate a process to a
+ *    non-descendant cgroup unless you have write access to the
+ *    cgroup.procs file in the common ancestor of the two cgroups. This
+ *    means that two cooperative processes in the default hierarchy
+ *    can't move processes between their cgroups (if the admin
+ *    disallows it). Unfortunately, this functionality doesn't exist in
+ *    the other hierarchies (for backwards compatibility reasons).
+ *    However, this requirement isn't as important as the previous two.
+ */
  mutex_lock(&cgroup_mutex);
- spin_lock_bh(&css_set_lock);
+ for_each_root(root) {
+ struct cgroup *parent, *child;
+ char namebuf[CGROUP_FILE_NAME_MAX];
+ bool is_dfl = cgroup_on_dfl(&root->cgrp);
+
+ spin_lock_bh(&css_set_lock);
+ parent = task_cgroup_from_root(current, root);
+ spin_unlock_bh(&css_set_lock);
+
+ snprintf(namebuf, CGROUP_FILE_NAME_MAX,
+ CGROUPNS_INTERNAL_PREFIX "%s", id_string);
+
+ /* This should not fail, since we're under &cgroup_mutex. */
+ child = cgroup_create_subtree(parent, namebuf, 0755);
+ if (WARN_ON(IS_ERR(child)))
+ continue;
 
+ /*
+ * Move the task to the new cgroup, which is owned by the user.
+ * Should never fail, since we're under &cgroup_mutex here.
+ */
+ rcu_read_lock();
+ if (WARN_ON(cgroup_attach_task(child, current, is_dfl)))
+ cgroup_destroy_locked(child);
+ rcu_read_unlock();
+
+ }
+
+ spin_lock_bh(&css_set_lock);
  cset = task_css_set(current);
  get_css_set(cset);
-
  spin_unlock_bh(&css_set_lock);
  mutex_unlock(&cgroup_mutex);
 
--
2.8.2

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Aleksa Sarai-2
In reply to this post by Aleksa Sarai-2
> This is an updated (and rewritten) version of v3 of this patchset[1].
>
> The main difference is that I changed how we the "allow management" is
> implemented. Rather than just chmod-ing the cgroup directory (which
> everyone agreed was quite an odd way of doing it),
> unshare(CLONE_NEWCGROUP) will create a new subtree in every cgroup the
> task is associated with. The task will then be migrated to those
> subtrees (which form the root cset of the cgroup namespace). This change
> will be transparent to namespaced processes, and they'll gain a new
> ability (the ability to create cgroups).
>
> The name of the cgroup is randomly generated to ensure we don't get
> conflicts (but maybe this should be dealt with in a nicer way). In
> addition, I've updated the cgroup.procs write permission checks to be
> user namespace aware, but I also added an additional "permitted" case
> (where all of the tasks are in the same cgroup namespace and %current
> has CAP_SYS_ADMIN in all of the relevant user namespaces).
>
> I'm not _completely_ convinced about the addition of that case, and
> maybe we should drop it (but I might be biased since this all comes from
> the requirements of rootless containers).
>
> Also, I haven't added a way to disable the functionality on a per-cgroup
> (or even global) basis. Maybe there should be a way to do that, but I'm
> not sure how it should be done (a cgroup.ns_subtrees file that allows
> administrators to change it on a per-cgroup basis, or just a sysctl?).
>
> PTAL.
>
> [1]: https://lkml.org/lkml/2016/5/2/280
>
> Aleksa Sarai (2):
>    cgroup: make cgroup.procs permissions userns-aware
>    cgroup: implement subtree creation on copy_cgroup_ns()
>
>   kernel/cgroup.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 126 insertions(+), 23 deletions(-)
>

Are there any comments on this version of the patchset? I thought we had
reached an agreement that the underlying feature (allowing a process to
manage its own cgroups) was useful. Is there a better way of solving
this problem, that I don't know of?

--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Tejun Heo-2
On Sat, May 21, 2016 at 12:48:48AM +1000, Aleksa Sarai wrote:
> Are there any comments on this version of the patchset? I thought we had
> reached an agreement that the underlying feature (allowing a process to
> manage its own cgroups) was useful. Is there a better way of solving this
> problem, that I don't know of?

I still don't see why this is necessary.  Delegation is done through
chmodding.  There's no reason to deviate for namespaces.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

James Bottomley-2
On Fri, 2016-05-20 at 08:22 -0700, Tejun Heo wrote:
> On Sat, May 21, 2016 at 12:48:48AM +1000, Aleksa Sarai wrote:
> > Are there any comments on this version of the patchset? I thought
> > we had reached an agreement that the underlying feature (allowing a
> > process to manage its own cgroups) was useful. Is there a better
> > way of solving this problem, that I don't know of?
>
> I still don't see why this is necessary.  Delegation is done through
> chmodding.  There's no reason to deviate for namespaces.

Given it's merge window time, I haven't yet had time to look at the
patch, but I can tell you why it (or something like it) is necessary:
unprivileged containers need to be able to set up cgroups as well as
namespaces, so we do need a way for the user ns owner to modify cgroups
in their default configuration otherwise cgroups just won't fit into
the unprivileged model.  Whether this should be through the cgroup ns
is up for debate, as is how we should actually allow this to happen and
what we should present to the user ns owner, but we do need a way to do
this.

Delegation can't be through chmodding in this case because the user ns
owner can't chmod something owned by init_user_ns root.

James


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Tejun Heo-2
Hello, James.

On Fri, May 20, 2016 at 11:30:58AM -0400, James Bottomley wrote:
> Given it's merge window time, I haven't yet had time to look at the
> patch, but I can tell you why it (or something like it) is necessary:
> unprivileged containers need to be able to set up cgroups as well as
> namespaces, so we do need a way for the user ns owner to modify cgroups
> in their default configuration otherwise cgroups just won't fit into
> the unprivileged model.  Whether this should be through the cgroup ns

That can be allowed by delegating a cgroup sub-hierarchy to the
unpriviledged user.  Sub-cgroups created by that user will be owned by
that user.  IOW, if the owner of the cgroup hierarchy doesn't
explicitly delegate the subtree, the unpriv user can't have the
subtree.  This is true for regular use cases and shouldn't be
different for namespaces.

> is up for debate, as is how we should actually allow this to happen and
> what we should present to the user ns owner, but we do need a way to do
> this.
>
> Delegation can't be through chmodding in this case because the user ns
> owner can't chmod something owned by init_user_ns root.

I still don't see why the existing mechanisms (including userns if
delegation from inside a ns is necessary) aren't enough.  And even if
we need something, I'd much prefer not to add a behavior which wildly
deviates from the usual rules.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

James Bottomley-2
On Fri, 2016-05-20 at 09:03 -0700, Tejun Heo wrote:

> Hello, James.
>
> On Fri, May 20, 2016 at 11:30:58AM -0400, James Bottomley wrote:
> > Given it's merge window time, I haven't yet had time to look at the
> > patch, but I can tell you why it (or something like it) is
> > necessary:
> > unprivileged containers need to be able to set up cgroups as well
> > as
> > namespaces, so we do need a way for the user ns owner to modify
> > cgroups
> > in their default configuration otherwise cgroups just won't fit
> > into
> > the unprivileged model.  Whether this should be through the cgroup
> > ns
>
> That can be allowed by delegating a cgroup sub-hierarchy to the
> unpriviledged user.  Sub-cgroups created by that user will be owned
> by that user.  IOW, if the owner of the cgroup hierarchy doesn't
> explicitly delegate the subtree, the unpriv user can't have the
> subtree.  This is true for regular use cases and shouldn't be
> different for namespaces.

The use case for unprivileged containers is that the end user should be
able to use them from the default configuration without and
administrator intervention (although there may be ways the admin can
prevent this if desired).  Your "deletage a cgroup sub-hierarchy"
requires admin intervention, so it's not yet a workflow that
unprivileged containers can use.

>
> > is up for debate, as is how we should actually allow this to happen
> > and what we should present to the user ns owner, but we do need a
> > way to do this.
> >
> > Delegation can't be through chmodding in this case because the user
> > ns owner can't chmod something owned by init_user_ns root.
>
> I still don't see why the existing mechanisms (including userns if
> delegation from inside a ns is necessary) aren't enough.  And even if
> we need something, I'd much prefer not to add a behavior which wildly
> deviates from the usual rules.

I think it's just different definitions.  If you take on our definition
of being able to set up a container without any admin intervention, do
you see our problem: we can't get the initial delegation of the
hierarchy.

James

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Tejun Heo-2
Hello, James.

On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> I think it's just different definitions.  If you take on our definition
> of being able to set up a container without any admin intervention, do
> you see our problem: we can't get the initial delegation of the
> hierarchy.

Yeah, I can see the difference but we can't solve that by special
casing NS case.  This is stemming from the fact that an unpriv
application can't create its sub-cgroups without explicit delegation
from the root and that has always been an explicit design choice.
It's tied to who's responsible for cleanup afterwards and what happens
when the process gets migrated to a different cgroup.  The latter is
an important issue on v1 hierarchies because migrating tasks sometimes
is used as a way to control resource distribution.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

James Bottomley-2
On Fri, 2016-05-20 at 09:17 -0700, Tejun Heo wrote:

> Hello, James.
>
> On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> > I think it's just different definitions.  If you take on our
> > definition of being able to set up a container without any admin
> > intervention, do you see our problem: we can't get the initial
> > delegation of the hierarchy.
>
> Yeah, I can see the difference but we can't solve that by special
> casing NS case.

Great, we agree on the problem definition ... as I said, I'm not saying
this patch is the solution, but it gives us a starting point for
exploring whether there is a solution.

>   This is stemming from the fact that an unpriv application can't
> create its sub-cgroups without explicit delegation from the root and
> that has always been an explicit design choice.
> It's tied to who's responsible for cleanup afterwards and what
> happens when the process gets migrated to a different cgroup.  The
> latter is an important issue on v1 hierarchies because migrating
> tasks sometimes is used as a way to control resource distribution.

OK, so is the only problem cleanup?  If so, what if I proposed that a
cgroup directory could only be created by the owner of the userns
(which would be any old unprivileged user) iff they create a cgroup ns
and the cgroup ns would be responsible for removing it again, so the
cgroup subdirectory would be tied to the cgroup namespace as its holder
and we'd use release of the cgroup to remove all the directories?

James

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Aleksa Sarai-2
>>    This is stemming from the fact that an unpriv application can't
>> create its sub-cgroups without explicit delegation from the root and
>> that has always been an explicit design choice.
>> It's tied to who's responsible for cleanup afterwards and what
>> happens when the process gets migrated to a different cgroup.  The
>> latter is an important issue on v1 hierarchies because migrating
>> tasks sometimes is used as a way to control resource distribution.
>
> OK, so is the only problem cleanup?  If so, what if I proposed that a
> cgroup directory could only be created by the owner of the userns
> (which would be any old unprivileged user) iff they create a cgroup ns
> and the cgroup ns would be responsible for removing it again, so the
> cgroup subdirectory would be tied to the cgroup namespace as its holder
> and we'd use release of the cgroup to remove all the directories?

The only question I'd have is how would we deal with visibility (and
should a "parent" cgroup namespace be able to modify things inside the
cgroup?). And of course, how would that interact with VFS?

--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Tejun Heo-2
In reply to this post by James Bottomley-2
Hello,

On Fri, May 20, 2016 at 12:25:09PM -0400, James Bottomley wrote:
> OK, so is the only problem cleanup?  If so, what if I proposed that a

For generic cases, it's a much larger problem.  We'd have to change
delegation model completely so that delegations are allowed by
default, which btw can't be allowed on v1 hierarchies as some
controllers don't behave properly hierarchically in v1 and would allow
unpriv users to escape the constraints of its ancestors.

> cgroup directory could only be created by the owner of the userns
> (which would be any old unprivileged user) iff they create a cgroup ns
> and the cgroup ns would be responsible for removing it again, so the
> cgroup subdirectory would be tied to the cgroup namespace as its holder
> and we'd use release of the cgroup to remove all the directories?

Unfortunately, cgroup hierarchy isn't designed to support this sort of
automatic delegation.  Unpriv processes would be able to escape
constraints on v1 with some controllers and on v2 controllers have to
be explicitly enabled by root for delegated scope to have access to
them.  We can try to isolate these delegated subtrees and make them
work transparently, which rgroup tried to do, but that collides
directly with the vfs conventions (rgroups don't show up in cgroup
hierarchy at all so avoid this problem).

Why does an unpriv NS need to have cgroup delegated to it without
cooperation from cgroup manager?  If for resource control, I'm pretty
sure we don't want to allow that without explicit cooperation from the
enclosing scope.  Overall, it feels like this is trying to work around
an issue which should be solved from userland.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

James Bottomley-2
On Fri, 2016-05-20 at 12:53 -0400, Tejun Heo wrote:

> Hello,
>
> On Fri, May 20, 2016 at 12:25:09PM -0400, James Bottomley wrote:
> > OK, so is the only problem cleanup?  If so, what if I proposed that
> > a
>
> For generic cases, it's a much larger problem.  We'd have to change
> delegation model completely so that delegations are allowed by
> default, which btw can't be allowed on v1 hierarchies as some
> controllers don't behave properly hierarchically in v1 and would
> allow unpriv users to escape the constraints of its ancestors.

Just so I'm clear: by delegation you mean create a subdirectory in the
cgroup hierarchy with a non-root owner?  We may have a solution for the
escape constraints problem: see below.

> > cgroup directory could only be created by the owner of the userns
> > (which would be any old unprivileged user) iff they create a cgroup
> > ns and the cgroup ns would be responsible for removing it again, so
> > the cgroup subdirectory would be tied to the cgroup namespace as
> > its holder and we'd use release of the cgroup to remove all the
> > directories?
>
> Unfortunately, cgroup hierarchy isn't designed to support this sort
> of automatic delegation.  Unpriv processes would be able to escape
> constraints on v1 with some controllers and on v2 controllers have to
> be explicitly enabled by root for delegated scope to have access to
> them.

Not necessarily.  We also talked about pinning the cgroup tree so that
once you enter the cgroup namespace, your current cgroup directory
becomes your root, meaning you can't cd back into the ancestors and
thus can't write their tasks file, meaning, I think, that it should be
impossible to escape ancestor constraints.

>   We can try to isolate these delegated subtrees and make them
> work transparently, which rgroup tried to do, but that collides
> directly with the vfs conventions (rgroups don't show up in cgroup
> hierarchy at all so avoid this problem).

Well, let's see if we can solve it within the current framework first.

>
> Why does an unpriv NS need to have cgroup delegated to it without
> cooperation from cgroup manager?

There's actually many answers to this.  The one I'm insterested in is
the ability for applications to make use of container features without
having to ask permission from some orchestration engine.  The problem
most people are looking at is how do I prevent the cgroup manager from
running as root, because that's a security problem waiting to happen.

>   If for resource control, I'm pretty sure we don't want to allow
> that without explicit cooperation from the enclosing scope.

The enclosing scope should be allowed to define the parameters (happens
today with namespaces) but there shouldn't be an active "thing" which
is the permission gateway.

>   Overall, it feels like this is trying to work around an issue which
> should be solved from userland.

So it's not impossible to have some setuid (or CAP_ scoped) universal
binary do this.  We do this today for the user namespace range of uids
problem.  However, it would have to be something that operated
independently of the cgroup manager, since every container
orchestration system wants to be their own cgroup manager, so there's
no one true one.

James


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Aditya Kali
In reply to this post by James Bottomley-2
On Fri, May 20, 2016 at 9:25 AM, James Bottomley
<[hidden email]> wrote:

>
> On Fri, 2016-05-20 at 09:17 -0700, Tejun Heo wrote:
> > Hello, James.
> >
> > On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> > > I think it's just different definitions.  If you take on our
> > > definition of being able to set up a container without any admin
> > > intervention, do you see our problem: we can't get the initial
> > > delegation of the hierarchy.
> >
> > Yeah, I can see the difference but we can't solve that by special
> > casing NS case.
>
> Great, we agree on the problem definition ... as I said, I'm not saying
> this patch is the solution, but it gives us a starting point for
> exploring whether there is a solution.
>
> >   This is stemming from the fact that an unpriv application can't
> > create its sub-cgroups without explicit delegation from the root and
> > that has always been an explicit design choice.
> > It's tied to who's responsible for cleanup afterwards and what
> > happens when the process gets migrated to a different cgroup.  The
> > latter is an important issue on v1 hierarchies because migrating
> > tasks sometimes is used as a way to control resource distribution.
>
> OK, so is the only problem cleanup?  If so, what if I proposed that a
> cgroup directory could only be created by the owner of the userns
> (which would be any old unprivileged user) iff they create a cgroup ns
> and the cgroup ns would be responsible for removing it again, so the
> cgroup subdirectory would be tied to the cgroup namespace as its holder
> and we'd use release of the cgroup to remove all the directories?
>

cgroup namspace doesn't own the resources in the cgroupns-root, and so
I am not sure how it will be able to do the cleanup either. I.e, even
if all the processes in the cgroup ns die, it doesn't mean that the
cgroupns-root they belonged to is available for cleanup. For this
reason, one of the implicit design choice in cgroupns was that the
cgroup-ns root should already exist and the target process should
already be moved to it (presumably by some admin process) before
creating the cgroupns.

Moreover, the subsystem controllers (cpu, memory, etc.) are oblivious
to cgroup namespaces. So, for example, creating new cgroup namespace
doesn't affect the reclaim behavior. But, allowing
creation/modification of sub-cgroups affects it. So I think allowing
any unprivileged process to do that cannot be considered safe for now.
Explicit approval from some admin process will still be needed (which
can be given by chmod/chown today).


>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks,

--
Aditya
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

W. Trevor King
In reply to this post by Tejun Heo-2
I'm just chipping in from the peanut gallery, so sorry if this misses
some earlier discussion.

On Fri, May 20, 2016 at 12:53:26PM -0400, Tejun Heo wrote:
> Why does an unpriv NS need to have cgroup delegated to it without
> cooperation from cgroup manager?  If for resource control, I'm
> pretty sure we don't want to allow that without explicit cooperation
> from the enclosing scope.

A useful analogy is renice(1), which allows users to manage the way
their allocated resources are distributed among their processes.  A
system where a sysadmin has to explicitly grant users permission to
use renice seems overly restrictive, as does a system where a sysadmin
has to explicitly grant users permission to use cgroups to manage
their resources.  At that level, the only different is probably that
adjusting niceness doesn't consume additional system resources, while
creating a new cgroup does, and sysadmins might want to protect
themselves from having users create zillions of cgroups.

On the other hand, sysadmins who do want to grant this power can
automatically put users in their own cgroup with adjusted POSIX
permissions at login time (e.g. “Hi Alice, here's your own cgroup to
manage as you see fit.  Hi Bob, …”).  But having a way to say “I'm
fine with users creating their own cgroup namespaces and sub-cgroups”
is easier.  And making it opt-out (“I'm *not* fine with users creating
their own…”) is even easier, and the choice between opt-in and opt-out
probably depends on how expensive cgroups are.

Cheers,
Trevor

--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

Tejun Heo-2
In reply to this post by James Bottomley-2
Hello, James.

On Fri, May 20, 2016 at 01:28:46PM -0400, James Bottomley wrote:
> Just so I'm clear: by delegation you mean create a subdirectory in the
> cgroup hierarchy with a non-root owner?  We may have a solution for the
> escape constraints problem: see below.

Yeah, there's delegation section in cgroup-v2.txt.

> > Unfortunately, cgroup hierarchy isn't designed to support this sort
> > of automatic delegation.  Unpriv processes would be able to escape
> > constraints on v1 with some controllers and on v2 controllers have to
> > be explicitly enabled by root for delegated scope to have access to
> > them.
>
> Not necessarily.  We also talked about pinning the cgroup tree so that
> once you enter the cgroup namespace, your current cgroup directory
> becomes your root, meaning you can't cd back into the ancestors and
> thus can't write their tasks file, meaning, I think, that it should be
> impossible to escape ancestor constraints.

I wish it were that clean.  Unfortunately, on v1, some controllers
(memory and blkio depending on settings, netcls and netprio always)
simply aren't properly hierarchical and if you have write perm to
subdirectory you can escape the constraints of your ancestors.
Whether you can cd back up or not doesn't matter at all, so we can't
allow delegation by default.

> > Why does an unpriv NS need to have cgroup delegated to it without
> > cooperation from cgroup manager?
>
> There's actually many answers to this.  The one I'm insterested in is
> the ability for applications to make use of container features without
> having to ask permission from some orchestration engine.  The problem

What's "container features"?  Do you mean resource control by that?

> most people are looking at is how do I prevent the cgroup manager from
> running as root, because that's a security problem waiting to happen.

It's distributing system wide resources so the top of the tree will
always be owned by root and delegating subtrees is a fairly minimal
operation.  I don't see how that would necessarily lead to security
problems.

> >   If for resource control, I'm pretty sure we don't want to allow
> > that without explicit cooperation from the enclosing scope.
>
> The enclosing scope should be allowed to define the parameters (happens
> today with namespaces) but there shouldn't be an active "thing" which
> is the permission gateway.

It's not that I fundamentally disagree that that'd be nice to have
but, given the way cgroup is designed and implemented currently, I'm
not sure this is a feasible goal.

Thanks.

--
tejun
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

James Bottomley-2
In reply to this post by Aditya Kali
On Fri, 2016-05-20 at 10:33 -0700, Aditya Kali wrote:

> On Fri, May 20, 2016 at 9:25 AM, James Bottomley
> <[hidden email]> wrote:
> >
> > On Fri, 2016-05-20 at 09:17 -0700, Tejun Heo wrote:
> > > Hello, James.
> > >
> > > On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> > > > I think it's just different definitions.  If you take on our
> > > > definition of being able to set up a container without any
> > > > admin intervention, do you see our problem: we can't get the
> > > > initial delegation of the hierarchy.
> > >
> > > Yeah, I can see the difference but we can't solve that by special
> > > casing NS case.
> >
> > Great, we agree on the problem definition ... as I said, I'm not
> > saying this patch is the solution, but it gives us a starting point
> > for exploring whether there is a solution.
> >
> > >   This is stemming from the fact that an unpriv application can't
> > > create its sub-cgroups without explicit delegation from the root
> > > and that has always been an explicit design choice.
> > > It's tied to who's responsible for cleanup afterwards and what
> > > happens when the process gets migrated to a different cgroup.
> > >  The latter is an important issue on v1 hierarchies because
> > > migrating tasks sometimes is used as a way to control resource
> > > distribution.
> >
> > OK, so is the only problem cleanup?  If so, what if I proposed that
> > a cgroup directory could only be created by the owner of the userns
> > (which would be any old unprivileged user) iff they create a cgroup
> > ns and the cgroup ns would be responsible for removing it again, so
> > the cgroup subdirectory would be tied to the cgroup namespace as
> > its holder and we'd use release of the cgroup to remove all the
> > directories?
> >
>
> cgroup namspace doesn't own the resources in the cgroupns-root, and
> so I am not sure how it will be able to do the cleanup either. I.e,
> even if all the processes in the cgroup ns die, it doesn't mean that
> the cgroupns-root they belonged to is available for cleanup. For this
> reason, one of the implicit design choice in cgroupns was that the
> cgroup-ns root should already exist and the target process should
> already be moved to it (presumably by some admin process) before
> creating the cgroupns.
>
> Moreover, the subsystem controllers (cpu, memory, etc.) are oblivious
> to cgroup namespaces. So, for example, creating new cgroup namespace
> doesn't affect the reclaim behavior.

That doesn't mean we can't give them an owning cgroup namespace. It's
more or less the way user ns works today for all the other namespaces.
 However, lets try to arrive at a proposal that works before we start
thinking about the implementation.

>  But, allowing creation/modification of sub-cgroups affects it. So I
> think allowing any unprivileged process to do that cannot be
> considered safe for now. Explicit approval from some admin process
> will still be needed (which can be given by chmod/chown today).

Well, that's the essence of the question.  Can this be done in a safe
way for cgroups like it is for namespaces today?

James