[PATCHv1 0/8] CGroup Namespaces

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Andy Lutomirski
On Wed, Oct 22, 2014 at 11:37 AM, Aditya Kali <[hidden email]> wrote:

> On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <[hidden email]> wrote:
>> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <[hidden email]> wrote:
>>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <[hidden email]> wrote:
>>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[hidden email]> wrote:
>>>>>>
>>>>>>> And with explicit permission from
>>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>>> semantics simple.
>>>>>>
>>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>>> that policy.
>>>>>>
>>>>>
>>>>> My inclination is towards keeping things simpler - both in code as
>>>>> well as in configuration. I agree that cgroupns might seem
>>>>> "less-flexible", but in its current form, it encourages consistent
>>>>> container configuration. If you have a process that needs to move
>>>>> around between cgroups belonging to different containers, then that
>>>>> process should probably not be inside any container's cgroup
>>>>> namespace. Allowing that will just make the cgroup namespace
>>>>> pretty-much meaningless.
>>>>
>>>> The problem with pinning is that preventing it causes problems
>>>> (specifically, either something potentially complex and incompatible
>>>> needs to be added or unprivileged processes will be able to pin
>>>> themselves).
>>>>
>>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>>> need kernel pinning support to effectively constrain its members'
>>>> cgroups.
>>>>
>>>
>>> So there are 2 scenarios to consider:
>>>
>>> We have 2 containers with cgroups: /container1 and /container2
>>> Assume process P is running under cgroupns-root '/container1'
>>>
>>> (1) process P wants to 'write' to cgroup.procs outside its
>>> cgroupns-root (say to /container2/cgroup.procs)
>>
>> This, at least, doesn't have the problem with unprivileged processes
>> pinning themselves.
>>
>>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>>> with cgroupns-root above /container1) wants to write pid of process P
>>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>>
>>> For (1), I think its ok to reject such a write. This is consistent
>>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>>> set. I believe this should be independent of visibility of the cgroup
>>> hierarchy for P.
>>>
>>> For (2), we may allow the write to succeed if we make sure that the
>>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>>> userns AND over P's cgroupns->user_ns).
>>
>> Why is its userns relevant?
>>
>> Why not just check whether the target cgroup is in the process doing
>> the write's cgroupns? (NB: you need to check f_cred, here, not
>> current_cred(), but that's orthogonal.)  Then the policy becomes: no
>> user of cgroupfs can move any process outside of the cgroupfs's user's
>> cgroupns root.
>>
> Humm .. it doesn't have to be. I think its simpler to not enforce
> artificial permission checks unless there is a security concern (and
> in this case, there doesn't seem to be any). So I will leave the
> capability check out from here.
>
>> I think I'm okay with this.
>>
>>> If this write succeeds, then:
>>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>>> by 'self' or any other process in P's cgrgroupns. I would really like
>>> to avoid showing relative paths or paths outside the cgroupns-root
>>
>> The empty string seems just as problematic to me.
>
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
>

Confused now.  If ".." in /proc/pid/group would be ambiguous, then so
would a path relative to cgroupns root, right?  Or am I missing
something?

(I'm not saying that ".." is beautiful or that it won't confuse
things.  I'm just not sure why it's ambiguous.)

> * show absolute path
> -- this will also wrong as the process won't be able to make sense of
> it unless it has exposure to the global cgroup hierarchy.
> -- worse case is this that the global path also exists under the
> cgroupns-root ... so now the process thinks its in completely wrong
> cgroup
> -- this exposes system
>
> * show only "/"
> -- this is arguably better, but if the process tires to verify that
> its pid is in cgroup.procs of the cgroupns-root, its in for a
> surprise!
>
> In either case, whatever we expose, the userspace won't be able to use
> this path correctly (worse yet, it associates wrong cgroup for that
> path). So I think its best to not print out the line for default
> hierarchy at all. This happens today when cgroupfs is not mounted. I
> am open to other suggestions.

I suppose that ".." is a possible security problem.  If I can force
you to see lots of ..s in there, then I might be about to get you to
write outside cgroupfs.

Grr.  No great solution here.  I suppose that the empty string isn't
so bad.  We could also write something obviously invalid like
"(unreachable)".  As long as no one actually creates a cgroup called
"(unreachable)", then this could result in errors but not actual
confusion.

>>> * should we then also allow setns() without first entering the
>>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>>> checks that your current cgroup is descendant of target cgroupns-root.
>>> Alternatively we can special-case setns() to own cgroupns so that it
>>> doesn't fail.
>>
>> I think setns should completely ignore the caller's cgroup and should
>> not change it.  Userspace can do this.
>>
>
> All above changes more or less means that tasks cannot pin themselves
> by unsharing cgroupns. Do you agree that we don't need that "explicit
> permission from cgroupfs" anymore (via cgroup.may_unshare file or
> other mechanism)?

Yes, I agree.

>
>>> * migration for these processes will be tricky, if not impossible. But
>>> the admin trying to do this probably doesn't care about it or will
>>> provision for it.
>>
>> Migration for processes in a mntns that have a current directory
>> outside their mntns is also difficult or impossible.  Same with
>> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
>> procfs.  Nothing new here.
>>
>> --Andy
>
> Thanks for the review!

No problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

Aditya Kali
In reply to this post by Serge E. Hallyn-3
On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <[hidden email]> wrote:

> Quoting Aditya Kali ([hidden email]):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <[hidden email]>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed.  (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

-     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
            return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <[hidden email]>
>
> just wanted to point this out.
>
>> ---
>>  kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>>       struct task_struct *task;
>>       int ret;
>>
>> +     /* Only allow changing cgroups accessible within task's cgroup
>> +      * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> +      * cgroupns->root_cgrp. */
>> +     if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> +             return -EPERM;
>> +
>>       /* look up all src csets */
>>       down_read(&css_set_rwsem);
>>       rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>>       struct cgroup_subsys_state *css;
>>       int ret;
>>
>> +     /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +             return -EINVAL;
>> +
>>       if (cft->write)
>>               return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>>       parent = cgroup_kn_lock_live(parent_kn);
>>       if (!parent)
>>               return -ENODEV;
>> +
>> +     /* Allow mkdir only within process's cgroup namespace root. */
>> +     if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> +             ret = -EPERM;
>> +             goto out_unlock;
>> +     }
>> +
>>       root = parent->root;
>>
>>       /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>>       if (!cgrp)
>>               return 0;
>>
>> +     /* Allow rmdir only within process's cgroup namespace root.
>> +      * The process can't delete its own root anyways. */
>> +     if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> +             cgroup_kn_unlock(kn);
>> +             return -EPERM;
>> +     }
>> +
>>       ret = cgroup_destroy_locked(cgrp);
>>
>>       cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>               if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>>                       continue;
>>
>> +             cgrp = task_cgroup_from_root(tsk, root);
>> +
>> +             /* The cgroup path on default hierarchy is shown only if it
>> +              * falls under current task's cgroupns-root.
>> +              */
>> +             if (root == &cgrp_dfl_root &&
>> +                 !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> +                     continue;
>> +
>>               seq_printf(m, "%d:", root->hierarchy_id);
>>               for_each_subsys(ss, ssid)
>>                       if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>>                       seq_printf(m, "%sname=%s", count ? "," : "",
>>                                  root->name);
>>               seq_putc(m, ':');
>> -             cgrp = task_cgroup_from_root(tsk, root);
>>               path = cgroup_path(cgrp, buf, PATH_MAX);
>>               if (!path) {
>>                       retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> [hidden email]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the reiview!

--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Tejun Heo-2
In reply to this post by Aditya Kali
Hello,

On Wed, Oct 22, 2014 at 11:37:55AM -0700, Aditya Kali wrote:
...
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

Let's go with this w/o pinning.  The only necessary feature for
cgroupns is making the /proc/*/cgroups relative to its own root.  It's
not like containers can avoid trusting its outside world anyway and
playing tricks with things like this tend to lead to weird surprises
down the road.  If userland messes up, userland messes up.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Aditya Kali
In reply to this post by Serge E. Hallyn-3
I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn <[hidden email]> wrote:

> Quoting Aditya Kali ([hidden email]):
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>> Signed-off-by: Aditya Kali <[hidden email]>
>
> I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
> have cgroups in the kernel this won't add much in the way of memory
> usage, right?  And I think the 'experimental' argument has long since
> been squashed.  So I'd argue for simplifying this patch by removing
> CONFIG_CGROUP_NS.
>

With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

> (more below)
>
>> ---
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  18 +++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 ++
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  |  11 ++++
>>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 +++++-
>>  11 files changed, 255 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 8902609..e04ed4b 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>>       &userns_operations,
>>  #endif
>>       &mntns_operations,
>> +#ifdef CONFIG_CGROUP_NS
>> +     &cgroupns_operations,
>> +#endif
>>  };
>>
>>  static const struct file_operations ns_file_operations = {
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 4a0eb2d..aa86495 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -22,6 +22,8 @@
>>  #include <linux/seq_file.h>
>>  #include <linux/kernfs.h>
>>  #include <linux/wait.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/types.h>
>>
>>  #ifdef CONFIG_CGROUPS
>>
>> @@ -460,6 +462,13 @@ struct cftype {
>>  #endif
>>  };
>>
>> +struct cgroup_namespace {
>> +     atomic_t                count;
>> +     unsigned int            proc_inum;
>> +     struct user_namespace   *user_ns;
>> +     struct cgroup           *root_cgrp;
>> +};
>> +
>>  extern struct cgroup_root cgrp_dfl_root;
>>  extern struct css_set init_css_set;
>>
>> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>>       return kernfs_name(cgrp->kn, buf, buflen);
>>  }
>>
>> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
>> +                                              struct cgroup *cgrp, char *buf,
>> +                                              size_t buflen)
>> +{
>> +     return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
>> +}
>> +
>>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>>                                             size_t buflen)
>>  {
>> -     return kernfs_path(cgrp->kn, buf, buflen);
>> +     return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>>  }
>>
>>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
>> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
>> new file mode 100644
>> index 0000000..9f637fe
>> --- /dev/null
>> +++ b/include/linux/cgroup_namespace.h
>> @@ -0,0 +1,62 @@
>> +#ifndef _LINUX_CGROUP_NAMESPACE_H
>> +#define _LINUX_CGROUP_NAMESPACE_H
>> +
>> +#include <linux/nsproxy.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/types.h>
>> +#include <linux/user_namespace.h>
>> +
>> +extern struct cgroup_namespace init_cgroup_ns;
>> +
>> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
>> +{
>> +     return tsk->nsproxy->cgroup_ns->root_cgrp;
>
> Per the rules in nsproxy.h, you should be taking the task_lock here.
>
> (If you are making assumptions about tsk then you need to state them
> here - I only looked quickly enough that you pass in 'leader')
>

In the new version of the patch, we call this function only for the
'current' task. As per nsproxy.h, no special precautions needed when
reading current task's nsproxy. So I just remodeled this function into
"current_cgroupns_root(void)".

>> +}
>> +
>> +#ifdef CONFIG_CGROUP_NS
>> +
>> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     if (ns)
>> +             atomic_inc(&ns->count);
>> +     return ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     if (ns && atomic_dec_and_test(&ns->count))
>> +             free_cgroup_ns(ns);
>> +}
>> +
>> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                            struct user_namespace *user_ns,
>> +                                            struct cgroup_namespace *old_ns);
>> +
>> +#else  /* CONFIG_CGROUP_NS */
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> +             struct cgroup_namespace *ns)
>> +{
>> +     return &init_cgroup_ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +}
>> +
>> +static inline struct cgroup_namespace *copy_cgroup_ns(
>> +             unsigned long flags,
>> +             struct user_namespace *user_ns,
>> +             struct cgroup_namespace *old_ns) {
>> +     if (flags & CLONE_NEWCGROUP)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     return old_ns;
>> +}
>> +
>> +#endif  /* CONFIG_CGROUP_NS */
>> +
>> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index 35fa08f..ac0d65b 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -8,6 +8,7 @@ struct mnt_namespace;
>>  struct uts_namespace;
>>  struct ipc_namespace;
>>  struct pid_namespace;
>> +struct cgroup_namespace;
>>  struct fs_struct;
>>
>>  /*
>> @@ -33,6 +34,7 @@ struct nsproxy {
>>       struct mnt_namespace *mnt_ns;
>>       struct pid_namespace *pid_ns_for_children;
>>       struct net           *net_ns;
>> +     struct cgroup_namespace *cgroup_ns;
>>  };
>>  extern struct nsproxy init_nsproxy;
>>
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index 34a1e10..e56dd73 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -6,6 +6,8 @@
>>
>>  struct pid_namespace;
>>  struct nsproxy;
>> +struct task_struct;
>> +struct inode;
>>
>>  struct proc_ns_operations {
>>       const char *name;
>> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>>  extern const struct proc_ns_operations pidns_operations;
>>  extern const struct proc_ns_operations userns_operations;
>>  extern const struct proc_ns_operations mntns_operations;
>> +extern const struct proc_ns_operations cgroupns_operations;
>>
>>  /*
>>   * We always define these enumerators
>> @@ -37,6 +40,7 @@ enum {
>>       PROC_UTS_INIT_INO       = 0xEFFFFFFEU,
>>       PROC_USER_INIT_INO      = 0xEFFFFFFDU,
>>       PROC_PID_INIT_INO       = 0xEFFFFFFCU,
>> +     PROC_CGROUP_INIT_INO    = 0xEFFFFFFBU,
>>  };
>>
>>  #ifdef CONFIG_PROC_FS
>> diff --git a/init/Kconfig b/init/Kconfig
>> index e84c642..c3be001 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>>       Enable some debugging help. Currently it exports additional stat
>>       files in a cgroup which can be useful for debugging.
>>
>> +config CGROUP_NS
>> +     bool "CGroup Namespaces"
>> +     default n
>> +     help
>> +       This options enables CGroup Namespaces which can be used to isolate
>> +       cgroup paths. This feature is only useful when unified cgroup
>> +       hierarchy is in use (i.e. cgroups are mounted with sane_behavior
>> +       option).
>> +
>>  endif # CGROUPS
>>
>>  config CHECKPOINT_RESTORE
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index dc5c775..75334f8 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>>  obj-$(CONFIG_COMPAT) += compat.o
>>  obj-$(CONFIG_CGROUPS) += cgroup.o
>> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>>  obj-$(CONFIG_CPUSETS) += cpuset.o
>>  obj-$(CONFIG_UTS_NS) += utsname.o
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 2b3e9f9..f8099b4 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -57,6 +57,8 @@
>>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>>  #include <linux/kthread.h>
>>  #include <linux/delay.h>
>> +#include <linux/proc_ns.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  #include <linux/atomic.h>
>>
>> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>>                             bool is_add);
>>
>> +struct cgroup_namespace init_cgroup_ns = {
>> +     .count = {
>> +             .counter = 1,
>> +     },
>> +     .proc_inum = PROC_CGROUP_INIT_INO,
>> +     .user_ns = &init_user_ns,
>
> This might mean that you should bump the init_user_ns refcount.
>

Humm. Doesn't look like all other namespaces are doing it though (ex:
init_pid_ns or init_ipc_ns). The initial count in init_user_ns is set
to 3 which only accounts for some current users, but not all. I will
increment it for init_cgroup_ns nevertheless (in cgroup_init()).

>> +     .root_cgrp = &cgrp_dfl_root.cgrp,
>> +};
>> +
>>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>>                           gfp_t gfp_mask)
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> new file mode 100644
>> index 0000000..c16604f
>> --- /dev/null
>> +++ b/kernel/cgroup_namespace.c
>> @@ -0,0 +1,128 @@
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cgroup_namespace.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/proc_ns.h>
>> +
>> +static struct cgroup_namespace *alloc_cgroup_ns(void)
>> +{
>> +     struct cgroup_namespace *new_ns;
>> +
>> +     new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
>> +     if (new_ns)
>> +             atomic_set(&new_ns->count, 1);
>> +     return new_ns;
>> +}
>> +
>> +void free_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +     cgroup_put(ns->root_cgrp);
>> +     put_user_ns(ns->user_ns);
>
> This is a problem on error patch in copy_cgroup_ns.  The
> alloc_cgroup_ns() doesn't initialize these values, so if
> you should fail in proc_alloc_inum() you'll show up here
> with fandom values in ns->*.
>

I don't see the codepath that leads to calling free_cgroup_ns() with
uninitialized members. We don't call free_cgroup_ns() on the error
path in copy_cgroup_ns().

>> +     proc_free_inum(ns->proc_inum);

BTW, I was missing the actual kfree(ns) here. Added it.

>> +}
>> +EXPORT_SYMBOL(free_cgroup_ns);
>> +
>> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> +                                     struct user_namespace *user_ns,
>> +                                     struct cgroup_namespace *old_ns)
>> +{
>> +     struct cgroup_namespace *new_ns = NULL;
>> +     struct cgroup *cgrp = NULL;
>> +     int err;
>> +
>> +     BUG_ON(!old_ns);
>> +
>> +     if (!(flags & CLONE_NEWCGROUP))
>> +             return get_cgroup_ns(old_ns);
>> +
>> +     /* Allow only sysadmin to create cgroup namespace. */
>> +     err = -EPERM;
>> +     if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>> +             goto err_out;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(current);
>> +
>> +     cgrp = get_task_cgroup(current);
>> +
>> +     /* Creating new CGROUPNS is supported only when unified hierarchy is in
>> +      * use. */
>
> Oh, drat.  Well, I'll take, it, but under protest  :)
>

Actually, I realized that this comment and the check below is bogus.
The 'get_task_cgroup(current)' always only returns the cgroup on the
default hierarchy. And so, the check below is unnecessary.
What this comment should really say is that cgroup namespace only
virtualizes the cgroup path for the default(unified) hierarchy. Its
fine if you have other hierarchies mounted too. Just that for those
hierarchies, full (non-virtualized) cgroup path will be visible in
/proc/self/cgroup. So cgroupns won't help there.

I have updated the comment in the new version of the patch.

>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto err_out_unlock;
>> +
>> +     err = -ENOMEM;
>> +     new_ns = alloc_cgroup_ns();
>> +     if (!new_ns)
>> +             goto err_out_unlock;
>> +
>> +     err = proc_alloc_inum(&new_ns->proc_inum);
>> +     if (err)
>> +             goto err_out_unlock;
>> +
>> +     new_ns->user_ns = get_user_ns(user_ns);
>> +     new_ns->root_cgrp = cgrp;
>> +
>> +     threadgroup_unlock(current);
>> +
>> +     return new_ns;
>> +
>> +err_out_unlock:
>> +     threadgroup_unlock(current);
>> +err_out:
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     kfree(new_ns);
>> +     return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +     pr_info("setns not supported for cgroup namespace");
>> +     return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +     struct cgroup_namespace *ns = NULL;
>> +     struct nsproxy *nsproxy;
>> +
>> +     rcu_read_lock();
>> +     nsproxy = task->nsproxy;
>> +     if (nsproxy) {
>> +             ns = nsproxy->cgroup_ns;
>> +             get_cgroup_ns(ns);
>> +     }
>> +     rcu_read_unlock();
>> +
>> +     return ns;
>> +}
>> +
>> +static void cgroupns_put(void *ns)
>> +{
>> +     put_cgroup_ns(ns);
>> +}
>> +
>> +static unsigned int cgroupns_inum(void *ns)
>> +{
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +
>> +     return cgroup_ns->proc_inum;
>> +}
>> +
>> +const struct proc_ns_operations cgroupns_operations = {
>> +     .name           = "cgroup",
>> +     .type           = CLONE_NEWCGROUP,
>> +     .get            = cgroupns_get,
>> +     .put            = cgroupns_put,
>> +     .install        = cgroupns_install,
>> +     .inum           = cgroupns_inum,
>> +};
>> +
>> +static __init int cgroup_namespaces_init(void)
>> +{
>> +     return 0;
>> +}
>> +subsys_initcall(cgroup_namespaces_init);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 0cf9cdb..cc06851 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>>       if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>>                               CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>>                               CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
>> -                             CLONE_NEWUSER|CLONE_NEWPID))
>> +                             CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>>               return -EINVAL;
>>       /*
>>        * Not implemented, but pretend it works if there is nothing to
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index ef42d0a..a8b1970 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -25,6 +25,7 @@
>>  #include <linux/proc_ns.h>
>>  #include <linux/file.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/cgroup_namespace.h>
>>
>>  static struct kmem_cache *nsproxy_cachep;
>>
>> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>>  #ifdef CONFIG_NET
>>       .net_ns                 = &init_net,
>>  #endif
>> +     .cgroup_ns              = &init_cgroup_ns,
>>  };
>>
>>  static inline struct nsproxy *create_nsproxy(void)
>> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>               goto out_pid;
>>       }
>>
>> +     new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
>> +                                         tsk->nsproxy->cgroup_ns);
>> +     if (IS_ERR(new_nsp->cgroup_ns)) {
>> +             err = PTR_ERR(new_nsp->cgroup_ns);
>> +             goto out_cgroup;
>> +     }
>> +
>>       new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>>       if (IS_ERR(new_nsp->net_ns)) {
>>               err = PTR_ERR(new_nsp->net_ns);
>> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>>       return new_nsp;
>>
>>  out_net:
>> +     if (new_nsp->cgroup_ns)
>> +             put_cgroup_ns(new_nsp->cgroup_ns);
>> +out_cgroup:
>>       if (new_nsp->pid_ns_for_children)
>>               put_pid_ns(new_nsp->pid_ns_for_children);
>>  out_pid:
>> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>>       struct nsproxy *new_ns;
>>
>>       if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                           CLONE_NEWPID | CLONE_NEWNET)))) {
>> +                           CLONE_NEWPID | CLONE_NEWNET |
>> +                           CLONE_NEWCGROUP)))) {
>>               get_nsproxy(old_ns);
>>               return 0;
>>       }
>> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>>               put_ipc_ns(ns->ipc_ns);
>>       if (ns->pid_ns_for_children)
>>               put_pid_ns(ns->pid_ns_for_children);
>> +     if (ns->cgroup_ns)
>> +             put_cgroup_ns(ns->cgroup_ns);
>>       put_net(ns->net_ns);
>>       kmem_cache_free(nsproxy_cachep, ns);
>>  }
>> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>>       int err = 0;
>>
>>       if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> -                            CLONE_NEWNET | CLONE_NEWPID)))
>> +                            CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>>               return 0;
>>
>>       user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> [hidden email]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the review!
--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Serge E. Hallyn-3
Quoting Aditya Kali ([hidden email]):

> >> +void free_cgroup_ns(struct cgroup_namespace *ns)
> >> +{
> >> +     cgroup_put(ns->root_cgrp);
> >> +     put_user_ns(ns->user_ns);
> >
> > This is a problem on error patch in copy_cgroup_ns.  The
> > alloc_cgroup_ns() doesn't initialize these values, so if
> > you should fail in proc_alloc_inum() you'll show up here
> > with fandom values in ns->*.
> >
>
> I don't see the codepath that leads to calling free_cgroup_ns() with
> uninitialized members. We don't call free_cgroup_ns() on the error
> path in copy_cgroup_ns().

Hm, yeah, I'm not seeing it now, sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 0/8] CGroup Namespaces

Vincent Batts
In reply to this post by Eric W. Biederman
Has there been further movement on CLONE_NEWCGROUP outside of this?


vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<[hidden email]> wrote:

> Aditya Kali <[hidden email]> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
>    cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs.  There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs.  In a practical reality that is nonsense.  If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set.  So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>>   (5) Setns to another cgroup namespace is allowed only when:
>>       (a) process has CAP_SYS_ADMIN in its current userns
>>       (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>>       (c) the process's current cgroup is a descendant cgroupns-root of the
>>           target namespace.
>>       (d) the target cgroupns-root is descendant of current cgroupns-root..
>>       The last check (d) prevents processes from escaping their cgroupns-root by
>>       attaching to parent cgroupns. Thus, setns is allowed only when the process
>>       is trying to restrict itself to a deeper cgroup hierarchy.
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>>
>>   (7) The cgroup namespace is alive as long as there is atleast 1
>>       process inside it. When the last process exits, the cgroup
>>       namespace is destroyed. The cgroupns-root and the actual cgroups
>>       remain though.
>>
>>   (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>>       the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>>       The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>>       container management tools to be run inside the containers transparently.
>>
>> Implementation
>>   The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>>   branch). Its fairly non-intrusive and provides above mentioned
>>   features.
>>
>> Possible extensions of CGROUPNS:
>>   (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>>       capabilities to restrict cgroups to administrative users. CGroup
>>       namespaces could be of help here. With cgroup namespaces, it might
>>       be possible to delegate administration of sub-cgroups under a
>>       cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>>  fs/kernfs/dir.c                  |  53 +++++++++---
>>  fs/kernfs/mount.c                |  48 +++++++++++
>>  fs/proc/namespaces.c             |   3 +
>>  include/linux/cgroup.h           |  41 +++++++++-
>>  include/linux/cgroup_namespace.h |  62 +++++++++++++++
>>  include/linux/kernfs.h           |   5 ++
>>  include/linux/nsproxy.h          |   2 +
>>  include/linux/proc_ns.h          |   4 +
>>  include/uapi/linux/sched.h       |   3 +-
>>  init/Kconfig                     |   9 +++
>>  kernel/Makefile                  |   1 +
>>  kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
>>  kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
>>  kernel/fork.c                    |   2 +-
>>  kernel/nsproxy.c                 |  19 ++++-
>>  15 files changed, 518 insertions(+), 41 deletions(-)
>>  create mode 100644 include/linux/cgroup_namespace.h
>>  create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> [hidden email]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123