[PATCHv1 0/8] CGroup Namespaces

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 0/8] CGroup Namespaces

Aditya Kali
Second take at the Cgroup Namespace patch-set.

Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
      (systemd, docker/libcontainer, etc.) data and leaking its name (or
      leaking the hierarchy) reveals too much information about the host
      system.
  (2) It makes the container migration across machines (CRIU) more
      difficult as the container names need to be unique across the
      machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
      docker/libcontainer, lmctfy, etc.) within virtual containers
      without adding dependency on some state/agent present outside the
      container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
      the process calling unshare is running.
      For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
      cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
      For the init_cgroup_ns, this is the real root (“/”) cgroup
      (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
      creator process later moves to a different cgroup.
      $ ~/unshare -c # unshare cgroupns in some cgroup
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
      [ns]$ mkdir sub_cgrp_1
      [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/self/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (3) Each process gets its CGROUPNS specific view of
      /proc/<pid>/cgroup.
  (a) Processes running inside the cgroup namespace will be able to see
      cgroup paths (in /proc/self/cgroup) only inside their root cgroup
      [ns]$ sleep 100000 &  # From within unshared cgroupns
      [1] 7353
      [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
      [ns]$ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

  (b) From global cgroupns, the real cgroup path will be visible:
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

  (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
      path will be visible:
      # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
      [ns2]$ cat /proc/7353/cgroup
      [ns2]$
      This is same as when cgroup hierarchy is not mounted at all.
      (In correct container setup though, it should not be possible to
       access PIDs in another container in the first place.)

  (4) Processes inside a cgroupns are not allowed to move out of the
      cgroupns-root. This is true even if a privileged process in global
      cgroupns tries to move the process out of its cgroupns-root.

      # From global cgroupns
      $ cat /proc/7353/cgroup
      0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
      # cgroupns-root for 7353 is /batchjobs/c_job_id1
      $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
      -bash: echo: write error: Operation not permitted

  (5) Setns to another cgroup namespace is allowed only when:
      (a) process has CAP_SYS_ADMIN in its current userns
      (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
      (c) the process's current cgroup is a descendant cgroupns-root of the
          target namespace.
      (d) the target cgroupns-root is descendant of current cgroupns-root..
      The last check (d) prevents processes from escaping their cgroupns-root by
      attaching to parent cgroupns. Thus, setns is allowed only when the process
      is trying to restrict itself to a deeper cgroup hierarchy.

  (6) When some thread from a multi-threaded process unshares its
      cgroup-namespace, the new cgroupns gets applied to the entire
      process (all the threads). This should be OK since
      unified-hierarchy only allows process-level containerization. So
      all the threads in the process will have the same cgroup. And both
      - changing cgroups and unsharing namespaces - are protected under
      threadgroup_lock(task).

  (7) The cgroup namespace is alive as long as there is atleast 1
      process inside it. When the last process exits, the cgroup
      namespace is destroyed. The cgroupns-root and the actual cgroups
      remain though.

  (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
      the unified cgroup hierarchy with cgroupns-root as the filesystem root.
      The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
      container management tools to be run inside the containers transparently.

Implementation
  The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
  branch). Its fairly non-intrusive and provides above mentioned
  features.

Possible extensions of CGROUPNS:
  (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
      capabilities to restrict cgroups to administrative users. CGroup
      namespaces could be of help here. With cgroup namespaces, it might
      be possible to delegate administration of sub-cgroups under a
      cgroupns-root to the cgroupns owner.


---
 fs/kernfs/dir.c                  |  53 +++++++++---
 fs/kernfs/mount.c                |  48 +++++++++++
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  41 +++++++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++
 include/linux/kernfs.h           |   5 ++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 +
 include/uapi/linux/sched.h       |   3 +-
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  | 139 ++++++++++++++++++++++++++------
 kernel/cgroup_namespace.c        | 168 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 ++++-
 15 files changed, 518 insertions(+), 41 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv1 3/8] cgroup: add function to get task's cgroup on default
[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv1 5/8] cgroup: introduce cgroup namespaces
[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
[PATCHv1 7/8] cgroup: cgroup namespace setns support
[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path

Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
 include/linux/kernfs.h |  3 +++
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
  return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-      size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_root,
+ struct kernfs_node *kn,
+ char *buf,
+ size_t buflen)
 {
  char *p = buf + buflen;
  int len;
 
+ BUG_ON(!buflen);
+
  *--p = '\0';
 
+ if (kn == kn_root) {
+ *--p = '/';
+ return p;
+ }
+
  do {
  len = strlen(kn->name);
  if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
  memcpy(p, kn->name, len);
  *--p = '/';
  kn = kn->parent;
+ if (kn == kn_root)
+ break;
  } while (kn && kn->parent);
 
  return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
  * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+    char *buf, size_t buflen)
 {
  unsigned long flags;
  char *p;
 
  spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
  spin_unlock_irqrestore(&kernfs_rename_lock, flags);
  return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
  spin_lock_irqsave(&kernfs_rename_lock, flags);
 
- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-       sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
  if (p)
  pr_cont("%s", p);
  else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+  struct kernfs_node *kn, char *buf,
+  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
  size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy

Aditya Kali
In reply to this post by Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <[hidden email]>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cab7dc4..56d507b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
  /* the src and dst cset list running through cset->mg_node */
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 7/8] cgroup: cgroup namespace setns support

Aditya Kali
In reply to this post by Aditya Kali
setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
  over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
  cgroup.
* target cgroupns-root is same as or deeper than task's current
  cgroupns-root. This is so that the task cannot escape out of its
  cgroupns-root. This also ensures that setns() only makes the task
  get restricted to a deeper cgroup hierarchy.

Signed-off-by: Aditya Kali <[hidden email]>
---
 kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+ struct task_struct *task = current;
+ struct cgroup *cgrp = NULL;
+ int err = 0;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(task);
+
+ cgrp = get_task_cgroup(task);
+
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto out_unlock;
+
+ /* Allow switch only if the task's current cgroup is descendant of the
+ * target cgroup_ns->root_cgrp.
+ */
+ if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+ goto out_unlock;
+
+ /* Only allow setns to a cgroupns root-ed deeper than task's current
+ * cgroupns-root. This will make sure that tasks cannot escape their
+ * cgroupns by attaching to parent cgroupns.
+ */
+ if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+  task_cgroupns_root(task)))
+ goto out_unlock;
+
+ err = 0;
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+ threadgroup_unlock(current);
+ if (cgrp)
+ cgroup_put(cgrp);
+ return err;
 }
 
 static void *cgroupns_get(struct task_struct *task)
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns

Aditya Kali
In reply to this post by Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
  return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+  struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
  struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
        unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
  memset(opts, 0, sizeof(*opts));
 
+ /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+ * namespace.
+ */
+ if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+ opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+ }
+
  while ((token = strsep(&o, ",")) != NULL) {
  nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
  if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
- if (nr_opts != 1) {
+ if (nr_opts > 1) {
  pr_err("sane_behavior: no other mount options allowed\n");
  return -EINVAL;
  }
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
  set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
  LIST_HEAD(tmp_links);
@@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
  int ret;
  int i;
  bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }
 
  /*
  * The first time anyone tries to mount a cgroup, enable the list
@@ -1816,11 +1840,28 @@ out_free:
  kfree(opts.release_agent);
  kfree(opts.name);
 
- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
  return ERR_PTR(ret);
+ }
 
  dentry = kernfs_mount(fs_type, flags, root->kf_root,
  CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry)) {
+ /* If this mount is for a non-init cgroup namespace, then
+ * Instead of root's dentry, we return the dentry specific to
+ * the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
  if (IS_ERR(dentry) || !new_sb)
  cgroup_put(&root->cgrp);
 
@@ -1833,6 +1874,7 @@ out_free:
  deactivate_super(pinned_sb);
  }
 
+ put_cgroup_ns(ns);
  return dentry;
 }
 
@@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
  .name = "cgroup",
  .mount = cgroup_mount,
  .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

Aditya Kali
In reply to this post by Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <[hidden email]>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
 #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
 #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
 #define CLONE_NEWIPC 0x08000000 /* New ipcs */
 #define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()

Aditya Kali
In reply to this post by Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <[hidden email]>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
  return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
  return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
  struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
  return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 5/8] cgroup: introduce cgroup namespaces

Aditya Kali
In reply to this post by Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/proc/namespaces.c             |   3 +
 include/linux/cgroup.h           |  18 +++++-
 include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 init/Kconfig                     |   9 +++
 kernel/Makefile                  |   1 +
 kernel/cgroup.c                  |  11 ++++
 kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 11 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
  &userns_operations,
 #endif
  &mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+ &cgroupns_operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
  return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
       size_t buflen)
 {
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+ return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+       struct user_namespace *user_ns,
+       struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+ unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns) {
+ if (flags & CLONE_NEWCGROUP)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
  struct mnt_namespace *mnt_ns;
  struct pid_namespace *pid_ns_for_children;
  struct net     *net_ns;
+ struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
  const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
  PROC_UTS_INIT_INO = 0xEFFFFFFEU,
  PROC_USER_INIT_INO = 0xEFFFFFFDU,
  PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..c3be001 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
  Enable some debugging help. Currently it exports additional stat
  files in a cgroup which can be useful for debugging.
 
+config CGROUP_NS
+ bool "CGroup Namespaces"
+ default n
+ help
+  This options enables CGroup Namespaces which can be used to isolate
+  cgroup paths. This feature is only useful when unified cgroup
+  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+  option).
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..75334f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2b3e9f9..f8099b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
       bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
     gfp_t gfp_mask)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..c16604f
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ cgrp = get_task_cgroup(current);
+
+ /* Creating new CGROUPNS is supported only when unified hierarchy is in
+ * use. */
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto err_out_unlock;
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out_unlock;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out_unlock;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ threadgroup_unlock(current);
+
+ return new_ns;
+
+err_out_unlock:
+ threadgroup_unlock(current);
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ rcu_read_unlock();
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..cc06851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
  if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
  CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
  CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
  return -EINVAL;
  /*
  * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
  .net_ns = &init_net,
 #endif
+ .cgroup_ns = &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
  goto out_pid;
  }
 
+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+    tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
  new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
  if (IS_ERR(new_nsp->net_ns)) {
  err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
  return new_nsp;
 
 out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
  if (new_nsp->pid_ns_for_children)
  put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
  struct nsproxy *new_ns;
 
  if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-      CLONE_NEWPID | CLONE_NEWNET)))) {
+      CLONE_NEWPID | CLONE_NEWNET |
+      CLONE_NEWCGROUP)))) {
  get_nsproxy(old_ns);
  return 0;
  }
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
  put_ipc_ns(ns->ipc_ns);
  if (ns->pid_ns_for_children)
  put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
  put_net(ns->net_ns);
  kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
  int err = 0;
 
  if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-       CLONE_NEWNET | CLONE_NEWPID)))
+       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
  return 0;
 
  user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

Aditya Kali
In reply to this post by Aditya Kali
Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root

Also, read of /proc/<pid>/cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.

Signed-off-by: Aditya Kali <[hidden email]>
---
 kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
  struct task_struct *task;
  int ret;
 
+ /* Only allow changing cgroups accessible within task's cgroup
+ * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+ * cgroupns->root_cgrp. */
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ return -EPERM;
+
  /* look up all src csets */
  down_read(&css_set_rwsem);
  rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
  struct cgroup_subsys_state *css;
  int ret;
 
+ /* Reject writes to cgroup files outside of task's cgroupns-root. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ return -EINVAL;
+
  if (cft->write)
  return cft->write(of, buf, nbytes, off);
 
@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
  parent = cgroup_kn_lock_live(parent_kn);
  if (!parent)
  return -ENODEV;
+
+ /* Allow mkdir only within process's cgroup namespace root. */
+ if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
  root = parent->root;
 
  /* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
  if (!cgrp)
  return 0;
 
+ /* Allow rmdir only within process's cgroup namespace root.
+ * The process can't delete its own root anyways. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+ cgroup_kn_unlock(kn);
+ return -EPERM;
+ }
+
  ret = cgroup_destroy_locked(cgrp);
 
  cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
  if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
  continue;
 
+ cgrp = task_cgroup_from_root(tsk, root);
+
+ /* The cgroup path on default hierarchy is shown only if it
+ * falls under current task's cgroupns-root.
+ */
+ if (root == &cgrp_dfl_root &&
+    !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ continue;
+
  seq_printf(m, "%d:", root->hierarchy_id);
  for_each_subsys(ss, ssid)
  if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
  seq_printf(m, "%sname=%s", count ? "," : "",
    root->name);
  seq_putc(m, ':');
- cgrp = task_cgroup_from_root(tsk, root);
  path = cgroup_path(cgrp, buf, PATH_MAX);
  if (!path) {
  retval = -ENAMETOOLONG;
--
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 0/8] CGroup Namespaces

Andy Lutomirski
In reply to this post by Aditya Kali
On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <[hidden email]> wrote:

> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> More details in the writeup below.
>
> Background
>   Cgroups and Namespaces are used together to create “virtual”
>   containers that isolates the host environment from the processes
>   running in container. But since cgroups themselves are not
>   “virtualized”, the task is always able to see global cgroups view
>   through cgroupfs mount and via /proc/self/cgroup file.
>
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   This exposure of cgroup names to the processes running inside a
>   container results in some problems:
>   (1) The container names are typically host-container-management-agent
>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>       leaking the hierarchy) reveals too much information about the host
>       system.
>   (2) It makes the container migration across machines (CRIU) more
>       difficult as the container names need to be unique across the
>       machines in the migration domain.
>   (3) It makes it difficult to run container management tools (like
>       docker/libcontainer, lmctfy, etc.) within virtual containers
>       without adding dependency on some state/agent present outside the
>       container.
>
>   Note that the feature proposed here is completely different than the
>   “ns cgroup” feature which existed in the linux kernel until recently.
>   The ns cgroup also attempted to connect cgroups and namespaces by
>   creating a new cgroup every time a new namespace was created. It did
>   not solve any of the above mentioned problems and was later dropped
>   from the kernel. Incidentally though, it used the same config option
>   name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
>   With unified cgroup hierarchy
>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>   have a much more coherent cgroup view and its easy to associate a
>   container with a single cgroup. This also allows us to virtualize the
>   cgroup view for tasks inside the container.
>
>   The new CGroup Namespace allows a process to “unshare” its cgroup
>   hierarchy starting from the cgroup its currently in.
>   For Ex:
>   $ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>   $ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>   [ns]$ ls -l /proc/self/ns/cgroup
>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>   cgroup:[4026532183]
>   # From within new cgroupns, process sees that its in the root cgroup
>   [ns]$ cat /proc/self/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
>   # From global cgroupns:
>   $ cat /proc/<pid>/cgroup
>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
>   # Unshare cgroupns along with userns and mountns
>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>   # sets up uid/gid map and exec’s /bin/bash
>   $ ~/unshare -c -u -m
>
>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>   # hierarchy.
>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>   [ns]$ ls -l /tmp/cgroup
>   total 0
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>   filesystem root for the namespace specific cgroupfs mount.
>
>   The virtualization of /proc/self/cgroup file combined with restricting
>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>   should provide a completely isolated cgroup view inside the container.
>
>   In its current form, the cgroup namespaces patcheset provides following
>   behavior:
>
>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>       the process calling unshare is running.
>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>       (identified in code as cgrp_dfl_root.cgrp).
>
>   (2) The cgroupns-root cgroup does not change even if the namespace
>       creator process later moves to a different cgroup.
>       $ ~/unshare -c # unshare cgroupns in some cgroup
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>       [ns]$ mkdir sub_cgrp_1
>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/self/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (3) Each process gets its CGROUPNS specific view of
>       /proc/<pid>/cgroup.
>   (a) Processes running inside the cgroup namespace will be able to see
>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>       [1] 7353
>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>       [ns]$ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
>   (b) From global cgroupns, the real cgroup path will be visible:
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

This is a little weird.  Not sure it's a problem.

>
>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>       path will be visible:
>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>       [ns2]$ cat /proc/7353/cgroup
>       [ns2]$
>       This is same as when cgroup hierarchy is not mounted at all.
>       (In correct container setup though, it should not be possible to
>        access PIDs in another container in the first place.)
>
>   (4) Processes inside a cgroupns are not allowed to move out of the
>       cgroupns-root. This is true even if a privileged process in global
>       cgroupns tries to move the process out of its cgroupns-root.
>
>       # From global cgroupns
>       $ cat /proc/7353/cgroup
>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>       -bash: echo: write error: Operation not permitted
>

>
>   (6) When some thread from a multi-threaded process unshares its
>       cgroup-namespace, the new cgroupns gets applied to the entire
>       process (all the threads). This should be OK since
>       unified-hierarchy only allows process-level containerization. So
>       all the threads in the process will have the same cgroup. And both
>       - changing cgroups and unsharing namespaces - are protected under
>       threadgroup_lock(task).

This seems odd to me.  Does unsharing the cgroupns unshare for all
tasks in the process?  If not, then I think that it shouldn't change
the cgroup either.

What did you end up doing to grant permission to unshare the cgroup ns?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 0/8] CGroup Namespaces

Aditya Kali
On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <[hidden email]> wrote:

> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <[hidden email]> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>>    your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>       (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>       leaking the hierarchy) reveals too much information about the host
>>       system.
>>   (2) It makes the container migration across machines (CRIU) more
>>       difficult as the container names need to be unique across the
>>       machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>       docker/libcontainer, lmctfy, etc.) within virtual containers
>>       without adding dependency on some state/agent present outside the
>>       container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc/<pid>/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtualization of /proc/self/cgroup file combined with restricting
>>   the view of cgroup hierarchy by namespace-private cgroupfs mount
>>   should provide a completely isolated cgroup view inside the container.
>>
>>   In its current form, the cgroup namespaces patcheset provides following
>>   behavior:
>>
>>   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>>       the process calling unshare is running.
>>       For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>>       cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>>       For the init_cgroup_ns, this is the real root (“/”) cgroup
>>       (identified in code as cgrp_dfl_root.cgrp).
>>
>>   (2) The cgroupns-root cgroup does not change even if the namespace
>>       creator process later moves to a different cgroup.
>>       $ ~/unshare -c # unshare cgroupns in some cgroup
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>       [ns]$ mkdir sub_cgrp_1
>>       [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/self/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (3) Each process gets its CGROUPNS specific view of
>>       /proc/<pid>/cgroup.
>>   (a) Processes running inside the cgroup namespace will be able to see
>>       cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>>       [ns]$ sleep 100000 &  # From within unshared cgroupns
>>       [1] 7353
>>       [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>>       [ns]$ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>>   (b) From global cgroupns, the real cgroup path will be visible:
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird.  Not sure it's a problem.
>
>>
>>   (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>>       path will be visible:
>>       # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>>       [ns2]$ cat /proc/7353/cgroup
>>       [ns2]$
>>       This is same as when cgroup hierarchy is not mounted at all.
>>       (In correct container setup though, it should not be possible to
>>        access PIDs in another container in the first place.)
>>
>>   (4) Processes inside a cgroupns are not allowed to move out of the
>>       cgroupns-root. This is true even if a privileged process in global
>>       cgroupns tries to move the process out of its cgroupns-root.
>>
>>       # From global cgroupns
>>       $ cat /proc/7353/cgroup
>>       0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>       # cgroupns-root for 7353 is /batchjobs/c_job_id1
>>       $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>>       -bash: echo: write error: Operation not permitted
>>
>
>>
>>   (6) When some thread from a multi-threaded process unshares its
>>       cgroup-namespace, the new cgroupns gets applied to the entire
>>       process (all the threads). This should be OK since
>>       unified-hierarchy only allows process-level containerization. So
>>       all the threads in the process will have the same cgroup. And both
>>       - changing cgroups and unsharing namespaces - are protected under
>>       threadgroup_lock(task).
>
> This seems odd to me.  Does unsharing the cgroupns unshare for all
> tasks in the process?  If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
>
> Signed-off-by: Aditya Kali <[hidden email]>

Acked-by: Serge Hallyn <[hidden email]>

(with or without my comment below taken)

> ---
>  fs/kernfs/dir.c        | 53 ++++++++++++++++++++++++++++++++++++++++----------
>  include/linux/kernfs.h |  3 +++
>  2 files changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..8655485 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
>   return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
>  }
>  
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> -      size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> + struct kernfs_node *kn_root,
> + struct kernfs_node *kn,
> + char *buf,
> + size_t buflen)
>  {
>   char *p = buf + buflen;
>   int len;
>  
> + BUG_ON(!buflen);
> +
>   *--p = '\0';
>  
> + if (kn == kn_root) {
> + *--p = '/';
> + return p;
> + }
> +
>   do {
>   len = strlen(kn->name);
>   if (p - buf < len + 1) {
> @@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
>   memcpy(p, kn->name, len);
>   *--p = '/';
>   kn = kn->parent;
> + if (kn == kn_root)
> + break;

I wonder if it would be clearer if you instead changed the while condition, i.e.

        } while (kn && kn != kn_root && kn_parent);

i.e .it's not a special condition, just a part of the expected flow.

>   } while (kn && kn->parent);
>  
>   return p;
> @@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
>  }
>  
>  /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
>   * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
>   * @buflen: size of @buf
>   *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
>   * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
>   * and %NULL is returned.
>   */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> +    char *buf, size_t buflen)
>  {
>   unsigned long flags;
>   char *p;
>  
>   spin_lock_irqsave(&kernfs_rename_lock, flags);
> - p = kernfs_path_locked(kn, buf, buflen);
> + p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
>   spin_unlock_irqrestore(&kernfs_rename_lock, flags);
>   return p;
>  }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> + return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
>  EXPORT_SYMBOL_GPL(kernfs_path);
>  
>  /**
> @@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>  
>   spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> - p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -       sizeof(kernfs_pr_cont_buf));
> + p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> + sizeof(kernfs_pr_cont_buf));
>   if (p)
>   pr_cont("%s", p);
>   else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 30faf79..3c2be75 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  }
>  
>  int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> +  struct kernfs_node *kn, char *buf,
> +  size_t buflen);
>  char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
>   size_t buflen);
>  void pr_cont_kernfs_name(struct kernfs_node *kn);
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>
> Signed-off-by: Aditya Kali <[hidden email]>

Acked-by: Serge Hallyn <[hidden email]>

> ---
>  include/uapi/linux/sched.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
>  #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
>  #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
>  #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> -   and is now available for re-use. */
> +#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
>  #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
>  #define CLONE_NEWIPC 0x08000000 /* New ipcs */
>  #define CLONE_NEWUSER 0x10000000 /* New user namespace */
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
>
> Signed-off-by: Aditya Kali <[hidden email]>

Acked-by: Serge Hallyn <[hidden email]>

> ---
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 1d51968..80ed6e0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
>  }
>  
>  char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>  
>  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cab7dc4..56d507b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
>  }
>  EXPORT_SYMBOL_GPL(task_cgroup_path);
>  
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> + struct cgroup *cgrp;
> +
> + mutex_lock(&cgroup_mutex);
> + down_read(&css_set_rwsem);
> +
> + cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> + cgroup_get(cgrp);
> +
> + up_read(&css_set_rwsem);
> + mutex_unlock(&cgroup_mutex);
> + return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
>  /* used to track tasks and other necessary states during migration */
>  struct cgroup_taskset {
>   /* the src and dst cset list running through cset->mg_node */
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
>
> Signed-off-by: Aditya Kali <[hidden email]>

Acked-by: Serge Hallyn <[hidden email]>

> ---
>  include/linux/cgroup.h | 22 ++++++++++++++++++++++
>  kernel/cgroup.c        | 22 ----------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 80ed6e0..4a0eb2d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
>   return cgrp->root == &cgrp_dfl_root;
>  }
>  
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> + return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> + WARN_ON_ONCE(cgroup_is_dead(cgrp));
> + css_get(&cgrp->self);
> +}
> +
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> +{
> + return css_tryget(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> + css_put(&cgrp->self);
> +}
> +
>  /* no synchronization, the result can only be used as a hint */
>  static inline bool cgroup_has_tasks(struct cgroup *cgrp)
>  {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 56d507b..2b3e9f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
>   return cgroup_css(cgrp, ss);
>  }
>  
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> - return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>   struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
>   return mode;
>  }
>  
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> - WARN_ON_ONCE(cgroup_is_dead(cgrp));
> - css_get(&cgrp->self);
> -}
> -
> -static bool cgroup_tryget(struct cgroup *cgrp)
> -{
> - return css_tryget(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> - css_put(&cgrp->self);
> -}
> -
>  /**
>   * cgroup_refresh_child_subsys_mask - update child_subsys_mask
>   * @cgrp: the target cgroup
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):

> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>
> Signed-off-by: Aditya Kali <[hidden email]>

I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
have cgroups in the kernel this won't add much in the way of memory
usage, right?  And I think the 'experimental' argument has long since
been squashed.  So I'd argue for simplifying this patch by removing
CONFIG_CGROUP_NS.

(more below)

> ---
>  fs/proc/namespaces.c             |   3 +
>  include/linux/cgroup.h           |  18 +++++-
>  include/linux/cgroup_namespace.h |  62 +++++++++++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  init/Kconfig                     |   9 +++
>  kernel/Makefile                  |   1 +
>  kernel/cgroup.c                  |  11 ++++
>  kernel/cgroup_namespace.c        | 128 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  11 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e04ed4b 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>   &userns_operations,
>  #endif
>   &mntns_operations,
> +#ifdef CONFIG_CGROUP_NS
> + &cgroupns_operations,
> +#endif
>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4a0eb2d..aa86495 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> + atomic_t count;
> + unsigned int proc_inum;
> + struct user_namespace *user_ns;
> + struct cgroup *root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>   return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> + struct cgroup *cgrp, char *buf,
> + size_t buflen)
> +{
> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>        size_t buflen)
>  {
> - return kernfs_path(cgrp->kn, buf, buflen);
> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..9f637fe
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
> +{
> + return tsk->nsproxy->cgroup_ns->root_cgrp;

Per the rules in nsproxy.h, you should be taking the task_lock here.

(If you are making assumptions about tsk then you need to state them
here - I only looked quickly enough that you pass in 'leader')

> +}
> +
> +#ifdef CONFIG_CGROUP_NS
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + if (ns)
> + atomic_inc(&ns->count);
> + return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns && atomic_dec_and_test(&ns->count))
> + free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +       struct user_namespace *user_ns,
> +       struct cgroup_namespace *old_ns);
> +
> +#else  /* CONFIG_CGROUP_NS */
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + return &init_cgroup_ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +}
> +
> +static inline struct cgroup_namespace *copy_cgroup_ns(
> + unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns) {
> + if (flags & CLONE_NEWCGROUP)
> + return ERR_PTR(-EINVAL);
> +
> + return old_ns;
> +}
> +
> +#endif  /* CONFIG_CGROUP_NS */
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>   struct mnt_namespace *mnt_ns;
>   struct pid_namespace *pid_ns_for_children;
>   struct net     *net_ns;
> + struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;
>  
>  struct proc_ns_operations {
>   const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>   PROC_UTS_INIT_INO = 0xEFFFFFFEU,
>   PROC_USER_INIT_INO = 0xEFFFFFFDU,
>   PROC_PID_INIT_INO = 0xEFFFFFFCU,
> + PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/init/Kconfig b/init/Kconfig
> index e84c642..c3be001 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>   Enable some debugging help. Currently it exports additional stat
>   files in a cgroup which can be useful for debugging.
>  
> +config CGROUP_NS
> + bool "CGroup Namespaces"
> + default n
> + help
> +  This options enables CGroup Namespaces which can be used to isolate
> +  cgroup paths. This feature is only useful when unified cgroup
> +  hierarchy is in use (i.e. cgroups are mounted with sane_behavior
> +  option).
> +
>  endif # CGROUPS
>  
>  config CHECKPOINT_RESTORE
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..75334f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2b3e9f9..f8099b4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>        bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> + .count = {
> + .counter = 1,
> + },
> + .proc_inum = PROC_CGROUP_INIT_INO,
> + .user_ns = &init_user_ns,

This might mean that you should bump the init_user_ns refcount.

> + .root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>      gfp_t gfp_mask)
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..c16604f
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,128 @@
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> + struct cgroup_namespace *new_ns;
> +
> + new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> + if (new_ns)
> + atomic_set(&new_ns->count, 1);
> + return new_ns;
> +}
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + cgroup_put(ns->root_cgrp);
> + put_user_ns(ns->user_ns);

This is a problem on error patch in copy_cgroup_ns.  The
alloc_cgroup_ns() doesn't initialize these values, so if
you should fail in proc_alloc_inum() you'll show up here
with fandom values in ns->*.

> + proc_free_inum(ns->proc_inum);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns)
> +{
> + struct cgroup_namespace *new_ns = NULL;
> + struct cgroup *cgrp = NULL;
> + int err;
> +
> + BUG_ON(!old_ns);
> +
> + if (!(flags & CLONE_NEWCGROUP))
> + return get_cgroup_ns(old_ns);
> +
> + /* Allow only sysadmin to create cgroup namespace. */
> + err = -EPERM;
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> + goto err_out;
> +
> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(current);
> +
> + cgrp = get_task_cgroup(current);
> +
> + /* Creating new CGROUPNS is supported only when unified hierarchy is in
> + * use. */

Oh, drat.  Well, I'll take, it, but under protest  :)

> + err = -EINVAL;
> + if (!cgroup_on_dfl(cgrp))
> + goto err_out_unlock;
> +
> + err = -ENOMEM;
> + new_ns = alloc_cgroup_ns();
> + if (!new_ns)
> + goto err_out_unlock;
> +
> + err = proc_alloc_inum(&new_ns->proc_inum);
> + if (err)
> + goto err_out_unlock;
> +
> + new_ns->user_ns = get_user_ns(user_ns);
> + new_ns->root_cgrp = cgrp;
> +
> + threadgroup_unlock(current);
> +
> + return new_ns;
> +
> +err_out_unlock:
> + threadgroup_unlock(current);
> +err_out:
> + if (cgrp)
> + cgroup_put(cgrp);
> + kfree(new_ns);
> + return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> + pr_info("setns not supported for cgroup namespace");
> + return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> + struct cgroup_namespace *ns = NULL;
> + struct nsproxy *nsproxy;
> +
> + rcu_read_lock();
> + nsproxy = task->nsproxy;
> + if (nsproxy) {
> + ns = nsproxy->cgroup_ns;
> + get_cgroup_ns(ns);
> + }
> + rcu_read_unlock();
> +
> + return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> + put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> + struct cgroup_namespace *cgroup_ns = ns;
> +
> + return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> + .name = "cgroup",
> + .type = CLONE_NEWCGROUP,
> + .get = cgroupns_get,
> + .put = cgroupns_put,
> + .install = cgroupns_install,
> + .inum = cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> + return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..cc06851 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>   if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>   CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>   CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> - CLONE_NEWUSER|CLONE_NEWPID))
> + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>   return -EINVAL;
>   /*
>   * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>   .net_ns = &init_net,
>  #endif
> + .cgroup_ns = &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>   goto out_pid;
>   }
>  
> + new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +    tsk->nsproxy->cgroup_ns);
> + if (IS_ERR(new_nsp->cgroup_ns)) {
> + err = PTR_ERR(new_nsp->cgroup_ns);
> + goto out_cgroup;
> + }
> +
>   new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>   if (IS_ERR(new_nsp->net_ns)) {
>   err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>   return new_nsp;
>  
>  out_net:
> + if (new_nsp->cgroup_ns)
> + put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>   if (new_nsp->pid_ns_for_children)
>   put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>   struct nsproxy *new_ns;
>  
>   if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -      CLONE_NEWPID | CLONE_NEWNET)))) {
> +      CLONE_NEWPID | CLONE_NEWNET |
> +      CLONE_NEWCGROUP)))) {
>   get_nsproxy(old_ns);
>   return 0;
>   }
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>   put_ipc_ns(ns->ipc_ns);
>   if (ns->pid_ns_for_children)
>   put_pid_ns(ns->pid_ns_for_children);
> + if (ns->cgroup_ns)
> + put_cgroup_ns(ns->cgroup_ns);
>   put_net(ns->net_ns);
>   kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>   int err = 0;
>  
>   if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -       CLONE_NEWNET | CLONE_NEWPID)))
> +       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>   return 0;
>  
>   user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [hidden email]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Serge E. Hallyn-3
In reply to this post by Aditya Kali
Quoting Aditya Kali ([hidden email]):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
>   over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
>   cgroup.

What is the point of this?

If I'm a user logged into
/lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
a container which is in
/lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
then I will want to be able to enter the container's cgroup.
The container's cgroup root is under my own (satisfying the
below condition0 but my cgroup is not a descendent of the
container's cgroup.


> * target cgroupns-root is same as or deeper than task's current
>   cgroupns-root. This is so that the task cannot escape out of its
>   cgroupns-root. This also ensures that setns() only makes the task
>   get restricted to a deeper cgroup hierarchy.
>
> Signed-off-by: Aditya Kali <[hidden email]>
> ---
>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>  
>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>  {
> - pr_info("setns not supported for cgroup namespace");
> - return -EINVAL;
> + struct cgroup_namespace *cgroup_ns = ns;
> + struct task_struct *task = current;
> + struct cgroup *cgrp = NULL;
> + int err = 0;
> +
> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> +    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(task);
> +
> + cgrp = get_task_cgroup(task);
> +
> + err = -EINVAL;
> + if (!cgroup_on_dfl(cgrp))
> + goto out_unlock;
> +
> + /* Allow switch only if the task's current cgroup is descendant of the
> + * target cgroup_ns->root_cgrp.
> + */
> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> + goto out_unlock;
> +
> + /* Only allow setns to a cgroupns root-ed deeper than task's current
> + * cgroupns-root. This will make sure that tasks cannot escape their
> + * cgroupns by attaching to parent cgroupns.
> + */
> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> +  task_cgroupns_root(task)))
> + goto out_unlock;
> +
> + err = 0;
> + get_cgroup_ns(cgroup_ns);
> + put_cgroup_ns(nsproxy->cgroup_ns);
> + nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> + threadgroup_unlock(current);
> + if (cgrp)
> + cgroup_put(cgrp);
> + return err;
>  }
>  
>  static void *cgroupns_get(struct task_struct *task)
> --
> 2.1.0.rc2.206.gedb03e5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Andy Lutomirski
On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[hidden email]> wrote:

> Quoting Aditya Kali ([hidden email]):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>

Presumably you need to ask your friendly cgroup manager to stick you
in that cgroup first.  Or we need to generally allow tasks to move
themselves deeper in the hierarchy, but that seems like a big change.

--Andy

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <[hidden email]>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [hidden email]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Aditya Kali
In reply to this post by Serge E. Hallyn-3
On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[hidden email]> wrote:

> Quoting Aditya Kali ([hidden email]):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>
This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in "session-c12.scope" container to
attach to cgroupns root'ed at "session-c12.scope/x1" container
(without implicitly moving its cgroup), then this invariant won't
hold.

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <[hidden email]>
>> ---
>>  kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> -     pr_info("setns not supported for cgroup namespace");
>> -     return -EINVAL;
>> +     struct cgroup_namespace *cgroup_ns = ns;
>> +     struct task_struct *task = current;
>> +     struct cgroup *cgrp = NULL;
>> +     int err = 0;
>> +
>> +     if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> +         !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> +             return -EPERM;
>> +
>> +     /* Prevent cgroup changes for this task. */
>> +     threadgroup_lock(task);
>> +
>> +     cgrp = get_task_cgroup(task);
>> +
>> +     err = -EINVAL;
>> +     if (!cgroup_on_dfl(cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Allow switch only if the task's current cgroup is descendant of the
>> +      * target cgroup_ns->root_cgrp.
>> +      */
>> +     if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> +             goto out_unlock;
>> +
>> +     /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +      * cgroupns-root. This will make sure that tasks cannot escape their
>> +      * cgroupns by attaching to parent cgroupns.
>> +      */
>> +     if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +                               task_cgroupns_root(task)))
>> +             goto out_unlock;
>> +
>> +     err = 0;
>> +     get_cgroup_ns(cgroup_ns);
>> +     put_cgroup_ns(nsproxy->cgroup_ns);
>> +     nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> +     threadgroup_unlock(current);
>> +     if (cgrp)
>> +             cgroup_put(cgrp);
>> +     return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [hidden email]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Serge E. Hallyn-3
Quoting Aditya Kali ([hidden email]):

> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[hidden email]> wrote:
> > Quoting Aditya Kali ([hidden email]):
> >> setns on a cgroup namespace is allowed only if
> >> * task has CAP_SYS_ADMIN in its current user-namespace and
> >>   over the user-namespace associated with target cgroupns.
> >> * task's current cgroup is descendent of the target cgroupns-root
> >>   cgroup.
> >
> > What is the point of this?
> >
> > If I'm a user logged into
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> > a container which is in
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> > then I will want to be able to enter the container's cgroup.
> > The container's cgroup root is under my own (satisfying the
> > below condition0 but my cgroup is not a descendent of the
> > container's cgroup.
> >
> This condition is there because we don't want to do implicit cgroup
> changes when a process attaches to another cgroupns. cgroupns tries to
> preserve the invariant that at any point, your current cgroup is
> always under the cgroupns-root of your cgroup namespace. But in your
> example, if we allow a process in "session-c12.scope" container to
> attach to cgroupns root'ed at "session-c12.scope/x1" container
> (without implicitly moving its cgroup), then this invariant won't
> hold.

Oh, I see.  Guess that should be workable.  Thanks.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123