[PATCHv3 0/8] CGroup Namespaces

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 0/8] CGroup Namespaces

Aditya Kali
Another spin for CGroup Namespaces feature.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
     task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

---
 Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
 fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
 fs/kernfs/mount.c                   |  48 +++++++++
 fs/proc/namespaces.c                |   1 +
 include/linux/cgroup.h              |  52 +++++++++-
 include/linux/cgroup_namespace.h    |  36 +++++++
 include/linux/kernfs.h              |   5 +
 include/linux/nsproxy.h             |   2 +
 include/linux/proc_ns.h             |   4 +
 include/uapi/linux/sched.h          |   3 +-
 kernel/Makefile                     |   2 +-
 kernel/cgroup.c                     | 106 +++++++++++++++-----
 kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
 kernel/fork.c                       |   2 +-
 kernel/nsproxy.c                    |  19 +++-
 15 files changed, 711 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/cgroups/namespace.txt
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path

Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/kernfs/dir.c        | 195 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
  return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-      size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
- char *p = buf + buflen;
+ size_t depth = 0;
+
+ BUG_ON(!kn);
+ while (kn->parent) {
+ depth++;
+ kn = kn->parent;
+ }
+ return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to,
+ char *buf,
+ size_t buflen)
+{
+ char *p = buf;
+ struct kernfs_node *kn;
+ size_t depth_from = 0, depth_to, d;
  int len;
 
- *--p = '\0';
+ /* We atleast need 2 bytes to write "/\0". */
+ BUG_ON(buflen < 2);
 
- do {
- len = strlen(kn->name);
- if (p - buf < len + 1) {
- buf[0] = '\0';
- p = NULL;
- break;
+ /* Short-circuit the easy case - kn_to is the root node. */
+ if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+ *p = '/';
+ *(p + 1) = '\0';
+ return p;
+ }
+
+ /* We can find the relative path only if both the nodes belong to the
+ * same kernfs root.
+ */
+ if (kn_from) {
+ BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+ depth_from = kernfs_node_depth(kn_from);
+ }
+
+ depth_to = kernfs_node_depth(kn_to);
+
+ /* We compose path from left to right. So first write out all possible
+ * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+ */
+ if (kn_from) {
+ while (depth_from > depth_to) {
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
  }
+
+ d = depth_to;
+ kn = kn_to;
+ while (depth_from < d) {
+ kn = kn->parent;
+ d--;
+ }
+
+ /* Now we have 'depth_from == depth_to' at this point. Add more
+ * "/.."s until we reach common ancestor. In the worst case,
+ * root node will be the common ancestor.
+ */
+ while (depth_from > 0) {
+ /* If we reached common ancestor, stop. */
+ if (kn_from == kn)
+ break;
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
+ kn = kn->parent;
+ }
+ }
+
+ /* Figure out how many bytes we need to write the path.
+ */
+ d = depth_to;
+ kn = kn_to;
+ len = 0;
+ while (depth_from < d) {
+ /* Account for "/<name>". */
+ len += strlen(kn->name) + 1;
+ kn = kn->parent;
+ --d;
+ }
+
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+
+ /* We have enough space. Move 'p' ahead by computed length and start
+ * writing node names into buffer.
+ */
+ p += len;
+ *p = '\0';
+ d = depth_to;
+ kn = kn_to;
+ while (d > depth_from) {
+ len = strlen(kn->name);
  p -= len;
  memcpy(p, kn->name, len);
  *--p = '/';
  kn = kn->parent;
- } while (kn && kn->parent);
+ --d;
+ }
 
- return p;
+ return buf;
 }
 
 /**
@@ -92,26 +223,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 }
 
 /**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
  * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
  * @buflen: size of @buf
  *
- * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf.  If @buf isn't long enough, @buf is nul terminated
  * and %NULL is returned.
  */
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+    char *buf, size_t buflen)
 {
  unsigned long flags;
  char *p;
 
  spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
  spin_unlock_irqrestore(&kernfs_rename_lock, flags);
  return p;
 }
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes.  The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf.  If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
 EXPORT_SYMBOL_GPL(kernfs_path);
 
 /**
@@ -145,8 +298,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
 
  spin_lock_irqsave(&kernfs_rename_lock, flags);
 
- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-       sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
  if (p)
  pr_cont("%s", p);
  else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 }
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+  struct kernfs_node *kn, char *buf,
+  size_t buflen);
 char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
  size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

Aditya Kali
In reply to this post by Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Acked-by: Serge Hallyn <[hidden email]>
Signed-off-by: Aditya Kali <[hidden email]>
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
 #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
 #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
 #define CLONE_NEWIPC 0x08000000 /* New ipcs */
 #define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Aditya Kali
In reply to this post by Aditya Kali
Signed-off-by: Aditya Kali <[hidden email]>
---
 Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
new file mode 100644
index 0000000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+ CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc/<pid>/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc/<pid>/cgroup will continue to show the full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+    the process calling unshare is running.
+    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+    For the init_cgroup_ns, this is the real root ('/') cgroup
+    (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+    creator process later moves to a different cgroup.
+    $ ~/unshare -c # unshare cgroupns in some cgroup
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+    [ns]$ mkdir sub_cgrp_1
+    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/self/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+    [ns]$ sleep 100000 &  # From within unshared cgroupns
+    [1] 7353
+    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+    [ns]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+    path relative to its own cgroupns-root will be shown:
+    # ns2's cgroupns-root is at '/batchjobs/container_id2'
+    [ns2]$ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+    Note that the relative path always starts with '/' to indicate that its
+    relative to the cgroupns-root of the caller.
+
+(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
+    (if they have proper access to external cgroups).
+    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
+    # assuming that the global hierarchy is still accessible inside cgroupns:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+    $ echo 7353 > batchjobs/container_id2/cgroup.procs
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+    Note that this kind of setup is not encouraged. A task inside cgroupns
+    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
+    the virtualization of /proc/<pid>/cgroup less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+    (a) the process has CAP_SYS_ADMIN in its current userns
+    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
+    No implicit cgroup changes happen with attaching to another cgroupns. It
+    is expected that the somone moves the attaching process under the target
+    cgroupns-root.
+
+(6) When some thread from a multi-threaded process unshares its
+    cgroup-namespace, the new cgroupns gets applied to the entire
+    process (all the threads). This should be OK since
+    unified-hierarchy only allows process-level containerization. So
+    all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is atleast 1
+    process inside it. When the last process exits, the cgroup
+    namespace is destroyed. The cgroupns-root and the actual cgroups
+    remain though.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+    inside cgroupns:
+    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+    This will mount the unified cgroup hierarchy with cgroupns-root as the
+    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
+
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 5/8] cgroup: introduce cgroup namespaces

Aditya Kali
In reply to this post by Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/proc/namespaces.c             |   1 +
 include/linux/cgroup.h           |  29 ++++++++-
 include/linux/cgroup_namespace.h |  36 +++++++++++
 include/linux/nsproxy.h          |   2 +
 include/linux/proc_ns.h          |   4 ++
 kernel/Makefile                  |   2 +-
 kernel/cgroup.c                  |  13 ++++
 kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                    |   2 +-
 kernel/nsproxy.c                 |  19 +++++-
 10 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
  &userns_operations,
 #endif
  &mntns_operations,
+ &cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
  return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ if (ns) {
+ BUG_ON(!cgroup_on_dfl(cgrp));
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
+     buflen);
+ } else {
+ return kernfs_path(cgrp->kn, buf, buflen);
+ }
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
       size_t buflen)
 {
- return kernfs_path(cgrp->kn, buf, buflen);
+ if (cgroup_on_dfl(cgrp)) {
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
+      buflen);
+ } else {
+ return cgroup_path_ns(NULL, cgrp, buf, buflen);
+ }
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+ return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+       struct user_namespace *user_ns,
+       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
  struct mnt_namespace *mnt_ns;
  struct pid_namespace *pid_ns_for_children;
  struct net     *net_ns;
+ struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
 
 struct pid_namespace;
 struct nsproxy;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
  const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
  PROC_UTS_INIT_INO = 0xEFFFFFFEU,
  PROC_USER_INIT_INO = 0xEFFFFFFDU,
  PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e12d36e..b1ae6d9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
 
 #include <linux/atomic.h>
 
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
       bool is_add);
 
+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
     gfp_t gfp_mask)
@@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
  unsigned long key;
  int ssid, err;
 
+ get_user_ns(init_cgroup_ns.user_ns);
+
  BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
  BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali ([hidden email])
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+ kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+ */
+ cgrp = get_task_cgroup(current);
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ return new_ns;
+
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
  if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
  CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
  CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
  return -EINVAL;
  /*
  * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
  .net_ns = &init_net,
 #endif
+ .cgroup_ns = &init_cgroup_ns,
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
  goto out_pid;
  }
 
+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+    tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
  new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
  if (IS_ERR(new_nsp->net_ns)) {
  err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
  return new_nsp;
 
 out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
  if (new_nsp->pid_ns_for_children)
  put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
  struct nsproxy *new_ns;
 
  if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-      CLONE_NEWPID | CLONE_NEWNET)))) {
+      CLONE_NEWPID | CLONE_NEWNET |
+      CLONE_NEWCGROUP)))) {
  get_nsproxy(old_ns);
  return 0;
  }
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
  put_ipc_ns(ns->ipc_ns);
  if (ns->pid_ns_for_children)
  put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
  put_net(ns->net_ns);
  kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
  int err = 0;
 
  if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-       CLONE_NEWNET | CLONE_NEWPID)))
+       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
  return 0;
 
  user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

Aditya Kali
In reply to this post by Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <[hidden email]>
---
 fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
  return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+  struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
  struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+  struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
        unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
  return -ENOENT;
  }
 
+ /* If inside a non-init cgroup namespace, only allow default hierarchy
+ * to be mounted.
+ */
+ if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+ return -EINVAL;
+ }
+
  if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
  if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
  set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
  LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
  int ret;
  int i;
  bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }
 
  /*
  * The first time anyone tries to mount a cgroup, enable the list
@@ -1866,11 +1891,28 @@ out_free:
  kfree(opts.release_agent);
  kfree(opts.name);
 
- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
  return ERR_PTR(ret);
+ }
 
  dentry = kernfs_mount(fs_type, flags, root->kf_root,
  CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+ /* If this mount is for the default hierarchy in non-init cgroup
+ * namespace, then instead of root cgroup's dentry, we return
+ * the dentry corresponding to the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
  if (IS_ERR(dentry) || !new_sb)
  cgroup_put(&root->cgrp);
 
@@ -1883,6 +1925,7 @@ out_free:
  deactivate_super(pinned_sb);
  }
 
+ put_cgroup_ns(ns);
  return dentry;
 }
 
@@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
  .name = "cgroup",
  .mount = cgroup_mount,
  .kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct kobject *cgroup_kobj;
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()

Aditya Kali
In reply to this post by Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Acked-by: Serge Hallyn <[hidden email]>
Signed-off-by: Aditya Kali <[hidden email]>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
  return cgrp->root == &cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
  return css;
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
  struct cgroup *cgrp = of->kn->parent->priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
  return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 6/8] cgroup: cgroup namespace setns support

Aditya Kali
In reply to this post by Aditya Kali
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <[hidden email]>
---
 kernel/cgroup_namespace.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Don't need to do anything if we are attaching to our own cgroupns. */
+ if (cgroup_ns == nsproxy->cgroup_ns)
+ return 0;
+
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+ return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy

Aditya Kali
In reply to this post by Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Acked-by: Serge Hallyn <[hidden email]>
Signed-off-by: Aditya Kali <[hidden email]>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
  /* the src and dst cset list running through cset->mg_node */
--
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 0/8] CGroup Namespaces

Aditya Kali
In reply to this post by Aditya Kali
These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali <[hidden email]> wrote:

> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
>    - removed use of threadgroup_lock() while creating new cgroupns
>    - use task_lock() instead of rcu_read_lock() while accessing
>      task->nsproxy
>    - optimized setns() to own cgroupns
>    - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
>    not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
>    across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
>    apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
>    cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
>    of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
>    cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>    mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>    anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>    your cgroupns-root.
>
> ---
>  Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
>  fs/kernfs/dir.c                     | 195 ++++++++++++++++++++++++++++++++----
>  fs/kernfs/mount.c                   |  48 +++++++++
>  fs/proc/namespaces.c                |   1 +
>  include/linux/cgroup.h              |  52 +++++++++-
>  include/linux/cgroup_namespace.h    |  36 +++++++
>  include/linux/kernfs.h              |   5 +
>  include/linux/nsproxy.h             |   2 +
>  include/linux/proc_ns.h             |   4 +
>  include/uapi/linux/sched.h          |   3 +-
>  kernel/Makefile                     |   2 +-
>  kernel/cgroup.c                     | 106 +++++++++++++++-----
>  kernel/cgroup_namespace.c           | 140 ++++++++++++++++++++++++++
>  kernel/fork.c                       |   2 +-
>  kernel/nsproxy.c                    |  19 +++-
>  15 files changed, 711 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/cgroups/namespace.txt
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Zefan Li
In reply to this post by Aditya Kali
> +In its current form, the cgroup namespaces patcheset provides following
> +behavior:
> +
> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
> +    the process calling unshare is running.
> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
> +    For the init_cgroup_ns, this is the real root ('/') cgroup
> +    (identified in code as cgrp_dfl_root.cgrp).
> +
> +(2) The cgroupns-root cgroup does not change even if the namespace
> +    creator process later moves to a different cgroup.
> +    $ ~/unshare -c # unshare cgroupns in some cgroup
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +    [ns]$ mkdir sub_cgrp_1
> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/self/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
> +(a) Processes running inside the cgroup namespace will be able to see
> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
> +    [1] 7353
> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> +    [ns]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(b) From global cgroupns, the real cgroup path will be visible:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
> +
> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
> +    path relative to its own cgroupns-root will be shown:
> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
> +    [ns2]$ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1

Should be ../container_id1/sub_cgrp_1 ?

> +
> +    Note that the relative path always starts with '/' to indicate that its
> +    relative to the cgroupns-root of the caller.

If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?

> +
> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
> +    (if they have proper access to external cgroups).
> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
> +    # assuming that the global hierarchy is still accessible inside cgroupns:
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
> +    $ cat /proc/7353/cgroup
> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
> +
> +    Note that this kind of setup is not encouraged. A task inside cgroupns
> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
> +    the virtualization of /proc/<pid>/cgroup less useful.
> +
> +(5) Setns to another cgroup namespace is allowed when:
> +    (a) the process has CAP_SYS_ADMIN in its current userns
> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
> +    No implicit cgroup changes happen with attaching to another cgroupns. It
> +    is expected that the somone moves the attaching process under the target
> +    cgroupns-root.
> +

s/the somone/someone

> +(6) When some thread from a multi-threaded process unshares its
> +    cgroup-namespace, the new cgroupns gets applied to the entire
> +    process (all the threads). This should be OK since
> +    unified-hierarchy only allows process-level containerization. So
> +    all the threads in the process will have the same cgroup.
> +
> +(7) The cgroup namespace is alive as long as there is atleast 1

s/atelast/at least

> +    process inside it. When the last process exits, the cgroup
> +    namespace is destroyed. The cgroupns-root and the actual cgroups
> +    remain though.
> +
> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
> +    inside cgroupns:
> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
> +
> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
> +
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 5/8] cgroup: introduce cgroup namespaces

Zefan Li
In reply to this post by Aditya Kali
On 2014/12/5 9:55, Aditya Kali wrote:

> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
>  they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>
> Signed-off-by: Aditya Kali <[hidden email]>
> ---
>  fs/proc/namespaces.c             |   1 +
>  include/linux/cgroup.h           |  29 ++++++++-
>  include/linux/cgroup_namespace.h |  36 +++++++++++
>  include/linux/nsproxy.h          |   2 +
>  include/linux/proc_ns.h          |   4 ++
>  kernel/Makefile                  |   2 +-
>  kernel/cgroup.c                  |  13 ++++
>  kernel/cgroup_namespace.c        | 127 +++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c                    |   2 +-
>  kernel/nsproxy.c                 |  19 +++++-
>  10 files changed, 230 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..55bc5da 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
>   &userns_operations,
>  #endif
>   &mntns_operations,
> + &cgroupns_operations,

Should be guarded with CONFIG_CGROUPS ?

There are other changes that break compile if !CONFIG_CGROUPS.

>  };
>  
>  static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6e7533b..94a5a0c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/kernfs.h>
>  #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -460,6 +462,13 @@ struct cftype {
>  #endif
>  };
>  
> +struct cgroup_namespace {
> + atomic_t count;
> + unsigned int proc_inum;
> + struct user_namespace *user_ns;
> + struct cgroup *root_cgrp;
> +};
> +
>  extern struct cgroup_root cgrp_dfl_root;
>  extern struct css_set init_css_set;
>  
> @@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>   return kernfs_name(cgrp->kn, buf, buflen);
>  }
>  
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> + struct cgroup *cgrp, char *buf,
> + size_t buflen)
> +{
> + if (ns) {
> + BUG_ON(!cgroup_on_dfl(cgrp));
> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
> +     buflen);
> + } else {
> + return kernfs_path(cgrp->kn, buf, buflen);
> + }
> +}
> +
>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>        size_t buflen)
>  {
> - return kernfs_path(cgrp->kn, buf, buflen);
> + if (cgroup_on_dfl(cgrp)) {
> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
> +      buflen);
> + } else {
> + return cgroup_path_ns(NULL, cgrp, buf, buflen);
> + }
>  }
>  
>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..0b97b8d
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,36 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *current_cgroupns_root(void)
> +{
> + return current->nsproxy->cgroup_ns->root_cgrp;
> +}
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + if (ns)
> + atomic_inc(&ns->count);
> + return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns && atomic_dec_and_test(&ns->count))
> + free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> +       struct user_namespace *user_ns,
> +       struct cgroup_namespace *old_ns);
> +
> +#endif  /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
>  struct pid_namespace;
> +struct cgroup_namespace;
>  struct fs_struct;
>  
>  /*
> @@ -33,6 +34,7 @@ struct nsproxy {
>   struct mnt_namespace *mnt_ns;
>   struct pid_namespace *pid_ns_for_children;
>   struct net     *net_ns;
> + struct cgroup_namespace *cgroup_ns;
>  };
>  extern struct nsproxy init_nsproxy;
>  
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>  
>  struct pid_namespace;
>  struct nsproxy;
> +struct task_struct;
> +struct inode;

These two lines seems unnecessary.

>  
>  struct proc_ns_operations {
>   const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>  extern const struct proc_ns_operations pidns_operations;
>  extern const struct proc_ns_operations userns_operations;
>  extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>  
>  /*
>   * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
>   PROC_UTS_INIT_INO = 0xEFFFFFFEU,
>   PROC_USER_INIT_INO = 0xEFFFFFFDU,
>   PROC_PID_INIT_INO = 0xEFFFFFFCU,
> + PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
>  };
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..d9731e2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
>  obj-$(CONFIG_KEXEC) += kexec.o
>  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
> -obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e12d36e..b1ae6d9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  #include <linux/kthread.h>
>  #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>  
>  #include <linux/atomic.h>
>  
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>        bool is_add);
>  
> +struct cgroup_namespace init_cgroup_ns = {
> + .count = {
> + .counter = 1,
> + },

.count = ATOMIC_INIT(1)

> + .proc_inum = PROC_CGROUP_INIT_INO,
> + .user_ns = &init_user_ns,
> + .root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
>  /* IDR wrappers which synchronize using cgroup_idr_lock */
>  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>      gfp_t gfp_mask)
> @@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
>   unsigned long key;
>   int ssid, err;
>  
> + get_user_ns(init_cgroup_ns.user_ns);
> +
>   BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
>   BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
>  
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..0e0ef3a
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,127 @@
> +/*
> + *  Copyright (C) 2014 Google Inc.
> + *
> + *  Author: Aditya Kali ([hidden email])
> + *
> + *  This program is free software; you can redistribute it and/or modify it
> + *  under the terms of the GNU General Public License as published by the Free
> + *  Software Foundation, version 2 of the License.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> + struct cgroup_namespace *new_ns;
> +
> + new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> + if (new_ns)
> + atomic_set(&new_ns->count, 1);
> + return new_ns;
> +}

Better fold this function into copy_cgroup_ns().

> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + cgroup_put(ns->root_cgrp);
> + put_user_ns(ns->user_ns);
> + proc_free_inum(ns->proc_inum);
> + kfree(ns);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);

This should be a static inline function.

> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns)
> +{
> + struct cgroup_namespace *new_ns = NULL;
> + struct cgroup *cgrp = NULL;
> + int err;
> +
> + BUG_ON(!old_ns);
> +
> + if (!(flags & CLONE_NEWCGROUP))
> + return get_cgroup_ns(old_ns);
> +
> + /* Allow only sysadmin to create cgroup namespace. */
> + err = -EPERM;
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> + goto err_out;
> +
> + /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> + */

The comment style should be

/*
 * ...
 */

> + cgrp = get_task_cgroup(current);
> +
> + err = -ENOMEM;
> + new_ns = alloc_cgroup_ns();
> + if (!new_ns)
> + goto err_out;
> +
> + err = proc_alloc_inum(&new_ns->proc_inum);
> + if (err)
> + goto err_out;
> +
> + new_ns->user_ns = get_user_ns(user_ns);
> + new_ns->root_cgrp = cgrp;
> +
> + return new_ns;
> +
> +err_out:
> + if (cgrp)
> + cgroup_put(cgrp);
> + kfree(new_ns);
> + return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> + pr_info("setns not supported for cgroup namespace");
> + return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> + struct cgroup_namespace *ns = NULL;
> + struct nsproxy *nsproxy;
> +
> + task_lock(task);
> + nsproxy = task->nsproxy;
> + if (nsproxy) {
> + ns = nsproxy->cgroup_ns;
> + get_cgroup_ns(ns);
> + }
> + task_unlock(task);
> +
> + return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> + put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> + struct cgroup_namespace *cgroup_ns = ns;
> +
> + return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> + .name = "cgroup",
> + .type = CLONE_NEWCGROUP,
> + .get = cgroupns_get,
> + .put = cgroupns_put,
> + .install = cgroupns_install,
> + .inum = cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> + return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);

Why provide this unused init function?

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..d22d793 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>   if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>   CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>   CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> - CLONE_NEWUSER|CLONE_NEWPID))
> + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>   return -EINVAL;
>   /*
>   * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>  #ifdef CONFIG_NET
>   .net_ns = &init_net,
>  #endif
> + .cgroup_ns = &init_cgroup_ns,
>  };
>  
>  static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>   goto out_pid;
>   }
>  
> + new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> +    tsk->nsproxy->cgroup_ns);
> + if (IS_ERR(new_nsp->cgroup_ns)) {
> + err = PTR_ERR(new_nsp->cgroup_ns);
> + goto out_cgroup;
> + }
> +
>   new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>   if (IS_ERR(new_nsp->net_ns)) {
>   err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>   return new_nsp;
>  
>  out_net:
> + if (new_nsp->cgroup_ns)
> + put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
>   if (new_nsp->pid_ns_for_children)
>   put_pid_ns(new_nsp->pid_ns_for_children);
>  out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>   struct nsproxy *new_ns;
>  
>   if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -      CLONE_NEWPID | CLONE_NEWNET)))) {
> +      CLONE_NEWPID | CLONE_NEWNET |
> +      CLONE_NEWCGROUP)))) {
>   get_nsproxy(old_ns);
>   return 0;
>   }
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>   put_ipc_ns(ns->ipc_ns);
>   if (ns->pid_ns_for_children)
>   put_pid_ns(ns->pid_ns_for_children);
> + if (ns->cgroup_ns)
> + put_cgroup_ns(ns->cgroup_ns);
>   put_net(ns->net_ns);
>   kmem_cache_free(nsproxy_cachep, ns);
>  }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>   int err = 0;
>  
>   if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> -       CLONE_NEWNET | CLONE_NEWPID)))
> +       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>   return 0;
>  
>   user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

Zefan Li
In reply to this post by Aditya Kali
On 2014/12/5 9:55, Aditya Kali wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is

s/as/has

> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <[hidden email]>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..efe5e15 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>   return NULL;
>  }
>  
> +/**
> + * kernfs_obtain_root - get a dentry for the given kernfs_node
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs

s/used used/be used/

s/which/who

> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +  struct kernfs_node *kn)
> +{
> + struct dentry *dentry;
> + struct inode *inode;
> +
> + BUG_ON(sb->s_op != &kernfs_sops);
> +
> + /* inode for the given kernfs_node should already exist. */
> + inode = ilookup(sb, kn->ino);
> + if (!inode) {
> + pr_debug("kernfs: could not get inode for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + /* instantiate and link root dentry */
> + dentry = d_obtain_root(inode);
> + if (!dentry) {
> + pr_debug("kernfs: could not get dentry for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
> + * may be called by callers other than kernfs_fill_super. */

/*
 * ...
 */

> + mutex_lock(&kernfs_mutex);
> + if (!dentry->d_fsdata) {
> + kernfs_get(kn);
> + dentry->d_fsdata = kn;
> + } else {
> + WARN_ON(dentry->d_fsdata != kn);
> + }
> + mutex_unlock(&kernfs_mutex);
> +
> + return dentry;
> +}

Seperate this as a standalone patch?

> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>   struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>         unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b1ae6d9..e779890 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>   return -ENOENT;
>   }
>  
> + /* If inside a non-init cgroup namespace, only allow default hierarchy
> + * to be mounted.
> + */

/*
 * ...
 */

> + if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
> +    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
> + return -EINVAL;
> + }
> +
>   if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>   pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>   if (nr_opts != 1) {
> @@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>   set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> + struct cgroup_namespace *ns)
> +{
> + struct dentry *nsdentry;
> +
> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> + return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>   LIST_HEAD(tmp_links);
> @@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>   int ret;
>   int i;
>   bool new_sb;
> + struct cgroup_namespace *ns =
> + get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }
>  
>   /*
>   * The first time anyone tries to mount a cgroup, enable the list
> @@ -1866,11 +1891,28 @@ out_free:
>   kfree(opts.release_agent);
>   kfree(opts.name);
>  
> - if (ret)
> + if (ret) {
> + put_cgroup_ns(ns);
>   return ERR_PTR(ret);
> + }
>  
>   dentry = kernfs_mount(fs_type, flags, root->kf_root,
>   CGROUP_SUPER_MAGIC, &new_sb);
> +
> + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> + /* If this mount is for the default hierarchy in non-init cgroup
> + * namespace, then instead of root cgroup's dentry, we return
> + * the dentry corresponding to the cgroupns->root_cgrp.
> + */
> + if (ns != &init_cgroup_ns) {
> + struct dentry *nsdentry;
> +
> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> + dput(dentry);
> + dentry = nsdentry;
> + }
> + }
> +
>   if (IS_ERR(dentry) || !new_sb)
>   cgroup_put(&root->cgrp);
>  
> @@ -1883,6 +1925,7 @@ out_free:
>   deactivate_super(pinned_sb);
>   }
>  
> + put_cgroup_ns(ns);
>   return dentry;
>  }
>  
> @@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
>   .name = "cgroup",
>   .mount = cgroup_mount,
>   .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Richard Weinberger
In reply to this post by Aditya Kali
Aditya,

I gave your patch set a try but it does not work for me.
Maybe you can bring some light into the issues I'm facing.
Sadly I still had no time to dig into your code.

Am 05.12.2014 um 02:55 schrieb Aditya Kali:

> Signed-off-by: Aditya Kali <[hidden email]>
> ---
>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 147 insertions(+)
>  create mode 100644 Documentation/cgroups/namespace.txt
>
> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
> new file mode 100644
> index 0000000..6480379
> --- /dev/null
> +++ b/Documentation/cgroups/namespace.txt
> @@ -0,0 +1,147 @@
> + CGroup Namespaces
> +
> +CGroup Namespace provides a mechanism to virtualize the view of the
> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
> +clone() and unshare() syscalls to create a new cgroup namespace.
> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
> +at the time of creation of the cgroup namespace.
> +
> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
> +path of the cgroup of a process. In a container setup (where a set of cgroups
> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
> +may leak potential system level information to the isolated processes.
> +
> +For Example:
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can generally be considered as system-data
> +and its desirable to not expose it to the isolated process.
> +
> +CGroup Namespaces can be used to restrict visibility of this path.
> +For Example:
> +  # Before creating cgroup namespace
> +  $ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> +  $ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
> +  $ ~/unshare -c
> +  [ns]$ ls -l /proc/self/ns/cgroup
> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> +  # From within new cgroupns, process sees that its in the root cgroup
> +  [ns]$ cat /proc/self/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +
> +  # From global cgroupns:
> +  $ cat /proc/<pid>/cgroup
> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +  # Unshare cgroupns along with userns and mountns
> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> +  # sets up uid/gid map and execs /bin/bash
> +  $ ~/unshare -c -u -m

This command does not issue CLONE_NEWUSER, -U does.

> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
> +  # hierarchy.
> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
> +  [ns]$ ls -l /tmp/cgroup
> +  total 0
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
And /proc/self/cgroup still shows the cgroup from outside.

---cut---
container:/ # ls /sys/fs/cgroup/
container:/ # mount -t cgroup none /sys/fs/cgroup/
mount: wrong fs type, bad option, bad superblock on none,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
container:/ # cat /proc/self/cgroup
8:memory:/machine/test00.libvirt-lxc
7:devices:/machine/test00.libvirt-lxc
6:hugetlb:/
5:cpuset:/machine/test00.libvirt-lxc
4:blkio:/machine/test00.libvirt-lxc
3:cpu,cpuacct:/machine/test00.libvirt-lxc
2:freezer:/machine/test00.libvirt-lxc
1:name=systemd:/user.slice/user-0.slice/session-c2.scope
container:/ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:02 .
dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
container:/ #

#host side
lxc-os132:~ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:56 .
dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
---cut---

Any ideas?

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Aditya Kali
On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <[hidden email]> wrote:

> Aditya,
>
> I gave your patch set a try but it does not work for me.
> Maybe you can bring some light into the issues I'm facing.
> Sadly I still had no time to dig into your code.
>
> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>> Signed-off-by: Aditya Kali <[hidden email]>
>> ---
>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 Documentation/cgroups/namespace.txt
>>
>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>> new file mode 100644
>> index 0000000..6480379
>> --- /dev/null
>> +++ b/Documentation/cgroups/namespace.txt
>> @@ -0,0 +1,147 @@
>> +                     CGroup Namespaces
>> +
>> +CGroup Namespace provides a mechanism to virtualize the view of the
>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>> +clone() and unshare() syscalls to create a new cgroup namespace.
>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>> +at the time of creation of the cgroup namespace.
>> +
>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>> +may leak potential system level information to the isolated processes.
>> +
>> +For Example:
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>> +and its desirable to not expose it to the isolated process.
>> +
>> +CGroup Namespaces can be used to restrict visibility of this path.
>> +For Example:
>> +  # Before creating cgroup namespace
>> +  $ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> +  $ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>> +  $ ~/unshare -c
>> +  [ns]$ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>> +  # From within new cgroupns, process sees that its in the root cgroup
>> +  [ns]$ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +
>> +  # From global cgroupns:
>> +  $ cat /proc/<pid>/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # Unshare cgroupns along with userns and mountns
>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> +  # sets up uid/gid map and execs /bin/bash
>> +  $ ~/unshare -c -u -m
>
> This command does not issue CLONE_NEWUSER, -U does.
>
I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>> +  # hierarchy.
>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> +  [ns]$ ls -l /tmp/cgroup
>> +  total 0
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
> And /proc/self/cgroup still shows the cgroup from outside.
>
> ---cut---
> container:/ # ls /sys/fs/cgroup/
> container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

> mount: wrong fs type, bad option, bad superblock on none,
>        missing codepage or helper program, or other error
>
>        In some cases useful info is found in syslog - try
>        dmesg | tail or so.
> container:/ # cat /proc/self/cgroup
> 8:memory:/machine/test00.libvirt-lxc
> 7:devices:/machine/test00.libvirt-lxc
> 6:hugetlb:/
> 5:cpuset:/machine/test00.libvirt-lxc
> 4:blkio:/machine/test00.libvirt-lxc
> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
> 2:freezer:/machine/test00.libvirt-lxc
> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
> container:/ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:02 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
> container:/ #
>
> #host side
> lxc-os132:~ # ls -la /proc/self/ns
> total 0
> dr-x--x--x 2 root root 0 Dec 14 23:56 .
> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
> ---cut---
>
> Any ideas?
>

Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

> Thanks,
> //richard


Thanks,
--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Richard Weinberger
Am 05.01.2015 um 23:48 schrieb Aditya Kali:

> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <[hidden email]> wrote:
>> Aditya,
>>
>> I gave your patch set a try but it does not work for me.
>> Maybe you can bring some light into the issues I'm facing.
>> Sadly I still had no time to dig into your code.
>>
>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>> Signed-off-by: Aditya Kali <[hidden email]>
>>> ---
>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 147 insertions(+)
>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>
>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>> new file mode 100644
>>> index 0000000..6480379
>>> --- /dev/null
>>> +++ b/Documentation/cgroups/namespace.txt
>>> @@ -0,0 +1,147 @@
>>> +                     CGroup Namespaces
>>> +
>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>> +at the time of creation of the cgroup namespace.
>>> +
>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>> +may leak potential system level information to the isolated processes.
>>> +
>>> +For Example:
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>> +and its desirable to not expose it to the isolated process.
>>> +
>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>> +For Example:
>>> +  # Before creating cgroup namespace
>>> +  $ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>> +  $ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>> +  $ ~/unshare -c
>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>> +  [ns]$ cat /proc/self/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>> +
>>> +  # From global cgroupns:
>>> +  $ cat /proc/<pid>/cgroup
>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>> +
>>> +  # Unshare cgroupns along with userns and mountns
>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>> +  # sets up uid/gid map and execs /bin/bash
>>> +  $ ~/unshare -c -u -m
>>
>> This command does not issue CLONE_NEWUSER, -U does.
>>
> I was using a custom unshare binary. But I will update the command
> line to be similar to the one in util-linux.
>
>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>> +  # hierarchy.
>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>> +  [ns]$ ls -l /tmp/cgroup
>>> +  total 0
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>> And /proc/self/cgroup still shows the cgroup from outside.
>>
>> ---cut---
>> container:/ # ls /sys/fs/cgroup/
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>
> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
> container, only unified hierarchy can be mounted. So, for now, that
> flag is needed. I will fix the documentation too.
>
>> mount: wrong fs type, bad option, bad superblock on none,
>>        missing codepage or helper program, or other error
>>
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail or so.
>> container:/ # cat /proc/self/cgroup
>> 8:memory:/machine/test00.libvirt-lxc
>> 7:devices:/machine/test00.libvirt-lxc
>> 6:hugetlb:/
>> 5:cpuset:/machine/test00.libvirt-lxc
>> 4:blkio:/machine/test00.libvirt-lxc
>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>> 2:freezer:/machine/test00.libvirt-lxc
>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>> container:/ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>> container:/ #
>>
>> #host side
>> lxc-os132:~ # ls -la /proc/self/ns
>> total 0
>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>> ---cut---
>>
>> Any ideas?
>>
>
> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.

Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Aditya Kali
In reply to this post by Zefan Li
Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li <[hidden email]> wrote:

>> +In its current form, the cgroup namespaces patcheset provides following
>> +behavior:
>> +
>> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
>> +    the process calling unshare is running.
>> +    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
>> +    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
>> +    For the init_cgroup_ns, this is the real root ('/') cgroup
>> +    (identified in code as cgrp_dfl_root.cgrp).
>> +
>> +(2) The cgroupns-root cgroup does not change even if the namespace
>> +    creator process later moves to a different cgroup.
>> +    $ ~/unshare -c # unshare cgroupns in some cgroup
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +    [ns]$ mkdir sub_cgrp_1
>> +    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/self/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
>> +(a) Processes running inside the cgroup namespace will be able to see
>> +    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> +    [ns]$ sleep 100000 &  # From within unshared cgroupns
>> +    [1] 7353
>> +    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> +    [ns]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(b) From global cgroupns, the real cgroup path will be visible:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
>> +
>> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
>> +    path relative to its own cgroupns-root will be shown:
>> +    # ns2's cgroupns-root is at '/batchjobs/container_id2'
>> +    [ns2]$ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
>
> Should be ../container_id1/sub_cgrp_1 ?
>

Starting with '/' was deliberate.

>> +
>> +    Note that the relative path always starts with '/' to indicate that its
>> +    relative to the cgroupns-root of the caller.
>
> If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?
>

This is so as not to surprise the apps parsing /proc/<pid>/cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

>> +
>> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
>> +    (if they have proper access to external cgroups).
>> +    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
>> +    # assuming that the global hierarchy is still accessible inside cgroupns:
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +    $ echo 7353 > batchjobs/container_id2/cgroup.procs
>> +    $ cat /proc/7353/cgroup
>> +    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
>> +
>> +    Note that this kind of setup is not encouraged. A task inside cgroupns
>> +    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
>> +    the virtualization of /proc/<pid>/cgroup less useful.
>> +
>> +(5) Setns to another cgroup namespace is allowed when:
>> +    (a) the process has CAP_SYS_ADMIN in its current userns
>> +    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
>> +    No implicit cgroup changes happen with attaching to another cgroupns. It
>> +    is expected that the somone moves the attaching process under the target
>> +    cgroupns-root.
>> +
>
> s/the somone/someone
>
fixed.

>> +(6) When some thread from a multi-threaded process unshares its
>> +    cgroup-namespace, the new cgroupns gets applied to the entire
>> +    process (all the threads). This should be OK since
>> +    unified-hierarchy only allows process-level containerization. So
>> +    all the threads in the process will have the same cgroup.
>> +
>> +(7) The cgroup namespace is alive as long as there is atleast 1
>
> s/atelast/at least
>
fixed.

>> +    process inside it. When the last process exits, the cgroup
>> +    namespace is destroyed. The cgroupns-root and the actual cgroups
>> +    remain though.
>> +
>> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
>> +    inside cgroupns:
>> +    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
>> +
>> +    This will mount the unified cgroup hierarchy with cgroupns-root as the
>> +    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
>> +
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks!
--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Eric W. Biederman
In reply to this post by Richard Weinberger
Richard Weinberger <[hidden email]> writes:

> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <[hidden email]> wrote:
>>> Aditya,
>>>
>>> I gave your patch set a try but it does not work for me.
>>> Maybe you can bring some light into the issues I'm facing.
>>> Sadly I still had no time to dig into your code.
>>>
>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>> Signed-off-by: Aditya Kali <[hidden email]>
>>>> ---
>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 147 insertions(+)
>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>
>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>> new file mode 100644
>>>> index 0000000..6480379
>>>> --- /dev/null
>>>> +++ b/Documentation/cgroups/namespace.txt
>>>> @@ -0,0 +1,147 @@
>>>> +                     CGroup Namespaces
>>>> +
>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>> +at the time of creation of the cgroup namespace.
>>>> +
>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>> +may leak potential system level information to the isolated processes.
>>>> +
>>>> +For Example:
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>> +and its desirable to not expose it to the isolated process.
>>>> +
>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>> +For Example:
>>>> +  # Before creating cgroup namespace
>>>> +  $ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>> +  $ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>> +  $ ~/unshare -c
>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>> +  [ns]$ cat /proc/self/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>> +
>>>> +  # From global cgroupns:
>>>> +  $ cat /proc/<pid>/cgroup
>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>> +
>>>> +  # Unshare cgroupns along with userns and mountns
>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>> +  # sets up uid/gid map and execs /bin/bash
>>>> +  $ ~/unshare -c -u -m
>>>
>>> This command does not issue CLONE_NEWUSER, -U does.
>>>
>> I was using a custom unshare binary. But I will update the command
>> line to be similar to the one in util-linux.
>>
>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>> +  # hierarchy.
>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>> +  [ns]$ ls -l /tmp/cgroup
>>>> +  total 0
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>
>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>
>>> ---cut---
>>> container:/ # ls /sys/fs/cgroup/
>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>
>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>> container, only unified hierarchy can be mounted. So, for now, that
>> flag is needed. I will fix the documentation too.
>>
>>> mount: wrong fs type, bad option, bad superblock on none,
>>>        missing codepage or helper program, or other error
>>>
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail or so.
>>> container:/ # cat /proc/self/cgroup
>>> 8:memory:/machine/test00.libvirt-lxc
>>> 7:devices:/machine/test00.libvirt-lxc
>>> 6:hugetlb:/
>>> 5:cpuset:/machine/test00.libvirt-lxc
>>> 4:blkio:/machine/test00.libvirt-lxc
>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>> 2:freezer:/machine/test00.libvirt-lxc
>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>> container:/ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>> container:/ #
>>>
>>> #host side
>>> lxc-os132:~ # ls -la /proc/self/ns
>>> total 0
>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>> ---cut---
>>>
>>> Any ideas?
>>>
>>
>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>
> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.

Ugh.  It sounds like there is a real mess here.  At the very least there
is misunderstanding.

I have a memory that systemd should have been able to use a unified
hierarchy.  As you could still mount the different controllers
independently (they just use the same directory structure on each
mount).

That said from a practical standpoint I am not certain that a cgroup
namespace is viable if it can not support the behavior of cgroupsfs
that everyone is using.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Richard Weinberger
Am 06.01.2015 um 00:53 schrieb Eric W. Biederman:

> Richard Weinberger <[hidden email]> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <[hidden email]> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <[hidden email]>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).

Luckily systemd folks want to move to the unified but as of now it does not work.
Please see this mail from Lennart:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html

Maybe the porting is easy. Dunno.
I had no time yet to look into that.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.

Yep.

systemd *really* wants to own cgroupfs, so it has to mount it within the container.
Currently libvirt does nasty hacks using bind mounts which are also problematic.
My hope was that with cgroup namespaces I can simply cheat systemd and give it
a cgroupfs to mess with.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

Aditya Kali
In reply to this post by Eric W. Biederman
On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman <[hidden email]> wrote:

> Richard Weinberger <[hidden email]> writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger <[hidden email]> wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali <[hidden email]>
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 0000000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> +                     CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of cgroups
>>>>> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc/<pid>/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>>>> +  [ns]$ ls -l /tmp/cgroup
>>>>> +  total 0
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>>>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>>>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>>>
>>>> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
>>>> But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
>>>> And /proc/self/cgroup still shows the cgroup from outside.
>>>>
>>>> ---cut---
>>>> container:/ # ls /sys/fs/cgroup/
>>>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>>>
>>> You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
>>> container, only unified hierarchy can be mounted. So, for now, that
>>> flag is needed. I will fix the documentation too.
>>>
>>>> mount: wrong fs type, bad option, bad superblock on none,
>>>>        missing codepage or helper program, or other error
>>>>
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail or so.
>>>> container:/ # cat /proc/self/cgroup
>>>> 8:memory:/machine/test00.libvirt-lxc
>>>> 7:devices:/machine/test00.libvirt-lxc
>>>> 6:hugetlb:/
>>>> 5:cpuset:/machine/test00.libvirt-lxc
>>>> 4:blkio:/machine/test00.libvirt-lxc
>>>> 3:cpu,cpuacct:/machine/test00.libvirt-lxc
>>>> 2:freezer:/machine/test00.libvirt-lxc
>>>> 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
>>>> container:/ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:02 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
>>>> container:/ #
>>>>
>>>> #host side
>>>> lxc-os132:~ # ls -la /proc/self/ns
>>>> total 0
>>>> dr-x--x--x 2 root root 0 Dec 14 23:56 .
>>>> dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
>>>> lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
>>>> ---cut---
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Please try with "-o __DEVEL_sane_behavior" flag to the mount command.
>>
>> Ohh, this renders the whole patch useless for me as systemd needs the "old/default" behavior of cgroups. :-(
>> I really hoped that cgroup namespaces will help me running systemd in a sane way within Linux containers.
>
> Ugh.  It sounds like there is a real mess here.  At the very least there
> is misunderstanding.
>
> I have a memory that systemd should have been able to use a unified
> hierarchy.  As you could still mount the different controllers
> independently (they just use the same directory structure on each
> mount).
>
In theory, if you boot kernel with
"cgroup__DEVEL__legacy_files_on_dfl" command-line parameter, and mount
cgroups with sane-behavior flag, then it should be more-or-less
similar to mounting all hierarchies together at the same mount-point
(mount -t cgroup -o __DEVEL_sane_behavior none $mntpt). I haven't
tried this, but systemd should be able to work with it and you can
enable cgroup-namespace too.

> That said from a practical standpoint I am not certain that a cgroup
> namespace is viable if it can not support the behavior of cgroupsfs
> that everyone is using.
>

Since the old/default behavior is on its way out, I didn't invest time
in fixing that. Also, some of the properties that make
cgroup-namespace simpler are only provided by unified hierarchy (for
example: a single root-cgroup per container).


> Eric

--
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123