[PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
This patchset adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group.  Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

This series is also laying the framework for additional Platform
QoS features in future Intel Xeon processors.

The CPU features themselves are relatively straight-forward, but
the presentation of the data is less straight-forward.  Since this
tracks cache usage and occupancy per process (by swapping Resource
Monitor IDs, or RMIDs, when processes are rescheduled), perf would
not be a good fit for this data, which does not report on a
per-process level.  Therefore, a new cgroup subsystem, cacheqos, has
been added.  This operates very similarly to the cpu and cpuacct
cgroup subsystems, where tasks can be grouped into sub-leaves of the
root-level cgroup.

Peter P Waskiewicz Jr (4):
      x86: Add support for Cache QoS Monitoring (CQM) detection
      x86: Add Cache QoS Monitoring support to x86 perf uncore
      cgroup: Add new cacheqos cgroup subsys to support Cache QoS Monitoring
      Documentation: Add documentation for cacheqos cgroup
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/4] x86: Add support for Cache QoS Monitoring (CQM) detection

Waskiewicz Jr, Peter P
This patch adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors.  It includes the
new values to track CQM resources to the cpuinfo_x86 structure,
plus the CPUID detection routines for CQM.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group.  Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

Signed-off-by: Peter P Waskiewicz Jr <[hidden email]>
---
 arch/x86/configs/x86_64_defconfig |  1 +
 arch/x86/include/asm/cpufeature.h |  9 ++++++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/common.c      | 39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index c1119d4..8e98ed4 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -14,6 +14,7 @@ CONFIG_LOG_BUF_SHIFT=18
 CONFIG_CGROUPS=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CPUSETS=y
+CONFIG_CGROUP_CACHEQOS=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
 CONFIG_CGROUP_SCHED=y
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 89270b4..5dd59a2 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -8,7 +8,7 @@
 #include <asm/required-features.h>
 #endif
 
-#define NCAPINTS 10 /* N 32-bit words worth of info */
+#define NCAPINTS 12 /* N 32-bit words worth of info */
 #define NBUGINTS 1 /* N 32-bit bug flags */
 
 /*
@@ -216,10 +216,17 @@
 #define X86_FEATURE_ERMS (9*32+ 9) /* Enhanced REP MOVSB/STOSB */
 #define X86_FEATURE_INVPCID (9*32+10) /* Invalidate Processor Context ID */
 #define X86_FEATURE_RTM (9*32+11) /* Restricted Transactional Memory */
+#define X86_FEATURE_CQM (9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_RDSEED (9*32+18) /* The RDSEED instruction */
 #define X86_FEATURE_ADX (9*32+19) /* The ADCX and ADOX instructions */
 #define X86_FEATURE_SMAP (9*32+20) /* Supervisor Mode Access Prevention */
 
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:0 (edx), word 10 */
+#define X86_FEATURE_CQM_LLC (10*32+ 1) /* LLC QoS if 1 */
+
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 11 */
+#define X86_FEATURE_CQM_OCCUP_LLC (11*32+ 0) /* LLC occupancy monitoring if 1 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4..3892281 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -110,6 +110,9 @@ struct cpuinfo_x86 {
  /* in KB - valid for CPUS which support this call: */
  int x86_cache_size;
  int x86_cache_alignment; /* In bytes */
+ /* Cache QoS architectural values: */
+ int x86_cache_max_rmid; /* max index */
+ int x86_cache_occ_scale; /* scale to bytes */
  int x86_power;
  unsigned long loops_per_jiffy;
  /* cpuid returned max cores value: */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6abc172..f18bc43 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -626,6 +626,30 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
  c->x86_capability[9] = ebx;
  }
 
+ /* Additional Intel-defined flags: level 0x0000000F */
+ if (c->cpuid_level >= 0x0000000F) {
+ u32 eax, ebx, ecx, edx;
+
+ /* QoS sub-leaf, EAX=0Fh, ECX=0 */
+ cpuid_count(0x0000000F, 0, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[10] = edx;
+ if (cpu_has(c, X86_FEATURE_CQM_LLC)) {
+ /* will be overridden if occupancy monitoring exists */
+ c->x86_cache_max_rmid = ebx;
+
+ /* QoS sub-leaf, EAX=0Fh, ECX=1 */
+ cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[11] = edx;
+ if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ c->x86_cache_max_rmid = ecx;
+ c->x86_cache_occ_scale = ebx;
+ }
+ } else {
+ c->x86_cache_max_rmid = -1;
+ c->x86_cache_occ_scale = -1;
+ }
+ }
+
  /* AMD-defined flags: level 0x80000001 */
  xlvl = cpuid_eax(0x80000000);
  c->extended_cpuid_level = xlvl;
@@ -814,6 +838,20 @@ static void generic_identify(struct cpuinfo_x86 *c)
  detect_nopl(c);
 }
 
+static void x86_init_cache_qos(struct cpuinfo_x86 *c)
+{
+ /*
+ * The heavy lifting of max_rmid and cache_occ_scale are handled
+ * in get_cpu_cap().  Here we just set the max_rmid for the boot_cpu
+ * in case CQM bits really aren't there in this CPU.
+ */
+ if (c != &boot_cpu_data) {
+ boot_cpu_data.x86_cache_max_rmid =
+ min(boot_cpu_data.x86_cache_max_rmid,
+    c->x86_cache_max_rmid);
+ }
+}
+
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
@@ -903,6 +941,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
  init_hypervisor(c);
  x86_init_rdrand(c);
+ x86_init_cache_qos(c);
 
  /*
  * Clear/Set all flags overriden by options, need do it
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/4] x86: Add Cache QoS Monitoring support to x86 perf uncore

Waskiewicz Jr, Peter P
In reply to this post by Waskiewicz Jr, Peter P
This patch adds the MSRs and masks for CQM to the x86 uncore.

The actual schedling functions using the MSRs will be included
in the next patch when the new cgroup subsystem is added, as there
are dependencies on structs from the cgroup.

Signed-off-by: Peter P Waskiewicz Jr <[hidden email]>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index a80ab71..f788145 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -412,6 +412,19 @@
 
 #define NHMEX_W_PMON_GLOBAL_FIXED_EN (1ULL << 31)
 
+#ifdef CONFIG_CGROUP_CACHEQOS
+/* Intel Cache QoS Monitoring uncore support */
+#define IA32_QM_EVTSEL 0xc8d
+#define IA32_QM_CTR 0xc8e
+#define IA32_PQR_ASSOC 0xc8f
+
+#define IA32_QM_EVTSEL_EVTID_READ_OCC 0x01
+#define IA32_QM_CTR_ERR (0x03llu << 62)
+#define IA32_RMID_PQR_MASK 0x3ff
+#define IA32_QM_EVTSEL_RMID_POSITION 32
+
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
 struct intel_uncore_ops;
 struct intel_uncore_pmu;
 struct intel_uncore_box;
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH 3/4] cgroup: Add new cacheqos cgroup subsys to support Cache QoS Monitoring

Waskiewicz Jr, Peter P
In reply to this post by Waskiewicz Jr, Peter P
This patch adds a new cgroup subsystem, named cacheqos.  This cgroup
controller is intended to manage task groups to track cache occupancy
and usage of a CPU.

The cacheqos subsystem operates very similarly to the cpuacct
subsystem.  Tasks can be grouped into different child subgroups,
and have separate cache occupancy accounting for each of the
subgroups.  See Documentation/cgroups/cacheqos-subsystem.txt for
more details.

The patch also adds the Kconfig option for enabling/disabling the
CGROUP_CACHEQOS subsystem.  As this CPU feature is currently found
only in Intel Xeon processors, the cgroup subsystem depends on X86.

Signed-off-by: Peter P Waskiewicz Jr <[hidden email]>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c | 112 ++++++++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/perf_event.h                    |  14 +
 init/Kconfig                                  |  10 +
 kernel/sched/Makefile                         |   1 +
 kernel/sched/cacheqos.c                       | 397 ++++++++++++++++++++++++++
 kernel/sched/cacheqos.h                       |  59 ++++
 7 files changed, 597 insertions(+)
 create mode 100644 kernel/sched/cacheqos.c
 create mode 100644 kernel/sched/cacheqos.h

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 29c2487..4d48e26 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -1633,6 +1633,118 @@ static struct intel_uncore_type *snb_msr_uncores[] = {
 };
 /* end of Sandy Bridge uncore support */
 
+#ifdef CONFIG_CGROUP_CACHEQOS
+
+/* needed for the cacheqos cgroup structs */
+#include "../../../kernel/sched/cacheqos.h"
+
+extern struct cacheqos root_cacheqos_group;
+static DEFINE_MUTEX(cqm_mutex);
+
+static int __init cacheqos_late_init(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ struct rmid_list_element *elem;
+ int i;
+
+ mutex_lock(&cqm_mutex);
+
+ if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ root_cacheqos_group.subsys_info =
+       kzalloc(sizeof(struct cacheqos_subsys_info), GFP_KERNEL);
+ if (!root_cacheqos_group.subsys_info) {
+ mutex_unlock(&cqm_mutex);
+ return -ENOMEM;
+ }
+
+ root_cacheqos_group.subsys_info->cache_max_rmid =
+  c->x86_cache_max_rmid;
+ root_cacheqos_group.subsys_info->cache_occ_scale =
+ c->x86_cache_occ_scale;
+ root_cacheqos_group.subsys_info->cache_size = c->x86_cache_size;
+ } else {
+ root_cacheqos_group.monitor_cache = false;
+ root_cacheqos_group.css.ss->disabled = 1;
+ mutex_unlock(&cqm_mutex);
+ return -ENODEV;
+ }
+
+ /* Populate the unused rmid list with all rmids. */
+ INIT_LIST_HEAD(&root_cacheqos_group.subsys_info->rmid_unused_fifo);
+ INIT_LIST_HEAD(&root_cacheqos_group.subsys_info->rmid_inuse_list);
+ elem = kzalloc(sizeof(*elem), GFP_KERNEL);
+ if (!elem)
+ return -ENOMEM;
+
+ elem->rmid = 0;
+ list_add_tail(&elem->list,
+      &root_cacheqos_group.subsys_info->rmid_inuse_list);
+ for (i = 1; i < root_cacheqos_group.subsys_info->cache_max_rmid; i++) {
+ elem = kzalloc(sizeof(*elem), GFP_KERNEL);
+ if (!elem)
+ return -ENOMEM;
+
+ elem->rmid = i;
+ INIT_LIST_HEAD(&elem->list);
+ list_add_tail(&elem->list,
+    &root_cacheqos_group.subsys_info->rmid_unused_fifo);
+ }
+
+ /* go live on the root group */
+ root_cacheqos_group.monitor_cache = true;
+
+ mutex_unlock(&cqm_mutex);
+ return 0;
+}
+late_initcall(cacheqos_late_init);
+
+void cacheqos_map_schedule_out(void)
+{
+ /*
+ * cacheqos_map_schedule_in() will set the MSR correctly, but
+ * clearing the MSR here will prevent occupancy counts against this
+ * task during the context switch.  In other words, this gives a
+ * "better" representation of what's happening in the cache.
+ */
+ wrmsrl(IA32_PQR_ASSOC, 0);
+}
+
+void cacheqos_map_schedule_in(struct cacheqos *cq)
+{
+ u64 map;
+
+ map = cq->rmid & IA32_RMID_PQR_MASK;
+ wrmsrl(IA32_PQR_ASSOC, map);
+}
+
+void cacheqos_read(void *arg)
+{
+ struct cacheqos *cq = arg;
+ u64 config;
+ u64 result = 0;
+ int cpu, node;
+
+ cpu = smp_processor_id(),
+ node = cpu_to_node(cpu);
+ config = cq->rmid;
+ config = ((config & IA32_RMID_PQR_MASK) <<
+   IA32_QM_EVTSEL_RMID_POSITION) |
+   IA32_QM_EVTSEL_EVTID_READ_OCC;
+
+ wrmsrl(IA32_QM_EVTSEL, config);
+ rdmsrl(IA32_QM_CTR, result);
+
+ /* place results in sys_wide_info area for recovery */
+ if (result & IA32_QM_CTR_ERR)
+ result = -1;
+ else
+ result &= ~IA32_QM_CTR_ERR;
+
+ cq->subsys_info->node_results[node] =
+      result * cq->subsys_info->cache_occ_scale;
+}
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
 /* Nehalem uncore support */
 static void nhm_uncore_msr_disable_box(struct intel_uncore_box *box)
 {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index b613ffd..14b97e4 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -50,6 +50,10 @@ SUBSYS(net_prio)
 #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_HUGETLB)
 SUBSYS(hugetlb)
 #endif
+
+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_CACHEQOS)
+SUBSYS(cacheqos)
+#endif
 /*
  * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
  */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..59eabf3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -54,6 +54,11 @@ struct perf_guest_info_callbacks {
 #include <linux/perf_regs.h>
 #include <asm/local.h>
 
+#ifdef CONFIG_CGROUP_CACHEQOS
+inline void cacheqos_sched_out(struct task_struct *task);
+inline void cacheqos_sched_in(struct task_struct *task);
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
 struct perf_callchain_entry {
  __u64 nr;
  __u64 ip[PERF_MAX_STACK_DEPTH];
@@ -676,6 +681,10 @@ static inline void perf_event_task_sched_in(struct task_struct *prev,
 {
  if (static_key_false(&perf_sched_events.key))
  __perf_event_task_sched_in(prev, task);
+
+#ifdef CONFIG_CGROUP_CACHEQOS
+ cacheqos_sched_in(task);
+#endif /* CONFIG_CGROUP_CACHEQOS */
 }
 
 static inline void perf_event_task_sched_out(struct task_struct *prev,
@@ -685,6 +694,11 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
 
  if (static_key_false(&perf_sched_events.key))
  __perf_event_task_sched_out(prev, next);
+
+#ifdef CONFIG_CGROUP_CACHEQOS
+ /* use outgoing task to see if cacheqos is active or not */
+ cacheqos_sched_out(prev);
+#endif /* CONFIG_CGROUP_CACHEQOS */
 }
 
 extern void perf_event_mmap(struct vm_area_struct *vma);
diff --git a/init/Kconfig b/init/Kconfig
index 4e5d96a..9619cdc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -905,6 +905,16 @@ config PROC_PID_CPUSET
  depends on CPUSETS
  default y
 
+config CGROUP_CACHEQOS
+ bool "Simple Cache QoS Monitoring cgroup subsystem"
+ depends on X86 || X86_64
+ help
+  Provides a simple Resource Controller for monitoring the
+  total cache occupancy by the tasks in a cgroup.  This requires
+  hardware support to track cache usage.
+
+  Say N if unsure.
+
 config CGROUP_CPUACCT
  bool "Simple CPU accounting cgroup subsystem"
  help
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7b62140..30aa883 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -18,3 +18,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CGROUP_CACHEQOS) += cacheqos.o
diff --git a/kernel/sched/cacheqos.c b/kernel/sched/cacheqos.c
new file mode 100644
index 0000000..1ce799e
--- /dev/null
+++ b/kernel/sched/cacheqos.c
@@ -0,0 +1,397 @@
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/kernel_stat.h>
+#include <linux/err.h>
+
+#include "cacheqos.h"
+#include "sched.h"
+
+struct cacheqos root_cacheqos_group;
+static DEFINE_MUTEX(cacheqos_mutex);
+
+#if !defined(CONFIG_X86_64) || !defined(CONFIG_X86)
+static int __init cacheqos_late_init(void)
+{
+ /* No Cache QoS support on this architecture, disable the subsystem */
+ root_cacheqos_group.monitor_cache = false;
+ root_cacheqos_group.css.ss->disabled = 1;
+ return -ENODEV;
+}
+late_initcall(cacheqos_late_init);
+#endif
+
+inline void cacheqos_sched_out(struct task_struct *task)
+{
+ struct cacheqos *cq = task_cacheqos(task);
+ /*
+ * Assumption is that this thread is running on the logical processor
+ * from which the task is being scheduled out.
+ *
+ * As the task is scheduled out mapping goes back to default map.
+ */
+ if (cq->monitor_cache)
+ cacheqos_map_schedule_out();
+}
+
+inline void cacheqos_sched_in(struct task_struct *task)
+{
+ struct cacheqos *cq = task_cacheqos(task);
+ /*
+ * Assumption is that this thread is running on the logical processor
+ * of which this task is being scheduled onto.
+ *
+ * As the task is scheduled in, the cgroup's rmid is loaded
+ */
+ if (cq->monitor_cache)
+ cacheqos_map_schedule_in(cq);
+}
+
+static void cacheqos_adjust_children_rmid(struct cacheqos *cq)
+{
+ struct cgroup_subsys_state *css, *pos;
+ struct cacheqos *p_cq, *pos_cq;
+
+ css = &cq->css;
+ rcu_read_lock();
+
+ css_for_each_descendant_pre(pos, css) {
+ pos_cq = css_cacheqos(pos);
+ if (!pos_cq->monitor_cache) {
+ /* monitoring is disabled, so use the parent's RMID */
+ p_cq = parent_cacheqos(pos_cq);
+ spin_lock_irq(&pos_cq->lock);
+ pos_cq->rmid = p_cq->rmid;
+ spin_unlock_irq(&pos_cq->lock);
+ }
+ }
+ rcu_read_unlock();
+}
+
+static int cacheqos_move_rmid_to_unused_list(struct cacheqos *cq)
+{
+ struct rmid_list_element *elem;
+
+ /*
+ * Assumes only called when cq->rmid is valid (ie, it is on the
+ * inuse list) and cacheqos_mutex is held.
+ */
+ lockdep_assert_held(&cacheqos_mutex);
+ list_for_each_entry(elem, &cq->subsys_info->rmid_inuse_list, list) {
+ if (cq->rmid == elem->rmid) {
+ /* Move rmid from inuse to unused list */
+ list_del_init(&elem->list);
+ list_add_tail(&elem->list,
+      &cq->subsys_info->rmid_unused_fifo);
+ goto quick_exit;
+ }
+ }
+ return -ELIBBAD;
+
+quick_exit:
+ return 0;
+}
+
+static int cacheqos_deallocate_rmid(struct cacheqos *cq)
+{
+ struct cacheqos *cq_parent = parent_cacheqos(cq);
+ int err;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_move_rmid_to_unused_list(cq);
+ if (err)
+ return err;
+ /* assign parent's rmid to cgroup */
+ cq->monitor_cache = false;
+ cq->rmid = cq_parent->rmid;
+
+ /* Check for children using this cgroup's rmid, iterate */
+ cacheqos_adjust_children_rmid(cq);
+
+ mutex_unlock(&cacheqos_mutex);
+ return 0;
+}
+
+static int cacheqos_allocate_rmid(struct cacheqos *cq)
+{
+ struct rmid_list_element *elem;
+ struct list_head *item;
+
+ mutex_lock(&cacheqos_mutex);
+
+ if (list_empty(&cq->subsys_info->rmid_unused_fifo)) {
+ mutex_unlock(&cacheqos_mutex);
+ return -EAGAIN;
+ }
+
+ /* Move rmid from unused to inuse list */
+ item = cq->subsys_info->rmid_unused_fifo.next;
+ list_del_init(item);
+ list_add_tail(item, &cq->subsys_info->rmid_inuse_list);
+
+ /* assign rmid to cgroup */
+ elem = list_entry(item, struct rmid_list_element, list);
+ cq->rmid = elem->rmid;
+ cq->monitor_cache = true;
+
+ /* Check for children using this cgroup's rmid, iterate */
+ cacheqos_adjust_children_rmid(cq);
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+/* create a new cacheqos cgroup */
+static struct cgroup_subsys_state *
+cacheqos_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct cacheqos *parent = css_cacheqos(parent_css);
+ struct cacheqos *cq;
+
+ if (!parent) {
+ /* cacheqos_late_init() will enable monitoring on the root */
+ root_cacheqos_group.rmid = 0;
+ return &root_cacheqos_group.css;
+ }
+
+ cq = kzalloc(sizeof(struct cacheqos), GFP_KERNEL);
+ if (!cq)
+ goto out;
+
+ cq->cgrp = parent_css->cgroup;
+ cq->monitor_cache = false; /* disabled i.e., use parent's RMID */
+ cq->rmid = parent->rmid; /* Start by using parent's RMID*/
+ cq->subsys_info = root_cacheqos_group.subsys_info;
+ return &cq->css;
+
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+/* destroy an existing cacheqos task group */
+static void cacheqos_css_free(struct cgroup_subsys_state *css)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+
+ if (cq->monitor_cache) {
+ mutex_lock(&cacheqos_mutex);
+ cacheqos_move_rmid_to_unused_list(cq);
+ mutex_unlock(&cacheqos_mutex);
+ }
+ kfree(cq);
+}
+
+/* return task group's monitoring state */
+static u64 cacheqos_monitor_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+
+ return cq->monitor_cache;
+}
+
+/* set the task group's monitoring state */
+static int cacheqos_monitor_write(struct cgroup_subsys_state *css,
+  struct cftype *cftype, u64 enable)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ int err = 0;
+
+ if (enable != 0 && enable != 1) {
+ err = -EINVAL;
+ goto monitor_out;
+ }
+
+ if (enable && cq->monitor_cache)
+ goto monitor_out;
+
+ if (cq->monitor_cache)
+ err = cacheqos_deallocate_rmid(cq);
+ else
+ err = cacheqos_allocate_rmid(cq);
+
+monitor_out:
+ return err;
+}
+
+static int cacheqos_get_occupancy_data(struct cacheqos *cq)
+{
+ unsigned int cpu;
+ unsigned int node;
+ const struct cpumask *node_cpus;
+ int err = 0;
+
+ /* Assumes cacheqos_mutex is held */
+ lockdep_assert_held(&cacheqos_mutex);
+ for_each_node_with_cpus(node) {
+ node_cpus = cpumask_of_node(node);
+ cpu = any_online_cpu(*node_cpus);
+ err = smp_call_function_single(cpu, cacheqos_read, cq, 1);
+
+ if (err) {
+ break;
+ } else if (cq->subsys_info->node_results[node] == -1) {
+ err = -EPROTO;
+ break;
+ }
+ }
+ return err;
+}
+
+/* return total system LLC occupancy in bytes of a task group */
+static int cacheqos_occupancy_read(struct cgroup_subsys_state *css,
+   struct cftype *cft, struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy = 0;
+ int err, node;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node)
+ total_occupancy += cq->subsys_info->node_results[node];
+
+ mutex_unlock(&cacheqos_mutex);
+
+ seq_printf(m, "%llu\n", total_occupancy);
+ return 0;
+}
+
+/* return display each LLC's occupancy in bytes of a task group */
+static int
+cacheqos_occupancy_persocket_seq_read(struct cgroup_subsys_state *css,
+      struct cftype *cft, struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ int err, node;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node) {
+ seq_printf(m, "%llu\n",
+   cq->subsys_info->node_results[node]);
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+/* return total system LLC occupancy as a %of system LLC for the task group */
+static int cacheqos_occupancy_percent_read(struct cgroup_subsys_state *css,
+   struct cftype *cft,
+   struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy = 0;
+ int err, node;
+ int node_cnt = 0;
+ int parts_of_100, parts_of_10000;
+ int cache_size;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node) {
+ ++node_cnt;
+ total_occupancy += cq->subsys_info->node_results[node];
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ cache_size = cq->subsys_info->cache_size * node_cnt;
+ parts_of_100 = (total_occupancy * 100) / (cache_size * 1024);
+ parts_of_10000 = (total_occupancy * 10000) / (cache_size * 1024) -
+ parts_of_100 * 100;
+ seq_printf(m, "%d.%02d\n", parts_of_100, parts_of_10000);
+
+ return 0;
+}
+
+/* return display each LLC's % occupancy of the socket's LLC for task group */
+static int
+cacheqos_occupancy_percent_persocket_seq_read(struct cgroup_subsys_state *css,
+      struct cftype *cft,
+      struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy;
+ int err, node;
+ int cache_size;
+ int parts_of_100, parts_of_10000;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ cache_size = cq->subsys_info->cache_size;
+ for_each_node_with_cpus(node) {
+ total_occupancy = cq->subsys_info->node_results[node];
+ parts_of_100 = (total_occupancy * 100) / (cache_size * 1024);
+ parts_of_10000 = (total_occupancy * 10000) /
+ (cache_size * 1024) - parts_of_100 * 100;
+
+ seq_printf(m, "%d.%02d\n", parts_of_100, parts_of_10000);
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+static struct cftype cacheqos_files[] = {
+ {
+ .name = "monitor_cache",
+ .read_u64 = cacheqos_monitor_read,
+ .write_u64 = cacheqos_monitor_write,
+ .mode = 0666,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "occupancy_persocket",
+ .read_seq_string = cacheqos_occupancy_persocket_seq_read,
+ },
+ {
+ .name = "occupancy",
+ .read_seq_string = cacheqos_occupancy_read,
+ },
+ {
+ .name = "occupancy_percent_persocket",
+ .read_seq_string = cacheqos_occupancy_percent_persocket_seq_read,
+ },
+ {
+ .name = "occupancy_percent",
+ .read_seq_string = cacheqos_occupancy_percent_read,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys cacheqos_subsys = {
+ .name = "cacheqos",
+ .css_alloc = cacheqos_css_alloc,
+ .css_free = cacheqos_css_free,
+ .subsys_id = cacheqos_subsys_id,
+ .base_cftypes = cacheqos_files,
+};
diff --git a/kernel/sched/cacheqos.h b/kernel/sched/cacheqos.h
new file mode 100644
index 0000000..b20f25e
--- /dev/null
+++ b/kernel/sched/cacheqos.h
@@ -0,0 +1,59 @@
+#ifndef _CACHEQOS_H_
+#define _CACHEQOS_H_
+#ifdef CONFIG_CGROUP_CACHEQOS
+
+#include <linux/cgroup.h>
+
+struct rmid_list_element {
+ int rmid;
+ struct list_head list;
+};
+
+struct cacheqos_subsys_info {
+ struct list_head rmid_unused_fifo;
+ struct list_head rmid_inuse_list;
+ int cache_max_rmid;
+ int cache_occ_scale;
+ int cache_size;
+ u64 node_results[MAX_NUMNODES];
+};
+
+struct cacheqos {
+ struct cgroup_subsys_state css;
+ struct cacheqos_subsys_info *subsys_info;
+ struct cgroup *cgrp;
+ bool monitor_cache; /* false - use parent RMID / true - new RMID */
+
+ /*
+ * Used for walking the task groups to update RMID's of the various
+ * sub-groups.  If monitor_cache is false, the sub-groups will inherit
+ * the parent's RMID.  If monitor_cache is true, then the group has its
+ * own RMID.
+ */
+ spinlock_t lock;
+ u32 rmid;
+};
+
+extern void cacheqos_map_schedule_out(void);
+extern void cacheqos_map_schedule_in(struct cacheqos *);
+extern void cacheqos_read(void *);
+
+/* return cacheqos group corresponding to this container */
+static inline struct cacheqos *css_cacheqos(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct cacheqos, css) : NULL;
+}
+
+/* return cacheqos group to which this task belongs */
+static inline struct cacheqos *task_cacheqos(struct task_struct *task)
+{
+ return css_cacheqos(task_css(task, cacheqos_subsys_id));
+}
+
+static inline struct cacheqos *parent_cacheqos(struct cacheqos *cacheqos)
+{
+ return css_cacheqos(css_parent(&cacheqos->css));
+}
+
+#endif /* CONFIG_CGROUP_CACHEQOS */
+#endif /* _CACHEQOS_H_ */
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH 4/4] Documentation: Add documentation for cacheqos cgroup

Waskiewicz Jr, Peter P
In reply to this post by Waskiewicz Jr, Peter P
This patch adds the documentation for the new cacheqos cgroup
subsystem.  It provides the overview of how the new subsystem
works, how Cache QoS Monitoring works in the x86 architecture,
and how everything is tied together between the hardware and the
cgroup software stack.

Signed-off-by: Peter P Waskiewicz Jr <[hidden email]>
---
 Documentation/cgroups/00-INDEX     |   2 +
 Documentation/cgroups/cacheqos.txt | 166 +++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+)
 create mode 100644 Documentation/cgroups/cacheqos.txt

diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index bc461b6..055655d 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -2,6 +2,8 @@
  - this file
 blkio-controller.txt
  - Description for Block IO Controller, implementation and usage details.
+cacheqos.txt
+ - Description for Cache QoS Monitoring; implementation and usage details
 cgroups.txt
  - Control Groups definition, implementation details, examples and API.
 cpuacct.txt
diff --git a/Documentation/cgroups/cacheqos.txt b/Documentation/cgroups/cacheqos.txt
new file mode 100644
index 0000000..b7b85ce
--- /dev/null
+++ b/Documentation/cgroups/cacheqos.txt
@@ -0,0 +1,166 @@
+Cache QoS Monitoring Controller
+-------------------------------
+
+1. Overview
+===========
+
+The Cache QoS Monitoring controller is used to group tasks using cgroups and
+monitor the CPU cache usage and occupancy of the grouped tasks.  This
+monitoring does require hardware support for this information, especially
+since cache optimization and usage models will vary between CPU architectures.
+
+The Cache QoS Monitoring controller supports multi-hierarchy groups. A
+monitoring group accumulates the cache usage of all of its child groups and
+the tasks directly present in its group.
+
+Monitoring groups can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -ocacheqos none /sys/fs/cgroup/cacheqos
+
+With the above step, the initial or the parent monitoring group becomes
+visible at /sys/fs/cgroup/cacheqos. At bootup, this group includes all the
+tasks in the system. /sys/fs/cgroup/cacheqos/tasks lists the tasks in this
+cgroup.  Each file in the cgroup is described in greater detail below.
+
+
+2. Basic usage
+==============
+
+New monitoring groups can be created under the parent group
+/sys/fs/cgroup/cacheqos.
+
+# cd /sys/fs/cgroup/cacheqos
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it. At this point, the group is ready to be monitored.
+However, since this process requires hardware support to identify tasks
+properly, the mechanisms in the hardware are most likely a finite resource.
+So new monitoring groups are not activated by default to monitor their
+respective task groups.
+
+To enable a task group for hardware monitoring:
+
+# cd /sys/fs/cgroup/cacheqos
+# mkdir g1
+# echo $$ > g1/tasks
+# echo 1 > g1/cacheqos.monitor_cache
+
+This will enable monitoring for the tasks in the g1 monitoring group.  Note
+that the root monitoring group is always enabled and cannot be turned off.
+
+
+3. Overview of files
+====================
+
+- cacheqos.monitor_cache:
+ Controls whether or not the monitoring group is enabled or not.  This
+ is a R/W field, and expects 0 for disable, 1 for enable.
+
+ If no available hardware resources are left for monitoring, writing a
+ 1 to this file will result in -EAGAIN being returned (Resource
+ temporarily unavailable).
+
+- cacheqos.occupancy:
+ This is a read-only field.  It returns the total cache occupancy in
+ bytes of the task group for all CPUs it has run on.
+
+- cacheqos.occupancy_percent:
+ This is a read-only field.  It returns the total cache occupancy used
+ as a percentage for all CPUs it has run on.  The percentage is based
+ on the size of the cache, which can obviously vary from CPU to CPU.
+
+- cacheqos.occupancy_persocket:
+ This is a read-only field.  It returns the total cache occupancy used
+ by the task group, broken down per CPU socket (usually per NUMA node).
+
+- cacheqos.occupancy_percent_persocket:
+ This is a read-only field.  It returns the total cache occupancy used
+ by the task group, broken down per CPU socket (usually per NUMA node).
+ Each socket's occupancy is presented as a percentage of the total
+ cache.
+
+4. Adding new architectures
+===========================
+
+Currently Cache QoS Monitoring support only exists in modern Intel Xeon
+processors.  Due to this, the Kconfig option for Cache QoS Monitoring depends
+on X86_64 or X86.  If another architecture supports cache monitoring, then
+a few functions need to be implemented by the architecture, and that
+architecture needs to be added to some #if clauses for support.  These are:
+
+- init/Kconfig
+ Add the new architecture to the dependancy list
+
+- kernel/sched/cacheqos.c
+ Add the new architecture to the #if line to compile out
+ cacheqos_late_init():
+
+ #if !defined(CONFIG_X86_64) || !defined(CONFIG_X86)
+ static int __init cacheqos_late_init(void)          ^^^^^^^
+
+The following functions need to be implemented by the architecture:
+
+- void cacheqos_map_schedule_out(void);
+ This function is called by the scheduler when swapping out a task from
+ a CPU.  This would be where the CPU architecture code to stop monitoring
+ for a particular task would be executed.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- void cacheqos_map_schedule_in(struct cacheqos *);
+ This function is called by the scheduler when swapping a task into a
+ CPU core.  This would be where the CPU architecture code to start
+ monitoring a particular task would be executed.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- void cacheqos_read(void *);
+ This function is called by the cacheqos cgroup subsystem when
+ collating the cache usage data.  This would be where the CPU
+ architecture code to pull information for a particular monitoring
+ unit would exist.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- int __init cacheqos_late_init(void);  (late_initcall)
+ This function needs to be implemented as late_initcall for the
+ specific architecture.  The reason for a later invocation is the
+ CPU features can be determined, which happens after the cgroup subsystem
+ is started in the kernel boot sequence.  Since the configuration of
+ the cacheqos cgroup depends on how much of particular monitoring
+ resources are available, the cgroup's root_cacheqos_group's subsys_info
+ field cannot be initialized until the CPU features are discovered.
+
+ This function's responsibility is to allocate the
+ root_cacheqos_group.subsys_info field and initialize these fields:
+ - cache_max_rmid: Maximum resource monitoring ID on this CPU
+ - cache_occ_scale: This is used to scale the occupancy data
+   being collected, meant to help compress the
+   values being stored in the CPU.  This may
+   exist or not in a particular architecture.
+ - cache_size: Size of the cache being monitored, used for the
+      percentage reporting.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+
+5. Intel-specific implementation
+================================
+
+Intel Xeon processors implement Cache QoS Monitoring using Resource Monitoring
+Identifiers, or RMIDs.  When a task is scheduled on a CPU core, the RMID that
+is associated with that task (or group that task belongs to) is written to the
+IA32_PRQ_ASSOC MSR for that CPU.  That instructs that CPU to accumulate cache
+occupancy data while that task runs.  When that task is scheduled out, the
+IA32_PQR_ASSOC MSR is written with a 0, clearing the monitoring mechanism.
+
+To retrieve the monitoring data, the RMID for the task group being read is
+used to build a configuration map for the IA32_QM_EVTSEL MSR.  Once the map is
+written to that MSR, the result is written to the IA32_QM_CTR MSR.  That data
+is then stored, but multiplied by the cache_occ_scale, which is read from the
+CPUID sub-leaf during CPU initialization.
+
+For details on the implementation, please refer to the Intel Software
+Development Manual, Volume 3, Chapter 17.14: Cache Quality of Service Monitoring
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Tejun Heo-2
In reply to this post by Waskiewicz Jr, Peter P
Hello,

On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> The CPU features themselves are relatively straight-forward, but
> the presentation of the data is less straight-forward.  Since this
> tracks cache usage and occupancy per process (by swapping Resource
> Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> not be a good fit for this data, which does not report on a
> per-process level.  Therefore, a new cgroup subsystem, cacheqos, has
> been added.  This operates very similarly to the cpu and cpuacct
> cgroup subsystems, where tasks can be grouped into sub-leaves of the
> root-level cgroup.

I don't really understand why this is implemented as part of cgroup.
There doesn't seem to be anything which requires cgroup.  Wouldn't
just doing it per-process make more sense?  Even grouping would be
better done along the traditional process hierarchy, no?  And
per-cgroup accounting can be trivially achieved from userland by just
accumulating the stats according to the process's cgroup membership.
What am I missing here?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
On Sat, 2014-01-04 at 11:10 -0500, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> > The CPU features themselves are relatively straight-forward, but
> > the presentation of the data is less straight-forward.  Since this
> > tracks cache usage and occupancy per process (by swapping Resource
> > Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> > not be a good fit for this data, which does not report on a
> > per-process level.  Therefore, a new cgroup subsystem, cacheqos, has
> > been added.  This operates very similarly to the cpu and cpuacct
> > cgroup subsystems, where tasks can be grouped into sub-leaves of the
> > root-level cgroup.
>
> I don't really understand why this is implemented as part of cgroup.
> There doesn't seem to be anything which requires cgroup.  Wouldn't
> just doing it per-process make more sense?  Even grouping would be
> better done along the traditional process hierarchy, no?  And
> per-cgroup accounting can be trivially achieved from userland by just
> accumulating the stats according to the process's cgroup membership.
> What am I missing here?

Thanks for the quick response!  I knew the approach would generate
questions, so let me explain.

The feature I'm enabling in the Xeon processors is fairly simple.  It
has a set of Resource Monitoring ID's (RMIDs), and those are used by the
CPU cores to track the cache usage while any process associated with the
RMID is running.  The more complicated part is how to present the
interface of creating RMID groups and assigning processes to them for
both tracking, and for stat collection.

We discussed (internally) a few different approaches to implement this.
The first natural thought was this is similar to other PMU features, but
this deals with processes and groups of processes, not overall CPU core
or uncore state.  Given the way processes in a cgroup can be grouped
together and treated as single entities, this felt like a natural fit
with the RMID concept.

Simply put, when we want to allocate an RMID for monitoring httpd
traffic, we can create a new child in the subsystem hierarchy, and
assign the httpd processes to it.  Then the RMID can be assigned to the
subsystem, and each process inherits that RMID.  So instead of dealing
with assigning an RMID to each and every process, we can leverage the
existing cgroup mechanisms for grouping processes and their children to
a group, and they inherit the RMID.

Please let me know if this is a better explanation, and gives a better
picture of why we decided to approach the implementation this way.  Also
note that this feature, Cache QoS Monitoring, is the first in a series
of Platform QoS Monitoring features that will be coming.  So this isn't
a one-off feature, so however this first piece gets accepted, we want to
make sure it's easy to expand and not impact userspace tools repeatedly
(if possible).

Cheers,
-PJ Waskiewicz

--------------
Intel Open Source Technology Center
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Tejun Heo-2
Hello,

On Sat, Jan 04, 2014 at 10:43:00PM +0000, Waskiewicz Jr, Peter P wrote:
> Simply put, when we want to allocate an RMID for monitoring httpd
> traffic, we can create a new child in the subsystem hierarchy, and
> assign the httpd processes to it.  Then the RMID can be assigned to the
> subsystem, and each process inherits that RMID.  So instead of dealing
> with assigning an RMID to each and every process, we can leverage the
> existing cgroup mechanisms for grouping processes and their children to
> a group, and they inherit the RMID.

Here's one thing that I don't get, possibly because I'm not
understanding the processor feature too well.  Why does the processor
have to be aware of the grouping?  ie. why can't it be done
per-process and then aggregated?  Is there something inherent about
the monitored events which requires such peculiarity?  Or is it that
accessing the stats data is noticieably expensive to do per context
switch?

> Please let me know if this is a better explanation, and gives a better
> picture of why we decided to approach the implementation this way.  Also
> note that this feature, Cache QoS Monitoring, is the first in a series
> of Platform QoS Monitoring features that will be coming.  So this isn't
> a one-off feature, so however this first piece gets accepted, we want to
> make sure it's easy to expand and not impact userspace tools repeatedly
> (if possible).

In general, I'm quite strongly opposed against using cgroup as
arbitrary grouping mechanism for anything other than resource control,
especially given that we're moving away from multiple hierarchies.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
On Sat, 2014-01-04 at 17:50 -0500, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Sat, Jan 04, 2014 at 10:43:00PM +0000, Waskiewicz Jr, Peter P wrote:
> > Simply put, when we want to allocate an RMID for monitoring httpd
> > traffic, we can create a new child in the subsystem hierarchy, and
> > assign the httpd processes to it.  Then the RMID can be assigned to the
> > subsystem, and each process inherits that RMID.  So instead of dealing
> > with assigning an RMID to each and every process, we can leverage the
> > existing cgroup mechanisms for grouping processes and their children to
> > a group, and they inherit the RMID.
>
> Here's one thing that I don't get, possibly because I'm not
> understanding the processor feature too well.  Why does the processor
> have to be aware of the grouping?  ie. why can't it be done
> per-process and then aggregated?  Is there something inherent about
> the monitored events which requires such peculiarity?  Or is it that
> accessing the stats data is noticieably expensive to do per context
> switch?

The processor doesn't need to understand the grouping at all, but it
also isn't tracking things per-process that are rolled up later.
They're tracked via the RMID resource in the hardware, which could
correspond to a single process, or 500 processes.  It really comes down
to the ease of management of grouping tasks in groups for two consumers,
1) the end user, and 2) the process scheduler.

I think I still may not be explaining how the CPU side works well
enough, in order to better understand what I'm trying to do with the
cgroup.  Let me try to be a bit more clear, and if I'm still sounding
vague or not making sense, please tell me what isn't clear and I'll try
to be more specific.  The new Documentation addition in patch 4 also has
a good overview, but let's try this:

A CPU may have 32 RMID's in hardware.  This is for the platform, not per
core.  I may want to have a single process assigned to an RMID for
tracking, say qemu to monitor cache usage of a specific VM.  But I also
may want to monitor cache usage of all MySQL database processes with
another RMID, or even split specific processes of that database between
different RMID's.  It all comes down to how the end-user wants to
monitor their specific workloads, and how those workloads are impacting
cache usage and occupancy.

With this implementation I've sent, all tasks are in RMID 0 by default.
Then one can create a subdirectory, just like the cpuacct cgroup, and
then add tasks to that subdirectory's task list.  Once that
subdirectory's task list is enabled (through the cacheqos.monitor_cache
handle), then a free RMID is assigned from the CPU, and when the
scheduler switches to any of the tasks in that cgroup under that RMID,
the RMID begins monitoring the usage.

The CPU side is easy and clean.  When something in the software wants to
monitor when a particular task is scheduled and started, write whatever
RMID that task is assigned to (through some mechanism) to the proper MSR
in the CPU.  When that task is swapped out, clear the MSR to stop
monitoring of that RMID.  When that RMID's statistics are requested by
the software (through some mechanism), then the CPU's MSRs are written
with the RMID in question, and the value is read of what has been
collected so far.  In my case, I decided to use a cgroup for this
"mechanism" since so much of the grouping and task/group association
already exists and doesn't need to be rebuilt or re-invented.

> > Please let me know if this is a better explanation, and gives a better
> > picture of why we decided to approach the implementation this way.  Also
> > note that this feature, Cache QoS Monitoring, is the first in a series
> > of Platform QoS Monitoring features that will be coming.  So this isn't
> > a one-off feature, so however this first piece gets accepted, we want to
> > make sure it's easy to expand and not impact userspace tools repeatedly
> > (if possible).
>
> In general, I'm quite strongly opposed against using cgroup as
> arbitrary grouping mechanism for anything other than resource control,
> especially given that we're moving away from multiple hierarchies.

Just to clarify then, would the mechanism in the cpuacct cgroup to
create a group off the root subsystem be considered multi-hierarchical?
If not, then the intent for this new cacheqos subsystem is to be
identical in that regard to cpuacct in the behavior.

This is a resource controller, it just happens to be tied to a hardware
resource instead of an OS resource.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
In reply to this post by Waskiewicz Jr, Peter P
On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> The CPU features themselves are relatively straight-forward, but
> the presentation of the data is less straight-forward.  Since this
> tracks cache usage and occupancy per process (by swapping Resource
> Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> not be a good fit for this data, which does not report on a
> per-process level.  Therefore, a new cgroup subsystem, cacheqos, has
> been added.  This operates very similarly to the cpu and cpuacct
> cgroup subsystems, where tasks can be grouped into sub-leaves of the
> root-level cgroup.

This doesn't make any sense.. From a quick SDM read you can do pretty
much whatever with those RMIDs. If you allocate a RMID per task (thread
in userspace) you can actually measure things on a task basis.

From then on you can use perf-cgroup to group whatever tasks you want.

So please be more explicit in why you think this doesn't fit into perf.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
In reply to this post by Waskiewicz Jr, Peter P
On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:

> The processor doesn't need to understand the grouping at all, but it
> also isn't tracking things per-process that are rolled up later.
> They're tracked via the RMID resource in the hardware, which could
> correspond to a single process, or 500 processes.  It really comes down
> to the ease of management of grouping tasks in groups for two consumers,
> 1) the end user, and 2) the process scheduler.
>
> I think I still may not be explaining how the CPU side works well
> enough, in order to better understand what I'm trying to do with the
> cgroup.  Let me try to be a bit more clear, and if I'm still sounding
> vague or not making sense, please tell me what isn't clear and I'll try
> to be more specific.  The new Documentation addition in patch 4 also has
> a good overview, but let's try this:
>
> A CPU may have 32 RMID's in hardware.  This is for the platform, not per
> core.  I may want to have a single process assigned to an RMID for
> tracking, say qemu to monitor cache usage of a specific VM.  But I also
> may want to monitor cache usage of all MySQL database processes with
> another RMID, or even split specific processes of that database between
> different RMID's.  It all comes down to how the end-user wants to
> monitor their specific workloads, and how those workloads are impacting
> cache usage and occupancy.
>
> With this implementation I've sent, all tasks are in RMID 0 by default.
> Then one can create a subdirectory, just like the cpuacct cgroup, and
> then add tasks to that subdirectory's task list.  Once that
> subdirectory's task list is enabled (through the cacheqos.monitor_cache
> handle), then a free RMID is assigned from the CPU, and when the
> scheduler switches to any of the tasks in that cgroup under that RMID,
> the RMID begins monitoring the usage.
>
> The CPU side is easy and clean.  When something in the software wants to
> monitor when a particular task is scheduled and started, write whatever
> RMID that task is assigned to (through some mechanism) to the proper MSR
> in the CPU.  When that task is swapped out, clear the MSR to stop
> monitoring of that RMID.  When that RMID's statistics are requested by
> the software (through some mechanism), then the CPU's MSRs are written
> with the RMID in question, and the value is read of what has been
> collected so far.  In my case, I decided to use a cgroup for this
> "mechanism" since so much of the grouping and task/group association
> already exists and doesn't need to be rebuilt or re-invented.

This still doesn't explain why you can't use perf-cgroup for this.

> > In general, I'm quite strongly opposed against using cgroup as
> > arbitrary grouping mechanism for anything other than resource control,
> > especially given that we're moving away from multiple hierarchies.
>
> Just to clarify then, would the mechanism in the cpuacct cgroup to
> create a group off the root subsystem be considered multi-hierarchical?
> If not, then the intent for this new cacheqos subsystem is to be
> identical in that regard to cpuacct in the behavior.
>
> This is a resource controller, it just happens to be tied to a hardware
> resource instead of an OS resource.

No, cpuacct and perf-cgroup aren't actually controllers at all. They're
resource monitors at best. Same with your Cache QoS Monitor, it doesn't
control anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:

> On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > The CPU side is easy and clean.  When something in the software wants to
> > monitor when a particular task is scheduled and started, write whatever
> > RMID that task is assigned to (through some mechanism) to the proper MSR
> > in the CPU.  When that task is swapped out, clear the MSR to stop
> > monitoring of that RMID.  When that RMID's statistics are requested by
> > the software (through some mechanism), then the CPU's MSRs are written
> > with the RMID in question, and the value is read of what has been
> > collected so far.  In my case, I decided to use a cgroup for this
> > "mechanism" since so much of the grouping and task/group association
> > already exists and doesn't need to be rebuilt or re-invented.
>
> This still doesn't explain why you can't use perf-cgroup for this.

I'm not completely familiar with perf-cgroup, so I looked for some
documentation for it to better understand it.  Are you referring to perf
-G to monitor an existing cgroup/all cgroups?  Or something else?  If
it's the former, I'm not following you how this would fit.

> > > In general, I'm quite strongly opposed against using cgroup as
> > > arbitrary grouping mechanism for anything other than resource control,
> > > especially given that we're moving away from multiple hierarchies.
> >
> > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > create a group off the root subsystem be considered multi-hierarchical?
> > If not, then the intent for this new cacheqos subsystem is to be
> > identical in that regard to cpuacct in the behavior.
> >
> > This is a resource controller, it just happens to be tied to a hardware
> > resource instead of an OS resource.
>
> No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> control anything.

I may be using controller in a different way than you are.  Yes, the
Cache QoS Monitor is monitoring cache data.  But it is also controlling
the allocation and deallocation of RMIDs to tasks/task groups as
monitoring is enabled and disabled for those groups.  That's why I
called it a controller.  If that's not accurate, I apologize.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
On Mon, Jan 06, 2014 at 04:34:04PM +0000, Waskiewicz Jr, Peter P wrote:

> On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:
> > On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > > The CPU side is easy and clean.  When something in the software wants to
> > > monitor when a particular task is scheduled and started, write whatever
> > > RMID that task is assigned to (through some mechanism) to the proper MSR
> > > in the CPU.  When that task is swapped out, clear the MSR to stop
> > > monitoring of that RMID.  When that RMID's statistics are requested by
> > > the software (through some mechanism), then the CPU's MSRs are written
> > > with the RMID in question, and the value is read of what has been
> > > collected so far.  In my case, I decided to use a cgroup for this
> > > "mechanism" since so much of the grouping and task/group association
> > > already exists and doesn't need to be rebuilt or re-invented.
> >
> > This still doesn't explain why you can't use perf-cgroup for this.
>
> I'm not completely familiar with perf-cgroup, so I looked for some
> documentation for it to better understand it.  Are you referring to perf
> -G to monitor an existing cgroup/all cgroups?  Or something else?  If
> it's the former, I'm not following you how this would fit.

All the bits under CONFIG_CGROUP_PERF, I've no idea how userspace looks.

> > > > In general, I'm quite strongly opposed against using cgroup as
> > > > arbitrary grouping mechanism for anything other than resource control,
> > > > especially given that we're moving away from multiple hierarchies.
> > >
> > > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > > create a group off the root subsystem be considered multi-hierarchical?
> > > If not, then the intent for this new cacheqos subsystem is to be
> > > identical in that regard to cpuacct in the behavior.
> > >
> > > This is a resource controller, it just happens to be tied to a hardware
> > > resource instead of an OS resource.
> >
> > No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> > resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> > control anything.
>
> I may be using controller in a different way than you are.  Yes, the
> Cache QoS Monitor is monitoring cache data.  But it is also controlling
> the allocation and deallocation of RMIDs to tasks/task groups as
> monitoring is enabled and disabled for those groups.  That's why I
> called it a controller.  If that's not accurate, I apologize.

Yeah that's not accurate, nor desired I think, because you get into
horrible problems with hierarchies, do child groups belong to your RMID
or not?

As is I don't really see a good use for RMIDs and I would simply not use
them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
In reply to this post by Peter Zijlstra-5
On Mon, 2014-01-06 at 12:08 +0100, Peter Zijlstra wrote:

> On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> > The CPU features themselves are relatively straight-forward, but
> > the presentation of the data is less straight-forward.  Since this
> > tracks cache usage and occupancy per process (by swapping Resource
> > Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> > not be a good fit for this data, which does not report on a
> > per-process level.  Therefore, a new cgroup subsystem, cacheqos, has
> > been added.  This operates very similarly to the cpu and cpuacct
> > cgroup subsystems, where tasks can be grouped into sub-leaves of the
> > root-level cgroup.
>
> This doesn't make any sense.. From a quick SDM read you can do pretty
> much whatever with those RMIDs. If you allocate a RMID per task (thread
> in userspace) you can actually measure things on a task basis.

Exactly.  An RMID can be assigned to a single task or a group of tasks.
Because the RMID is a hardware resource and is limited, the
implementation of using it is what we're really discussing here.  Our
approach is to either monitor per-task, or per group of tasks.

> From then on you can use perf-cgroup to group whatever tasks you want.
>
> So please be more explicit in why you think this doesn't fit into perf.

I said this in my other reply to the other thread, but I'll ask again
because I'm not following.  I'm looking for information on perf-cgroup,
and I all see is a way to monitor CPU events for tasks in a cgroup (perf
-G option).

The other part I'm not seeing is how to control the RMIDs being
allocated across to different groups.  There may be 100 task groups to
monitor, but only 32 RMIDs.  So the RMIDs need to be handed out to
active tasks and then enabled, data extracted, then disabled.  That was
the intent of the cacheqos.monitor_cache knob.

The bottom line is I'm asking for a bit more information from you about
perf-cgroup, since it sounds like you see a fit for CQM here, and I'm
not seeing what you're looking at yet.  Any information is much
appreciated.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
In reply to this post by Peter Zijlstra-5
On Mon, 2014-01-06 at 17:41 +0100, Peter Zijlstra wrote:

> On Mon, Jan 06, 2014 at 04:34:04PM +0000, Waskiewicz Jr, Peter P wrote:
> > On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:
> > > On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > > > The CPU side is easy and clean.  When something in the software wants to
> > > > monitor when a particular task is scheduled and started, write whatever
> > > > RMID that task is assigned to (through some mechanism) to the proper MSR
> > > > in the CPU.  When that task is swapped out, clear the MSR to stop
> > > > monitoring of that RMID.  When that RMID's statistics are requested by
> > > > the software (through some mechanism), then the CPU's MSRs are written
> > > > with the RMID in question, and the value is read of what has been
> > > > collected so far.  In my case, I decided to use a cgroup for this
> > > > "mechanism" since so much of the grouping and task/group association
> > > > already exists and doesn't need to be rebuilt or re-invented.
> > >
> > > This still doesn't explain why you can't use perf-cgroup for this.
> >
> > I'm not completely familiar with perf-cgroup, so I looked for some
> > documentation for it to better understand it.  Are you referring to perf
> > -G to monitor an existing cgroup/all cgroups?  Or something else?  If
> > it's the former, I'm not following you how this would fit.
>
> All the bits under CONFIG_CGROUP_PERF, I've no idea how userspace looks.

Ah ok.  Yes, the userspace side of perf really doesn't fit controlling
the CQM bits at all from what I see.

> > > > > In general, I'm quite strongly opposed against using cgroup as
> > > > > arbitrary grouping mechanism for anything other than resource control,
> > > > > especially given that we're moving away from multiple hierarchies.
> > > >
> > > > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > > > create a group off the root subsystem be considered multi-hierarchical?
> > > > If not, then the intent for this new cacheqos subsystem is to be
> > > > identical in that regard to cpuacct in the behavior.
> > > >
> > > > This is a resource controller, it just happens to be tied to a hardware
> > > > resource instead of an OS resource.
> > >
> > > No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> > > resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> > > control anything.
> >
> > I may be using controller in a different way than you are.  Yes, the
> > Cache QoS Monitor is monitoring cache data.  But it is also controlling
> > the allocation and deallocation of RMIDs to tasks/task groups as
> > monitoring is enabled and disabled for those groups.  That's why I
> > called it a controller.  If that's not accurate, I apologize.
>
> Yeah that's not accurate, nor desired I think, because you get into
> horrible problems with hierarchies, do child groups belong to your RMID
> or not?

I'd rather not support a child group of a child group.  Only groups off
the root, and each group would be assigned an RMID when it's activated
for monitoring.

> As is I don't really see a good use for RMIDs and I would simply not use
> them.

If you want to use CQM in the hardware, then the RMID is how you get the
cache usage data from the CPU.  If you don't want to use CQM, then you
can ignore RMIDs.

One of the best use cases for using RMIDs is in virtualization.  A VM
may be a heavy cache user, or a light cache user.  Tracing different VMs
on different RMIDs can allow an admin to identify which VM may be
causing high levels of eviction, and either migrate it to another host,
or move other tasks/VMs to other hosts.  Without CQM, it's much harder
to find which process is eating the cache up.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > Yeah that's not accurate, nor desired I think, because you get into
> > horrible problems with hierarchies, do child groups belong to your RMID
> > or not?
>
> I'd rather not support a child group of a child group.  Only groups off
> the root, and each group would be assigned an RMID when it's activated
> for monitoring.

Yeah, that's a complete non started for cgroups. Cgroups need to be
completely hierarchical.

So even the root group should represent all tasks; which if you fragment
RMIDs on child cgroups doesn't work anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
On Mon, 2014-01-06 at 18:53 +0100, Peter Zijlstra wrote:

> On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > > Yeah that's not accurate, nor desired I think, because you get into
> > > horrible problems with hierarchies, do child groups belong to your RMID
> > > or not?
> >
> > I'd rather not support a child group of a child group.  Only groups off
> > the root, and each group would be assigned an RMID when it's activated
> > for monitoring.
>
> Yeah, that's a complete non started for cgroups. Cgroups need to be
> completely hierarchical.
>
> So even the root group should represent all tasks; which if you fragment
> RMIDs on child cgroups doesn't work anymore.

The root group does represent all tasks in the current patchset on RMID
0.  Then any child assigned to another group will be assigned to a
different RMID.  It looks like this:

                       root (rmid 0)
                       /  \
             (rmid 4) g1  g2 (rmid 16)

We could keep going down from there, but I don't see it buying anything
extra.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
In reply to this post by Waskiewicz Jr, Peter P
On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > As is I don't really see a good use for RMIDs and I would simply not use
> > them.
>
> If you want to use CQM in the hardware, then the RMID is how you get the
> cache usage data from the CPU.  If you don't want to use CQM, then you
> can ignore RMIDs.

I think you can make do with a single RMID (per cpu). When you program
the counter (be it for a task, cpu or cgroup context) you set the 1 RMID
and EVSEL and read the CTR.

What I'm not entirely clear on is if the EVSEL and CTR MSR are per
logical CPU or per L3 (package); /me prays they're per logical CPU.

> One of the best use cases for using RMIDs is in virtualization.

*groan*.. /me plugs wax in ears and goes la-la-la-la

> A VM
> may be a heavy cache user, or a light cache user.  Tracing different VMs
> on different RMIDs can allow an admin to identify which VM may be
> causing high levels of eviction, and either migrate it to another host,
> or move other tasks/VMs to other hosts.  Without CQM, it's much harder
> to find which process is eating the cache up.

Not necessarily VMs, there's plenty large processes that exhibit similar
problems.. why must people always do VMs :-(

That said, even with a single RMID you can get that information by
simply running it against all competing processes one at a time. Since
there's limited RMID space you need to rotate at some point anyway.

The cgroup interface you propose wouldn't allow for rotation; other than
manual by creating different cgroups one after another.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Waskiewicz Jr, Peter P
On Mon, 2014-01-06 at 19:06 +0100, Peter Zijlstra wrote:

> On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > > As is I don't really see a good use for RMIDs and I would simply not use
> > > them.
> >
> > If you want to use CQM in the hardware, then the RMID is how you get the
> > cache usage data from the CPU.  If you don't want to use CQM, then you
> > can ignore RMIDs.
>
> I think you can make do with a single RMID (per cpu). When you program
> the counter (be it for a task, cpu or cgroup context) you set the 1 RMID
> and EVSEL and read the CTR.
>
> What I'm not entirely clear on is if the EVSEL and CTR MSR are per
> logical CPU or per L3 (package); /me prays they're per logical CPU.

There is one per logical CPU.  However, in the current generation, they
report on the usage of the same L3 cache.  But the CPU takes care of the
resolution of which MSR write and read comes from the logical CPU, so
software doesn't need to lock access to it from different CPUs.

> > One of the best use cases for using RMIDs is in virtualization.
>
> *groan*.. /me plugs wax in ears and goes la-la-la-la
>
> > A VM
> > may be a heavy cache user, or a light cache user.  Tracing different VMs
> > on different RMIDs can allow an admin to identify which VM may be
> > causing high levels of eviction, and either migrate it to another host,
> > or move other tasks/VMs to other hosts.  Without CQM, it's much harder
> > to find which process is eating the cache up.
>
> Not necessarily VMs, there's plenty large processes that exhibit similar
> problems.. why must people always do VMs :-(

Completely agreed.  It's just the loudest people right now asking for
this capability are using VMs for the most part.

> That said, even with a single RMID you can get that information by
> simply running it against all competing processes one at a time. Since
> there's limited RMID space you need to rotate at some point anyway.
>
> The cgroup interface you propose wouldn't allow for rotation; other than
> manual by creating different cgroups one after another.

I see your points, and I also think that the cgroup approach now isn't
the best way to make this completely flexible.  What about this:

Add a new read/write entry to the /proc/<pid> attributes that is the
RMID to assign that process to.  Then expose all the available RMIDs
in /sys/devices/system/cpu, say in a new directory platformqos (or
whatever), which then have all the statistics inside those, plus a knob
to enable monitoring or not.  Then all the kernel exposes is a way to
assign a PID to an RMID, and a way to turn on monitoring or turn it off,
and get the data.  I can then put a simple userspace tool together to
make the management suck less.

Thoughts?

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[hidden email] Intel Corp.
N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf��^jǫy�m��@A�a��� 0��h��i
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Peter Zijlstra-5
On Mon, Jan 06, 2014 at 08:10:45PM +0000, Waskiewicz Jr, Peter P wrote:
> There is one per logical CPU.  However, in the current generation, they
> report on the usage of the same L3 cache.  But the CPU takes care of the
> resolution of which MSR write and read comes from the logical CPU, so
> software doesn't need to lock access to it from different CPUs.

What are the rules of RMIDs, I can't seem to find that in the SDM and I
think you're tagging cachelines with them. Which would mean that in
order to (re) use them you need a complete cache (L3) wipe.

Without a wipe you keep having stale entries of the former user and no
clear indication on when your numbers are any good.

Also, is there any sane way of shooting down the entire L3?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
12