Quantcast

[PATCH V2 0/5] Updates to EDAC and AMD MCE driver

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 0/5] Updates to EDAC and AMD MCE driver

Aravind Gopalakrishnan
This patchset mainly provides necessary EDAC bits to decode errors
occuring on Scalable MCA enabled processors and also updates AMD MCE
driver to program the correct MCx_MISC register address for upcoming
processors.

Patches 1, 2 and 3 are meant for the upcoming processors.

Patches 4 and 5 are either fixing or adding comments to help in
understanding the code and do not introduce any functional changes.

Patch 1: Move MSR definition to mce.h
Patch 2: Updates to EDAC driver to decode the new error signatures
Patch 3: Fix logic to obtain correct block address
Patch 4: Fix deferred error comment
Patch 5: Add comments to amd_nb.h to describe threshold_block structure

Tested V2 patches for regressions on Fam15h, Fam10h systems
and found none.

Note 1: Introduced new patch for moving MCx_CONFIG MSR to mce.h
Note 2: The enums ans amd_hwid_mappings[] array are placed in arch/x86
        as there are follow-up patches which need the struct there

Changes from V1: (per Boris suggestions)
  - Simplify error decoding routines
  - Move headers to mce.h
  - Rename enumerations and struct members (to be more descriptive)
  - Drop gerund usage
  - Remove comments that are spelling out the code

Aravind Gopalakrishnan (5):
  x86/mce: Move MCx_CONFIG MSR definition
  EDAC, MCE, AMD: Enable error decoding of Scalable MCA errors
  x86/mce/AMD: Fix logic to obtain block address
  x86/mce: Clarify comments regarding deferred error
  x86/mce/AMD: Add comments for easier understanding

 arch/x86/include/asm/amd_nb.h        |  18 +-
 arch/x86/include/asm/mce.h           |  63 ++++++-
 arch/x86/include/asm/msr-index.h     |   4 -
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 108 +++++++----
 drivers/edac/mce_amd.c               | 342 ++++++++++++++++++++++++++++++++++-
 5 files changed, 486 insertions(+), 49 deletions(-)

--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 1/5] x86/mce: Move MCx_CONFIG MSR definition

Aravind Gopalakrishnan
Since this is contained to only MCE code, move
the MSR definiton there instead of adding to msr-index

Per discussion here:
http://marc.info/?l=linux-kernel&m=145633699026474&w=2

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/mce.h       | 4 ++++
 arch/x86/include/asm/msr-index.h | 4 ----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 18d2ba9..e8b09b3 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -91,6 +91,10 @@
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE "MACHINECHECK"
 
+/* 'SMCA': AMD64 Scalable MCA */
+#define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
+#define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
+
 /*
  * This structure contains all data related to the MCE log.  Also
  * carries a signature to make it easier to find from external
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 75a5bb6..984ab75 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -269,10 +269,6 @@
 #define MSR_IA32_MC0_CTL2 0x00000280
 #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x))
 
-/* 'SMCA': AMD64 Scalable MCA */
-#define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
-#define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
-
 #define MSR_P6_PERFCTR0 0x000000c1
 #define MSR_P6_PERFCTR1 0x000000c2
 #define MSR_P6_EVNTSEL0 0x00000186
--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 5/5] x86/mce/AMD: Add comments for easier understanding

Aravind Gopalakrishnan
In reply to this post by Aravind Gopalakrishnan
In an attempt to aid in understand of what threshold_block
structure holds, assing comments to describe the members here.
Also, trimming comments around threshold_restart_bank()
and updating copyright info.

No functional change is introduced.

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/amd_nb.h        | 18 +++++++++---------
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  7 ++-----
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h
index 3c56ef1..bc01c0a 100644
--- a/arch/x86/include/asm/amd_nb.h
+++ b/arch/x86/include/asm/amd_nb.h
@@ -27,15 +27,15 @@ struct amd_l3_cache {
 };
 
 struct threshold_block {
- unsigned int block;
- unsigned int bank;
- unsigned int cpu;
- u32 address;
- u16 interrupt_enable;
- bool interrupt_capable;
- u16 threshold_limit;
- struct kobject kobj;
- struct list_head miscj;
+ unsigned int block; /* Threshold block number within bank */
+ unsigned int bank; /* MCA bank the block belongs to */
+ unsigned int cpu; /* CPU which controls the MCA bank */
+ u32 address; /* MSR address for the block */
+ u16 interrupt_enable; /* Enable/ Disable APIC interrupt upon threshold error */
+ bool interrupt_capable; /* Specifies if interrupt is possible from the block */
+ u16 threshold_limit; /* Value upon which threshold interrupt is generated */
+ struct kobject kobj; /* sysfs object */
+ struct list_head miscj; /* Add multiple threshold blocks within a bank to the list */
 };
 
 struct threshold_bank {
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index a155eaa..ebb63ec 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -1,5 +1,5 @@
 /*
- *  (c) 2005-2015 Advanced Micro Devices, Inc.
+ *  (c) 2005-2016 Advanced Micro Devices, Inc.
  *  Your use of this code is subject to the terms and conditions of the
  *  GNU general public license version 2. See "COPYING" or
  *  http://www.gnu.org/licenses/gpl.html
@@ -183,10 +183,7 @@ static int lvt_off_valid(struct threshold_block *b, int apic, u32 lo, u32 hi)
  return 1;
 };
 
-/*
- * Called via smp_call_function_single(), must be called with correct
- * cpu affinity.
- */
+/* Reprogram MCx_MISC MSR behind this threshold bank */
 static void threshold_restart_bank(void *_tr)
 {
  struct thresh_restart *tr = _tr;
--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 4/5] x86/mce: Clarify comments regarding deferred error

Aravind Gopalakrishnan
In reply to this post by Aravind Gopalakrishnan
The Deferred field indicates if we have a Deferred error.
Deferred errors indicate errors that hardware could not
fix. But it still does not cause any interruption to program
flow. So it does not generate any #MC and UC bit in MCx_STATUS
is not set.

Fixing comment here. No functional change

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/mce.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 69f8bda..3b45e36 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -40,7 +40,7 @@
 #define MCI_STATUS_AR (1ULL<<55)  /* Action required */
 
 /* AMD-specific bits */
-#define MCI_STATUS_DEFERRED (1ULL<<44)  /* declare an uncorrected error */
+#define MCI_STATUS_DEFERRED (1ULL<<44)  /* uncorrected error, deferred exception */
 #define MCI_STATUS_POISON (1ULL<<43)  /* access poisonous data */
 #define MCI_STATUS_TCC (1ULL<<55)  /* Task context corrupt */
 
--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 3/5] x86/mce/AMD: Fix logic to obtain block address

Aravind Gopalakrishnan
In reply to this post by Aravind Gopalakrishnan
In upcoming processors, the BLKPTR field is no longer used
to indicate the MSR number of the additional register.
Insted, it simply indicates the prescence of additional MSRs.

Fixing the logic here to gather MSR address from
MSR_AMD64_SMCA_MCx_MISC() for newer processors
and we fall back to existing logic for older processors.

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/mce.h           |  4 ++
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 90 ++++++++++++++++++++++++------------
 2 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index e83bbd6..69f8bda 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -104,10 +104,14 @@
 #define MCE_LOG_SIGNATURE "MACHINECHECK"
 
 /* 'SMCA': AMD64 Scalable MCA */
+#define MSR_AMD64_SMCA_MC0_MISC0 0xc0002003
 #define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
 #define MSR_AMD64_SMCA_MC0_IPID 0xc0002005
+#define MSR_AMD64_SMCA_MC0_MISC1 0xc000200a
+#define MSR_AMD64_SMCA_MCx_MISC(x) (MSR_AMD64_SMCA_MC0_MISC0 + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_IPID(x) (MSR_AMD64_SMCA_MC0_IPID + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_MISCy(x, y) ((MSR_AMD64_SMCA_MC0_MISC1 + y) + (0x10*(x)))
 
 /*
  * This structure contains all data related to the MCE log.  Also
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 13f15cb..a155eaa 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -286,6 +286,54 @@ static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
  wrmsr(MSR_CU_DEF_ERR, low, high);
 }
 
+static u32 get_block_address(u32 current_addr, u32 low, u32 high,
+     unsigned int bank, unsigned int block)
+{
+ u32 addr = 0, offset = 0;
+
+ if (mce_flags.smca) {
+ if (!block) {
+ addr = MSR_AMD64_SMCA_MCx_MISC(bank);
+ } else {
+ /*
+ * For SMCA enabled processors, BLKPTR field
+ * of the first MISC register (MCx_MISC0) indicates
+ * presence of additional MISC register set (MISC1-4)
+ */
+ u32 low, high;
+
+ if (rdmsr_safe(MSR_AMD64_SMCA_MCx_CONFIG(bank),
+       &low, &high) ||
+    !(low & MCI_CONFIG_MCAX))
+ goto nextaddr_out;
+
+ if (!rdmsr_safe(MSR_AMD64_SMCA_MCx_MISC(bank),
+ &low, &high) &&
+    (low & MASK_BLKPTR_LO))
+ addr = MSR_AMD64_SMCA_MCx_MISCy(bank, block - 1);
+ }
+
+ goto nextaddr_out;
+ }
+
+ /* Fall back to method we used for older processors */
+ switch (block) {
+ case 0:
+ addr = MSR_IA32_MCx_MISC(bank);
+ break;
+ case 1:
+ offset = ((low & MASK_BLKPTR_LO) >> 21);
+ if (offset)
+ addr = MCG_XBLK_ADDR + offset;
+ break;
+ default:
+ addr = ++current_addr;
+ }
+
+nextaddr_out:
+ return addr;
+}
+
 static int
 prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
  int offset, u32 misc_high)
@@ -348,16 +396,10 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
 
  for (bank = 0; bank < mca_cfg.banks; ++bank) {
  for (block = 0; block < NR_BLOCKS; ++block) {
- if (block == 0)
- address = MSR_IA32_MCx_MISC(bank);
- else if (block == 1) {
- address = (low & MASK_BLKPTR_LO) >> 21;
- if (!address)
- break;
-
- address += MCG_XBLK_ADDR;
- } else
- ++address;
+ address = get_block_address(address, low, high,
+    bank, block);
+ if (!address)
+ break;
 
  if (rdmsr_safe(address, &low, &high))
  break;
@@ -462,16 +504,10 @@ static void amd_threshold_interrupt(void)
  if (!(per_cpu(bank_map, cpu) & (1 << bank)))
  continue;
  for (block = 0; block < NR_BLOCKS; ++block) {
- if (block == 0) {
- address = MSR_IA32_MCx_MISC(bank);
- } else if (block == 1) {
- address = (low & MASK_BLKPTR_LO) >> 21;
- if (!address)
- break;
- address += MCG_XBLK_ADDR;
- } else {
- ++address;
- }
+ address = get_block_address(address, low, high,
+    bank, block);
+ if (!address)
+ break;
 
  if (rdmsr_safe(address, &low, &high))
  break;
@@ -691,16 +727,12 @@ static int allocate_threshold_blocks(unsigned int cpu, unsigned int bank,
  if (err)
  goto out_free;
 recurse:
- if (!block) {
- address = (low & MASK_BLKPTR_LO) >> 21;
- if (!address)
- return 0;
- address += MCG_XBLK_ADDR;
- } else {
- ++address;
- }
+ address = get_block_address(address, low, high, bank, ++block);
+
+ if (!address)
+ return 0;
 
- err = allocate_threshold_blocks(cpu, bank, ++block, address);
+ err = allocate_threshold_blocks(cpu, bank, block, address);
  if (err)
  goto out_free;
 
--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH V2 2/5] EDAC, MCE, AMD: Enable error decoding of Scalable MCA errors

Aravind Gopalakrishnan
In reply to this post by Aravind Gopalakrishnan
For Scalable MCA enabled processors, errors are listed
per IP block. And since it is not required for an IP to
map to a particular bank, we need to use HWID and McaType
values from the MCx_IPID register to figure out which IP
a given bank represents.

We also have a new bit (TCC) in the MCx_STATUS register
to indicate Task context is corrupt.

Add logic here to decode errors from all known IP
blocks for Fam17h Model 00-0fh and to print TCC errors.

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/mce.h           |  53 ++++++
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  11 ++
 drivers/edac/mce_amd.c               | 342 ++++++++++++++++++++++++++++++++++-
 3 files changed, 405 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index e8b09b3..e83bbd6 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -42,6 +42,18 @@
 /* AMD-specific bits */
 #define MCI_STATUS_DEFERRED (1ULL<<44)  /* declare an uncorrected error */
 #define MCI_STATUS_POISON (1ULL<<43)  /* access poisonous data */
+#define MCI_STATUS_TCC (1ULL<<55)  /* Task context corrupt */
+
+/*
+ * McaX field if set indicates a given bank supports MCA extensions:
+ *  - Deferred error interrupt type is specifiable by bank.
+ *  - MCx_MISC0[BlkPtr] field indicates presence of extended MISC registers,
+ *    But should not be used to determine MSR numbers.
+ *  - TCC bit is present in MCx_STATUS.
+ */
+#define MCI_CONFIG_MCAX 0x1
+#define MCI_IPID_MCATYPE 0xFFFF0000
+#define MCI_IPID_HWID 0xFFF
 
 /*
  * Note that the full MCACOD field of IA32_MCi_STATUS MSR is
@@ -93,7 +105,9 @@
 
 /* 'SMCA': AMD64 Scalable MCA */
 #define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
+#define MSR_AMD64_SMCA_MC0_IPID 0xc0002005
 #define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_IPID(x) (MSR_AMD64_SMCA_MC0_IPID + 0x10*(x))
 
 /*
  * This structure contains all data related to the MCE log.  Also
@@ -292,4 +306,43 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
       struct cper_sec_mem_err *mem_err);
 
+/*
+ * Enumerate new IP types and HWID values
+ * in ScalableMCA enabled AMD processors
+ */
+#ifdef CONFIG_X86_MCE_AMD
+enum amd_ip_types {
+ SMCA_F17H_CORE_BLOCK = 0, /* Core errors */
+ SMCA_DF_BLOCK, /* Data Fabric */
+ SMCA_UMC_BLOCK, /* Unified Memory Controller */
+ SMCA_PB_BLOCK, /* Parameter Block */
+ SMCA_PSP_BLOCK, /* Platform Security Processor */
+ SMCA_SMU_BLOCK, /* System Management Unit */
+ N_AMD_IP_TYPES
+};
+
+struct amd_hwid {
+ const char *amd_ipname;
+ unsigned int amd_hwid_value;
+};
+
+extern struct amd_hwid amd_hwid_mappings[N_AMD_IP_TYPES];
+
+enum amd_core_mca_blocks {
+ SMCA_LS_BLOCK = 0, /* Load Store */
+ SMCA_IF_BLOCK, /* Instruction Fetch */
+ SMCA_L2_CACHE_BLOCK, /* L2 cache */
+ SMCA_DE_BLOCK, /* Decoder unit */
+ RES, /* Reserved */
+ SMCA_EX_BLOCK, /* Execution unit */
+ SMCA_FP_BLOCK, /* Floating Point */
+ SMCA_L3_CACHE_BLOCK /* L3 cache */
+};
+
+enum amd_df_mca_blocks {
+ SMCA_CS_BLOCK = 0, /* Coherent Slave */
+ SMCA_PIE_BLOCK /* Power management, Interrupts, etc */
+};
+#endif
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 88de27b..13f15cb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -71,6 +71,17 @@ static const char * const th_names[] = {
  "execution_unit",
 };
 
+/* Define HWID to IP type mappings for Scalable MCA */
+struct amd_hwid amd_hwid_mappings[] = {
+ [SMCA_F17H_CORE_BLOCK] = { "f17h_core", 0xB0 },
+ [SMCA_DF_BLOCK] = { "df", 0x2E },
+ [SMCA_UMC_BLOCK] = { "umc", 0x96 },
+ [SMCA_PB_BLOCK] = { "pb", 0x5 },
+ [SMCA_PSP_BLOCK] = { "psp", 0xFF },
+ [SMCA_SMU_BLOCK] = { "smu", 0x1 },
+};
+EXPORT_SYMBOL_GPL(amd_hwid_mappings);
+
 static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks);
 static DEFINE_PER_CPU(unsigned char, bank_map); /* see which banks are on */
 
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index e3a945c..409448e 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -147,6 +147,136 @@ static const char * const mc6_mce_desc[] = {
  "Status Register File",
 };
 
+/* Scalable MCA error strings */
+
+static const char * const f17h_ls_mce_desc[] = {
+ "Load queue parity",
+ "Store queue parity",
+ "Miss address buffer payload parity",
+ "L1 TLB parity",
+ "", /* reserved */
+ "DC tag error type 6",
+ "DC tag error type 1",
+ "Internal error type 1",
+ "Internal error type 2",
+ "Sys Read data error thread 0",
+ "Sys read data error thread 1",
+ "DC tag error type 2",
+ "DC data error type 1 (poison comsumption)",
+ "DC data error type 2",
+ "DC data error type 3",
+ "DC tag error type 4",
+ "L2 TLB parity",
+ "PDC parity error",
+ "DC tag error type 3",
+ "DC tag error type 5",
+ "L2 fill data error",
+};
+
+static const char * const f17h_if_mce_desc[] = {
+ "microtag probe port parity error",
+ "IC microtag or full tag multi-hit error",
+ "IC full tag parity",
+ "IC data array parity",
+ "Decoupling queue phys addr parity error",
+ "L0 ITLB parity error",
+ "L1 ITLB parity error",
+ "L2 ITLB parity error",
+ "BPQ snoop parity on Thread 0",
+ "BPQ snoop parity on Thread 1",
+ "L1 BTB multi-match error",
+ "L2 BTB multi-match error",
+};
+
+static const char * const f17h_l2_mce_desc[] = {
+ "L2M tag multi-way-hit error",
+ "L2M tag ECC error",
+ "L2M data ECC error",
+ "HW assert",
+};
+
+static const char * const f17h_de_mce_desc[] = {
+ "uop cache tag parity error",
+ "uop cache data parity error",
+ "Insn buffer parity error",
+ "Insn dispatch queue parity error",
+ "Fetch address FIFO parity",
+ "Patch RAM data parity",
+ "Patch RAM sequencer parity",
+ "uop buffer parity"
+};
+
+static const char * const f17h_ex_mce_desc[] = {
+ "Watchdog timeout error",
+ "Phy register file parity",
+ "Flag register file parity",
+ "Immediate displacement register file parity",
+ "Address generator payload parity",
+ "EX payload parity",
+ "Checkpoint queue parity",
+ "Retire dispatch queue parity",
+};
+
+static const char * const f17h_fp_mce_desc[] = {
+ "Physical register file parity",
+ "Freelist parity error",
+ "Schedule queue parity",
+ "NSQ parity error",
+ "Retire queue parity",
+ "Status register file parity",
+};
+
+static const char * const f17h_l3_mce_desc[] = {
+ "Shadow tag macro ECC error",
+ "Shadow tag macro multi-way-hit error",
+ "L3M tag ECC error",
+ "L3M tag multi-way-hit error",
+ "L3M data ECC error",
+ "XI parity, L3 fill done channel error",
+ "L3 victim queue parity",
+ "L3 HW assert",
+};
+
+static const char * const f17h_cs_mce_desc[] = {
+ "Illegal request from transport layer",
+ "Address violation",
+ "Security violation",
+ "Illegal response from transport layer",
+ "Unexpected response",
+ "Parity error on incoming request or probe response data",
+ "Parity error on incoming read response data",
+ "Atomic request parity",
+ "ECC error on probe filter access",
+};
+
+static const char * const f17h_pie_mce_desc[] = {
+ "HW assert",
+ "Internal PIE register security violation",
+ "Error on GMI link",
+ "Poison data written to internal PIE register",
+};
+
+static const char * const f17h_umc_mce_desc[] = {
+ "DRAM ECC error",
+ "Data poison error on DRAM",
+ "SDP parity error",
+ "Advanced peripheral bus error",
+ "Command/address parity error",
+ "Write data CRC error",
+};
+
+static const char * const f17h_pb_mce_desc[] = {
+ "Parameter Block RAM ECC error",
+};
+
+static const char * const f17h_psp_mce_desc[] = {
+ "PSP RAM ECC or parity error",
+};
+
+static const char * const f17h_smu_mce_desc[] = {
+ "SMU RAM ECC or parity error",
+};
+
 static bool f12h_mc0_mce(u16 ec, u8 xec)
 {
  bool ret = false;
@@ -731,6 +861,192 @@ static bool amd_filter_mce(struct mce *m)
  return false;
 }
 
+static void decode_f17hcore_errors(u8 xec, unsigned int mca_type)
+{
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ switch (mca_type) {
+ case SMCA_LS_BLOCK:
+ error_desc_array = f17h_ls_mce_desc;
+ ip_name = "LS";
+ len = ARRAY_SIZE(f17h_ls_mce_desc) - 1;
+
+ if (xec == 0x4) {
+ pr_cont("Unrecognized error code from LS MCA bank\n");
+ return;
+ }
+
+ break;
+
+ case SMCA_IF_BLOCK:
+ error_desc_array = f17h_if_mce_desc;
+ ip_name = "IF";
+ len = ARRAY_SIZE(f17h_if_mce_desc) - 1;
+ break;
+
+ case SMCA_L2_CACHE_BLOCK:
+ error_desc_array = f17h_l2_mce_desc;
+ ip_name = "L2_Cache";
+ len = ARRAY_SIZE(f17h_l2_mce_desc) - 1;
+ break;
+
+ case SMCA_DE_BLOCK:
+ error_desc_array = f17h_de_mce_desc;
+ ip_name = "DE";
+ len = ARRAY_SIZE(f17h_de_mce_desc) - 1;
+ break;
+
+ case SMCA_EX_BLOCK:
+ error_desc_array = f17h_ex_mce_desc;
+ ip_name = "EX";
+ len = ARRAY_SIZE(f17h_ex_mce_desc) - 1;
+ break;
+
+ case SMCA_FP_BLOCK:
+ error_desc_array = f17h_fp_mce_desc;
+ ip_name = "FP";
+ len = ARRAY_SIZE(f17h_fp_mce_desc) - 1;
+ break;
+
+ case SMCA_L3_CACHE_BLOCK:
+ error_desc_array = f17h_l3_mce_desc;
+ ip_name = "L3_Cache";
+ len = ARRAY_SIZE(f17h_l3_mce_desc) - 1;
+ break;
+
+ default:
+ pr_cont("Unrecognized Mca Type value for F17h Core. Unable to decode errors\n");
+ return;
+ }
+
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+}
+
+static void decode_df_errors(u8 xec, unsigned int mca_type)
+{
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ switch (mca_type) {
+ case  SMCA_CS_BLOCK:
+ error_desc_array = f17h_cs_mce_desc;
+ ip_name = "CS";
+ len = ARRAY_SIZE(f17h_cs_mce_desc) - 1;
+ break;
+
+ case SMCA_PIE_BLOCK:
+ error_desc_array = f17h_pie_mce_desc;
+ ip_name = "PIE";
+ len = ARRAY_SIZE(f17h_pie_mce_desc) - 1;
+ break;
+
+ default:
+ pr_cont("Unrecognized Mca Type value for DF. Unable to decode errors\n");
+ return;
+ }
+
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+}
+
+/* Decode errors according to Scalable MCA specification */
+static void decode_smca_errors(struct mce *m)
+{
+ u32 low, high;
+ u32 addr = MSR_AMD64_SMCA_MCx_IPID(m->bank);
+ unsigned int hwid, mca_type, i;
+ u8 xec = XEC(m->status, xec_mask);
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ if (rdmsr_safe(addr, &low, &high)) {
+ pr_emerg("Invalid IP block specified, error information is unreliable.\n");
+ return;
+ }
+
+ hwid = high & MCI_IPID_HWID;
+ mca_type = (high & MCI_IPID_MCATYPE) >> 16;
+
+ pr_emerg(HW_ERR "MC%d IPID value: 0x%08x%08x\n", m->bank, high, low);
+
+ /*
+ * Based on hwid and mca_type values,
+ * decode errors from respective IPs.
+ * Note: mca_type values make sense only
+ * in the context of an hwid
+ */
+ for (i = 0; i < ARRAY_SIZE(amd_hwid_mappings); i++)
+ if (amd_hwid_mappings[i].amd_hwid_value == hwid)
+ break;
+
+ switch (i) {
+ case SMCA_F17H_CORE_BLOCK:
+ ip_name = (mca_type == SMCA_L3_CACHE_BLOCK) ?
+  "L3 Cache" : "F17h Core";
+ break;
+
+ case SMCA_DF_BLOCK:
+ ip_name = "DF";
+ break;
+
+ case SMCA_UMC_BLOCK:
+ error_desc_array = f17h_umc_mce_desc;
+ ip_name = "UMC";
+ len = ARRAY_SIZE(f17h_umc_mce_desc) - 1;
+ break;
+
+ case SMCA_PB_BLOCK:
+ error_desc_array = f17h_pb_mce_desc;
+ ip_name = "PB";
+ len = ARRAY_SIZE(f17h_pb_mce_desc) - 1;
+ break;
+
+ case SMCA_PSP_BLOCK:
+ error_desc_array = f17h_psp_mce_desc;
+ ip_name = "PSP";
+ len = ARRAY_SIZE(f17h_psp_mce_desc) - 1;
+ break;
+
+ case SMCA_SMU_BLOCK:
+ error_desc_array = f17h_smu_mce_desc;
+ ip_name = "SMU";
+ len = ARRAY_SIZE(f17h_smu_mce_desc) - 1;
+ break;
+
+ default:
+ pr_emerg(HW_ERR "HWID:%d does not match any existing IPs\n", hwid);
+ return;
+ }
+
+ pr_emerg(HW_ERR "%s Error: ", ip_name);
+
+ if (i == SMCA_F17H_CORE_BLOCK) {
+ decode_f17hcore_errors(xec, mca_type);
+ } else if (i == SMCA_DF_BLOCK) {
+ decode_df_errors(xec, mca_type);
+ } else {
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+ }
+}
+
 static const char *decode_error_status(struct mce *m)
 {
  if (m->status & MCI_STATUS_UC) {
@@ -752,6 +1068,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  struct mce *m = (struct mce *)data;
  struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
  int ecc;
+ u32 ebx = cpuid_ebx(0x80000007);
 
  if (amd_filter_mce(m))
  return NOTIFY_STOP;
@@ -769,11 +1086,20 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  ((m->status & MCI_STATUS_PCC) ? "PCC"  : "-"),
  ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
 
- if (c->x86 == 0x15 || c->x86 == 0x16)
+ if (c->x86 >= 0x15)
  pr_cont("|%s|%s",
  ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
  ((m->status & MCI_STATUS_POISON)   ? "Poison"   : "-"));
 
+ if (!!(ebx & BIT(3))) {
+ u32 low, high;
+ u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
+
+ if (!rdmsr_safe(addr, &low, &high) &&
+    (low & MCI_CONFIG_MCAX))
+ pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));
+ }
+
  /* do the two bits[14:13] together */
  ecc = (m->status >> 45) & 0x3;
  if (ecc)
@@ -784,6 +1110,11 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  if (m->status & MCI_STATUS_ADDRV)
  pr_emerg(HW_ERR "MC%d Error Address: 0x%016llx\n", m->bank, m->addr);
 
+ if (!!(ebx & BIT(3))) {
+ decode_smca_errors(m);
+ goto err_code;
+ }
+
  if (!fam_ops)
  goto err_code;
 
@@ -834,6 +1165,7 @@ static struct notifier_block amd_mce_dec_nb = {
 static int __init mce_amd_init(void)
 {
  struct cpuinfo_x86 *c = &boot_cpu_data;
+ u32 ebx = cpuid_ebx(0x80000007);
 
  if (c->x86_vendor != X86_VENDOR_AMD)
  return -ENODEV;
@@ -888,6 +1220,14 @@ static int __init mce_amd_init(void)
  fam_ops->mc2_mce = f16h_mc2_mce;
  break;
 
+ case 0x17:
+ xec_mask = 0x3f;
+ if (!(ebx & BIT(3))) {
+ printk(KERN_WARNING "Decoding supported only on Scalable MCA enabled processors\n");
+ return 0;
+ }
+ break;
+
  default:
  printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
  kfree(fam_ops);
--
2.7.0

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH V2 2/5] EDAC, MCE, AMD: Enable error decoding of Scalable MCA errors

Borislav Petkov-3
On Mon, Feb 29, 2016 at 04:32:56PM -0600, Aravind Gopalakrishnan wrote:

> For Scalable MCA enabled processors, errors are listed
> per IP block. And since it is not required for an IP to
> map to a particular bank, we need to use HWID and McaType
> values from the MCx_IPID register to figure out which IP
> a given bank represents.
>
> We also have a new bit (TCC) in the MCx_STATUS register
> to indicate Task context is corrupt.
>
> Add logic here to decode errors from all known IP
> blocks for Fam17h Model 00-0fh and to print TCC errors.
>
> Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
> ---
>  arch/x86/include/asm/mce.h           |  53 ++++++
>  arch/x86/kernel/cpu/mcheck/mce_amd.c |  11 ++
>  drivers/edac/mce_amd.c               | 342 ++++++++++++++++++++++++++++++++++-
>  3 files changed, 405 insertions(+), 1 deletion(-)

Ok, applied with a bunch of changes ontop. I'm sending them as a reply
to this message. The second patch is relying on the assumption that a
hwid of 0 is invalid. Is that so?

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/3] x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors

Borislav Petkov-3
From: Aravind Gopalakrishnan <[hidden email]>
Date: Mon, 29 Feb 2016 16:32:56 -0600
Subject: [PATCH 1/3] x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA
 errors

For Scalable MCA enabled processors, errors are listed per IP block. And
since it is not required for an IP to map to a particular bank, we need
to use HWID and McaType values from the MCx_IPID register to figure out
which IP a given bank represents.

We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
context is corrupt.

Add logic here to decode errors from all known IP blocks for Fam17h
Model 00-0fh and to print TCC errors.

Boris:
- reorganize function placement in drivers/edac/mce_amd.c
- reflow comments

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
Cc: "H. Peter Anvin" <[hidden email]>
Cc: Ingo Molnar <[hidden email]>
Cc: linux-edac <[hidden email]>
Cc: Mauro Carvalho Chehab <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: x86-ml <[hidden email]>
Link: http://lkml.kernel.org/r/1456785179-14378-3-git-send-email-Aravind.Gopalakrishnan@...
Signed-off-by: Borislav Petkov <[hidden email]>
---
 arch/x86/include/asm/mce.h           |  53 ++++++
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  12 ++
 drivers/edac/mce_amd.c               | 342 ++++++++++++++++++++++++++++++++++-
 3 files changed, 406 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index f9d4b8d4baf2..6f1380064471 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -42,6 +42,18 @@
 /* AMD-specific bits */
 #define MCI_STATUS_DEFERRED (1ULL<<44)  /* declare an uncorrected error */
 #define MCI_STATUS_POISON (1ULL<<43)  /* access poisonous data */
+#define MCI_STATUS_TCC (1ULL<<55)  /* Task context corrupt */
+
+/*
+ * McaX field if set indicates a given bank supports MCA extensions:
+ *  - Deferred error interrupt type is specifiable by bank.
+ *  - MCx_MISC0[BlkPtr] field indicates presence of extended MISC registers,
+ *    But should not be used to determine MSR numbers.
+ *  - TCC bit is present in MCx_STATUS.
+ */
+#define MCI_CONFIG_MCAX 0x1
+#define MCI_IPID_MCATYPE 0xFFFF0000
+#define MCI_IPID_HWID 0xFFF
 
 /*
  * Note that the full MCACOD field of IA32_MCi_STATUS MSR is
@@ -93,7 +105,9 @@
 
 /* 'SMCA': AMD64 Scalable MCA */
 #define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
+#define MSR_AMD64_SMCA_MC0_IPID 0xc0002005
 #define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_IPID(x) (MSR_AMD64_SMCA_MC0_IPID + 0x10*(x))
 
 /*
  * This structure contains all data related to the MCE log.  Also
@@ -291,4 +305,43 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
       struct cper_sec_mem_err *mem_err);
 
+/*
+ * Enumerate new IP types and HWID values in AMD processors which support
+ * Scalable MCA.
+ */
+#ifdef CONFIG_X86_MCE_AMD
+enum amd_ip_types {
+ SMCA_F17H_CORE_BLOCK = 0, /* Core errors */
+ SMCA_DF_BLOCK, /* Data Fabric */
+ SMCA_UMC_BLOCK, /* Unified Memory Controller */
+ SMCA_PB_BLOCK, /* Parameter Block */
+ SMCA_PSP_BLOCK, /* Platform Security Processor */
+ SMCA_SMU_BLOCK, /* System Management Unit */
+ N_AMD_IP_TYPES
+};
+
+struct amd_hwid {
+ const char *amd_ipname;
+ unsigned int amd_hwid_value;
+};
+
+extern struct amd_hwid amd_hwid_mappings[N_AMD_IP_TYPES];
+
+enum amd_core_mca_blocks {
+ SMCA_LS_BLOCK = 0, /* Load Store */
+ SMCA_IF_BLOCK, /* Instruction Fetch */
+ SMCA_L2_CACHE_BLOCK, /* L2 cache */
+ SMCA_DE_BLOCK, /* Decoder unit */
+ RES, /* Reserved */
+ SMCA_EX_BLOCK, /* Execution unit */
+ SMCA_FP_BLOCK, /* Floating Point */
+ SMCA_L3_CACHE_BLOCK /* L3 cache */
+};
+
+enum amd_df_mca_blocks {
+ SMCA_CS_BLOCK = 0, /* Coherent Slave */
+ SMCA_PIE_BLOCK /* Power management, Interrupts, etc */
+};
+#endif
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 88de27bd5797..3188cd9eb9b5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -71,6 +71,18 @@ static const char * const th_names[] = {
  "execution_unit",
 };
 
+/* Define HWID to IP type mappings for Scalable MCA */
+struct amd_hwid amd_hwid_mappings[] =
+{
+ [SMCA_F17H_CORE_BLOCK] = { "f17h_core", 0xB0 },
+ [SMCA_DF_BLOCK] = { "data fabric", 0x2E },
+ [SMCA_UMC_BLOCK] = { "UMC", 0x96 },
+ [SMCA_PB_BLOCK] = { "param block", 0x5 },
+ [SMCA_PSP_BLOCK] = { "PSP", 0xFF },
+ [SMCA_SMU_BLOCK] = { "SMU", 0x1 },
+};
+EXPORT_SYMBOL_GPL(amd_hwid_mappings);
+
 static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks);
 static DEFINE_PER_CPU(unsigned char, bank_map); /* see which banks are on */
 
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index e3a945ce374b..6820d17fea9c 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -147,6 +147,135 @@ static const char * const mc6_mce_desc[] = {
  "Status Register File",
 };
 
+/* Scalable MCA error strings */
+static const char * const f17h_ls_mce_desc[] = {
+ "Load queue parity",
+ "Store queue parity",
+ "Miss address buffer payload parity",
+ "L1 TLB parity",
+ "", /* reserved */
+ "DC tag error type 6",
+ "DC tag error type 1",
+ "Internal error type 1",
+ "Internal error type 2",
+ "Sys Read data error thread 0",
+ "Sys read data error thread 1",
+ "DC tag error type 2",
+ "DC data error type 1 (poison comsumption)",
+ "DC data error type 2",
+ "DC data error type 3",
+ "DC tag error type 4",
+ "L2 TLB parity",
+ "PDC parity error",
+ "DC tag error type 3",
+ "DC tag error type 5",
+ "L2 fill data error",
+};
+
+static const char * const f17h_if_mce_desc[] = {
+ "microtag probe port parity error",
+ "IC microtag or full tag multi-hit error",
+ "IC full tag parity",
+ "IC data array parity",
+ "Decoupling queue phys addr parity error",
+ "L0 ITLB parity error",
+ "L1 ITLB parity error",
+ "L2 ITLB parity error",
+ "BPQ snoop parity on Thread 0",
+ "BPQ snoop parity on Thread 1",
+ "L1 BTB multi-match error",
+ "L2 BTB multi-match error",
+};
+
+static const char * const f17h_l2_mce_desc[] = {
+ "L2M tag multi-way-hit error",
+ "L2M tag ECC error",
+ "L2M data ECC error",
+ "HW assert",
+};
+
+static const char * const f17h_de_mce_desc[] = {
+ "uop cache tag parity error",
+ "uop cache data parity error",
+ "Insn buffer parity error",
+ "Insn dispatch queue parity error",
+ "Fetch address FIFO parity",
+ "Patch RAM data parity",
+ "Patch RAM sequencer parity",
+ "uop buffer parity"
+};
+
+static const char * const f17h_ex_mce_desc[] = {
+ "Watchdog timeout error",
+ "Phy register file parity",
+ "Flag register file parity",
+ "Immediate displacement register file parity",
+ "Address generator payload parity",
+ "EX payload parity",
+ "Checkpoint queue parity",
+ "Retire dispatch queue parity",
+};
+
+static const char * const f17h_fp_mce_desc[] = {
+ "Physical register file parity",
+ "Freelist parity error",
+ "Schedule queue parity",
+ "NSQ parity error",
+ "Retire queue parity",
+ "Status register file parity",
+};
+
+static const char * const f17h_l3_mce_desc[] = {
+ "Shadow tag macro ECC error",
+ "Shadow tag macro multi-way-hit error",
+ "L3M tag ECC error",
+ "L3M tag multi-way-hit error",
+ "L3M data ECC error",
+ "XI parity, L3 fill done channel error",
+ "L3 victim queue parity",
+ "L3 HW assert",
+};
+
+static const char * const f17h_cs_mce_desc[] = {
+ "Illegal request from transport layer",
+ "Address violation",
+ "Security violation",
+ "Illegal response from transport layer",
+ "Unexpected response",
+ "Parity error on incoming request or probe response data",
+ "Parity error on incoming read response data",
+ "Atomic request parity",
+ "ECC error on probe filter access",
+};
+
+static const char * const f17h_pie_mce_desc[] = {
+ "HW assert",
+ "Internal PIE register security violation",
+ "Error on GMI link",
+ "Poison data written to internal PIE register",
+};
+
+static const char * const f17h_umc_mce_desc[] = {
+ "DRAM ECC error",
+ "Data poison error on DRAM",
+ "SDP parity error",
+ "Advanced peripheral bus error",
+ "Command/address parity error",
+ "Write data CRC error",
+};
+
+static const char * const f17h_pb_mce_desc[] = {
+ "Parameter Block RAM ECC error",
+};
+
+static const char * const f17h_psp_mce_desc[] = {
+ "PSP RAM ECC or parity error",
+};
+
+static const char * const f17h_smu_mce_desc[] = {
+ "SMU RAM ECC or parity error",
+};
+
 static bool f12h_mc0_mce(u16 ec, u8 xec)
 {
  bool ret = false;
@@ -691,6 +820,193 @@ static void decode_mc6_mce(struct mce *m)
  pr_emerg(HW_ERR "Corrupted MC6 MCE info?\n");
 }
 
+static void decode_f17h_core_errors(u8 xec, unsigned int mca_type)
+{
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ switch (mca_type) {
+ case SMCA_LS_BLOCK:
+ error_desc_array = f17h_ls_mce_desc;
+ ip_name = "LS";
+ len = ARRAY_SIZE(f17h_ls_mce_desc) - 1;
+
+ if (xec == 0x4) {
+ pr_cont("Unrecognized error code from LS MCA bank\n");
+ return;
+ }
+
+ break;
+
+ case SMCA_IF_BLOCK:
+ error_desc_array = f17h_if_mce_desc;
+ ip_name = "IF";
+ len = ARRAY_SIZE(f17h_if_mce_desc) - 1;
+ break;
+
+ case SMCA_L2_CACHE_BLOCK:
+ error_desc_array = f17h_l2_mce_desc;
+ ip_name = "L2_Cache";
+ len = ARRAY_SIZE(f17h_l2_mce_desc) - 1;
+ break;
+
+ case SMCA_DE_BLOCK:
+ error_desc_array = f17h_de_mce_desc;
+ ip_name = "DE";
+ len = ARRAY_SIZE(f17h_de_mce_desc) - 1;
+ break;
+
+ case SMCA_EX_BLOCK:
+ error_desc_array = f17h_ex_mce_desc;
+ ip_name = "EX";
+ len = ARRAY_SIZE(f17h_ex_mce_desc) - 1;
+ break;
+
+ case SMCA_FP_BLOCK:
+ error_desc_array = f17h_fp_mce_desc;
+ ip_name = "FP";
+ len = ARRAY_SIZE(f17h_fp_mce_desc) - 1;
+ break;
+
+ case SMCA_L3_CACHE_BLOCK:
+ error_desc_array = f17h_l3_mce_desc;
+ ip_name = "L3_Cache";
+ len = ARRAY_SIZE(f17h_l3_mce_desc) - 1;
+ break;
+
+ default:
+ pr_cont("Unrecognized Mca Type value for F17h Core. Unable to decode errors\n");
+ return;
+ }
+
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+}
+
+static void decode_df_errors(u8 xec, unsigned int mca_type)
+{
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ switch (mca_type) {
+ case  SMCA_CS_BLOCK:
+ error_desc_array = f17h_cs_mce_desc;
+ ip_name = "CS";
+ len = ARRAY_SIZE(f17h_cs_mce_desc) - 1;
+ break;
+
+ case SMCA_PIE_BLOCK:
+ error_desc_array = f17h_pie_mce_desc;
+ ip_name = "PIE";
+ len = ARRAY_SIZE(f17h_pie_mce_desc) - 1;
+ break;
+
+ default:
+ pr_cont("Unrecognized Mca Type value for DF. Unable to decode errors\n");
+ return;
+ }
+
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+}
+
+/* Decode errors according to Scalable MCA specification */
+static void decode_smca_errors(struct mce *m)
+{
+ u32 low, high;
+ u32 addr = MSR_AMD64_SMCA_MCx_IPID(m->bank);
+ unsigned int hwid, mca_type, i;
+ u8 xec = XEC(m->status, xec_mask);
+ const char * const *error_desc_array;
+ char *ip_name;
+ size_t len;
+
+ if (rdmsr_safe(addr, &low, &high)) {
+ pr_emerg("Invalid IP block specified, error information is unreliable.\n");
+ return;
+ }
+
+ hwid = high & MCI_IPID_HWID;
+ mca_type = (high & MCI_IPID_MCATYPE) >> 16;
+
+ pr_emerg(HW_ERR "MC%d IPID value: 0x%08x%08x\n", m->bank, high, low);
+
+ /*
+ * Based on hwid and mca_type values,
+ * decode errors from respective IPs.
+ * Note: mca_type values make sense only
+ * in the context of an hwid
+ */
+ for (i = 0; i < ARRAY_SIZE(amd_hwid_mappings); i++)
+ if (amd_hwid_mappings[i].amd_hwid_value == hwid)
+ break;
+
+ switch (i) {
+ case SMCA_F17H_CORE_BLOCK:
+ ip_name = (mca_type == SMCA_L3_CACHE_BLOCK) ?
+  "L3 Cache" : "F17h Core";
+ break;
+
+ case SMCA_DF_BLOCK:
+ ip_name = "DF";
+ break;
+
+ case SMCA_UMC_BLOCK:
+ error_desc_array = f17h_umc_mce_desc;
+ ip_name = "UMC";
+ len = ARRAY_SIZE(f17h_umc_mce_desc) - 1;
+ break;
+
+ case SMCA_PB_BLOCK:
+ error_desc_array = f17h_pb_mce_desc;
+ ip_name = "PB";
+ len = ARRAY_SIZE(f17h_pb_mce_desc) - 1;
+ break;
+
+ case SMCA_PSP_BLOCK:
+ error_desc_array = f17h_psp_mce_desc;
+ ip_name = "PSP";
+ len = ARRAY_SIZE(f17h_psp_mce_desc) - 1;
+ break;
+
+ case SMCA_SMU_BLOCK:
+ error_desc_array = f17h_smu_mce_desc;
+ ip_name = "SMU";
+ len = ARRAY_SIZE(f17h_smu_mce_desc) - 1;
+ break;
+
+ default:
+ pr_emerg(HW_ERR "HWID:%d does not match any existing IPs\n", hwid);
+ return;
+ }
+
+ pr_emerg(HW_ERR "%s Error: ", ip_name);
+
+ if (i == SMCA_F17H_CORE_BLOCK) {
+ decode_f17h_core_errors(xec, mca_type);
+ } else if (i == SMCA_DF_BLOCK) {
+ decode_df_errors(xec, mca_type);
+ } else {
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
+ }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
+ }
+}
+
+
 static inline void amd_decode_err_code(u16 ec)
 {
  if (INT_ERROR(ec)) {
@@ -752,6 +1068,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  struct mce *m = (struct mce *)data;
  struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
  int ecc;
+ u32 ebx = cpuid_ebx(0x80000007);
 
  if (amd_filter_mce(m))
  return NOTIFY_STOP;
@@ -769,11 +1086,20 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  ((m->status & MCI_STATUS_PCC) ? "PCC"  : "-"),
  ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
 
- if (c->x86 == 0x15 || c->x86 == 0x16)
+ if (c->x86 >= 0x15)
  pr_cont("|%s|%s",
  ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
  ((m->status & MCI_STATUS_POISON)   ? "Poison"   : "-"));
 
+ if (!!(ebx & BIT(3))) {
+ u32 low, high;
+ u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
+
+ if (!rdmsr_safe(addr, &low, &high) &&
+    (low & MCI_CONFIG_MCAX))
+ pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : "-"));
+ }
+
  /* do the two bits[14:13] together */
  ecc = (m->status >> 45) & 0x3;
  if (ecc)
@@ -784,6 +1110,11 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
  if (m->status & MCI_STATUS_ADDRV)
  pr_emerg(HW_ERR "MC%d Error Address: 0x%016llx\n", m->bank, m->addr);
 
+ if (!!(ebx & BIT(3))) {
+ decode_smca_errors(m);
+ goto err_code;
+ }
+
  if (!fam_ops)
  goto err_code;
 
@@ -834,6 +1165,7 @@ static struct notifier_block amd_mce_dec_nb = {
 static int __init mce_amd_init(void)
 {
  struct cpuinfo_x86 *c = &boot_cpu_data;
+ u32 ebx = cpuid_ebx(0x80000007);
 
  if (c->x86_vendor != X86_VENDOR_AMD)
  return -ENODEV;
@@ -888,6 +1220,14 @@ static int __init mce_amd_init(void)
  fam_ops->mc2_mce = f16h_mc2_mce;
  break;
 
+ case 0x17:
+ xec_mask = 0x3f;
+ if (!(ebx & BIT(3))) {
+ printk(KERN_WARNING "Decoding supported only on Scalable MCA enabled processors\n");
+ return 0;
+ }
+ break;
+
  default:
  printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
  kfree(fam_ops);
--
2.3.5


--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Borislav Petkov-3
In reply to this post by Borislav Petkov-3
From: Borislav Petkov <[hidden email]>
Date: Wed, 2 Mar 2016 11:23:13 +0100
Subject: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Merge all IP blocks into a single enum. This allows for easier block
name use later. Drop superfluous "_BLOCK" from the enum names.

Signed-off-by: Borislav Petkov <[hidden email]>
Cc: Aravind Gopalakrishnan <[hidden email]>
---
 arch/x86/include/asm/mce.h           | 46 +++++++++----------
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 27 +++++++----
 drivers/edac/mce_amd.c               | 88 +++++++++++++++---------------------
 3 files changed, 76 insertions(+), 85 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6f1380064471..4a197cb25593 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -311,37 +311,33 @@ extern void apei_mce_report_mem_error(int corrected,
  */
 #ifdef CONFIG_X86_MCE_AMD
 enum amd_ip_types {
- SMCA_F17H_CORE_BLOCK = 0, /* Core errors */
- SMCA_DF_BLOCK, /* Data Fabric */
- SMCA_UMC_BLOCK, /* Unified Memory Controller */
- SMCA_PB_BLOCK, /* Parameter Block */
- SMCA_PSP_BLOCK, /* Platform Security Processor */
- SMCA_SMU_BLOCK, /* System Management Unit */
+ SMCA_F17H_CORE = 0, /* Core errors */
+ SMCA_LS, /* - Load Store */
+ SMCA_IF, /* - Instruction Fetch */
+ SMCA_L2_CACHE, /* - L2 cache */
+ SMCA_DE, /* - Decoder unit */
+ RES, /* - Reserved */
+ SMCA_EX, /* - Execution unit */
+ SMCA_FP, /* - Floating Point */
+ SMCA_L3_CACHE, /* - L3 cache */
+
+ SMCA_DF, /* Data Fabric */
+ SMCA_CS, /* - Coherent Slave */
+ SMCA_PIE, /* - Power management, Interrupts, etc */
+
+ SMCA_UMC, /* Unified Memory Controller */
+ SMCA_PB, /* Parameter Block */
+ SMCA_PSP, /* Platform Security Processor */
+ SMCA_SMU, /* System Management Unit */
  N_AMD_IP_TYPES
 };
 
 struct amd_hwid {
- const char *amd_ipname;
- unsigned int amd_hwid_value;
+ const char *name;
+ unsigned int hwid;
 };
 
-extern struct amd_hwid amd_hwid_mappings[N_AMD_IP_TYPES];
-
-enum amd_core_mca_blocks {
- SMCA_LS_BLOCK = 0, /* Load Store */
- SMCA_IF_BLOCK, /* Instruction Fetch */
- SMCA_L2_CACHE_BLOCK, /* L2 cache */
- SMCA_DE_BLOCK, /* Decoder unit */
- RES, /* Reserved */
- SMCA_EX_BLOCK, /* Execution unit */
- SMCA_FP_BLOCK, /* Floating Point */
- SMCA_L3_CACHE_BLOCK /* L3 cache */
-};
+extern struct amd_hwid amd_hwids[N_AMD_IP_TYPES];
 
-enum amd_df_mca_blocks {
- SMCA_CS_BLOCK = 0, /* Coherent Slave */
- SMCA_PIE_BLOCK /* Power management, Interrupts, etc */
-};
 #endif
-
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 3188cd9eb9b5..c184c92a00ab 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -72,16 +72,25 @@ static const char * const th_names[] = {
 };
 
 /* Define HWID to IP type mappings for Scalable MCA */
-struct amd_hwid amd_hwid_mappings[] =
-{
- [SMCA_F17H_CORE_BLOCK] = { "f17h_core", 0xB0 },
- [SMCA_DF_BLOCK] = { "data fabric", 0x2E },
- [SMCA_UMC_BLOCK] = { "UMC", 0x96 },
- [SMCA_PB_BLOCK] = { "param block", 0x5 },
- [SMCA_PSP_BLOCK] = { "PSP", 0xFF },
- [SMCA_SMU_BLOCK] = { "SMU", 0x1 },
+struct amd_hwid amd_hwids[] =
+{
+ [SMCA_F17H_CORE] = { "F17h core", 0xB0 },
+ [SMCA_LS] = { "Load-Store", 0x0 },
+ [SMCA_IF] = { "IFetch", 0x0 },
+ [SMCA_L2_CACHE]  = { "L2 Cache", 0x0 },
+ [SMCA_DE] = { "Decoder", 0x0 },
+ [SMCA_EX] = { "Execution", 0x0 },
+ [SMCA_FP] = { "Floating Point", 0x0 },
+ [SMCA_L3_CACHE]  = { "L3 Cache", 0x0 },
+ [SMCA_DF] = { "Data Fabric", 0x2E },
+ [SMCA_CS] = { "Coherent Slave", 0x0 },
+ [SMCA_PIE] = { "PwrMan/Intr", 0x0 },
+ [SMCA_UMC] = { "UMC", 0x96 },
+ [SMCA_PB] = { "Param Block", 0x5 },
+ [SMCA_PSP] = { "PSP", 0xFF },
+ [SMCA_SMU] = { "SMU", 0x1 },
 };
-EXPORT_SYMBOL_GPL(amd_hwid_mappings);
+EXPORT_SYMBOL_GPL(amd_hwids);
 
 static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks);
 static DEFINE_PER_CPU(unsigned char, bank_map); /* see which banks are on */
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 6820d17fea9c..0f9953cbde4e 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -820,58 +820,51 @@ static void decode_mc6_mce(struct mce *m)
  pr_emerg(HW_ERR "Corrupted MC6 MCE info?\n");
 }
 
-static void decode_f17h_core_errors(u8 xec, unsigned int mca_type)
+static void decode_f17h_core_errors(const char *ip_name, u8 xec, unsigned int mca_type)
 {
  const char * const *error_desc_array;
- char *ip_name;
  size_t len;
 
+ pr_emerg(HW_ERR "%s Error: ", ip_name);
+
  switch (mca_type) {
- case SMCA_LS_BLOCK:
+ case SMCA_LS:
  error_desc_array = f17h_ls_mce_desc;
- ip_name = "LS";
  len = ARRAY_SIZE(f17h_ls_mce_desc) - 1;
 
  if (xec == 0x4) {
  pr_cont("Unrecognized error code from LS MCA bank\n");
  return;
  }
-
  break;
 
- case SMCA_IF_BLOCK:
+ case SMCA_IF:
  error_desc_array = f17h_if_mce_desc;
- ip_name = "IF";
  len = ARRAY_SIZE(f17h_if_mce_desc) - 1;
  break;
 
- case SMCA_L2_CACHE_BLOCK:
+ case SMCA_L2_CACHE:
  error_desc_array = f17h_l2_mce_desc;
- ip_name = "L2_Cache";
  len = ARRAY_SIZE(f17h_l2_mce_desc) - 1;
  break;
 
- case SMCA_DE_BLOCK:
+ case SMCA_DE:
  error_desc_array = f17h_de_mce_desc;
- ip_name = "DE";
  len = ARRAY_SIZE(f17h_de_mce_desc) - 1;
  break;
 
- case SMCA_EX_BLOCK:
+ case SMCA_EX:
  error_desc_array = f17h_ex_mce_desc;
- ip_name = "EX";
  len = ARRAY_SIZE(f17h_ex_mce_desc) - 1;
  break;
 
- case SMCA_FP_BLOCK:
+ case SMCA_FP:
  error_desc_array = f17h_fp_mce_desc;
- ip_name = "FP";
  len = ARRAY_SIZE(f17h_fp_mce_desc) - 1;
  break;
 
- case SMCA_L3_CACHE_BLOCK:
+ case SMCA_L3_CACHE:
  error_desc_array = f17h_l3_mce_desc;
- ip_name = "L3_Cache";
  len = ARRAY_SIZE(f17h_l3_mce_desc) - 1;
  break;
 
@@ -881,7 +874,7 @@ static void decode_f17h_core_errors(u8 xec, unsigned int mca_type)
  }
 
  if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);
  return;
  }
 
@@ -891,19 +884,18 @@ static void decode_f17h_core_errors(u8 xec, unsigned int mca_type)
 static void decode_df_errors(u8 xec, unsigned int mca_type)
 {
  const char * const *error_desc_array;
- char *ip_name;
  size_t len;
 
+ pr_emerg(HW_ERR "Data Fabric Error: ");
+
  switch (mca_type) {
- case  SMCA_CS_BLOCK:
+ case  SMCA_CS:
  error_desc_array = f17h_cs_mce_desc;
- ip_name = "CS";
  len = ARRAY_SIZE(f17h_cs_mce_desc) - 1;
  break;
 
- case SMCA_PIE_BLOCK:
+ case SMCA_PIE:
  error_desc_array = f17h_pie_mce_desc;
- ip_name = "PIE";
  len = ARRAY_SIZE(f17h_pie_mce_desc) - 1;
  break;
 
@@ -913,7 +905,7 @@ static void decode_df_errors(u8 xec, unsigned int mca_type)
  }
 
  if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);
  return;
  }
 
@@ -928,7 +920,7 @@ static void decode_smca_errors(struct mce *m)
  unsigned int hwid, mca_type, i;
  u8 xec = XEC(m->status, xec_mask);
  const char * const *error_desc_array;
- char *ip_name;
+ const char *ip_name;
  size_t len;
 
  if (rdmsr_safe(addr, &low, &high)) {
@@ -947,41 +939,39 @@ static void decode_smca_errors(struct mce *m)
  * Note: mca_type values make sense only
  * in the context of an hwid
  */
- for (i = 0; i < ARRAY_SIZE(amd_hwid_mappings); i++)
- if (amd_hwid_mappings[i].amd_hwid_value == hwid)
+ for (i = 0; i < ARRAY_SIZE(amd_hwids); i++)
+ if (amd_hwids[i].hwid == hwid)
  break;
 
  switch (i) {
- case SMCA_F17H_CORE_BLOCK:
- ip_name = (mca_type == SMCA_L3_CACHE_BLOCK) ?
-  "L3 Cache" : "F17h Core";
+ case SMCA_F17H_CORE:
+ ip_name = (mca_type == SMCA_L3_CACHE) ?
+   "L3 Cache" : "F17h Core";
+
+ return decode_f17h_core_errors(ip_name, xec, mca_type);
  break;
 
- case SMCA_DF_BLOCK:
- ip_name = "DF";
+ case SMCA_DF:
+ return decode_df_errors(xec, mca_type);
  break;
 
- case SMCA_UMC_BLOCK:
+ case SMCA_UMC:
  error_desc_array = f17h_umc_mce_desc;
- ip_name = "UMC";
  len = ARRAY_SIZE(f17h_umc_mce_desc) - 1;
  break;
 
- case SMCA_PB_BLOCK:
+ case SMCA_PB:
  error_desc_array = f17h_pb_mce_desc;
- ip_name = "PB";
  len = ARRAY_SIZE(f17h_pb_mce_desc) - 1;
  break;
 
- case SMCA_PSP_BLOCK:
+ case SMCA_PSP:
  error_desc_array = f17h_psp_mce_desc;
- ip_name = "PSP";
  len = ARRAY_SIZE(f17h_psp_mce_desc) - 1;
  break;
 
- case SMCA_SMU_BLOCK:
+ case SMCA_SMU:
  error_desc_array = f17h_smu_mce_desc;
- ip_name = "SMU";
  len = ARRAY_SIZE(f17h_smu_mce_desc) - 1;
  break;
 
@@ -990,20 +980,16 @@ static void decode_smca_errors(struct mce *m)
  return;
  }
 
- pr_emerg(HW_ERR "%s Error: ", ip_name);
+ ip_name = amd_hwids[mca_type].name;
 
- if (i == SMCA_F17H_CORE_BLOCK) {
- decode_f17h_core_errors(xec, mca_type);
- } else if (i == SMCA_DF_BLOCK) {
- decode_df_errors(xec, mca_type);
- } else {
- if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
- return;
- }
+ pr_emerg(HW_ERR "%s Error: ", ip_name);
 
- pr_cont("%s.\n", error_desc_array[xec]);
+ if (xec > len) {
+ pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ return;
  }
+
+ pr_cont("%s.\n", error_desc_array[xec]);
 }
 
 
--
2.3.5


--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/3] EDAC, mce_amd: Correct error paths

Borislav Petkov-3
In reply to this post by Borislav Petkov-3
From: Borislav Petkov <[hidden email]>
Date: Wed, 2 Mar 2016 11:46:58 +0100
Subject: [PATCH 3/3] EDAC, mce_amd: Correct error paths

We need to unwind properly when we fail to find the proper decoding
functions. Streamline error messages to resemble the rest of this file,
while at it and do some minor stylistic changes.

Signed-off-by: Borislav Petkov <[hidden email]>
---
 drivers/edac/mce_amd.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 0f9953cbde4e..81495a360eea 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -833,7 +833,7 @@ static void decode_f17h_core_errors(const char *ip_name, u8 xec, unsigned int mc
  len = ARRAY_SIZE(f17h_ls_mce_desc) - 1;
 
  if (xec == 0x4) {
- pr_cont("Unrecognized error code from LS MCA bank\n");
+ pr_cont("Unrecognized LS MCA error code.\n");
  return;
  }
  break;
@@ -869,12 +869,12 @@ static void decode_f17h_core_errors(const char *ip_name, u8 xec, unsigned int mc
  break;
 
  default:
- pr_cont("Unrecognized Mca Type value for F17h Core. Unable to decode errors\n");
+ pr_cont("Corrupted MCA Core info.\n");
  return;
  }
 
  if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);
+ pr_cont("Unrecognized %s MCA bank error code.\n", amd_hwids[mca_type].name);
  return;
  }
 
@@ -900,12 +900,12 @@ static void decode_df_errors(u8 xec, unsigned int mca_type)
  break;
 
  default:
- pr_cont("Unrecognized Mca Type value for DF. Unable to decode errors\n");
+ pr_cont("Corrupted MCA Data Fabric info.\n");
  return;
  }
 
  if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);
+ pr_cont("Unrecognized %s MCA bank error code.\n", amd_hwids[mca_type].name);
  return;
  }
 
@@ -915,12 +915,12 @@ static void decode_df_errors(u8 xec, unsigned int mca_type)
 /* Decode errors according to Scalable MCA specification */
 static void decode_smca_errors(struct mce *m)
 {
- u32 low, high;
  u32 addr = MSR_AMD64_SMCA_MCx_IPID(m->bank);
  unsigned int hwid, mca_type, i;
  u8 xec = XEC(m->status, xec_mask);
  const char * const *error_desc_array;
  const char *ip_name;
+ u32 low, high;
  size_t len;
 
  if (rdmsr_safe(addr, &low, &high)) {
@@ -934,10 +934,8 @@ static void decode_smca_errors(struct mce *m)
  pr_emerg(HW_ERR "MC%d IPID value: 0x%08x%08x\n", m->bank, high, low);
 
  /*
- * Based on hwid and mca_type values,
- * decode errors from respective IPs.
- * Note: mca_type values make sense only
- * in the context of an hwid
+ * Based on hwid and mca_type values, decode errors from respective IPs.
+ * Note: mca_type values make sense only in the context of a hwid.
  */
  for (i = 0; i < ARRAY_SIZE(amd_hwids); i++)
  if (amd_hwids[i].hwid == hwid)
@@ -976,7 +974,7 @@ static void decode_smca_errors(struct mce *m)
  break;
 
  default:
- pr_emerg(HW_ERR "HWID:%d does not match any existing IPs\n", hwid);
+ pr_emerg(HW_ERR "HWID:%d does not match any existing IPs.\n", hwid);
  return;
  }
 
@@ -985,7 +983,7 @@ static void decode_smca_errors(struct mce *m)
  pr_emerg(HW_ERR "%s Error: ", ip_name);
 
  if (xec > len) {
- pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
+ pr_cont("Unrecognized %s MCA bank error code.\n", ip_name);
  return;
  }
 
@@ -1151,7 +1149,7 @@ static struct notifier_block amd_mce_dec_nb = {
 static int __init mce_amd_init(void)
 {
  struct cpuinfo_x86 *c = &boot_cpu_data;
- u32 ebx = cpuid_ebx(0x80000007);
+ u32 ebx;
 
  if (c->x86_vendor != X86_VENDOR_AMD)
  return -ENODEV;
@@ -1207,17 +1205,18 @@ static int __init mce_amd_init(void)
  break;
 
  case 0x17:
+ ebx = cpuid_ebx(0x80000007);
  xec_mask = 0x3f;
+
  if (!(ebx & BIT(3))) {
- printk(KERN_WARNING "Decoding supported only on Scalable MCA enabled processors\n");
- return 0;
+ printk(KERN_WARNING "Decoding supported only on Scalable MCA processors.\n");
+ goto err_out;
  }
  break;
 
  default:
  printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
- kfree(fam_ops);
- fam_ops = NULL;
+ goto err_out;
  }
 
  pr_info("MCE: In-kernel MCE decoding enabled.\n");
@@ -1225,6 +1224,11 @@ static int __init mce_amd_init(void)
  mce_register_decode_chain(&amd_mce_dec_nb);
 
  return 0;
+
+err_out:
+ kfree(fam_ops);
+ fam_ops = NULL;
+ return -EINVAL;
 }
 early_initcall(mce_amd_init);
 
--
2.3.5

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Aravind Gopalakrishnan
In reply to this post by Borislav Petkov-3
On 3/2/2016 4:53 AM, Borislav Petkov wrote:

> Merge all IP blocks into a single enum. This allows for easier block
> name use later. Drop superfluous "_BLOCK" from the enum names.
>
> Signed-off-by: Borislav Petkov <[hidden email]>
> Cc: Aravind Gopalakrishnan <[hidden email]>
>
>   enum amd_ip_types {
> - SMCA_F17H_CORE_BLOCK = 0, /* Core errors */
> - SMCA_DF_BLOCK, /* Data Fabric */
> - SMCA_UMC_BLOCK, /* Unified Memory Controller */
> - SMCA_PB_BLOCK, /* Parameter Block */
> - SMCA_PSP_BLOCK, /* Platform Security Processor */
> - SMCA_SMU_BLOCK, /* System Management Unit */
> + SMCA_F17H_CORE = 0, /* Core errors */
> + SMCA_LS, /* - Load Store */
> + SMCA_IF, /* - Instruction Fetch */
> + SMCA_L2_CACHE, /* - L2 cache */
> + SMCA_DE, /* - Decoder unit */
> + RES, /* - Reserved */
> + SMCA_EX, /* - Execution unit */
> + SMCA_FP, /* - Floating Point */
> + SMCA_L3_CACHE, /* - L3 cache */
> +
> + SMCA_DF, /* Data Fabric */
> + SMCA_CS, /* - Coherent Slave */
> + SMCA_PIE, /* - Power management, Interrupts, etc */
> +
> + SMCA_UMC, /* Unified Memory Controller */
> + SMCA_PB, /* Parameter Block */
> + SMCA_PSP, /* Platform Security Processor */
> + SMCA_SMU, /* System Management Unit */
>   N_AMD_IP_TYPES
>   };
>  

No, this would break the logic in EDAC.
The main reason I placed it in separate enums is because the "mca_type"
values map to the enum.

These blocks-

+ SMCA_LS, /* - Load Store */
+ SMCA_IF, /* - Instruction Fetch */
+ SMCA_L2_CACHE, /* - L2 cache */
+ SMCA_DE, /* - Decoder unit */
+ RES, /* - Reserved */
+ SMCA_EX, /* - Execution unit */
+ SMCA_FP, /* - Floating Point */
+ SMCA_L3_CACHE, /* - L3 cache */


have the same hwid value (0xb0). But they differ in mca_type values. And
in exactly that order.
(LS is mca_type 0, IF is mca_type 1 and so on..)

So, after we get mca_type value from the MSR (mca_type = (high &
MCI_IPID_MCATYPE) >> 16),
We check for LS (=0) or IF (=1) ...
With this change, LS block is assigned 1 due to the ordering in enum.

And this logic applies to "DF" block as well.  (whose component blocks
are "coherent slave" and "pie" which have mca_type values of 0 and 1
respectively)
"DF" and "F17h_core" are essentially parent blocks and CS, PIE, LS, IF
etc are children. hwid indicates the parent, mca_type indicates the child..

>
>  
>   /* Define HWID to IP type mappings for Scalable MCA */
> -struct amd_hwid amd_hwid_mappings[] =
> -{
> - [SMCA_F17H_CORE_BLOCK] = { "f17h_core", 0xB0 },
> - [SMCA_DF_BLOCK] = { "data fabric", 0x2E },
> - [SMCA_UMC_BLOCK] = { "UMC", 0x96 },
> - [SMCA_PB_BLOCK] = { "param block", 0x5 },
> - [SMCA_PSP_BLOCK] = { "PSP", 0xFF },
> - [SMCA_SMU_BLOCK] = { "SMU", 0x1 },
> +struct amd_hwid amd_hwids[] =
> +{
> + [SMCA_F17H_CORE] = { "F17h core", 0xB0 },
> + [SMCA_LS] = { "Load-Store", 0x0 },
> + [SMCA_IF] = { "IFetch", 0x0 },
> + [SMCA_L2_CACHE]  = { "L2 Cache", 0x0 },
> + [SMCA_DE] = { "Decoder", 0x0 },
> + [SMCA_EX] = { "Execution", 0x0 },
> + [SMCA_FP] = { "Floating Point", 0x0 },
> + [SMCA_L3_CACHE]  = { "L3 Cache", 0x0 },
> + [SMCA_DF] = { "Data Fabric", 0x2E },
> + [SMCA_CS] = { "Coherent Slave", 0x0 },
> + [SMCA_PIE] = { "PwrMan/Intr", 0x0 },
> + [SMCA_UMC] = { "UMC", 0x96 },
> + [SMCA_PB] = { "Param Block", 0x5 },
> + [SMCA_PSP] = { "PSP", 0xFF },
> + [SMCA_SMU] = { "SMU", 0x1 },
>   };
> -EXPORT_SYMBOL_GPL(amd_hwid_mappings);
> +EXPORT_SYMBOL_GPL(amd_hwids);
>  

These strings are what I intend to use for the sysfs interface.
So, I am not sure if "PwrMan/Intr" would work when I need to create the
kobj..

Also, the legacy names use snake_case-
static const char * const th_names[] = {
         "load_store",
         "insn_fetch",
         "combined_unit",
         "",
         "northbridge",
         "execution_unit",
};

So, I think we should continue this approach and have something like this-
static const char * const amd_core_mcablock_names[] = {
         [SMCA_LS]         = "load_store",
         [SMCA_IF]         = "insn_fetch",
         [SMCA_L2_CACHE]   = "l2_cache",
         [SMCA_DE]         = "decode_unit",
         [RES]                   = "",
         [SMCA_EX]         = "execution_unit",
         [SMCA_FP]         = "floating_point",
         [SMCA_L3_CACHE]   = "l3_cache",
};

static const char * const amd_df_mcablock_names[] = {
         [SMCA_CS]  = "coherent_slave",
         [SMCA_PIE] = "pie",
};

(Split arrays again because I feel it'd be better to have arrays ordered
according to mca_type values)

Expanding "df" to "data_fabric" and "pb" to "param_block" is fine with me.

>
>  
>   if (xec > len) {
> - pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
> + pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);

This wouldn't work because of the mca_type ordering as mentioned above.
(Or it should be amd_core_mcablock_names[mca_type])

>  
>   if (xec > len) {
> - pr_cont("Unrecognized error code from %s MCA bank\n", ip_name);
> + pr_cont("Unrecognized error code from %s MCA bank\n", amd_hwids[mca_type].name);
>   return;
>   }
>  

Ditto.

>
>  
> - pr_emerg(HW_ERR "%s Error: ", ip_name);
> + ip_name = amd_hwids[mca_type].name;

This should be amd_hwids[i].name
We shouldn't be using mca_type value as index..


Thanks,
-Aravind.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] EDAC, mce_amd: Correct error paths

Aravind Gopalakrishnan
In reply to this post by Borislav Petkov-3
On 3/2/2016 4:54 AM, Borislav Petkov wrote:
> From: Borislav Petkov <[hidden email]>
> Date: Wed, 2 Mar 2016 11:46:58 +0100
> Subject: [PATCH 3/3] EDAC, mce_amd: Correct error paths
>
> We need to unwind properly when we fail to find the proper decoding
> functions. Streamline error messages to resemble the rest of this file,
> while at it and do some minor stylistic changes.
>
> Signed-off-by: Borislav Petkov <[hidden email]>

Looks good. Thanks.

Reviewed-by: Aravind Gopalakrishnan<[hidden email]>

> -
>  
>   default:
>   printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
> - kfree(fam_ops);
> - fam_ops = NULL;
> + goto err_out;
>   }
>  
>   pr_info("MCE: In-kernel MCE decoding enabled.\n");
> @@ -1225,6 +1224,11 @@ static int __init mce_amd_init(void)
>   mce_register_decode_chain(&amd_mce_dec_nb);
>  
>   return 0;
> +
> +err_out:
> + kfree(fam_ops);
> + fam_ops = NULL;
> + return -EINVAL;

Thanks! Sorry I missed this.

-Aravind.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH V2 2/5] EDAC, MCE, AMD: Enable error decoding of Scalable MCA errors

Aravind Gopalakrishnan
In reply to this post by Borislav Petkov-3
On 3/2/2016 4:50 AM, Borislav Petkov wrote:
>
> Ok, applied with a bunch of changes ontop.


Thanks!

>   The second patch is relying on the assumption that a
> hwid of 0 is invalid. Is that so?
>

Yes, HWID of 0 is invalid.

Thanks,
-Aravind.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Borislav Petkov-3
In reply to this post by Aravind Gopalakrishnan
On Wed, Mar 02, 2016 at 09:52:23AM -0600, Aravind Gopalakrishnan wrote:

> So, I think we should continue this approach and have something like this-
> static const char * const amd_core_mcablock_names[] = {
>         [SMCA_LS]         = "load_store",
>         [SMCA_IF]         = "insn_fetch",
>         [SMCA_L2_CACHE]   = "l2_cache",
>         [SMCA_DE]         = "decode_unit",
>         [RES]                   = "",
>         [SMCA_EX]         = "execution_unit",
>         [SMCA_FP]         = "floating_point",
>         [SMCA_L3_CACHE]   = "l3_cache",
> };
>
> static const char * const amd_df_mcablock_names[] = {
>         [SMCA_CS]  = "coherent_slave",
>         [SMCA_PIE] = "pie",
> };
>
> (Split arrays again because I feel it'd be better to have arrays ordered
> according to mca_type values)

Ok, care to take the patch and redo it as you suggest?

I really don't want to be assigning strings each time during decoding.

Also, make sure the strings are as human readable as possible and so
that users can at least have an idea what we're saying. "load_store"
is better than "LS", "insn_fetch" is better than "IF", etc. Some
abbreviations should remain, though. "platform_security_processor" is
yucky and I guess there we can stick to "PSP". Ditto for "SMU"...

Making the unabbreviated lowercase for sysfs usage is fine too, of
course.

Thanks.

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Aravind Gopalakrishnan
On 3/2/2016 10:21 AM, Borislav Petkov wrote:

> On Wed, Mar 02, 2016 at 09:52:23AM -0600, Aravind Gopalakrishnan wrote:
>> So, I think we should continue this approach and have something like this-
>> static const char * const amd_core_mcablock_names[] = {
>>          [SMCA_LS]         = "load_store",
>>          [SMCA_IF]         = "insn_fetch",
>>          [SMCA_L2_CACHE]   = "l2_cache",
>>          [SMCA_DE]         = "decode_unit",
>>          [RES]                   = "",
>>          [SMCA_EX]         = "execution_unit",
>>          [SMCA_FP]         = "floating_point",
>>          [SMCA_L3_CACHE]   = "l3_cache",
>> };
>>
>> static const char * const amd_df_mcablock_names[] = {
>>          [SMCA_CS]  = "coherent_slave",
>>          [SMCA_PIE] = "pie",
>> };
>>
>> (Split arrays again because I feel it'd be better to have arrays ordered
>> according to mca_type values)
> Ok, care to take the patch and redo it as you suggest?

Sure. I was going to introduce these strings as part of patch to update
sysfs code to
understand the new banks anyway. So it's already in the works:)

> I really don't want to be assigning strings each time during decoding.

Ok, Will update the EDAC to use the existing string array.

> Also, make sure the strings are as human readable as possible and so
> that users can at least have an idea what we're saying. "load_store"
> is better than "LS", "insn_fetch" is better than "IF", etc. Some
> abbreviations should remain, though. "platform_security_processor" is
> yucky and I guess there we can stick to "PSP". Ditto for "SMU"...

Understood. Will do as you suggest.

> Making the unabbreviated lowercase for sysfs usage is fine too, of
> course.
>
>

So, have you pushed the set of patches you applied somewhere? (bp.git?)
I can work on top of those and it will be easier to rebase on top of tip.git
once the patches find their way there..

Thanks,
-Aravind.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Borislav Petkov-3
On Wed, Mar 02, 2016 at 10:27:57AM -0600, Aravind Gopalakrishnan wrote:
> So, have you pushed the set of patches you applied somewhere? (bp.git?)
> I can work on top of those and it will be easier to rebase on top of tip.git
> once the patches find their way there..

No.

But you can take the three here, merge them again into a single patch
and do the changes ontot.

I made them into three to show you more easily what should be changed.

Thanks.

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 2/3] x86/mce/AMD, EDAC: Simplify SMCA decoding

Aravind Gopalakrishnan
On 3/2/2016 10:38 AM, Borislav Petkov wrote:
> But you can take the three here, merge them again into a single patch
> and do the changes ontot.
>
> I made them into three to show you more easily what should be changed.
>
>


Ok, I'll just spin a V3 of the entire patchset with all your suggested
changes then..

Thanks,
-Aravind.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[tip:ras/core] x86/mce: Move MCx_CONFIG MSR definitions

tip-bot for Peter Zijlstra
In reply to this post by Aravind Gopalakrishnan
Commit-ID:  adc53f2e0ae2fcff10a4b981df14729ffb1482fc
Gitweb:     http://git.kernel.org/tip/adc53f2e0ae2fcff10a4b981df14729ffb1482fc
Author:     Aravind Gopalakrishnan <[hidden email]>
AuthorDate: Mon, 7 Mar 2016 14:02:17 +0100
Committer:  Ingo Molnar <[hidden email]>
CommitDate: Tue, 8 Mar 2016 11:48:14 +0100

x86/mce: Move MCx_CONFIG MSR definitions

Those MSRs are used only by the MCE code so move them there.

Signed-off-by: Aravind Gopalakrishnan <[hidden email]>
Signed-off-by: Borislav Petkov <[hidden email]>
Cc: Borislav Petkov <[hidden email]>
Cc: Linus Torvalds <[hidden email]>
Cc: Peter Zijlstra <[hidden email]>
Cc: Thomas Gleixner <[hidden email]>
Cc: Tony Luck <[hidden email]>
Cc: linux-edac <[hidden email]>
Link: http://lkml.kernel.org/r/1456785179-14378-2-git-send-email-Aravind.Gopalakrishnan@...
Signed-off-by: Ingo Molnar <[hidden email]>
---
 arch/x86/include/asm/mce.h       | 4 ++++
 arch/x86/include/asm/msr-index.h | 4 ----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2ea4527..80ba0a8 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -91,6 +91,10 @@
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE "MACHINECHECK"
 
+/* AMD Scalable MCA */
+#define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
+#define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
+
 /*
  * This structure contains all data related to the MCE log.  Also
  * carries a signature to make it easier to find from external
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 5523465..b05402e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -264,10 +264,6 @@
 #define MSR_IA32_MC0_CTL2 0x00000280
 #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x))
 
-/* 'SMCA': AMD64 Scalable MCA */
-#define MSR_AMD64_SMCA_MC0_CONFIG 0xc0002004
-#define MSR_AMD64_SMCA_MCx_CONFIG(x) (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
-
 #define MSR_P6_PERFCTR0 0x000000c1
 #define MSR_P6_PERFCTR1 0x000000c2
 #define MSR_P6_EVNTSEL0 0x00000186
Loading...