Re: [PATCH] edac:Fix kernel panic regression in edac_mc_reset_delay_period

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] edac:Fix kernel panic regression in edac_mc_reset_delay_period

Borislav Petkov-3
On Thu, May 19, 2016 at 03:44:57PM -0400, Nicholas Krause wrote:
> This fixes a kernel panic regression in the function,
> edac_mc_reset_delay_period as show by this kernel panic
> trace:
> [   58.402137] BUG: unable to handle kernel paging request at 0000000000015d10
> [   58.410564] IP: [<ffffffff8109ab82>] queued_spin_lock_slowpath+0x132/0x170
> [   58.418941] PGD 3ffcc8067 PUD 3ffc56067 PMD 0
> [   58.428821] Oops: 0002 [#1] SMP
> [   58.439076] Modules linked in: xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables
> [   58.468176] CPU: 1 PID: 2792 Comm: edactest Not tainted 4.6.0-dirty #1
                                        ^^^^^^^^
Ha, what is that program?

> [   58.478878] Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
> [   58.488590] task: ffff8803ff9a9300 ti: ffff8803ffbf0000 task.ti: ffff8803ffbf0000
> [   58.499562] RIP: 0010:[<ffffffff8109ab82>]  [<ffffffff8109ab82>] queued_spin_lock_slowpath+0x132/0x170
> [   58.521850] RSP: 0018:ffff8803ffbf3cf8  EFLAGS: 00010002
> [   58.532653] RAX: 0000000000002bfe RBX: 0000000000000082 RCX: 0000000000080000
> [   58.545334] RDX: 0000000000015d10 RSI: 00000000affd0fc4 RDI: ffffffff81d39940
> [   58.555376] RBP: ffff88040a97b848 R08: ffff88041ed15d00 R09: 0000000000000004
> [   58.565813] R10: 000000000000000a R11: f000000000000000 R12: ffffffff81d39940
> [   58.577911] R13: 000000000000c940 R14: ffff8803ffbf3d48 R15: ffff8803ffbf3f28
> [   58.588311] FS:  00007f639468f780(0000) GS:ffff88041ed00000(0000) knlGS:00000000f7743680
> [   58.598270] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.609814] CR2: 0000000000015d10 CR3: 00000003ffafa000 CR4: 00000000000006e0
> [   58.620848] Stack:
> [   58.630118]  ffffffff81774d3f 000000000000000f ffffffff810ae889 ffff88040a97b820
> [   58.640635]  ffff8803ffbf3d90 0000000000002000 ffff88040c335c00 00000000000003e8
> [   58.652220]  ffffffff810aed20 0000000000000041 0000000200000000 ffff88040a97b800
> [   58.662230] Call Trace:
> [   58.672043]  [<ffffffff81774d3f>] ? _raw_spin_lock_irqsave+0x1f/0x30
> [   58.682221]  [<ffffffff810ae889>] ? lock_timer_base.isra.34+0x49/0x60
> [   58.693178]  [<ffffffff810aed20>] ? del_timer+0x30/0x70
> [   58.704839]  [<ffffffff81075494>] ? try_to_grab_pending+0xa4/0x140
> [   58.715206]  [<ffffffff81075569>] ? mod_delayed_work_on+0x39/0x80
> [   58.725250]  [<ffffffff81684e90>] ? edac_mc_reset_delay_period+0x30/0x50
> [   58.735572]  [<ffffffff81685865>] ? edac_set_poll_msec+0x45/0x60
> [   58.745346]  [<ffffffff8107a43b>] ? param_attr_store+0x6b/0xe0
> [   58.755254]  [<ffffffff81079975>] ? module_attr_store+0x15/0x20
> [   58.764869]  [<ffffffff811f7192>] ? kernfs_fop_write+0x142/0x190
> [   58.774516]  [<ffffffff81187a1e>] ? __vfs_write+0x1e/0xe0
> [   58.783565]  [<ffffffff811879d4>] ? __vfs_read+0xa4/0xd0
> [   58.792437]  [<ffffffff811a47a7>] ? __alloc_fd+0x37/0x160
> [   58.801108]  [<ffffffff811887f0>] ? vfs_write+0xb0/0x1b0
> [   58.809465]  [<ffffffff81189bdb>] ? SyS_write+0x4b/0xb0
> [   58.817707]  [<ffffffff81774f5f>] ? entry_SYSCALL_64_fastpath+0x17/0x93
> [   58.825626] Code: f8 66 c7 07 01 00 c3 66 90 f3 c3 48 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48 81 c2 00 5d 01 00 48 03 14 c5 40 24 d1 81 <4c> 89 02 41 8b 40 08 85 c0 75 0a f3 90 41 8b 40 08 85 c0 74 f6
> [   58.852733] RIP  [<ffffffff8109ab82>] queued_spin_lock_slowpath+0x132/0x170
> [   58.861275]  RSP <ffff8803ffbf3cf8>
> [   58.869458] CR2: 0000000000015d10
> [   58.877632] ---[ end trace 3f286bc71cca15d1 ]---
> [   58.885869] Kernel panic - not syncing: Fatal exception

So I see the splat but the fix does not look correct... It is more,
like, an uninitialized workqueue somewhere. How do you trigger this?

Write some values into
/sys/module/edac_core/parameters/edac_mc_poll_msec ? I guess that's that
edactest program.

Can I have your .config please?

...

Ok, I think I see it - we initialize the workqueues only when
->edac_check is defined. And you're probably using an EDAC driver which
doesn't define that function, thus the splat.

But which driver are you using? I don't see it in your module list. So
it is either compiled in or you've simply loaded edac_core.ko only.

If you want to write a proper fix, I'd give you a hint: look at
->op_state. That should be tested.

:-)

Thanks.

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] edac:Fix kernel panic regression in edac_mc_reset_delay_period

Borislav Petkov-3
On Thu, May 19, 2016 at 06:10:37PM -0400, nick wrote:
> Here is the issue though it does not happen on v4.4 but on newer
> kernels. I bisected it and it does work if the commit I stated is
> reverted, and working at that commit the only line changed in any
> function is in my patch. Here we can add a check like your asking:

This is not what I hinted at. I hinted at checking ->op_state in
edac_mc_reset_delay_period() before doing edac_mod_work().

Here's another hint: ->op_state gets set to OP_RUNNING_POLL only when
->edac_check is not NULL.

Better?

--
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.