kernel BUG at mm/huge_memory.c:1798!

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zlatko Calusic
On 28.12.2012 01:37, Sedat Dilek wrote:

> On Fri, Dec 28, 2012 at 1:33 AM, Zlatko Calusic <[hidden email]> wrote:
>> On 28.12.2012 01:24, Sedat Dilek wrote:
>>>
>>> On Fri, Dec 28, 2012 at 12:51 AM, Zlatko Calusic
>>> <[hidden email]> wrote:
>>>>
>>>> On 28.12.2012 00:42, Sedat Dilek wrote:
>>>>>
>>>>>
>>>>> On Fri, Dec 28, 2012 at 12:39 AM, Zlatko Calusic
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> On 28.12.2012 00:30, Sedat Dilek wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Zlatko,
>>>>>>>
>>>>>>> I am not sure if I hit the same problem as described in this thread.
>>>>>>>
>>>>>>> Under heavy load, while building a customized toolchain for the Freetz
>>>>>>> router project I got a BUG || NULL pointer derefence || kswapd ||
>>>>>>> zone_balanced || pgdat_balanced() etc. (details see my screenshot).
>>>>>>>
>>>>>>> I will try your patch from [1] ***only*** on top of my last
>>>>>>> Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).
>>>>>>>
>>>>>>
>>>>>> Yes, that's the same bug. It should be fixed with my latest patch, so
>>>>>> I'd
>>>>>> appreciate you testing it, to be on the safe side this time. There
>>>>>> should
>>>>>> be
>>>>>> no difference if you apply it to anything newer than 3.8-rc1, so go for
>>>>>> it.
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>> Not sure how I can really reproduce this bug as one build worked fine
>>>>> within my last v3.8-rc1 kernel.
>>>>> I increased the parallel-make-jobs-number from "4" to "8" to stress a
>>>>> bit harder.
>>>>> Just building right now... and will report.
>>>>>
>>>>> If you have any test-case (script or whatever), please let me/us know.
>>>>>
>>>>
>>>> Unfortunately not, I haven't reproduced it yet on my machines. But it
>>>> seems
>>>> that bug will hit only under heavy memory pressure. When close to OOM, or
>>>> possibly with lots of writing to disk. It's also possible that
>>>> fragmentation
>>>> of memory zones could provoke it, that means testing it for a longer
>>>> time.
>>>>
>>>
>>> I tested successfully by doing simultaneously...
>>> - building Freetz with 8 parallel make-jobs
>>> - building Linux GIT with 1 make-job
>>> - 9 tabs open in firefox
>>> - In one tab I ran YouTube music video
>>> - etc.
>>>
>>> I am reading [1] and [2] where another user reports success by reverting
>>> this...
>>>
>>> commit cda73a10eb3f493871ed39f468db50a65ebeddce
>>> "mm: do not sleep in balance_pgdat if there's no i/o congestion"
>>>
>>> BTW, this machine has also 4GiB RAM (Ubuntu/precise AMD64).
>>>
>>> Feel free to add a "Reported-by/Tested-by" if you think this is a
>>> positive report.
>>>
>>
>> Thanks for the testing! And keep running it in case something interesting
>> pops up. ;)
>>
>> No need to revert cda73a10eb because it fixes another bug. And the patch
>> you're now running fixes the new bug I introduced with a combination of my
>> latest 2 patches. Nah, it gets complicated... :)
>>
>> But, at least I found the culprit and as soon as Linus applies the fix,
>> everything will be hunky dory again, at least on this front. :P
>>
>
> I am not subscribed to LKML and linux-mm,,,
> Do you have a patch with a proper subject and descriptive text? URL?
>

Soon to follow. I'd appreciate Zhouping Liu testing it too, though.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Sedat Dilek-2
On Fri, Dec 28, 2012 at 1:42 AM, Zlatko Calusic <[hidden email]> wrote:

> On 28.12.2012 01:37, Sedat Dilek wrote:
>>
>> On Fri, Dec 28, 2012 at 1:33 AM, Zlatko Calusic <[hidden email]>
>> wrote:
>>>
>>> On 28.12.2012 01:24, Sedat Dilek wrote:
>>>>
>>>>
>>>> On Fri, Dec 28, 2012 at 12:51 AM, Zlatko Calusic
>>>> <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> On 28.12.2012 00:42, Sedat Dilek wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 28, 2012 at 12:39 AM, Zlatko Calusic
>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 28.12.2012 00:30, Sedat Dilek wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Zlatko,
>>>>>>>>
>>>>>>>> I am not sure if I hit the same problem as described in this thread.
>>>>>>>>
>>>>>>>> Under heavy load, while building a customized toolchain for the
>>>>>>>> Freetz
>>>>>>>> router project I got a BUG || NULL pointer derefence || kswapd ||
>>>>>>>> zone_balanced || pgdat_balanced() etc. (details see my screenshot).
>>>>>>>>
>>>>>>>> I will try your patch from [1] ***only*** on top of my last
>>>>>>>> Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).
>>>>>>>>
>>>>>>>
>>>>>>> Yes, that's the same bug. It should be fixed with my latest patch, so
>>>>>>> I'd
>>>>>>> appreciate you testing it, to be on the safe side this time. There
>>>>>>> should
>>>>>>> be
>>>>>>> no difference if you apply it to anything newer than 3.8-rc1, so go
>>>>>>> for
>>>>>>> it.
>>>>>>> Thanks!
>>>>>>>
>>>>>>
>>>>>> Not sure how I can really reproduce this bug as one build worked fine
>>>>>> within my last v3.8-rc1 kernel.
>>>>>> I increased the parallel-make-jobs-number from "4" to "8" to stress a
>>>>>> bit harder.
>>>>>> Just building right now... and will report.
>>>>>>
>>>>>> If you have any test-case (script or whatever), please let me/us know.
>>>>>>
>>>>>
>>>>> Unfortunately not, I haven't reproduced it yet on my machines. But it
>>>>> seems
>>>>> that bug will hit only under heavy memory pressure. When close to OOM,
>>>>> or
>>>>> possibly with lots of writing to disk. It's also possible that
>>>>> fragmentation
>>>>> of memory zones could provoke it, that means testing it for a longer
>>>>> time.
>>>>>
>>>>
>>>> I tested successfully by doing simultaneously...
>>>> - building Freetz with 8 parallel make-jobs
>>>> - building Linux GIT with 1 make-job
>>>> - 9 tabs open in firefox
>>>> - In one tab I ran YouTube music video
>>>> - etc.
>>>>
>>>> I am reading [1] and [2] where another user reports success by reverting
>>>> this...
>>>>
>>>> commit cda73a10eb3f493871ed39f468db50a65ebeddce
>>>> "mm: do not sleep in balance_pgdat if there's no i/o congestion"
>>>>
>>>> BTW, this machine has also 4GiB RAM (Ubuntu/precise AMD64).
>>>>
>>>> Feel free to add a "Reported-by/Tested-by" if you think this is a
>>>> positive report.
>>>>
>>>
>>> Thanks for the testing! And keep running it in case something interesting
>>> pops up. ;)
>>>
>>> No need to revert cda73a10eb because it fixes another bug. And the patch
>>> you're now running fixes the new bug I introduced with a combination of
>>> my
>>> latest 2 patches. Nah, it gets complicated... :)
>>>
>>> But, at least I found the culprit and as soon as Linus applies the fix,
>>> everything will be hunky dory again, at least on this front. :P
>>>
>>
>> I am not subscribed to LKML and linux-mm,,,
>> Do you have a patch with a proper subject and descriptive text? URL?
>>
>
> Soon to follow. I'd appreciate Zhouping Liu testing it too, though.

I understand, but I want to push your patch with a proper subject-line
as a pending patch with reference to patchwork, so...

- Sedat -

> --
> Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zhouping Liu
In reply to this post by Zlatko Calusic
>
> Thank you for the report Zhouping!
>
> Would you be so kind to test the following patch and report results?
> Apply the patch to the latest mainline.

Hello Zlatko,

I have tested the below patch(applied it on mainline directly),
but IMO, I'd like to say it maybe don't fix the issue completely.

run the reproducer[1] on two machine, one machine has 2 numa nodes(8Gb RAM),
another one has 4 numa nodes(8Gb RAM), then the system hung all the time, such as the dmesg log:

[  713.066937] Killed process 6085 (oom01) total-vm:18880768kB, anon-rss:7915612kB, file-rss:4kB
[  959.555269] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[  959.562144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1079.382018] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1079.388872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1199.209709] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1199.216562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1319.036939] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1319.043794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1438.864797] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1438.871649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1558.691611] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1558.698466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
......

I'm not sure whether it's your patch triggering the hung task or not, but reverted cda73a10eb3,
the reproducer(oom01) can PASS without both 'NULL pointer dereference at 0000000000000500' and hung task issues.

but some time, it's possible that the reproducer(oom01) cause hung task on a box with large RAM(100Gb+), so I can't judge it...

Thanks,
Zhouping

>
> Thanks,
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 23291b9..e55ce55 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t
> *pgdat, int order, long remaining,
>  static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   int *classzone_idx)
>  {
> + bool pgdat_is_balanced = false;
>   struct zone *unbalanced_zone;
>   int i;
>   int end_zone = 0; /* Inclusive.  0 = ZONE_DMA */
> @@ -2638,8 +2639,11 @@ loop_again:
>   zone_clear_flag(zone, ZONE_CONGESTED);
>   }
>   }
> - if (i < 0)
> +
> + if (i < 0) {
> + pgdat_is_balanced = true;
>   goto out;
> + }
>  
>   for (i = 0; i <= end_zone; i++) {
>   struct zone *zone = pgdat->node_zones + i;
> @@ -2766,8 +2770,11 @@ loop_again:
>   pfmemalloc_watermark_ok(pgdat))
>   wake_up(&pgdat->pfmemalloc_wait);
>  
> - if (pgdat_balanced(pgdat, order, *classzone_idx))
> + if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> + pgdat_is_balanced = true;
>   break; /* kswapd: all done */
> + }
> +
>   /*
>   * OK, kswapd is getting into trouble.  Take a nap, then take
>   * another pass across the zones.
> @@ -2775,7 +2782,7 @@ loop_again:
>   if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>   if (has_under_min_watermark_zone)
>   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> - else
> + else if (unbalanced_zone)
>   wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>   }
>  
> @@ -2788,9 +2795,9 @@ loop_again:
>   if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>   break;
>   } while (--sc.priority >= 0);
> -out:
>  
> - if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
> +out:
> + if (!pgdat_is_balanced) {
>   cond_resched();
>  
>   try_to_freeze();
>
> --
> Zlatko
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>
>

--
Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zhouping Liu
On 12/28/2012 10:45 AM, Zhouping Liu wrote:

>> Thank you for the report Zhouping!
>>
>> Would you be so kind to test the following patch and report results?
>> Apply the patch to the latest mainline.
> Hello Zlatko,
>
> I have tested the below patch(applied it on mainline directly),
> but IMO, I'd like to say it maybe don't fix the issue completely.
>
> run the reproducer[1] on two machine, one machine has 2 numa nodes(8Gb RAM),
> another one has 4 numa nodes(8Gb RAM), then the system hung all the time, such as the dmesg log:
>
> [  713.066937] Killed process 6085 (oom01) total-vm:18880768kB, anon-rss:7915612kB, file-rss:4kB
> [  959.555269] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [  959.562144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1079.382018] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1079.388872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1199.209709] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1199.216562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1319.036939] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1319.043794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1438.864797] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1438.871649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1558.691611] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1558.698466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ......
>
> I'm not sure whether it's your patch triggering the hung task or not, but reverted cda73a10eb3,
> the reproducer(oom01) can PASS without both 'NULL pointer dereference at 0000000000000500' and hung task issues.
>
> but some time, it's possible that the reproducer(oom01) cause hung task on a box with large RAM(100Gb+), so I can't judge it...

sorry, I forgot to link the reproducer.
oom01 in LTP test suite:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c

from my site, it can 100% reproduce the bug using oom01 test case.

Thanks,
Zhouping

>
> Thanks,
> Zhouping
>
>> Thanks,
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 23291b9..e55ce55 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t
>> *pgdat, int order, long remaining,
>>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>   int *classzone_idx)
>>   {
>> + bool pgdat_is_balanced = false;
>>   struct zone *unbalanced_zone;
>>   int i;
>>   int end_zone = 0; /* Inclusive.  0 = ZONE_DMA */
>> @@ -2638,8 +2639,11 @@ loop_again:
>>   zone_clear_flag(zone, ZONE_CONGESTED);
>>   }
>>   }
>> - if (i < 0)
>> +
>> + if (i < 0) {
>> + pgdat_is_balanced = true;
>>   goto out;
>> + }
>>  
>>   for (i = 0; i <= end_zone; i++) {
>>   struct zone *zone = pgdat->node_zones + i;
>> @@ -2766,8 +2770,11 @@ loop_again:
>>   pfmemalloc_watermark_ok(pgdat))
>>   wake_up(&pgdat->pfmemalloc_wait);
>>  
>> - if (pgdat_balanced(pgdat, order, *classzone_idx))
>> + if (pgdat_balanced(pgdat, order, *classzone_idx)) {
>> + pgdat_is_balanced = true;
>>   break; /* kswapd: all done */
>> + }
>> +
>>   /*
>>   * OK, kswapd is getting into trouble.  Take a nap, then take
>>   * another pass across the zones.
>> @@ -2775,7 +2782,7 @@ loop_again:
>>   if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>>   if (has_under_min_watermark_zone)
>>   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
>> - else
>> + else if (unbalanced_zone)
>>   wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>>   }
>>  
>> @@ -2788,9 +2795,9 @@ loop_again:
>>   if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>>   break;
>>   } while (--sc.priority >= 0);
>> -out:
>>  
>> - if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
>> +out:
>> + if (!pgdat_is_balanced) {
>>   cond_resched();
>>  
>>   try_to_freeze();
>>
>> --
>> Zlatko
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [hidden email].  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zhouping Liu
In reply to this post by Zhouping Liu
On 12/28/2012 10:45 AM, Zhouping Liu wrote:
>> Thank you for the report Zhouping!
>>
>> Would you be so kind to test the following patch and report results?
>> Apply the patch to the latest mainline.
> Hello Zlatko,
>
> I have tested the below patch(applied it on mainline directly),
> but IMO, I'd like to say it maybe don't fix the issue completely.

Hi Zlatko,

I re-tested it on another machine, which has 60+ Gb RAM and 4 numa nodes,
without your patch, it's easy to reproduce the 'NULL pointer' error,
after applying your patch, I couldn't reproduce the issue any more.

depending on the above, it implied that your patch fixed the issue.

but in my last mail, I tested it on two machines, which caused hung task
with your patch,
so I'm confusing is it your patch block some oom-killer performance? if
it's not, your patch is good for me.

Thanks,
Zhouping

>
> run the reproducer[1] on two machine, one machine has 2 numa nodes(8Gb RAM),
> another one has 4 numa nodes(8Gb RAM), then the system hung all the time, such as the dmesg log:
>
> [  713.066937] Killed process 6085 (oom01) total-vm:18880768kB, anon-rss:7915612kB, file-rss:4kB
> [  959.555269] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [  959.562144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1079.382018] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1079.388872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1199.209709] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1199.216562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1319.036939] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1319.043794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1438.864797] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1438.871649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1558.691611] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1558.698466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ......
>
> I'm not sure whether it's your patch triggering the hung task or not, but reverted cda73a10eb3,
> the reproducer(oom01) can PASS without both 'NULL pointer dereference at 0000000000000500' and hung task issues.
>
> but some time, it's possible that the reproducer(oom01) cause hung task on a box with large RAM(100Gb+), so I can't judge it...
>
> Thanks,
> Zhouping
>
>> Thanks,
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 23291b9..e55ce55 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t
>> *pgdat, int order, long remaining,
>>   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>   int *classzone_idx)
>>   {
>> + bool pgdat_is_balanced = false;
>>   struct zone *unbalanced_zone;
>>   int i;
>>   int end_zone = 0; /* Inclusive.  0 = ZONE_DMA */
>> @@ -2638,8 +2639,11 @@ loop_again:
>>   zone_clear_flag(zone, ZONE_CONGESTED);
>>   }
>>   }
>> - if (i < 0)
>> +
>> + if (i < 0) {
>> + pgdat_is_balanced = true;
>>   goto out;
>> + }
>>  
>>   for (i = 0; i <= end_zone; i++) {
>>   struct zone *zone = pgdat->node_zones + i;
>> @@ -2766,8 +2770,11 @@ loop_again:
>>   pfmemalloc_watermark_ok(pgdat))
>>   wake_up(&pgdat->pfmemalloc_wait);
>>  
>> - if (pgdat_balanced(pgdat, order, *classzone_idx))
>> + if (pgdat_balanced(pgdat, order, *classzone_idx)) {
>> + pgdat_is_balanced = true;
>>   break; /* kswapd: all done */
>> + }
>> +
>>   /*
>>   * OK, kswapd is getting into trouble.  Take a nap, then take
>>   * another pass across the zones.
>> @@ -2775,7 +2782,7 @@ loop_again:
>>   if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
>>   if (has_under_min_watermark_zone)
>>   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
>> - else
>> + else if (unbalanced_zone)
>>   wait_iff_congested(unbalanced_zone, BLK_RW_ASYNC, HZ/10);
>>   }
>>  
>> @@ -2788,9 +2795,9 @@ loop_again:
>>   if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>>   break;
>>   } while (--sc.priority >= 0);
>> -out:
>>  
>> - if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
>> +out:
>> + if (!pgdat_is_balanced) {
>>   cond_resched();
>>  
>>   try_to_freeze();
>>
>> --
>> Zlatko
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [hidden email].  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zlatko Calusic
In reply to this post by Zhouping Liu
On 28.12.2012 03:45, Zhouping Liu wrote:

>>
>> Thank you for the report Zhouping!
>>
>> Would you be so kind to test the following patch and report results?
>> Apply the patch to the latest mainline.
>
> Hello Zlatko,
>
> I have tested the below patch(applied it on mainline directly),
> but IMO, I'd like to say it maybe don't fix the issue completely.
>
> run the reproducer[1] on two machine, one machine has 2 numa nodes(8Gb RAM),
> another one has 4 numa nodes(8Gb RAM), then the system hung all the time, such as the dmesg log:
>
> [  713.066937] Killed process 6085 (oom01) total-vm:18880768kB, anon-rss:7915612kB, file-rss:4kB
> [  959.555269] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [  959.562144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1079.382018] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1079.388872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1199.209709] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1199.216562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1319.036939] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1319.043794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1438.864797] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1438.871649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1558.691611] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
> [ 1558.698466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ......
>
> I'm not sure whether it's your patch triggering the hung task or not, but reverted cda73a10eb3,
> the reproducer(oom01) can PASS without both 'NULL pointer dereference at 0000000000000500' and hung task issues.
>
> but some time, it's possible that the reproducer(oom01) cause hung task on a box with large RAM(100Gb+), so I can't judge it...
>

Thanks for the test.

Yes, close to OOM things get quite unstable and it's hard to get
reliable test results. Maybe you could run it a few times, and see if
you can get any meaningful statistics out of a few runs. I need to check
oom.c myself and see what it's doing. Thanks for the link.

Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

Zlatko Calusic
In reply to this post by Zhouping Liu
On 28.12.2012 10:01, Zhouping Liu wrote:

> On 12/28/2012 10:45 AM, Zhouping Liu wrote:
>>> Thank you for the report Zhouping!
>>>
>>> Would you be so kind to test the following patch and report results?
>>> Apply the patch to the latest mainline.
>> Hello Zlatko,
>>
>> I have tested the below patch(applied it on mainline directly),
>> but IMO, I'd like to say it maybe don't fix the issue completely.
>
> Hi Zlatko,
>
> I re-tested it on another machine, which has 60+ Gb RAM and 4 numa nodes,
> without your patch, it's easy to reproduce the 'NULL pointer' error,
> after applying your patch, I couldn't reproduce the issue any more.
>
> depending on the above, it implied that your patch fixed the issue.
>

Yes, that's exactly what I expected. Just wanted to doublecheck this
time. Live and learn. ;)

> but in my last mail, I tested it on two machines, which caused hung task
> with your patch,
> so I'm confusing is it your patch block some oom-killer performance? if
> it's not, your patch is good for me.
>

 From what I know, the patch shouldn't have much influence on the oom
killer, if any. But, as all those subsystems are closely interconnected,
both oom & vmscan code is mm after all, there could be some
interference. Is the hung-task issue repeatable?
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: kernel BUG at mm/huge_memory.c:1798!

Hillf Danton
In reply to this post by Alex Xu
On Fri, Dec 28, 2012 at 12:08 AM, Alex Xu <[hidden email]> wrote:

> On 25/12/12 07:05 AM, Hillf Danton wrote:
>> On Tue, Dec 25, 2012 at 12:38 PM, Zhouping Liu <[hidden email]> wrote:
>>> Hello all,
>>>
>>> I found the below kernel bug using latest mainline(637704cbc95),
>>> my hardware has 2 numa nodes, and it's easy to reproduce the issue
>>> using LTP test case: "# ./mmap10 -a -s -c 200":
>>
>> Can you test with 5a505085f0 and 4fc3f1d66b1 reverted?
>>
>> Hillf
>>
>
> (for people from mailing lists, please cc me when replying)
>
> Same thing?

Yes and thank you very much for reporting it.

Hillf
>
> mapcount 0 page_mapcount 1
> ------------[ cut here ]------------
> kernel BUG at mm/huge_memory.c:1798!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: kernel BUG at mm/huge_memory.c:1798!

Mel Gorman-2
In reply to this post by Zhouping Liu
(Adding Michel and Rik to cc)

On Mon, Dec 24, 2012 at 11:38:51PM -0500, Zhouping Liu wrote:
> Hello all,
>
> I found the below kernel bug using latest mainline(637704cbc95),
> my hardware has 2 numa nodes, and it's easy to reproduce the issue
> using LTP test case: "# ./mmap10 -a -s -c 200":
>

Sorry for the long delay in responding, first day back online after being
off for a holiday. I can reproduce this problem. It takes many iterations
but triggers within a minute. It also affects 3.8-rc2.

This BUG is triggered because the rmap walk is not finding all vmas mapping
a page (walk found 0 avc's while the page mapcount was positive). THP
requires this rmap walk to be correct or minimally all the pmds will not
be marked as splitting. If PMDs are not marked splitting then all sorts
of problems can occur. The test case triggering this is fairly simple.

o A process creates a mapping 10MB-4KB in size
o The process forks, then unmaps the entire mapping and exits
o First child forks then unmaps part of the mapping and exits
o Second child unmaps then unmaps part of the mapping and exits

The partial unmapping on a page boundary but not a THP boundary is what
causes the THP split and allows this bug to be triggered. The two children
are racing with each other.

I confirmed that this bug triggers on UMA and the THP migration for NUMA
balancing is very unlikely to be a factor.  As it looked like anon_vma
rwsem was being taken for write in the correct places, it led me to conclude
that this must be a race within split_huge_page() itself somewhere.

split_huge_page consists of three tasks

  o __split_huge_page_splitting marks all PMDs as splitting

  o __split_huge_page_refcount uses the page compound lock to serialise
    based on the page and converts the compount page into 512 base pages

  o __split_huge_page_map uses the page table lock to serialise within
    the address space and updates the PTE

None of these affect avc linkages but they might be affecting the rmap
walk after the conversion to interval trees. I'm adding Michel and Rik
to the cc in case they can quickly spot how the walk could be affected
during a THP split.

The patch below hides the race by serialising split_huge_page but the
exact race needs to be identified before figuring out what the real locking
problem is. That's as far as I got with it today and will pick it up again
tomorrow. Better theories as to what is going wrong are welcome.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e894ed..f815ddb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1822,6 +1822,8 @@ int split_huge_page(struct page *page)
  anon_vma = page_lock_anon_vma_read(page);
  if (!anon_vma)
  goto out;
+ page_unlock_anon_vma_read(anon_vma);
+ anon_vma_lock_write(anon_vma);
  ret = 0;
  if (!PageCompound(page))
  goto out_unlock;
@@ -1832,7 +1834,7 @@ int split_huge_page(struct page *page)
 
  BUG_ON(PageCompound(page));
 out_unlock:
- page_unlock_anon_vma_read(anon_vma);
+ anon_vma_unlock(anon_vma);
 out:
  return ret;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Mel Gorman-2
Zhouping, please test this patch.

Andrea and Hugh, any comments on whether this could be improved?

---8<---
mm: thp: Acquire the anon_vma rwsem for lock during split

Zhouping Liu reported the following against 3.8-rc1 when running a mmap
testcase from LTP.

[  588.143072] mapcount 0 page_mapcount 3
[  588.147471] ------------[ cut here ]------------
[  588.152856] kernel BUG at mm/huge_memory.c:1798!
[  588.158125] invalid opcode: 0000 [#1] SMP
[  588.162882] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter
+ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
+dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca
+vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas

[  588.246517] CPU 1
[  588.248636] Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
[  588.262171] RIP: 0010:[<ffffffff8118fac7>]  [<ffffffff8118fac7>] __split_huge_page+0x677/0x6d0
[  588.272067] RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
[  588.278235] RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
[  588.286394] RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
[  588.294216] RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
[  588.302441] R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
[  588.310495] R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
[  588.318640] FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
[  588.327894] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  588.334569] CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
[  588.342718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  588.350861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  588.359134] Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
[  588.368667] Stack:
[  588.370960]  ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
[  588.379566]  ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
[  588.388150]  00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
[  588.396711] Call Trace:
[  588.455106]  [<ffffffff816017b5>] ? rwsem_down_read_failed+0x15/0x17
[  588.518106]  [<ffffffff812fa014>] ? call_rwsem_down_read_failed+0x14/0x30
[  588.580897]  [<ffffffff815ffc04>] ? down_read+0x24/0x2b
[  588.642630]  [<ffffffff8118fb88>] split_huge_page+0x68/0xb0
[  588.703814]  [<ffffffff81190ed4>] __split_huge_page_pmd+0x134/0x330
[  588.766064]  [<ffffffff8104b997>] ? pte_alloc_one+0x37/0x50
[  588.826460]  [<ffffffff81191121>] split_huge_page_pmd_mm+0x51/0x60
[  588.887746]  [<ffffffff8119116b>] split_huge_page_address+0x3b/0x50
[  588.948673]  [<ffffffff8119121c>] __vma_adjust_trans_huge+0x9c/0xf0
[  589.008660]  [<ffffffff811650f4>] vma_adjust+0x684/0x750
[  589.066328]  [<ffffffff811653ba>] __split_vma.isra.28+0x1fa/0x220
[  589.123497]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
[  589.180704]  [<ffffffff811661a9>] do_munmap+0xf9/0x420
[  589.237461]  [<ffffffff8160026c>] ? __schedule+0x3cc/0x7b0
[  589.294520]  [<ffffffff8116651e>] vm_munmap+0x4e/0x70
[  589.350784]  [<ffffffff8116741b>] sys_munmap+0x2b/0x40
[  589.406971]  [<ffffffff8160a159>] system_call_fastpath+0x16/0x1b

Alexander Beregalov reported a very similar bug and Hillf Danton identified
that commit 5a505085 (mm/rmap: Convert the struct anon_vma::mutex to an
rwsem) and commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon()
and try_to_unmap_anon() more scalable) were likely the problem. Reverting
these commits was reported to solve the problem.

Despite the reason for these commits, NUMA balancing is not the direct
source of the problem. split_huge_page() expected the anon_vma lock to be
exclusive to serialise the whole split operation. Ordinarily it is expected
that the anon_vma lock would only be required when updating the avcs but
THP also uses it. The locking requirements for THP are complex and there
is some overlap but broadly speaking they include the following

1. mmap_sem for read or write prevents THPs being created underneath
2. anon_vma is taken for write if collapsing a huge page
3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
   split_huge_page can run in parallel
4. wait_split_huge_page uses anon_vma taken for write mode to serialise
   against other THP operations
5. compound_lock is used to serialise between
   __split_huge_page_refcount() and gup

split_huge_page takes anon_vma for read but that does not serialise against
parallel split_huge_page operations on the same page (rule 2). One process
could be modifying the ref counts while the other modifies the page tables
leading to counters not being reliable. This patch takes the anon_vma
lock for write to serialise against parallel split_huge_page and parallel
collapse operations as it is the most fine-grained lock available that
protects against both.

Reported-by: Zhouping Liu <[hidden email]>
Reported-by: Alexander Beregalov <[hidden email]>
Signed-off-by: Mel Gorman <[hidden email]>
---
 mm/huge_memory.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e894ed..6001ee6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1819,9 +1819,19 @@ int split_huge_page(struct page *page)
 
  BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
  BUG_ON(!PageAnon(page));
- anon_vma = page_lock_anon_vma_read(page);
+
+ /*
+ * The caller does not necessarily hold an mmap_sem that would prevent
+ * the anon_vma disappearing so we first we take a reference to it
+ * and then lock the anon_vma for write. This is similar to
+ * page_lock_anon_vma_read except the write lock is taken to serialise
+ * against parallel split or collapse operations.
+ */
+ anon_vma = page_get_anon_vma(page);
  if (!anon_vma)
  goto out;
+ anon_vma_lock_write(anon_vma);
+
  ret = 0;
  if (!PageCompound(page))
  goto out_unlock;
@@ -1832,7 +1842,8 @@ int split_huge_page(struct page *page)
 
  BUG_ON(PageCompound(page));
 out_unlock:
- page_unlock_anon_vma_read(anon_vma);
+ anon_vma_unlock(anon_vma);
+ put_anon_vma(anon_vma);
 out:
  return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: kernel BUG at mm/huge_memory.c:1798!

Zhouping Liu
In reply to this post by Mel Gorman-2
On 01/04/2013 01:57 AM, Mel Gorman wrote:

> (Adding Michel and Rik to cc)
>
> On Mon, Dec 24, 2012 at 11:38:51PM -0500, Zhouping Liu wrote:
>> Hello all,
>>
>> I found the below kernel bug using latest mainline(637704cbc95),
>> my hardware has 2 numa nodes, and it's easy to reproduce the issue
>> using LTP test case: "# ./mmap10 -a -s -c 200":
>>
> Sorry for the long delay in responding, first day back online after being
> off for a holiday. I can reproduce this problem. It takes many iterations
> but triggers within a minute. It also affects 3.8-rc2.
>
> This BUG is triggered because the rmap walk is not finding all vmas mapping
> a page (walk found 0 avc's while the page mapcount was positive). THP
> requires this rmap walk to be correct or minimally all the pmds will not
> be marked as splitting. If PMDs are not marked splitting then all sorts
> of problems can occur. The test case triggering this is fairly simple.
>
> o A process creates a mapping 10MB-4KB in size
> o The process forks, then unmaps the entire mapping and exits
> o First child forks then unmaps part of the mapping and exits
> o Second child unmaps then unmaps part of the mapping and exits
>
> The partial unmapping on a page boundary but not a THP boundary is what
> causes the THP split and allows this bug to be triggered. The two children
> are racing with each other.
>
> I confirmed that this bug triggers on UMA and the THP migration for NUMA
> balancing is very unlikely to be a factor.  As it looked like anon_vma
> rwsem was being taken for write in the correct places, it led me to conclude
> that this must be a race within split_huge_page() itself somewhere.
>
> split_huge_page consists of three tasks
>
>    o __split_huge_page_splitting marks all PMDs as splitting
>
>    o __split_huge_page_refcount uses the page compound lock to serialise
>      based on the page and converts the compount page into 512 base pages
>
>    o __split_huge_page_map uses the page table lock to serialise within
>      the address space and updates the PTE
>
> None of these affect avc linkages but they might be affecting the rmap
> walk after the conversion to interval trees. I'm adding Michel and Rik
> to the cc in case they can quickly spot how the walk could be affected
> during a THP split.
>
> The patch below hides the race by serialising split_huge_page but the
> exact race needs to be identified before figuring out what the real locking
> problem is. That's as far as I got with it today and will pick it up again
> tomorrow. Better theories as to what is going wrong are welcome.

Hello Mel,

I have tested the below patch, and run the reproducer for a long time,
the issue
couldn't be triggered any more with the patch.

Thanks for your patch, and I will tested your new patch(it seemed that
this one
is a little different with the new one), provide the feedback ASAP.

Zhouping

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9e894ed..f815ddb 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1822,6 +1822,8 @@ int split_huge_page(struct page *page)
>   anon_vma = page_lock_anon_vma_read(page);
>   if (!anon_vma)
>   goto out;
> + page_unlock_anon_vma_read(anon_vma);
> + anon_vma_lock_write(anon_vma);
>   ret = 0;
>   if (!PageCompound(page))
>   goto out_unlock;
> @@ -1832,7 +1834,7 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(PageCompound(page));
>   out_unlock:
> - page_unlock_anon_vma_read(anon_vma);
> + anon_vma_unlock(anon_vma);
>   out:
>   return ret;
>   }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Hugh Dickins-2
In reply to this post by Mel Gorman-2
I've added Alexander, Hillf and Alex to the Cc.

On Fri, 4 Jan 2013, Mel Gorman wrote:
> Zhouping, please test this patch.
>
> Andrea and Hugh, any comments on whether this could be improved?

Your patch itself looks just right to me, no improvement required;
and it's easy to understand how the bug crept in, from a blanket
rwsem replacement of anon_vma mutex meeting the harmless-looking
anon_vma_interval_tree_foreach in __split_huge_page, which looked
as if it needed only the readlock provided by the usual method.

But I'd fight shy myself of trying to describe all the THP locking
conventions in the commit message: I haven't really tried to work
out just how right you've got all those details.

The actual race in question here was just two processes (one or both
forked) doing split_huge_page() on the same THPage at the same time,
wasn't it?  (Though of course we only see the backtrace from one of
them.)  Which would be very confusing, and no surprise that the
pmd_trans_splitting test ends up skipping pmds already updated by
the racing process, so the mapcount doesn't match what's expected.
Of course we need exclusive lock against that, which you give it.

>
> ---8<---
> mm: thp: Acquire the anon_vma rwsem for lock during split

"write" rather than "lock"?

>
> Zhouping Liu reported the following against 3.8-rc1 when running a mmap
> testcase from LTP.
>
> [  588.143072] mapcount 0 page_mapcount 3
> [  588.147471] ------------[ cut here ]------------
> [  588.152856] kernel BUG at mm/huge_memory.c:1798!
> [  588.158125] invalid opcode: 0000 [#1] SMP
> [  588.162882] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter
> +ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
> +dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca
> +vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
>
> [  588.246517] CPU 1
> [  588.248636] Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
> [  588.262171] RIP: 0010:[<ffffffff8118fac7>]  [<ffffffff8118fac7>] __split_huge_page+0x677/0x6d0
> [  588.272067] RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
> [  588.278235] RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
> [  588.286394] RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
> [  588.294216] RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
> [  588.302441] R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
> [  588.310495] R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
> [  588.318640] FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
> [  588.327894] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  588.334569] CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
> [  588.342718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  588.350861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  588.359134] Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
> [  588.368667] Stack:
> [  588.370960]  ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
> [  588.379566]  ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
> [  588.388150]  00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
> [  588.396711] Call Trace:
> [  588.455106]  [<ffffffff816017b5>] ? rwsem_down_read_failed+0x15/0x17
> [  588.518106]  [<ffffffff812fa014>] ? call_rwsem_down_read_failed+0x14/0x30
> [  588.580897]  [<ffffffff815ffc04>] ? down_read+0x24/0x2b
> [  588.642630]  [<ffffffff8118fb88>] split_huge_page+0x68/0xb0
> [  588.703814]  [<ffffffff81190ed4>] __split_huge_page_pmd+0x134/0x330
> [  588.766064]  [<ffffffff8104b997>] ? pte_alloc_one+0x37/0x50
> [  588.826460]  [<ffffffff81191121>] split_huge_page_pmd_mm+0x51/0x60
> [  588.887746]  [<ffffffff8119116b>] split_huge_page_address+0x3b/0x50
> [  588.948673]  [<ffffffff8119121c>] __vma_adjust_trans_huge+0x9c/0xf0
> [  589.008660]  [<ffffffff811650f4>] vma_adjust+0x684/0x750
> [  589.066328]  [<ffffffff811653ba>] __split_vma.isra.28+0x1fa/0x220
> [  589.123497]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
> [  589.180704]  [<ffffffff811661a9>] do_munmap+0xf9/0x420
> [  589.237461]  [<ffffffff8160026c>] ? __schedule+0x3cc/0x7b0
> [  589.294520]  [<ffffffff8116651e>] vm_munmap+0x4e/0x70
> [  589.350784]  [<ffffffff8116741b>] sys_munmap+0x2b/0x40
> [  589.406971]  [<ffffffff8160a159>] system_call_fastpath+0x16/0x1b
>
> Alexander Beregalov reported a very similar bug and Hillf Danton identified

And Alex Xu.

> that commit 5a505085 (mm/rmap: Convert the struct anon_vma::mutex to an
> rwsem) and commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon()
> and try_to_unmap_anon() more scalable) were likely the problem. Reverting
> these commits was reported to solve the problem.
>
> Despite the reason for these commits, NUMA balancing is not the direct
> source of the problem. split_huge_page() expected the anon_vma lock to be
> exclusive to serialise the whole split operation. Ordinarily it is expected
> that the anon_vma lock would only be required when updating the avcs but
> THP also uses it. The locking requirements for THP are complex and there
> is some overlap but broadly speaking they include the following
>
> 1. mmap_sem for read or write prevents THPs being created underneath
> 2. anon_vma is taken for write if collapsing a huge page
> 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
>    split_huge_page can run in parallel
> 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
>    against other THP operations
> 5. compound_lock is used to serialise between
>    __split_huge_page_refcount() and gup
>
> split_huge_page takes anon_vma for read but that does not serialise against
> parallel split_huge_page operations on the same page (rule 2). One process
> could be modifying the ref counts while the other modifies the page tables
> leading to counters not being reliable. This patch takes the anon_vma
> lock for write to serialise against parallel split_huge_page and parallel
> collapse operations as it is the most fine-grained lock available that
> protects against both.
>
> Reported-by: Zhouping Liu <[hidden email]>
> Reported-by: Alexander Beregalov <[hidden email]>
> Signed-off-by: Mel Gorman <[hidden email]>
> ---
>  mm/huge_memory.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9e894ed..6001ee6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1819,9 +1819,19 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>   BUG_ON(!PageAnon(page));
> - anon_vma = page_lock_anon_vma_read(page);
> +
> + /*
> + * The caller does not necessarily hold an mmap_sem that would prevent
> + * the anon_vma disappearing so we first we take a reference to it
> + * and then lock the anon_vma for write. This is similar to
> + * page_lock_anon_vma_read except the write lock is taken to serialise
> + * against parallel split or collapse operations.
> + */
> + anon_vma = page_get_anon_vma(page);
>   if (!anon_vma)
>   goto out;
> + anon_vma_lock_write(anon_vma);
> +
>   ret = 0;
>   if (!PageCompound(page))
>   goto out_unlock;
> @@ -1832,7 +1842,8 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(PageCompound(page));
>  out_unlock:
> - page_unlock_anon_vma_read(anon_vma);
> + anon_vma_unlock(anon_vma);
> + put_anon_vma(anon_vma);
>  out:
>   return ret;
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Michel Lespinasse
In reply to this post by Mel Gorman-2
On Fri, Jan 4, 2013 at 6:08 AM, Mel Gorman <[hidden email]> wrote:

> Despite the reason for these commits, NUMA balancing is not the direct
> source of the problem. split_huge_page() expected the anon_vma lock to be
> exclusive to serialise the whole split operation. Ordinarily it is expected
> that the anon_vma lock would only be required when updating the avcs but
> THP also uses it. The locking requirements for THP are complex and there
> is some overlap but broadly speaking they include the following
>
> 1. mmap_sem for read or write prevents THPs being created underneath
> 2. anon_vma is taken for write if collapsing a huge page
> 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
>    split_huge_page can run in parallel
> 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
>    against other THP operations
> 5. compound_lock is used to serialise between
>    __split_huge_page_refcount() and gup
>
> split_huge_page takes anon_vma for read but that does not serialise against
> parallel split_huge_page operations on the same page (rule 2). One process
> could be modifying the ref counts while the other modifies the page tables
> leading to counters not being reliable. This patch takes the anon_vma
> lock for write to serialise against parallel split_huge_page and parallel
> collapse operations as it is the most fine-grained lock available that
> protects against both.

Your comment about this being the most fine-grained lock made me
think, couldn't we use lock_page() on the THP page here ?

Now I don't necessarily want to push you that direction, because I
haven't fully thought it trough and because what you propose brings us
closer to what happened before anon_vma became an rwlock, which is
more obviously safe. But I felt I should still mention it, since we're
really only trying to protect from concurrent operations on the same
THP page, so locking at just that granularity would seem desirable.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Zhouping Liu
In reply to this post by Mel Gorman-2
On 01/04/2013 10:08 PM, Mel Gorman wrote:
> Zhouping, please test this patch.

Tested it, the issue is gone with following patch.

Tested-by: Zhouping Liu <[hidden email]>

Thanks,
Zhouping

>
> Andrea and Hugh, any comments on whether this could be improved?
>
> ---8<---
> mm: thp: Acquire the anon_vma rwsem for lock during split
>
> Zhouping Liu reported the following against 3.8-rc1 when running a mmap
> testcase from LTP.
>
> [  588.143072] mapcount 0 page_mapcount 3
> [  588.147471] ------------[ cut here ]------------
> [  588.152856] kernel BUG at mm/huge_memory.c:1798!
> [  588.158125] invalid opcode: 0000 [#1] SMP
> [  588.162882] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter
> +ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
> +dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca
> +vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
>
> [  588.246517] CPU 1
> [  588.248636] Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
> [  588.262171] RIP: 0010:[<ffffffff8118fac7>]  [<ffffffff8118fac7>] __split_huge_page+0x677/0x6d0
> [  588.272067] RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
> [  588.278235] RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
> [  588.286394] RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
> [  588.294216] RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
> [  588.302441] R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
> [  588.310495] R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
> [  588.318640] FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
> [  588.327894] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  588.334569] CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
> [  588.342718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  588.350861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  588.359134] Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
> [  588.368667] Stack:
> [  588.370960]  ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
> [  588.379566]  ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
> [  588.388150]  00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
> [  588.396711] Call Trace:
> [  588.455106]  [<ffffffff816017b5>] ? rwsem_down_read_failed+0x15/0x17
> [  588.518106]  [<ffffffff812fa014>] ? call_rwsem_down_read_failed+0x14/0x30
> [  588.580897]  [<ffffffff815ffc04>] ? down_read+0x24/0x2b
> [  588.642630]  [<ffffffff8118fb88>] split_huge_page+0x68/0xb0
> [  588.703814]  [<ffffffff81190ed4>] __split_huge_page_pmd+0x134/0x330
> [  588.766064]  [<ffffffff8104b997>] ? pte_alloc_one+0x37/0x50
> [  588.826460]  [<ffffffff81191121>] split_huge_page_pmd_mm+0x51/0x60
> [  588.887746]  [<ffffffff8119116b>] split_huge_page_address+0x3b/0x50
> [  588.948673]  [<ffffffff8119121c>] __vma_adjust_trans_huge+0x9c/0xf0
> [  589.008660]  [<ffffffff811650f4>] vma_adjust+0x684/0x750
> [  589.066328]  [<ffffffff811653ba>] __split_vma.isra.28+0x1fa/0x220
> [  589.123497]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
> [  589.180704]  [<ffffffff811661a9>] do_munmap+0xf9/0x420
> [  589.237461]  [<ffffffff8160026c>] ? __schedule+0x3cc/0x7b0
> [  589.294520]  [<ffffffff8116651e>] vm_munmap+0x4e/0x70
> [  589.350784]  [<ffffffff8116741b>] sys_munmap+0x2b/0x40
> [  589.406971]  [<ffffffff8160a159>] system_call_fastpath+0x16/0x1b
>
> Alexander Beregalov reported a very similar bug and Hillf Danton identified
> that commit 5a505085 (mm/rmap: Convert the struct anon_vma::mutex to an
> rwsem) and commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon()
> and try_to_unmap_anon() more scalable) were likely the problem. Reverting
> these commits was reported to solve the problem.
>
> Despite the reason for these commits, NUMA balancing is not the direct
> source of the problem. split_huge_page() expected the anon_vma lock to be
> exclusive to serialise the whole split operation. Ordinarily it is expected
> that the anon_vma lock would only be required when updating the avcs but
> THP also uses it. The locking requirements for THP are complex and there
> is some overlap but broadly speaking they include the following
>
> 1. mmap_sem for read or write prevents THPs being created underneath
> 2. anon_vma is taken for write if collapsing a huge page
> 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
>     split_huge_page can run in parallel
> 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
>     against other THP operations
> 5. compound_lock is used to serialise between
>     __split_huge_page_refcount() and gup
>
> split_huge_page takes anon_vma for read but that does not serialise against
> parallel split_huge_page operations on the same page (rule 2). One process
> could be modifying the ref counts while the other modifies the page tables
> leading to counters not being reliable. This patch takes the anon_vma
> lock for write to serialise against parallel split_huge_page and parallel
> collapse operations as it is the most fine-grained lock available that
> protects against both.
>
> Reported-by: Zhouping Liu <[hidden email]>
> Reported-by: Alexander Beregalov <[hidden email]>
> Signed-off-by: Mel Gorman <[hidden email]>
> ---
>   mm/huge_memory.c |   15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9e894ed..6001ee6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1819,9 +1819,19 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>   BUG_ON(!PageAnon(page));
> - anon_vma = page_lock_anon_vma_read(page);
> +
> + /*
> + * The caller does not necessarily hold an mmap_sem that would prevent
> + * the anon_vma disappearing so we first we take a reference to it
> + * and then lock the anon_vma for write. This is similar to
> + * page_lock_anon_vma_read except the write lock is taken to serialise
> + * against parallel split or collapse operations.
> + */
> + anon_vma = page_get_anon_vma(page);
>   if (!anon_vma)
>   goto out;
> + anon_vma_lock_write(anon_vma);
> +
>   ret = 0;
>   if (!PageCompound(page))
>   goto out_unlock;
> @@ -1832,7 +1842,8 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(PageCompound(page));
>   out_unlock:
> - page_unlock_anon_vma_read(anon_vma);
> + anon_vma_unlock(anon_vma);
> + put_anon_vma(anon_vma);
>   out:
>   return ret;
>   }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Simon Jeons
In reply to this post by Mel Gorman-2
On Fri, 2013-01-04 at 14:08 +0000, Mel Gorman wrote:

> Zhouping, please test this patch.
>
> Andrea and Hugh, any comments on whether this could be improved?
>
> ---8<---
> mm: thp: Acquire the anon_vma rwsem for lock during split
>
> Zhouping Liu reported the following against 3.8-rc1 when running a mmap
> testcase from LTP.
>
> [  588.143072] mapcount 0 page_mapcount 3
> [  588.147471] ------------[ cut here ]------------
> [  588.152856] kernel BUG at mm/huge_memory.c:1798!
> [  588.158125] invalid opcode: 0000 [#1] SMP
> [  588.162882] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter
> +ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
> +dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca
> +vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
>
> [  588.246517] CPU 1
> [  588.248636] Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
> [  588.262171] RIP: 0010:[<ffffffff8118fac7>]  [<ffffffff8118fac7>] __split_huge_page+0x677/0x6d0
> [  588.272067] RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
> [  588.278235] RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
> [  588.286394] RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
> [  588.294216] RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
> [  588.302441] R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
> [  588.310495] R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
> [  588.318640] FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
> [  588.327894] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  588.334569] CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
> [  588.342718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  588.350861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  588.359134] Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
> [  588.368667] Stack:
> [  588.370960]  ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
> [  588.379566]  ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
> [  588.388150]  00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
> [  588.396711] Call Trace:
> [  588.455106]  [<ffffffff816017b5>] ? rwsem_down_read_failed+0x15/0x17
> [  588.518106]  [<ffffffff812fa014>] ? call_rwsem_down_read_failed+0x14/0x30
> [  588.580897]  [<ffffffff815ffc04>] ? down_read+0x24/0x2b
> [  588.642630]  [<ffffffff8118fb88>] split_huge_page+0x68/0xb0
> [  588.703814]  [<ffffffff81190ed4>] __split_huge_page_pmd+0x134/0x330
> [  588.766064]  [<ffffffff8104b997>] ? pte_alloc_one+0x37/0x50
> [  588.826460]  [<ffffffff81191121>] split_huge_page_pmd_mm+0x51/0x60
> [  588.887746]  [<ffffffff8119116b>] split_huge_page_address+0x3b/0x50
> [  588.948673]  [<ffffffff8119121c>] __vma_adjust_trans_huge+0x9c/0xf0
> [  589.008660]  [<ffffffff811650f4>] vma_adjust+0x684/0x750
> [  589.066328]  [<ffffffff811653ba>] __split_vma.isra.28+0x1fa/0x220
> [  589.123497]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
> [  589.180704]  [<ffffffff811661a9>] do_munmap+0xf9/0x420
> [  589.237461]  [<ffffffff8160026c>] ? __schedule+0x3cc/0x7b0
> [  589.294520]  [<ffffffff8116651e>] vm_munmap+0x4e/0x70
> [  589.350784]  [<ffffffff8116741b>] sys_munmap+0x2b/0x40
> [  589.406971]  [<ffffffff8160a159>] system_call_fastpath+0x16/0x1b
>
> Alexander Beregalov reported a very similar bug and Hillf Danton identified
> that commit 5a505085 (mm/rmap: Convert the struct anon_vma::mutex to an
> rwsem) and commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon()
> and try_to_unmap_anon() more scalable) were likely the problem. Reverting
> these commits was reported to solve the problem.
>
> Despite the reason for these commits, NUMA balancing is not the direct
> source of the problem. split_huge_page() expected the anon_vma lock to be
> exclusive to serialise the whole split operation. Ordinarily it is expected
> that the anon_vma lock would only be required when updating the avcs but
> THP also uses it. The locking requirements for THP are complex and there
> is some overlap but broadly speaking they include the following
>
> 1. mmap_sem for read or write prevents THPs being created underneath
> 2. anon_vma is taken for write if collapsing a huge page
> 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
>    split_huge_page can run in parallel
> 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
>    against other THP operations
> 5. compound_lock is used to serialise between
>    __split_huge_page_refcount() and gup

Can't find gup path hold compound_lock.

>
> split_huge_page takes anon_vma for read but that does not serialise against
> parallel split_huge_page operations on the same page (rule 2). One process
> could be modifying the ref counts while the other modifies the page tables
> leading to counters not being reliable. This patch takes the anon_vma
> lock for write to serialise against parallel split_huge_page and parallel
> collapse operations as it is the most fine-grained lock available that
> protects against both.
>
> Reported-by: Zhouping Liu <[hidden email]>
> Reported-by: Alexander Beregalov <[hidden email]>
> Signed-off-by: Mel Gorman <[hidden email]>
> ---
>  mm/huge_memory.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9e894ed..6001ee6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1819,9 +1819,19 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>   BUG_ON(!PageAnon(page));
> - anon_vma = page_lock_anon_vma_read(page);
> +
> + /*
> + * The caller does not necessarily hold an mmap_sem that would prevent
> + * the anon_vma disappearing so we first we take a reference to it
> + * and then lock the anon_vma for write. This is similar to
> + * page_lock_anon_vma_read except the write lock is taken to serialise
> + * against parallel split or collapse operations.
> + */
> + anon_vma = page_get_anon_vma(page);
>   if (!anon_vma)
>   goto out;
> + anon_vma_lock_write(anon_vma);
> +
>   ret = 0;
>   if (!PageCompound(page))
>   goto out_unlock;
> @@ -1832,7 +1842,8 @@ int split_huge_page(struct page *page)
>  
>   BUG_ON(PageCompound(page));
>  out_unlock:
> - page_unlock_anon_vma_read(anon_vma);
> + anon_vma_unlock(anon_vma);
> + put_anon_vma(anon_vma);
>  out:
>   return ret;
>  }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Simon Jeons
In reply to this post by Michel Lespinasse
On Fri, 2013-01-04 at 17:32 -0800, Michel Lespinasse wrote:

> On Fri, Jan 4, 2013 at 6:08 AM, Mel Gorman <[hidden email]> wrote:
> > Despite the reason for these commits, NUMA balancing is not the direct
> > source of the problem. split_huge_page() expected the anon_vma lock to be
> > exclusive to serialise the whole split operation. Ordinarily it is expected
> > that the anon_vma lock would only be required when updating the avcs but
> > THP also uses it. The locking requirements for THP are complex and there
> > is some overlap but broadly speaking they include the following
> >
> > 1. mmap_sem for read or write prevents THPs being created underneath
> > 2. anon_vma is taken for write if collapsing a huge page
> > 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
> >    split_huge_page can run in parallel
> > 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
> >    against other THP operations
> > 5. compound_lock is used to serialise between
> >    __split_huge_page_refcount() and gup
> >
> > split_huge_page takes anon_vma for read but that does not serialise against
> > parallel split_huge_page operations on the same page (rule 2). One process
> > could be modifying the ref counts while the other modifies the page tables
> > leading to counters not being reliable. This patch takes the anon_vma
> > lock for write to serialise against parallel split_huge_page and parallel
> > collapse operations as it is the most fine-grained lock available that
> > protects against both.
>
> Your comment about this being the most fine-grained lock made me
> think, couldn't we use lock_page() on the THP page here ?
>
> Now I don't necessarily want to push you that direction, because I
> haven't fully thought it trough and because what you propose brings us
> closer to what happened before anon_vma became an rwlock, which is
> more obviously safe. But I felt I should still mention it, since we're
> really only trying to protect from concurrent operations on the same
> THP page, so locking at just that granularity would seem desirable.

Why you said that anon_vma lock who will protect page associated to a
list of vmas is fine-grained then page lock who just protect one page?

>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Mel Gorman-2
In reply to this post by Hugh Dickins-2
On Fri, Jan 04, 2013 at 01:28:09PM -0800, Hugh Dickins wrote:

> I've added Alexander, Hillf and Alex to the Cc.
>
> On Fri, 4 Jan 2013, Mel Gorman wrote:
> > Zhouping, please test this patch.
> >
> > Andrea and Hugh, any comments on whether this could be improved?
>
> Your patch itself looks just right to me, no improvement required;
> and it's easy to understand how the bug crept in, from a blanket
> rwsem replacement of anon_vma mutex meeting the harmless-looking
> anon_vma_interval_tree_foreach in __split_huge_page, which looked
> as if it needed only the readlock provided by the usual method.
>

Indeed. Thanks Hugh for taking a look over it.

> But I'd fight shy myself of trying to describe all the THP locking
> conventions in the commit message: I haven't really tried to work
> out just how right you've got all those details.
>

I thought it was risky myself but it was the best way of getting Andrea
to object if I missed some subtlety! If I had infinite time I would
follow up with a patch to Documentation/vm/transhuge.txt explaining how
the anon_vma lock is used by THP.

> The actual race in question here was just two processes (one or both
> forked) doing split_huge_page() on the same THPage at the same time,
> wasn't it?  (Though of course we only see the backtrace from one of
> them.)  Which would be very confusing, and no surprise that the
> pmd_trans_splitting test ends up skipping pmds already updated by
> the racing process, so the mapcount doesn't match what's expected.
> Of course we need exclusive lock against that, which you give it.
>

Ok, thanks. Will resend to Andrew with some changelog edits.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Mel Gorman-2
In reply to this post by Zhouping Liu
On Sat, Jan 05, 2013 at 01:51:09PM +0800, Zhouping Liu wrote:
> On 01/04/2013 10:08 PM, Mel Gorman wrote:
> >Zhouping, please test this patch.
>
> Tested it, the issue is gone with following patch.
>
> Tested-by: Zhouping Liu <[hidden email]>
>

Super. Thanks very much for reporting and testing this quickly.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

[PATCH] mm: thp: Acquire the anon_vma rwsem for write during split

Mel Gorman-2
In reply to this post by Hugh Dickins-2
Zhouping Liu reported the following against 3.8-rc1 when running a mmap
testcase from LTP.

[  588.143072] mapcount 0 page_mapcount 3
[  588.147471] ------------[ cut here ]------------
[  588.152856] kernel BUG at mm/huge_memory.c:1798!
[  588.158125] invalid opcode: 0000 [#1] SMP
[  588.162882] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter
+ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
+dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca
+vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas

[  588.246517] CPU 1
[  588.248636] Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
[  588.262171] RIP: 0010:[<ffffffff8118fac7>]  [<ffffffff8118fac7>] __split_huge_page+0x677/0x6d0
[  588.272067] RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
[  588.278235] RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
[  588.286394] RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
[  588.294216] RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
[  588.302441] R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
[  588.310495] R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
[  588.318640] FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
[  588.327894] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  588.334569] CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
[  588.342718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  588.350861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  588.359134] Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
[  588.368667] Stack:
[  588.370960]  ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
[  588.379566]  ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
[  588.388150]  00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
[  588.396711] Call Trace:
[  588.455106]  [<ffffffff816017b5>] ? rwsem_down_read_failed+0x15/0x17
[  588.518106]  [<ffffffff812fa014>] ? call_rwsem_down_read_failed+0x14/0x30
[  588.580897]  [<ffffffff815ffc04>] ? down_read+0x24/0x2b
[  588.642630]  [<ffffffff8118fb88>] split_huge_page+0x68/0xb0
[  588.703814]  [<ffffffff81190ed4>] __split_huge_page_pmd+0x134/0x330
[  588.766064]  [<ffffffff8104b997>] ? pte_alloc_one+0x37/0x50
[  588.826460]  [<ffffffff81191121>] split_huge_page_pmd_mm+0x51/0x60
[  588.887746]  [<ffffffff8119116b>] split_huge_page_address+0x3b/0x50
[  588.948673]  [<ffffffff8119121c>] __vma_adjust_trans_huge+0x9c/0xf0
[  589.008660]  [<ffffffff811650f4>] vma_adjust+0x684/0x750
[  589.066328]  [<ffffffff811653ba>] __split_vma.isra.28+0x1fa/0x220
[  589.123497]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
[  589.180704]  [<ffffffff811661a9>] do_munmap+0xf9/0x420
[  589.237461]  [<ffffffff8160026c>] ? __schedule+0x3cc/0x7b0
[  589.294520]  [<ffffffff8116651e>] vm_munmap+0x4e/0x70
[  589.350784]  [<ffffffff8116741b>] sys_munmap+0x2b/0x40
[  589.406971]  [<ffffffff8160a159>] system_call_fastpath+0x16/0x1b

Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton
identified that commit 5a505085 (mm/rmap: Convert the struct anon_vma::mutex
to an rwsem) and commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon()
and try_to_unmap_anon() more scalable) were likely the problem. Reverting
these commits was reported to solve the problem for Alexander.

Despite the reason for these commits, NUMA balancing is not the direct
source of the problem. split_huge_page() expects the anon_vma lock to be
exclusive to serialise the whole split operation. Ordinarily it is expected
that the anon_vma lock would only be required when updating the avcs but
THP also uses the anon_vma rwsem for collapse and split operations where
the page lock or compound lock cannot be used (as the page is changing
from base to THP or vice versa) and the page table locks are
insufficient.

This patch takes the anon_vma lock for write to serialise against parallel
split_huge_page as THP expected before the conversion to rwsem.

Reported-and-tested-by: Zhouping Liu <[hidden email]>
Reported-by: Alexander Beregalov <[hidden email]>
Reported-by: Alex Xu <[hidden email]>
Signed-off-by: Mel Gorman <[hidden email]>
---
 mm/huge_memory.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e894ed..6001ee6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1819,9 +1819,19 @@ int split_huge_page(struct page *page)
 
  BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
  BUG_ON(!PageAnon(page));
- anon_vma = page_lock_anon_vma_read(page);
+
+ /*
+ * The caller does not necessarily hold an mmap_sem that would prevent
+ * the anon_vma disappearing so we first we take a reference to it
+ * and then lock the anon_vma for write. This is similar to
+ * page_lock_anon_vma_read except the write lock is taken to serialise
+ * against parallel split or collapse operations.
+ */
+ anon_vma = page_get_anon_vma(page);
  if (!anon_vma)
  goto out;
+ anon_vma_lock_write(anon_vma);
+
  ret = 0;
  if (!PageCompound(page))
  goto out_unlock;
@@ -1832,7 +1842,8 @@ int split_huge_page(struct page *page)
 
  BUG_ON(PageCompound(page));
 out_unlock:
- page_unlock_anon_vma_read(anon_vma);
+ anon_vma_unlock(anon_vma);
+ put_anon_vma(anon_vma);
 out:
  return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] mm: thp: Acquire the anon_vma rwsem for lock during split

Mel Gorman-2
In reply to this post by Michel Lespinasse
On Fri, Jan 04, 2013 at 05:32:17PM -0800, Michel Lespinasse wrote:

> On Fri, Jan 4, 2013 at 6:08 AM, Mel Gorman <[hidden email]> wrote:
> > Despite the reason for these commits, NUMA balancing is not the direct
> > source of the problem. split_huge_page() expected the anon_vma lock to be
> > exclusive to serialise the whole split operation. Ordinarily it is expected
> > that the anon_vma lock would only be required when updating the avcs but
> > THP also uses it. The locking requirements for THP are complex and there
> > is some overlap but broadly speaking they include the following
> >
> > 1. mmap_sem for read or write prevents THPs being created underneath
> > 2. anon_vma is taken for write if collapsing a huge page
> > 3. mm->page_table_lock should be taken when checking if pmd_trans_huge as
> >    split_huge_page can run in parallel
> > 4. wait_split_huge_page uses anon_vma taken for write mode to serialise
> >    against other THP operations
> > 5. compound_lock is used to serialise between
> >    __split_huge_page_refcount() and gup
> >
> > split_huge_page takes anon_vma for read but that does not serialise against
> > parallel split_huge_page operations on the same page (rule 2). One process
> > could be modifying the ref counts while the other modifies the page tables
> > leading to counters not being reliable. This patch takes the anon_vma
> > lock for write to serialise against parallel split_huge_page and parallel
> > collapse operations as it is the most fine-grained lock available that
> > protects against both.
>
> Your comment about this being the most fine-grained lock made me
> think, couldn't we use lock_page() on the THP page here ?
>
> Now I don't necessarily want to push you that direction, because I
> haven't fully thought it trough and because what you propose brings us
> closer to what happened before anon_vma became an rwlock, which is
> more obviously safe. But I felt I should still mention it, since we're
> really only trying to protect from concurrent operations on the same
> THP page, so locking at just that granularity would seem desirable.
>

I considered this too because anon_vma locking is really coarse. The
coarse nature is not the only issue as depending on the anon_vma lock is
yet another obstacle to using THP for file-backed pages. I also did not
think about it fully but it felt that trying to convert to the page lock
would be problematic. Take reclaim;

shrink_page_list (trylock_page so we hold the page lock)
  -> add_to_swap
    -> split_huge_page (lock_page)

BANG, we recursively lock. During collapse it's also problematic because we
use the anon_vma lock to protect the whole operation but still depend on
the page lock to protect against LRU modifications. We take the page
lock before isolating the page from the LRU otherwise we'd have to
disable IRQs each time and that would ugly.

I did not think converting to the page lock woul dbe impossible but it
it non-trivial and it's not better if it means we have to take the LRU
lock frequently in khugepaged when collapsing to a THP.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
123