[PATCH 0/3] mm, thp: remove duplication and fix locking issues in swapin

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 0/3] mm, thp: remove duplication and fix locking issues in swapin

Ebru Akagunduz
This patch series removes duplication of included header
and fixes locking inconsistency in khugepaged swapin

Ebru Akagunduz (3):
  mm, thp: remove duplication of included header
  mm, thp: fix possible circular locking dependency caused by
    sum_vm_event()
  mm, thp: make swapin readahead under down_read of mmap_sem

 mm/huge_memory.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/3] mm, thp: remove duplication of included header

Ebru Akagunduz
swapops.h included for a second time in the commit:
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=639040960a340f6f987065fc52e149f4ea25ce25

This patch removes the duplication.

Signed-off-by: Ebru Akagunduz <[hidden email]>
---
 mm/huge_memory.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9472fed..91442a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -30,7 +30,6 @@
 #include <linux/hashtable.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
-#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/3] mm, thp: fix possible circular locking dependency caused by sum_vm_event()

Ebru Akagunduz
In reply to this post by Ebru Akagunduz
Nested circular locking dependency detected by kernel robot (udevadm).

  udevadm/221 is trying to acquire lock:
   (&mm->mmap_sem){++++++}, at: [<ffffffff81262543>] __might_fault+0x83/0x150
  but task is already holding lock:
   (s_active#12){++++.+}, at: [<ffffffff813315ee>] kernfs_fop_write+0x8e/0x250
  which lock already depends on the new lock.

 Possible unsafe locking scenario:

 CPU0                    CPU1
 ----                    ----
 lock(s_active);
                         lock(cpu_hotplug.lock);
                         lock(s_active);
 lock(&mm->mmap_sem);

 the existing dependency chain (in reverse order) is:
 -> #2 (s_active#12){++++.+}:
        [<ffffffff8117da2c>] lock_acquire+0xac/0x180
        [<ffffffff8132f50a>] __kernfs_remove+0x2da/0x410
        [<ffffffff81330630>] kernfs_remove_by_name_ns+0x40/0x90
        [<ffffffff813339fb>] sysfs_remove_file_ns+0x2b/0x70
        [<ffffffff81ba8a16>] device_del+0x166/0x320
        [<ffffffff81ba943c>] device_destroy+0x3c/0x50
        [<ffffffff8105aa61>] cpuid_class_cpu_callback+0x51/0x70
        [<ffffffff81131ce9>] notifier_call_chain+0x59/0x190
        [<ffffffff81132749>] __raw_notifier_call_chain+0x9/0x10
        [<ffffffff810fe6b0>] __cpu_notify+0x40/0x90
        [<ffffffff810fe890>] cpu_notify_nofail+0x10/0x30
        [<ffffffff810fe8d7>] notify_dead+0x27/0x1e0
        [<ffffffff810fe273>] cpuhp_down_callbacks+0x93/0x190
        [<ffffffff82096062>] _cpu_down+0xc2/0x1e0
        [<ffffffff810ff727>] do_cpu_down+0x37/0x50
        [<ffffffff8110003b>] cpu_down+0xb/0x10
        [<ffffffff81038e4d>] _debug_hotplug_cpu+0x7d/0xd0
        [<ffffffff8435d6bb>] debug_hotplug_cpu+0xd/0x11
        [<ffffffff84352426>] do_one_initcall+0x138/0x1cf
        [<ffffffff8435270a>] kernel_init_freeable+0x24d/0x2de
        [<ffffffff8209533a>] kernel_init+0xa/0x120
        [<ffffffff820a7972>] ret_from_fork+0x22/0x50

 -> #1 (cpu_hotplug.lock#2){+.+.+.}:
        [<ffffffff8117da2c>] lock_acquire+0xac/0x180
        [<ffffffff820a20d1>] mutex_lock_nested+0x71/0x4c0
        [<ffffffff810ff526>] get_online_cpus+0x66/0x80
        [<ffffffff81246fb3>] sum_vm_event+0x23/0x1b0
        [<ffffffff81293768>] collapse_huge_page+0x118/0x10b0
        [<ffffffff81294c5d>] khugepaged+0x55d/0xe80
        [<ffffffff81130304>] kthread+0x134/0x1a0
        [<ffffffff820a7972>] ret_from_fork+0x22/0x50

 -> #0 (&mm->mmap_sem){++++++}:
        [<ffffffff8117bf61>] __lock_acquire+0x2861/0x31f0
        [<ffffffff8117da2c>] lock_acquire+0xac/0x180
        [<ffffffff8126257e>] __might_fault+0xbe/0x150
        [<ffffffff8133160f>] kernfs_fop_write+0xaf/0x250
        [<ffffffff812a8933>] __vfs_write+0x43/0x1a0
        [<ffffffff812a8d3a>] vfs_write+0xda/0x240
        [<ffffffff812a8f84>] SyS_write+0x44/0xa0
        [<ffffffff820a773c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

This patch moves sum_vm_event() before taking down_write(&mm->mmap_sem)
to solve dependency lock.

Signed-off-by: Ebru Akagunduz <[hidden email]>
---
 mm/huge_memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91442a9..feee44c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2451,6 +2451,9 @@ static void collapse_huge_page(struct mm_struct *mm,
  goto out_nolock;
  }
 
+ swap = get_mm_counter(mm, MM_SWAPENTS);
+ curr_allocstall = sum_vm_event(ALLOCSTALL);
+
  /*
  * Prevent all access to pagetables with the exception of
  * gup_fast later hanlded by the ptep_clear_flush and the VM
@@ -2483,8 +2486,6 @@ static void collapse_huge_page(struct mm_struct *mm,
  goto out;
  }
 
- swap = get_mm_counter(mm, MM_SWAPENTS);
- curr_allocstall = sum_vm_event(ALLOCSTALL);
  /*
  * Don't perform swapin readahead when the system is under pressure,
  * to avoid unnecessary resource consumption.
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Ebru Akagunduz
In reply to this post by Ebru Akagunduz
Currently khugepaged makes swapin readahead under
down_write. This patch supplies to make swapin
readahead under down_read instead of down_write.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. The system
was forced to swap out all. Afterwards, the test program
touches the area by writing, it skips a page in each
20 pages of the area.

Signed-off-by: Ebru Akagunduz <[hidden email]>
---
 mm/huge_memory.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index feee44c..668bc07 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2386,13 +2386,14 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
  * but with mmap_sem held to protect against vma changes.
  */
 
-static void __collapse_huge_page_swapin(struct mm_struct *mm,
+static bool __collapse_huge_page_swapin(struct mm_struct *mm,
  struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd)
 {
  unsigned long _address;
  pte_t *pte, pteval;
  int swapped_in = 0, ret = 0;
+ struct vm_area_struct *vma_orig = vma;
 
  pte = pte_offset_map(pmd, address);
  for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
@@ -2402,11 +2403,19 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
  continue;
  swapped_in++;
  ret = do_swap_page(mm, vma, _address, pte, pmd,
-   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+   FAULT_FLAG_ALLOW_RETRY,
    pteval);
+ /* do_swap_page returns VM_FAULT_RETRY with released mmap_sem */
+ if (ret & VM_FAULT_RETRY) {
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, address);
+ /* vma is no longer available, don't continue to swapin */
+ if (vma != vma_orig)
+ return false;
+ }
  if (ret & VM_FAULT_ERROR) {
  trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
- return;
+ return false;
  }
  /* pte is unmapped now, we need to map it */
  pte = pte_offset_map(pmd, _address);
@@ -2414,6 +2423,7 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
  pte--;
  pte_unmap(pte);
  trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
+ return true;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -2459,7 +2469,7 @@ static void collapse_huge_page(struct mm_struct *mm,
  * gup_fast later hanlded by the ptep_clear_flush and the VM
  * handled by the anon_vma lock + PG_lock.
  */
- down_write(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);
  if (unlikely(khugepaged_test_exit(mm))) {
  result = SCAN_ANY_PROCESS;
  goto out;
@@ -2490,9 +2500,20 @@ static void collapse_huge_page(struct mm_struct *mm,
  * Don't perform swapin readahead when the system is under pressure,
  * to avoid unnecessary resource consumption.
  */
- if (allocstall == curr_allocstall && swap != 0)
- __collapse_huge_page_swapin(mm, vma, address, pmd);
+ if (allocstall == curr_allocstall && swap != 0) {
+ /*
+ * __collapse_huge_page_swapin always returns with mmap_sem
+ * locked. If it fails, release mmap_sem and jump directly
+ * label out. Continuing to collapse causes inconsistency.
+ */
+ if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
+ up_read(&mm->mmap_sem);
+ goto out;
+ }
+ }
 
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
  anon_vma_lock_write(vma->anon_vma);
 
  pte = pte_offset_map(pmd, address);
--
1.9.1

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 0/3] mm, thp: remove duplication and fix locking issues in swapin

Ebru Akagunduz
In reply to this post by Ebru Akagunduz
On Mon, May 23, 2016 at 08:14:08PM +0300, Ebru Akagunduz wrote:

> This patch series removes duplication of included header
> and fixes locking inconsistency in khugepaged swapin
>
> Ebru Akagunduz (3):
>   mm, thp: remove duplication of included header
>   mm, thp: fix possible circular locking dependency caused by
>     sum_vm_event()
>   mm, thp: make swapin readahead under down_read of mmap_sem
>
>  mm/huge_memory.c | 39 ++++++++++++++++++++++++++++++---------
>  1 file changed, 30 insertions(+), 9 deletions(-)
>

Hi Andrew,

I prepared this patch series to solve rest of
problems of khugepaged swapin.

I have seen the discussion:
http://marc.info/?l=linux-mm&m=146373278424897&w=2

In my opinion, checking whether kswapd is wake up
could be good.

It's up to you. I can take an action according to community's decision.

Here is the last status of the discussion:

"
Optimistic swapin collapsing

1. it could be too optimisitic to lose the gain due to evicting workingset
2. let's detect memory pressure
3. current allocstall magic is not a good idea.
4. let's change the design from optimistic to conservative
5. how we can be conservative
6. two things - detect hot pages and threshold of swap pte
7. threhsold of swap pte is already done so remained thing is detect hot page
8. how to detect hot page - let's use young bit
9. Now, we are conservatie so we will swap in when it's worth
10. let's remove allocstall magic

I think it's not off-topic.
Anyway, it's just my thought and don't have any real workload and objection.
Feel free to ignore.
"

Thanks.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Kirill A. Shutemov
In reply to this post by Ebru Akagunduz
On Mon, May 23, 2016 at 08:14:11PM +0300, Ebru Akagunduz wrote:

> Currently khugepaged makes swapin readahead under
> down_write. This patch supplies to make swapin
> readahead under down_read instead of down_write.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. The system
> was forced to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
>
> Signed-off-by: Ebru Akagunduz <[hidden email]>
> ---
>  mm/huge_memory.c | 33 +++++++++++++++++++++++++++------
>  1 file changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index feee44c..668bc07 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2386,13 +2386,14 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>   * but with mmap_sem held to protect against vma changes.
>   */
>  
> -static void __collapse_huge_page_swapin(struct mm_struct *mm,
> +static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>   struct vm_area_struct *vma,
>   unsigned long address, pmd_t *pmd)
>  {
>   unsigned long _address;
>   pte_t *pte, pteval;
>   int swapped_in = 0, ret = 0;
> + struct vm_area_struct *vma_orig = vma;
>  
>   pte = pte_offset_map(pmd, address);
>   for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> @@ -2402,11 +2403,19 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
>   continue;
>   swapped_in++;
>   ret = do_swap_page(mm, vma, _address, pte, pmd,
> -   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> +   FAULT_FLAG_ALLOW_RETRY,
>     pteval);
> + /* do_swap_page returns VM_FAULT_RETRY with released mmap_sem */
> + if (ret & VM_FAULT_RETRY) {
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, address);
> + /* vma is no longer available, don't continue to swapin */
> + if (vma != vma_orig)
> + return false;
> + }
>   if (ret & VM_FAULT_ERROR) {
>   trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
> - return;
> + return false;
>   }
>   /* pte is unmapped now, we need to map it */
>   pte = pte_offset_map(pmd, _address);
> @@ -2414,6 +2423,7 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
>   pte--;
>   pte_unmap(pte);
>   trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
> + return true;
>  }
>  
>  static void collapse_huge_page(struct mm_struct *mm,
> @@ -2459,7 +2469,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   * gup_fast later hanlded by the ptep_clear_flush and the VM
>   * handled by the anon_vma lock + PG_lock.
>   */
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);
>   if (unlikely(khugepaged_test_exit(mm))) {
>   result = SCAN_ANY_PROCESS;
>   goto out;
> @@ -2490,9 +2500,20 @@ static void collapse_huge_page(struct mm_struct *mm,
>   * Don't perform swapin readahead when the system is under pressure,
>   * to avoid unnecessary resource consumption.
>   */
> - if (allocstall == curr_allocstall && swap != 0)
> - __collapse_huge_page_swapin(mm, vma, address, pmd);
> + if (allocstall == curr_allocstall && swap != 0) {
> + /*
> + * __collapse_huge_page_swapin always returns with mmap_sem
> + * locked. If it fails, release mmap_sem and jump directly
> + * label out. Continuing to collapse causes inconsistency.
> + */
> + if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
> + up_read(&mm->mmap_sem);
> + goto out;
> + }
> + }
>  
> + up_read(&mm->mmap_sem);
> + down_write(&mm->mmap_sem);

That's the critical point.

How do you guarantee that the vma will not be destroyed (or changed)
between up_read() and down_write()?

You need at least find_vma() again.

>   anon_vma_lock_write(vma->anon_vma);
>  
>   pte = pte_offset_map(pmd, address);
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [hidden email].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[hidden email]"> [hidden email] </a>

--
 Kirill A. Shutemov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Michal Hocko-4
In reply to this post by Ebru Akagunduz
On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> Currently khugepaged makes swapin readahead under
> down_write. This patch supplies to make swapin
> readahead under down_read instead of down_write.

You are still keeping down_write. Can we do without it altogether?
Blocking mmap_sem of a remote proces for write is certainly not nice.
--
Michal Hocko
SUSE Labs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Rik van Riel
On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> >
> > Currently khugepaged makes swapin readahead under
> > down_write. This patch supplies to make swapin
> > readahead under down_read instead of down_write.
> You are still keeping down_write. Can we do without it altogether?
> Blocking mmap_sem of a remote proces for write is certainly not nice.

Maybe Andrea can explain why khugepaged requires
a down_write of mmap_sem?

If it were possible to have just down_read that
would make the code a lot simpler.

--
All Rights Reversed.


signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Kirill A. Shutemov-2
On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:

> On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > >
> > > Currently khugepaged makes swapin readahead under
> > > down_write. This patch supplies to make swapin
> > > readahead under down_read instead of down_write.
> > You are still keeping down_write. Can we do without it altogether?
> > Blocking mmap_sem of a remote proces for write is certainly not nice.
>
> Maybe Andrea can explain why khugepaged requires
> a down_write of mmap_sem?
>
> If it were possible to have just down_read that
> would make the code a lot simpler.

You need a down_write() to retract page table. We need to make sure that
nobody sees the page table before we can replace it with huge pmd.

--
 Kirill A. Shutemov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Rik van Riel
On Mon, 2016-05-23 at 22:01 +0300, Kirill A. Shutemov wrote:

> On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:
> >
> > On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > >
> > > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > > >
> > > >
> > > > Currently khugepaged makes swapin readahead under
> > > > down_write. This patch supplies to make swapin
> > > > readahead under down_read instead of down_write.
> > > You are still keeping down_write. Can we do without it
> > > altogether?
> > > Blocking mmap_sem of a remote proces for write is certainly not
> > > nice.
> > Maybe Andrea can explain why khugepaged requires
> > a down_write of mmap_sem?
> >
> > If it were possible to have just down_read that
> > would make the code a lot simpler.
> You need a down_write() to retract page table. We need to make sure
> that
> nobody sees the page table before we can replace it with huge pmd.
Good point.

I guess the alternative is to have the page_table_lock
taken by a helper function (everywhere) that can return
failure if the page table was changed while the caller
was waiting for the lock.

Doable, but a fair amount of code churn.

--
All Rights Reversed.


signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Kirill A. Shutemov
On Mon, May 23, 2016 at 03:26:47PM -0400, Rik van Riel wrote:

> On Mon, 2016-05-23 at 22:01 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:
> > >
> > > On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > > >
> > > > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > > > >
> > > > >
> > > > > Currently khugepaged makes swapin readahead under
> > > > > down_write. This patch supplies to make swapin
> > > > > readahead under down_read instead of down_write.
> > > > You are still keeping down_write. Can we do without it
> > > > altogether?
> > > > Blocking mmap_sem of a remote proces for write is certainly not
> > > > nice.
> > > Maybe Andrea can explain why khugepaged requires
> > > a down_write of mmap_sem?
> > >
> > > If it were possible to have just down_read that
> > > would make the code a lot simpler.
> > You need a down_write() to retract page table. We need to make sure
> > that
> > nobody sees the page table before we can replace it with huge pmd.
>
> Good point.
>
> I guess the alternative is to have the page_table_lock
> taken by a helper function (everywhere) that can return
> failure if the page table was changed while the caller
> was waiting for the lock.

Not page table was changed, but pmd is now pointing to something else.
Basically, we would need to nest all pte-ptl's within pmd_lock().
That's not good for scalability.

--
 Kirill A. Shutemov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Rik van Riel
On Mon, 2016-05-23 at 23:02 +0300, Kirill A. Shutemov wrote:

> On Mon, May 23, 2016 at 03:26:47PM -0400, Rik van Riel wrote:
> >
> > On Mon, 2016-05-23 at 22:01 +0300, Kirill A. Shutemov wrote:
> > >
> > > On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:
> > > >
> > > >
> > > > On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > > > >
> > > > >
> > > > > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Currently khugepaged makes swapin readahead under
> > > > > > down_write. This patch supplies to make swapin
> > > > > > readahead under down_read instead of down_write.
> > > > > You are still keeping down_write. Can we do without it
> > > > > altogether?
> > > > > Blocking mmap_sem of a remote proces for write is certainly
> > > > > not
> > > > > nice.
> > > > Maybe Andrea can explain why khugepaged requires
> > > > a down_write of mmap_sem?
> > > >
> > > > If it were possible to have just down_read that
> > > > would make the code a lot simpler.
> > > You need a down_write() to retract page table. We need to make
> > > sure
> > > that
> > > nobody sees the page table before we can replace it with huge
> > > pmd.
> > Good point.
> >
> > I guess the alternative is to have the page_table_lock
> > taken by a helper function (everywhere) that can return
> > failure if the page table was changed while the caller
> > was waiting for the lock.
> Not page table was changed, but pmd is now pointing to something
> else.
> Basically, we would need to nest all pte-ptl's within pmd_lock().
> That's not good for scalability.
I can see a few alternatives here:

1) huge pmd collapsing takes both the pmd lock and the pte lock,
   preventing pte updates from happening simultaneously

2) code that (re-)acquires the pte lock can read a sequence number
   at the pmd level, check that it did not change after the
   pte lock has been acquired, and abort if it has - I believe most
   of the code that re-acquires the pte lock already knows how to
   abort if somebody else touched the pte while it was looking
   elsewhere

That way the (uncommon) thp collapse code should still exclude
pte level operations, at the cost of potentially teaching a few
more pte level operations to abort (chances are most already do,
considering a race with other pte-level manipulations requires that).

--
All Rights Reversed.


signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Kirill A. Shutemov-2
On Mon, May 23, 2016 at 04:13:03PM -0400, Rik van Riel wrote:

> On Mon, 2016-05-23 at 23:02 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 23, 2016 at 03:26:47PM -0400, Rik van Riel wrote:
> > >
> > > On Mon, 2016-05-23 at 22:01 +0300, Kirill A. Shutemov wrote:
> > > >
> > > > On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:
> > > > >
> > > > >
> > > > > On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > > > > >
> > > > > >
> > > > > > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Currently khugepaged makes swapin readahead under
> > > > > > > down_write. This patch supplies to make swapin
> > > > > > > readahead under down_read instead of down_write.
> > > > > > You are still keeping down_write. Can we do without it
> > > > > > altogether?
> > > > > > Blocking mmap_sem of a remote proces for write is certainly
> > > > > > not
> > > > > > nice.
> > > > > Maybe Andrea can explain why khugepaged requires
> > > > > a down_write of mmap_sem?
> > > > >
> > > > > If it were possible to have just down_read that
> > > > > would make the code a lot simpler.
> > > > You need a down_write() to retract page table. We need to make
> > > > sure
> > > > that
> > > > nobody sees the page table before we can replace it with huge
> > > > pmd.
> > > Good point.
> > >
> > > I guess the alternative is to have the page_table_lock
> > > taken by a helper function (everywhere) that can return
> > > failure if the page table was changed while the caller
> > > was waiting for the lock.
> > Not page table was changed, but pmd is now pointing to something
> > else.
> > Basically, we would need to nest all pte-ptl's within pmd_lock().
> > That's not good for scalability.
>
> I can see a few alternatives here:
>
> 1) huge pmd collapsing takes both the pmd lock and the pte lock,
>    preventing pte updates from happening simultaneously

That's what we do now and that's not enough.

We would need to serialize against pmd_lock() during normal page-fault
path (and other pte manipulation), which we don't do now if pmd points to
page table.

That's huge hit on scalability.

>
> 2) code that (re-)acquires the pte lock can read a sequence number
>    at the pmd level, check that it did not change after the
>    pte lock has been acquired, and abort if it has - I believe most
>    of the code that re-acquires the pte lock already knows how to
>    abort if somebody else touched the pte while it was looking
>    elsewhere

So, every pmd_lock() (and other means we take the lock) should bump the
sequence number and we need to be able to read stable result  outside
pmd_lock(), meaning it should be atomic_t or something similar.

Not exactly free.

And I'm not convinced the hassle worth the gain.

> That way the (uncommon) thp collapse code should still exclude
> pte level operations, at the cost of potentially teaching a few
> more pte level operations to abort (chances are most already do,
> considering a race with other pte-level manipulations requires that).

--
 Kirill A. Shutemov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of mmap_sem

Andrea Arcangeli-4
On Tue, May 24, 2016 at 12:49:42AM +0300, Kirill A. Shutemov wrote:
> That's what we do now and that's not enough.
>
> We would need to serialize against pmd_lock() during normal page-fault
> path (and other pte manipulation), which we don't do now if pmd points to
> page table.

Yes, mmap_sem for writing while converting the pmd to a
pmd_trans_huge() in khugepaged, is so that the pagetable walk doesn't
require the pmd_lock after holding the mmap_sem for reading if the pmd
is found !pmd_trans_unstable, i.e. if the pmd points to a pte.

This way the non-THP pte walk retains the identical cost it has with
THP not compiled into the kernel (even when THP is enabled).

khugepaged already starts by doing work with the mmap_sem for reading,
then while holding the mmap_sem for reading if khugepaged_scan_pmd()
finds a candidate pmd to collapse into a pmd_trans_huge(), it calls
collapse_huge_page which at some point releases the mmap_sem for
reading (before the THP memory allocation) and takes it again for
writing if the allocation succeeded and we can go ahead with the
atomic THP collapse (under mmap_sem for writing and under the anon_vma
lock for writing too to serialize against split_huge_page which can be
called on the physical page and doesn't hold any mmap_sem but just
finds the pagetables through a rmap walk). The atomic part is all non
blocking.

The swapin loop can run under the mmap_sem for reading if it does the
proper check to revalidate the vma, but it should move above the below
comment.

        /*
         * Prevent all access to pagetables with the exception of
         * gup_fast later hanlded by the ptep_clear_flush and the VM
         * handled by the anon_vma lock + PG_lock.
         */
        down_write(&mm->mmap_sem);

Which is more or less what the last patch was doing except by keeping
the comment above the swapin stage, it made the comment wrong, as the
comment was then followed by a down_read.

Aside from the comment being wrong (which is not a kernel crashing
issue), the real problem was lack of revalidates after releasing the
mmap_sem and this revalidate attempt is also not correct:

+                       vma = find_vma(mm, address);
+                       /* vma is no longer available, don't continue to swapin */
+                       if (vma != vma_orig)
+                               return false;

Because the mmap_sem was temporarily dropped, the vma may have been
freed and reallocated at the same address, but it may be a completely
different vma with different vm_start/end values or it may not be
anonymous or mremap may have altered the vm_start/end too or the "mm"
may have exited in the meanwhile.

collapse_huge_page already shows how to correctly to revalidate the
vma after dropping the mmap_sem temporarily:

        down_write(&mm->mmap_sem);
        if (unlikely(khugepaged_test_exit(mm))) {
                result = SCAN_ANY_PROCESS;
                goto out;
        }

        vma = find_vma(mm, address);
        if (!vma) {
                result = SCAN_VMA_NULL;
                goto out;
        }
        hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
        hend = vma->vm_end & HPAGE_PMD_MASK;
        if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
                result = SCAN_ADDRESS_RANGE;
                goto out;
        }
        if (!hugepage_vma_check(vma)) {
                result = SCAN_VMA_CHECK;
                goto out;
        }

All checks above are needed for a correct revalidate, otherwise the
above code could also have been replaced by a vma != vma_orig.

If we move this swapin stage before the comment and after the THP
allocation succeeded, and we do enough revalidates correctly (which
are currently missing or incorrect, and find_vma is not enough for a
revalidate, find_vma only says there's some random vma with vm_end >
address), it should then work ok under only the mmap_sem for
reading.

Overall the last patch goes in the right direction just it needs to do
all revalidates right and to move the swapin stage a bit more up to
avoid invalidating the comment I think.

Thanks,
Andrea
Loading...