Skip to content

Commit d15cd6d

Browse files
thejhgregkh
authored andcommitted
mm/khugepaged: take the right locks for page table retraction
commit 8d3c106 upstream. pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 27e1f82 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent 26f084e commit d15cd6d

File tree

1 file changed

+26
-5
lines changed

1 file changed

+26
-5
lines changed

mm/khugepaged.c

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1456,6 +1456,14 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
14561456
if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
14571457
return;
14581458

1459+
/*
1460+
* Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
1461+
* that got written to. Without this, we'd have to also lock the
1462+
* anon_vma if one exists.
1463+
*/
1464+
if (vma->anon_vma)
1465+
return;
1466+
14591467
hpage = find_lock_page(vma->vm_file->f_mapping,
14601468
linear_page_index(vma, haddr));
14611469
if (!hpage)
@@ -1468,6 +1476,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
14681476
if (!pmd)
14691477
goto drop_hpage;
14701478

1479+
/*
1480+
* We need to lock the mapping so that from here on, only GUP-fast and
1481+
* hardware page walks can access the parts of the page tables that
1482+
* we're operating on.
1483+
*/
1484+
i_mmap_lock_write(vma->vm_file->f_mapping);
1485+
1486+
/*
1487+
* This spinlock should be unnecessary: Nobody else should be accessing
1488+
* the page tables under spinlock protection here, only
1489+
* lockless_pages_from_mm() and the hardware page walker can access page
1490+
* tables while all the high-level locks are held in write mode.
1491+
*/
14711492
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
14721493

14731494
/* step 1: check all mapped PTEs are to the right huge page */
@@ -1514,19 +1535,20 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
15141535
}
15151536

15161537
/* step 4: collapse pmd */
1517-
ptl = pmd_lock(vma->vm_mm, pmd);
15181538
_pmd = pmdp_collapse_flush(vma, haddr, pmd);
1519-
spin_unlock(ptl);
15201539
mm_dec_nr_ptes(mm);
15211540
pte_free(mm, pmd_pgtable(_pmd));
15221541

1542+
i_mmap_unlock_write(vma->vm_file->f_mapping);
1543+
15231544
drop_hpage:
15241545
unlock_page(hpage);
15251546
put_page(hpage);
15261547
return;
15271548

15281549
abort:
15291550
pte_unmap_unlock(start_pte, ptl);
1551+
i_mmap_unlock_write(vma->vm_file->f_mapping);
15301552
goto drop_hpage;
15311553
}
15321554

@@ -1575,7 +1597,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
15751597
* An alternative would be drop the check, but check that page
15761598
* table is clear before calling pmdp_collapse_flush() under
15771599
* ptl. It has higher chance to recover THP for the VMA, but
1578-
* has higher cost too.
1600+
* has higher cost too. It would also probably require locking
1601+
* the anon_vma.
15791602
*/
15801603
if (vma->anon_vma)
15811604
continue;
@@ -1597,10 +1620,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
15971620
*/
15981621
if (mmap_write_trylock(mm)) {
15991622
if (!khugepaged_test_exit(mm)) {
1600-
spinlock_t *ptl = pmd_lock(mm, pmd);
16011623
/* assume page table is clear */
16021624
_pmd = pmdp_collapse_flush(vma, addr, pmd);
1603-
spin_unlock(ptl);
16041625
mm_dec_nr_ptes(mm);
16051626
pte_free(mm, pmd_pgtable(_pmd));
16061627
}

0 commit comments

Comments
 (0)