From andrea@suse.de Fri Apr 7 18:25:22 2000 Date: Fri, 7 Apr 2000 12:45:13 +0200 (CEST) From: Andrea Arcangeli To: Ben LaHaise Cc: riel@nl.linux.org, Linus Torvalds , linux-mm@kvack.org Subject: Re: [patch] take 2 Re: PG_swap_entry bug in recent kernels On Thu, 6 Apr 2000, Ben LaHaise wrote: >First off, it doesn't verify that all of the bits which should be clear in >a swap entry actually are -- eg bit 0 (present) can and is getting >set. It would have to rebuild and compare the swap entry produced by >SWP_ENTRY(type,offset) to be 100% sure that it is a valid swap entry. That shouldn't be a problem. What we cares is to return a _valid_ swap entry from acquire_swap_entry, if that wasn't a cached-swap-entry that were a minor problem (not a stability issue at least). However I agree the current design is not very clean and not very strict and it's too silent. So now I make sure that the swap-entry bitflag is set only on good page->index. >Heheh, okay I've moved it there. Just in case I've also added a >BUG() check in free_pages_okay to make sure there aren't any other places >that have been missed. That's certainly a good idea, I did that too indeed :). I'm working on that and right now I am stuck fixing up SMP/non-SMP races and bugs in the VM that I found while checking the swap entry stuff. However the swap entry stuff in my current tree should be fixed now. This is a snapshot of my tree so that we can stay in sync. I guess the below stuff it's ready for inclusion but if you don't agree with something let me know of course, thanks. It seems to work stable here. I included also some early fix (the non intrusive ones). One for a SMP race fix for swapoff/get_swap_page, an SMP race in the vmlist access lock during swapoff, do_wp_page that have to know the swap cache can have reference count 3 and not being shared, and the highmem page replacement should cache also the swap-entry bitflag (I just do a copy of all the flags to be faster). It doesn't need to copy the mapping pointer since it's always null and the new_page have it null too. It's against 2.3.99-pre4-4. diff -urN ref/mm/filemap.c swap-entry-1/mm/filemap.c --- ref/mm/filemap.c Thu Apr 6 01:00:51 2000 +++ swap-entry-1/mm/filemap.c Fri Apr 7 04:44:27 2000 @@ -300,6 +300,8 @@ if (PageSwapCache(page)) { spin_unlock(&pagecache_lock); __delete_from_swap_cache(page); + /* the page is local to us now */ + page->flags &= ~(1UL << PG_swap_entry); goto made_inode_progress; } diff -urN ref/mm/highmem.c swap-entry-1/mm/highmem.c --- ref/mm/highmem.c Mon Apr 3 03:21:59 2000 +++ swap-entry-1/mm/highmem.c Fri Apr 7 03:48:46 2000 @@ -75,7 +75,7 @@ /* Preserve the caching of the swap_entry. */ highpage->index = page->index; - highpage->mapping = page->mapping; + highpage->flags = page->flags; /* * We can just forget the old page since diff -urN ref/mm/memory.c swap-entry-1/mm/memory.c --- ref/mm/memory.c Fri Apr 7 12:17:24 2000 +++ swap-entry-1/mm/memory.c Fri Apr 7 12:26:53 2000 @@ -838,6 +838,7 @@ */ switch (page_count(old_page)) { case 2: + case 3: /* * Lock the page so that no one can look it up from * the swap cache, grab a reference and start using it. @@ -880,7 +881,19 @@ new_page = old_page; } spin_unlock(&tsk->mm->page_table_lock); - __free_page(new_page); + /* + * We're releasing a page, it can be an anonymous + * page as well. Since we don't hold any lock (except + * the mmap_sem semaphore) the other user of the anonymous + * page may have released it from under us and now we + * could be the only owner of the page, thus put_page_testzero() can + * return 1, and we have to clear the swap-entry + * bitflag in such case. + */ + if (put_page_testzero(new_page)) { + new_page->flags &= ~(1UL << PG_swap_entry); + __free_pages_ok(new_page, 0); + } return 1; bad_wp_page: diff -urN ref/mm/page_alloc.c swap-entry-1/mm/page_alloc.c --- ref/mm/page_alloc.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/page_alloc.c Thu Apr 6 05:05:59 2000 @@ -110,6 +110,8 @@ BUG(); if (PageDecrAfter(page)) BUG(); + if (PageSwapEntry(page)) + BUG(); zone = page->zone; diff -urN ref/mm/swap_state.c swap-entry-1/mm/swap_state.c --- ref/mm/swap_state.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/swap_state.c Fri Apr 7 12:29:00 2000 @@ -126,9 +126,14 @@ UnlockPage(page); } - ClearPageSwapEntry(page); - - __free_page(page); + /* + * Only the last unmap have to lose the swap entry + * information that we have cached into page->index. + */ + if (put_page_testzero(page)) { + page->flags &= ~(1UL << PG_swap_entry); + __free_pages_ok(page, 0); + } } @@ -151,7 +156,7 @@ * Right now the pagecache is 32-bit only. But it's a 32 bit index. =) */ repeat: - found = find_lock_page(&swapper_space, entry.val); + found = find_get_page(&swapper_space, entry.val); if (!found) return 0; /* @@ -163,7 +168,6 @@ * is enough to check whether the page is still in the scache. */ if (!PageSwapCache(found)) { - UnlockPage(found); __free_page(found); goto repeat; } @@ -172,13 +176,11 @@ #ifdef SWAP_CACHE_INFO swap_cache_find_success++; #endif - UnlockPage(found); return found; } out_bad: printk (KERN_ERR "VM: Found a non-swapper swap page!\n"); - UnlockPage(found); __free_page(found); return 0; } diff -urN ref/mm/swapfile.c swap-entry-1/mm/swapfile.c --- ref/mm/swapfile.c Thu Apr 6 01:00:52 2000 +++ swap-entry-1/mm/swapfile.c Fri Apr 7 12:35:59 2000 @@ -212,22 +212,22 @@ /* We have the old entry in the page offset still */ if (!page->index) - goto new_swap_entry; + goto null_swap_entry; entry.val = page->index; type = SWP_TYPE(entry); if (type >= nr_swapfiles) - goto new_swap_entry; + goto bad_nofile; + swap_list_lock(); p = type + swap_info; if ((p->flags & SWP_WRITEOK) != SWP_WRITEOK) - goto new_swap_entry; + goto unlock_list; offset = SWP_OFFSET(entry); if (offset >= p->max) - goto new_swap_entry; + goto bad_offset; /* Has it been re-used for something else? */ - swap_list_lock(); swap_device_lock(p); if (p->swap_map[offset]) - goto unlock_new_swap_entry; + goto unlock; /* We're cool, we can just use the old one */ p->swap_map[offset] = 1; @@ -236,11 +236,24 @@ swap_list_unlock(); return entry; -unlock_new_swap_entry: +unlock: swap_device_unlock(p); +unlock_list: swap_list_unlock(); +clear_swap_entry: + ClearPageSwapEntry(page); new_swap_entry: return get_swap_page(); + +null_swap_entry: + printk(KERN_WARNING __FUNCTION__ " null swap entry\n"); + goto clear_swap_entry; +bad_nofile: + printk(KERN_WARNING __FUNCTION__ " nonexistent swap file\n"); + goto clear_swap_entry; +bad_offset: + printk(KERN_WARNING __FUNCTION__ " bad offset\n"); + goto unlock_list; } /* @@ -263,8 +276,11 @@ /* If this entry is swap-cached, then page must already hold the right address for any copies in physical memory */ - if (pte_page(pte) != page) + if (pte_page(pte) != page) { + if (page->index == entry.val) + ClearPageSwapEntry(page); return; + } /* We will be removing the swap cache in a moment, so... */ set_pte(dir, pte_mkdirty(pte)); return; @@ -358,10 +374,20 @@ */ if (!mm) return; + /* + * Avoid the vmas to go away from under us + * and also avoids the task to play with + * pagetables while we're running. If the + * vmlist_modify_lock wouldn't acquire the + * mm->page_table_lock spinlock we should + * acquire it by hand. + */ + vmlist_access_lock(mm); for (vma = mm->mmap; vma; vma = vma->vm_next) { pgd_t * pgd = pgd_offset(mm, vma->vm_start); unuse_vma(vma, pgd, entry, page); } + vmlist_access_unlock(mm); return; } @@ -418,8 +444,10 @@ shm_unuse(entry, page); /* Now get rid of the extra reference to the temporary page we've been using. */ - if (PageSwapCache(page)) + if (PageSwapCache(page)) { delete_from_swap_cache(page); + ClearPageSwapEntry(page); + } __free_page(page); /* * Check for and clear any overflowed swap map counts. @@ -488,8 +516,8 @@ swap_list.next = swap_list.head; } nr_swap_pages -= p->pages; - swap_list_unlock(); p->flags = SWP_USED; + swap_list_unlock(); err = try_to_unuse(type); if (err) { /* re-insert swap space back into swap_list */ Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/