Paul Mackerras [Tue, 6 Jan 2009 22:41:02 +0000 (14:41 -0800)]
Allow times and time system calls to return small negative values
At the moment, the times() system call will appear to fail for a period
shortly after boot, while the value it want to return is between -4095 and
-1. The same thing will also happen for the time() system call on 32-bit
platforms some time in 2106 or so.
On some platforms, such as x86, this is unavoidable because of the system
call ABI, but other platforms such as powerpc have a separate error
indication from the return value, so system calls can in fact return small
negative values without indicating an error. On those platforms,
force_successful_syscall_return() provides a way to indicate that the
system call return value should not be treated as an error even if it is
in the range which would normally be taken as a negative error number.
This adds a force_successful_syscall_return() call to the time() and
times() system calls plus their 32-bit compat versions, so that they don't
erroneously indicate an error on those platforms whose system call ABI has
a separate error indication. This will not affect anything on other
platforms.
Joakim Tjernlund added the fix for time() and the compat versions of
time() and times(), after I did the fix for times().
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 6 Jan 2009 22:41:01 +0000 (14:41 -0800)]
documentation: when to BUG(), and when to not BUG()
Provide some basic advice about when to use BUG()/BUG_ON(): never, unless
there's really no better option.
This matches my understanding of the standard policy ... which seems not
to be written down so far, outside of LKML messages that I haven't
bookmarked.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Tue, 6 Jan 2009 22:40:59 +0000 (14:40 -0800)]
poll: allow f_op->poll to sleep
f_op->poll is the only vfs operation which is not allowed to sleep. It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions. The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.
This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events. The
only added overhead is an extra function call during wake up and
negligible.
This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.
While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.
* s/type * symbol/type *symbol/ : three places in poll.h
* remove blank line before EXPORT_SYMBOL() : two places in select.c
Oleg: spotted missing barrier in poll_schedule_timeout()
Davide: spotted missing write barrier in pollwake()
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Brad Boyer <flar@allandria.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roland McGrath <roland@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
scripts: script from kerneloops.org to pretty print oops dumps
We're struggling all the time to figure out where the code came from that
oopsed.. The script below (a adaption from a script used by
kerneloops.org) can help developers quite a bit, at least for non-module
cases.
memory = agp_allocate_memory(agp_bridge, pg_count, type); c055c10f: 89 c2 mov %eax,%edx
if (memory == NULL) c055c111: 74 19 je c055c12c <agp_allocate_memory_wrap+0x30>
/* This function must only be called when current_controller != NULL */
static void agp_insert_into_pool(struct agp_memory * temp)
{
struct agp_memory *prev;
so in this case, we faulted while dereferencing agp_fe.current_controller
pointer, and we get to see exactly which function and line it affects...
Personally I find this very useful, and I can see value for having this
script in the kernel for more-than-just-me to use.
Caveats:
* It only works for oopses not-in-modules
* It only works nicely for kernels compiled with CONFIG_DEBUG_INFO
* It's not very fast.
* It only works on x86
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... because we do want repeated same-oops to be seen by automated
tools like kerneloops.org
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Machek [Tue, 6 Jan 2009 22:40:53 +0000 (14:40 -0800)]
strict_strto* is not strict enough
It decodes "\n" as 0, which is bad, because stray echo into backlight
will turn your backlight off, etc...
Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: Yi Yang <yi.y.yang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Frysinger [Tue, 6 Jan 2009 22:40:53 +0000 (14:40 -0800)]
autodetect_raid: add missing __init marking
The function autodetect_raid is only used by __init functions, and it refers
to __initdata, so it needs __init markings. Fixes this error:
The function autodetect_raid() references
the variable __initdata raid_noautodetect.
This is often because autodetect_raid lacks a __initdata
annotation or the annotation of raid_noautodetect is wrong.
Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qinghuang Feng [Tue, 6 Jan 2009 22:40:52 +0000 (14:40 -0800)]
samples: mark {static|__init|__exit} for {init|exit} functions
None of these (init|exit) functions is called from other functions which
is outside the kernel module mechanism or kernel itself, so mark them as
{static|__init|__exit}.
Darrick J. Wong [Tue, 6 Jan 2009 22:40:51 +0000 (14:40 -0800)]
Create a DIV_ROUND_CLOSEST macro to do division with rounding
Create a helper macro to divide two numbers and round the result to the
nearest whole number. This is a helper macro for hwmon drivers that want
to convert incoming sysfs values per standard hwmon practice, though the
macro itself can be used by anyone.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/exec.c:__bprm_mm_init(): clean up error handling
Untangle the error unwinding in this function, saving a test of local
variable `vma'.
Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marton Balint [Tue, 6 Jan 2009 22:40:43 +0000 (14:40 -0800)]
do_mounts: add device info to mount message
In the past, I used the root=... command line parameter to specify the
root filesystem to the kernel. Now it seems that specifying it is not
necessary. The kernel detects the root filesystem even if the kernel
command line is empty. My root fs is on a raid1 device by the way, and I
am not using initrd for the boot process.
If the kernel detects the root filesystem somehow, I think it should print
out the result of this detection, otherwise I will not know which device
has the root filesystem. Or is there an easy way to get this information
on a running system? I had a quick look at the /proc and /sys
filesystems, but haven't found anything useful there.
Signed-off-by: Marton Balint <cus@fazekas.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oops handling: ensure that any oops is flushed to the mtdoops console
This used to work unpatched with older kernels, during the development
phase of mtdoops. Before commit e3e8a75d2acfc61ebf25524666a0a2c6abb0620c
a space was printed with console_loglevel set to 15, which probably
flushed the oops message as a side effect.
This is another patch from the Nokia N810 kernel.
Signed-off-by: Viktor Rosendahl <viktor.rosendahl@nokia.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several subsystem open handlers dereference the fops_get() return value
without checking it for nullness. This opens a race condition between the
open handler and module unloading.
A module can be marked as being unloaded (MODULE_STATE_GOING) before its
exit function is called and gets the chance to unregister the driver.
During that window open handlers can still be called, and fops_get() will
fail in try_module_get() and return a NULL pointer.
This change checks the fops_get() return value and returns -ENODEV if NULL.
Reported-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Laurent Pinchart <laurent.pinchart@skynet.be> Acked-by: Takashi Iwai <tiwai@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Airlie <airlied@linux.ie> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the newly introduced pci_ioremap_bar() function in drivers/misc.
pci_ioremap_bar() just takes a pci device and a bar number, with the goal
of making it really hard to get wrong, while also having a central place
to stick sanity checks.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Tue, 6 Jan 2009 22:40:39 +0000 (14:40 -0800)]
atomic_t: unify all arch definitions
The atomic_t type cannot currently be used in some header files because it
would create an include loop with asm/atomic.h. Move the type definition
to linux/types.h to break the loop.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rakib Mullick [Tue, 6 Jan 2009 22:40:38 +0000 (14:40 -0800)]
init: properly placing noinline keyword
checkpatch warns about 'static void noinline'. It wants `static noinline
void'.
Both are permissible, but the kernel consistently uses `static inline' and
`static noinline', and consistency is good. Hence let's keep the
checkpatch warning and fix up this code site.
[akpm@linux-foundation.org: rewrote changelog] Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Tue, 6 Jan 2009 22:40:33 +0000 (14:40 -0800)]
mm: stop kswapd's infinite loop at high order allocation
Wassim Dagash reported following kswapd infinite loop problem.
kswapd runs in some infinite loop trying to swap until order 10 of zone
highmem is OK.... kswapd will continue to try to balance order 10 of zone
highmem forever (or until someone release a very large chunk of highmem).
For non order-0 allocations, the system may never be balanced due to
fragmentation but kswapd should not infinitely loop as a result.
Instead, recheck all watermarks at order-0 as they are the most important.
If watermarks are ok, kswapd will go back to sleep.
Johannes Weiner [Tue, 6 Jan 2009 22:40:32 +0000 (14:40 -0800)]
bootmem: print request details before BUG_ON(them)
Moving the request details print-out before the sanity checks that
might panic() enables us to analyse invalid requests without having
access to the line information of the stack dump.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
mm->hiwater_xxx directly, this leads to 2 problems:
- taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
moment before the task exits, so we should check the current values of
rss/vm anyway.
- do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can
be preempted right before mm->hiwater_xxx = new_val, and another thread
can use A_LOT of memory and exit in between. When the first thread
resumes it can be the last thread in the thread group, in that case we
report the wrong hiwater_xxx values which do not take A_LOT into
account.
Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
xacct_add_tsk() to use them. The first helper will also be used by
rusage->ru_maxrss accounting.
Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to
decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
can look at this mm_struct when exit_mmap() actually unmaps the memory.
Nick Piggin [Tue, 6 Jan 2009 22:40:28 +0000 (14:40 -0800)]
mm: pagecache gfp flags fix
Frustratingly, gfp_t is really divided into two classes of flags. One are
the context dependent ones (can we sleep? can we enter filesystem? block
subsystem? should we use some extra reserves, etc.). The other ones are
the type of memory required and depend on how the algorithm is implemented
rather than the point at which the memory is allocated (highmem? dma
memory? etc).
Some of the functions which allocate a page and add it to page cache take
a gfp_t, but sometimes those functions or their callers aren't really
doing the right thing: when allocating pagecache page, the memory type
should be mapping_gfp_mask(mapping). When allocating radix tree nodes,
the memory type should be kernel mapped (not highmem) memory. The gfp_t
argument should only really be needed for context dependent options.
This patch doesn't really solve that tangle in a nice way, but it does
attempt to fix a couple of bugs.
- find_or_create_page changes its radix-tree allocation to only include
the main context dependent flags in order so the pagecache page may be
allocated from arbitrary types of memory without affecting the
radix-tree. In practice, slab allocations don't come from highmem
anyway, and radix-tree only uses slab allocations. So there isn't a
practical change (unless some fs uses GFP_DMA for pages).
- grab_cache_page_nowait() is changed to allocate radix-tree nodes with
GFP_NOFS, because it is not supposed to reenter the filesystem. This
bug could cause lock recursion if a filesystem is not expecting the
function to reenter the fs (as-per documentation).
Filesystems should be careful about exactly what semantics they want and
what they get when fiddling with gfp_t masks to allocate pagecache. One
should be as liberal as possible with the type of memory that can be used,
and same for the the context specific flags.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:40:26 +0000 (14:40 -0800)]
fs: sys_sync fix
s_syncing livelock avoidance was breaking data integrity guarantee of
sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
if there is a concurrent sys_sync happening.
This livelock avoidance is much less important now that we don't have the
get_super_to_sync() call after every sb that we sync. This was replaced
by __put_super_and_need_restart.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:40:25 +0000 (14:40 -0800)]
fs: sync_sb_inodes fix
Fix data integrity semantics required by sys_sync, by iterating over all
inodes and waiting for any writeback pages after the initial writeout.
Comments explain the exact problem.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:40:25 +0000 (14:40 -0800)]
fs: remove WB_SYNC_HOLD
Remove WB_SYNC_HOLD. The primary motiviation is the design of my
anti-starvation code for fsync. It requires taking an inode lock over the
sync operation, so we could run into lock ordering problems with multiple
inodes. It is possible to take a single global lock to solve the ordering
problem, but then that would prevent a future nice implementation of "sync
multiple inodes" based on lock order via inode address.
Seems like a backward step to remove this, but actually it is busted
anyway: we can't use the inode lists for data integrity wait: an inode can
be taken off the dirty lists but still be under writeback. In order to
satisfy data integrity semantics, we should wait for it to finish
writeback, but if we only search the dirty lists, we'll miss it.
It would be possible to have a "writeback" list, for sys_sync, I suppose.
But why complicate things by prematurely optimise? For unmounting, we
could avoid the "livelock avoidance" code, which would be easier, but
again premature IMO.
Fixing the existing data integrity problem will come next.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WB_SYNC_HOLD is going to be zapped so we should not use it. Use
%WB_SYNC_NONE instead. Here is what akpm said:
"I think I'll just switch that to WB_SYNC_NONE. The `wait==0' mode is
just an advisory thing to help the fs shove lots of data into the
queues. If some gets missed then it'll be picked up on the second
->sync_fs call, with wait==1."
Thanks to Randy Dunlap for catching this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:40:22 +0000 (14:40 -0800)]
mm: direct IO starvation improvement
Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
A 4K direct IO will actually try to sync and/or invalidate the pagecache
of the entire file, for example (which might be many GB or TB large).
Improve this by doing range syncs. Also, memory no longer has to be
unmapped to catch the dirty bits for syncing, as dirty bits would remain
coherent due to dirty mmap accounting.
This fixes the immediate DM deadlocks when doing direct IO reads to block
device with a mounted filesystem, if only by papering over the problem
somewhat rather than addressing the fsync starvation cases.
Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 6 Jan 2009 22:40:20 +0000 (14:40 -0800)]
shmem: unify regular and tiny shmem
tiny-shmem shares most of its 130 lines of code with shmem and tends to
break when particular bits of shmem get modified. Unifying saves code and
makes keeping these two in sync much easier.
Ying Han [Tue, 6 Jan 2009 22:40:18 +0000 (14:40 -0800)]
mm: make get_user_pages() interruptible
The initial implementation of checking TIF_MEMDIE covers the cases of OOM
killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
return immediately. This patch includes:
1. add the case that the SIGKILL is sent by user processes. The
process can try to get_user_pages() unlimited memory even if a user
process has sent a SIGKILL to it(maybe a monitor find the process
exceed its memory limit and try to kill it). In the old
implementation, the SIGKILL won't be handled until the get_user_pages()
returns.
2. change the return value to be ERESTARTSYS. It makes no sense to
return ENOMEM if the get_user_pages returned by getting a SIGKILL
signal. Considering the general convention for a system call
interrupted by a signal is ERESTARTNOSYS, so the current return value
is consistant to that.
Lee:
An unfortunate side effect of "make-get_user_pages-interruptible" is that
it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
resulting in freeing of mlocked pages. Freeing of mlocked pages, in
itself, is not so bad. We just count them now--altho' I had hoped to
remove this stat and add PG_MLOCKED to the free pages flags check.
However, consider pages in shared libraries mapped by more than one task
that a task mlocked--e.g., via mlockall(). If the task that mlocked the
pages exits via SIGKILL, these pages would be left mlocked and
unevictable.
Proposed fix:
Add another GUP flag to ignore sigkill when calling get_user_pages from
munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
the same purpose. We are not actually allocating memory in this case,
which "make-get_user_pages-interruptible" intends to avoid. We're just
munlocking pages that are already resident and mapped, and we're reusing
get_user_pages() to access those pages.
?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
into a single flag: GUP_FLAGS_MUNLOCK ???
[Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rohit Seth <rohitseth@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:13 +0000 (14:40 -0800)]
badpage: KERN_ALERT BUG instead of KERN_EMERG
bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
which I've followed in print_bad_pte(). These are serious system errors,
on a par with BUGs, but they're not quite emergencies, and we do our best
to carry on: say KERN_ALERT "BUG: " like the x86 oops does.
And remove the "Trying to fix it up, but a reboot is needed" line: it's
not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:12 +0000 (14:40 -0800)]
badpage: ratelimit print_bad_pte and bad_page
print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite
dispensible. Correlating corruption across neighbouring entries can be
very helpful, so allow a burst of 60 reports before keeping quiet for the
remainder of that minute (or allow a steady drip of one report per
second).
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:11 +0000 (14:40 -0800)]
badpage: remove vma from page_remove_rmap
Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
page_dup_rmap(): we're trying to be more resilient about that than BUGs.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:10 +0000 (14:40 -0800)]
badpage: zap print_bad_pte on swap and file
Complete zap_pte_range()'s coverage of bad pagetable entries by calling
print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
That needs free_swap_and_cache() to tell it, which will also have shown
one of those "swap_free" errors (but with much less information).
Similar checks in fork's copy_one_pte()? No, that would be more noisy
than helpful: we'll see them when parent and child exec or exit.
Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
with what happens if do_swap_page() operates a bad swap entry; but don't
we have patches to be more careful about killing when VM_FAULT_OOM?
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:09 +0000 (14:40 -0800)]
badpage: vm_normal_page use print_bad_pte
print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong. That's weak coverage when we suspect pagetable corruption.
Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).
Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.
Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:08 +0000 (14:40 -0800)]
badpage: replace page_remove_rmap Eeek and BUG
Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be
freed. Removing the BUG() allows more info, on subsequent pages, to be
gathered.
We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.
In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().
mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.
Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as
possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
there too.
print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm is
already NULL in the exit case. Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:06 +0000 (14:40 -0800)]
badpage: keep any bad page out of circulation
Until now the bad_page() checkers have special-cased PageReserved, keeping
those pages out of circulation thereafter. Now extend the special case to
all: we want to keep ANY page with bad state out of circulation - the
"free" page may well be in use by something.
Leave the bad state of those pages untouched, for examination by
debuggers; except for PageBuddy - leaving that set would risk bringing the
page back.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:40:05 +0000 (14:40 -0800)]
badpage: simplify page_alloc flag check+clear
Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
a page: check the same flags as before when freeing, clear ALL the flags
(unless PageReserved) when freeing, check ALL flags off when allocating.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitri Monakhov [Tue, 6 Jan 2009 22:40:04 +0000 (14:40 -0800)]
fs: truncate blocks outside i_size after O_DIRECT write error
In case of error extending write may have instantiated a few blocks
outside i_size. We need to trim these blocks. We have to do it
*regardless* to blocksize. At least ext2, ext3 and reiserfs interpret
(i_size < biggest block) condition as error. Fsck will complain about
wrong i_size. Then fsck will fix the error by changing i_size according
to the biggest block. This is bad because this blocks contain garbage
from previous write attempt. And result in data corruption.
####TESTCASE_BEGIN
$touch /mnt/test/BIG_FILE
## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
## size and block sould't be changed because write op failed.
$stat /mnt/test/BIG_FILE
File: `/mnt/test/BIG_FILE'
Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
<<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
Device: fe07h/65031d Inode: 14 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2007-01-24 20:03:38.000000000 +0300
Modify: 2007-01-24 20:03:38.000000000 +0300
Change: 2007-01-24 20:03:39.000000000 +0300
#fsck.ext3 -f /dev/VG/test
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
Pass 2: Checking directory structure
....
#####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
index af0558d..4e88bea 100644
[akpm@linux-foundation.org: use i_size_read()] Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rik van Riel [Tue, 6 Jan 2009 22:40:01 +0000 (14:40 -0800)]
vmscan: bail out of direct reclaim after swap_cluster_max pages
When the VM is under pressure, it can happen that several direct reclaim
processes are in the pageout code simultaneously. It also happens that
the reclaiming processes run into mostly referenced, mapped and dirty
pages in the first round.
This results in multiple direct reclaim processes having a lower
pageout priority, which corresponds to a higher target of pages to
scan.
This in turn can result in each direct reclaim process freeing
many pages. Together, they can end up freeing way too many pages.
This kicks useful data out of memory (in some cases more than half
of all memory is swapped out). It also impacts performance by
keeping tasks stuck in the pageout code for too long.
A 30% improvement in hackbench has been observed with this patch.
The fix is relatively simple: in shrink_zone() we can check how many
pages we have already freed, direct reclaim tasks break out of the
scanning loop if they have already freed enough pages and have reached
a lower priority level.
We do not break out of shrink_zone() when priority == DEF_PRIORITY,
to ensure that equal pressure is applied to every zone in the common
case.
However, in order to do this we do need to know how many pages we already
freed, so move nr_reclaimed into scan_control.
akpm: a historical interlude...
We tried this in 2004:
:commit e468e46a9bea3297011d5918663ce6d19094cf87
:Author: akpm <akpm>
:Date: Thu Jun 24 15:53:52 2004 +0000
:
:[PATCH] vmscan.c: dont reclaim too many pages
:
: The shrink_zone() logic can, under some circumstances, cause far too many
: pages to be reclaimed. Say, we're scanning at high priority and suddenly hit
: a large number of reclaimable pages on the LRU.
: Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
And we reverted it in 2006:
:commit 210fe530305ee50cd889fe9250168228b2994f32
:Author: Andrew Morton <akpm@osdl.org>
:Date: Fri Jan 6 00:11:14 2006 -0800
:
: [PATCH] vmscan: balancing fix
:
: Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:
:
: The shrink_zone() logic can, under some circumstances, cause far too many
: pages to be reclaimed. Say, we're scanning at high priority and suddenly
: hit a large number of reclaimable pages on the LRU.
:
: Change things so we bale out when SWAP_CLUSTER_MAX pages have been
: reclaimed.
:
: Problem is, this change caused significant imbalance in inter-zone scan
: balancing by truncating scans of larger zones.
:
: Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
: balancing algorithm would require that if we're scanning 100 pages of
: ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
: cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
: reclaimed. Thus effectively causing smaller zones to be scanned relatively
: harder than large ones.
:
: Now I need to remember what the workload was which caused me to write this
: patch originally, then fix it up in a different way...
And we haven't demonstrated that whatever problem caused that reversion is
not being reintroduced by this change in 2008.
Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:57 +0000 (14:39 -0800)]
swapfile: let others seed random
Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
there's been a coincidental discussion of earlier randomization, assume
that goes ahead, let swapon be a client rather than stirring for itself.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Donjun Shin <djshin90@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Joern Engel <joern@logfs.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:56 +0000 (14:39 -0800)]
swapfile: change discard pgoff_t to sector_t
Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
sector_t: given the constraints on swap offsets (in particular, the 5 bits
of swap type accommodated in the same unsigned long), pgoff_t was actually
safe as is, but it certainly looked worrying when shifted left.
[akpm@linux-foundation.org: fix shift overflow] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:55 +0000 (14:39 -0800)]
swapfile: swap allocation cycle if nonrot
Though attempting to find free clusters (Andrea), swap allocation has
always restarted its searches from the beginning of the swap area (sct),
to reduce seek times between swap pages, by not scattering them all over
the partition.
But on a solidstate swap device, seeks are cheap, and block remapping to
level the wear may be limited by zones: in that case it's better to cycle
around the whole partition.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:54 +0000 (14:39 -0800)]
swapfile: swapon randomize if nonrot
Swap allocation has always started from the beginning of the swap area;
but if we're dealing with a solidstate swap device which can only remap
blocks within limited zones, that would sooner wear out the first zone.
Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
randomize the cluster_next starting position for allocation.
If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
with an "SS" at the right end of the kernel's "Adding ... swap" message
(so that if it's both nonrot and discardable, "SSD" will be shown there).
Perhaps something should be shown in /proc/swaps (swapon -s), but we have
to be more cautious before making any addition to that format.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:53 +0000 (14:39 -0800)]
swapfile: swap allocation use discard
When scan_swap_map() finds a free cluster of swap pages to allocate,
discard the old contents of the cluster if the device supports discard.
But don't bother when swap is so fragmented that we allocate single pages.
Be careful about racing allocations made while we're scanning for a
cluster; and hold up allocations made while we're discarding.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:51 +0000 (14:39 -0800)]
swapfile: swapon use discard (trim)
When adding swap, all the old data on swap can be forgotten: sys_swapon()
discard all but the header page of the swap partition (or every extent but
the header of the swap file), to give a solidstate swap device the
opportunity to optimize its wear-levelling.
If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
"D" at the right end of the kernel's "Adding ... swap" message. Perhaps
something should be shown in /proc/swaps (swapon -s), but we have to be
more cautious before making any addition to that format.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:50 +0000 (14:39 -0800)]
swapfile: rearrange scan and swap_info
Before making functional changes, rearrange scan_swap_map() to simplify
subsequent diffs. Actually, there is one functional change in there:
leave cluster_nr negative while scanning for a new cluster - resetting it
early increased the likelihood that when we have difficulty finding a free
cluster, another task may come in and try doing exactly the same - just a
waste of cpu.
Before making functional changes, rearrange struct swap_info_struct
slightly: flags will be needed as an unsigned long (for wait_on_bit), next
is a good int to pair with prio, old_block_size is uninteresting so shift
it to the end.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:49 +0000 (14:39 -0800)]
swapfile: remove v0 SWAP-SPACE message
The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
now safely drop its "version 0 swap is no longer supported" message - just
say "Unable to find swap-space signature" as usual. This removes one
level of indentation from a stretch of sys_swapon().
I'd have liked to be specific, saying "Unable to find SWAPSPACE2
signature", but it's just too confusing that the version 1 signature shows
the number 2.
Hugh Dickins [Tue, 6 Jan 2009 22:39:47 +0000 (14:39 -0800)]
swapfile: swapon needs larger size type
sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
an int, but should be an unsigned long like the maxpages it's compared
against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
with "Swap area shorter than signature indicates".
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KOSAKI Motohiro [Tue, 6 Jan 2009 22:39:46 +0000 (14:39 -0800)]
mm: make vread() and vwrite() declaration
Sparse output following warnings.
mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?
Hugh Dickins [Tue, 6 Jan 2009 22:39:41 +0000 (14:39 -0800)]
mm: optimize get_scan_ratio for no swap
Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc
optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual
declaration to swapfile.c: it never belonged in page_alloc.c.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:40 +0000 (14:39 -0800)]
mm: add add_to_swap stub
If we add a failing stub for add_to_swap(), then we can remove the #ifdef
CONFIG_SWAP from mm/vmscan.c.
This was intended as a source cleanup, but looking more closely, it turns
out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
page, whereas now it goes to the more suitable activate_locked, like the
CONFIG_SWAP nr_swap_pages 0 case.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:39 +0000 (14:39 -0800)]
mm: remove gfp_mask from add_to_swap
Remove gfp_mask argument from add_to_swap(): it's misleading because its
only caller, shrink_page_list(), is not atomic at that point; and in due
course (implementing discard) we'll sometimes want to allocate some memory
with GFP_NOIO (as is used in swap_writepage) when allocating swap.
No change to the gfp_mask passed down to add_to_swap_cache(): still use
__GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
though it's not obvious if that's the best combination to ask for here.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:38 +0000 (14:39 -0800)]
mm: remove try_to_munlock from vmscan
An unfortunate feature of the Unevictable LRU work was that reclaiming an
anonymous page involved an extra scan through the anon_vma: to check that
the page is evictable before allocating swap, because the swap could not
be freed reliably soon afterwards.
Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
not an issue any more: remove try_to_munlock() call from
shrink_page_list(), leaving it to try_to_munmap() to discover if the page
is one to be culled to the unevictable list - in which case then
try_to_free_swap().
Update unevictable-lru.txt to remove comments on the try_to_munlock() in
shrink_page_list(), and shorten some lines over 80 columns.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:37 +0000 (14:39 -0800)]
mm: try_to_unuse check removing right swap
There's a possible race in try_to_unuse() which Nick Piggin led me to two
years ago. Where it does lock_page() after read_swap_cache_async(), what
if another task removed that page from swapcache just before we locked it?
It would sail though the (*swap_map > 1) tests doing nothing (because it
could not have been removed from swapcache before its swap references were
gone), until it reaches the delete_from_swap_cache(page) near the bottom.
Now imagine that this page has been allocated to swap on a different swap
area while we dropped page lock (perhaps at the top, perhaps in unuse_mm):
we could wrongly remove from swap cache before the page has been written
to swap, so a subsequent do_swap_page() would read in stale data from
swap.
I think this case could not happen before: remove_exclusive_swap_page()
refused while page count was raised. But now with reuse_swap_page() and
try_to_free_swap() removing from swap cache without minding page count, I
think it could happen - the previous patch argued that it was safe because
try_to_unuse() already ignored page count, but overlooked that it might be
breaking the assumptions in try_to_unuse() itself.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove_exclusive_swap_page(): its problem is in living up to its name.
It doesn't matter if someone else has a reference to the page (raised
page_count); it doesn't matter if the page is mapped into userspace
(raised page_mapcount - though that hints it may be worth keeping the
swap): all that matters is that there be no more references to the swap
(and no writeback in progress).
swapoff (try_to_unuse) has been removing pages from swapcache for years,
with no concern for page count or page mapcount, and we used to have a
comment in lookup_swap_cache() recognizing that: if you go for a page of
swapcache, you'll get the right page, but it could have been removed from
swapcache by the time you get page lock.
So, give up asking for exclusivity: get rid of
remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
remove_exclusive_swap_page_count() which were spawned for the recent LRU
work: replace them by the simpler try_to_free_swap() which just checks
page_swapcount().
Similarly, remove the page_count limitation from free_swap_and_count(),
but assume that it's worth holding on to the swap if page is mapped and
swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
It would be consistent, but I think we probably have enough for now.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:34 +0000 (14:39 -0800)]
mm: reuse_swap_page replaces can_share_swap_page
A good place to free up old swap is where do_wp_page(), or do_swap_page(),
is about to redirty the page: the data on disk is then stale and won't be
read again; and if we do decide to write the page out later, using the
previous swap location makes an unnecessary disk seek very likely.
So give can_share_swap_page() the side-effect of delete_from_swap_cache()
when it safely can. And can_share_swap_page() was always a misleading
name, the more so if it has a side-effect: rename it reuse_swap_page().
Irrelevant cleanup nearby: remove swap_token_default_timeout definition
from swap.h: it's used nowhere.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:33 +0000 (14:39 -0800)]
mm: wp lock page before deciding cow
An application may rely on get_user_pages() to give it pages writable from
userspace and shared with a driver, GUP breaking COW if necessary. It may
mprotect() the pages' writability, off and on, from time to time.
Normally this works fine (so long as the app does not fork); but just
occasionally, under memory pressure, a readonly pte in a newly writable
area is COWed unnecessarily, breaking the link with the driver: because
do_wp_page() does trylock_page, and falls back to COW whenever that fails.
For reliable behaviour in the unshared case, when the trylock_page fails,
now unlock pagetable, lock page and relock pagetable, before deciding
whether Copy-On-Write is really necessary.
Reported-by: Zhou Yingchao Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:32 +0000 (14:39 -0800)]
mm: gup persist for write permission
do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
COW has been done if necessary, though it may be leaving the pte without
write permission - for the odd case of forced writing to a readonly vma
for ptrace. At present GUP then retries the follow_page() without asking
for write permission, to escape an endless loop when forced.
But an application may be relying on GUP to guarantee a writable page
which won't be COWed again when written from userspace, whereas a race
here might leave a readonly pte in place? Change the VM_FAULT_WRITE
handling to ask follow_page() for write permission again, except in that
odd case of forced writing to a readonly vma.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 6 Jan 2009 22:39:31 +0000 (14:39 -0800)]
mm: add dirty_background_bytes and dirty_bytes sysctls
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.
dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.
With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.
dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system. If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.
When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value. For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:
Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified. When one sysctl is written, the other appears as 0 when read.
The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.
Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100. This restriction is maintained, but
dirty_bytes has a lower limit of only one page.
Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio. This restriction is maintained in addition to
restricting dirty_background_bytes. If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.
Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 6 Jan 2009 22:39:29 +0000 (14:39 -0800)]
mm: change dirty limit type specifiers to unsigned long
The background dirty and dirty limits are better defined with type
specifiers of unsigned long since negative writeback thresholds are not
possible.
These values, as returned by get_dirty_limits(), are normally compared
with ZVC values to determine whether writeback shall commence or be
throttled. Such page counts cannot be negative, so declaring the page
limits as signed is unnecessary.
Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Tue, 6 Jan 2009 22:39:28 +0000 (14:39 -0800)]
mm/page_alloc.c: eliminate NULL test and memset after alloc_bootmem
As noted by Akinobu Mita in patch b1fceac2b9e04d278316b2faddf276015fc06e3b,
alloc_bootmem and related functions never return NULL and always return a
zeroed region of memory. Thus a NULL test or memset after calls to these
functions is unnecessary.
This was fixed using the following semantic patch.
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
expression E;
statement S;
@@
E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
(
- BUG_ON (E == NULL);
|
- if (E == NULL) S
)
@@
expression E,E1;
@@
E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
... when != E
- memset(E,0,E1);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:27 +0000 (14:39 -0800)]
mm: further cleanup page_add_new_anon_rmap
Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
was good but stupid: we can and should SetPageSwapBacked() there too; and
we know for sure that this anonymous, swap-backed page is not file cache.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:26 +0000 (14:39 -0800)]
mm: make page_lock_anon_vma() static
page_lock_anon_vma() and page_unlock_anon_vma() were made available to
show_page_path() in vmscan.c; but now that has been removed, make them
static in rmap.c again, they're better kept private if possible.
Hugh Dickins [Tue, 6 Jan 2009 22:39:25 +0000 (14:39 -0800)]
mm: add_active_or_unevictable into rmap
lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
appear together. Save some symbol table space and some jumping around by
removing lru_cache_add_active_or_unevictable(), folding its code into
page_add_new_anon_rmap(): like how we add file pages to lru just after
adding them to page cache.
Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
originally intended.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:25 +0000 (14:39 -0800)]
mm: replace some BUG_ONs by VM_BUG_ONs
The swap code is over-provisioned with BUG_ONs on assorted page flags,
mostly dating back to 2.3. They're good documentation, and guard against
developer error, but a waste of space on most systems: change them to
VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
ones: they're later, from 2.5.69, but even less interesting now.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:23 +0000 (14:39 -0800)]
mm: remove GFP_HIGHUSER_PAGECACHE
GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
that harder to track down: remove it, and its out-of-work brothers
GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.
Since we're making that improvement to hotremove_migrate_alloc(), I think
we can now also remove one of the "o"s from its comment.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 6 Jan 2009 22:39:22 +0000 (14:39 -0800)]
mm: remove cgroup_mm_owner_callbacks
cgroup_mm_owner_callbacks() was brought in to support the memrlimit
controller, but sneaked into mainline ahead of it. That controller has
now been shelved, and the mm_owner_changed() args were inadequate for it
anyway (they needed an mm pointer instead of a task pointer).
Remove the dead code, and restore mm_update_next_owner() locking to how it
was before: taking mmap_sem there does nothing for memcontrol.c, now the
only user of mm->owner.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:39:20 +0000 (14:39 -0800)]
mm: vmalloc make lazy unmapping configurable
Lazy unmapping in the vmalloc code has now opened the possibility for use
after free bugs to go undetected. We can catch those by forcing an unmap
and flush (which is going to be slow, but that's what happens).
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 6 Jan 2009 22:39:19 +0000 (14:39 -0800)]
mm: vmalloc use mutex for purge
The vmalloc purge lock can be a mutex so we can sleep while a purge is
going on (purge involves a global kernel TLB invalidate, so it can take
quite a while).
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Glauber Costa [Tue, 6 Jan 2009 22:39:19 +0000 (14:39 -0800)]
mm: vmalloc improve vmallocinfo
If we do that, output of files like /proc/vmallocinfo will show things
like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
caller. This info is not as useful as the real caller of the allocation.
So, proposal is to call __vmalloc_node node directly, with matching
parameters to save the caller information
Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Tue, 6 Jan 2009 22:39:17 +0000 (14:39 -0800)]
mm: more likely reclaim MADV_SEQUENTIAL mappings
File pages mapped only in sequentially read mappings are perfect reclaim
canditates.
This patch makes these mappings behave like weak references, their pages
will be reclaimed unless they have a strong reference from a normal
mapping as well.
It changes the reclaim and the unmap path where they check if the page has
been referenced. In both cases, accesses through sequentially read
mappings will be ignored.