Vadim Lobanov [Fri, 23 Jun 2006 09:05:16 +0000 (02:05 -0700)]
[PATCH] Poll cleanups/microoptimizations
The "count" and "pt" variables are declared and modified by do_poll(), as
well as accessed and written indirectly in the do_pollfd() subroutine.
This patch pulls all handling of these variables into the do_poll()
function, thereby eliminating the odd use of indirection in do_pollfd().
This is done by pulling the "struct pollfd" traversal loop from do_pollfd()
into its only caller do_poll(). As an added bonus, the patch saves a few
clock cycles, and also adds comments to make the code easier to follow.
It's wasn't referenced in Makefile since at least 2.2.8, unbuildable due to
trivial typos and things like DATA_LATCH and arc_write_control() which
doesn't exist.
Adrian Bunk:
adapted the patch to unrelated context changes
Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Miklos Szeredi [Fri, 23 Jun 2006 09:05:12 +0000 (02:05 -0700)]
[PATCH] vfs: add lock owner argument to flush operation
Pass the POSIX lock owner ID to the flush operation.
This is useful for filesystems which don't want to store any locking state
in inode->i_flock but want to handle locking/unlocking POSIX locks
internally. FUSE is one such filesystem but I think it possible that some
network filesystems would need this also.
Also add a flag to indicate that a POSIX locking request was generated by
close(), so filesystems using the above feature won't send an extra locking
request in this case.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
posix_lock_file() was too cautious, failing operations on OOM, even if they
didn't actually require an allocation.
This has the disadvantage, that a failing unlock on process exit could lead to
a memory leak. There are two possibilites for this:
- filesystem implements .lock() and calls back to posix_lock_file(). On
cleanup of files_struct locks_remove_posix() is called which should remove all
locks belonging to files_struct. However if filesystem calls
posix_lock_file() which fails, then those locks will never be freed.
- if a file is closed while a lock is blocked, then after acquiring
fcntl_setlk() will undo the lock. But this unlock itself might fail on OOM,
again possibly leaking the lock.
The solution is to move the checking of the allocations until after it is sure
that they will be needed. This will solve the above problem since unlock will
always succeed unless it splits an existing region.
Pekka Enberg [Fri, 23 Jun 2006 09:05:08 +0000 (02:05 -0700)]
[PATCH] read_mapping_page for address space
Add read_mapping_page() which is used for callers that pass
mapping->a_ops->readpage as the filler for read_cache_page. This removes
some duplication from filesystem code.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Moyer [Fri, 23 Jun 2006 09:05:07 +0000 (02:05 -0700)]
[PATCH] Add a sysfs file to determine if a kexec kernel is loaded
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded. Each
file contains a simple boolean value indicating whether the relevant kernel
has been loaded into memory. The motivation for this is geared around
support.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Michael Holzheu [Fri, 23 Jun 2006 09:05:06 +0000 (02:05 -0700)]
[PATCH] s390_hypfs filesystem
On zSeries machines there exists an interface which allows the operating
system to retrieve LPAR hypervisor accounting data. For example, it is
possible to get usage data for physical and virtual cpus. In order to
provide this information to user space programs, I implemented a new
virtual Linux file system named 's390_hypfs' using the Linux 2.6 libfs
framework. The name 's390_hypfs' stands for 'S390 Hypervisor Filesystem'.
All the accounting information is put into different virtual files which
can be accessed from user space. All data is represented as ASCII strings.
When the file system is mounted the accounting information is retrieved and
a file system tree is created with the attribute files containing the cpu
information. The content of the files remains unchanged until a new update
is made. An update can be triggered from user space through writing
'something' into a special purpose update file.
We create the following directory structure:
<mount-point>/
update
cpus/
<cpu-id>
type
mgmtime
<cpu-id>
...
hyp/
type
systems/
<lpar-name>
cpus/
<cpu-id>
type
mgmtime
cputime
onlinetime
<cpu-id>
...
<lpar-name>
cpus/
...
- update: File to trigger update
- cpus/: Directory for all physical cpus
- cpus/<cpu-id>/: Directory for one physical cpu.
- cpus/<cpu-id>/type: Type name of physical zSeries cpu.
- cpus/<cpu-id>/mgmtime: Physical-LPAR-management time in microseconds.
- hyp/: Directory for hypervisor information
- hyp/type: Typ of hypervisor (currently only 'LPAR Hypervisor')
- systems/: Directory for all LPARs
- systems/<lpar-name>/: Directory for one LPAR.
- systems/<lpar-name>/cpus/<cpu-id>/: Directory for the virtual cpus
- systems/<lpar-name>/cpus/<cpu-id>/type: Typ of cpu.
- systems/<lpar-name>/cpus/<cpu-id>/mgmtime:
Accumulated number of microseconds during which a physical
CPU was assigned to the logical cpu and the cpu time was
consumed by the hypervisor and was not provided to
the LPAR (LPAR overhead).
- systems/<lpar-name>/cpus/<cpu-id>/cputime:
Accumulated number of microseconds during which a physical CPU
was assigned to the logical cpu and the cpu time was consumed
by the LPAR.
- systems/<lpar-name>/cpus/<cpu-id>/onlinetime:
Accumulated number of microseconds during which the logical CPU
has been online.
As mount point for the filesystem /sys/hypervisor/s390 is created.
The update process is triggered when writing 'something' into the
'update' file at the top level hypfs directory. You can do this e.g.
with 'echo 1 > update'. During the update the whole directory structure
is deleted and built up again.
Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Ingo Oeser <ioe-lkml@rameria.de> Cc: Joern Engel <joern@wohnheim.fh-wedel.de> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Michael Holzheu <holzheu@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roman Zippel [Fri, 23 Jun 2006 09:05:00 +0000 (02:05 -0700)]
[PATCH] m68k: clean up uaccess.h
This uninlines a few large functions in uaccess.h and cleans up the rest.
It includes a (hopefully temporary) workaround for the broken typeof of
gcc-4.1.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Finn Thain [Fri, 23 Jun 2006 09:04:59 +0000 (02:04 -0700)]
[PATCH] m68k: m68k mac VIA2 fixes and cleanups
Some fixes and cleanups from the linux-mac68k repo. Fix mac_esp by clearing
the VIA2 SCSI IRQ flag before the SCSI IRQ handler is invoked. Also fix a
race condition caused by unmasking a nubus slot IRQ then setting the relevant
nubus_active bit.
Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roman Zippel [Fri, 23 Jun 2006 09:04:56 +0000 (02:04 -0700)]
[PATCH] m68k: restore amikbd compatibility with 2.4
Dump the extra mapping in the amikbd interrupt handler, so old Amiga keymaps
work again. Amigas need a special keymap anyway, standard keymaps are not
usable and recreating all keymaps is simply not worth the trouble.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Dmitry Torokhov <dtor_core@ameritech.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roman Zippel [Fri, 23 Jun 2006 09:04:51 +0000 (02:04 -0700)]
[PATCH] m68k: completely initialize hw_regs_t in ide_setup_ports
ide_setup_ports does not completely initialize the hw_regs_t structure which
can cause random failures, as the structure is often on the stack. None of
the callers expect a partially initialized structure, i.e. none of them do
any setup of their own before calling ide_setup_ports().
Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Shaohua Li [Fri, 23 Jun 2006 09:04:50 +0000 (02:04 -0700)]
[PATCH] move do_suspend_lowlevel to correct segment
Move do_suspend_lowlevel to correct segment. If it is in the same hugepage
with ro data, mark_rodata_ro will make it unexecutable.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make swsusp allocate only as much memory as needed to store the image data
and metadata during resume.
Without this patch swsusp additionally allocates many page frames that will
conflict with the "original" locations of the image data and are considered
as "unsafe", treating them as "eaten" pages (ie. allocated but unusable).
The patch makes swsusp allocate as many pages as it'll need to store the
data read from the image in one shot, creating a list of allocated "safe"
pages, and use the observation that all pages allocated by it are marked
with the PG_nosave and PG_nosave_free flags set. Â Namely, when it's about
to load an image page, swsusp can check whether the page frame
corresponding to the "original" location of this page has been allocated
(ie. if the page frame has the PG_nosave and PG_nosave_free flags set) and
if so, it can load the page directly into this page frame. Â Otherwise it
uses an allocated "safe" page from the list to store the data that will be
copied to their "original" location later on.
This allows us to save many page copyings and page allocations during
resume and in the future it may allow us to load images greater than 50% of
the normal zone.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: "Pavel Machek" <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
swsusp allocates memory from the normal zone, so it cannot use lowmem
reserve pages from the lower zones. Therefore it should not count these
pages as available to it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
WARNING: vmlinux: 'enable_irq' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'disable_irq' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'disable_irq_nosync' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'probe_irq_mask' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'sys_open' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'sys_read' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'strstr' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'memscan' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'memcmp' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'strnlen' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'strncmp' exported twice. Previous export was in vmlinux
WARNING: vmlinux: 'strcmp' exported twice. Previous export was in vmlinux
Signed-off-by: Mathieu Chouquet-Stringer <mchouque@free.fr> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Chuck Ebbert [Fri, 23 Jun 2006 09:04:31 +0000 (02:04 -0700)]
[PATCH] i386: extra checks in show_registers()
Sometimes thread_info and task_struct get out-of-sync with each other.
Printing task.thread_info in show_registers() can help spot this. And when
task_struct is corrupt then task.comm can contain garbage, so only print as
many characters as it can hold.
Chuck Ebbert [Fri, 23 Jun 2006 09:04:23 +0000 (02:04 -0700)]
[PATCH] i386: let usermode execute the "enter" instruction
The i386 page fault handler does not allow enough slack when checking for
userspace access below the current stack pointer. This prevents use of the
enter instruction by user code. Fix this by allowing enough slack for
"enter $65535,$31" to execute.
Problem reported by Tomasz Malesinski <tmal@mimuw.edu.pl>
Tested using this program, based on the original from Tomasz:
Zhang Yanmin [Fri, 23 Jun 2006 09:04:22 +0000 (02:04 -0700)]
[PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.
1) In function do_irq_balance, after kernel finds the min_loaded cpu but
before calling set_pending_irq to really pin the selected_irq to the
target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
Later on, when the irq is acked, kernel would calls
move_native_irq=>desc->handler->set_affinity to change the irq affinity.
However, every function pointed by
hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
always changes irq_affinity[irq] to cpumask. Next time when recalling
do_irq_balance, it has to do cpu_ands again with
irq_affinity[selected_irq], but irq_affinity[selected_irq] already
becomes one cpu selected by the first irq balance.
2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
issue.
Due to "end_of_stack_stop_unwind_function" recursing back to itself in the
EBP stackframe-walker. So avoid this type of recursion when walking the
stack .
Jan Beulich [Fri, 23 Jun 2006 09:04:19 +0000 (02:04 -0700)]
[PATCH] fix x86 microcode driver handling of multiple matching revisions
When multiple updates matching a given CPU are found in the update file, the
action taken by the microcode update driver was inappropriate:
- when lower revision microcode was found before matching or higher revision
one, the driver would needlessly complain that it would not downgrade the
CPU
- when microcode matching the currently installed revision was found before
newer revision code, no update would actually take place
To change this behavior, the driver now concludes about possibly updates and
issues messages only when the entire input was parsed.
Additionally, this adds back (in different places, and conditionalized upon
a new module option) some messages removed by a previous patch.
Andreas Mohr [Fri, 23 Jun 2006 09:04:17 +0000 (02:04 -0700)]
[PATCH] i386 apm.c optimization
- avoid expensive modulo (integer division) which happened
since APM_MAX_EVENTS is 20 (non-power-of-2)
- kill compiler warnings by initializing two variables
- add __read_mostly to some important static variables that are read often
(by idle loop etc.)
- constify several structures
Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
L3 cache miss reduction of __copy_from_user_ll
samples %
37408 65.1412 vmlinux __copy_from_user_ll
23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)
Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
64206 4.0143 vmlinux journal_add_journal_head
59746 3.7355 vmlinux do_get_write_access
47674 2.9807 vmlinux journal_put_journal_head
46021 2.8774 vmlinux journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples % app name symbol name
69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
55685 3.4215 vmlinux journal_add_journal_head
52371 3.2179 vmlinux __find_get_block
45504 2.7960 vmlinux journal_put_journal_head
36005 2.2123 vmlinux journal_stop
pattern9-0-cpu4-0-09011744/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % app name symbol name
1147 5.4994 vmlinux journal_add_journal_head
881 4.2240 vmlinux journal_dirty_data
872 4.1809 vmlinux blk_rq_map_sg
734 3.5192 vmlinux journal_commit_transaction
617 2.9582 vmlinux radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out
iozone results are
original 2.6.12.4 CPU time = 207.768 sec
cache aware CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)
original:
pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %
cache awre:
pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %
Sergei Shtylyov [Fri, 23 Jun 2006 09:04:13 +0000 (02:04 -0700)]
[PATCH] Au1550/1200: add missing PSC #define's, make OSS driver use the proper ones
Add missing PSC #define's required for the drivers using PSC on DBAu1550
board (also fixing Au1550 PSC3 address) and all Au1200-based boards as
well. Make the OSS driver use the correct PSC definitions fo each board.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:04:01 +0000 (02:04 -0700)]
[PATCH] SELinux: add task_movememory hook
This patch adds new security hook, task_movememory, to be called when memory
owened by a task is to be moved (e.g. when migrating pages to a this hook is
identical to the setscheduler implementation, but a separate hook introduced
to allow this check to be specialized in the future if necessary.
Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:04:00 +0000 (02:04 -0700)]
[PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)
Add a security hook call to enable security modules to control the ability
to attach a task to a cpuset. While limited control over this operation is
possible via permission checks on the pseudo fs interface, those checks are
not sufficient to control access to the target task, which is looked up in
this function. The existing task_setscheduler hook is re-used for this
operation since this falls under the same class of operations.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:03:59 +0000 (02:03 -0700)]
[PATCH] SELinux: add security hooks to {get,set}affinity
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
James Morris [Fri, 23 Jun 2006 09:03:58 +0000 (02:03 -0700)]
[PATCH] lsm: add task_setioprio hook
Implement an LSM hook for setting a task's IO priority, similar to the hook
for setting a tasks's nice value.
A previous version of this LSM hook was included in an older version of
multiadm by Jan Engelhardt, although I don't recall it being submitted
upstream.
Also included is the corresponding SELinux hook, which re-uses the setsched
permission in the proccess class.
Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] move_pages: fix 32 -> 64 bit compat function
The definition of the third parameter is a pointer to an array of virtual
addresses which give us some trouble. The existing code calculated the
wrong address in the array since I used void to avoid having to specify a
type.
I now use the correct type "compat_uptr_t __user *" in the definition of
the function in kernel/compat.c.
However, I used __u32 in syscalls.h. Would have to include compat.h there
in order to provide the same definition which would generate an ugly
include situation.
On both ia64 and x86_64 compat_uptr_t is u32. So this works although
parameter declarations differ.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] sys_move_pages: 32bit support (i386, x86_64)
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)
Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.
Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is
not up to date so I added the missing pieces. Not sure if this is done the
right way.
[akpm@osdl.org: compile fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.
long move_pages(pid, number_of_pages_to_move,
addresses_of_pages[],
nodes[] or NULL,
status[],
flags);
The addresses of pages is an array of void * pointing to the
pages to be moved.
The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.
The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.
Possible page states in status[]:
0..MAX_NUMNODES The page is now on the indicated node.
-ENOENT Page is not present
-EACCES Page is mapped by multiple processes and can only
be moved if MPOL_MF_MOVE_ALL is specified.
-EPERM The page has been mlocked by a process/driver and
cannot be moved.
-EBUSY Page is busy and cannot be moved. Try again later.
-EFAULT Invalid address (no VMA or zero page).
-ENOMEM Unable to allocate memory on target node.
-EIO Unable to write back page. The page must be written
back in order to move it since the page is dirty and the
filesystem does not provide a migration function that
would allow the moving of dirty pages.
-EINVAL A dirty page cannot be moved. The filesystem does not provide
a migration function and has no ability to write back pages.
The flags parameter indicates what types of pages to move:
MPOL_MF_MOVE Move pages that are only mapped by the process.
MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
Requires sufficient capabilities.
Possible return codes from move_pages()
-ENOENT No pages found that would require moving. All pages
are either already on the target node, not present, had an
invalid address or could not be moved because they were
mapped by multiple processes.
-EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
to migrate pages in a kernel thread.
-EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
or an attempt to move a process belonging to another user.
-EACCES One of the target nodes is not allowed by the current cpuset.
-ENODEV One of the target nodes is not online.
-ESRCH Process does not exist.
-E2BIG Too many pages to move.
-ENOMEM Not enough memory to allocate control array.
-EFAULT Parameters could not be accessed.
A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
From: Christoph Lameter <clameter@sgi.com>
Detailed results for sys_move_pages()
Pass a pointer to an integer to get_new_page() that may be used to
indicate where the completion status of a migration operation should be
placed. This allows sys_move_pags() to report back exactly what happened to
each page.
Wish there would be a better way to do this. Looks a bit hacky.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration: use allocator function for migrate_pages()
Instead of passing a list of new pages, pass a function to allocate a new
page. This allows the correct placement of MPOL_INTERLEAVE pages during page
migration. It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to(). The batching of new page allocations becomes unnecessary.
[PATCH] page migration: handle freeing of pages in migrate_pages()
Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().
Currently migrate_pages() is mess with lots of goto. Extract two functions
from migrate_pages() and get rid of the gotos.
Plus we can just unconditionally set the locked bit on the new page since we
are the only one holding a reference. Locking is to stop others from
accessing the page once we establish references to the new page.
Remove the list_del from move_to_lru in order to have finer control over list
processing.
Kirill Korotaev [Fri, 23 Jun 2006 09:03:50 +0000 (02:03 -0700)]
[PATCH] printk() should not be called under zone->lock
This patch fixes printk() under zone->lock in show_free_areas(). It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock. I found
allocations/freeing mem both in netconsole and serial console.
This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.
Andrew Morton [Fri, 23 Jun 2006 09:03:47 +0000 (02:03 -0700)]
[PATCH] initialise total_memory() earlier
Initialise total_memory earlier in boot. Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.
And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.
Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Fri, 23 Jun 2006 09:03:46 +0000 (02:03 -0700)]
[PATCH] mm/slab.c: fix early init assumption
The SLAB bootstrap code assumes that the first two kmalloc caches created
(the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
and L3 structure size increase in lockdep, one of them ended up being
off-slab, and subsequently crashing with:
The fix is to introduce a bootstrap flag and to use it to prevent off-slab
caches being created so early during bootup.
(The calculation for off-slab caches is quite complex so i didnt want to
complicate things with introducing yet another INDEX_ calculation, the flag
approach is simpler and smaller.)
Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins [Fri, 23 Jun 2006 09:03:45 +0000 (02:03 -0700)]
[PATCH] fix update_mmu_cache in fremap.c
There are two calls to update_mmu_cache in fremap.c, both defective.
The one in install_page needs to be accompanied by lazy_mmu_prot_update
(some other cleanup time, move that into ia64 update_mmu_cache itself); and
the one in install_file_pte should be removed since the pte is not present.
Hugh Dickins [Fri, 23 Jun 2006 09:03:44 +0000 (02:03 -0700)]
[PATCH] swapoff: use atomic_inc_not_zero() on mm_users
Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to
use that on mm_users: doesn't actually matter at present, but safer to be
sure that once mm_users has gone to 0, nothing raises it for an instant.