[PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.
Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.
[PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86. (It would be more on an x86_64 system). This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).
This patch:
The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer. Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer. This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.
[PATCH] kdump: introduce "reset_devices" command line option
Resetting the devices during driver initialization can be a costly
operation in terms of time (especially scsi devices). This option can be
used by drivers to know that user forcibly wants the devices to be reset
during initialization.
This option can be useful while kernel is booting in unreliable
environment. For ex. during kdump boot where devices are in unknown
random state and BIOS execution has been skipped.
Ian Kent [Wed, 27 Sep 2006 08:50:44 +0000 (01:50 -0700)]
[PATCH] autofs4 needs to force fail return revalidate
For a long time now I have had a problem with not being able to return a
lookup failure on an existsing directory. In autofs this corresponds to a
mount failure on a autofs managed mount entry that is browsable (and so the
mount point directory exists).
While this problem has been present for a long time I've avoided resolving
it because it was not very visible. But now that autofs v5 has "mount and
expire on demand" of nested multiple mounts, such as is found when mounting
an export list from a server, solving the problem cannot be avoided any
longer.
I've tried very hard to find a way to do this entirely within the autofs4
module but have not been able to find a satisfactory way to achieve it.
So, I need to propose a change to the VFS.
Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:42 +0000 (01:50 -0700)]
[PATCH] uml: fix sleep length bug
um_timer shouldn't add local_offset to the host time since get_time already
did it. This threw off sleep when a settimeofday or equivalent had happened.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:40 +0000 (01:50 -0700)]
[PATCH] uml: thread creation tidying
fork on UML has always somewhat subtle. The underlying cause has been the
need to initialize a stack for the new process. The only portable way to
initialize a new stack is to set it as the alternate signal stack and take a
signal. The signal handler does whatever initialization is needed and jumps
back to the original stack, where the fork processing is finished. The basic
context switching mechanism is a jmp_buf for each process. You switch to a
new process by longjmping to its jmp_buf.
Now that UML has its own implementation of setjmp and longjmp, and I can poke
around inside a jmp_buf without fear that libc will change the structure, a
much simpler mechanism is possible. The jmpbuf can simply be initialized by
hand.
This eliminates -
the need to set up and remove the alternate signal stack
sending and handling a signal
the signal blocking needed around the stack switching, since
there is no stack switching
setting up the jmp_buf needed to jump back to the original
stack after the new one is set up
In addition, since jmp_buf is now defined by UML, and not by libc, it can be
embedded in the thread struct. This makes it unnecessary to have it exist on
the stack, where it used to be. It also simplifies interfaces, since the
switch jmp_buf used to be a void * inside the thread struct, and functions
which took it as an argument needed to define a jmp_buf variable and assign it
from the void *.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:38 +0000 (01:50 -0700)]
[PATCH] uml: mark some tt-mode code
Mark a symbol and file as being tt-mode only. This shrinks the binary
slightly when tt mode support is compiled out and makes it easier to identity
stuff when tt mode is removed.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:37 +0000 (01:50 -0700)]
[PATCH] uml: add checkstack support
Make checkstack work for UML. We need to pass the underlying architecture
name, rather than "um" to checkstack.pl.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:37 +0000 (01:50 -0700)]
[PATCH] uml: use correct SIGBUS handler
BB noticed that we had the wrong bus error handler.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:36 +0000 (01:50 -0700)]
[PATCH] uml: fix gcov support
Make __bb_init_func weak in order to avoid a link failure with some libcs
and/or gccs.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The UML/x86_64 headers were missing ptrace support for some segment registers.
The underlying problem was that the x86_64 kernel uses user_regs_struct
rather than the ptrace register definitions in ptrace. This patch switches
UML/x86_64 to using user_regs_struct for its definitions of the host's
registers.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:34 +0000 (01:50 -0700)]
[PATCH] uml: get rid of ZONE_DMA use
ZONE_DMA might become dependent on CONFIG_ZONE_DMA, which UML doesn't define
(we're still arguing about this) So, let's change ZONE_DMA to ZONE_NORMAL.
This is prompted by optional-zone_dma-in-the-vm.patch, but should be harmless
on its own.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Wed, 27 Sep 2006 08:50:33 +0000 (01:50 -0700)]
[PATCH] uml: const more data
Make lots of structures const in order to make it obvious that they need no
locking.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This spinlock can be taken on interrupt too, so spin_lock_irq[save] must be
used.
However, Documentation/networking/netdevices.txt explains we are called with
rtnl_lock() held - so we don't need to care about other concurrent opens.
Verified also in LDD3 and by direct checking. Also verified that the network
layer (through a state machine) guarantees us that nobody will close the
interface while it's being used. Please correct me if I'm wrong.
Also, we must check we don't sleep with irqs disabled!!! But anyway, this is
not news - we already can't sleep while holding a spinlock. Who says this is
guaranted really by the present code?
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have never used this flag and recently one user experienced a complaining
warning about this (there was a symbol in the positive half of the address space
IIRC). So fix it.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Howells [Wed, 27 Sep 2006 08:50:23 +0000 (01:50 -0700)]
[PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c).
Currently it's in fs/proc/task_mmu.c, a file that is dependent on both
CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from
kernel/signal.c from where it is called unconditionally.
[akpm@osdl.org: build fix] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The first program will set up a SYSV IPC SHM segment and wait on a futex in it
for the number at the start to change. The program will increment that number
and wake the first program up. This leads to output of the form:
David Howells [Wed, 27 Sep 2006 08:50:21 +0000 (01:50 -0700)]
[PATCH] NOMMU: Make mremap() partially work for NOMMU kernels
Make mremap() partially work for NOMMU kernels. It may resize a VMA provided
that it doesn't exceed the size of the slab object in which the storage is
allocated that the VMA refers to. Shareable VMAs may not be resized.
Moving VMAs (as permitted by MREMAP_MAYMOVE) is not currently supported.
This patch also makes use of the fact that the VMA list is now ordered to cut
it short when possible.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Howells [Wed, 27 Sep 2006 08:50:19 +0000 (01:50 -0700)]
[PATCH] NOMMU: Permit ptrace to ignore non-PROT_WRITE VMAs in NOMMU mode
Permit ptrace to modify a section that's non-shared but is marked
unwritable, such as is obtained by mapping the text segment of an ELF-FDPIC
executable binary with into a binary that's being ptraced[*].
[*] Under NOMMU conditions ptrace causes read-only MAP_PRIVATE mmaps to become
totally private copies because if a private mapping was actually shared
then the debugging setting breakpoints in it would potentially crash
other processes.
This is done by using the VM_MAYWRITE flag rather than the VM_WRITE flag
when deciding whether to permit a write.
Without this patch a debugger can't set breakpoints in the mapped text
sections of executables that are mapped read-only private, even if the
mmap() syscall has taken a private copy because PT_PTRACED is set.
In addition, VM_MAYREAD is used instead of VM_READ for similar reasons.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sonic Zhang [Wed, 27 Sep 2006 08:50:17 +0000 (01:50 -0700)]
[PATCH] Check if start address is in vma region in NOMMU function get_user_pages()
In NOMMU arch, if run "cat /proc/self/mem", data from physical address 0
are read. This behavior is different from MMU arch. In IA32, message
"cat: /proc/self/mem: Input/output error" is reported.
This issue is rootcaused by not validate the start address in NOMMU
function get_user_pages(). Following patch solves this issue.
Signed-off-by: Sonic Zhang <sonic.adi@gmail.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Howells [Wed, 27 Sep 2006 08:50:16 +0000 (01:50 -0700)]
[PATCH] NOMMU: Set BDI capabilities for /dev/mem and /dev/kmem
Set the backing device info capabilities for /dev/mem and /dev/kmem to
permit direct sharing under no-MMU conditions and full mapping capabilities
under MMU conditions. Make the BDI used by these available to all directly
mappable character devices.
Also comment the capabilities for /dev/zero.
[akpm@osdl.org: ifdef reductions] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The empty line between the short description and the first argument
description causes a section to appear twice in the generated manpage.
Also the short description should really be short: the script can't handle
multiple lines.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement the special memory driver (mspec) based on the do_no_pfn
approach. The driver is currently used only on SN2 hardware with special
fetchop support but could be beneficial on other architectures using the
uncached mode.
Signed-off-by: Jes Sorensen <jes@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement do_no_pfn() for handling mapping of memory without a struct page
backing it. This avoids creating fake page table entries for regions which
are not backed by real memory.
This feature is used by the MSPEC driver and other users, where it is
highly undesirable to have a struct page sitting behind the page (for
instance if the page is accessed in cached mode via the struct page in
parallel to the the driver accessing it uncached, which can result in data
corruption on some architectures, such as ia64).
This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than
expect all negative pfn values would be an error. It also bugs on cow
mappings as this would not work with the VM.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node. All slab allocations use
GFP_THISNODE when calling into the page allocator. If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.
This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.
Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes. This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining. Plus the slab
is handing out memory as local when it is really remote.
Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator. This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.
The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context. We do not allocate objects from outside
of the current cpuset context like before.
Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator. This is a mischmash that is
not that great. Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator. There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.
V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
#ifdef
[akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Add NUMA_BUILD definition in kernel.h to avoid #ifdef CONFIG_NUMA
The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA
F.e.
if (NUMA_BUILD && <numa_condition>) {
...
}
[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
causing ifdef explosion in core kernel, so let's see if this is a comfortable
way in whcih to control that]
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On larger systems, the amount of output dumped on the console when you do
SysRq-M is beyond insane. This patch is trying to reduce it somewhat as
even with the smaller NUMA systems that have hit the desktop this seems to
be a fair thing to do.
The philosophy I have taken is as follows:
1) If a zone is empty, don't tell, we don't need yet another line
telling us so. The information is available since one can look up
the fact how many zones were initialized in the first place.
2) Put as much information on a line is possible, if it can be done
in one line, rahter than two, then do it in one. I tried to format
the temperature stuff for easy reading.
Change show_free_areas() to not print lines for empty zones. If no zone
output is printed, the zone is empty. This reduces the number of lines
dumped to the console in sysrq on a large system by several thousand lines.
Change the zone temperature printouts to use one line per CPU instead of
two lines (one hot, one cold). On a 1024 CPU, 1024 node system, this
reduces the console output by over a million lines of output.
While this is a bigger problem on large NUMA systems, it is also applicable
to smaller desktop sized and mid range NUMA systems.
Old format:
Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 42, batch 7 used:24
cpu 0 cold: high 14, batch 3 used:1
cpu 1 hot: high 42, batch 7 used:34
cpu 1 cold: high 14, batch 3 used:0
cpu 2 hot: high 42, batch 7 used:0
cpu 2 cold: high 14, batch 3 used:0
cpu 3 hot: high 42, batch 7 used:0
cpu 3 cold: high 14, batch 3 used:0
cpu 4 hot: high 42, batch 7 used:0
cpu 4 cold: high 14, batch 3 used:0
cpu 5 hot: high 42, batch 7 used:0
cpu 5 cold: high 14, batch 3 used:0
cpu 6 hot: high 42, batch 7 used:0
cpu 6 cold: high 14, batch 3 used:0
cpu 7 hot: high 42, batch 7 used:0
cpu 7 cold: high 14, batch 3 used:0
Node 0 DMA32 per-cpu: empty
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Node 1 DMA per-cpu:
[snip]
Free pages: 5410688kB (0kB HighMem)
Active:9536 inactive:4261 dirty:6 writeback:0 unstable:0 free:338168 slab:1931 mapped:1900 pagetables:208
Node 0 DMA free:1676304kB min:3264kB low:4080kB high:4896kB active:128048kB inactive:61568kB present:1970880kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:512kB low:512kB high:512kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA free:1951728kB min:3280kB low:4096kB high:4912kB active:5632kB inactive:1504kB present:1982464kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
....
kmalloc_node() falls back to ___cache_alloc() under certain conditions and
at that point memory policies may be applied redirecting the allocation
away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
kmalloc_node(...,-1) may not return memory from the local node.
Fix this by doing the policy check in __cache_alloc() instead of
____cache_alloc().
This version here is a cleanup of Kiran's patch.
- Tested on ia64.
- Extra material removed.
- Consolidate the exit path if alternate_node_alloc() returned an object.
[akpm@osdl.org: warning fix] Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This moves the definition of struct page from mm.h to its own header file
page-struct.h. This is a prereq to fix SetPageUptodate which is broken on
s390:
#define SetPageUptodate(_page)
do {
struct page *__page = (_page);
if (!test_and_set_bit(PG_uptodate, &__page->flags))
page_test_and_clear_dirty(_page);
} while (0)
_page gets used twice in this macro which can cause subtle bugs. Using
__page for the page_test_and_clear_dirty call doesn't work since it causes
yet another problem with the page_test_and_clear_dirty macro as well.
In order to avoid all these problems caused by macros it seems to be a good
idea to get rid of them and convert them to static inline functions.
Because of header file include order it's necessary to have a seperate
header file for the struct page definition.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton [Wed, 27 Sep 2006 08:50:00 +0000 (01:50 -0700)]
[PATCH] vm: add per-zone writeout counter
The VM is supposed to minimise the number of pages which get written off the
LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
we don't actually have a clear way of showing how true this is.
So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
pages which have been written by the vm scanner in this zone and globally.
Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was
registered by the architecture. However, when
CONFIG_MEMORY_HOTPLUG_RESERVE is set, the architecture expects that the
spanned_pages will be much larger and that mem_map will be allocated that
is used lated on memory hot-add.
This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.
Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Account for holes that are outside the range of physical memory
absent_pages_in_range() made the assumption that users of the API would not
care about holes beyound the end of physical memory. This was not the
case. This patch will account for ranges outside of physical memory as
holes correctly.
Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Account for memmap and optionally the kernel image as holes
The x86_64 code accounted for memmap and some portions of the the DMA zone as
holes. This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks. This patch will account for the
memmap as a memory hole. Architectures may optionally use set_dma_reserve()
if they wish to account for a portion of memory in ZONE_DMA as a hole.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Bob Picco <bob.picco@hp.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
un-, de-, -free, -destroy, -exit, etc functions should in general return
void. Also,
There is very little, say, filesystem driver code can do upon failed
kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
should be put in generic code, instead.
[PATCH] Really ignore kmem_cache_destroy return value
* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:
(void)kmem_cache_destroy(cache);
* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.
SWsoft Virtuozzo/OpenVZ Linux kernel team has discovered that ext3 error
behavior was broken in linux kernels since 2.5.x versions by the following
patch:
2002/10/31 02:15:26-05:00 tytso@snap.thunk.org
Default mount options from superblock for ext2/3 filesystems
http://linux.bkbits.net:8080/linux-2.6/gnupatch@3dc0d88eKbV9ivV4ptRNM8fBuA3JBQ
In case ext3 file system is mounted with errors=continue
(EXT3_ERRORS_CONTINUE) errors should be ignored when possible. However at
present in case of any error kernel aborts journal and remounts filesystem
to read-only. Such behavior was hit number of times and noted to differ
from that of 2.4.x kernels.
This patch fixes this:
- do nothing in case of EXT3_ERRORS_CONTINUE,
- set EXT3_MOUNT_ABORT and call journal_abort() in all other cases
- panic() should be called after ext3_commit_super() to save
sb marked as EXT3_ERROR_FS
Signed-off-by: Vasily Averin <vvs@sw.ru> Acked-by: Kirill Korotaev <dev@sw.ru> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Stephen C. Tweedie" <sct@redhat.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mingming Cao [Wed, 27 Sep 2006 08:49:32 +0000 (01:49 -0700)]
[PATCH] ext3: turn on reservation dump on block allocation errors
In the past there were a few kernel panics related to block reservation
tree operations failure (insert/remove etc). It would be very useful to
get the block allocation reservation map info when such error happens.
Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric Sandeen [Wed, 27 Sep 2006 08:49:31 +0000 (01:49 -0700)]
[PATCH] JBD: 16T fixes
These are a few places I've found in jbd that look like they may not be
16T-safe, or consistent with the use of unsigned longs for block
containers. Problems here would be somewhat hard to hit, would require
journal blocks past the 8T boundary, which would not be terribly common.
Still, should fix.
(some of these have come from the ext4 work on jbd as well).
I think there's one more possibility that the wrap() function may not be
safe IF your last block in the journal butts right up against the 232 block
boundary, but that seems like a VERY remote possibility, and I'm not
worrying about it at this point.
Signed-off-by: Eric Sandeen <esandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric Sandeen [Wed, 27 Sep 2006 08:49:30 +0000 (01:49 -0700)]
[PATCH] ext3: inode numbers are unsigned long
This is primarily format string fixes, with changes to ialloc.c where large
inode counts could overflow, and also pass around journal_inum as an
unsigned long, just to be pedantic about it....
Signed-off-by: Eric Sandeen <esandeen@redhat.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric Sandeen [Wed, 27 Sep 2006 08:49:29 +0000 (01:49 -0700)]
[PATCH] fix ext3 mounts at 16T
I need to do some actual IO testing now, but this gets things mounting for
a 16T ext3 filesystem. (patched up e2fsprogs is needed too, I'll send that
off the kernel list)
This patch fixes these issues in the kernel:
o sbi->s_groups_count overflows in ext3_fill_super()
at 16T, s_blocks_count is already maxed out; adding
EXT3_BLOCKS_PER_GROUP(sb) overflows it and groups_count comes out to 0.
Not really what we want, and causes a failed mount.
Feel free to check my math (actually, please do!), but changing it this
way should work & avoid the overflow:
(A + B - 1)/B changed to: ((A - 1)/B) + 1
o ext3_check_descriptors() overflows range checks
ext3_check_descriptors() iterates over all block groups making sure
that various bits are within the right block ranges... on the last pass
through, it is checking the error case
[item] >= block + EXT3_BLOCKS_PER_GROUP(sb)
where "block" is the first block in the last block group. The last
block in this group (and the last one that will fit in 32 bits) is block
+ EXT3_BLOCKS_PER_GROUP(sb)- 1. block + EXT3_BLOCKS_PER_GROUP(sb) wraps
back around to 0.
so, make things clearer with "first_block" and "last_block" where those
are first and last, inclusive, and use <, > rather than <, >=.
Finally, the last block group may be smaller than the rest, so account
for this on the last pass through: last_block = sb->s_blocks_count - 1;
(a similar patch could be done for ext2; does anyone in their right mind
use ext2 at 16T? I'll send an ext2 patch doing the same thing if that's
warranted)
Signed-off-by: Eric Sandeen <esandeen@redhat.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
jbd_sync_bh releases journal->j_list_lock. Add a lock annotation to this
function so that sparse can check callers for lock pairing, and so that
sparse will not complain about this function since it intentionally uses
the lock in this manner.
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/i2c-2.6: (30 commits)
i2c: Drop unimplemented slave functions
i2c: Constify i2c_algorithm declarations, part 2
i2c: Constify i2c_algorithm declarations, part 1
i2c: Let drivers constify i2c_algorithm data
i2c-isa: Restore driver owner
i2c-viapro: Add support for the VT8237A and VT8251
i2c: Warn on i2c client creation failure
i2c-core: Drop useless bitmaskings
i2c-algo-pcf: Discard the mdelay data struct member
i2c-algo-bit: Cleanups
i2c-isa: Fail adding driver on attach_adapter error
i2c: __must_check fixes (chip drivers)
i2c-dev: attach/detach_adapter cleanups
i2c-stub: Chip address as a module parameter
i2c: Plan i2c-isa for removal
i2c: New bus driver for TI OMAP boards
i2c-algo-bit: Discard the mdelay data struct member
i2c-matroxfb: Struct init conversion
i2c: Fix copy-n-paste in subsystem Kconfig
i2c-au1550: Add I2C support for Au1200
...
Franck Bui-Huu [Fri, 11 Aug 2006 15:51:53 +0000 (17:51 +0200)]
[MIPS] setup.c: use early_param() for early command line parsing
There's no point to rewrite some logic to parse command line
to pass initrd parameters or to declare a user memory area.
We could use instead parse_early_param() that does the same
thing.
Franck Bui-Huu [Fri, 18 Aug 2006 14:18:09 +0000 (16:18 +0200)]
[MIPS] get_wchan(): remove uses of mfinfo[64]
This array was used to 'cache' some frame info about scheduler
functions to speed up get_wchan(). This array was 1Ko size and
was only used when CONFIG_KALLSYMS was set but declared for all
configs.
Rather than make the array statement conditional, this patches
removes this array and its uses. Indeed the common case doesn't
seem to use this array and get_wchan() is not a critical path
anyways.
It results in a smaller bss and a smaller/cleaner code:
text data bss dec hex filename 2543808 254148 139296 2937252 2cd1a4 vmlinux-new-get-wchan 2544080 254148 143392 2941620 2ce2b4 vmlinux~old
Franck Bui-Huu [Fri, 18 Aug 2006 14:18:08 +0000 (16:18 +0200)]
[MIPS] get_frame_info(): null function size means size is unknown
This patch adds 2 sanity checks.
The first one test that the start address of the function to analyze has been
set by the caller. If not return an error since nothing usefull can be done
without.
The second one checks that the function's size has been set. A null size can
happen if CONFIG_KALLSYMS is not set and it means that we don't know the size
of the function to analyze. In this case, we make it equal to 128 instructions
by default.
MIPS is the only port to call its fstatat()-related syscalls
"__NR_fstatat". Now I can see why that might be seen as every
other port being wrong, but I think for o32, it is at best confusing.
__NR_fstat provides a plain (32-bit) stat while __NR_fstatat provides a
64-bit stat. Changing the name to __NR_fstatat64 would make things more
explicit, match x86, and make the glibc port slightly easier.
The current name is more appropriate for n32 and n64, but it would be
appropriate for other 64-bit targets too, and those targets have chosen
to call it __NR_newfstatat instead. Using the same name for MIPS would
again be more consistent and make the glibc port slightly easier.
I'm not wedded to this idea if the current names are preferred,
but FWIW...
Signed-off-by: Richard Sandiford <richard@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[MIPS] The o32 fstatat syscall behaves differently on 32 and 64 bit kernels
While working on a glibc patch to support the fstatat() functions[1],
I noticed that the o32 implementation behaves differently on 32-bit and
64-bit kernels; the former provides a stat64 while the latter provides
a plain (o32) stat. I think the former is what's intended, as there is
no separate fstatat64. It's also what x86 does.
I think this is just a case of a compat too far.
[1] I've seen Khem's patch, but I don't think it's right.
Signed-off-by: Richard Sandiford <richard@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>