Qi Yong [Tue, 5 Feb 2008 06:29:23 +0000 (22:29 -0800)]
skip writing data pages when inode is under I_SYNC
Since I_SYNC was split out from I_LOCK, the concern in commit 4b89eed93e0fa40a63e3d7b1796ec1337ea7a3aa ("Write back inode data pages
even when the inode itself is locked") is not longer valid.
We should revert to the original behavior: in __writeback_single_inode(),
when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
skip writing entirely. Otherwise, we are double calling do_writepages()
Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Cc: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:29:23 +0000 (22:29 -0800)]
mm: don't waste swap on locked pages
try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
migrating), and recycles it back to the active list. But if it's an
anonymous page, we've already allocated swap to it: just wasting swap.
Spot locked pages in page_referenced_one and treat them as referenced.
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).
This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).
No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct). With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)
do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)
The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.
I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.
Now mainline doesn't have this line in iput (like sles9 has):
if (inode->i_state & I_DIRTY_DELAYED)
mark_inode_dirty_sync(inode);
so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit. So this should fix a subtle bug in
mainline too (not verified in practice though). The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).
An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.
Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bron Gondwana [Tue, 5 Feb 2008 06:29:20 +0000 (22:29 -0800)]
mm/page-writeback: highmem_is_dirtyable option
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.
Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.
[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.
The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert Bragg [Tue, 5 Feb 2008 06:29:18 +0000 (22:29 -0800)]
mm: don't allow ioremapping of ranges larger than vmalloc space
When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
vmlist search routine in __get_vm_area_node can mistakenly allow a driver
to ioremap a range larger than vmalloc space.
If at the time of the ioremap all existing vmlist areas sit below the
determined alignment then the search routine continues past all entries and
exits the for loop - straight into the found: label - without ever testing
for integer wrapping or that the requested size fits.
We were seeing a driver successfully ioremap 128M of flash even though
there was only 120M of vmalloc space. From that point the system was left
with the remainder of the first 16M of space to vmalloc/ioremap within.
Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Add comments explaining how the function can be called.
2. Collect global diffs in a local array and only spill
them once into the global counters when the zone scan
is finished. This means that we only touch each global
counter once instead of each time we fold cpu counters
into zone counters.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to change the layout of the page tables after an mmap has crossed the
adress space limit of the current page table layout a architecture hook in
get_unmapped_area is needed. The arguments are the address of the new mapping
and the length of it.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.
- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.
- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c
- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.
[akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 5 Feb 2008 06:29:10 +0000 (22:29 -0800)]
radix-tree: avoid atomic allocations for preloaded insertions
Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But the
preallocation is squandered when it comes time to allocate a node, we
default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we don't
need to be using.
Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock. Pagecache insertions are always
performed with a radix tree preload, and after this change, such a
situation will never fall back to kmem_cache_alloc within
radix_tree_node_alloc.
David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:
The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including disk
IO) before all the pages are inserted into the radix-tree. So the reserves
can easily be depleted at that point. The patch is confirmed to fix the
problem.
Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:06 +0000 (22:29 -0800)]
maps4: add /proc/kpageflags interface
This makes a subset of physical page flags available to userspace. Together
with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
Exported flags are decoupled from the kernel's internal flags. This
allows us to reorder flag bits, and synthesize any bits that get
redefined in terms of other bits.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[akpm@linux-foundation.org: s/0/NULL/] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:05 +0000 (22:29 -0800)]
maps4: add /proc/kpagecount interface
This makes physical page map counts available to userspace. Together
with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
monitor memory usage on a per-page basis.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[bunk@stusta.de: make struct proc_kpagemap static] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:04 +0000 (22:29 -0800)]
maps4: add /proc/pid/pagemap interface
This interface provides a mapping for each page in an address space to its
physical page frame number, allowing precise determination of what pages are
mapped and what pages are shared between processes.
New in this version:
- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)
This patch folds in cleanups and swap PTE support from Dave Hansen
<haveblue@us.ibm.com>.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:03 +0000 (22:29 -0800)]
maps4: regroup task_mmu by interface
Reorder source so that all the code and data for each interface is together.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:03 +0000 (22:29 -0800)]
maps4: move clear_refs code to task_mmu.c
This puts all the clear_refs code where it belongs and probably lets things
compile on MMU-less systems as well.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:02 +0000 (22:29 -0800)]
maps4: simplify interdependence of maps and smaps
This pulls the shared map display code out of show_map and puts it in
show_smap where it belongs.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 5 Feb 2008 06:28:59 +0000 (22:28 -0800)]
maps4: rework TASK_SIZE macros
The following replaces the earlier patches sent. It should address
David Rientjes's comments, and has been compile tested on all the
architectures that it touches, save for parisc.
For the /proc/<pid>/pagemap code[1], we need to able to query how
much virtual address space a particular task has. The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches. The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.
I'd like to have that for other architectures. So, add it
for all the architectures that actually use "current" in
their TASK_SIZE. For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.
Fengguang Wu [Tue, 5 Feb 2008 06:28:56 +0000 (22:28 -0800)]
maps4: add proportional set size accounting in smaps
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing
it. So if a process has 1000 pages all to itself, and 1000 shared with one
other process, its PSS will be 1500.
- lwn.net: "ELC: How much memory are applications really using?"
The PSS proposed by Matt Mackall is a very nice metic for measuring an
process's memory footprint. So collect and export it via
/proc/<pid>/smaps.
Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
job. They are comprehensive tools. But for PSS, let's do it in the simple
way.
vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
proper if else for the two major cases of extending and truncating truncate
and thus makes it a lot more readable while keeping exactly the same
functinality.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:55 +0000 (22:28 -0800)]
tmpfs: fix shmem_swaplist races
Intensive swapoff testing shows shmem_unuse spinning on an entry in
shmem_swaplist pointing to itself: how does that come about? Days pass...
First guess is this: shmem_delete_inode tests list_empty without taking the
global mutex (so the swapping case doesn't slow down the common case); but
there's an instant in shmem_unuse_inode's list_move_tail when the list entry
may appear empty (a rare case, because it's actually moving the head not the
the list member). So there's a danger of leaving the inode on the swaplist
when it's freed, then reinitialized to point to itself when reused. Fix that
by skipping the list_move_tail when it's a no-op, which happens to plug this.
But this same spinning then surfaces on another machine. Ah, I'd never
suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
still hold page lock, which would hold off inode deletion if the page were in
pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
earlier, but then shmem_unuse_inode could never prune unswapped inodes.
Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
though I am a little uneasy about the iput which has to follow - it works, and
I see nothing wrong with it, but it is surprising that shmem inode deletion
may now occur below shmem_writepage. Revisit this fix later?
And while we're looking at these races: the way shmem_unuse tests swapped
without holding info->lock looks unsafe, if we've more than one swap area: a
racing shmem_writepage on another page of the same inode could be putting it
in swapcache, just as we're deciding to remove the inode from swaplist -
there's a danger of going on swap without being listed, so a later swapoff
would hang, being unable to locate the entry. Move that test and removal down
into shmem_unuse_inode, once info->lock is held.
Hugh Dickins [Tue, 5 Feb 2008 06:28:54 +0000 (22:28 -0800)]
tmpfs: radix_tree_preloading
Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
or swap cache, without any radix tree preload: so tending to deplete emergency
reserves of memory.
GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
being called under memory pressure, so must not wait for more memory to become
available. But shmem_unuse_inode now has a window in which it can and should
preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
add_to_page_cache.
shmem_getpage is not so straightforward: its filepage/swappage integrity
relies upon exchanging between caches under spinlock, and it would need a lot
of restructuring to place the preloads correctly. Instead, follow its pattern
of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
radix_tree_preload, followed immediately by radix_tree_preload_end - that
won't guarantee success in the next add_to_page_cache, but doesn't need to.
And we can then remove that bothersome congestion_wait: when needed, it'll
automatically get done in the course of the radix_tree_preload.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:53 +0000 (22:28 -0800)]
tmpfs: open a window in shmem_unuse_inode
There are a couple of reasons (patches follow) why it would be good to open a
window for sleep in shmem_unuse_inode, between its search for a matching swap
entry, and its handling of the entry found.
shmem_unuse_inode must then use igrab to hold the inode against deletion in
that window, and its corresponding iput might result in deletion: so it had
better unlock_page before the iput, and might as well release the page too.
Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
into shmem_unuse_inode, in the case when it finds a match.
Let try_to_unuse break on error in the shmem_unuse case, as it does in the
unuse_mm case: though at this point in the series, no error to break on.
Hugh Dickins [Tue, 5 Feb 2008 06:28:52 +0000 (22:28 -0800)]
tmpfs: make shmem_unuse more preemptible
shmem_unuse is at present an unbroken search through every swap vector page of
every tmpfs file which might be swapped, all under shmem_swaplist_lock. This
dates from long ago, when the caller held mmlist_lock over it all too: long
gone, but there's never been much pressure for preemptible swapoff.
Make it a little more preemptible, replacing shmem_swaplist_lock by
shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
cond_resched_lock (on info->lock) at one convenient point in the
shmem_unuse_inode loop, where it has no outstanding kmap_atomic.
If we're serious about preemptible swapoff, there's much further to go e.g.
I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
case dictate preemptiblility for other configs. But as in the earlier patch
to make swapoff scan ptes preemptibly, my hidden agenda is really towards
making memcgroups work, hardly about preemptibility at all.
Hugh Dickins [Tue, 5 Feb 2008 06:28:51 +0000 (22:28 -0800)]
tmpfs: allocate on read when stacked
tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
size=0). But if a stacked filesystem such as unionfs gets pages from a sparse
tmpfs file by reading holes, and then writes to them, it can easily exceed any
such limit at present.
So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed,
pessimistically mark such pages as dirty, so they cannot get reclaimed and
unaccounted by mistake. The venerable shmem_recalc_inode code (originally to
account for the reclaim of clean pages) suffices to get the accounting right
when swappages are dropped in favour of more uptodate filepages.
This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
caused by unionfs writing to a very sparse tmpfs file: to minimize memory
allocation in swapout, tmpfs requires the swap vector be allocated upfront,
which wasn't always happening in this stacked case.
Hugh Dickins [Tue, 5 Feb 2008 06:28:51 +0000 (22:28 -0800)]
tmpfs: allow filepage alongside swappage
tmpfs has long allowed for a fresh filepage to be created in pagecache, just
before shmem_getpage gets the chance to match it up with the swappage which
already belongs to that offset. But unionfs_writepage now does a
find_or_create_page, divorced from shmem_getpage, which leaves conflicting
filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.
Therefore shmem_writepage (where a page is swizzled from file to swap) must
now be on the lookout for existing swap, ready to free it in favour of the
more uptodate filepage, instead of BUGging on that clash. And when the
add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails
in shmem_getpage, it should retry in the same way it already does.
Hugh Dickins [Tue, 5 Feb 2008 06:28:50 +0000 (22:28 -0800)]
tmpfs: move swap swizzling into shmem
move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
between tmpfs page cache and swap cache, to avoid page copying) are only used
by shmem.c; and our subsequent fix for unionfs needs different treatments in
the two instances of move_from_swap_cache. Move them from swap_state.c into
their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
add_to_swap_cache externally visible.
shmem.c likes to say set_page_dirty where swap_state.c liked to say
SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
makes moot (and implies we should lose that "shift page from clean_pages to
dirty_pages list" comment: it's on neither).
Hugh Dickins [Tue, 5 Feb 2008 06:28:49 +0000 (22:28 -0800)]
tmpfs: shuffle add_to_swap_caches
add_to_swap_cache doesn't amount to much: merge it into its sole caller
read_swap_cache_async. But we'll be needing to call __add_to_swap_cache from
shmem.c, so promote it to the new add_to_swap_cache. Both were static, so
there's no interface confusion to worry about.
And lose that inappropriate "Anon pages are already on the LRU" comment in the
merging: they're not already on the LRU, as Nick Piggin noticed.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
No-problems-with: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:49 +0000 (22:28 -0800)]
tmpfs: move swap_state stats update
Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help fix,
it's best to move the swap swizzling functions from swap_state.c to shmem.c.
As a preliminary to that, move swap stats updating down into
__add_to_swap_cache, which will remain internal to swap_state.c.
Well, actually, just move down the incrementation of add_total: remove
noent_race and exist_race completely, they are relics of my 2.4.11 testing.
Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s.
tmpfs: fix mounts when size is less than the page size
When tmpfs is mounted with a size less than one page, the number of blocks
is set to 0 which makes the tmpfs mount unlimited. This can lead to a
quick and surprising death if someone typos a tmpfs mount command and
writes too much.
tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0,
as Documentation/filesystems/tmpfs.txt says.
Hugh: do this by rounding size up instead of down in all cases: which
slightly expands other odd-sized tmpfs mounts, but in a consistent way.
Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Tue, 5 Feb 2008 06:28:47 +0000 (22:28 -0800)]
shmem: factor out sbi->free_inodes manipulations
The shmem_sb_info structure has a number of free_inodes. This
value is altered in appropriate places under spinlock and with
the sbi->max_inodes != 0 check.
Consolidate these manipulations into two helpers.
This is minus 42 bytes of shmem.o and minus 4 :) lines of code.
Hugh Dickins [Tue, 5 Feb 2008 06:28:46 +0000 (22:28 -0800)]
swapoff: scan ptes preemptibly
Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency
in swapoff by scanning the page table preemptibly: so long as unuse_pte is
careful to recheck that entry under pte lock.
(To tell the truth, this patch was not inspired by any cries for lower
latency here: rather, this restructuring permits a future memory controller
patch to allocate with GFP_KERNEL in unuse_pte, where before it could not.
But it would be wrong to tuck this change away inside a memcgroup patch.)
Hugh Dickins [Tue, 5 Feb 2008 06:28:45 +0000 (22:28 -0800)]
swapin: fix valid_swaphandles defect
valid_swaphandles is supposed to do a quick pass over the swap map entries
neigbouring the entry which swapin_readahead is targetting, to determine for
it a range worth reading all together. But since it always starts its search
from the beginning of the swap "cluster", a reject (free entry) there
immediately curtails the readaround, and every swapin_readahead from that
cluster is for just a single page. Instead scan forwards and backwards around
the target entry.
Use better names for some variables: a swap_info pointer is usually called
"si" not "swapdev". And at the end, if only the target page should be read,
return count of 0 to disable readaround, to avoid the unnecessarily repeated
call to read_swap_cache_async.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:44 +0000 (22:28 -0800)]
shmem_file_write is redundant
With the old aops, writing to a tmpfs file had to use its own special method:
the generic method would pass in a fresh page to prepare_write when the right
page was there in swapcache - which was inefficient to handle, even once we'd
concocted the code to handle it.
With the new aops, the generic method uses shmem_write_end, which lets
shmem_getpage find the right page: so now abandon shmem_file_write in favour
of the generic method. Yes, that does do several things that tmpfs hasn't
really needed (notably balance_dirty_pages_ratelimited, which ramfs also
calls); but more use of common code is preferable.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:44 +0000 (22:28 -0800)]
shmem_getpage return page locked
In the new aops, write_begin is supposed to return the page locked: though
I've seen no ill effects, that's been overlooked in the case of
shmem_write_begin, and should be fixed. Then shmem_write_end must unlock the
page: do so _after_ updating i_size, as we found to be important in other
filesystems (though since shmem pages don't go the usual writeback route, they
never suffered from that corruption).
For shmem_write_begin to return the page locked, we need shmem_getpage to
return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's
simplify the interface and return it locked even when SGP_READ.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:43 +0000 (22:28 -0800)]
shmem: SGP_QUICK and SGP_FAULT redundant
Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no
users now. Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and
shmem_getpage is about to return with page always locked).
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:42 +0000 (22:28 -0800)]
swapin needs gfp_mask for loop on tmpfs
Building in a filesystem on a loop device on a tmpfs file can hang when
swapping, the loop thread caught in that infamous throttle_vm_writeout.
In theory this is a long standing problem, which I've either never seen in
practice, or long ago suppressed the recollection, after discounting my load
and my tmpfs size as unrealistically high. But now, with the new aops, it has
become easy to hang on one machine.
Loop used to grab_cache_page before the old prepare_write to tmpfs, which
seems to have been enough to free up some memory for any swapin needed; but
the new write_begin lets tmpfs find or allocate the page (much nicer, since
grab_cache_page missed tmpfs pages in swapcache).
When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
mapping_gfp_mask - hence the hang.
So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
swapin_readahead to read_swap_cache_async to add_to_swap_cache.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:41 +0000 (22:28 -0800)]
swapin_readahead: move and rearrange args
swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
beside its kindred read_swap_cache_async. Why were its args in a different
order? rearrange them. And since it was always followed by a
read_swap_cache_async of the target page, fold that in and return struct
page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and
read_swap_cache_async stubs.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:40 +0000 (22:28 -0800)]
swapin_readahead: excise NUMA bogosity
For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
code, advancing addr, and stepping on to the next vma at the boundary, to line
up the mempolicy for each page allocation.
It _might_ be a good idea to allocate swap more according to vma layout; but
the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
allocated as needed for pages as they sink to the bottom of the inactive LRUs.
Sometimes that may match vma layout, but not so often that it's worth going
to these misleading vma->vm_next lengths: rip all that out.
Originally I intended to retain the incrementation of addr, but correct its
initial value: valid_swaphandles generally supplies an offset below the target
addr (this is readaround rather than readahead), but addr has not been
adjusted accordingly, so in the interleave case it has usually been allocating
the target page from the "wrong" node (though that may not matter very much).
But look at the equivalent shmem_swapin code: either by oversight or by
design, though it has all the apparatus for choosing a new mempolicy per page,
it uses the same idx throughout, choosing the same mempolicy and interleave
node for each page of the cluster.
Which is actually a much better strategy: each node has its own LRUs and its
own kswapd, so if you're betting on any particular relationship between swap
and node, the best bet is that nearby swap entries belong to pages from the
same node - even when the mempolicy of the target page is to interleave. And
examining a map of nodes corresponding to swap entries on a numa=fake system
bears this out. (We could later tweak swap allocation to make it even more
likely, but this patch is merely about removing cruft.)
So, neither adjust nor increment addr in swapin_readahead, and then
shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
once per cluster, and so few fields of pvma are used, let's skip the memset -
from shmem_alloc_page also.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ken Chen [Tue, 5 Feb 2008 06:28:36 +0000 (22:28 -0800)]
hugetlb: allow sticky directory mount option
Allow sticky directory mount option for hugetlbfs. This allows admin
to create a shared hugetlbfs mount point for multiple users, while
prevent accidental file deletion that users may step on each other.
It is similiar to default tmpfs mount option, or typical option used
on /tmp.
Signed-off-by: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The constructor for buffer_head slabs was removed recently. We need the
constructor back in slab defrag in order to insure that slab objects always
have a definite state even before we allocated them.
I think we mistakenly merged the removal of the constuctor into a cleanup
patch. You (ie: akpm) had a test that showed that the removal of the
constructor led to a small regression. The prior state makes things easier
for slab defrag.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page array is repeatedly indexed both in vunmap and vmalloc_area_node().
Add a temporary variable to make it easier to read (and easier to patch
later).
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_vmalloc_addr(): Check if an address is within the vmalloc boundaries
Checking if an address is a vmalloc address is done in a couple of places.
Define a common version in mm.h and replace the other checks.
Again the include structures suck. The definition of VMALLOC_START and
VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
there.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
i386: Resolve dependency of asm-i386/pgtable.h on highmem.h
pgtable.h does not include highmem.h but uses various constants from
highmem.h. We cannot include highmem.h because highmem.h will in turn include
many other include files that also depend on pgtable.h
So move the definitions from highmem.h into pgtable.h.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We already have page table manipulation for vmalloc in vmalloc.c. Move the
vmalloc_to_page() function there as well.
Move the definitions for vmalloc related functions in mm.h to a newly created
section. A better place would be vmalloc.h but mm.h is basic and may depend
on these functions. An alternative would be to include vmalloc.h in mm.h
(like done for vmstat.h).
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:28 +0000 (22:28 -0800)]
gpiolib: avr32 at32ap platform support
Teach AVR32 to use the "GPIO Library" when exposing its GPIOs, so that signals
on external chips (like GPIO expanders) can easily be used.
This mostly reorganizes some existing logic, with two minor changes in
behavior:
- The PSR registers are used instead of the previous "gpio_mask" values,
matching AT91 behavior and removing some duplication between that role
and that of "pinmux_mask".
- NR_IRQs grew to acommodate a bank of external GPIOs. Eventually this
number should probably become a board-specific config option.
There's a debugfs dump of status for the built-in GPIOs, showing which pins
have deglitching, pullups, or open drain drive enabled, as well as the ID
string used when requesting each IRQ.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eric miao [Tue, 5 Feb 2008 06:28:27 +0000 (22:28 -0800)]
deprecate obsolete pca9539 driver
Use drivers/gpio/pca9539.c instead.
Signed-off-by: eric miao <eric.miao@marvell.com> Acked-by: Ben Gardner <bgardner@wabtec.com> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eric miao [Tue, 5 Feb 2008 06:28:26 +0000 (22:28 -0800)]
gpiolib: pca9539 i2c gpio expander support
This adds a new-style I2C driver with basic support for the sixteen bit
PCA9539 GPIO expanders. These chips have multiple registers, push-pull output
drivers, and (not supported in this patch) pin change interrupts.
Board-specific code must provide "pca9539_platform_data" with each chip's
"i2c_board_info". That provides the GPIO numbers to be used by that chip, and
callbacks for board-specific setup/teardown logic.
Derived from drivers/i2c/chips/pca9539.c (which has no current known users).
This is faster and simpler; it uses 16-bit register access, and cache the
OUTPUT and DIRECTION registers for fast access
Signed-off-by: eric miao <eric.miao@marvell.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Jean Delvare <khali@linux-fr.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:25 +0000 (22:28 -0800)]
mcp23s08 spi gpio expander support
Basic driver for 8-bit SPI based MCP23S08 GPIO expander, without support for
IRQs or the shared chipselect mechanism.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:24 +0000 (22:28 -0800)]
gpiolib: pcf857x i2c gpio expander support
This is a new-style I2C driver for most common 8 and 16 bit I2C based
"quasi-bidirectional" GPIO expanders: pcf8574 or pcf8575, and several
compatible models (mostly faster, supporting I2C at up to 1 MHz).
The driver exposes the GPIO signals using the platform-neutral GPIO
programming interface, so they are easily accessed by other kernel code. The
lack of such a flexible kernel API has been a big factor in the proliferation
of board-specific drivers for these chips... stuff that rarely makes it
upstream since it's so ugly. This driver will let such boards use standard
calls.
Since it's a new-style driver, these devices must be configured as part of
board-specific init. That eliminates the need for error-prone manual
configuration of module parameters, and makes compatibility with legacy
drivers (pcf8574.c, pc8575.c) for these chips easier (there's a clear
either/or disjunction).
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Philipp Zabel [Tue, 5 Feb 2008 06:28:22 +0000 (22:28 -0800)]
gpiolib support for the PXA architecture
This adds gpiolib support for the PXA architecture:
- move all GPIO API functions from generic.c into gpio.c
- convert the gpio_get/set_value macros into inline functions
This makes it easier to hook up GPIOs provided by external chips like
ASICs and CPLDs.
Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Minor ARM fixup from David Brownell folded into this ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:21 +0000 (22:28 -0800)]
gpiolib: update Documentation/gpio.txt
Update Documentation/gpio.txt, primarily to include the new "gpiolib"
infrastructure.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:20 +0000 (22:28 -0800)]
gpiolib: add gpio provider infrastructure
Provide new implementation infrastructure that platforms may choose to use
when implementing the GPIO programming interface. Platforms can update their
GPIO support to use this. In many cases the incremental cost to access a
non-inlined GPIO should be less than a dozen instructions, with the memory
cost being about a page (total) of extra data and code. The upside is:
* Providing two features which were "want to have (but OK to defer)" when
GPIO interfaces were first discussed in November 2006:
- A "struct gpio_chip" to plug in GPIOs that aren't directly supported
by SOC platforms, but come from FPGAs or other multifunction devices
using conventional device registers (like UCB-1x00 or SM501 GPIOs,
and southbridges in PCs with more open specs than usual).
- Full support for message-based GPIO expanders, where registers are
accessed through sleeping I/O calls. Previous support for these
"cansleep" calls was just stubs. (One example: the widely used
pcf8574 I2C chips, with 8 GPIOs each.)
* Including a non-stub implementation of the gpio_{request,free}() calls,
making those calls much more useful. The diagnostic labels are also
recorded given DEBUG_FS, so /sys/kernel/debug/gpio can show a snapshot
of all GPIOs known to this infrastructure.
The driver programming interfaces introduced in 2.6.21 do not change at all;
this infrastructure is entirely below those covers.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 5 Feb 2008 06:28:17 +0000 (22:28 -0800)]
gpiolib: add drivers/gpio directory
Add an empty drivers/gpio directory for gpiolib infrastructure and GPIO
expanders. It will be populated by later patches.
This won't be the only place to hold such gpio_chip code. Many external chips
add a few GPIOs as secondary functionality (such as MFD drivers) and platform
code frequently needs to closely integrate GPIO and IRQ support.
This is placed *early* in the build/link sequence since it's common for other
drivers to depend on GPIOs to do their work, so they must be initialized early
in the device_initcall() sequence.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Jean Delvare <khali@linux-fr.org> Cc: Eric Miao <eric.miao@marvell.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Philipp Zabel <philipp.zabel@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Gardner <bgardner@wabtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I think that it would be better to set up segment boundary in the same
way as we did for the maximum segment size. That is, removing
shost->dma_boundary and LLDs call pci_set_dma_seg_boundary (or its
friends).
Then __scsi_alloc_queue() can set up both limits in the same way:
killing dma_boundary in scsi_host_template needs a large patch for
libata (dma_boundary is used by only libata and sym53c8xx). I'll send
a patch to do that if it is acceptable. James and Jeff?
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FUJITA Tomonori [Tue, 5 Feb 2008 06:28:16 +0000 (22:28 -0800)]
iommu sg merging: swiotlb: respect the segment boundary limits
This patch makes swiotlb not allocate a memory area spanning LLD's segment
boundary.
is_span_boundary() judges whether a memory area spans LLD's segment boundary.
If map_single finds such a area, map_single tries to find the next available
memory area.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FUJITA Tomonori [Tue, 5 Feb 2008 06:28:13 +0000 (22:28 -0800)]
iommu sg merging: add accessors for segment_boundary_mask in device_dma_parameters()
This adds new accessors for segment_boundary_mask in device_dma_parameters
structure in the same way I did for max_segment_size. So we can easily change
where to place struct device_dma_parameters in the future.
dma_get_segment boundary returns 0xffffffff if dma_parms in struct device
isn't set up properly. 0xffffffff is the default value used in the block
layer and the scsi mid layer.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously, during initialization of the IOMMU tables, the last entry
at each 4GB boundary is marked as used since there are many adapters
which cannot handle DMAing across any 4GB boundary.
The IOMMU doesn't allocate a memory area spanning LLD's segment
boundary anymore. The segment boundary of devices are set to 4GB by
default. So we can remove 4GB boundary protection now.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jeff Garzik <jeff@garzik.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FUJITA Tomonori [Tue, 5 Feb 2008 06:28:07 +0000 (22:28 -0800)]
iommu sg: add IOMMU helper functions for the free area management
This adds IOMMU helper functions for the free area management. These
functions take care of LLD's segment boundary limit for IOMMUs. They would be
useful for IOMMUs that use bitmap for the free area management.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jeff Garzik <jeff@garzik.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FUJITA Tomonori [Tue, 5 Feb 2008 06:28:05 +0000 (22:28 -0800)]
iommu sg merging: call blk_queue_segment_boundary in __scsi_alloc_queue
request_queue and device struct must have the same value of a segment
size limit. This patch adds blk_queue_segment_boundary in
__scsi_alloc_queue so LLDs don't need to call both
blk_queue_segment_boundary and set_dma_max_seg_size. A LLD can change
the default value (64KB) can call device_dma_parameters accessors like
pci_set_dma_max_seg_size when allocating scsi_host.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Jeff Garzik <jeff@garzik.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IOMMUs merges scatter/gather segments without considering a low level
driver's restrictions. The problem is that IOMMUs can't access to the
limitations because they are in request_queue.
This patchset introduces a new structure, device_dma_parameters,
including dma information. A pointer to device_dma_parameters is added
to struct device. The bus specific structures (like pci_dev) includes
device_dma_parameters. Low level drivers can use dma_set_max_seg_size
to tell IOMMUs about the restrictions.
We can move more dma stuff in struct device (like dma_mask) to struct
device_dma_parameters later (needs some cleanups before that).
This includes patches for all the IOMMUs that could merge sg (x86_64,
ppc, IA64, alpha, sparc64, and parisc) though only the ppc patch was
tested. The patches for other IOMMUs are only compile tested.
This patch:
Add a new structure, device_dma_parameters, including dma information. A
pointer to device_dma_parameters is added to struct device.
- there are only max_segment_size and segment_boundary_mask there but we'll
move more dma stuff in struct device (like dma_mask) to struct
device_dma_parameters later. segment_boundary_mask is not supported yet.
- new accessors for the dma parameters are added. So we can easily change
where to place struct device_dma_parameters in the future.
- dma_get_max_seg_size returns 64K if dma_parms in struct device isn't set
up properly. 64K is the default max_segment_size in the block layer.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Jeff Garzik <jeff@garzik.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark A. Greer [Tue, 5 Feb 2008 06:27:54 +0000 (22:27 -0800)]
serial: MPSC: set baudrate when BRG divider is set.
The clock to generate the desired baudrate with the MPSC is first divided
by the Baud Rate Generator (BRG) and then by the MPSC itself. So, when the
BRG divider is changed, the MPSC divider must also be changed to generate
the correct baudrate. During MPSC initialization, the BRG divider is
changed but the MPSC divider isn't changed until much later. This results
in some printk's coming out garbled. To fix that, set the MPSC divider at
the same time that the BRG divider is changed.
Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Tue, 5 Feb 2008 06:27:53 +0000 (22:27 -0800)]
serial: speed setup failure reporting
Invalid speeds are forced to 9600. Update the code for this to encode new
style baud rates properly.
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Russell King [Tue, 5 Feb 2008 06:27:52 +0000 (22:27 -0800)]
serial: avoid stalling suspend if serial port won't drain
Some ports seem to be unable to drain their transmitters on shut down. Such a
problem can occur if the port is programmed for hardware imposed flow control,
characters are in the FIFO but the CTS signal is inactive.
Normally, this isn't a problem because most places where we wait for the
transmitter to drain have a time-out. However, there is no timeout in the
suspend path.
Give a port 30ms to drain; this is an arbitary value chosen to avoid long
delays if there are many such ports in the system, while giving a reasonable
chance for a single port to drain. Should a port not drain within this
timeout, issue a warning.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Russell King [Tue, 5 Feb 2008 06:27:51 +0000 (22:27 -0800)]
serial: avoid waking up closed serial ports on resume
When we boot, serial ports remain in low power mode until they're used either
by userspace or for the kernel console.
However, if you suspend the system, and then resume, all serial ports will be
taken out of low power mode. This is bad news for embedded devices where this
can mean higher power consumption.
Only bring a serial port out of low power mode if the port is being used as
the kernel console, or is in use by userspace.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Newton [Tue, 5 Feb 2008 06:27:50 +0000 (22:27 -0800)]
8250.c: support specifying DW APB UARTs in device platform_data
Allow the private_data field to be specified in platform_data for the
standard 8250/16550 UART. This field is used by DW APB type UARTs and
without this patch it's only possible to set this field when registering
the port by hand. If private_data is not set then the driver will
potentially oops with a NULL pointer dereference.
Signed-off-by: Will Newton <will.newton@gmail.com> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Krauth J. <krauth.julien@addi-data.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Olsa [Tue, 5 Feb 2008 06:27:48 +0000 (22:27 -0800)]
drivers/serial/s3c2410.c: remove dead config symbols
Remove dead config symbol.
Signed-off-by: Jiri Olsa <olsajiri@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ben Dooks <ben@fluff.org> Cc: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yinghai Lu [Tue, 5 Feb 2008 06:27:46 +0000 (22:27 -0800)]
serial: keep the DTR setting for serial console.
with reverting "x86, serial: convert legacy COM ports to platform devices",
we will have the serial console before the port is probled again.
uart_add_one_port==>uart_configure_port==>set_mcttrl(port, 0) will clear
the DTR setting by uart_set_options(). then I will lose my output from
serial console again.
So try to keep DTR in uart_configure_port()
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <ak@suse.de> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Tue, 5 Feb 2008 06:27:46 +0000 (22:27 -0800)]
drivers/pcmcia: add missing pci_dev_get
pci_get_slot does a pci_dev_get, so pci_dev_put needs to be called in an
error case.
An extract of the semantic match used to find the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
type find1.T,T1,T2;
identifier find1.E;
statement find1.S;
expression x1,x2,x3;
expression find1.test;
int ret != 0;
@@
T E;
...
(
* E = pci_get_slot(...);
if (E == NULL) S
|
* if ((E = pci_get_slot(...)) == NULL)
S
)
... when != pci_dev_put(...,(T1)E,...)
when != if (E != NULL) { ... pci_dev_put(...,(T1)E,...); ...}
when != x1 = (T1)E
when != E = x3;
when any
if (test) {
... when != pci_dev_put(...,(T2)E,...)
when != if (E != NULL) { ... pci_dev_put(...,(T2)E,...); ...}
when != x2 = (T2)E
(
* return;
|
* return ret;
)
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Tue, 5 Feb 2008 06:27:44 +0000 (22:27 -0800)]
drivers/pcmcia: Add missing iounmap
of_iomap calls ioremap, and so should be matched with an iounmap. At the
two error returns, the result of calling of_iomap is only stored in a local
variable, so these error paths need to call iounmap. Furthermore, this
function ultimately stores the result of of_iomap in an array that is local
to the file. These values should be iounmapped at some point. I have
added a corresponding call to iounmap at the end of the function
m8xx_remove.
The problem was found using the following semantic match.
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
type T,T1,T2;
identifier E;
statement S;
expression x1,x2,x3;
int ret;
@@
T E;
...
* E = of_iomap(...);
if (E == NULL) S
... when != iounmap(...,(T1)E,...)
when != if (E != NULL) { ... iounmap(...,(T1)E,...); ...}
when != x1 = (T1)E
when != E = x3;
when any
if (...) {
... when != iounmap(...,(T2)E,...)
when != if (E != NULL) { ... iounmap(...,(T2)E,...); ...}
when != x2 = (T2)E
(
* return;
|
* return ret;
)
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Vitaly Bordug <vitb@kernel.crashing.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Olof Johansson <olof@lixom.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>