David Chinner [Thu, 24 May 2007 05:26:31 +0000 (15:26 +1000)]
[XFS] Lazy Superblock Counters
When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.
When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.
The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.
The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.
This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.
It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.
One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.
As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....
David Chinner [Thu, 24 May 2007 05:22:19 +0000 (15:22 +1000)]
[XFS] Make hole punching at EOF atomic.
If hole punching at EOF is done as two steps (i.e. truncate then extend)
the file is in a transient state between the two steps where an
application can see the incorrect file size. Punching a hole to EOF needs
to be treated in teh same way as all other hole punching cases so that the
file size is never seen to change.
David Chinner [Thu, 24 May 2007 05:21:57 +0000 (15:21 +1000)]
[XFS] Fix vmalloc leak on mount/unmount.
When setting the length of the iclogbuf to write out we should just be
changing the desired byte count rather completely reassociating the buffer
memory with the buffer. Reassociating the buffer memory changes the
apparent length of the buffer and hence when we free the buffer, we don't
free all the vmap()d space we originally allocated.
David Chinner [Mon, 14 May 2007 08:24:09 +0000 (18:24 +1000)]
[XFS] Sleeping with the ilock waiting for I/O completion is Bad.
Recent fixes to the filesystem freezing code introduced a vn_iowait call
in the middle of the sync code. Unfortunately, at the point where this
call was added we are holding the ilock. The ilock is needed by I/O
completion for unwritten extent conversion and now updating the file size.
Hence I/o cannot complete if we hold the ilock while waiting for I/O
completion.
Nathan Scott [Mon, 14 May 2007 08:24:02 +0000 (18:24 +1000)]
[XFS] Don't grow filesystems past the size they can index.
When growing a filesystem we don't check to see if the new size overflows
the page cache index range, so we can do silly things like grow a
filesystem page 16TB on a 32bit. Check new filesystem sizes against the
limits the kernel can support.
Many block drivers (aoe, iscsi) really want refcountable pages in bios,
which is what almost everyone send down. XFS unfortunately has a few
places where it sends down buffers that may come from kmalloc, which
breaks them.
Alan Cox [Wed, 11 Jul 2007 00:22:27 +0000 (17:22 -0700)]
lots-of-architectures: enable arbitary speed tty support
Add the termios2 structure ready for enabling on most platforms. One or
two like Sparc are plain weird so have been left alone. Most can use the
same structure as ktermios for termios2 (ie the newer ioctl uses the
structure matching the current kernel structure)
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Ian Molton <spyro@f2s.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelianov [Wed, 11 Jul 2007 00:22:26 +0000 (17:22 -0700)]
Make common helpers for seq_files that work with list_heads
Many places in kernel use seq_file API to iterate over a regular list_head.
The code for such iteration is identical in all the places, so it's worth
introducing a common helpers.
This makes code about 300 lines smaller:
The first version of this patch made the helper functions static inline
in the seq_file.h header. This patch moves them to the fs/seq_file.c as
Andrew proposed. The vmlinux .text section sizes are as follows:
2.6.22-rc1-mm1: 0x001794d5
with the previous version: 0x00179505
with this patch: 0x00179135
The config file used was make allnoconfig with the "y" inclusion of all
the possible options to make the files modified by the patch compile plus
drivers I have on the test node.
This patch:
Many places in kernel use seq_file API to iterate over a regular list_head.
The code for such iteration is identical in all the places, so it's worth
introducing a common helpers.
Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Richard Purdie [Wed, 11 Jul 2007 00:22:24 +0000 (17:22 -0700)]
Add LZO1X algorithm to the kernel
This is a hybrid version of the patch to add the LZO1X compression
algorithm to the kernel. Nitin and myself have merged the best parts of
the various patches to form this version which we're both happy with (and
are jointly signing off).
The performance of this version is equivalent to the original minilzo code
it was based on. Bytecode comparisons have also been made on ARM, i386 and
x86_64 with favourable results.
There are several users of LZO lined up including jffs2, crypto and reiser4
since its much faster than zlib.
Signed-off-by: Nitin Gupta <nitingupta910@gmail.com> Signed-off-by: Richard Purdie <rpurdie@openedhand.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc:
mmc: at91_mci: fix hanging and rework to match flowcharts
mmc: at91_mci typo
sdhci: Fix "Unexpected interrupt" handling
mmc: fix silly copy-and-paste error
mmc: move layer init and workqueue to core file
mmc: refactor host class handling
mmc: refactor bus operations
sdhci: add ene controller id
mmc: bounce requests for simple hosts
Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (40 commits)
bonding/bond_main.c: make 2 functions static
ps3: gigabit ethernet driver for PS3, take3
[netdrvr] Fix dependencies for ax88796 ne2k clone driver
eHEA: Capability flag for DLPAR support
Remove sk98lin ethernet driver.
sunhme.c:quattro_pci_find() must be __devinit
bonding / ipv6: no addrconf for slaves separately from master
atl1: remove write-only var in tx handler
macmace: use "unsigned long flags;"
Cleanup usbnet_probe() return value handling
netxen: deinline and sparse fix
eeprom_93cx6: shorten pulse timing to match spec (bis)
phylib: Add Marvell 88E1112 phy id
phylib: cleanup marvell.c a bit
AX88796 network driver
IOC3: Switch to pci refcounting safe APIs
e100: Fix Tyan motherboard e100 not receiving IPMI commands
QE Ethernet driver writes to wrong register to mask interrupts
rrunner.c:rr_init() must be __devinit
tokenring/3c359.c:xl_init() must be __devinit
...
Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev: (32 commits)
[libata] sata_mv: print out additional chip info during probe
[libata] Use ATA_UDMAx standard masks when filling driver's udma_mask info
[libata] AHCI: Add support for Marvell AHCI-like chips (initially 6145)
[libata] Clean up driver udma_mask initializers
libata: Support chips with 64K PRD quirk
Add a PCI ID for santa rosa's PATA controller.
sata_sil24: sil24_interrupt() micro-optimisation
Add irq_flags to struct pata_platform_info
sata_promise: cleanups
[libata] pata_ixp4xx: kill unused var
ata_piix: fix pio/mwdma programming
[libata] ahci: minor internal cleanups
[ATA] Add named constant for ATAPI command DEVICE RESET
[libata] sata_sx4, sata_via: minor documentation updates
[libata] ahci: minor internal cleanups
[libata] ahci: Factor out SATA port init into a separate function
[libata] pata_sil680: minor cleanups from benh
[libata] sata_sx4: named constant cleanup
[libata] pata_ixp4xx: convert to new EH
[libata] pdc_adma: Reorder initializers with a couple structs
...
Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
[S390] vmlogrdr function annotation.
[S390] s390: rename CPU_IDLE to S390_CPU_IDLE
[S390] cio: Remove prototype for non-existing function cmf_reset().
[S390] zcrypt: fix request timeout handling
[S390] system call optimization.
[S390] dasd: Avoid compile warnings on !CONFIG_DASD_PROFILE
[S390] Remove volatile from atomic_t
[S390] Program check in diag 210 under 31 bit
[S390] Bogomips calculation for 64 bit.
[S390] smp: Merge smp_count_cpus() and smp_get_save_areas().
[S390] zcore: Fix __user annotation.
[S390] fixed cdl-format detection.
[S390] sclp: Test facility list before executing a service call.
[S390] sclp: introduce some new interfaces.
[S390] Fixed comment typo.
[S390] vmcp cleanup
Merge branch 'splice-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block
* 'splice-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block:
pipe: add documentation and comments
pipe: change the ->pin() operation to ->confirm()
Remove remnants of sendfile()
xip sendfile removal
splice: completely document external interface with kerneldoc
sendfile: remove bad_sendfile() from bad_file_ops
shmem: convert to using splice instead of sendfile()
relay: use splice_to_pipe() instead of open-coding the pipe loop
pipe: allow passing around of ops private pointer
splice: divorce the splice structure/function definitions from the pipe header
splice: relay support
sendfile: convert nfsd to splice_direct_to_actor()
sendfile: convert nfs to using splice_read()
loop: convert to using splice_direct_to_actor() instead of sendfile()
splice: add void cookie to the actor data
sendfile: kill generic_file_sendfile()
sendfile: remove .sendfile from filesystems that use generic_file_sendfile()
sys_sendfile: switch to using ->splice_read, if available
vmsplice: add vmsplice-to-user support
splice: abstract out actor data
Merge branch 'trivial-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block
* 'trivial-2.6.23' of git://git.kernel.dk/data/git/linux-2.6-block:
Documentation/block/barrier.txt is not in sync with the actual code: - blk_queue_ordered() no longer has a gfp_mask parameter - blk_queue_ordered_locked() no longer exists - sd_prepare_flush() looks slightly different
Use list_for_each_entry() instead of list_for_each() in the block device
Make a "menuconfig" out of the Kconfig objects "menu, ..., endmenu",
block/Kconfig already has its own "menuconfig" so remove these
Use menuconfigs instead of menus, so the whole menu can be disabled at once
cfq-iosched: fix async queue behaviour
unexport bio_{,un}map_user
Remove legacy CDROM drivers
[PATCH] fix request->cmd == INT cases
cciss: add new controller support for P700m
[PATCH] Remove acsi.c
[BLOCK] drop unnecessary bvec rewinding from flush_dry_bio_endio
[PATCH] cdrom_sysctl_info fix
blk_hw_contig_segment(): bad segment size checks
[TRIVIAL PATCH] Kill blk_congestion_wait() stub for !CONFIG_BLOCK
Adrian Bunk [Mon, 9 Jul 2007 18:51:12 +0000 (11:51 -0700)]
bonding/bond_main.c: make 2 functions static
Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Chad Tindel <ctindel@users.sourceforge.net> Cc: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
This is the third submission of the network driver for PS3.
The differences from the previous one are:
- renamed source file names so that their prefix can match
with the module name
- added cbe-oss-dev@ozlabs.org line for MAINTAINER file
- changed some in copyright comments
If there are no more comments, please apply for 2.6.23.
Thank you
--
Subject: PS3: Ethernet driver
From: Masakazu Mokuno <mokuno@sm.sony.co.jp>
Add Gigabit Ethernet support for the PS3 game console. The module will
be called ps3_gelic.
Jay Vosburgh [Mon, 9 Jul 2007 17:42:47 +0000 (10:42 -0700)]
bonding / ipv6: no addrconf for slaves separately from master
At present, when a device is enslaved to bonding, if ipv6 is
active then addrconf will be initated on the slave (because it is closed
then opened during the enslavement processing). This causes DAD and RS
packets to be sent from the slave. These packets in turn can confuse
switches that perform ipv6 snooping, causing them to incorrectly update
their forwarding tables (if, e.g., the slave being added is an inactve
backup that won't be used right away) and direct traffic away from the
active slave to a backup slave (where the incoming packets will be
dropped).
This patch alters the behavior so that addrconf will only run on
the master device itself. I believe this is logically correct, as it
prevents slaves from having an IPv6 identity independent from the
master. This is consistent with the IPv4 behavior for bonding.
This is accomplished by (a) having bonding set IFF_SLAVE sooner
in the enslavement processing than currently occurs (before open, not
after), and (b) having ipv6 addrconf ignore UP and CHANGE events on
slave devices.
The eql driver also uses the IFF_SLAVE flag. I inspected eql,
and I believe this change is reasonable for its usage of IFF_SLAVE, but
I did not test it.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Peter Korsgaard [Mon, 2 Jul 2007 22:46:42 +0000 (00:46 +0200)]
Cleanup usbnet_probe() return value handling
usbnet_probe() handles a positive return value from the driver bind()
function as success, but will later only setup the status handler if the
return value was zero, leading to confusion. Patch adjusts this to accept
positive values as success in both checks.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Get rid of dubious casts to (void *) which causes a sparse warning.
And move largeish function from inline to the one file that uses the code,
the compiler can then decide to inline it.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Olof Johansson [Tue, 3 Jul 2007 21:23:46 +0000 (16:23 -0500)]
phylib: cleanup marvell.c a bit
Simplify the marvell driver init a bit: Make the supported devices an
array instead of explicitly registering each structure. This makes it
considerably easier to add new devices down the road.
Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Robert P. J. Day [Tue, 10 Jul 2007 10:37:56 +0000 (06:37 -0400)]
[MIPS] PNX8550: Cleanup proc code.
Here's a slightly cleaner way of creating the /proc structure for the
pnx8850. mostly, it creates a directory with default mode 555, since the
one you're creating is mode 444, which is somewhat unusual for a directory
under /proc.
[MIPS] Change names of local variables to silence sparse
This patch is an workaround for these sparse warnings:
linux/include/linux/calc64.h:25:17: warning: symbol '__quot' shadows an earlier one
linux/include/linux/calc64.h:25:17: originally declared here
linux/include/linux/calc64.h:25:17: warning: symbol '__mod' shadows an earlier one
linux/include/linux/calc64.h:25:17: originally declared here
[MIPS] Add debugfs files to show fpuemu statistics
Export contents of struct mips_fpu_emulator_stats via debugfs.
There is no way to read these statistics for now but they (at least
the "emulated" count) might be sometimes useful for performance tuning
on FPU-less CPUs.
[MIPS] rbtx4938: Fix secondary PCIC and glue internal NICs
* Fix pci ops for secondary PCIC
* Do not reserve 1MB for PCI MEM region (leave PCIBIOS_MIN_MEM zero)
* Use platform_device to provide ethernet addresses for internal NICs.
(background: TX49XX SoCs include PCI NIC (TC35815 compatible)
connected via its internal PCI bus, but the NIC's PROM interface is
not connected to SEEPROM. So we must provide its ethernet address
by another way.)
* Check return value of early_read_config_word()
Atsushi Nemoto [Fri, 29 Jun 2007 13:34:53 +0000 (22:34 +0900)]
[MIPS] tc35815: Load MAC address via platform_device
TX49XX SoCs include PCI NIC (TC35815 compatible) connected via its
internal PCI bus, but the NIC's PROM interface is not connected to
SEEPROM. So we must provide its ethernet address by another way.
Atsushi Nemoto [Mon, 25 Jun 2007 16:14:01 +0000 (01:14 +0900)]
[MIPS] Make ioremap() work on TX39/49 special unmapped segment
TX39XX and TX49XX have "reserved" segment in CKSEG3 area.
0xff000000-0xff3fffff on TX49XX and 0xff000000-0xfffeffff on TX39XX
are reserved (unmapped, uncached). Controllers on these SoCs are
placed in this segment.
This patch add plat_ioremap() and plat_iounmap() to override default
behavior and implement these hooks for TX39/TX49.
- use RTC_CLASS instead of GEN_RTC
- get rid of ds1216 in favour of a RTC_CLASS driver
- use correct console device for older RM400
- use physical addresses for 82596 device
- use 128 byte L1 cache line size (this is needed because most of the
SNI caches are using 128 L2 cache lines)
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[MIPS] Enable support for the userlocal hardware register
Which will cut down the cost of RDHWR $29 which is used to obtain the
TLS pointer and so far being emulated in software down to a single cycle
operation.
This is an optimised implementation of early printk() for the DECstation.
After the recent conversion to a MIPS-specific generic routine using a
character-by-character output the performance dropped significantly.
This change reverts to the previous speed -- even at 9600 bps of the
serial console the difference is visible with a naked eye; I presume for a
framebuffer it is even worse (it may depend on exactly which one is used
though).
Additionally the change includes a fix for a problem that the old
implementation had -- the format used would not actually limit the length
of the string output. This new implementation uses a local buffer to deal
with it -- even with this additional copying it is much faster than the
generic function.
Plus this driver is registered much earlier than the generic one,
allowing one to see critical messages, such as one about an incorrect CPU
setting used, that are produced beforehand. :-)
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Franck Bui-Huu [Mon, 4 Jun 2007 15:46:35 +0000 (17:46 +0200)]
[MIPS] Fix PHYS_OFFSET for 64-bits kernels with 32-bits symbols
The current implementation of __pa() for 64-bits kernels with 32-bits
symbols is broken. In this configuration, we need 2 values for
PAGE_OFFSET, one in XKPHYS and the other in CKSEG0 space.
When the value in CKSEG0 space is used, it doesn't take into account
of PHYS_OFFSET. Even worse we can't redefine this value.
The patch restores CPHYSADDR() but in __pa()'s implementation because
it removes the need of 2 PAGE_OFFSET.
OTOH, CPHYSADDR() is quite bad when dealing with mapped kernels. So
this patch assumes there's no need to deal with such kernel in 64-bits
world.
Franck Bui-Huu [Mon, 4 Jun 2007 15:46:33 +0000 (17:46 +0200)]
[MIPS] Make PAGE_OFFSET aware of PHYS_OFFSET
For platforms that use PHYS_OFFSET and do not use a mapped kernel,
this patch automatically adds PHYS_OFFSET into PAGE_OFFSET.
Therefore there are no more needs for them to redefine PAGE_OFFSET.
For mapped kernel, they need to redefine PAGE_OFFSET anyways.
No point in adding yet another #ifdef for Loongson since all this mask is
being used for is converting an XKPHYS address into a physical address
anyway. So replace all definitions by one with the highest architectural
possible value.
Florian Fainelli [Tue, 22 May 2007 19:44:42 +0000 (21:44 +0200)]
[MIPS] Add generic GPIO to Au1x00
This patch adds support for the generic GPIO API to Au1x00 boards. It requires
the generic GPIO patch for MIPS boards by Yoichi Yuasa. Now there is a MIPS
target using it, can you queue these patchset for 2.6.22 ? Thank you very
much in advance.
Atsushi Nemoto [Tue, 29 May 2007 15:38:07 +0000 (00:38 +0900)]
[MIPS] Simplify missing-syscalls for N32 and O32
Use standard missing-syscalls with EXTRA_CFLAGS instead of duplicating
the command. And move the archprepare rule before the archclean rule.
Suggested by Franck Bui-Huu. Also add "echo" to show the target ABI.
Alan Cox [Tue, 10 Jul 2007 16:05:16 +0000 (17:05 +0100)]
IOC3: Switch to pci refcounting safe APIs
Convert the IOC3 driver to use ref counting pci interfaces so that we can
obsolete the (usually unsafe) pci_find_{slot/device} interfaces and avoid
future authors writing hotplug-unsafe device drivers.
Signed-off-by: Alan Cox <alan@redhat.com>
Build fixes: Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>