]> pilppa.com Git - linux-2.6-omap-h63xx.git/log
linux-2.6-omap-h63xx.git
17 years agoIB/mthca: Fix posting >255 recv WRs for Tavor
Michael S. Tsirkin [Mon, 14 May 2007 04:26:51 +0000 (07:26 +0300)]
IB/mthca: Fix posting >255 recv WRs for Tavor

Fix posting lists of > 255 receive WRs for Tavor: rq.next_ind must
be updated each doorbell, otherwise the next doorbell will use an
incorrect index.

Found by Ronni Zimmermann at Mellanox.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoRDMA/cma: Add check to validate that cm_id is bound to a device
Sean Hefty [Mon, 7 May 2007 18:49:27 +0000 (11:49 -0700)]
RDMA/cma: Add check to validate that cm_id is bound to a device

Several checks in the rdma_cm check against the state of the
cm_id, but only to validate that the cm_id is bound to an underlying
transport specific CM and an RDMA device.  Make the check explicit
in what we're trying to check for, since we're not synchronizing
against the cm_id state.

This will allow a user to disconnect a cm_id or reject a connection
after receiving a device removal event.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoRDMA/cma: Fix synchronization with device removal in cma_iw_handler
Sean Hefty [Mon, 7 May 2007 18:49:12 +0000 (11:49 -0700)]
RDMA/cma: Fix synchronization with device removal in cma_iw_handler

The cma_iw_handler needs to validate the state of the rdma_cm_id before
processing a new connection request to ensure that a device removal is
not already being processed for the same rdma_cm_id.  Without the state
check, the user can receive simultaneous callbacks for the same cm_id, or
a callback after they've destroyed the cm_id.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoRDMA/cma: Simplify device removal handling code
Sean Hefty [Mon, 7 May 2007 18:49:00 +0000 (11:49 -0700)]
RDMA/cma: Simplify device removal handling code

Add a new routine and rename another to encapsulate common code for
synchronizing with device removal.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Disable scaling code by default, bump version number
Joachim Fenkes [Wed, 9 May 2007 11:48:31 +0000 (13:48 +0200)]
IB/ehca: Disable scaling code by default, bump version number

- Scaling code is still considered experimental, so disable it by default
- Increase version to SVNEHCA_0023

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Beautify sysfs attribute code and fix compiler warnings
Joachim Fenkes [Wed, 9 May 2007 11:48:25 +0000 (13:48 +0200)]
IB/ehca: Beautify sysfs attribute code and fix compiler warnings

eHCA's sysfs attributes are now being created via sysfs_create_group(),
making the process neatly table-driven. The return value is checked, thus
fixing a few compiler warnings.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Remove _irqsave, move #ifdef
Joachim Fenkes [Wed, 9 May 2007 11:48:20 +0000 (13:48 +0200)]
IB/ehca: Remove _irqsave, move #ifdef

- In ehca_process_eq(), we're IRQ safe throughout the whole function, so we
  don't need another _irqsave in the middle of flight.

- take_over_work() is only called by comp_pool_callback(), so it can move
  into the same #ifdef block.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Fix AQP0/1 QP number
Hoang-Nam Nguyen [Wed, 9 May 2007 11:48:11 +0000 (13:48 +0200)]
IB/ehca: Fix AQP0/1 QP number

AQP0/1 should report qp_num={0|1} and the actual QP# should be stored
in struct ehca_qp, not the other way round.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Correctly set GRH mask bit in ehca_modify_qp()
Joachim Fenkes [Wed, 9 May 2007 11:48:01 +0000 (13:48 +0200)]
IB/ehca: Correctly set GRH mask bit in ehca_modify_qp()

The driver needs to always supply the "GRH present" flag to the
hypervisor, whether it's true or false. Not supplying it (i.e. not
setting the corresponding mask bit) amounts to a "perhaps", which we
don't want.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ehca: Serialize hypervisor calls in ehca_register_mr()
Stefan Roscher [Wed, 9 May 2007 11:47:56 +0000 (13:47 +0200)]
IB/ehca: Serialize hypervisor calls in ehca_register_mr()

Some pSeries hypervisor versions show a race condition in the allocate
MR hCall.  Serialize this call per adapter to circumvent this problem.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/ipath: Shadow the gpio_mask register
Arthur Jones [Thu, 10 May 2007 19:10:49 +0000 (12:10 -0700)]
IB/ipath: Shadow the gpio_mask register

Once upon a time, GPIO interrupts were rare.  But then a chip bug in
the waldo series forced the use of a GPIO interrupt to signal packet
reception.  This greatly increased the frequency of GPIO interrupts
which have the gpio_mask bits set on the waldo chips.  Other bits in
the gpio_status register are used for I2C clock and data lines, these
bits are usually on.  An "unlikely" annotation leftover from the old
days was improperly applied to these bits, and an unnecessary chip
mmio read was being accessed in the interrupt fast path on waldo.

Remove the stagnant unlikely annotation in the interrupt handler and
keep a shadow copy of the gpio_mask register to avoid the slow mmio
read when testing for interruptable GPIO bits.

Signed-off-by: Arthur Jones <arthur.jones@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoIB/mlx4: Fix uninitialized spinlock for 32-bit archs
Jack Morgenstein [Sun, 13 May 2007 14:18:23 +0000 (17:18 +0300)]
IB/mlx4: Fix uninitialized spinlock for 32-bit archs

uar_lock spinlock was used in mlx4_ib_cq_arm without being initialized
(this only affects 32-bit archs, because uar_lock is not used on
64-bit archs and MLX4_INIT_DOORBELL_LOCK() is a NOP).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agomlx4_core: Remove unused doorbell_lock
Roland Dreier [Sun, 13 May 2007 15:54:18 +0000 (08:54 -0700)]
mlx4_core: Remove unused doorbell_lock

struct mlx4_priv.doorbell_lock is never used, so delete it.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agonet: Trivial MLX4_DEBUG dependency fix.
Paul Mundt [Thu, 10 May 2007 03:50:28 +0000 (12:50 +0900)]
net: Trivial MLX4_DEBUG dependency fix.

CONFIG_MLX4_DEBUG works out to a def_bool y for those that have
CONFIG_EMBEDDED set.  Make it depend on MLX4_CORE.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
17 years agoMerge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband
Linus Torvalds [Thu, 10 May 2007 02:40:09 +0000 (19:40 -0700)]
Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband

* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
  IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters
  IB: Put rlimit accounting struct in struct ib_umem
  IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules

17 years agoMerge branch 'usb-move' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
Linus Torvalds [Thu, 10 May 2007 01:53:12 +0000 (18:53 -0700)]
Merge branch 'usb-move' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6

* 'usb-move' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  Move USB network drivers to drivers/net/usb.

17 years agoMerge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik...
Linus Torvalds [Thu, 10 May 2007 01:52:45 +0000 (18:52 -0700)]
Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  Doc Fix: remove mention of combined mode-related kernel parameters
  libata: fix kernel-doc parameters
  Fix pata_qdi.c probe code
  pata_scc: fix compilation
  sata_via: add missing PM hooks
  sata_nv: fix ADMA freeze/thaw/irq_clear issues
  pata_pcmcia.c: add card ident for jvc cdrom
  sata_promise: SATAII-150/300 TX4 port numbering fix
  sata_promise: fix another error decode regression
  libata-acpi: fix _GTF command protocol for ATAPI devices

17 years agoRevert "md: improve partition detection in md array"
Linus Torvalds [Thu, 10 May 2007 01:51:36 +0000 (18:51 -0700)]
Revert "md: improve partition detection in md array"

This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.

Quoth Neil Brown:

  "It causes an oops when auto-detecting raid arrays, and it doesn't
   seem easy to fix.

   The array may not be 'open' when do_md_run is called, so
   bdev->bd_disk might be NULL, so bd_set_size can oops.

   This whole approach of opening an md device before it has been
   assembled just seems to get more and more painful.  I think I'm going
   to have to come up with something clever to provide both backward
   comparability with usage expectation, and sane integration into the
   rest of the kernel."

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMove USB network drivers to drivers/net/usb.
Jeff Garzik [Thu, 10 May 2007 01:31:55 +0000 (21:31 -0400)]
Move USB network drivers to drivers/net/usb.

It is preferable to group drivers by usage (net, scsi, ATA, ...) than
by bus.  When reviewing drivers, the [PCI|USB|PCMCIA|...] maintainer
is probably less qualified on networking issues than a networking
maintainer.  Also, from a practical standpoint, chips often
appear on multiple buses, which is why we do not put drivers into
drivers/pci/net.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoDoc Fix: remove mention of combined mode-related kernel parameters
Jesse Barnes [Tue, 1 May 2007 21:34:39 +0000 (14:34 -0700)]
Doc Fix: remove mention of combined mode-related kernel parameters

Looks like you removed the combined_mode quirk (yay!) but didn't update
kernel-parameters.txt...  might confuse people.  Here's a patch to remove
mention of it from the documentation.

Signed-off-by: Jesse Barnes <jesse.barnes@intel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agolibata: fix kernel-doc parameters
Randy Dunlap [Wed, 2 May 2007 00:35:55 +0000 (17:35 -0700)]
libata: fix kernel-doc parameters

Warning(linux-2.6.21-git4//drivers/ata/libata-core.c:904): No description found for parameter 'new_sectors'
Warning(linux-2.6.21-git4//drivers/ata/libata-core.c:941): No description found for parameter 'new_sectors'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoFix pata_qdi.c probe code
Samuel Thibault [Thu, 3 May 2007 09:30:25 +0000 (11:30 +0200)]
Fix pata_qdi.c probe code

There is a small typo in the probe code of pata_qdi.c, here is a patch.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agopata_scc: fix compilation
Alexey Dobriyan [Thu, 3 May 2007 19:44:59 +0000 (23:44 +0400)]
pata_scc: fix compilation

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agosata_via: add missing PM hooks
Tejun Heo [Fri, 4 May 2007 13:30:34 +0000 (15:30 +0200)]
sata_via: add missing PM hooks

For some reason, sata_via is missing PM hooks.  Add them.  Spotted by
Jeroen Janssen <jeroen.janssen@gmail.com>.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Jeroen Janssen <jeroen.janssen@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agosata_nv: fix ADMA freeze/thaw/irq_clear issues
Robert Hancock [Sat, 5 May 2007 21:36:36 +0000 (15:36 -0600)]
sata_nv: fix ADMA freeze/thaw/irq_clear issues

This patch fixes some problems with ADMA-capable controllers with
regard to freeze, thaw and irq_clear libata callbacks. Freeze and
thaw didn't switch the ADMA-specific interrupts on or off, and more
critically the irq_clear function didn't respect the restriction that
the notifier clear registers for both ports have to be written at
the same time even when only one port is being cleared. This could
result in timeouts on one port when error handling (i.e. as a result
of hotplug) occurred on the other port.

As well, this fixes some issues in the interrupt handler: we shouldn't
check any ADMA status if the port has ADMA switched off because of
an ATAPI device, and it also checks to see if any ADMA interrupt has
been raised even when we are in port-register mode.

Signed-off-by: Robert Hancock <hancockr@shaw.ca>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agopata_pcmcia.c: add card ident for jvc cdrom
Richard Kennedy [Tue, 8 May 2007 14:20:56 +0000 (15:20 +0100)]
pata_pcmcia.c: add card ident for jvc cdrom

update pata_pcmcia to add card ident for JVC MP-CDX1 cdrom drive
card info:
PRODID_1="KME"
PRODID_2="KXLC005"
PRODID_3="00"
MANFID=0032,2904

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agosata_promise: SATAII-150/300 TX4 port numbering fix
Mikael Pettersson [Sun, 6 May 2007 20:14:01 +0000 (22:14 +0200)]
sata_promise: SATAII-150/300 TX4 port numbering fix

There is a known problem with sata_promise on SATAII-150/300 TX4
controller cards: it enumerates drives in an order that differs
from the port numbers printed on the controller cards. However,
Promise's BIOS and Linux driver both get the order right.

I investigated Promise's Linux driver (v1.01.0.23), and found
that it explicitly changes the mapping from logical port number
to ATA engine MMIO address on the SATAII TX4 cards. It does this
on all SATAII TX4 cards, without inspecting revision etc. The
SATAII TX2plus cards continue to use the same mapping that was
used for the first-generation chips.

This patch updates sata_promise to use the new port number to
ATA engine mapping on SATAII TX4 cards, which fixes the drive
enumeration order problem on those cards. Tested on several
1st and 2nd generation TX2plus and TX4 chips.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agosata_promise: fix another error decode regression
Mikael Pettersson [Sun, 6 May 2007 20:12:31 +0000 (22:12 +0200)]
sata_promise: fix another error decode regression

The sata_promise error decode update changed pdc_host_intr()
to return and not complete the qc after detecting an error.
Unfortunately not completing the qc:s causes them to always
time out on error, which is wrong and has nasty side-effects.

This patch updates pdc_error_intr() to call ata_port_abort(),
similar to ahci and sata_sil24. Doing this is important as it
makes EH see the original error and not a bogus timeout.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agolibata-acpi: fix _GTF command protocol for ATAPI devices
Tejun Heo [Sun, 22 Apr 2007 17:06:46 +0000 (02:06 +0900)]
libata-acpi: fix _GTF command protocol for ATAPI devices

_GTF command is never ATA_PROT_ATAPI_NODATA whether the device is
ATAPI or not.  It's always ATA_PROT_NODATA.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoatl1: add netconsole support
Alexey Dobriyan [Wed, 9 May 2007 14:52:35 +0000 (18:52 +0400)]
atl1: add netconsole support

Copied from b44 driver, but it works:

netconsole: device eth0 not up yet, forcing it
atl1: eth0 link is up 100 Mbps full duplex
netconsole: network logging started

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoFix hang on IBM Token Ring PCMCIA card ejection
Paul Walmsley [Wed, 9 May 2007 16:47:16 +0000 (10:47 -0600)]
Fix hang on IBM Token Ring PCMCIA card ejection

Ejecting a PCMCIA IBM Token Ring card that has not had its dev->open()
called will reliably trigger an uninitialized spinlock oops when
spinlock debugging is enabled. The system then hangs, occasionally
softlockup oopsing.  Apparently ibmtr.c:tok_interrupt() doesn't expect
to be called before tok_open(), but tok_interrupt() gets called anyway
when the card is ejected.  So, set an already-existing flag which
causes tok_interrupt() to bail out early upon card ejection. Tested by
inserting and removing the PCMCIA card several times.

Signed-off-by: Paul Walmsley <paul@booyaka.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoskge: default WOL should be magic only (rev2)
Stephen Hemminger [Tue, 8 May 2007 20:36:20 +0000 (13:36 -0700)]
skge: default WOL should be magic only (rev2)

By default, the skge driver now enables wake on magic and wake on PHY.
This is a bad default (bug), wake on PHY means machine will never shutdown
if connected to a switch.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>a
Signed-off-by: Jeff Garzik <jeff@garzik.org>
17 years agoMerge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/linville...
Jeff Garzik [Wed, 9 May 2007 22:54:49 +0000 (18:54 -0400)]
Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream

17 years agoMerge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
Linus Torvalds [Wed, 9 May 2007 22:41:31 +0000 (15:41 -0700)]
Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6:
  ide: fix PIO setup on resume for ATAPI devices
  ide: legacy PCI bus order probing fixes
  ide: add ide_proc_register_port()
  ide: add "initializing" argument to ide_register_hw()
  ide: cable detection fixes (take 2)
  ide: move IDE settings handling to ide-proc.c
  ide: split off ioctl handling from IDE settings (v2)
  ide: make /proc/ide/ optional
  ide: add ide_tune_dma() helper
  ide: rework the code for selecting the best DMA transfer mode (v3)
  ide: fix UDMA/MWDMA/SWDMA masks (v3)

17 years agoMerge git://git.linux-nfs.org/pub/linux/nfs-2.6
Linus Torvalds [Wed, 9 May 2007 22:29:58 +0000 (15:29 -0700)]
Merge git://git.linux-nfs.org/pub/linux/nfs-2.6

* git://git.linux-nfs.org/pub/linux/nfs-2.6:
  NFS: Kill the obsolete NFS_PARANOIA
  NFS: use __set_current_state()
  sunrpc: fix crash in rpc_malloc()
  NFS: Clean up NFSv4 XDR error message
  NFS: NFS client underestimates how large an NFSv4 SETATTR reply can be
  SUNRPC: Fix pointer arithmetic bug recently introduced in rpc_malloc/free
  NFS: Remove redundant check in nfs_check_verifier()
  NFS: Fix a jiffie wraparound issue

17 years agoide: fix PIO setup on resume for ATAPI devices
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:11 +0000 (00:01 +0200)]
ide: fix PIO setup on resume for ATAPI devices

PIO should be restored also for ATAPI devices during resume, fix it.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: legacy PCI bus order probing fixes
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:11 +0000 (00:01 +0200)]
ide: legacy PCI bus order probing fixes

IDE PCI host drivers should register themselves with IDE core only when
IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI
host drivers are also modular) the code has no effect and just complicates
the probing.

Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when
needed and invisible to the user) and covering by #ifdef/#endif the code
in question.  It turned out that "ide=reverse" was silently accepted but did
nothing in case when IDE driver was modular, this is fixed now.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add ide_proc_register_port()
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:11 +0000 (00:01 +0200)]
ide: add ide_proc_register_port()

* create_proc_ide_interfaces() tries to add /proc entries for every probed
  and initialized IDE port, replace it by ide_proc_register_port() which does
  it only for the given port (also rename destroy_proc_ide_interface() to
  ide_proc_unregister_port() for consistency)

* convert {create,destroy}_proc_ide_interface[s]() users to use new functions

* pmac driver depended on proc_ide_create() to add /proc port entries, fix it

* au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic
  driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them

* there is now no need to add /proc entries for IDE ports in proc_ide_create()
  so don't do it

* proc_ide_create() needs now to be called before drivers are probed - fix it,
  while at it make proc_ide_create() create /proc "ide" directory

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add "initializing" argument to ide_register_hw()
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:10 +0000 (00:01 +0200)]
ide: add "initializing" argument to ide_register_hw()

Add "initializing" argument to ide_register_hw() and use it instead of ide.c
wide variable of the same name.  Update all users of ide_register_hw()
accordingly.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: cable detection fixes (take 2)
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:10 +0000 (00:01 +0200)]
ide: cable detection fixes (take 2)

Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough
review of the cable detection code...

* print user-friendly warning about limiting the maximum transfer speed
  to UDMA33 (and the reason behind it) when 80-wire cable is not detected,
  also while at it cleanup eighty_ninty_three() a bit

* use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs:
  - bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1
    were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case
    (please see FIXME comment in eighty_ninty_three() for more details)
  - CONFIG_IDEDMA_IVB=y/n cases were interchanged
  - check for SATA devices was missing

* remove private cable warnings from pdc_202xx{old,new} drivers now that core
  code provides this functionality (plus, in pdc202xx_new case the test could
  give false warnings for ATAPI devices because pdc202xx_new driver doesn't
  even support ATAPI DMA)

Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: move IDE settings handling to ide-proc.c
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:10 +0000 (00:01 +0200)]
ide: move IDE settings handling to ide-proc.c

* move
__ide_add_setting()
ide_add_setting()
__ide_remove_setting()
auto_remove_settings()
ide_find_setting_by_name()
ide_read_setting()
ide_write_setting()
set_xfer_rate()
ide_add_generic_settings()
ide_register_subdriver()
ide_unregister_subdriver()

  from ide.c to ide-proc.c

* set_{io_32bit,pio_mode,using_dma}() cannot be marked static now, fix it

* rename ide_[un]register_subdriver() to ide_proc_[un]register_driver(),
  update device drivers to use new names

* add CONFIG_IDE_PROC_FS=n versions of ide_proc_[un]register_driver()
  and ide_add_generic_settings()

* make ide_find_setting_by_name(), ide_{read,write}_setting()
  and ide_{add,remove}_proc_entries() static

* cover IDE settings code in device drivers with CONFIG_IDE_PROC_FS #ifdef,
  also while at it cover with CONFIG_IDE_PROC_FS #ifdef ide_driver_t.proc

* remove bogus comment from ide.h

* cover with CONFIG_IDE_PROC_FS #ifdef .proc and .settings in ide_drive_t

Besides saner code this patch results in the IDE core smaller by ~2 kB
(on x86-32) and IDE disk driver by ~1 kB (ditto) when CONFIG_IDE_PROC_FS=n.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: split off ioctl handling from IDE settings (v2)
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:10 +0000 (00:01 +0200)]
ide: split off ioctl handling from IDE settings (v2)

* do write permission and min/max checks in ide_procset_t functions

* ide-disk.c: drive->id is always available so cleanup "multcount" setting
  accordingly

* ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA,
  fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field,
  the bug didn't trigger because this IDE setting uses custom ->set function

* ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl

* ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl

* handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl()
  instead of using IDE settings to deal with them

* remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl
  fields from ide_settings_t, also remove now unused TYPE_INTA handling

v2:
* add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: make /proc/ide/ optional
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:09 +0000 (00:01 +0200)]
ide: make /proc/ide/ optional

All important information/features should be already available through
sysfs and ioctl interfaces.

Add CONFIG_IDE_PROC_FS (CONFIG_SCSI_PROC_FS rip-off) config option,
disabling it makes IDE driver ~5 kB smaller (on x86-32).

While at it add CONFIG_PROC_FS=n versions of proc_ide_{create,destroy}()
and remove no longer needed #ifdefs.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: add ide_tune_dma() helper
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:09 +0000 (00:01 +0200)]
ide: add ide_tune_dma() helper

After reworking the code responsible for selecting the best DMA
transfer mode it is now possible to add generic ide_tune_dma() helper.

Convert some IDE PCI host drivers to use it (the ones left need more work).

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: rework the code for selecting the best DMA transfer mode (v3)
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:08 +0000 (00:01 +0200)]
ide: rework the code for selecting the best DMA transfer mode (v3)

Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch.

* add ide_hwif_t.udma_filter hook for filtering UDMA mask
  (use it in alim15x3, hpt366, siimage and serverworks drivers)
* add ide_max_dma_mode() for finding best DMA mode for the device
  (loosely based on some older libata-core.c code)
* convert ide_dma_speed() users to use ide_max_dma_mode()
* make ide_rate_filter() take "ide_drive_t *drive" as an argument instead
  of "u8 mode" and teach it to how to use UDMA mask to do filtering
* use ide_rate_filter() in hpt366 driver
* remove no longer needed ide_dma_speed() and *_ratemask()
* unexport eighty_ninty_three()

v2:
* rename ->filter_udma_mask to ->udma_filter
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]

v3:
* updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space
  originated transfer mode change requests when 100MHz clock is used)

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoide: fix UDMA/MWDMA/SWDMA masks (v3)
Bartlomiej Zolnierkiewicz [Wed, 9 May 2007 22:01:07 +0000 (00:01 +0200)]
ide: fix UDMA/MWDMA/SWDMA masks (v3)

* use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask
* add udma_mask field to ide_pci_device_t and use it to initialize
  ->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers
* fix UDMA masks to match with chipset specific *_ratemask()
  (alim15x3, hpt366, serverworks and siimage drivers need UDMA mask
   filtering method - done in the next patch)

v2:
* piix: fix cable detection for 82801AA_1 and 82372FB_1
  [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* cmd64x: use hwif->cds->udma_mask
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* aec62xx: fix newly introduced bug - check DMA status not command register
  [ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]

v3:
* piix: use hwif->cds->udma_mask
  [ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
17 years agoNFS: Kill the obsolete NFS_PARANOIA
Jesper Juhl [Thu, 26 Apr 2007 07:29:02 +0000 (00:29 -0700)]
NFS: Kill the obsolete NFS_PARANOIA

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoNFS: use __set_current_state()
Milind Arun Choudhary [Thu, 26 Apr 2007 07:29:03 +0000 (00:29 -0700)]
NFS: use __set_current_state()

use __set_current_state(TASK_*) instead of current->state = TASK_*, in fs/nfs

Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agosunrpc: fix crash in rpc_malloc()
Peter Zijlstra [Wed, 9 May 2007 06:30:11 +0000 (08:30 +0200)]
sunrpc: fix crash in rpc_malloc()

While the comment says:
 * To prevent rpciod from hanging, this allocator never sleeps,
 * returning NULL if the request cannot be serviced immediately.

The function does not actually check for NULL pointers being returned.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoNFS: Clean up NFSv4 XDR error message
Chuck Lever [Tue, 8 May 2007 22:23:28 +0000 (18:23 -0400)]
NFS: Clean up NFSv4 XDR error message

Make it more useful for debugging purposes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoNFS: NFS client underestimates how large an NFSv4 SETATTR reply can be
Chuck Lever [Tue, 8 May 2007 22:23:28 +0000 (18:23 -0400)]
NFS: NFS client underestimates how large an NFSv4 SETATTR reply can be

The maximum size of an NFSv4 SETATTR compound reply should include the
GETATTR operation that we send.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoSUNRPC: Fix pointer arithmetic bug recently introduced in rpc_malloc/free
Chuck Lever [Tue, 8 May 2007 22:23:28 +0000 (18:23 -0400)]
SUNRPC: Fix pointer arithmetic bug recently introduced in rpc_malloc/free

Use a cleaner method to find the size of an rpc_buffer.  This actually
works on x86-64!

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoNFS: Remove redundant check in nfs_check_verifier()
Trond Myklebust [Wed, 9 May 2007 13:00:18 +0000 (09:00 -0400)]
NFS: Remove redundant check in nfs_check_verifier()

The check for nfs_attribute_timeout(dir) in nfs_check_verifier is
redundant: nfs_lookup_revalidate() will already call nfs_revalidate_inode()
on the parent dir when necessary.

The only case where this is not done is the case of a negative dentry. Fix
this case by moving up the revalidation code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years agoNFS: Fix a jiffie wraparound issue
Trond Myklebust [Wed, 9 May 2007 13:00:17 +0000 (09:00 -0400)]
NFS: Fix a jiffie wraparound issue

dentry verifiers are always set to the parent directory's
cache_change_attribute. There is no reason to be testing for anything other
than equality when we're trying to find out if the dentry has been checked
since the last time the directory was modified.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
17 years ago[IA64] sa_interrupt is deprecated
akpm@linux-foundation.org [Wed, 9 May 2007 07:43:17 +0000 (00:43 -0700)]
[IA64] sa_interrupt is deprecated

Seems more than just deprecated, we can't build using SA_INTERUPT.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
17 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
Linus Torvalds [Wed, 9 May 2007 20:38:45 +0000 (13:38 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] wire up pselect, ppoll
  [IA64] Add TIF_RESTORE_SIGMASK
  [IA64] unwind did not work for processes born with CLONE_STOPPED
  [IA64] Optional method to purge the TLB on SN systems
  [IA64] SPIN_LOCK_UNLOCKED macro cleanup in arch/ia64
  [IA64-SN2][KJ] mmtimer.c-kzalloc
  [IA64] fix stack alignment for ia32 signal handlers
  [IA64] - Altix: hotplug after intr redirect can crash system
  [IA64] save and restore cpus_allowed in cpu_idle_wait
  [IA64] Removal of percpu TR cleanup in kexec code
  [IA64] Fix some section mismatch errors

17 years agoMerge git://git.infradead.org/mtd-2.6
Linus Torvalds [Wed, 9 May 2007 20:10:11 +0000 (13:10 -0700)]
Merge git://git.infradead.org/mtd-2.6

* git://git.infradead.org/mtd-2.6: (21 commits)
  [MTD] [CHIPS] Remove MTD_OBSOLETE_CHIPS (jedec, amd_flash, sharp)
  [MTD] Delete allegedly obsolete "bank_size" field of mtd_info.
  [MTD] Remove unnecessary user space check from mtd.h.
  [MTD] [MAPS] Remove flash maps for no longer supported 405LP boards
  [MTD] [MAPS] Fix missing printk() parameter in physmap_of.c MTD driver
  [MTD] [NAND] platform NAND driver: add driver
  [MTD] [NAND] platform NAND driver: update header
  [JFFS2] Simplify and clean up jffs2_add_tn_to_tree() some more.
  [JFFS2] Remove another bogus optimisation in jffs2_add_tn_to_tree()
  [JFFS2] Remove broken insert_point optimisation in jffs2_add_tn_to_tree()
  [JFFS2] Remember to calculate overlap on nodes which replace older nodes
  [JFFS2] Don't advance c->wbuf_ofs to next eraseblock after wbuf flush
  [MTD] [NAND] at91_nand.c: CMDLINE_PARTS support
  [MTD] [NAND] Tidy up handling of page number in nand_block_bad()
  [MTD] block2mtd_paramline[] mustn't be __initdata
  [MTD] [NAND] Support multiple chips in CAFÉ driver
  [MTD] [NAND] Rename cafe.c to cafe_nand.c and remove the multi-obj magic
  [MTD] [NAND] Use rslib for CAFÉ ECC
  [RSLIB] Support non-canonical GF representations
  [JFFS2] Remove dead file histo_mips.h
  ...

17 years agoMerge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
Linus Torvalds [Wed, 9 May 2007 20:08:20 +0000 (13:08 -0700)]
Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: Fix stacktrace simplification fallout.
  sh: SH7760 DMABRG support.
  sh: clockevent/clocksource/hrtimers/nohz TMU support.
  sh: Truncate MAX_ACTIVE_REGIONS for the common case.
  rtc: rtc-sh: Fix rtc_dev pointer for rtc_update_irq().
  sh: Convert to common die chain.
  sh: Wire up utimensat syscall.
  sh: landisk mv_nr_irqs definition.
  sh: Fixup ndelay() xloops calculation for alternate HZ.
  sh: Add 32-bit opcode feature CPU flag.
  sh: Fix PC adjustments for varying opcode length.
  sh: Support for SH-2A 32-bit opcodes.
  sh: Kill off redundant __div64_32 symbol export.
  sh: Share exception vector table for SH-3/4.
  sh: Always define TRAPA_BUG_OPCODE.
  sh: __GFP_REPEAT for pte allocations, too.
  rtc: rtc-sh: Fix up dev_dbg() warnings.
  sh: generic quicklist support.

17 years agoMerge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
Linus Torvalds [Wed, 9 May 2007 20:05:57 +0000 (13:05 -0700)]
Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm

* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (28 commits)
  ARM: OMAP: Fix GCC-reported compile time bug
  ARM: OMAP: restore CONFIG_GENERIC_TIME
  ARM: OMAP: partial LED fixes
  ARM: OMAP: add SoSSI clock (call propagate_rate for childrens)
  ARM: OMAP: FB sync with N800 tree (support for dynamic SRAM allocations)
  ARM: OMAP: Sync framebuffer headers with N800 tree
  ARM: OMAP: Mostly cosmetic to sync up with linux-omap tree
  ARM: OMAP: Fix gpmc header
  ARM: OMAP: Add mailbox support for IVA
  [ARM] armv7: add Makefile and Kconfig entries
  [ARM] armv7: add support for asid-tagged VIVT I-cache
  [ARM] armv7: add dedicated ARMv7 barrier instructions
  [ARM] armv7: Add ARMv7 cacheid macros
  [ARM] armv7: add support for ARMv7 cores.
  [ARM] Fix ARM branch relocation range
  [ARM] 4363/1: AT91: Remove legacy PIO definitions
  [ARM] 4361/1: AT91: Build error
  ARM: OMAP: Sync core code with linux-omap
  ARM: OMAP: Sync headers with linux-omap
  ARM: OMAP: h4 must have blinky leds!!
  ...

17 years agoFix a bad error case handling in read_cache_page_async()
David Howells [Wed, 9 May 2007 12:42:20 +0000 (13:42 +0100)]
Fix a bad error case handling in read_cache_page_async()

Commit 6fe6900e1e5b6fa9e5c59aa5061f244fe3f467e2 introduced a nasty bug
in read_cache_page_async().

It added a "mark_page_accessed(page)" at the final return path in
read_cache_page_async().  But in error cases, 'page' holds the error
code, and you can't mark it accessed.

[ and Glauber de Oliveira Costa points out that we can use a return
  instead of adding more goto's ]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
Linus Torvalds [Wed, 9 May 2007 19:56:01 +0000 (12:56 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32
  [POWERPC] EEH: log all PCI-X and PCI-E AER registers
  [POWERPC] EEH: capture and log pci state on error
  [POWERPC] EEH: Split up long error msg
  [POWERPC] EEH: log error only after driver notification.
  [POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init().
  [POWERPC] Don't use SLAB/SLUB for PTE pages
  [POWERPC] Spufs support for 64K LS mappings on 4K kernels
  [POWERPC] Add ability to 4K kernel to hash in 64K pages
  [POWERPC] Introduce address space "slices"
  [POWERPC] Small fixes & cleanups in segment page size demotion
  [POWERPC] iSeries: Make HVC_ISERIES the default
  [POWERPC] iSeries: suppress build warning in lparmap.c
  [POWERPC] Mark pages that don't exist as nosave
  [POWERPC] swsusp: Introduce register_nosave_region_late

17 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
Linus Torvalds [Wed, 9 May 2007 19:54:17 +0000 (12:54 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial

* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.

17 years agoMerge branch 'for-linus' of git://www.atmel.no/~hskinnemoen/linux/kernel/avr32
Linus Torvalds [Wed, 9 May 2007 19:50:25 +0000 (12:50 -0700)]
Merge branch 'for-linus' of git://www.atmel.no/~hskinnemoen/linux/kernel/avr32

* 'for-linus' of git://www.atmel.no/~hskinnemoen/linux/kernel/avr32:
  [AVR32] Wire up sys_utimensat
  [AVR32] Fix section mismatch .taglist -> .init.text
  [AVR32] Implement dma_{alloc,free}_writecombine()
  AVR32: Spinlock initializer cleanup
  [AVR32] Use correct config symbol when setting cpuflags

17 years agoi386: msr.h: be paranoid about types and parentheses
H. Peter Anvin [Wed, 9 May 2007 07:02:11 +0000 (00:02 -0700)]
i386: msr.h: be paranoid about types and parentheses

When implementing things as macros, make sure we use typecasts and
parentheses where needed.  The macros as defined were vulnerable to
surreptitious promotion causing problems.

Avoid macros where practical; e.g. wrmsr() can be an inline instead.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoi386: remove unused rdtsc() macro
H. Peter Anvin [Wed, 9 May 2007 07:02:06 +0000 (00:02 -0700)]
i386: remove unused rdtsc() macro

All users to the two-part rdtsc() macro have already switched to using
rdtscl() or rdtscll().  Remove the now-obsolete macro.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoi386: cpu/transmeta.c: fix definition of USER686
H. Peter Anvin [Wed, 9 May 2007 07:02:00 +0000 (00:02 -0700)]
i386: cpu/transmeta.c: fix definition of USER686

The definition of USER686 is supposed to be a mask of feature bits,
not an OR of feature numbers!  It happened to work anyway on the only
processor affected, simply by pure coincidence.  Fix.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: improve partition detection in md array
NeilBrown [Wed, 9 May 2007 09:35:39 +0000 (02:35 -0700)]
md: improve partition detection in md array

md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.

However that doesn't happen until the array is opened, which is later
than some people would like.

So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.

This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: allow reshape_position for md arrays to be set via sysfs
NeilBrown [Wed, 9 May 2007 09:35:38 +0000 (02:35 -0700)]
md: allow reshape_position for md arrays to be set via sysfs

"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).

When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.

So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: remove the slash from the name of a kmem_cache used by raid5
NeilBrown [Wed, 9 May 2007 09:35:37 +0000 (02:35 -0700)]
md: remove the slash from the name of a kmem_cache used by raid5

SLUB doesn't like slashes as it wants to use the cache name as the name of a
directory (or symlink) in sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: stop using csum_partial for checksum calculation in md
NeilBrown [Wed, 9 May 2007 09:35:37 +0000 (02:35 -0700)]
md: stop using csum_partial for checksum calculation in md

If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it.  We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.

So replace it with C code to do the same thing.  Speed is not crucial here, so
something simple and correct is best.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: move test for whether level supports bitmap to correct place
NeilBrown [Wed, 9 May 2007 09:35:36 +0000 (02:35 -0700)]
md: move test for whether level supports bitmap to correct place

We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.

With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomd: cleanup: use seq_release_private() where appropriate
Martin Peschke [Wed, 9 May 2007 09:35:35 +0000 (02:35 -0700)]
md: cleanup: use seq_release_private() where appropriate

We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agodrivers/md.c: Use ARRAY_SIZE macro when appropriate
Ahmed S. Darwish [Wed, 9 May 2007 09:35:34 +0000 (02:35 -0700)]
drivers/md.c: Use ARRAY_SIZE macro when appropriate

Use ARRAY_SIZE macro already defined in kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoframe buffer: geforce 7300 gt
Michal Piotrowski [Wed, 9 May 2007 09:35:34 +0000 (02:35 -0700)]
frame buffer: geforce 7300 gt

My geforce isn't supported by nvidia frame buffer.

/sbin/lspci
01:00.0 VGA compatible controller: nVidia Corporation Unknown device 02e2 (rev a2)

/usr/sbin/fbset -i

mode "1024x768-60"
    # D: 65.003 MHz, H: 48.365 kHz, V: 60.006 Hz
    geometry 1024 768 1024 32767 8
    timings 15384 160 24 29 3 136 6
    accel true
    rgba 8/0,8/0,8/0,0/0
endmode

Frame buffer device information:
    Name        : NV2e
    Address     : 0xe0000000
    Size        : 134217728
    Type        : PACKED PIXELS
    Visual      : PSEUDOCOLOR
    XPanStep    : 8
    YPanStep    : 1
    YWrapStep   : 0
    LineLength  : 1024
    MMIO Address: 0xf6000000
    MMIO Size   : 16777216
    Accelerator : Unknown (46)

Here is a patch for this problem.

Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofbdev: add support for AVR32
Haavard Skinnemoen [Wed, 9 May 2007 09:35:33 +0000 (02:35 -0700)]
fbdev: add support for AVR32

Provide framebuffer page protection flags and definitions of
fb_readl/fb_writel for AVR32.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agosvgalib: move fb_get_caps to svgalib
Antonino A. Daplas [Wed, 9 May 2007 09:35:32 +0000 (02:35 -0700)]
svgalib: move fb_get_caps to svgalib

Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by
s3fb, arkfb and vt8623fb.

Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoarkfb: new framebuffer driver for ARK Logic cards
Ondrej Zajicek [Wed, 9 May 2007 09:35:31 +0000 (02:35 -0700)]
arkfb: new framebuffer driver for ARK Logic cards

This patch adds fbdev driver for graphics cards with ARK Logic 2000PV graphics
chip with ICS 5342 ramdac.

[adaplas@gmail.com: build fixes]
Signed-off-by: Ondrej Zajicek <santiago@crfreenet.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agovt8623fb: new framebuffer driver for VIA VT8623
Ondrej Zajicek [Wed, 9 May 2007 09:35:31 +0000 (02:35 -0700)]
vt8623fb: new framebuffer driver for VIA VT8623

This patch adds fbdev driver for graphics core in VIA VT8623

[adaplas@gmail.com: build fixes]
Signed-off-by: Ondrej Zajicek <santiago@crfreenet.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoi386 mmzone: use __maybe_unused
David Rientjes [Wed, 9 May 2007 09:35:30 +0000 (02:35 -0700)]
i386 mmzone: use __maybe_unused

Replace automatic variable instances of __attribute__ ((unused)) with
__maybe_unused.

Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoi386: voyager: use __maybe_unused
David Rientjes [Wed, 9 May 2007 09:35:29 +0000 (02:35 -0700)]
i386: voyager: use __maybe_unused

Replace automatic variable instances of __attribute__((unused)) with
__maybe_unused in mca_nmi_hook().

Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agosh: dma: use __maybe_unused
David Rientjes [Wed, 9 May 2007 09:35:28 +0000 (02:35 -0700)]
sh: dma: use __maybe_unused

There is no such thing as labeling a variable as __attribute__((used)).  Since
ts_shift is not referenced in inline assembly, we assume that we're simply
suppressing a warning here if the variable is declared but unreferenced.

Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoi386 pci: use __maybe_unused
David Rientjes [Wed, 9 May 2007 09:35:28 +0000 (02:35 -0700)]
i386 pci: use __maybe_unused

Use the new macro here

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agocompiler: introduce __used and __maybe_unused
David Rientjes [Wed, 9 May 2007 09:35:27 +0000 (02:35 -0700)]
compiler: introduce __used and __maybe_unused

__used is defined to be __attribute__((unused)) for all pre-3.3 gcc
compilers to suppress warnings for unused functions because perhaps they
are referenced only in inline assembly.  It is defined to be
__attribute__((used)) for gcc 3.3 and later so that the code is still
emitted for such functions.

__maybe_unused is defined to be __attribute__((unused)) for both function
and variable use if it could possibly be unreferenced due to the evaluation
of preprocessor macros.  Function prototypes shall be marked with
__maybe_unused if the actual definition of the function is dependant on
preprocessor macros.

No update to compiler-intel.h is necessary because ICC supports both
__attribute__((used)) and __attribute__((unused)) as specified by the gcc
manual.

__attribute_used__ is deprecated and will be removed once all current
code is converted to using __used.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agorename thread_info to stack
Roman Zippel [Wed, 9 May 2007 09:35:17 +0000 (02:35 -0700)]
rename thread_info to stack

This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.

Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.

It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
could benefit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agowrap access to thread_info
Roman Zippel [Wed, 9 May 2007 09:35:16 +0000 (02:35 -0700)]
wrap access to thread_info

Recently a few direct accesses to the thread_info in the task structure snuck
back, so this wraps them with the appropriate wrapper.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAllow arch to initialize arch field of the module structure
Roman Zippel [Wed, 9 May 2007 09:35:15 +0000 (02:35 -0700)]
Allow arch to initialize arch field of the module structure

This will later allow an arch to add module specific information via linker
generated tables instead of poking directly in the module object structure.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoclocksource: fix resume logic
Thomas Gleixner [Wed, 9 May 2007 09:35:15 +0000 (02:35 -0700)]
clocksource: fix resume logic

We need to make sure that the clocksources are resumed, when timekeeping is
resumed.  The current resume logic does not guarantee this.

Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.

Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMove remote node draining out of slab allocators
Christoph Lameter [Wed, 9 May 2007 09:35:14 +0000 (02:35 -0700)]
Move remote node draining out of slab allocators

Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes.  This requires SLUB to have
a whole subsystem in order to be compatible with SLAB.  Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there.  If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero.  Then we will drain one batch from the pageset.  The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty.  So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues.  The number of pcp is determined by the
number of processors and nodes in a system.  A system with 4 processors and 2
nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoMake vm statistics update interval configurable
Christoph Lameter [Wed, 9 May 2007 09:35:13 +0000 (02:35 -0700)]
Make vm statistics update interval configurable

Make it configurable.  Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agovmstat: use our own timer events
Christoph Lameter [Wed, 9 May 2007 09:35:12 +0000 (02:35 -0700)]
vmstat: use our own timer events

vmstat is currently using the cache reaper to periodically bring the
statistics up to date.  The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB.  This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick.  This will lead to some overlap in large
systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information.  It is only
necessary to stagger the vm statistics updates per processor in each node.  Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick.  That may be useful to keep a large amount of the second free
of timer activity.  Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agomicrocode: use suspend-related CPU hotplug notifications
Rafael J. Wysocki [Wed, 9 May 2007 09:35:11 +0000 (02:35 -0700)]
microcode: use suspend-related CPU hotplug notifications

Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions.  Remove the global variable suspend_cpu_hotplug
previously used for this purpose.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoAdd suspend-related notifications for CPU hotplug
Rafael J. Wysocki [Wed, 9 May 2007 09:35:10 +0000 (02:35 -0700)]
Add suspend-related notifications for CPU hotplug

Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs: deprecate memclear_highpage_flush
Nate Diller [Wed, 9 May 2007 09:35:09 +0000 (02:35 -0700)]
fs: deprecate memclear_highpage_flush

Now that all the in-tree users are converted over to zero_user_page(),
deprecate the old memclear_highpage_flush() call.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoreiserfs: use zero_user_page
Nate Diller [Wed, 9 May 2007 09:35:09 +0000 (02:35 -0700)]
reiserfs: use zero_user_page

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoext3: use zero_user_page
Nate Diller [Wed, 9 May 2007 09:35:08 +0000 (02:35 -0700)]
ext3: use zero_user_page

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoaffs: use zero_user_page
Nate Diller [Wed, 9 May 2007 09:35:07 +0000 (02:35 -0700)]
affs: use zero_user_page

Use zero_user_page() instead of open-coding it.

Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofs: convert core functions to zero_user_page
Nate Diller [Wed, 9 May 2007 09:35:07 +0000 (02:35 -0700)]
fs: convert core functions to zero_user_page

It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset().  There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones.  Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call.  The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agotimer: parenthesis fix in tbase_get_deferrable() etc
Jarek Poplawski [Wed, 9 May 2007 09:35:05 +0000 (02:35 -0700)]
timer: parenthesis fix in tbase_get_deferrable() etc

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agoFUTEX: new PRIVATE futexes
Eric Dumazet [Wed, 9 May 2007 09:35:04 +0000 (02:35 -0700)]
FUTEX: new PRIVATE futexes

  Analysis of current linux futex code :
  --------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
(rollback is value != expected_value, returns EWOULDBLOCK)
(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part  may be done
   by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

  This means scalability problems if many processes/threads want to use
  futexes at the same time.
  This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

  Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
  ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

  mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
  Highly threaded processes might suffer from mmap_sem contention.

  mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
  programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
  It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
  because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace.  As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).  We
chose to force all futexes be 'shared'.  This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented.  Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
prepared to fallback on old subcommands for old kernels.  Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net>
Cc: "Ulrich Drepper" <drepper@gmail.com>
Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
17 years agofutex_requeue_pi optimization
Pierre Peiffer [Wed, 9 May 2007 09:35:02 +0000 (02:35 -0700)]
futex_requeue_pi optimization

This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.

This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes.  With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>