Linus Torvalds [Wed, 31 Dec 2008 01:32:25 +0000 (17:32 -0800)]
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
sata_sil: add Large Block Transfer support
[libata] ata_piix: cleanup dmi strings checking
DMI: add dmi_match
libata: blacklist NCQ on OCZ CORE 2 SSD (resend)
[libata] Update kernel-doc comments to match source code
libata: perform port detach in EH
libata: when restoring SControl during detach do the PMP links first
libata: beef up iterators
Linus Torvalds [Wed, 31 Dec 2008 01:31:25 +0000 (17:31 -0800)]
Merge branch 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
oprofile: select RING_BUFFER
ring_buffer: adding EXPORT_SYMBOLs
oprofile: fix lost sample counter
oprofile: remove nr_available_slots()
oprofile: port to the new ring_buffer
ring_buffer: add remaining cpu functions to ring_buffer.h
oprofile: moving cpu_buffer_reset() to cpu_buffer.h
oprofile: adding cpu_buffer_entries()
oprofile: adding cpu_buffer_write_commit()
oprofile: adding cpu buffer r/w access functions
ftrace: remove unused function arg in trace_iterator_increment()
ring_buffer: update description for ring_buffer_alloc()
oprofile: set values to default when creating oprofilefs
oprofile: implement switch/case in buffer_sync.c
x86/oprofile: cleanup IBS init/exit functions in op_model_amd.c
x86/oprofile: reordering IBS code in op_model_amd.c
oprofile: fix typo
oprofile: whitspace changes only
oprofile: update comment for oprofile_add_sample()
oprofile: comment cleanup
Linus Torvalds [Wed, 31 Dec 2008 01:28:09 +0000 (17:28 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: avoid leaking caches or refcounts on sysfs error
slab: Fix comment on #endif
slab: remove GFP_THISNODE clearing from alloc_slabmgmt()
slub: Add might_sleep_if() to slab_alloc()
SLUB: failslab support
slub: Fix incorrect use of loose
slab: Update the kmem_cache_create documentation regarding the name parameter
slub: make early_kmem_cache_node_alloc void
slab: unsigned slabp->inuse cannot be less than 0
slub - fix get_object_page comment
SLUB: Replace __builtin_return_address(0) with _RET_IP_.
SLUB: cleanup - define macros instead of hardcoded numbers
Linus Torvalds [Wed, 31 Dec 2008 01:25:49 +0000 (17:25 -0800)]
Merge branch 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (37 commits)
drm/i915: fix modeset devname allocation + agp init return check.
drm/i915: Remove redundant test in error path.
drm: Add a debug node for vblank state.
drm: Avoid use-before-null-test on dev in drm_cleanup().
drm/i915: Don't print to dmesg when taking signal during object_pin.
drm: pin new and unpin old buffer when setting a mode.
drm/i915: un-EXPORT and make 'intelfb_panic' static
drm/i915: Delete unused, pointless i915_driver_firstopen.
drm/i915: fix sparse warnings: returning void-valued expression
drm/i915: fix sparse warnings: move 'extern' decls to header file
drm/i915: fix sparse warnings: make symbols static
drm/i915: fix sparse warnings: declare one-bit bitfield as unsigned
drm/i915: Don't double-unpin buffers if we take a signal in evict_everything().
drm/i915: Fix fbcon setup to align display pitch to 64b.
drm/i915: Add missing userland definitions for gem init/execbuffer.
i915/drm: provide compat defines for userspace for certain struct members.
drm: drop DRM_IOCTL_MODE_REPLACEFB, add+remove works just as well.
drm: sanitise drm modesetting API + remove unused hotplug
drm: fix allowing master ioctls on non-master fds.
drm/radeon: use locked rmmap to remove sarea mapping.
...
Linus Torvalds [Wed, 31 Dec 2008 01:25:29 +0000 (17:25 -0800)]
Merge branch 'agp-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6
* 'agp-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
agp/intel: Fix broken ® symbol in device name.
agp/intel: add support for G41 chipset
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: (98 commits)
sparc: move select of ARCH_SUPPORTS_MSI
sparc: drop SUN_IO
sparc: unify sections.h
sparc: use .data.init_task section for init_thread_union
sparc: fix array overrun check in of_device_64.c
sparc: unify module.c
sparc64: prepare module_64.c for unification
sparc64: use bit neutral Elf symbols
sparc: unify module.h
sparc: introduce CONFIG_BITS
sparc: fix hardirq.h removal fallout
sparc64: do not export pus_fs_struct
sparc: use sparc64 version of scatterlist.h
sparc: Commonize memcmp assembler.
sparc: Unify strlen assembler.
sparc: Add asm/asm.h
sparc: Kill memcmp_32.S code which has been ifdef'd out for centuries.
sparc: replace for_each_cpu_mask_nr with for_each_cpu
sparc: fix sparse warnings in irq_32.c
sparc: add include guards to kernel.h
...
Linus Torvalds [Wed, 31 Dec 2008 01:20:05 +0000 (17:20 -0800)]
Merge branch 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block: (43 commits)
bio: get rid of bio_vec clearing
bounce: don't rely on a zeroed bio_vec list
cciss: simplify parameters to deregister_disk function
cfq-iosched: fix race between exiting queue and exiting task
loop: Do not call loop_unplug for not configured loop device.
loop: Flush possible running bios when loop device is released.
alpha: remove dead BIO_VMERGE_BOUNDARY
Get rid of CONFIG_LSF
block: make blk_softirq_init() static
block: use min_not_zero in blk_queue_stack_limits
block: add one-hit cache for disk partition lookup
cfq-iosched: remove limit of dispatch depth of max 4 times quantum
nbd: tell the block layer that it is not a rotational device
block: get rid of elevator_t typedef
aio: make the lookup_ioctx() lockless
bio: add support for inlining a number of bio_vecs inside the bio
bio: allow individual slabs in the bio_set
bio: move the slab pointer inside the bio_set
bio: only mempool back the largest bio_vec slab cache
block: don't use plugging on SSD devices
...
Linus Torvalds [Wed, 31 Dec 2008 00:20:19 +0000 (16:20 -0800)]
Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, sparseirq: clean up Kconfig entry
x86: turn CONFIG_SPARSE_IRQ off by default
sparseirq: fix numa_migrate_irq_desc dependency and comments
sparseirq: add kernel-doc notation for new member in irq_desc, -v2
locking, irq: enclose irq_desc_lock_class in CONFIG_LOCKDEP
sparseirq, xen: make sure irq_desc is allocated for interrupts
sparseirq: fix !SMP building, #2
x86, sparseirq: move irq_desc according to smp_affinity, v7
proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
sparse irqs: add irqnr.h to the user headers list
sparse irqs: handle !GENIRQ platforms
sparseirq: fix !SMP && !PCI_MSI && !HT_IRQ build
sparseirq: fix Alpha build failure
sparseirq: fix typo in !CONFIG_IO_APIC case
x86, MSI: pass irq_cfg and irq_desc
x86: MSI start irq numbering from nr_irqs_gsi
x86: use NR_IRQS_LEGACY
sparse irq_desc[] array: core kernel and x86 changes
genirq: record IRQ_LEVEL in irq_desc[]
irq.h: remove padding from irq_desc on 64bits
Linus Torvalds [Wed, 31 Dec 2008 00:16:21 +0000 (16:16 -0800)]
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: fix warning in kernel/hrtimer.c
x86: make sure we really have an hpet mapping before using it
x86: enable HPET on Fujitsu u9200
linux/timex.h: cleanup for userspace
posix-timers: simplify de_thread()->exit_itimers() path
posix-timers: check ->it_signal instead of ->it_pid to validate the timer
posix-timers: use "struct pid*" instead of "struct task_struct*"
nohz: suppress needless timer reprogramming
clocksource, acpi_pm.c: put acpi_pm_read_slow() under CONFIG_PCI
nohz: no softirq pending warnings for offline cpus
hrtimer: removing all ur callback modes, fix
hrtimer: removing all ur callback modes, fix hotplug
hrtimer: removing all ur callback modes
x86: correct link to HPET timer specification
rtc-cmos: export second NVRAM bank
Fixed up conflicts in sound/drivers/pcsp/pcsp.c and sound/core/hrtimer.c
manually.
Linus Torvalds [Wed, 31 Dec 2008 00:10:19 +0000 (16:10 -0800)]
Merge branch 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
stacktrace: provide save_stack_trace_tsk() weak alias
rcu: provide RCU options on non-preempt architectures too
printk: fix discarding message when recursion_bug
futex: clean up futex_(un)lock_pi fault handling
"Tree RCU": scalable classic RCU implementation
futex: rename field in futex_q to clarify single waiter semantics
x86/swiotlb: add default swiotlb_arch_range_needs_mapping
x86/swiotlb: add default phys<->bus conversion
x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
x86: add swiotlb allocation functions
swiotlb: consolidate swiotlb info message printing
swiotlb: support bouncing of HighMem pages
swiotlb: factor out copy to/from device
swiotlb: add arch hook to force mapping
swiotlb: allow architectures to override phys<->bus<->phys conversions
swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
rcu: fix rcutorture behavior during reboot
resources: skip sanity check of busy resources
swiotlb: move some definitions to header
swiotlb: allow architectures to override swiotlb pool allocation
...
Fix up trivial conflicts in
arch/x86/kernel/Makefile
arch/x86/mm/init_32.c
include/linux/hardirq.h
as per Ingo's suggestions.
Robert Hancock [Thu, 25 Dec 2008 01:06:06 +0000 (19:06 -0600)]
sata_sil: add Large Block Transfer support
This implements support for the Large Block Transfer feature found in Silicon
Image 311x controllers. This allows transferring bigger contiguous chunks of
data from system memory and avoids the 64KB boundary restriction of standard
SFF controllers.
This is based on a patch from Jeff Garzik (from the sii-lbt branch of
libata-dev) but includes a few bug fixes: Since the bmdma2 register does not
implement the status bits, the original bmdma register must be used except
where the bmdma2 register is required. As well the DMA boundary should be
31-bit instead of 32-bit since the top bit of the length field is still
required for the PRD end-of-table flag.
Signed-off-by: Robert Hancock <hancockr@shaw.ca> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jiri Slaby [Wed, 10 Dec 2008 13:07:22 +0000 (14:07 +0100)]
[libata] ata_piix: cleanup dmi strings checking
Commit
ATA: piix, fix pointer deref on suspend
fixed a possible oops in an ugly manner. Use newly introduced dmi_match()
to make the code pretty again.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Alexandru Romanescu <a_romanescu@yahoo.co.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Lubomir Bulej [Mon, 22 Dec 2008 10:35:22 +0000 (11:35 +0100)]
libata: blacklist NCQ on OCZ CORE 2 SSD (resend)
The patchlet below blacklists NCQ on OCZ CORE v2 SSD drive(s). Even
though the drive advertises NCQ support with queue depth 1, it responds
with all-zeroes FIS to NCQ commands which triggers ata error handling
several times before the kernel decides to disable NCQ on the drive.
Signed-off-by: Lubomir Bulej <lubomir.bulej@dsrg.mff.cuni.cz> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
David Rientjes [Thu, 18 Dec 2008 06:09:46 +0000 (22:09 -0800)]
slub: avoid leaking caches or refcounts on sysfs error
If a slab cache is mergeable and the sysfs alias cannot be added, the
target cache shall have its refcount decremented. kmem_cache_create()
will return NULL, so if kmem_cache_destroy() is ever called on the target
cache, it will never be freed if the refcount has been leaked.
Likewise, if a slab cache is not mergeable and the sysfs link cannot be
added, the new cache shall be removed from the slab_caches list.
kmem_cache_create() will return NULL, so it will be impossible to call
kmem_cache_destroy() on it.
Both of these operations require slub_lock since refcount of all slab
caches and slab_caches are protected by the lock.
In the mergeable case, it would be better to restore objsize and offset
back to their original values, but this could race with another merge
since slub_lock was dropped.
Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka Enberg [Wed, 26 Nov 2008 08:01:31 +0000 (10:01 +0200)]
slab: remove GFP_THISNODE clearing from alloc_slabmgmt()
Commit 6cb062296f73e74768cca2f3eaf90deac54de02d ("Categorize GFP flags")
left one call-site in alloc_slabmgmt() to clear GFP_THISNODE instead of
GFP_CONSTRAINT_MASK. Unfortunately, that ends up clearing __GFP_NOWARN
and __GFP_NORETRY as well which is not what we want. As the only caller
of alloc_slabmgmt() already clears GFP_CONSTRAINT_MASK before passing
local_flags to it, we can just remove the clearing of GFP_THISNODE.
This patch should fix spurious page allocation failure warnings on the
mempool_alloc() path. See the following URL for the original discussion
of the bug:
http://lkml.org/lkml/2008/10/27/100
Acked-by: Christoph Lameter <cl@linux-foundation.org> Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Akinobu Mita [Tue, 23 Dec 2008 10:37:01 +0000 (19:37 +0900)]
SLUB: failslab support
Currently fault-injection capability for SLAB allocator is only
available to SLAB. This patch makes it available to SLUB, too.
[penberg@cs.helsinki.fi: unify slab and slub implementations] Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Eric Anholt [Fri, 19 Dec 2008 22:47:48 +0000 (14:47 -0800)]
drm/i915: Don't print to dmesg when taking signal during object_pin.
This showed up in logs where people had a hung chip, so pinning was blocked
on the chip unpinning other buffers, and the X Server took its scheduler
signal during that time.
Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@linux.ie>
Hannes Eder [Thu, 18 Dec 2008 14:09:00 +0000 (15:09 +0100)]
drm/i915: fix sparse warnings: declare one-bit bitfield as unsigned
Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Eric Anholt <eric@anholt.net> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Dave Airlie <airlied@linux.ie>
Eric Anholt [Wed, 10 Dec 2008 18:09:41 +0000 (10:09 -0800)]
drm/i915: Don't double-unpin buffers if we take a signal in evict_everything().
We haven't seen this in practice, but it was visible when looking at a bug
report from when i915_gem_evict_everything() was broken and would always
return error.
Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@linux.ie>
drm: drop DRM_IOCTL_MODE_REPLACEFB, add+remove works just as well.
The replace fb ioctl replaces the backing buffer object for a modesetting
framebuffer object. This can be acheived by just creating a new
framebuffer backed by the new buffer object, setting that for the crtcs
in question and then removing the old framebuffer object.
Signed-off-by: Kristian Hogsberg <krh@redhat.com> Acked-by: Jakob Bornecrantz <jakob@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Jesse Barnes [Fri, 7 Nov 2008 22:24:08 +0000 (14:24 -0800)]
DRM: i915: add mode setting support
This commit adds i915 driver support for the DRM mode setting APIs.
Currently, VGA, LVDS, SDVO DVI & VGA, TV and DVO LVDS outputs are
supported. HDMI, DisplayPort and additional SDVO output support will
follow.
Support for the mode setting code is controlled by the new 'modeset'
module option. A new config option, CONFIG_DRM_I915_KMS controls the
default behavior, and whether a PCI ID list is built into the module for
use by user level module utilities.
Note that if mode setting is enabled, user level drivers that access
display registers directly or that don't use the kernel graphics memory
manager will likely corrupt kernel graphics memory, disrupt output
configuration (possibly leading to hangs and/or blank displays), and
prevent panic/oops messages from appearing. So use caution when
enabling this code; be sure your user level code supports the new
interfaces.
A new SysRq key, 'g', provides emergency support for switching back to
the kernel's framebuffer console; which is useful for testing.
Co-authors: Dave Airlie <airlied@linux.ie>, Hong Liu <hong.liu@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Fri, 7 Nov 2008 22:05:41 +0000 (14:05 -0800)]
DRM: add mode setting support
Add mode setting support to the DRM layer.
This is a fairly big chunk of work that allows DRM drivers to provide
full output control and configuration capabilities to userspace. It was
motivated by several factors:
- the fb layer's APIs aren't suited for anything but simple
configurations
- coordination between the fb layer, DRM layer, and various userspace
drivers is poor to non-existent (radeonfb excepted)
- user level mode setting drivers makes displaying panic & oops
messages more difficult
- suspend/resume of graphics state is possible in many more
configurations with kernel level support
This commit just adds the core DRM part of the mode setting APIs.
Driver specific commits using these new structure and APIs will follow.
Co-authors: Jesse Barnes <jbarnes@virtuousgeek.org>, Jakob Bornecrantz <jakob@tungstengraphics.com>
Contributors: Alan Hourihane <alanh@tungstengraphics.com>, Maarten Maathuis <madman2003@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
Jesse Barnes [Wed, 12 Nov 2008 18:03:55 +0000 (10:03 -0800)]
drm/i915: add GEM GTT mapping support
Use the new core GEM object mapping code to allow GTT mapping of GEM
objects on i915. The fault handler will make sure a fence register is
allocated too, if the object in question is tiled.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
Jesse Barnes [Wed, 5 Nov 2008 18:31:53 +0000 (10:31 -0800)]
drm: GEM mmap support
Add core support for mapping of GEM objects. Drivers should provide a
vm_operations_struct if they want to support page faulting of objects.
The code for handling GEM object offsets was taken from TTM, which was
written by Thomas Hellström.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Dave Airlie <airlied@redhat.com>
Vegard Nossum [Tue, 2 Dec 2008 03:38:47 +0000 (13:38 +1000)]
drm: fix leak of uninitialized data to userspace
...so drm_getunique() is trying to copy some uninitialized data to
userspace. The ECX register contains the number of words that are
left to copy -- so there are 5 * 4 = 20 bytes left. The offset of the
first uninitialized byte (counting from the start of the string) is
also 20 (i.e. 0xf65d2294&((1 << 5)-1) == 20). So somebody tried to
copy 40 bytes when the string was only 19 long.
...so it seems that dev->unique is never updated to reflect the
actual length of the string. The remaining bytes (20 in this case)
are random uninitialized bytes that are copied into userspace.
This patch fixes the problem by setting dev->unique_len after the
snprintf().
airlied- I've had to fix this up to store the alloced size so
we have it for drm_free later.
Dave Airlie [Fri, 28 Nov 2008 04:22:24 +0000 (14:22 +1000)]
drm: move to kref per-master structures.
This is step one towards having multiple masters sharing a drm
device in order to get fast-user-switching to work.
It splits out the information associated with the drm master
into a separate kref counted structure, and allocates this when
a master opens the device node. It also allows the current master
to abdicate (say while VT switched), and a new master to take over
the hardware.
It moves the Intel and radeon drivers to using the sarea from
within the new master structures.
Dave Airlie [Fri, 28 Nov 2008 03:43:47 +0000 (13:43 +1000)]
drm: cleanup exit path for module unload
The current sub-module unload exit path is a mess, it tries
to abuse the idr. Just keep a list of devices per driver struct
and free them in-order on rmmod.
Jens Axboe [Tue, 23 Dec 2008 11:46:21 +0000 (12:46 +0100)]
bio: get rid of bio_vec clearing
We don't need to clear the memory used for adding bio_vec entries,
since nobody should be looking at members unitialized. Any valid
use should be below bio->bi_vcnt, and that members up until that count
must be valid since they were added through bio_add_page().
Jens Axboe [Tue, 23 Dec 2008 11:44:19 +0000 (12:44 +0100)]
bounce: don't rely on a zeroed bio_vec list
__blk_queue_bounce() relies on a zeroed bio_vec list, since it looks
up arbitrary indexes in the allocated bio. The block layer only
guarentees that added entries are valid, so clear memory after alloc.
Jens Axboe [Mon, 15 Dec 2008 20:19:25 +0000 (21:19 +0100)]
cfq-iosched: fix race between exiting queue and exiting task
Original patch from Nikanth Karthikesan <knikanth@suse.de>
When a queue exits the queue lock is taken and cfq_exit_queue() would free all
the cic's associated with the queue.
But when a task exits, cfq_exit_io_context() gets cic one by one and then
locks the associated queue to call __cfq_exit_single_io_context. It looks like
between getting a cic from the ioc and locking the queue, the queue might have
exited on another cpu.
Fix this by rechecking the cfq_io_context queue key inside the queue lock
again, and not calling into __cfq_exit_single_io_context() if somebody
beat us to it.
Milan Broz [Fri, 12 Dec 2008 13:50:49 +0000 (14:50 +0100)]
loop: Do not call loop_unplug for not configured loop device.
In loop_unplug() function is expected that mapping is set
and lo->lo_backing_file is not NULL.
Unfortunately loop_set_fd() set the request queue unplug function,
but loop_clr_fd() doesn't clear that.
Loop device allows open of non-configured loop in some situations.
If the unplug on request queue is called, loop module oopses because
of missing lo_backing_file.
Simple reproducer:
losetup /dev/loop0 /xxx
losetup -d /dev/loop0
dmsetup create x --table "0 1 linear /dev/loop0 0"
Milan Broz [Fri, 12 Dec 2008 13:48:27 +0000 (14:48 +0100)]
loop: Flush possible running bios when loop device is released.
When there are still queued bios and reference count
drops to zero, loop device must flush all queued bios.
Otherwise it can lead to situation that caller
closes the device, but some bios are still running
and endio() function call later OOpses when uses
unallocated mempool.
This happens for example when running dm-crypt over loop,
here is typical oops backtrace:
Jens Axboe [Fri, 12 Dec 2008 08:51:16 +0000 (09:51 +0100)]
Get rid of CONFIG_LSF
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.
Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
FUJITA Tomonori [Thu, 4 Dec 2008 07:56:35 +0000 (08:56 +0100)]
block: use min_not_zero in blk_queue_stack_limits
zero is invalid for max_phys_segments, max_hw_segments, and
max_segment_size. It's better to use use min_not_zero instead of
min. min() works though (because the commit 0e435ac makes sure that
these values are set to the default values, non zero, if a queue is
initialized properly).
With this patch, blk_queue_stack_limits does the almost same thing
that dm's combine_restrictions_low() does. I think that it's easy to
remove dm's combine_restrictions_low.
Jens Axboe [Fri, 24 Oct 2008 10:52:42 +0000 (12:52 +0200)]
block: add one-hit cache for disk partition lookup
disk_map_sector_rcu() returns a partition from a sector offset,
which we use for IO statistics on a per-partition basis. The
lookup itself is an O(N) list lookup, where N is the number of
partitions. This actually hurts performance quite a bit, even
on the lower end partitions. On higher numbered partitions,
it can get pretty bad.
Solve this by adding a one-hit cache for partition lookup.
This makes the lookup O(1) for the case where we do most IO to
one partition. Even for mixed partition workloads, amortized cost
is pretty close to O(1) since the natural IO batching makes the
one-hit cache last for lots of IOs.
Jens Axboe [Mon, 20 Oct 2008 13:44:28 +0000 (15:44 +0200)]
cfq-iosched: remove limit of dispatch depth of max 4 times quantum
This basically limits the hardware queue depth to 4*quantum at any
point in time, which is 16 with the default settings. As CFQ uses
other means to shrink the hardware queue when necessary in the first
place, there's really no need for this extra heuristic. Additionally,
it ends up hurting performance in some cases.
Jens Axboe [Tue, 9 Dec 2008 07:11:22 +0000 (08:11 +0100)]
aio: make the lookup_ioctx() lockless
The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.
There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Jens Axboe [Tue, 23 Dec 2008 11:42:54 +0000 (12:42 +0100)]
bio: add support for inlining a number of bio_vecs inside the bio
When we go and allocate a bio for IO, we actually do two allocations.
One for the bio itself, and one for the bi_io_vec that holds the
actual pages we are interested in.
This feature inlines a definable amount of io vecs inside the bio
itself, so we eliminate the bio_vec array allocation for IO's up
to a certain size. It defaults to 4 vecs, which is typically 16k
of IO.
Jens Axboe [Wed, 10 Dec 2008 14:35:05 +0000 (15:35 +0100)]
bio: allow individual slabs in the bio_set
Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.
This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.
Jens Axboe [Thu, 11 Dec 2008 10:53:43 +0000 (11:53 +0100)]
bio: only mempool back the largest bio_vec slab cache
We only very rarely need the mempool backing, so it makes sense to
get rid of all but one of the mempool in a bio_set. So keep the
largest bio_vec count mempool so we can always honor the largest
allocation, and "upgrade" callers that fail.
Tejun Heo [Fri, 28 Nov 2008 04:32:07 +0000 (13:32 +0900)]
block: fix empty barrier on write-through w/ ordered tag
Empty barrier on write-through (or no cache) w/ ordered tag has no
command to execute and without any command to execute ordered tag is
never issued to the device and the ordering is never achieved. Force
draining for such cases.
Tejun Heo [Fri, 28 Nov 2008 04:32:06 +0000 (13:32 +0900)]
block: simplify empty barrier implementation
Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.
With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().
Remove blk_empty_barrier() test which now has no user.
Tejun Heo [Fri, 28 Nov 2008 04:32:05 +0000 (13:32 +0900)]
block: make barrier completion more robust
Barrier completion had the following assumptions.
* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.
* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.
Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.
* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().
* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.
* While at it, convert 0/1 return from blk_do_ordered() to false/true.
Tejun Heo [Fri, 28 Nov 2008 04:32:04 +0000 (13:32 +0900)]
block: make every barrier action optional
In all barrier sequences, the barrier write itself was always assumed
to be issued and thus didn't have corresponding control flag. This
patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
start_ordered() such that any barrier action can be skipped.
This patch doesn't introduce any visible behavior changes.
Tejun Heo [Fri, 28 Nov 2008 04:32:03 +0000 (13:32 +0900)]
block: remove duplicate or unused barrier/discard error paths
* Because barrier mode can be changed dynamically, whether barrier is
supported or not can be determined only when actually issuing the
barrier and there is no point in checking it earlier. Drop barrier
support check in generic_make_request() and __make_request(), and
update comment around the support check in blk_do_ordered().
* There is no reason to check discard support in both
generic_make_request() and __make_request(). Drop the check in
__make_request(). While at it, move error action block to the end
of the function and add unlikely() to q existence test.
* Barrier request, be it empty or not, is never passed to low level
driver and thus it's meaningless to try to copy back req->sector to
bio->bi_sector on error. In addition, the notion of failed sector
doesn't make any sense for empty barrier to begin with. Drop the
code block from __end_that_request_first().
Tejun Heo [Fri, 28 Nov 2008 04:32:02 +0000 (13:32 +0900)]
block: reorganize QUEUE_ORDERED_* constants
Separate out ordering type (drain,) and action masks (preflush,
postflush, fua) from visible ordering mode selectors
(QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
while action masks are named QUEUE_ORDERED_DO_*.
This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
optional to improve empty barrier implementation.
Richard Kennedy [Wed, 3 Dec 2008 11:41:40 +0000 (12:41 +0100)]
block: reorder struct bio to remove padding on 64bit
Remove 8 bytes of padding from struct bio which also removes 16 bytes from
struct bio_pair to make it 248 bytes. bio_pair then fits into one fewer
cache lines & into a smaller slab.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Keith Mannthey [Tue, 25 Nov 2008 09:24:35 +0000 (10:24 +0100)]
block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Wu Fengguang [Tue, 25 Nov 2008 08:08:39 +0000 (09:08 +0100)]
block: don't take lock on changing ra_pages
There's no need to take queue_lock or kernel_lock when modifying
bdi->ra_pages. So remove them. Also remove out of date comment for
queue_max_sectors_store().
Documentation: remove reference to ll_rw_blk.c and moved drivers/block/elevator.c
The drivers/block/ll_rw_block.c has been split and organized in the block/
directory, and also drivers/block/elevator.c has been moved to the block/
directory. Update Documentation/block/biodoc.txt accordingly
Do not free io context when taking recursive faults in do_exit
When taking recursive faults in do_exit, if the io_context is not null,
exit_io_context() is being called. But it might decrement the refcount
more than once. It is better to leave this task alone.
Jens Axboe [Wed, 19 Nov 2008 13:38:39 +0000 (14:38 +0100)]
block: leave the request timeout timer running even on an empty list
For sync IO, we'll often do them serialized. This means we'll be touching
the queue timer for every IO, as opposed to only occasionally like we
do for queued IO. Instead of deleting the timer when the last request
is removed, just let continue running. If a new request comes up soon
we then don't have to readd the timer again. If no new requests arrive,
the timer will expire without side effect later.
Now the rq->deadline can't be zero if the request is in the
timeout_list, so there is no need to have next_set. There is no need to
access a request's deadline field if blk_rq_timed_out is called on it.
Xen's blkfront sets noop as the default I/O scheduler at initialization
time to avoid elevator overheads such as idling, but with the advent of
basic disk profiling capabilities this is not necessary anymore. We
should just tell the block layer that we are a paravirt front-end driver
and the elevator will automatically make the necessary adjustments.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>