Chris Mason [Thu, 22 Jan 2009 14:23:10 +0000 (09:23 -0500)]
Btrfs: do less aggressive btree readahead
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too. It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.
This patch changes things to just look for blocks within 64k of the
target block. It will trigger less IO and make for lower latencies on
the read size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yehuda Sadeh [Wed, 21 Jan 2009 19:39:14 +0000 (14:39 -0500)]
Btrfs: fiemap support
Now that bmap support is gone, this is the only way to get extent
mappings for userland. These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.
Chris Mason [Wed, 21 Jan 2009 18:11:13 +0000 (13:11 -0500)]
Btrfs: stop providing a bmap operation to avoid swapfile corruptions
Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.
This causes problems for btrfs in a few ways:
btrfs returns logical block numbers through bmap, and these are not suitable
for IO. They might translate to different devices, raid etc.
COW means that file block mappings are going to change frequently.
Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely. A later commit
will add fiemap support for people that really want to know how
a file is laid out.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng [Wed, 21 Jan 2009 17:54:03 +0000 (12:54 -0500)]
Btrfs: fix tree logs parallel sync
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.
This patch has following changes to fix the bug:
Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.
Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.
At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.
Yan Zheng [Wed, 21 Jan 2009 15:49:16 +0000 (10:49 -0500)]
Btrfs: fix stop searching test in replace_one_extent
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.
The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.
The fix is get the full size from ram_bytes field of file extent item.
Yan Zheng [Wed, 21 Jan 2009 15:49:16 +0000 (10:49 -0500)]
Btrfs: Fix infinite loop in btrfs_extent_post_op
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.
finish_current_insert enters infinite loop if it only finds some backrefs to
update. The fix is to check for pending backref updates before restarting the
loop.
The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.
Roland Dreier [Wed, 21 Jan 2009 15:49:16 +0000 (10:49 -0500)]
Btrfs: Remove extra KERN_INFO in the middle of a line
The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device()
actually follows another printk that doesn't end in a newline (since the
intention is for the two printks to make one line of output), so the
KERN_INFO just ends up messing up the output:
device label exp <6>devid 1 transid 9 /dev/sda5
Fix this by changing the extra KERN_INFO to KERN_CONT.
Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Wed, 21 Jan 2009 15:49:16 +0000 (10:49 -0500)]
Btrfs: cleanup xattr code
Andrew's review of the xattr code revealed some minor issues that this patch
addresses. Just an error return fix, got rid of a useless statement and
commented one of the trickier parts of __btrfs_getxattr.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
The structure used to send device in btrfs ioctl calls was not
properly aligned, and so 32 bit ioctls would not work properly on
64 bit kernels.
We could fix this with compat ioctls, but we're just one byte away
and it doesn't make sense at this stage to carry about the compat ioctls
forever at this stage in the project.
This patch brings the ioctl arg up to an evenly aligned 4k.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 16 Jan 2009 16:58:19 +0000 (11:58 -0500)]
Btrfs: Clear the device->running_pending flag before bailing on congestion
Btrfs maintains a queue of async bio submissions so the checksumming
threads don't have to wait on get_request_wait. In order to avoid
extra wakeups, this code has a running_pending flag that is used
to tell new submissions they don't need to wake the thread.
When the threads notice congestion on a single device, they
may decide to requeue the job and move on to other devices. This
makes sure the running_pending flag is cleared before the
job is requeued.
It should help avoid IO stalls by making sure the task is woken up
when new submissions come in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 9 Jan 2009 18:14:17 +0000 (13:14 -0500)]
Btrfs: explicitly mark the tree log root for writeback
Each subvolume has an extent_state_tree used to mark metadata
that needs to be sent to disk while syncing the tree. This is
used in addition to the dirty bits on the pages themselves so that
a single subvolume can be sent to disk efficiently in disk order.
Normally this marking happens in btrfs_alloc_free_block, which also does
special recording of dirty tree blocks for the tree log roots.
Yan Zheng noticed that when the root of the log tree is allocated, it is added
to the wrong writeback list. The fix used here is to explicitly set
it dirty as part of tree log creation.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng [Tue, 6 Jan 2009 14:58:02 +0000 (09:58 -0500)]
Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents
btrfs_drop_extents doesn't change file extent's ram_bytes
in the case of booked extent. To be consistent, we should
also not change ram_bytes when truncating existing extent.
Yan Zheng [Tue, 6 Jan 2009 14:58:06 +0000 (09:58 -0500)]
Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation
Snapshot creation happens at a specific time during transaction commit. We
need to make sure the code called by snapshot creation doesn't wait
for the running transaction to commit.
This changes btrfs_delete_inode and finish_pending_snaps to use
btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.
It would be better if btrfs_delete_inode didn't use the join, but the
call path that triggers it is:
Liu Hui [Mon, 5 Jan 2009 20:57:51 +0000 (15:57 -0500)]
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
Yan Zheng [Mon, 5 Jan 2009 20:43:42 +0000 (15:43 -0500)]
Btrfs: avoid potential super block corruption
The data in fs_info->super_for_commit are zeros before the
first transaction commit. If tree log sync and system crash
both occur before the first transaction commit, super block
will get corrupted.
This fixes it by properly filling in the super_for_commit field at
open time.
Chris Mason [Mon, 5 Jan 2009 21:57:23 +0000 (16:57 -0500)]
Btrfs: add permission checks to the ioctls
Only root can add/remove devices
Only root can defrag subtrees
Only files open for writing can be defragged
Only files open for writing can be the destination for a clone
Signed-off-by: Chris Mason <chris.mason@oracle.com>
and are queued up for v2.6.29. This shows that the facility is still not
tested well enough to release into a stable kernel - disable it for now and
reactivate in .29. In .29 the hardware-branch-tracer will use the DS/BTS
facilities too - hopefully resulting in better code.
Kyle McMartin [Tue, 23 Dec 2008 13:44:30 +0000 (08:44 -0500)]
parisc: disable UP-optimized flush_tlb_mm
flush_tlb_mm's "optimized" uniprocessor case of allocating a new
context for userspace is exposing a race where we can suddely return
to a syscall with the protection id and space id out of sync, trapping
on the next userspace access.
Harry Ciao [Tue, 23 Dec 2008 21:57:16 +0000 (13:57 -0800)]
edac: fix edac core deadlock when removing a device
When deleting an edac device, we have to wait for its edac_dev.work to be
completed before deleting the whole edac_dev structure. Since we have no
idea which work in current edac_poller's workqueue is the work we are
conerned about, we wait for all work in the edac_poller's workqueue to be
proceseed. This is done via flush_cpu_workqueue() which inserts a
wq_barrier into the tail of the workqueue and then sleeping on the
completion of this wq_barrier. The edac_poller will wake up sleepers when
it is found.
EDAC core creates only one kernel worker thread, edac_poller, to run the
works of all current edac devices. They share the same callback function
of edac_device_workq_function(), which would grab the mutex of
device_ctls_mutex first before it checks the device. This is exactly
where edac_poller and rmmod would have a great chance to deadlock.
In below call trace of rmmod > ... >
edac_device_del_device >
edac_device_workq_teardown > flush_workqueue > flush_cpu_workqueue,
device_ctls_mutex would have already been grabbed by
edac_device_del_device(). So, on one hand rmmod would sleep on the
completion of a wq_barrier, holding device_ctls_mutex; on the other hand
edac_poller would be blocked on the same mutex when it's running any one
of works of existing edac evices(Note, this edac_dev.work is likely to be
totally irrelevant to the one that is being removed right now)and never
would have a chance to run the work of above wq_barrier to wake rmmod up.
edac_device_workq_teardown() should not be called within the critical
region of device_ctls_mutex. Just like is done in edac_pci_del_device()
and edac_mc_del_mc(), where edac_pci_workq_teardown() and
edac_mc_workq_teardown() are called after related mutex are released.
Moreover, an edac_dev.work should check first if it is being removed. If
this is the case, then it should bail out immediately. Since not all of
existing edac devices are to be removed, this "shutting flag" should be
contained to edac device being removed. The current edac_dev.op_state can
be used to serve this purpose.
The original deadlock problem and the solution have been witnessed and
tested on actual hardware. Without the solution, rmmod an edac driver
would result in below deadlock:
Evgeniy Polyakov [Tue, 23 Dec 2008 21:57:12 +0000 (13:57 -0800)]
w1: fix slave selection on big-endian systems
During test of the w1-gpio driver i found that in "w1.c:679
w1_slave_found()" the device id is converted to little-endian with
"cpu_to_le64()", but its not converted back to cpu format in "w1_io.c:293
w1_reset_select_slave()".
Based on a patch created by Andreas Hummel.
[akpm@linux-foundation.org: remove unneeded cast] Reported-by: Andreas Hummel <andi_hummel@gmx.de> Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
V4L/DVB (9920): em28xx: fix NULL pointer dereference in call to VIDIOC_INT_RESET command
Fix a NULL pointer dereference that would occur if the video decoder tied to
the em28xx supports the VIDIOC_INT_RESET call (for example: the cx25840 driver)
Thomas Gleixner [Sat, 20 Dec 2008 20:27:34 +0000 (21:27 +0100)]
Null pointer deref with hrtimer_try_to_cancel()
Impact: Prevent kernel crash with posix timer clockid CLOCK_MONOTONIC_RAW
commit 2d42244ae71d6c7b0884b5664cf2eda30fb2ae68 (clocksource:
introduce CLOCK_MONOTONIC_RAW) introduced a new clockid, which is only
available to read out the raw not NTP adjusted system time.
The above commit did not prevent that a posix timer can be created
with that clockid. The timer_create() syscall succeeds and initializes
the timer to a non existing hrtimer base. When the timer is deleted
either by timer_delete() or by the exit() cleanup the kernel crashes.
Prevent the creation of timers for CLOCK_MONOTONIC_RAW by setting the
posix clock function to no_timer_create which returns an error code.
Reported-and-tested-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 20 Dec 2008 19:07:31 +0000 (11:07 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: change simple_strtol to simple_strtoul
9p: convert d_iname references to d_name.name
9p: Remove potentially bad parameter from function entry debug print.
Linus Torvalds [Sat, 20 Dec 2008 19:07:18 +0000 (11:07 -0800)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix resume (S2R) broken by Intel microcode module, on A110L
x86 gart: don't complain if no AMD GART found
AMD IOMMU: panic if completion wait loop fails
AMD IOMMU: set cmd buffer pointers to zero manually
x86: re-enable MCE on secondary CPUS after suspend/resume
AMD IOMMU: allocate rlookup_table with __GFP_ZERO
might hang upon resuming, OTOH it should have likely hanged each and every time.
(1) possible deadlock in microcode_resume_cpu() if either 'if' section is
taken;
(2) now, I don't see it in spec. and can't experimentally verify it (newer
ucodes don't seem to be available for my Core2duo)... but logically-wise, I'd
think that when read upon resuming, the 'microcode revision' (MSR 0x8B) should
be back to its original one (we need to reload ucode anyway so it doesn't seem
logical if a cpu doesn't drop the version)... if so, the comparison with
memcmp() for the full 'struct cpu_signature' is wrong... and that's how one of
the aforementioned 'if' sections might have been triggered - leading to a
deadlock.
Obviously, in my tests I simulated loading/resuming with the ucode of the same
version (just to see that the file is loaded/re-loaded upon resuming) so this
issue has never popped up.
I'd appreciate if someone with an appropriate system might give a try to the
2nd patch (titled "fix a comparison && deadlock...").
In any case, the deadlock situation is a must-have fix.
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] mpt fusion: clear list of outstanding commands on host reset
[SCSI] scsi_lib: only call scsi_unprep_request() under queue lock
[SCSI] ibmvstgt: move crq_queue_create to the end of initialization
[SCSI] libiscsi REGRESSION: fix passthrough support with older iscsi tools
[SCSI] aacraid: disable Dell Percraid quirk on Adaptec 2200S and 2120S
Linus Torvalds [Fri, 19 Dec 2008 19:36:04 +0000 (11:36 -0800)]
Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/i915: GEM on PAE has problems - disable it for now.
drm/i915: Don't return busy for buffers left on the flushing list.
Yan Zheng [Fri, 19 Dec 2008 15:58:46 +0000 (10:58 -0500)]
Btrfs: properly update block accounting for metadata
This adds the missing block accounting code to finish_current_insert and makes
block accounting for root item properly protected by the delalloc spin lock.
Stanley Miao [Fri, 19 Dec 2008 14:08:22 +0000 (22:08 +0800)]
ALSA: Fix a Oops bug in omap soc driver.
There will be a Oops or frequent underrun messages when playing music with
omap soc driver, this is because a data region is incorretly sized, other data
region will be overwriten when writing to this data region.
Signed-off-by: Stanley Miao <stanley.miao@windriver.com> Acked-by: Jarkko Nikula <jarkko.nikula@nokia.com> Cc: stable@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Iwai [Fri, 19 Dec 2008 13:02:32 +0000 (14:02 +0100)]
ALSA: hda - Remove non-working headphone control for Dell laptops
The previous commit re-enabled hp_nid setup for IDT92HD73*, but
it's unneeded indeed for Dell laptops that have multiple headphones.
Setting the extra hp_nid results in a non-working "Headpohne" mixer
control. Thus hp_nid should be 0 for these dell models.
Also, the automatic addition of hp_nid should check whether it's
a dual-HP model or not. For dual-HPs, the pins are already checked
by the early workaround.
Wu Fengguang [Wed, 26 Nov 2008 06:35:22 +0000 (14:35 +0800)]
ACPI: don't cond_resched() when irqs_disabled()
The ACPI interpreter usually runs with irqs enabled.
However, during suspend/resume it runs with
irqs disabled to evaluate _GTS/_BFS, as well as
by irqrouter_resume() which evaluates _CRS, _PRS, _SRS.
http://bugzilla.kernel.org/show_bug.cgi?id=12252
Signed-off-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
Bjorn Helgaas [Thu, 13 Nov 2008 23:30:13 +0000 (17:30 -0600)]
ACPI: fix 2.6.28 acpi.debug_level regression
acpi_early_init() was changed to over-write the cmdline param,
making it really inconvenient to set debug flags at boot-time.
Also,
This sets the default level to "info", which is what all the ACPI
drivers use. So to enable messages from drivers, you only have to
supply the "layer" (a.k.a. "component"). For non-"info" ACPI core
and ACPI interpreter messages, you have to supply both level and
layer masks, as before.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Len Brown <len.brown@intel.com>
Takashi Iwai [Wed, 17 Dec 2008 13:51:01 +0000 (14:51 +0100)]
ALSA: hda - Add no-jd model for IDT 92HD73xx
Added the model without the jack-detection for some desktops that
have really no jack-detection. The recent driver caused regressions
regarding the sound output on such machines.
cciss: fix problem that deleting multiple logical drives could cause a panic
Fix problem that deleting multiple logical drives could cause a panic.
It fixes a panic which can be easily reproduced in the following way: Just
create several "arrays," each with multiple logical drives via hpacucli,
then delete the first array, and it will blow up in deregister_disk(), in
the call to get_host() when it tries to dig the hba pointer out of a NULL
queue pointer.
The problem has been present since my code to make rebuild_lun_table
behave better went in.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cca.cpqcorp.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Dave Airlie [Fri, 19 Dec 2008 05:38:34 +0000 (15:38 +1000)]
drm/i915: GEM on PAE has problems - disable it for now.
On PAE systems, GEM allocates pages using shmem, and passes these
pages to be bound into AGP, however the AGP interfaces + the x86
set_memory interfaces all take unsigned long not dma_addr_t.
The initial fix for this was a mess, so we need to do this correctly
for 2.6.29.
Eric Anholt [Mon, 15 Dec 2008 03:05:04 +0000 (19:05 -0800)]
drm/i915: Don't return busy for buffers left on the flushing list.
These buffers don't have active rendering still occurring to them, they just
need either a flush to be emitted or a retire_requests to occur so that we
notice they're done. Return unbusy so that one of the two occurs. The two
expected consumers of this interface (OpenGL and libdrm_intel BO cache) both
want this behavior.
Signed-off-by: Eric Anholt <eric@anholt.net> Acked-by: Keith Packard <keithp@keithp.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
NeilBrown [Fri, 19 Dec 2008 05:25:01 +0000 (16:25 +1100)]
md: Don't read past end of bitmap when reading bitmap.
When we read the write-intent-bitmap off the device, we currently
read a whole number of pages.
When PAGE_SIZE is 4K, this works due to the alignment we enforce
on the superblock and bitmap.
When PAGE_SIZE is 64K, this case read past the end-of-device
which causes an error.
When we write the superblock, we ensure to clip the last page
to just be the required size. Copy that code into the read path
to just read the required number of sectors.
Signed-off-by: Neil Brown <neilb@suse.de> Cc: stable@kernel.org
James Chapman [Wed, 17 Dec 2008 12:02:16 +0000 (12:02 +0000)]
ppp: fix segfaults introduced by netdev_priv changes
This patch fixes a segfault in ppp_shutdown_interface() and
ppp_destroy_interface() when a PPP connection is closed. I bisected
the problem to the following commit:
netdevice ppp: Convert directly reference of netdev->priv
1. Use netdev_priv(dev) to replace dev->priv.
2. Alloc netdev's private data by alloc_netdev().
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The original ppp_generic code treated the netdev and struct ppp as
independent data structures which were freed separately. In moving the
ppp struct into the netdev, it is now possible for the private data to
be freed before the call to ppp_shutdown_interface(), which is bad.
The kfree(ppp) in ppp_destroy_interface() is also wrong; presumably
ppp hasn't worked since the above commit.
The following patch fixes both problems.
Signed-off-by: James Chapman <jchapman@katalix.com> Reviewed-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Fri, 19 Dec 2008 03:35:10 +0000 (19:35 -0800)]
net: Fix module refcount leak in kernel_accept()
The kernel_accept() does not hold the module refcount of newsock->ops->owner,
so we need __module_get(newsock->ops->owner) code after call kernel_accept()
by hand.
In sunrpc, the module refcount is missing to hold. So this cause kernel panic.
Used following script to reproduct:
while [ 1 ];
do
mount -t nfs4 192.168.0.19:/ /mnt
touch /mnt/file
umount /mnt
lsmod | grep ipv6
done
This patch fixed the problem by add __module_get(newsock->ops->owner) to
kernel_accept(). So we do not need to used __module_get(newsock->ops->owner)
in every place when used kernel_accept().
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John McCutchan [Thu, 18 Dec 2008 01:43:02 +0000 (17:43 -0800)]
Maintainer email fixes for inotify
Update John McCutchan and Robert Love's email addresses for
maintenance of inotify
Signed-off-by: John McCutchan <john@johnmccutchan.com> Acked-by: Robert Love <rlove@rlove.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>