x64, x2apic/intr-remap: code re-structuring, to be used by both DMA and Interrupt remapping
Allocate the iommu during the parse of DMA remapping hardware
definition structures. And also, introduce routines for device
scope initialization which will be explicitly called during
dma-remapping initialization.
These will be used for enabling interrupt remapping separately from the
existing DMA-remapping enabling sequence.
Roland McGrath [Thu, 10 Jul 2008 21:50:39 +0000 (14:50 -0700)]
x86_64: fix delayed signals
On three of the several paths in entry_64.S that call
do_notify_resume() on the way back to user mode, we fail to properly
check again for newly-arrived work that requires another call to
do_notify_resume() before going to user mode. These paths set the
mask to check only _TIF_NEED_RESCHED, but this is wrong. The other
paths that lead to do_notify_resume() do this correctly already, and
entry_32.S does it correctly in all cases.
All paths back to user mode have to check all the _TIF_WORK_MASK
flags at the last possible stage, with interrupts disabled.
Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
set any time after we entered do_notify_resume(). More work flags
can be set (or left set) synchronously inside do_notify_resume(), as
TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
(which then send an asynchronous interrupt).
There are many different scenarios that could hit this bug, most of
them races. The simplest one to demonstrate does not require any
race: when one signal has done handler setup at the check before
returning from a syscall, and there is another signal pending that
should be handled. The second signal's handler should interrupt the
first signal handler before it actually starts (so the interrupted PC
is still at the handler's entry point). Instead, it runs away until
the next kernel entry (next syscall, tick, etc).
This test behaves correctly on 32-bit kernels, and fails on 64-bit
(either 32-bit or 64-bit test binary). With this fix, it works.
Mark Rustad [Thu, 10 Jul 2008 19:27:11 +0000 (14:27 -0500)]
[PATCH] IPMI: return correct value from ipmi_write
This patch corrects the handling of write operations to the IPMI watchdog
to work as intended by returning the number of characters actually
processed. Without this patch, an "echo V >/dev/watchdog" enables the
watchdog if IPMI is providing the watchdog function.
Signed-off-by: Mark Rustad <MRustad@gmail.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
x86: Recover timer_ack lost in the merge of the NMI watchdog
In the course of the recent unification of the NMI watchdog an assignment
to timer_ack to switch off unnecesary POLL commands to the 8259A in the
case of a watchdog failure has been accidentally removed. The statement
used to be limited to the 32-bit variation as since the rewrite of the
timer code it has been relevant for the 82489DX only. This change brings
it back.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no such entity as ISA IRQ2. The ACPI spec does not make it
explicitly clear, but does not preclude it either -- all it says is ISA
legacy interrupts are identity mapped by default (subject to overrides),
but it does not state whether IRQ2 exists or not. As a result if there is
no IRQ0 override, then IRQ2 is normally initialised as an ISA interrupt,
which implies an edge-triggered line, which is unmasked by default as this
is what we do for edge-triggered I/O APIC interrupts so as not to miss an
edge.
To the best of my knowledge it is useless, as IRQ2 has not been in use
since the PC/AT as back then it was taken by the 8259A cascade interrupt
to the slave, with the line position in the slot rerouted to newly-created
IRQ9. No device could thus make use of this line with the pair of 8259A
chips. Now in theory INTIN2 of the I/O APIC may be usable, but the
interrupt of the device wired to it would not be available in the PIC mode
at all, so I seriously doubt if anybody decided to reuse it for a regular
device.
However there are two common uses of INTIN2. One is for IRQ0, with an
ACPI interrupt override (or its equivalent in the MP table). But in this
case IRQ2 is gone entirely with INTIN0 left vacant. The other one is for
an 8959A ExtINTA cascade. In this case IRQ0 goes to INTIN0 and if ACPI is
used INTIN2 is assumed to be IRQ2 (there is no override and ACPI has no
way to report ExtINTA interrupts). This is where a problem happens.
The problem is INTIN2 is configured as a native APIC interrupt, with a
vector assigned and the mask cleared. And the line may indeed get active
and inject interrupts if the master 8959A has its timer interrupt enabled
(it might happen for other interrupts too, but they are normally masked in
the process of rerouting them to the I/O APIC). There are two cases where
it will happen:
* When the I/O APIC NMI watchdog is enabled. This is actually a misnomer
as the watchdog pulses are delivered through the 8259A to the LINT0
inputs of all the local APICs in the system. The implication is the
output of the master 8259A goes high and low repeatedly, signalling
interrupts to INTIN2 which is enabled too!
[The origin of the name is I think for a brief period during the
development we had a capability in our code to configure the watchdog to
use an I/O APIC input; that would be INTIN2 in this scenario.]
* When the native route of IRQ0 via INTIN0 fails for whatever reason -- as
it happens with the system considered here. In this scenario the timer
pulse is delivered through the 8259A to LINT0 input of the local APIC of
the bootstrap processor, quite similarly to how is done for the watchdog
described above. The result is, again, INTIN2 receives these pulses
too. Rafael's system used to escape this scenario, because an incorrect
IRQ0 override would occupy INTIN2 and prevent it from being unmasked.
My conclusion is IRQ2 should be excluded from configuration in all the
cases and the current exception for ACPI systems should be lifted. The
reason being the exception not only being useless, but harmful as well.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unlike the 32-bit one, the 64-bit variation of the LVT0 setup code for
the "8259A Virtual Wire" through the local APIC timer configuration does
not fully configure the relevant irq_chip structure. Instead it relies on
the preceding I/O APIC code to have set it up, which does not happen if
the I/O APIC variants have not been tried.
The patch includes corresponding changes to the 32-bit variation too
which make them both the same, barring a small syntactic difference
involving sequence of functions in the source. That should work as an aid
with the upcoming merge.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ0 is edge-triggered, but the "8259A Virtual Wire" through the local
APIC configuration in the 32-bit version uses the "fasteoi" handler
suitable for level-triggered APIC interrupt. Rewrite code so that the
"edge" handler is used. The 64-bit version uses different code and is
unaffected.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Glauber Costa [Thu, 10 Jul 2008 18:17:53 +0000 (15:17 -0300)]
x86: use AS_CFI instead of UNWIND_INFO
In dwarf2_32.h, test for CONFIG_AS_CFI instead of
CONFIG_UNWIND_INFO. Turns out that searching for UNWIND_INFO
returns no match in any Kconfig or Makefile, so we're really
just throwing everything away regarding dwarf frames for i386.
The test that generates CONFIG_AS_CFI does not have anything
x86_64-specific, and right now, checking V=1 builds shows me
that the flags is there anyway, although unused.
Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
libata-acpi: don't call sleeping function from invalid context
Added Targa Visionary 1000 IDE adapter to pata_sis.c
libata-acpi: filter out DIPM enable
Dave Chinner [Fri, 11 Jul 2008 07:43:55 +0000 (17:43 +1000)]
Fix reference counting race on log buffers
When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout. However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there. If we do, we release
the log buffer, otherwise we decrement the reference count.
But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog. That leads to a filesystem hang.
Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.
Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Tim Shimmin <tes@sgi.com> Tested-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eventually causing init to die and panic the system:
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.26-rc9-tip #13878
after a maratonic bisection session, the bad commit turned out to be:
| b7675791859075418199c7af86a116ea34eaf5bd is first bad commit
| commit b7675791859075418199c7af86a116ea34eaf5bd
| Author: Jeremy Fitzhardinge <jeremy@goop.org>
| Date: Wed Jun 25 00:19:00 2008 -0400
|
| x86: remove open-coded save/load segment operations
|
| This removes a pile of buggy open-coded implementations of savesegment
| and loadsegment.
after some more bisection of this patch itself, it turns out that what
makes the difference are the savesegment() changes to __switch_to().
Taking a look at this portion of arch/x86/kernel/process_64.o revealed
this crutial difference:
note the "m" modifier - it allows GCC to generate the segment move
into a memory operand as well.
But regarding segment operands there's a subtle detail in the x86
instruction set: the above 16-bit moves are zero-extend, but only
if it goes to a register.
If it goes to a memory operand, -0x34(%rbp) in the above case, there's
no zero-extend to 32-bit and the instruction will only save 16 bits
instead of the intended 32-bit.
The other 16 bits is random data - which can cause problems when that
value is used later on.
The solution is to only allow segment operands to go to registers.
This fix allows my test-system to boot up without crashing.
x86_64: add pseudo-features for 32-bit compat syscall
Add pseudo-feature bits to describe whether the CPU supports sysenter
and/or syscall from ia32-compat userspace. This removes a hardcoded
test in vdso32-setup.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some BIOSen enable DIPM via _GTF which causes command timeouts under
certain configuration. This didn't occur on 2.6.25 because 2.6.25
defaulted to SRST, so _GTF wasn't executed during boot probe, so ahci
host reset disabled DIPM and as _GTF wasn't executed after SRST, DIPM
wasn't enabled. On 2.6.26, hardreset is used during probe and after
probe _GTF is executed enabling DIPM and thus the failures.
This patch could theoretically disable DIPM on machines which used to
have it enabled on 2.6.25 but AFAIK ahci is currently the only driver
which uses SATA ACPI hierarchy (_SDD) and as the host reset would have
always disabled DIPM, this shouldn't happen.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Paul Gortmaker [Fri, 11 Jul 2008 00:30:48 +0000 (17:30 -0700)]
rtc: fix reported IRQ rate for when HPET is enabled
The IRQ rate reported back by the RTC is incorrect when HPET is enabled.
Newer hardware that has HPET to emulate the legacy RTC device gets this value
wrong since after it sets the rate, it returns before setting the variable
used to report the IRQ rate back to users of the device -- so the set rate and
the reported rate get out of sync.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Brownell <david-b@pacbell.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
tun: Persistent devices can get stuck in xoff state
xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
ipv6: missed namespace context in ipv6_rthdr_rcv
netlabel: netlink_unicast calls kfree_skb on error path by itself
ipv4: fib_trie: Fix lookup error return
tcp: correct kcalloc usage
ip: sysctl documentation cleanup
Documentation: clarify tcp_{r,w}mem sysctl docs
netfilter: nf_nat_snmp_basic: fix a range check in NAT for SNMP
netfilter: nf_conntrack_tcp: fix endless loop
libertas: fix memory alignment problems on the blackfin
zd1211rw: stop beacons on remove_interface
rt2x00: Disable synchronization during initialization
rc80211_pid: Fix fast_start parameter handling
sctp: Add documentation for sctp sysctl variable
ipv6: fix race between ipv6_del_addr and DAD timer
irda: Fix netlink error path return value
irda: New device ID for nsc-ircc
irda: via-ircc proper dma freeing
sctp: Mark the tsn as received after all allocations finish
...
Max Krasnyansky [Thu, 10 Jul 2008 23:59:11 +0000 (16:59 -0700)]
tun: Persistent devices can get stuck in xoff state
The scenario goes like this. App stops reading from tun/tap.
TX queue gets full and driver does netif_stop_queue().
App closes fd and TX queue gets flushed as part of the cleanup.
Next time the app opens tun/tap and starts reading from it but
the xoff state is not cleared. We're stuck.
Normally xoff state is cleared when netdev is brought up. But
in the case of persistent devices this happens only during
initial setup.
The fix is trivial. If device is already up when an app opens
it we clear xoff state and that gets things moving again.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
Add a XFRM_STATE_AF_UNSPEC flag to handle the AF_UNSPEC behavior for
the selector family. Userspace applications can set this flag to leave
the selector family of the xfrm_state unspecified. This can be used
to to handle inter family tunnels if the selector is not set from
userspace.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Thu, 10 Jul 2008 23:52:52 +0000 (16:52 -0700)]
ipv4: fib_trie: Fix lookup error return
In commit a07f5f508a4d9728c8e57d7f66294bf5b254ff7f "[IPV4] fib_trie: style
cleanup", the changes to check_leaf() and fn_trie_lookup() were wrong - where
fn_trie_lookup() would previously return a negative error value from
check_leaf(), it now returns 0.
Now fn_trie_lookup() doesn't appear to care about plen, so we can revert
check_leaf() to returning the error value.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Tested-by: William Boughton <bill@boughton.de> Acked-by: Stephen Heminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Take out the confusing language in tcp_frto, and organize the
undocumented values.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
"The register %ecx looks innocent but is very important here. The disassembly:
mov %edx,%ecx
shr $0x2,%ecx
rep stos %eax,%es:(%edi) <-- the fault
So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
(0x6b6b6b6b >> 2 == 0x1adadada.)
%ecx is the counter for the memset, from here:
memset(object, 0, c->objsize);
i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
Where did "c" come from? Uh-oh...
c = get_cpu_slab(s, smp_processor_id());
This looks like it has very much to do with CPU hotplug/unplug. Is
there a race between SLUB/hotplug since the CPU slab is used after it
has been freed?"
Good analysis.
Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset(). The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).
At some point of time the caller continues on another CPU having an
obsolete pointer...
Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
exec'ing an ELF without a PT_GNU_STACK program header should default to an
executable stack; but this got broken by the unlimited argv feature because
stack vma is now created before the right personality has been established:
so breaking old binaries using nested function trampolines.
Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
vm_flags used to be set, before the mprotect_fixup. Checking through
our existing VM_flags, none would have changed since insert_vm_struct:
so this seems safer than finding a way through the personality labyrinth.
Gas 2.15 complains about 32-bit registers being used in lea.
AS arch/x86/lib/copy_user_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:188: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:257: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
AS arch/x86/lib/copy_user_nocache_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S:107: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
Nick Piggin [Thu, 10 Jul 2008 07:25:35 +0000 (17:25 +1000)]
Fix PREEMPT_RCU without HOTPLUG_CPU
PREEMPT_RCU without HOTPLUG_CPU is broken. The rcu_online_cpu is called
to initially populate rcu_cpu_online_map with all online CPUs when the
hotplug event handler is installed, and also to populate the map with
CPUs as they come online. The former case is meant to happen with and
without HOTPLUG_CPU, but without HOTPLUG_CPU, the rcu_offline_cpu
function is no-oped -- while it still gets called, it does not set the
rcu CPU map.
With a blank RCU CPU map, grace periods get to tick by completely
oblivious to active RCU read side critical sections. This results in
free-before-grace bugs.
Fix is obvious once the problem is known. (Also, change __devinit to
__cpuinit so the function gets thrown away on !HOTPLUG_CPU kernels).
Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ Nick is my personal hero of the day - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_get_smp_config_quirk'
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_find_smp_config_quirk'
arch/x86/kernel/visws_quirks.c: In function ‘visws_early_detect’:
arch/x86/kernel/visws_quirks.c:293: error: ‘no_broadcast’ undeclared (first use in this function)
arch/x86/kernel/visws_quirks.c:293: error: (Each undeclared identifier is reported only once
arch/x86/kernel/visws_quirks.c:293: error: for each function it appears in.)
make[1]: *** [arch/x86/kernel/visws_quirks.o] Error 1
make: *** [arch/x86/kernel/visws_quirks.o] Error 2
Robert Richter [Thu, 10 Jul 2008 16:58:24 +0000 (18:58 +0200)]
x86/pci merge: fixing numaq initialization
Patch d49c4288 (tip/x86/mpparse) introduced some changes in calling
subsys_init calls if CONFIG_X86_NUMAQ option is set. This patch
updates subsystem initalization according to this changes.
Signed-off-by: Robert Richter <robert.richter@amd.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86, VisWS: turn into generic arch, copy visws files
copy arch/x86/mach-visws/setup_visws.c, apic_visws.c and traps_visws.c
files to arch/x86/kernel/, in preparation of the switchover to a
non-subarch setup for VISWS.
x86, VisWS: turn into generic arch, eliminate include/asm-x86/mach-visws/smpboot_hooks.h
now that include/asm-x86/mach-visws/smpboot_hooks.h equals
to the default file in ../mach-default/smpboot_hooks.h, simply
include it instead of maintaining a copy.
Mark Fasheh [Thu, 10 Jul 2008 16:25:39 +0000 (09:25 -0700)]
ocfs2: Fix flags in ocfs2_file_lock
The stack-glue merge changed the way we use flags in dlmglue in that we now
use the fs/dlm equivalents. Unfortunately, a merge error left the new flock
code only partially updated. This took a while to show up though, because
the lock level constants are actually identical between o2dlm and fs/dlm.
The *_CONVERT and *_NOQUEUE flags have different values though, which is
eventually causing a crash in flags_to_o2dlm().
x86: I/O APIC: Add a 64-bit variation of replace_pin_at_irq()
When an interrupt is rerouted to a different I/O APIC pin the relevant
entry of the irq_2_pin list should get updated accordingly so that
operations are performed on the correct redirection entry.
This is already done by the 32-bit variation of the code and here is a
complementing 64-bit implementation. Should make someone's decision less
tough when merging the two. ;)
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD_IOMMU should depend on IOMMU_HELPER since they are the IOMMU
helper functions. SWIOTLB requires IOMMU_HELPER so declaring that
AMD_IOMMU depends on SWIOTLB properly fixes the problems.