pilppa.com Git - linux-2.6-omap-h63xx.git/log

firewire: enforce access order between generation and node ID, fix "giving up on config rom"

fw_device.node_id and fw_device.generation are accessed without mutexes.
We have to ensure that all readers will get to see node_id updates
before generation updates.

Fixes an inability to recognize devices after "giving up on config rom",
https://bugzilla.redhat.com/show_bug.cgi?id=429950

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Reviewed by Nick Piggin <nickpiggin@yahoo.com.au>.

Verified to fix 'giving up on config rom' issues on multiple system and
drive combinations that were previously affected.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>

firewire: fw-cdev: use device generation, not card generation

We have to use the fw_device.generation here, not the fw_card.generation,
because the generation must never be newer than the node ID when we emit
a transaction. This cannot be guaranteed with fw_card.generation.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Verified in concert with subsequent memory barriers patch to fix 'giving
up on config rom' issues on multiple system and drive combinations that
were previously affected.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

firewire: fw-sbp2: use device generation, not card generation

There was a small window where a login or reconnect job could use an
already updated card generation with an outdated node ID.  We have to
use the fw_device.generation here, not the fw_card.generation, because
the generation must never be newer than the node ID when we emit a
transaction.  This cannot be guaranteed with fw_card.generation.

Furthermore, the target's and initiator's node IDs can be obtained from
fw_device and fw_card.  Dereferencing their underlying topology objects
is not necessary.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Verified in concert with subsequent memory barriers patch to fix 'giving
up on config rom' issues on multiple system and drive combinations that
were previously affected.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

firewire: fw-sbp2: try to increase reconnect_hold (speed up reconnection)

Ask the target to grant 4 seconds instead of the standard and minimum of
1 second window after bus reset for reconnection.  This accelerates
reconnection if there are more than one targets on the bus:  If a login
and inquiry to one target blocks the fw-sbp2 workqueue for more than 1s
after bus reset, we now still can reconnect to the other target.

Before that, fw-sbp2's reconnect attempts would be rejected with "error
status: 0:9" (function rejected), and fw-sbp2 would finally re-login.
All those futile reconnect attemps cost extra time until the target
which needs re-login is ready for I/O again.

The reconnect timeout field in the login ORB doesn't have to be honored
by the target though.  I found that we could get up to
  - allegedly 32768s from an old OXFW911 firmware
  - 256s from LSI bridges
  - 4s from OXUF922 and OXFW912 bridges,
  - 2s from TI bridges,
  - only the standard 1s from Initio and Prolific bridges and from
    Apple OpenFirmware in target mode.

We just try to get 4 seconds which already covers the case of a few
HDDs on the same bus quite nicely.

A minor drawback occurs in the following (rare and impractical) border
case:
  - two initiators are there, initiator 1 holds an exclusive login to
    a target,
  - initiator 1 goes off the bus,
  - target refuses login attempts from initiator 2 until reconnect_hold
    seconds after bus reset.

An alternative approach to the issue at hand would be to parallelize
fw-sbp2's reconnect and login work.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Jarod Wilson <jwilson@redhat.com>

firewire: fw-sbp2: skip unnecessary logout

Don't attempt to send a logout ORB if the target was already unplugged
or had its link switched off. If two targets are attached, this
enhances the chance to quickly reconnect to the remaining target when
one target is plugged out.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Jarod Wilson <jwilson@redhat.com>

firewire vs. ieee1394: clarify MAINTAINERS

Maintainers like to receive less mail, and submitters like to have to Cc
less recipients.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-ohci: Dynamically allocate buffers for DMA descriptors

Previously, the fw-ohci driver used fixed-length buffers for storing
descriptors for isochronous receive DMA programs.  If an application
(such as libdc1394) generated a DMA program that was too large, fw-ohci
would reach the limit of its fixed-sized buffer and return an error to
userspace.

This patch replaces the fixed-length ring-buffer with a linked-list of
page-sized buffers.  Additional buffers can be dynamically allocated and
appended to the list when necessary.  For a particular context, buffers
are kept around after use and reused as necessary, so there is no
allocation taking place after the DMA program is generated for the first
time.

In addition, the buffers it uses are coherent for DMA so there is no
syncing required before and after writes.  This syncing wasn't properly
done in the previous version of the code.

-

This is the fourth version of my patch that replaces a fixed-length
buffer for DMA descriptors with a dynamically allocated linked-list of
buffers.

As we discovered with the last attempt, new context programs are
sometimes queued from interrupt context, making it unacceptable to call
tasklet_disable() from context_get_descriptors().

This version of the patch uses ohci->lock for all locking needs instead
of tasklet_disable/enable.  There is a new requirement that
context_get_descriptors() be called while holding ohci->lock.  It was
already held for the AT context, so adding the requirement for the iso
context did not seem particularly onerous.  In addition, this has the
side benefit of allowing iso queue to be safely called from concurrent
user-space threads, which previously was not safe.

Signed-off-by: David Moore <dcm@acm.org>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>
Signed-off-by: Jarod Wilson <jwilson@redhat.com>
-

Fixes the following issues:
  - Isochronous reception stopped prematurely if an application used a
    larger buffer.  (Reproduced with coriander.)
  - Isochronous reception stopped after one or a few frames on VT630x
    in OHCI 1.0 mode.  (Fixes reception in coriander, but dvgrab still
    doesn't work with these chips.)

Patch update: struct member alignment, whitespace nits

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-ohci: CycleTooLong interrupt management

The firewire-ohci driver so far lacked the ability to resume cycle
master duty after that condition happened, as added to ohci1394 in Linux
2.6.18 by commit 57fdb58fa5a140bdd52cf4c4ffc30df73676f0a5.  This ports
this patch to fw-ohci.

The "cycle too long" condition has been seen in practice
  - with IIDC cameras if a mode with packets too large for a speed is
    chosen,
  - sporadically when capturing DV on a VIA VT6306 card with ohci1394/
    ieee1394/ raw1394/ dvgrab 2.
    https://bugzilla.redhat.com/show_bug.cgi?id=415841#c7
(This does not fix Fedora bug 415841.)

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: Fix extraction of source node id

Fix extraction of the source node id from the packet header.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-ohci: Bug fixes for packet-per-buffer support

This patch corrects a number of bugs in the current OHCI 1.0
packet-per-buffer support:

1. Correctly deal with payloads that cross a page boundary.  The
previous version would not split the descriptor at such a boundary,
potentially corrupting unrelated memory.

2. Allow user-space to specify multiple packets per struct
fw_cdev_iso_packet in the same way that dual-buffer allows.  This is
signaled by header_length being a multiple of header_size.  This
multiple determines the number of packets.  The payload size allocated
per packet is determined by dividing the total payload size by the
number of packets.

3. Make sync support work properly for packet-per-buffer.

I have tested this patch with libdc1394 by forcing my OHCI 1.1
controller to use the packet-per-buffer support instead of dual-buffer.

I would greatly appreciate testing by those who have a DV devices and
other types of iso streamers to make sure I didn't cause any
regressions.

Stefan, with this patch, I'm hoping that libdc1394 will work with all
your OHCI 1.0 controllers now.

The one bit of future work that remains for packet-per-buffer support is
the automatic compaction of short payloads that I discussed with
Kristian.

Signed-off-by: David Moore <dcm@acm.org>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-ohci: Fix for dualbuffer three-or-more buffers

This patch fixes the problem where different OHCI 1.1 controllers behave
differently when a received iso packet straddles three or more buffers
when using the dual-buffer receive mode.  Two changes are made in order
to handle this situation:

1. The packet sync DMA descriptor is given a non-zero header length and
non-zero payload length.  This is because zero-payload descriptors are
not discussed in the OHCI 1.1 specs and their behavior is thus
undefined.  Instead we use a header size just large enough for a single
header and a payload length of 4 bytes for this first descriptor.

2. As we process received packets in the context's tasklet, read the
packet length out of the headers.  Keep track of the running total of
the packet length as "excess_bytes", so we can ignore any descriptors
where no packet starts or ends.  These descriptors may not have had
their first_res_count or second_res_count fields updated by the
controller so we cannot rely on those values.

The main drawback of this patch is that the excess_bytes value might get
"out of sync" with the packet descriptors if something strange happens
to the DMA program.  I'm not if such a thing could ever happen, but I
appreciate any suggestions in making it more robust.

Also, the packet-per-buffer support may need a similar fix to deal with
issue 1, but I haven't done any work on that yet.

Stefan, I'm hoping that with this patch, all your OHCI 1.1 controllers
will work properly with an unmodified version of libdc1394.

Signed-off-by: David Moore <dcm@acm.org>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-sbp2: remove unused misleading macro

SBP2_MAX_SECTORS is nowhere used in fw-sbp2.
It merely got copied over from sbp2 where it played a role in the past.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-sbp2: prepare for s/g chaining

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

firewire: fw-sbp2: refactor workq and kref handling

This somewhat reduces the size of firewire-sbp2.ko.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: ohci1394: don't schedule IT tasklets on IR events

Bug noted by Pieter Palmers: Isochronous transmit tasklets were
scheduled on isochronous receive events, in addition to the proper
isochronous receive tasklets.

http://marc.info/?l=linux1394-devel&m=119783196222802

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: sbp2: raise default transfer size limit

This patch speeds up sbp2 a little bit --- but more importantly, it
brings the behavior of sbp2 and fw-sbp2 closer to each other.  Like
fw-sbp2, sbp2 now does not limit the size of single transfers to 255
sectors anymore, unless told so by a blacklist flag or by module load
parameters.

Only very old bridge chips have been known to need the 255 sectors
limit, and we have got one such chip in our hardwired blacklist.  There
certainly is a danger that more bridges need that limit; but I prefer to
have this issue present in both fw-sbp2 and sbp2 rather than just one of
them.

An OXUF922 with 400GB 7200RPM disk on an S400 controller is sped up by
this patch from 22.9 to 23.5 MB/s according to hdparm.  The same effect
could be achieved before by setting a higher max_sectors module
parameter.  On buses which use 1394b beta mode, sbp2 and fw-sbp2 will
now achieve virtually the same bandwidth.  Fw-sbp2 only remains faster
on 1394a buses due to fw-core's gap count optimization.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: remove unused code

The code has been in "#if 0 - #endif" since Linux 2.6.12.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: small cleanup after "nopage"

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: nopage

Convert ieee1394 from nopage to fault.
Remove redundant vma range checks (correct resource range check is retained).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: Add missing "space"

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: sbp2: s/g list access cosmetics

Replace sg->length by sg_dma_len(sg). Rename a variable for shorter
line lengths and eliminate some superfluous local variables.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

ieee1394: sbp2: prepare for s/g chaining

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86

* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: (890 commits)
  x86: fix nodemap_size according to nodeid bits
  x86: fix overlap between pagetable with bss section
  x86: add PCI IDs to k8topology_64.c
  x86: fix early_ioremap pagetable ops
  x86: use the same pgd_list for PAE and 64-bit
  x86: defer cr3 reload when doing pud_clear()
  x86: early boot debugging via FireWire (ohci1394_dma=early)
  x86: don't special-case pmd allocations as much
  x86: shrink some ifdefs in fault.c
  x86: ignore spurious faults
  x86: remove nx_enabled from fault.c
  x86: unify fault_32|64.c
  x86: unify fault_32|64.c with ifdefs
  x86: unify fault_32|64.c by ifdef'd function bodies
  x86: arch/x86/mm/init_32.c printk fixes
  x86: arch/x86/mm/init_32.c cleanup
  x86: arch/x86/mm/init_64.c printk fixes
  x86: unify ioremap
  x86: fixes some bugs about EFI memory map handling
  x86: use reboot_type on EFI 32
  ...

[net] Gracefully handle shared e1000/1000e driver PCI ID's

Both the old e1000 driver and the new e1000e driver can drive some
PCI-Express e1000 cards, and we should avoid ambiguity about which
driver will pick up the support for those cards when both drivers are
enabled.

This solves the problem by having the old driver support those cards if
the new driver isn't configured, but otherwise ceding support for PCI
Express versions of the e1000 chipset to the newer driver. Thus
allowing both legacy configurations where only the old driver is active
(and handles all chips it knows about) and the new configuration with
the new driver handling the more modern PCIE variants.

Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Make !NETFILTER_ADVANCED enable IP6_NF_MATCH_IPV6HEADER

We want IPV6HEADER matching for the non-advanced default netfilter
configuration, since it's part of the standard netfilter setup of at
least some distributions (eg Fedora).

Otherwise NETFILTER_ADVANCED loses much of its point, since even
non-advanced users would have to enable all the advanced options just to
get a working IPv6 netfilter setup.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

x86: fix nodemap_size according to nodeid bits

memnode.map is s16 array because of nodeid is 16 bit now.

so need to increase the nodemap_size according to that bits.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix overlap between pagetable with bss section

one early crash on one 8 node 256g machine:

Command line: console=uart8250,io,0x3f8,115200n8 initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug initcall_debug apic=debug acpi.debug_level=0x0000000f pci=routeirq ip=dhcp load_ramdisk=1 ramdisk_size=131072 BOOT_IMAGE=kernel.org/bzImage_2.6.25_k8.1
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
end_pfn_map = 67239936
Kernel panic - not syncing: Duplicated early reservation d40000-e42000

Pid: 0, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #3

Call Trace:
[<ffffffff80221545>] lapic_get_maxlvt+0x0/0x10
[<ffffffff80221657>] clear_local_APIC+0x5/0xcf
[<ffffffff80221726>] disable_local_APIC+0x5/0x17
[<ffffffff8021fe16>] smp_send_stop+0x46/0x4c
[<ffffffff80235293>] panic+0x94/0x13e
[<ffffffff80bc3b03>] sctp_eps_proc_init+0x12/0x34
[<ffffffff80b9f1c5>] reserve_early+0x30/0x6c
[<ffffffff80803925>] init_memory_mapping+0x2cd/0x2dc
[<ffffffff80b9dc01>] setup_arch+0x21f/0x44e
[<ffffffff80b978be>] start_kernel+0x6f/0x2c7
[<ffffffff80b971cc>] _sinittext+0x1cc/0x1d3

it turns out there is overlap between pgtable and bss...

in System.map we have
ffffffff80d40420 b rsi_table
ffffffff80d40620 B krb5_seq_lock
ffffffff80d40628 b i.20437
ffffffff80d40630 b xprt_rdma_inline_write_padding
ffffffff80d40638 b sunrpc_table_header
ffffffff80d40640 b zero
ffffffff80d40644 b min_memreg
ffffffff80d40648 b rpcrdma_tk_lock_g
ffffffff80d40650 B sctp_assocs_id_lock
ffffffff80d40658 B proc_net_sctp
ffffffff80d40660 B sctp_assocs_id
ffffffff80d40680 B sysctl_sctp_mem
ffffffff80d40690 B sysctl_sctp_rmem
ffffffff80d406a0 B sysctl_sctp_wmem
ffffffff80d406b0 b sctp_ctl_socket
ffffffff80d406b8 b sctp_pf_inet6_specific
ffffffff80d406c0 b sctp_pf_inet_specific
ffffffff80d406c8 b sctp_af_v4_specific
ffffffff80d406d0 b sctp_af_v6_specific
ffffffff80d406d8 b sctp_rand.33270
ffffffff80d406dc b sctp_memory_pressure
ffffffff80d406e0 b sctp_sockets_allocated
ffffffff80d406e4 b sctp_memory_allocated
ffffffff80d406e8 b sctp_sysctl_header
ffffffff80d406f0 b zero
ffffffff80d406f4 A __bss_stop
ffffffff80d406f4 A _end

need to round up table_start to PAGE_SIZE.

also make the panic more informative.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: add PCI IDs to k8topology_64.c

This just adds the PCI IDs of AMD's family 10h and 11h CPU's northbridges to
k8topology discovery.

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix early_ioremap pagetable ops

Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: use the same pgd_list for PAE and 64-bit

Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: defer cr3 reload when doing pud_clear()

PAE mode requires that we reload cr3 in order to guarantee that
changes to the pgd will be noticed by the processor. This means that
in principle pud_clear needs to reload cr3 every time. However,
because reloading cr3 implies a tlb flush, we want to avoid it where
possible.

pud_clear() is only used in a couple of places:
- in free_pmd_range(), when pulling down a range of process address space, and
- huge_pmd_unshare()

In both cases, the calling code will do a a tlb flush anyway, so
there's no need to do it within pud_clear().

In free_pmd_range(), the pud_clear is immediately followed by
pmd_free_tlb(); we can hook that to make the mmu_gather do an
unconditional full flush to make sure cr3 gets reloaded.

In huge_pmd_unshare, it is followed by flush_tlb_range, which always
results in a full cr3-reload tlb flush.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: early boot debugging via FireWire (ohci1394_dma=early)

This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.

If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.

With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.

In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.

An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.

A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt

It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff

Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: don't special-case pmd allocations as much

In x86 PAE mode, stop treating pmds as a special case.  Previously
they were always allocated and freed with the pgd.  The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.

This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.

There is a complicating wart, however.  When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3.  Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.

This patch simply avoids reloading cr3 unless the update is to the
current pagetable.  Later patches will optimise this further.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: shrink some ifdefs in fault.c

The change from current to tsk in do_page_fault is safe as
this is set at the very beginning of the function.

Removes a likely() annotation from the 64-bit version, this
could have instead been added to 32-bit.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: ignore spurious faults

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: remove nx_enabled from fault.c

On !PAE 32-bit, _PAGE_NX will be 0, making is_prefetch always
return early. The test is sufficient on PAE as __supported_pte_mask
is updated in the same places as nx_enabled in init_32.c which also
takes disable_nx into account.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify fault_32|64.c

Unify includes in moved fault.c.

Modify Makefiles to pick up unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify fault_32|64.c with ifdefs

Elimination of these ifdefs can be done in a unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify fault_32|64.c by ifdef'd function bodies

It's about time to get on with unifying these files, elimination
of the ugly ifdefs can occur in the unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: arch/x86/mm/init_32.c printk fixes

printk fixes. NOP in terms of functionality, but strings got
a bit larger due to the KERN_ markers that were added.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: arch/x86/mm/init_32.c cleanup

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: arch/x86/mm/init_64.c printk fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify ioremap

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fixes some bugs about EFI memory map handling

This patch fixes some bugs of EFI memory handing code.

- On x86_64, it is possible that EFI memory map can not be mapped via
  identity map, so efi_map_memmap is removed, just use early_ioremap.

- On i386, the EFI memory map mapping take effect cross paging_init,
  so it is not necessary to use efi_map_memmap.

- EFI memory map is unmapped in efi_enter_virtual_mode to avoid
  early_ioremap leak.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: use reboot_type on EFI 32

This patch makes reboot_type of BOOT_EFI is used on i386 too. Because
correpsonding reboot code of i386 and x86_64 is merged.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify page fault oops printing

This changes the oops dumping format for page faults to
be similar between X86_32 and 64.

This is the first user of printk_address on X86_32.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: introduce show_fault_oops helper to fault_32|64.c

This will help when unifying the oops dumping code on 32/64
bit. No functional changes.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: add is_errata100 helper to fault_32|64.c

Further towards unifying these files, add another helper
in same spirit as is_errata93.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: add is_f00f_bug helper to fault_32|64.c

Further towards unifying these files, add another helper
in same spirit as is_errata93.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: make ioremap() UC by default

Yes! A mere 120 c_p_a() fixing and rewriting patches later,
we are now confident that we can enable UC by default for
ioremap(), on x86 too.

Every other architectures was doing this already. Doing so
makes Linux more robust against MTRR mixups (which might go
unnoticed if BIOS writers test other OSs only - where PAT
might override bad MTRRs defaults).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa cleanup the 64-bit alias math

Cleanup the address calculations, which are necessary to identify the
high/low alias mappings of the kernel on 64 bit machines. Instead of
calling __pa/__va back and forth, calculate the physical address once
and base the other calculations on it. Add understandable constants so
we can use the already available within() helper. Also add comments,
which help mere mortals to understand what this code does.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa: fix the self-test

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: rodata config hookup

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: enable CONFIG_DEBUG_PAGEALLOC more widely

make CONFIG_DEBUG_PAGEALLOC universally available.

CONFIG_HIBERNATION and CONFIG_HUGETLBFS was disabling it, for no
particular reason.

If there are any unfixed bugs here we'll fix it, but do not disable
vital debugging facilities like that ..

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: init memory debugging

debug incorrect/late access to init memory, by permanently unmapping
the init memory ranges. Depends on CONFIG_DEBUG_PAGEALLOC=y.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: move misplaced rodata check call

It looks like a mismerge put the rodata self-check in the wrong spot; move
it to the right place after marking the .rodata section read only.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix clflush_page_range logic

only present ptes must be flushed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: optimize clflush

clflush is sufficient to be issued on one CPU. The invalidation is
broadcast throughout the coherence domain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: clflush_page_range needs mfence

clflush is an unordered operation with respect to other memory
traffic, including other CLFLUSH instructions. This needs proper
fencing with mfence.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa: rename global_flush_tlb() to cpa_flush_all()

The function name global_flush_tlb() suggests something different from
what the function really does. Rename it to cpa_flush_all(), which is an
understandable counterpart to cpa_flush_range().

no global visibility of the old API anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa: implement clflush optimization

Use clflush on CPUs which support this.

clflush is only used when the page attribute operation has been
successful. On CPUs which do not support clflush and in the case of
error the old fashioned global_flush_tlb() is called.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa use the new set_clr function

Convert cpa_set and cpa_clear to call the new set_clr function.
Seperate out the debug helpers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa create set_and_clr function

Create a set_and_clr function to avoid the duplicate loops. Allows
also to do combined operations for optimization.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa move the flush into set and clear functions

To avoid the modification of the flush code for the clflush
implementation, move the flush into the set and clear functions and
provide helper functions for the debugging code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: add testcases for RODATA and NX protections/attributes

Latest update; I now have 4 NX tests, but 2 fail so they're #if 0'd.
I also cleaned up the NX test code quite a bit, and got rid of the ugly
exception table sorting stuff.

From: Arjan van de Ven <arjan@linux.intel.com>

This patch adds testcases for the CONFIG_DEBUG_RODATA configuration option
as well as the NX CPU feature/mappings. Both testcases can move to tests/
once that patch gets merged into mainline.
(I'm half considering moving the rodata test into mm/init.c but I'll
wait with that until init.c is unified)

As part of this I had to fix a not-quite-right alignment in the vmlinux.lds.h
for the RODATA sections, which lead to 1 page less being marked read only.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: ioremap KERN_INFO

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: clean up change_page_attr_set/clear()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: fix loop

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: fix split thinko

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: make sure initmem is writable

When we free initmem, various rodata and CPA checks may have left
memory read only.. this patch ensures that the memory is writable
before we free it.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix pageattr-selftest

In Ingo's testing, he found a bug in the CPA selftest code. What would
happen is that the test would call change_page_attr_addr on a range of
memory, part of which was read only, part of which was writable. The
only thing the test wanted to change was the global bit...

What actually happened was that the selftest would take the permissions
of the first page, and then the change_page_attr_addr call would then
set the permissions of the entire range to this first page. In the
rodata section case, this resulted in pages after the .rodata becoming
read only... which made the kernel rather unhappy in many interesting
ways.

This is just another example of how dangerous the cpa API is (was); this
patch changes the test to use the incremental clear/set APIs
instead, and it changes the clear/set implementation to work on a 1 page
at a time basis.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: remove flush_agp_mappings()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: move flush to cpa

The set_memory_* and set_pages_* family of API's currently requires the
callers to do a global tlb flush after the function call; forgetting this is
a very nasty deathtrap. This patch moves the global tlb flush into
each of the callers

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: make various pageattr.c functions static

change_page_attr_add is only used in pageattr.c now, so we can
make this function static.
change_page_attr() isn't used anywere at all anymore; this function
is a really bad API anyway so just remove the bloat entirely.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: set_memory_notpresent()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: convert ioremap to new API

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix ioremap API

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix ioremap RAM check

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix the missing BIOS area check in page_is_ram

page_is_ram has a FIXME since ages, which reminds to sanity check the
BIOS area between 640k and 1M, which is sometimes falsely reported as
RAM in the e820 tables.

Implement the sanity check. Move the BIOS range defines from
pageattr.c into e820.h to avoid duplicate defines.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: move page_is_ram() function

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: deprecate change_page_attr() for drivers

With the introduction of the new API, no driver or non-archcore code needs
to use c-p-a anymore, so this patch also deprecates the EXPORT_SYMBOL of CPA
(it's a horrible API after all).

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: convert CPA users to the new set_page_ API

This patch converts various users of change_page_attr() to the new,
more intent driven set_page_*/set_memory_* API set.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: a new API for drivers/etc to control cache and other page attributes

Right now, if drivers or other code want to change, say, a cache attribute of a
page, the only API they have is change_page_attr(). c-p-a is a really bad API
for this, because it forces the caller to know *ALL* the attributes he wants
for the page, not just the 1 thing he wants to change. So code that wants to
set a page uncachable, needs to be aware of the NX status as well etc etc etc.

This patch introduces a set of new APIs for this, set_pages_<attr> and
set_memory_<attr>, that offer a logical change to the user, and leave all
attributes not implied by the requested logical change alone.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cpa: move clflush_cache_range()

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: unify ioremap_32 and _64

Unify the now identical ioremap_32.c and ioremap_64.c into the
same ioremap.c file. No code changed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: unify ioremap

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: use remove_vm_are in ioremap_32 error path

When ioremap_page_range fails, then we can use remove_vm_area instead
of vunmap safely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: __iomem annotations

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: switch to change_page_attr_addr in ioremap_32.c

Use change_page_attr_addr() instead of change_page_attr(), which
simplifies the code significantly and matches the 64bit
implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: make c_p_a unconditional in ioremap

Make c_p_a unconditional for ioremap and iounmap. This ensures
complete consistency of the flags which are handed to
ioremap_page_range and the real flags in the mappings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: introduce max_pfn_mapped

64bit uses end_pfn_map and 32bit uses max_low_pfn. There are several
files which have #ifdef'ed defines which map either to end_pfn_map or
max_low_pfn. Replace this by a universal define and clean up all the
other instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: cleanup ioremap includes

Get rid of the douplicate define of ISA_START/END_ADDRESS and use the
same headers in 32 and 64 bit code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: style cleanup of ioremap code

Fix the coding style before going further.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: fix ioremap pgprot inconsistency

The pgprot flags which are handed into ioremap_page_range() are
different to those which are set in change_page_attr(). The
ioremap_page_range flags are executable, while the c_p_a flags are
not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: fix ioremap pgprot inconsistency

The pgprot flags which are handed into ioremap_page_range() are
different to those which are set in change_page_attr(). The
ioremap_page_range flags are executable, while the c_p_a flags are
not. Also make the mappings global (which is a NOP currently on 32bit,
although CPUs from PPRO+ onwards support it, but that's a separate
fix.)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: turn the check_exec function into function that

What the check_exec() function really is trying to do is enforce certain
bits in the pgprot that are required by the x86 architecture, but that
callers might not be aware of (such as NX bit exclusion of the BIOS
area for BIOS based PCI access; it's not uncommon to ioremap the BIOS
region for various purposes and normally ioremap() memory has the NX bit
set).

This patch turns the check_exec() function into static_protections()
which also is now used to make sure the kernel text area remains non-NX
and that the .rodata section remains read-only. If the architecture
ends up requiring more such mandatory prot settings for specific areas,
this is now a reasonable place to add these.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: cpa: make self-test depend on DEBUG_KERNEL

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: ioremap_nocache fix

This patch fixes a bug of ioremap_nocache. ioremap_nocache() will call
__ioremap() with flags != 0 to do the real work, which will call
change_page_attr_addr() if phys_addr + size - 1 < (end_pfn_map << PAGE_SHIFT).
But some pages between 0 ~ end_pfn_map << PAGE_SHIFT are not mapped by
identity map, this will make change_page_attr_addr failed.

This patch is based on latest x86 git and has been tested on x86_64 platform.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: add PAGE_KERNEL_EXEC_NOCACHE

add PAGE_KERNEL_EXEC_NOCACHE.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86: fix NX bit handling in change_page_attr()

This patch fixes a bug of change_page_attr/change_page_attr_addr on
Intel i386/x86_64 CPUs.  After changing page attribute to be
executable with these functions, the page remains un-executable on
Intel i386/x86_64 CPU.  Because on Intel i386/x86_64 CPU, only if the
"NX" bits of all three level page tables are cleared (PAE is enabled),
the corresponding page is executable (refer to section 4.13.2 of Intel
64 and IA-32 Architectures Software Developer's Manual).  So, the bug
is fixed through clearing the "NX" bit of PMD when splitting the huge
PMD.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>