Gerrit Renker [Wed, 21 Nov 2007 12:09:56 +0000 (10:09 -0200)]
[DCCP]: Promote CCID2 as default CCID
This patch addresses the following problems:
1. DCCP relies for its proper functioning on having at least one CCID module
enabled (as in TCP plugable congestion control). Currently it is possible to
disable both CCIDs and thus leave the DCCP module in a compiled, but entirely
non-functional state: no sockets can be created when no CCID is available.
Furthermore, the protocol is (again like TCP) not intended to be used without
CCIDs. Last, a non-empty CCID list is needed for doing CCID feature negotiation.
2. Internally the default CCID that is advertised by the Linux host is set to CCID2
(DCCPF_INITIAL_CCID in include/linux/dccp.h). Disabling CCID2 in the Kconfig
menu without changing the defaults leads to a failure `module not found' when
trying to load the dccp module (which internally tries to load the default CCID).
3. The specification (RFC 4340, sec. 10) treats CCID2 somewhat like a
`minimum common denominator'; the specification says that:
* "New connections start with CCID 2 for both endpoints"
* "A DCCP implementation intended for general use, such as an implementation in a
general-purpose operating system kernel, SHOULD implement at least CCID 2.
The intent is to make CCID 2 broadly available for interoperability [...]"
Providing CCID2 as minimum-required CCID (like Reno/Cubic in TCP) thus seems reasonable.
Hence this patch automatically selects CCID2 when DCCP is enabled. Documentation also added.
Discussions with Ian McDonald on this subject are gratefully acknowledged.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Wed, 21 Nov 2007 12:00:17 +0000 (10:00 -0200)]
[DCCP]: Update documentation
This updates the DCCP documentation, following input from Ian McDonald,
clarifiying the status of DCCP, and adding a note about the test tree.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Wed, 21 Nov 2007 11:56:48 +0000 (09:56 -0200)]
[DCCP]: Honour and make use of shutdown option set by user
This extends the DCCP socket API by honouring any shutdown(2) option set by the user.
The behaviour is, as much as possible, made consistent with the API for TCP's shutdown.
This patch exploits the information provided by the user via the socket API to reduce
processing costs:
* if the read end is closed (SHUT_RD), it is not necessary to deliver to input CCID;
* if the write end is closed (SHUT_WR), the same idea applies, but with a difference -
as long as the TX queue has not been drained, we need to receive feedback to keep
congestion-control rates up to date. Hence SHUT_WR is honoured only after the last
packet (under congestion control) has been sent;
* although SHUT_RDWR seems nonsensical, it is nevertheless supported in the same manner
as for TCP (and agrees with test for SHUTDOWN_MASK in dccp_poll() in net/dccp/proto.c).
Furthermore, most of the code already honours the sk_shutdown flags (dccp_recvmsg() for
instance sets the read length to 0 if SHUT_RD had been called); CCID handling is now added
to this by the present patch.
There will also no longer be any delivery when the socket is in the final stages, i.e. when
one of dccp_close(), dccp_fin(), or dccp_done() has been called - which is fine since at
that stage the connection is its final stages.
Motivation and background are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/shutdown
A FIXME has been added to notify the other end if SHUT_RD has been set (RFC 4340, 11.7).
Note: There is a comment in inet_shutdown() in net/ipv4/af_inet.c which asks to "make
sure the socket is a TCP socket". This should probably be extended to mean
`TCP or DCCP socket' (the code is also used by UDP and raw sockets).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 20 Nov 2007 23:56:37 +0000 (21:56 -0200)]
[DCCP]: Make PARTOPEN an autonomous state
This decouples PARTOPEN from TCP-specific stream-states.
It thus addresses the FIXME.
The code has been checked with regard to dependency on PARTOPEN and FIN_WAIT1
states (to which PARTOPEN previously was mapped): there is no difference, as
PARTOPEN is always referred to directly (i.e. not via the mapping to TCP
state).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 20 Nov 2007 20:09:59 +0000 (18:09 -0200)]
[CCID3]: Inline for moving average
The moving average computation occurs so frequently in the CCID 3 code that
it merits an inline function of its own. This is uses a suggestion by
Arnaldo as per http://www.mail-archive.com/dccp@vger.kernel.org/msg01662.html
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 20 Nov 2007 20:01:59 +0000 (18:01 -0200)]
[CCID3]: Accurately determine idle & application-limited periods
This fixes/updates the handling of idle and application-limited periods in CCID3,
which currently is broken: there is no detection as to how long a sender has been
idle - there is only one flag which is toggled in between function calls.
Being obsolete now, the `idle' flag is removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 20 Nov 2007 20:00:39 +0000 (18:00 -0200)]
[CCID3]: Ignore trivial amounts of elapsed time
This patch fixes a previously undiscovered bug; the problem is in computing
the elapsed time as the time between `receiving' the packet (i.e. skb enters
CCID module) and sending feedback:
- there is no layer-processing, queueing, or delay involved,
- hence the elapsed time is in the order of 1 function call
- this is in the dimension of maximally 50..100usec
- which renders the use of elapsed time almost entirely useless.
The fix is simply to ignore such trivial amounts of elapsed time.
As a further advantage, the now useless elapsed_time field can be removed from
the socket, which reduces the socket structure by another four bytes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Tue, 20 Nov 2007 19:33:17 +0000 (17:33 -0200)]
[CCID3]: Revert use of MSS instead of s
This updates the CCID3 code with regard to two instances of using `MSS' in place of `s':
1. The RFC3390-based initial rate: both rfc3448bis as well as the Faster Restart
draft now consistently use `s' instead of MSS.
2. Now agrees with section 4.2 of rfc3448bis: "If the sender is ready to send data when
it does not yet have a round trip sample, the value of X is set to s bytes per
second, for segment size s [...]"
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
Please look at the comment in there to see the rationale.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 16 Nov 2007 01:17:07 +0000 (02:17 +0100)]
mac80211: remove more forgotten code
Hopefully that's the rest. Seems I didn't do a very thorough job
removing the management interface.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ron Rindjunsky [Wed, 14 Nov 2007 17:57:38 +0000 (19:57 +0200)]
mac80211: adding 802.11n definitions in ieee80211.h
This patch adds several structs and definitions to ieee80211.h
to support 802.11n draft specifications.
As 802.11n depends on and extends the 802.11e standard in several issues,
there are also several definitions that belong to 802.11e.
Signed-off-by: Ron Rindjunsky <ron.rindjunsky@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Schaa [Fri, 9 Nov 2007 15:26:09 +0000 (16:26 +0100)]
mac80211: Remove local->scan_flags
This patch removes all references to local->scan_flags as these are not
used anymore since the removal of prism2 ioctls.
Signed-off-by: Helmut Schaa <hschaa@suse.de> Signed-off-by: Jiri Benc <jbenc@suse.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 9 Nov 2007 00:57:29 +0000 (01:57 +0100)]
mac80211: provide interface iterator for drivers
Sometimes drivers need to know which interfaces are associated with
their hardware. Rather than forcing those drivers to keep track of
the interfaces that were added, this adds an iteration function to
mac80211.
As it is intended to be used from the interface add/remove callbacks,
the iteration function may currently only be called under RTNL.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov [Tue, 20 Nov 2007 07:20:59 +0000 (23:20 -0800)]
[NET]: Compact sk_stream_mem_schedule() code
This function references sk->sk_prot->xxx for many times.
It turned out, that there's so many code in it, that gcc
cannot always optimize access to sk->sk_prot's fields.
After saving the sk->sk_prot on the stack and comparing
disassembled code, it turned out that the function became
~10 bytes shorter and made less dereferences (on i386 and
x86_64). Stack consumption didn't grow.
Besides, this patch drives most of this function into the
80 columns limit.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Thery [Tue, 20 Nov 2007 07:18:16 +0000 (23:18 -0800)]
[NET]: Make netns cleanup to run in a separate queue
This patch adds a separate workqueue for cleaning up a network
namespace. If we use the keventd workqueue to execute cleanup_net(),
there is a problem to unregister devices in IPv6. Indeed the code
that cleans up also schedule work in keventd: as long as cleanup_net()
hasn't return, dst_gc_task() cannot run and as long as dst_gc_task() has
not run, there are still some references pending on the net devices and
cleanup_net() can not unregister and exit the keventd workqueue.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Denis V. Lunev <den@openvz.org> Acked-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 20 Nov 2007 06:43:37 +0000 (22:43 -0800)]
[PATCH] IPV4 : Move ip route cache flush (secret_rebuild) from softirq to workqueue
Every 600 seconds (ip_rt_secret_interval), a softirq flush of the
whole ip route cache is triggered. On loaded machines, this can starve
softirq for many seconds and can eventually crash.
This patch moves this flush to a workqueue context, using the worker
we intoduced in commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b (IPV4:
Convert rt_check_expire() from softirq processing to workqueue.)
Also, immediate flushes (echo 0 >/proc/sys/net/ipv4/route/flush) are
using rt_do_flush() helper function, wich take attention to
rescheduling.
Next step will be to handle delayed flushes
("echo -1 >/proc/sys/net/ipv4/route/flush" or "ip route flush cache")
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov [Tue, 20 Nov 2007 06:38:33 +0000 (22:38 -0800)]
[RAW]: Consolidate proc interface.
Both ipv6/raw.c and ipv4/raw.c use the seq files to walk
through the raw sockets hash and show them.
The "walking" code is rather huge, but is identical in both
cases. The difference is the hash table to walk over and
the protocol family to check (this was not in the first
virsion of the patch, which was noticed by YOSHIFUJI)
Make the ->open store the needed hash table and the family
on the allocated raw_iter_state and make the start/next/stop
callbacks work with it.
This removes most of the code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis V. Lunev [Tue, 20 Nov 2007 06:29:30 +0000 (22:29 -0800)]
[NET]: Make AF_UNIX per network namespace safe [v2]
Because of the global nature of garbage collection, and because of the
cost of per namespace hash tables unix_socket_table has been kept
global. With a filter added on lookups so we don't see sockets from
the wrong namespace.
Currently I don't fold the namesapce into the hash so multiple
namespaces using the same socket name will be guaranteed a hash
collision.
Changes from v1:
- fixed unix_seq_open
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis V. Lunev [Tue, 20 Nov 2007 06:28:35 +0000 (22:28 -0800)]
[NET]: Make AF_PACKET handle multiple network namespaces
This is done by making packet_sklist_lock and packet_sklist per
network namespace and adding an additional filter condition on
received packets to ensure they came from the proper network
namespace.
Changes from v1:
- prohibit to call inet_dgram_ops.ioctl in other than init_net
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[NET]: Make the netlink methods in rtnetlink handle multiple network namespaces
After the previous prep work this just consists of removing checks
limiting the code to work in the initial network namespace, and
updating rtmsg_ifinfo so we can generate events for devices in
something other then the initial network namespace.
Referring to network other network devices like the IFLA_LINK
and IFLA_MASTER attributes do, gets interesting if those network
devices happen to be in other network namespaces. Currently
ifindex numbers are allocated globally so I have taken the path
of least resistance and not still report the information even
though the devices they are talking about are invisible.
If applications start getting confused or when ifindex
numbers become local to the network namespace we may need
to do something different in the future.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Denis V. Lunev <den@openz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis V. Lunev [Tue, 20 Nov 2007 06:26:51 +0000 (22:26 -0800)]
[NET]: Make rtnetlink infrastructure network namespace aware (v3)
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis V. Lunev [Fri, 30 Nov 2007 13:21:31 +0000 (00:21 +1100)]
[NET]: Modify all rtnetlink methods to only work in the initial namespace (v2)
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
[IPVS]: Create synced connections with their real state
With this patch the synced connections are created with their real state,
which can be changed on the next synchronizations if necessary. This way
on fail-over all the connections will be treated according to their actual
state, causing no scheduling problems (the active and the nonactive
connections have different weights in the schedulers).
The backwards compatibility is preserved and the existing tools will show
the true connection states even on the backup director.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
[IPVS]: Flag synced connections and expose them in proc
This patch labels the sync-created connections with IP_VS_CONN_F_SYNC
flag and creates /proc/net/ip_vs_conn_sync to enable monitoring of the
origin of the connections, if they are local or created by the
synchronization.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Wu [Tue, 30 Oct 2007 20:50:05 +0000 (16:50 -0400)]
ieee80211: Add IEEE80211_MAX_FRAME_LEN to linux/ieee80211.h
This patch adds IEEE80211_MAX_FRAME_LEN which is useful for drivers trying
to determine how much to allocate for their RX buffers.
It also updates the comment on IEEE80211_MAX_DATA_LEN based on revisions
in 802.11e.
IEEE80211_MAX_FRAG_THRESHOLD and IEEE80211_MAX_RTS_THRESHOLD are also
revised due to the new maximum frame size.
Signed-off-by: Michael Wu <flamingice@sourmilk.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mattias Nissler [Wed, 24 Oct 2007 21:30:36 +0000 (23:30 +0200)]
mac80211: Accept auto txpower setting
This changes the SIWTXPOWER ioctl to also accept a txpower setting of
"automatic". Since mac80211 currently cannot tell drivers to automatically
adjust tx power, we select the tx power level of the current channel. While
this is kind of a hack, it certainly saves some iwconfig users from headaches.
Signed-off-by: Mattias Nissler <mattias.nissler@gmx.de> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilpo Järvinen [Sat, 17 Nov 2007 00:17:05 +0000 (16:17 -0800)]
[TCP]: Correct DSACK check placing
Previously one of the in-block skip branches was missing it.
Also, drop it from tail-fully-processed case because the next
iteration will do exactly the same thing, i.e., process the
SACK block that contains the DSACK information.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Sat, 17 Nov 2007 00:09:28 +0000 (16:09 -0800)]
[CAN]: Add documentation
This patch adds documentation for the PF_CAN protocol family.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Sat, 17 Nov 2007 00:07:41 +0000 (16:07 -0800)]
[CAN]: Add maintainer entries
This patch adds entries in the CREDITS and MAINTAINERS file for CAN.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 16 Nov 2007 23:56:08 +0000 (15:56 -0800)]
[CAN]: Add virtual CAN netdevice driver
This patch adds the virtual CAN bus (vcan) network driver.
The vcan device is just a loopback device for CAN frames, no
real CAN hardware is involved.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 16 Nov 2007 23:53:52 +0000 (15:53 -0800)]
[CAN]: Add broadcast manager (bcm) protocol
This patch adds the CAN broadcast manager (bcm) protocol.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 16 Nov 2007 23:53:09 +0000 (15:53 -0800)]
[CAN]: Add raw protocol
This patch adds the CAN raw protocol.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 16 Nov 2007 23:52:17 +0000 (15:52 -0800)]
[CAN]: Add PF_CAN core module
This patch adds the CAN core functionality but no protocols or drivers.
No protocol implementations are included here. They come as separate
patches. Protocol numbers are already in include/linux/can.h.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Sun, 16 Dec 2007 23:59:24 +0000 (15:59 -0800)]
[CAN]: Allocate protocol numbers for PF_CAN
This patch adds a protocol/address family number, ARP hardware type,
ethernet packet type, and a line discipline number for the SocketCAN
implementation.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 16 Nov 2007 11:32:10 +0000 (03:32 -0800)]
[NET]: NET_CLS_ROUTE : convert ip_rt_acct to per_cpu variables
ip_rt_acct needs 4096 bytes per cpu to perform some accounting.
It is actually allocated as a single huge array [4096*NR_CPUS]
(rounded up to a power of two)
Converting it to a per cpu variable is wanted to :
- Save space on machines were num_possible_cpus() < NR_CPUS
- Better NUMA placement (each cpu gets memory on its node)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilpo Järvinen [Fri, 16 Nov 2007 03:50:37 +0000 (19:50 -0800)]
[TCP]: Rewrite SACK block processing & sack_recv_cache use
Key points of this patch are:
- In case new SACK information is advance only type, no skb
processing below previously discovered highest point is done
- Optimize cases below highest point too since there's no need
to always go up to highest point (which is very likely still
present in that SACK), this is not entirely true though
because I'm dropping the fastpath_skb_hint which could
previously optimize those cases even better. Whether that's
significant, I'm not too sure.
Currently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).
Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!
Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.
Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.
The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".
DSACKs are special case that must always be walked.
The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilpo Järvinen [Fri, 16 Nov 2007 03:42:54 +0000 (19:42 -0800)]
[TCP]: Make lost retrans detection more self-contained
Highest_sack_end_seq is no longer calculated in the loop,
thus it can be pushed to the worker function altogether
making that function independent of the sacktag.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilpo Järvinen [Fri, 16 Nov 2007 03:39:31 +0000 (19:39 -0800)]
[TCP]: non-FACK SACK follows conservative SACK loss recovery
Many assumptions that are true when no reordering or other
strange events happen are not a part of the RFC3517. FACK
implementation is based on such assumptions. Previously (before
the rewrite) the non-FACK SACK was basically doing fast rexmit
and then it times out all skbs when first cumulative ACK arrives,
which cannot really be called SACK based recovery :-).
RFC3517 SACK disables these things:
- Per SKB timeouts & head timeout entry to recovery
- Marking at least one skb while in recovery (RFC3517 does this
only for the fast retransmission but not for the other skbs
when cumulative ACKs arrive in the recovery)
- Sacktag's loss detection flavors B and C (see comment before
tcp_sacktag_write_queue)
This does not implement the "last resort" rule 3 of NextSeg, which
allows retransmissions also when not enough SACK blocks have yet
arrived above a segment for IsLost to return true [RFC3517].
The implementation differs from RFC3517 in these points:
- Rate-halving is used instead of FlightSize / 2
- Instead of using dupACKs to trigger the recovery, the number
of SACK blocks is used as FACK does with SACK blocks+holes
(which provides more accurate number). It seems that the
difference can affect negatively only if the receiver does not
generate SACK blocks at all even though it claimed to be
SACK-capable.
- Dupthresh is not a constant one. Dynamical adjustments include
both holes and sacked segments (equal to what FACK has) due to
complexity involved in determining the number sacked blocks
between highest_sack and the reordered segment. Thus it's will
be an over-estimate.
Implementation note:
tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head
skb at that point cannot be SACKED_ACKED (nor would such
situation last for long enough to cause problems).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilpo Järvinen [Fri, 16 Nov 2007 03:35:11 +0000 (19:35 -0800)]
[TCP]: Extend reordering detection to cover CA_Loss partially
This implements more accurately what is stated in sacktag's
overall comment:
"Both of these heuristics are not used in Loss state, when
we cannot account for retransmits accurately."
When CA_Loss state is entered, the state changer ensures that
undo_marker is only set if no TCPCB_RETRANS skbs were found,
thus having non-zero undo_marker in CA_Loss basically tells
that the R-bits still accurately reflect the current state
of TCP.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov [Thu, 15 Nov 2007 11:03:19 +0000 (03:03 -0800)]
[NET]: Move sock_valbool_flag to socket.c
The sock_valbool_flag() helper is used in setsockopt to
set or reset some flag on the sock. This helper is required
in the net/socket.c only, so move it there.
Besides, patch two places in sys_setsockopt() that repeat
this helper functionality manually.
Since this is not a bugfix, but a trivial cleanup, I
prepared this patch against net-2.6.25, but it also
applies (with a single offset) to the latest net-2.6.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov [Thu, 15 Nov 2007 00:01:43 +0000 (16:01 -0800)]
[NET]: Use sockfd_lookup_light in the rest of the net/socket.c
Some time ago a sockfd_lookup_light was introduced and
most of the socket.c file was patched to use it. However
two routines were left - sys_sendto and sys_recvfrom.
Patch them as well, since this helper does exactly what
these two need.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Tue, 20 Nov 2007 02:53:30 +0000 (18:53 -0800)]
[NETFILTER]: Introduce NF_INET_ hook values
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.
Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Tue, 20 Nov 2007 02:50:17 +0000 (18:50 -0800)]
[IPSEC]: Add async resume support on input
This patch adds support for async resumptions on input. To do so, the
transform would return -EINPROGRESS and subsequently invoke the
function xfrm_input_resume to resume processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Tue, 20 Nov 2007 02:47:58 +0000 (18:47 -0800)]
[IPSEC]: Remove nhoff from xfrm_input
The nhoff field isn't actually necessary in xfrm_input. For tunnel
mode transforms we now throw away the output IP header so it makes no
sense to fill in the nexthdr field. For transport mode we can now let
the function transport_finish do the setting and it knows where the
nexthdr field is.
The only other thing that needs the nexthdr field to be set is the
header extraction code. However, we can simply move the protocol
extraction out of the generic header extraction.
We want to minimise the amount of info we have to carry around between
transforms as this simplifies the resumption process for async crypto.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:47:08 +0000 (21:47 -0800)]
[IPSEC]: Make x->lastused an unsigned long
Currently x->lastused is u64 which means that it cannot be
read/written atomically on all architectures. David Miller observed
that the value stored in it is only an unsigned long which is always
atomic.
So based on his suggestion this patch changes the internal
representation from u64 to unsigned long while the user-interface
still refers to it as u64.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sun, 16 Dec 2007 23:55:02 +0000 (15:55 -0800)]
[IPSEC]: Move integrity stat collection into xfrm_input
Similar to the moving out of the replay processing on the output, this
patch moves the integrity stat collectin from x->type->input into
xfrm_input.
This would eventually allow transforms such as AH/ESP to be lockless.
The error value EBADMSG (currently unused in the crypto layer) is used
to indicate a failed integrity check. In future this error can be
directly returned by the crypto layer once we switch to aead
algorithms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:44:55 +0000 (21:44 -0800)]
[IPSEC]: Store xfrm states in security path directly
As it is xfrm_input first collects a list of xfrm states on the stack
before storing them in the packet's security path just before it
returns. For async crypto, this construction presents an obstacle
since we may need to leave the loop after each transform.
In fact, it's much easier to just skip the stack completely and always
store to the security path. This is proven by the fact that this
patch actually shrinks the code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:44:23 +0000 (21:44 -0800)]
[IPSEC]: Merge most of the input path
As part of the work on asynchronous cryptographic operations, we need
to be able to resume from the spot where they occur. As such, it
helps if we isolate them to one spot.
This patch moves most of the remaining family-specific processing into
the common input code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:43:43 +0000 (21:43 -0800)]
[IPSEC]: Add async resume support on output
This patch adds support for async resumptions on output. To do so,
the transform would return -EINPROGRESS and subsequently invoke the
function xfrm_output_resume to resume processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:43:11 +0000 (21:43 -0800)]
[IPSEC]: Merge most of the output path
As part of the work on asynchrnous cryptographic operations, we need
to be able to resume from the spot where they occur. As such, it
helps if we isolate them to one spot.
This patch moves most of the remaining family-specific processing into
the common output code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sat, 12 Jan 2008 03:14:00 +0000 (19:14 -0800)]
[IPV4]: Add ip_local_out
Most callers of the LOCAL_OUT chain will set the IP packet length and
header checksum before doing so. They also share the same output
function dst_output.
This patch creates a new function called ip_local_out which does all
of that and converts the appropriate users over to it.
Apart from removing duplicate code, it will also help in merging the
IPsec output path once the same thing is done for IPv6.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:41:28 +0000 (21:41 -0800)]
[IPSEC]: Separate inner/outer mode processing on input
With inter-family transforms the inner mode differs from the outer
mode. Attempting to handle both sides from the same function means
that it needs to handle both IPv4 and IPv6 which creates duplication
and confusion.
This patch separates the two parts on the input path so that each
function deals with one family only.
In particular, the functions xfrm4_extract_inut/xfrm6_extract_inut
moves the pertinent fields from the IPv4/IPv6 IP headers into a
neutral format stored in skb->cb. This is then used by the inner mode
input functions to modify the inner IP header. In this way the input
function no longer has to know about the outer address family.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:40:52 +0000 (21:40 -0800)]
[IPSEC]: Separate inner/outer mode processing on output
With inter-family transforms the inner mode differs from the outer
mode. Attempting to handle both sides from the same function means
that it needs to handle both IPv4 and IPv6 which creates duplication
and confusion.
This patch separates the two parts on the output path so that each
function deals with one family only.
In particular, the functions xfrm4_extract_output/xfrm6_extract_output
moves the pertinent fields from the IPv4/IPv6 IP headers into a
neutral format stored in skb->cb. This is then used by the outer mode
output functions to write the outer IP header. In this way the output
function no longer has to know about the inner address family.
Since the extract functions are only called by tunnel modes (the only
modes that can support inter-family transforms), I've also moved the
xfrm*_tunnel_check_size calls into them. This allows the correct ICMP
message to be sent as opposed to now where you might call icmp_send
with an IPv6 packet and vice versa.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:40:13 +0000 (21:40 -0800)]
[INET]: Give outer DSCP directly to ip*_copy_dscp
This patch changes the prototype of ipv4_copy_dscp and ipv6_copy_dscp so
that they directly take the outer DSCP rather than the outer IP header.
This will help us to unify the code for inter-family tunnels.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:39:38 +0000 (21:39 -0800)]
[IPSEC]: Move x->outer_mode->output out of locked section
RO mode is the only one that requires a locked output function. So
it's easier to move the lock into that function rather than requiring
everyone else to run under the lock.
In particular, this allows us to move the size check into the output
function without causing a potential dead-lock should the ICMP error
somehow hit the same SA on transmission.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:39:08 +0000 (21:39 -0800)]
[IPSEC]: Forbid BEET + ipcomp for now
While BEET can theoretically work with IPComp the current code can't
do that because it tries to construct a BEET mode tunnel type which
doesn't (and cannot) exist. In fact as it is it won't even attach a
tunnel object at all for BEET which is bogus.
To support this fully we'd also need to change the policy checks on
input to recognise a plain tunnel as a legal variant of an optional
BEET transform.
This patch simply fails such constructions for now.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Tue, 11 Dec 2007 17:32:34 +0000 (09:32 -0800)]
[IPSEC]: Merge common code into xfrm_bundle_create
Half of the code in xfrm4_bundle_create and xfrm6_bundle_create are
common. This patch extracts that logic and puts it into
xfrm_bundle_create. The rest of it are then accessed through afinfo.
As a result this fixes the problem with inter-family transforms where
we treat every xfrm dst in the bundle as if it belongs to the top
family.
This patch also fixes a long-standing error-path bug where we may free
the xfrm states twice.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:37:28 +0000 (21:37 -0800)]
[IPSEC]: Move flow construction into xfrm_dst_lookup
This patch moves the flow construction from the callers of
xfrm_dst_lookup into that function. It also changes xfrm_dst_lookup
so that it takes an xfrm state as its argument instead of explicit
addresses.
This removes any address-specific logic from the callers of
xfrm_dst_lookup which is needed to correctly support inter-family
transforms.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:36:51 +0000 (21:36 -0800)]
[IPSEC]: Replace x->type->{local,remote}_addr with flags
The functions local_addr and remote_addr are more than what they're
needed for. The same thing can be done easily with flags on the type
object. This patch does that and simplifies the wrapper functions in
xfrm6_policy accordingly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:36:07 +0000 (21:36 -0800)]
[IPSEC]: Make sure idev is consistent with dev in xfrm_dst
Previously we took the device from the bottom route and idev from the
top route. This is bad because idev may well point to a different
device. This patch changes it so that we get the idev from the device
directly.
It also makes it an error if either dev or idev is NULL. This is
consistent with the rest of the routing code which also treats these
cases as errors.
I've removed the err initialisation in xfrm6_policy.c because it
achieves no purpose and hid a bug when an initial version of this
patch neglected to set err to -ENODEV (fortunately the IPv4 version
warned about it).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:35:32 +0000 (21:35 -0800)]
[IPSEC]: Set dst->input to dst_discard
The input function should never be invoked on IPsec dst objects. This
is because we don't apply IPsec on input until after we've made the
routing decision.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:35:01 +0000 (21:35 -0800)]
[IPSEC]: Only set neighbour on top xfrm dst
The neighbour field is only used by dst_confirm which only ever happens on
the top-most xfrm dst. So it's a waste to duplicate for every other xfrm
dst. This patch moves its setting out of the loop so that only the top one
gets set.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:34:06 +0000 (21:34 -0800)]
[NET]: Eliminate duplicate copies of dst_discard
We have a number of copies of dst_discard scattered around the place
which all do the same thing, namely free a packet on the input or
output paths.
This patch deletes all of them except dst_discard and points all the
users to it.
The only non-trivial bit is decnet where it returns an error.
However, conceptually this is identical to the blackhole functions
used in IPv4 and IPv6 which do not return errors. So they should
either all return errors or all return zero. For now I've stuck with
the majority and picked zero as the return value.
It doesn't really matter in practice since few if any driver would
react differently depending on a zero return value or NET_RX_DROP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:33:32 +0000 (21:33 -0800)]
[IPV6]: Move nfheader_len into rt6_info
The dst member nfheader_len is only used by IPv6. It's also currently
creating a rather ugly alignment hole in struct dst. Therefore this patch
moves it from there into struct rt6_info.
It also reorders the fields in rt6_info to minimize holes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:33:01 +0000 (21:33 -0800)]
[IPSEC]: Use dst->header_len when resizing on output
Currently we use x->props.header_len when resizing on output.
However, if we're resizing at all we might as well go the whole hog
and do it for the whole dst.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 14 Nov 2007 05:32:26 +0000 (21:32 -0800)]
[IPV6]: Only set nfheader_len for top xfrm dst
We only need to set nfheader_len in the top xfrm dst. This is because
we only ever read the nfheader_len from the top xfrm dst.
It is also easier to count nfheader_len as part of header_len which
then lets us remove the ugly wrapper functions for incrementing and
decrementing header lengths in xfrm6_policy.c.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>