Remove a support of ISA addresses predefined at compile time. It is
unused (filled by zeroes) and prolongs the code. Don't initialize global
array and add `ioaddr' module param description.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- use dev_* for printing in pci probe function
- move ISA p[rints directly into isa find function, do not postpone it.
Remove macros bound to it then.
- prepend some prints by "mxser: " to know what it belongs to
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- remove unused mxvar_diagflag
- move mxser_msr into the only user/function
- GMStatus, hmm, fix race-prone access to it. We need only one instance for
real, not MXSER_PORTS. Move it to MOXA_GETMSTATUS ioctl.
- mxser_mon_ext, almost the same, but alloc it on heap, since it has more than
2 kilos.
- fix indexing, `i' is not the index value, `i * MXSER_PORTS_PER_BOARD + j' is
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- remove break ctl from ioctl handler, it's never reached, since
tty_ops->break_ctl is defined (mxser break handling is done in software)
- mark MOXA_GET_MAJOR as deprecated
- fix TIOCGICOUNT (some retval non-checks of put_user). Use copy_to_user
to whole structure instead.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Fri, 25 Jul 2008 08:48:18 +0000 (01:48 -0700)]
nwflash: use simple_read_from_buffer()
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Fri, 25 Jul 2008 08:48:14 +0000 (01:48 -0700)]
mwave: ioctl BKL pushdown
Push the BKL down to the point it wraps the actual mwave method handlers
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Eric Sesterhenn <snakebyte@gmx.de> Cc: Yani Ioannou <yani.ioannou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf: use ELF_CORE_EFLAGS for kcore ELF header flags
ELF_CORE_EFLAGS is already used by the binfmt_elf coredumper to set correct
arch specific ELF header flags on coredumps. Use it for kcore dumps as well.
At the moment, this affects the CRIS and the H8300 arch.
Signed-off-by: Edgar E. Iglesias <edgar@axis.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adrian Bunk [Fri, 25 Jul 2008 08:48:09 +0000 (01:48 -0700)]
pty: remove unused UNIX98_PTY_COUNT options
The h8300 and sparc options somehow survived when the code stopped using
CONFIG_UNIX98_PTY_COUNT.
Reviewed-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ipc: do not use a negative value to re-enable msgmni automatic recomputing
This patch proposes an alternative to the "magical
positive-versus-negative number trick" Andrew complained about last week
in http://lkml.org/lkml/2008/6/24/418.
This had been introduced with the patches that scale msgmni to the amount
of lowmem. With these patches, msgmni has a registered notification
routine that recomputes msgmni value upon memory add/remove or ipc
namespace creation/ removal.
When msgmni is changed from user space (i.e. value written to the proc
file), that notification routine is unregistered, and the way to make it
registered back is to write a negative value into the proc file. This is
the "magical positive-versus-negative number trick".
To fix this, a new proc file is introduced: /proc/sys/kernel/auto_msgmni.
This file acts as ON/OFF for msgmni automatic recomputing.
With this patch, the process is the following:
1) kernel boots in "automatic recomputing mode"
/proc/sys/kernel/msgmni contains the value that has been computed (depends
on lowmem)
/proc/sys/kernel/automatic_msgmni contains "1"
2) echo <val> > /proc/sys/kernel/msgmni
. sets msg_ctlmni to <val>
. de-activates automatic recomputing (i.e. if, say, some memory is added
msgmni won't be recomputed anymore)
. /proc/sys/kernel/automatic_msgmni now contains "0"
3) echo "0" > /proc/sys/kernel/automatic_msgmni
. de-activates msgmni automatic recomputing
this has the same effect as 2) except that msg_ctlmni's value stays
blocked at its current value)
3) echo "1" > /proc/sys/kernel/automatic_msgmni
. recomputes msgmni's value based on the current available memory size
and number of ipc namespaces
. re-activates automatic recomputing for msgmni.
The attached patch:
- reverses the locking order of ulp->lock and sem_lock:
Previously, it was first ulp->lock, then inside sem_lock.
Now it's the other way around.
- converts the undo structure to rcu.
Benefits:
- With the old locking order, IPC_RMID could not kfree the undo structures.
The stale entries remained in the linked lists and were released later.
- The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that
recreates exactly the same id happen between find_alloc_undo() and sem_lock,
then semtimedop() would access already kfree'd memory.
Remove the ipc_lock_down() routines: they used to call idr_find() locklessly
(given that the ipc ids lock was already held), so they are not needed
anymore.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Jim Houston <jim.houston@comcast.net> Cc: Pierre Peiffer <peifferp@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
idr: rename some of the idr APIs internal routines
This is a trivial patch that renames:
. alloc_layer to get_from_free_list since it idr_pre_get that actually
allocates memory.
. free_layer to move_to_free_list since memory is not actually freed there.
This makes things more clear for the next patches.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Reviewed-by: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Jim Houston <jim.houston@comcast.net> Cc: Pierre Peiffer <peifferp@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After scalability problems have been detected when using the sysV ipcs, I have
proposed to use an RCU based implementation of the IDR api instead (see
threads http://lkml.org/lkml/2008/4/11/212 and
http://lkml.org/lkml/2008/4/29/295).
This resulted in many people asking to convert the idr API and make it rcu
safe (because most of the code was duplicated and thus unmaintanable and
unreviewable).
So here is a first attempt.
The important change wrt to the idr API itself is during idr removes: idr
layers are freed after a grace period, instead of being moved to the free
list.
The important change wrt to ipcs, is that idr_find() can now be called
locklessly inside a rcu read critical section.
Here are the results I've got for the pmsg test sent by Manfred:
calgary iommu: use the first kernels TCE tables in kdump
kdump kernel fails to boot with calgary iommu and aacraid driver on a x366
box. The ongoing dma's of aacraid from the first kernel continue to exist
until the driver is loaded in the kdump kernel. Calgary is initialized
prior to aacraid and creation of new tce tables causes wrong dma's to
occur. Here we try to get the tce tables of the first kernel in kdump
kernel and use them. While in the kdump kernel we do not allocate new tce
tables but instead read the base address register contents of calgary
iommu and use the tables that the registers point to. With these changes
the kdump kernel and hence aacraid now boots normally.
The callback which has returned NOTIFY_BAD will not receive
CPU_UP_CANCELED. Change the code to fulfil the CPU_UP_CANCELED logic if
CPU_UP_PREPARE fails.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
workqueues: make get_online_cpus() useable for work->func()
workqueue_cpu_callback(CPU_DEAD) flushes cwq->thread under
cpu_maps_update_begin(). This means that the multithreaded workqueues
can't use get_online_cpus() due to the possible deadlock, very bad and
very old problem.
Introduce the new state, CPU_POST_DEAD, which is called after
cpu_hotplug_done() but before cpu_maps_update_done().
Change workqueue_cpu_callback() to use CPU_POST_DEAD instead of CPU_DEAD.
This means that create/destroy functions can't rely on get_online_cpus()
any longer and should take cpu_add_remove_lock instead.
[akpm@linux-foundation.org: fix CONFIG_SMP=n] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Gautham R Shenoy <ego@in.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
workqueues: schedule_on_each_cpu: use flush_work()
Change schedule_on_each_cpu() to use flush_work() instead of
flush_workqueue(), this way we don't wait for other work_struct's which
can be queued meanwhile.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jarek Poplawski <jarkao2@gmail.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Jarek Poplawski <jarkao2@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
workqueues: insert_work: use "list_head *" instead of "int tail"
insert_work() inserts the new work_struct before or after cwq->worklist,
depending on the "int tail" parameter. Change it to accept "list_head *"
instead, this shrinks .text a bit and allows us to insert the barrier
after specific work_struct.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jarek Poplawski <jarkao2@gmail.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: format_corename: fix the "core_uses_pid" logic
I don't understand why the multi-thread coredump implies the core_uses_pid
behaviour, but we shouldn't use mm->mm_users for that. This counter can
be incremented by get_task_mm(). Use the valued returned by
coredump_wait() instead.
Also, remove the "const char *pattern" argument, format_corename() can use
core_pattern directly.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have core_state->dumper list we can use it to wake up the
sub-threads waiting for the coredump completion.
This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
lessens by sizeof(struct completion). Also, with this change we can
decouple exit_mm() from the coredumping code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: construct the list of coredumping threads at startup time
binfmt->core_dump() has to iterate over the all threads in system in order
to find the coredumping threads and construct the list using the
GFP_ATOMIC allocations.
With this patch each thread allocates the list node on exit_mm()'s stack and
adds itself to the list.
This allows us to do further changes:
- simplify ->core_dump()
- change exit_mm() to clear ->mm first, then wait for ->core_done.
this makes the coredumping process visible to oom_kill
- kill mm->core_done
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zap_process() to return int instead of incrementing
mm->core_state->nr_threads directly. Change zap_threads() to set
mm->core_state only on success.
This patch restores the original size of .text, and more importantly now
->nr_threads is used in two places only.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: turn mm->core_startup_done into the pointer to struct core_state
mm->core_startup_done points to "struct completion startup_done" allocated
on the coredump_wait()'s stack. Introduce the new structure, core_state,
which holds this "struct completion". This way we can add more info
visible to the threads participating in coredump without enlarging
mm_struct.
No changes in affected .o files.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
linux_binfmt->core_dump() runs before the process does exit_aio(), this
means that we can hit the kernel thread which shares the same ->mm.
Afaics, nothing really bad can happen, but perhaps it makes sense to fix
this minor bug.
It is sad we have to iterate over all threads in system and use
GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
needs some nontrivial changes in mm_struct and do_coredump.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main loop in zap_threads() must skip kthreads which may use the same
mm. Otherwise we "kill" this thread erroneously (for example, it can not
fork or exec after that), and the coredumping task stucks in the
TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters
count.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
No functional changes yet. But this allows us to do further
fixes/cleanups.
oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm(). The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races. With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:
/* The result must not be dereferenced !!! */
struct mm_struct *__get_task_mm(struct task_struct *tsk)
{
if (tsk->flags & PF_KTHREAD)
return NULL;
return tsk->mm;
}
Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
by INIT_TASK() and copied to the forked childs (we could set it in
kthreadd() along with PF_NOFREEZE instead).
daemonize() was changed as well. In that case testing of PF_KTHREAD is
racy, but daemonize() is hopeless anyway.
This flag is cleared in do_execve(), before search_binary_handler().
Probably not the best place, we can do this in exec_mmap() or in
start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
doesn't matter in practice, and if do_execve() fails kthread should die
soon.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. SIGKILL can't be blocked, remove this check from sigkill_pending().
2. When ptrace_stop() sees sigkill_pending() == T, it can just return.
Kill "int killed" and simplify the code. This also is more correct,
the tracer shouldn't see us in TASK_TRACED if we are not going to
stop.
I strongly believe this code needs further changes. We should do the "was
this task killed" check unconditionally, currently it depends on
arch_ptrace_stop_needed(). On the other hand, sigkill_pending() isn't
very clever. If the task was killed tkill(SIGKILL), the signal can be
already dequeued if the caller is do_exit().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_stop() has some complicated checks to prevent the scheduling in the
TASK_TRACED state with the pending SIGKILL, but these checks are racy, and
they depend on arch_ptrace_stop_needed().
This patch assumes that the traced task should die asap if it was killed by
SIGKILL, in that case schedule()->signal_pending_state() has no reason to
ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty
special case.
Note: do_exit()->ptrace_notify() is special, the killed task can already
dequeue SIGKILL at this point. Another indication that fatal_signal_pending()
is not exactly right.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adrian Bunk [Fri, 25 Jul 2008 08:47:34 +0000 (01:47 -0700)]
include/asm/ptrace.h userspace headers cleanup
This patch contains the following cleanups for the asm/ptrace.h
userspace headers:
- include/asm-generic/Kbuild.asm already lists ptrace.h, remove
the superfluous listings in the Kbuild files of the following
architectures:
- cris
- frv
- powerpc
- x86
- don't expose function prototypes and macros to userspace:
- arm
- blackfin
- cris
- mn10300
- parisc
- remove #ifdef CONFIG_'s around #define's:
- blackfin
- m68knommu
- sh: AFAIK __SH5__ should work in both kernel and userspace,
no need to leak CONFIG_SUPERH64 to userspace
- xtensa: cosmetical change to remove empty
#ifndef __ASSEMBLY__ #else #endif
from the userspace headers
Not changed by this patch is the fact that the following architectures
have a different struct pt_regs depending on CONFIG_ variables:
- h8300
- m68knommu
- mips
This does not work in userspace.
Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: <linux-arch@vger.kernel.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Greg Ungerer <gerg@uclinux.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Grant Grundler <grundler@parisc-linux.org> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Chris Zankel <chris@zankel.net> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Paul Mackerras <paulus@samba.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael Kerrisk [Fri, 25 Jul 2008 08:47:32 +0000 (01:47 -0700)]
signals: make siginfo_t si_utime + si_sstime report times in USER_HZ, not HZ
In the switch to configurable HZ in 2.6, the treatment of the si_utime and
si_stime fields that are exposed to userland via the siginfo structure
looks to have been botched. As things stand, these fields report times in
units of HZ, so that userland gets information that varies depending on
the HZ that the kernel was configured with. This patch changes the
reported values to use USER_HZ units.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
signals: do_signal_stop: kill the SIGNAL_UNKILLABLE check
fae5fa44f1fd079ffbed8e0add929dd7bbd1347f changed do_signal_stop() to check
SIGNAL_UNKILLABLE, this wasn't needed. If signal_group_exit() == F, the
signal sent to SIGNAL_UNKILLABLE task must be already filtered out by the
caller, get_signal_to_deliver(). And if signal_group_exit() == T we are
not going to stop.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
signals: dequeue_signal: don't check SIGNAL_GROUP_EXIT when setting SIGNAL_STOP_DEQUEUED
dequeue_signal() checks SIGNAL_GROUP_EXIT before setting
SIGNAL_STOP_DEQUEUED. This was added by 788e05a67c343fa22f2ae1d3ca264e7f15c25eaf a long ago to avoid the
coredump/SIGSTOP race.
Since then the related code was changed, and now this subtle check is both
incomplete and unneeded at the same time. It is incomplete because
nowadays exec() doesn't set SIGNAL_GROUP_EXIT, so in fact we should check
signal_group_exit() to avoid a similar race. Fortunately, we doesn't need
the check at all. The only function which relies on SIGNAL_STOP_DEQUEUED
is do_signal_stop(), and it ignores this flag if signal_group_exit() == T,
this covers the SIGNAL_GROUP_EXIT case.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
posix timers: release_posix_timer: kill the bogus put_task_struct(->it_process);
release_posix_timer() can't be called with ->it_process != NULL. Once
sys_timer_create() sets ->it_process it must not call
release_posix_timer(), otherwise we can race with another thread doing
sys_timer_delete(), this timer is visible to idr_find() and unlocked.
The same is true for two other callers (actually, for any possible
caller), sys_timer_delete() and itimer_delete(). They must clear
->it_process before unlock_timer() + release_posix_timer().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
posix timers: timer_delete: remove the bogus "->it_process != NULL" check
sys_timer_delete() and itimer_delete() check "timer->it_process != NULL",
this looks completely bogus. ->it_process == NULL means that this timer
is already under destruction or it is not fully initialized, this must not
happen.
sys_timer_delete: the timer is locked, and lock_timer() can't succeed
if ->it_process == NULL.
itimer_delete: it is called by exit_itimers() when there are no other
threads which can play with signal_struct->posix_timers.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 25 Jul 2008 08:47:25 +0000 (01:47 -0700)]
cpuset: two minor code-cleanups
In cpuset_update_task_memory_state() local variable struct task_struct
*tsk = current;
And local variable tsk is used 14 times and statement task_cs(tsk) is used
twice in this function. So using task_cs(tsk) instead of task_cs(current)
is better for readability.
And "(struct cgroup_scanner *)&scan" is not good for readability also.
(and "container_of" is used in cpuset_do_move_task(), not
"(cpuset_hotplug_scanner *)scan")
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 25 Jul 2008 08:47:24 +0000 (01:47 -0700)]
cpuset: code-cleanup for started_after
cgroup(cgroup_scan_tasks) will initialize heap->gt for us. This patch
removes started_after() and its helper-function.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 25 Jul 2008 08:47:23 +0000 (01:47 -0700)]
cpuset: don't pass empty cpumasks to partition_sched_domains()
I create lots of empty cpusets(empty cpumasks) and turn off the
"sched_load_balance" in top cpuset.
I found that all these empty cpumasks are passed to
partition_sched_domains() in rebuild_sched_domains(), it's very
time-consuming for partition_sched_domains() and it's not need.
It also reduce memory consumed and some works in rebuild_sched_domains()
too.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When changing 'sched_relax_domain_level', don't rebuild sched domains if
'cpus' is empty or 'sched_load_balance' is not set.
Also make the comments of rebuild_sched_domains() more readable.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpusets: update task's cpus_allowed and mems_allowed after CPU/NODE offline/online
The bug is that a task may run on the cpu/node which is not in its
cpuset.cpus/ cpuset.mems.
It can be reproduced by the following commands:
-----------------------------------
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0-1 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo $$ > /dev/cpuset/0/tasks
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 1 > /sys/devices/system/cpu/cpu1/online
-----------------------------------
There is only CPU0 in cpuset.cpus, but the task in this cpuset runs on
both CPU0 and CPU1.
It is because the task's cpu_allowed didn't get updated after we did CPU
offline/online manipulation. Similar for mem_allowed.
This patch fixes this bug expect for root cpuset. Because there is a
problem about root cpuset, in that whether it is necessary to update all
the tasks in root cpuset or not after cpu/node offline/online.
If updating, some kernel threads which is bound into a specified cpu will
be unbound.
If not updating, there is a bug in root cpuset. This bug is also caused
by offline/online manipulation. For example, there is a dual-cpu machine.
we create a sub cpuset in root cpuset and assign 1 to its cpus. And then
we attach some tasks into this sub cpuset. After this, we offline CPU1.
Now, the tasks in this new cpuset are moved into root cpuset automatically
because there is no cpu in sub cpuset. Then we online CPU1, we find all
the tasks which doesn't belong to root cpuset originally just run on CPU0.
Maybe we need to add a flag in the task_struct to mark which task can't be
unbound?
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpusets: restructure the function update_cpumask() and update_nodemask()
Extract two functions from update_cpumask() and update_nodemask().They
will be used later for updating tasks' cpus_allowed and mems_allowed after
CPU/NODE offline/online.
[lizf@cn.fujitsu.com: build fix] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because of remove refcnt patch, it's very rare case to that
mem_cgroup_charge_common() is called against a page which is accounted.
mem_cgroup_charge_common() is called when.
1. a page is added into file cache.
2. an anon page is _newly_ mapped.
A racy case is that a newly-swapped-in anonymous page is referred from
prural threads in do_swap_page() at the same time.
(a page is not Locked when mem_cgroup_charge() is called from do_swap_page.)
Another case is shmem. It charges its page before calling add_to_page_cache().
Then, mem_cgroup_charge_cache() is called twice. This case is handled in
mem_cgroup_cache_charge(). But this check may be too hacky...
Signed-off-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new call, mem_cgroup_shrink_usage() is added for shmem handling and
relacing non-standard usage of mem_cgroup_charge/uncharge.
Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
mem_cgroup. In general, shmem is used by some process group and not for
global resource (like file caches). So, it's reasonable to reclaim pages
from mem_cgroup where shmem is mainly used.
[hugh@veritas.com: shmem_getpage release page sooner]
[hugh@veritas.com: mem_cgroup_shrink_usage css_put] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes page migration under memory controller to use a
different algorithm. (thanks to Christoph for new idea.)
Before:
- page_cgroup is migrated from an old page to a new page.
After:
- a new page is accounted , no reuse of page_cgroup.
Pros:
- We can avoid compliated lock depndencies and races in migration.
Cons:
- new param to mem_cgroup_charge_common().
- mem_cgroup_getref() is added for handling ref_cnt ping-pong.
This version simplifies complicated lock dependency in page migraiton
under memory resource controller.
new refcnt sequence is following.
a mapped page:
prepage_migration() ..... +1 to NEW page
try_to_unmap() ..... all refs to OLD page is gone.
move_pages() ..... +1 to NEW page if page cache.
remap... ..... all refs from *map* is added to NEW one.
end_migration() ..... -1 to New page.
page's mapcount + (page_is_cache) refs are added to NEW one.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 25 Jul 2008 08:47:07 +0000 (01:47 -0700)]
devcgroup: relax white-list protection down to RCU
Currently this list is protected with a simple spinlock, even for reading
from one. This is OK, but can be better.
Actually I want it to be better very much, since after replacing the
OpenVZ device permissions engine with the cgroup-based one I noticed, that
we set 12 default device permissions for each newly created container (for
/dev/null, full, terminals, ect devices), and people sometimes have up to
20 perms more, so traversing the ~30-40 elements list under a spinlock
doesn't seem very good.
Here's the RCU protection for white-list - dev_whitelist_item-s are added
and removed under the devcg->lock, but are looked up in permissions
checking under the rcu_read_lock.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup_clone: use pid of newly created task for new cgroup
cgroup_clone creates a new cgroup with the pid of the task. This works
correctly for unshare, but for clone cgroup_clone is called from
copy_namespaces inside copy_process, which happens before the new pid is
created. As a result, the new cgroup was created with current's pid.
This patch:
1. Moves the call inside copy_process to after the new pid
is created
2. Passes the struct pid into ns_cgroup_clone (as it is not
yet attached to the task)
3. Passes a name from ns_cgroup_clone() into cgroup_clone()
so as to keep cgroup_clone() itself simpler
4. Uses pid_vnr() to get the process id value, so that the
pid used to name the new cgroup is always the pid as it
would be known to the task which did the cloning or
unsharing. I think that is the most intuitive thing to
do. This way, task t1 does clone(CLONE_NEWPID) to get
t2, which does clone(CLONE_NEWPID) to get t3, then the
cgroup for t3 will be named for the pid by which t2 knows
t3.
(Thanks to Dan Smith for finding the main bug)
Changelog:
June 11: Incorporate Paul Menage's feedback: don't pass
NULL to ns_cgroup_clone from unshare, and reduce
patch size by using 'nodename' in cgroup_clone.
June 10: Original version
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge Hallyn <serge@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Tested-by: Dan Smith <danms@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Fri, 25 Jul 2008 08:47:04 +0000 (01:47 -0700)]
cgroup files: convert res_counter_write() to be a cgroups write_string() handler
Currently res_counter_write() is a raw file handler even though it's
ultimately taking a number, since in some cases it wants to
pre-process the string when converting it to a number.
This patch converts res_counter_write() from a raw file handler to a
write_string() handler; this allows some of the boilerplate
copying/locking/checking to be removed, and simplies the cleanup path,
since these functions are now performed by the cgroups framework.
[lizf@cn.fujitsu.com: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>