From: KAMEZAWA Hiroyuki Date: Thu, 2 Apr 2009 23:57:38 +0000 (-0700) Subject: memcg: fix OOM killer under memcg X-Git-Url: http://pilppa.com/gitweb/?a=commitdiff_plain;h=0b7f569e45bb6be142d87017030669a6a7d327a1;p=linux-2.6-omap-h63xx.git memcg: fix OOM killer under memcg This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki Cc: Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Daisuke Nishimura Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 523a9c16c40..8a11caf417a 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -1,5 +1,5 @@ Memory Resource Controller(Memcg) Implementation Memo. -Last Updated: 2009/1/19 +Last Updated: 2009/1/20 Base Kernel Version: based on 2.6.29-rc2. Because VM is getting complex (one of reasons is memcg...), memcg's behavior @@ -360,3 +360,21 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. # kill malloc task. Of course, tmpfs v.s. swapoff test should be tested, too. + + 9.8 OOM-Killer + Out-of-memory caused by memcg's limit will kill tasks under + the memcg. When hierarchy is used, a task under hierarchy + will be killed by the kernel. + In this case, panic_on_oom shouldn't be invoked and tasks + in other groups shouldn't be killed. + + It's not difficult to cause OOM under memcg as following. + Case A) when you can swapoff + #swapoff -a + #echo 50M > /memory.limit_in_bytes + run 51M of malloc + + Case B) when you use mem+swap limitation. + #echo 50M > memory.limit_in_bytes + #echo 50M > memory.memsw.limit_in_bytes + run 51M of malloc diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b2816fba530..43763bd772b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -503,7 +503,7 @@ struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id, /* Returns true if root is ancestor of cg */ bool css_is_ancestor(struct cgroup_subsys_state *cg, - struct cgroup_subsys_state *root); + const struct cgroup_subsys_state *root); /* Get id and depth of css */ unsigned short css_id(struct cgroup_subsys_state *css); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f2a3f5c9936..382109b5bae 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3405,7 +3405,7 @@ unsigned short css_depth(struct cgroup_subsys_state *css) } bool css_is_ancestor(struct cgroup_subsys_state *child, - struct cgroup_subsys_state *root) + const struct cgroup_subsys_state *root) { struct css_id *child_id = rcu_dereference(child->id); struct css_id *root_id = rcu_dereference(root->id); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6f6a575e77a..025f8abfae2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -295,6 +295,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) { struct mem_cgroup *mem = NULL; + + if (!mm) + return NULL; /* * Because we have no locks, mm->owner's may be being moved to other * cgroup. We use css_tryget() here even if this looks @@ -486,10 +489,20 @@ void mem_cgroup_move_lists(struct page *page, int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem) { int ret; + struct mem_cgroup *curr = NULL; task_lock(task); - ret = task->mm && mm_match_cgroup(task->mm, mem); + rcu_read_lock(); + curr = try_get_mem_cgroup_from_mm(task->mm); + rcu_read_unlock(); task_unlock(task); + if (!curr) + return 0; + if (curr->use_hierarchy) + ret = css_is_ancestor(&curr->css, &mem->css); + else + ret = (curr == mem); + css_put(&curr->css); return ret; } @@ -820,6 +833,19 @@ bool mem_cgroup_oom_called(struct task_struct *task) rcu_read_unlock(); return ret; } + +static int record_last_oom_cb(struct mem_cgroup *mem, void *data) +{ + mem->last_oom_jiffies = jiffies; + return 0; +} + +static void record_last_oom(struct mem_cgroup *mem) +{ + mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb); +} + + /* * Unlike exported interface, "oom" parameter is added. if oom==true, * oom-killer can be invoked. @@ -902,7 +928,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, mutex_lock(&memcg_tasklist); mem_cgroup_out_of_memory(mem_over_limit, gfp_mask); mutex_unlock(&memcg_tasklist); - mem_over_limit->last_oom_jiffies = jiffies; + record_last_oom(mem_over_limit); } goto nomem; }