From 6f4e786e0302c8d80005a9f1868d3c2f93ba6ebd Mon Sep 17 00:00:00 2001 From: Eric K Date: Wed, 30 Jan 2019 17:47:56 -0800 Subject: [PATCH] Memory leak use case Change-Id: I9f8fab1c12b7c94ca2c1a1d3c8c4ed470d821b63 --- doc/source/use-cases.rst | 1 + use-cases/memory-leak.rst | 180 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 181 insertions(+) create mode 100644 use-cases/memory-leak.rst diff --git a/doc/source/use-cases.rst b/doc/source/use-cases.rst index bc9e7b3..9babfe5 100644 --- a/doc/source/use-cases.rst +++ b/doc/source/use-cases.rst @@ -11,3 +11,4 @@ a starting point. use-cases/nic-failure-affects-instance-and-app.rst use-cases/heat-mistral-aodh.rst use-cases/fenix-rolling-upgrade.rst + use-cases/memory-leak.rst diff --git a/use-cases/memory-leak.rst b/use-cases/memory-leak.rst new file mode 100644 index 0000000..62ed5d2 --- /dev/null +++ b/use-cases/memory-leak.rst @@ -0,0 +1,180 @@ +.. + +This work is licensed under a Creative Commons Attribution 3.0 Unported License. +http://creativecommons.org/licenses/by/3.0/legalcode + +============================================= +Memory leak mitigation and failure prevention +============================================= + +In a cloud infrastructure, it is important to detect and mitigate memory leaks +in critical infrastructure software before catastrophic failure takes place. + + +Problem description +=================== + +Many long-running software services are susceptible to memory leaks. A service +experiencing memory leak tends to gradually consume more and more memory +during its operation, ultimately compromising the performance and the +stability of the service. In the case of critical infrastructure software, +it is important to detect and prevent catastrophic failure due to memory +leaks. + + +Fault class +=========== + + * Software error + * Performance degradation + + +OpenStack projects used +======================= + +There are at least three solution architectures. + +1. Local: mitigation decisions and actions are conducted locally on each + server. +2. Central: mitigation decisions and actions are orchestrated by a central + service. +3. Delegated: mitigation decision policy can be defined centrally but + localized to each server. + + +Local +----- +In the local architecture, a local utility can make the decision to restart a +process or take other mitigating actions when the process exceeds certain +fixed memory thresholds specified by the cloud operator. Candidate +implementations include: + +- custom scripts. +- native memory limit mechanisms (e.g., cgroups) which would kill a process + when memory usage becomes too high, allowing another mechanism to restart + the process. +- `Monit`_. + +.. _Monit: https://mmonit.com/monit/documentation/monit.html#Process-resource-tests + + +Central +------- +In the central architecture, mitigation decisions can be made at a central +level which is able to use cloud level information and policy not available at +a local level. Mitigation actions can include the orchestration of graceful +failovers that involve multiple servers. + +There are three logical components to a solution. + +* Memory usage collection: + + - Monasca (monasca-agent) + - Nagios + - Zabbix + +* Mitigation decision: + + - Congress + - Vitrage + - Watcher + +* Mitigation action: + + - Mistral + - Watcher + + +Delegated +--------- +The delegated architecture might be implemented as a mixture of the above two. + + +Remediation class +================= + + Proactive / preemptive + + +Fault detection +=============== + +Definitive detection of memory leak is an unsolved problem. For the purpose of +this use case, suspected memory leak can be determined based on operator-set +limits or a more generic procedure based on memory usage history and other +relevant information. + + +Inputs and decision-making +========================== + +Inputs: + * Memory usage by process. + * Memory usage by server. + * (Potentially) Memory usage history. + * (Potentially) A list of candidate services or processes for memory leak + mitigation. + +Decision making: + * The simplest case is when the operator prescribes memory limits for each + relevant process. Take mitigating actions when prescribed memory limits + for a service/process is breached. + + * The appropriate memory limits for each service/process might be determined + by an inductive algorithm. The subject is under active investigation by + the research community (for selected references, see `References`_). + + * When there are no prescribed memory limits, decisions can be made on the + basis of a more generic procedure or policy. For example, a policy sketch + may be as follows. + + - When a server's overall memory usage exceeds 90% of available memory for + a period of 10 minutes, take mitigating actions on the candidate + services or processes, prioritized by parameters such as: + + + Each service's total memory usage. + + Each service's historical memory usage. + + Risk and level of disruption of mitigating action taken upon each + service. + + +Remediation +=========== + +Two main mitigating approaches are available: + + * Restart the service experiencing memory leak. + * Orchestrate a graceful fail-over. + +Existing implementation(s) +========================== + +Existing implementations are available for the local architecture. See `Local`_. + + +Future work +=========== + +If there is operator interest in the central or delegated architectures, +future work would include implementing the architectures using the referenced +projects and documenting the results. + + +Dependencies +============ + +Not applicable. + + +References +========== + +Matthias Hauswirth and Trishul M. Chilimbi. 2004. +Low-overhead memory leak detection using adaptive statistical profiling. +SIGOPS Oper. Syst. Rev. 38, 5 (October 2004), 156-164. +DOI=http://dx.doi.org/10.1145/1037949.1024412 + +Sor, Vladimir, Plumbr Ou, Tarvo Treier and Satish Narayana Srirama. +“Improving Statistical Approach for Memory Leak Detection Using Machine +Learning.” 2013 IEEE International Conference on Software Maintenance (2013): +544-547.