Project: Transcendent Memory
|
Project Description:
|
new approach to managing physical memory in a virtualized system
|
|
License:
|
GPL
|
Finally updated again: 20101122 (though could still use a major rewrite)
WHAT IS TRANSCENDENT MEMORY?
Virtualization provides significant cost savings by allowing otherwise
underutilized resources to be time-shared between multiple
virtual machines.
While efficient techniques for optimizing
CPU and I/O device utilization are widely implemented,
time-sharing of physical
memory is much more difficult.
As a result, physical memory is increasingly
becoming a bottleneck in virtualized systems, limiting the efficient
consolidation of virtual machines onto physical machines.
Transcendent Memory (tmem for short)
provides a new approach for improving the utilization of
physical memory in a virtualized** environment by claiming underutilized
memory in a system and making it available where it is most needed.
From the perspective of an operating system, tmem is fast pseudo-RAM
of indeterminate and varying
size that is useful primarily when real RAM is in short supply and is
accessible only via a somewhat quirky copy-based interface.
More formally, Transcendent Memory is both: (a) a collection of idle
physical memory in a system and (b) an API for providing
indirect access to that memory. A tmem host (such as
a hypervisor
in a virtualized system) maintains and manages one or more tmem pools
of physical memory. One or more tmem clients
(such as a guest OS in a virtualized system) can access this memory
only indirectly via a well-defined tmem API which imposes a
carefully-crafted set of rules and restrictions. Through proper use of
the tmem API, a tmem client may utilize a tmem pool as an extension
to its memory, thus reducing disk I/O and improving performance.
As an added bonus, tmem is compatible and complementary -- indeed
even supplementary -- to
other physical memory management mechanisms such as automated ballooning
and page-sharing.
(** While tmem's primary goal is to better optimize the multiplexing of
physical memory between guests in a virtualized system, the tmem
interface is useful within a native operating system on a physical system
too. See references below.)
The tmem approach is flexible and we have already identified four different
use models that can benefit a tmem client:
- A second-chance clean page cache called cleancache (previously known as precache and hcache)
- A fast swap "device" called frontswap (previously known as preswap or hswap)
- A shared page cache (shared cleancache) for remote clustered filesystems
- An interdomain shared memory mechanism
The first three are sufficient for significantly improving utilization
of physical memory -- and thus performance -- in many
virtualized workloads. The last may provide
substantial incremental benefits that are still under investigation.
To be useful, tmem must be present and enabled in both the hypervisor
and in each guest operating system;
guest OS changes are explicitly required, but are surprisingly non-invasive.
Both Xen 4.0 and Oracle VM Server
2.2 support tmem in the hypervisor. Tmem-enabled kernels are
available in the 2.6.18-xen tree, in OpenSuse 11.2, SLE11, and Oracle
Enterprise Linux 5u5. And rpm's are available also for Enterprise
Linux 6 beta to ensure tmem is ready when this OS is released.
Finally, patches to implement cleancache and frontswap are being
proposed to upstream Linux and have received a great deal of support.
A brief explanation of how Linux can use cleancache/frontswap/tmem (NOTE: uses the obsolete
terms precache and preswap instead of cleancache and frontswap) is
here
For Linux experts, a good overview of the motivations, related state-of-the-art, a brief intro
to tmem and some impressive performance results was presented at the Virtualization mini-summit
at Linux Plumber's Conference 2010, and is available in these
slides,
including a complete script in these
speaker notes.
Some ideas on how tmem-like ideas might be extended for broader use in Linux was posted
here,
and briefly discussed at the Linux LSF/MM10 summit.
Tmem was previously described at the Linux Symposium 2009 conference
from which there is a
paper
and
presentation slides,
including a complete script in the
speaker notes.
A similar
presentation
with updated information and
speaker notes
was presented in January
2010 at Linux Conference Australia (LCA10).
There is also a more Xen-focused
presentation with
speakers notes
and
video
from Xen Summit North America 2010 and
older slides and video from a presentation on tmem at Xen Summit
North America 2009 can be found
here.
Finally, a brief older one-page academic overview of Transcendent
Memory prepared for the
OSDI'08
work-in-progress session
is
here.
Some preliminary but promising measurements of cleancache can be
found in our
Paravirtualized Paging
paper from
WIOV'08.
Various linux patches and RPMs, some very outdated, can be found
here.
The patch for Xen's paravirtualized 2.6.18-xen Linux is
in tree
here.
HOW DOES TRANSCENDENT MEMORY COMPARE TO OTHER APPROACHES?
First let's define the problem space:
The deceptively simple objective is to optimize, across time,
the distribution of a fixed amount of machine memory among
a maximal set of virtual machines (VMs). To do this, we measure
the current and future memory need or working set size of each
running VM and periodically reclaim memory from those VMs
that have an excess of memory and either provide it to VMs that
need more memory or use it to provision additional new VMs.
However, attempts to implement this objective encounter a
number of difficulties:
-
Working set is a poorly-defined concept, so the current working set size
of a VM cannot be accurately measured and an approximation must be deduced
from a collection of externally observed data and events
-
The future working set size of a VM cannot be accurately predicted, must be
estimated from trends in externally available data, and may change
unpredictable, perhaps dramatically, at any instant
-
Memory has inertia and so cannot be randomly taken away from
a VM without potential loss of data
-
The consequence of underprovisioning the memory of a VM may be severe: A
significant non-linear performance impact on that VM (i.e. thrashing)
-
The consequence of mispredicting the number of VMs that can be supported
by the fixed total machine memory is also severe:
A significant non-linear performance
impact on some set of VMs -- and thus on the overall system (i.e. host
swapping)
Most research and development projects in this problem space focus either on
working set estimation and prediction, or on
cleverly increasing the apparent amount of physical memory (e.g.
through compression or page-sharing). The transfer of memory between
VMs is generally considered a solved problem -- ballooning [1]
is implemented on all major virtualization platforms -- but memory
inertia imposes challenges that ballooning can only partially overcome.
Most notably, ballooning depends on requests to a VM to donate memory,
which may have a long latency especially for large requests.
Our approach, tmem, is fundamentally different in that it assumes that
working set estimation and prediction will never be sufficiently accurate and,
instead, provides a mechanism for mitigating the impact of that
inaccuracy. Tmem is complementary to compression and/or page-sharing
and in fact may provide opportunities to optimize it by isolating
memory that can most benefit. And while tmem partially depends on some
form of ballooning-like mechanism, unneeded memory is pooled and utilized
in a way that minimizes memory inertia, thus making a significant fraction
of main memory instantly available to accomodate rapidly changing needs.
Finally, tmem explicitly exposes certain previously hidden OS memory
management operations, such as page eviction, that may uncover new avenues of
virtualization research, not only in collaborative memory management but
perhaps in areas such as VM introspection and improved virtual disk management.
Tmem has its limitations too. There is some overhead, which may result in
a small negative performance impact on some workloads. OS change --
paravirtualization -- is required, though the changes are surpisingly
small and non-invasive. And tmem doesn't help at all if all of main
memory is truly in use (i.e. the sum of the working set of all active
virtual machines exceeds the size of physical memory).
But we believe the benefits of tmem
will greatly outweigh these costs.
WHERE DOES THE MEMORY FOR TRANSCENDENT MEMORY COME FROM?
In order for tmem to work, unused memory must be collected.
Where does this memory come from?
To answer, we need to understand some basics of
physical memory management, first in a physical system and then in
a virtualized system.
Single system physical memory management
In a running physical system, memory utilization can be characterized by two
metrics: W, the working set size of the currently running
workload and P, the amount of physical memory on the system.
On most systems, W will vary dramatically and often rapidly,
and rarely will W be equal to P.
For any point in time where
W is smaller than P, some fraction of memory is
unused or idle. If at another point in time,
W is larger than P, most systems will use disk space (a
swap disk) to contain the overflow. Since a disk is
horribly slow compared to physical memory, if the memory-on-disk
must be regularly accessed, system performance can become abysmal,
a situation widely known as thrashing.
Since "memory is cheap", to avoid thrashing, a system administrator
will often ensure P is much larger than W will ever be
which means that physical memory is usually greatly overprovisioned.
As a result, an even greater fraction of memory is usually idle.
Now, it is true that in any decent operating system, idle memory is
not really entirely unused. Generally, much of this memory is used
as some form of a
page cache, storing copies of previously-used disk pages that
might need to again
be used in the future, thus avoiding the cost and delay of re-reading
that page from the disk. However, despite the fact that operating systems
algorithms have studied this problem for years, the algorithms still
can only approximate the future, not predict it. As a result, many,
maybe even most, of the pages in the page cache will never be used
again and will eventually be evicted from memory. Annoyingly
(and in another indication that the future cannot be predicted), it
will turn out that some pages that are evicted are actually
needed again in the future... Too late, those pages will need to be
fetched from disk. (Note: In later discussion, we will refer to those
pages as "false negative evictions.")
So, to differentiate those pages that at any moment in time will
be used in the future from those that will not, we will consider
the former to be in the working set W and will only consider
the latter to be truly idle. At any moment in time, this still
leaves a large quantity of physical memory as unused... wasted.
Virtualized system physical memory management -- Static partitioning
Now let's turn our attention to a virtual environment.
Depending
on the virtualization system and on options selected, physical
memory is either statically or dynamically partitioned.
In a statically partitioned system, whenever a new virtual machine
is launched, it is given a fixed amount of guest physical memory which
it uses as it sees fit. The machine physical memory backing the
guest physical memory may not be contiguous and the mapping from a
guest physical page to machine physical memory may change over time,
but the total guest physical memory size remains fixed, reflecting
the real world equivalent. Subsequently launched virtual machines
require additional fixed chunks of machine physical memory and, of
course, the hypervisor reserves some memory for itself. Eventually
only a fragment of machine physical remains which is insufficient to
add another virtual machine, (Some virtualization systems support
host swapping, but when this is used performance falls off a cliff so
it is avoided except when rare conditions arise. For our purposes,
it can be ignored.) This fragment is therefore unassigned and we
call it fallow memory. Depending on the total machine physical
memory in the system, and the amount of guest physical memory assigned
to individual virtual machines, the amount of memory left fallow
may be rather substantial. When combined with the idle memory from
each guest, the total may be the majority of the machine physical
memory. (In some virtualization systems, a service domain may
own all non-hypervisor machine physical memory in the system and dole
it out as virtual machines are launched, which means the amount
of fallow memory is zero. However, the amount of idle memory in the
service domain is instead a rough approximation of the same.)
Virtualized system physical memory management -- Dynamic partitioning
In order to allow underutilized machine physical memory to be reassigned
from one virtual machine to another, a dynamic partitioning technique
called ballooning was invented. With ballooning, a pseudo-device
running in each virtual machine called the balloon driver absorbs and
releases memory from the fixed guest physical memory allocation. For example,
the balloon driver in virtual machine A transfers ownership of a set of
pages to the hypervisor, which in turn transfers it to a balloon driver
in virtual machine B.
Ballooning, while very interesting and useful, suffers from many constraints,
Most notably, decisions regarding which virtual machines have memory
to donate and which need more memory -- and how much -- must be driven
by a sophisticated algorithm which can infer and predict the working set
size of all of the guest virtual machines. We have already observed
the difficulty of this on a single machine and the dire consequences of
incorrectly predicting the future; the frequency is exacerbated under
the memory pressure provoked by ballooning.
Also notable is the fact that ballooning
is entirely dependent on the graces of each guest operating system to
surrender memory; we have also observed that an operating system believes
it is successfully utilizing idle memory and so may be loath to give up
memory, especially if the demand is urgent and the amount is large. We
refer to this property as memory inertia.
In
short, ballooning is useful for shaping memory over time, but
inadequately
responsive enough to ensure that, for example, the rapidly growing working set
of one or more VMs can be instantly satisfied. And once again the
consequences of insufficient memory for needy virtual machines is a
potentially dramatic reduction in performance.
Virtualized system physical memory management -- Live migration
Before we leave this topic, one more important significant source of
underutilized memory should be identified. One of the key advantages
of virtualization is the ability to migrate a live virtual machine from
one physical machine to another, with effectively no downtime.
Some believe that, in future virtual data centers, migration
will be employed frequently
to optimize load-balancing, power constraints, or manage other resources.
One constraint of effective migration is that all of the guest physical
memory must also (eventually) move from the sending physical machine to
the destination. So, to allow for fluid migration, bin packing
techniques must be applied and memory holes must be prevalent in the
data center. While these holes may eventually be filled by an inward-bound
virtual machine, the sum of the unused physical memory across the entire
data center at any moment may be substantial. More idle memory.
Summary
As can be seen, the sources of idle memory are varied and may add
up to a significant fraction of the physical memory in a machine or
data center. Single system algorithms that attempt to make better use of idle
memory depend on predicting the future of a machine's working set,
which is destined to fail, at least some of the time. And mechanisms
to move memory between virtual machines to meet dynamically varying
demands must not only succesfully predict the future for multiple virtual
machines simultaneously
but also overcome memory inertia.
But if idle memory could be used for the benefit of virtual (or physical)
machines without being owned by those machines, to both reduce the penalty
for failing to predict the future and eliminate memory inertia, could
physical memory be more effectively utilized?
This is the foundation of Transcendent Memory.
HOW IS TRANSCENDENT MEMORY USED?
First, while tmem works with any pool of otherwise idle memory
it is ideally paired with a dynamic partitioning mechanism that is
capable of responding automatically to changing memory needs.
In this model, an automated ballooning mechanism (such as
self-ballooning, or MEB [6])
handles longer-term and
larger-scale memory shaping by transferring full ownership
(i.e. direct addressability) of physical memory to a needy virtual
machine, and tmem alleviates the impact of short-term mispredictions
and memory inertia which are exacerbated by the memory pressure
applied by dynamic partitioning. (Note that in recent tmem-enabled
kernels, this self-ballooning mechanism is integrated into the kernel
itself.)
In general,
a tmem client makes requests to a tmem host via the tmem API, so a
tmem host must expose the API to the client. As an example,
Xen implements and exports a hypercall interface that supports
the tmem API and the Linux kernel utilizes that API.
The tmem API is very narrow: There are two services, one to
create a tmem pool and one to perform operations on the pool.
At an abstract level, they are (using C syntax):
pool_id = tmem_new_pool(uuid, flags);
and
retval = tmem_op(OP, handle, pfn);
(For clarity, we will use slightly different syntax in the following but simple
preprocessor directives can be used to translate between the two.)
First, let's look at pool creation. Two parameters are passed, a 128-bit uuid
(universally unique identifier) and a 32-bit set of flags. Two of
the flags are important as they select very different semantics:
- shared vs private
- ephemeral vs persistent
For shared pools, the uuid represents a share name. The uuid is ignored
for private pools. We will explore the related semantics shortly but
it's important to note that there is no size parameter; the client creating
the tmem pool has no control or knowledge of the size of the pool.
Once created, operations can be performed on the pool. For example, a
page of data can be put to a pool and associated with a
handle which
actually is a three-element tuple consisting of a 32-bit pool-id,
a 64-bit object-id and a 32-bit page-id. The pool-id selects
a previously created pool, the object-id is roughly analogous to
a file and the page-id is analogous to sequential page number
within the file. The
data is specified as a physical page frame number. So, the C code:
retval = tmem_put(pool_id, object_id, page_id, pfn);
copies the page of data from pfn into the previously created tmem pool
specified by pool_id and associates it with the object_id/page_id pair.
Then
retval = tmem_get(pool_id, object_id, page_id, pfn);
copies the page of data associated with the pool_id/object_id/page_id
(if presetn) from the tmem pool into the page frame specified by pfn.
Note that copying is explicitly required; no magic remapping (e.g.
page flipping) is done. Also
other semantics are enforced... and differ somewhat depending on
the type of pool that was created.
For any tmem pools it creates,
a tmem client is responsible for ensuring any data it puts into a
pool remains consistent with any other memory or storage. To do
this, two flush operations are provided, one for a page
and one for an entire object.
There are also operations to read, write, or exchange a partial page,
or destroy a previously-created pool.
A complete list of operations and semantics can be found in the
Transcendent Memory API specification.
As an example, let's consider a
private ephemeral pool, or using the Linux client terminology,
a cleancache. We put a page to the pool and then later attempt
to get the page. But since the size of the pool/cleancache is
unknown, any put or any get can fail. A get of a previously
successfully put page will probably be successful, but sometimes
the get may fail. In other words, persistence
is not guaranteed! Thus, a client can only put clean pages not
dirty pages into the cleancache. This is a minor inconvenience for
an operating system, but in a virtualized environment the restriction
is extremely important because any page in the pool can be instantly
reclaimed by the hypervisor for other needs!
The use model for cleancache is straightforward: When Linux's pageframe
replacement algorithm evicts a clean page, that page is put to a
previously-created per-filesystem tmem pool (using the inode number
and page index as the handle). Whenever a page must be read from disk,
the tmem pool is first checked; if the page is found by tmem, it is used.
If it is not found, the disk is read. Linux must sometimes judiciously
flush pages from tmem to ensure consistency, but this is entirely
manageable and the code is well-contained.
As a second example, consider a private persistent pool, or using
the Linux client terminology, a frontswap. In this pool, persistence
is guaranteed (for the life of the client) so any get following
a successful put must also be successful. But because the size is
unknown, any put can fail. These semantics nicely support a
swap-disk-like mechanism, but the hypervisor controls the sizing
of the swap-disk. As an illustration, the hypervisor may choose to
accept a page into the tmem pool only when the caller is not using
its full memory allocation (i.e. when it has previously ballooned-out
some of its memory).
The implementation of cleancache and frontswap is best understood by
reading the code.
The most recent patchset submittal for cleancache can be found
here
and the most recent patchset submittal for frontswap can be found
here
The cleancache submittal includes support for the OCFS2 filesystem
using a tmem shared ephemeral . It results in a
shared page cache for cluster VMs that reside on the same physical
machine. And
we believe a shared persistent pool will offer an interesting form
of inter-VM shared memory that, once tmem is in use for other purposes,
will be very easy to use as a foundation for inter-VM communication
mechanisms.
We are also brainstorming other use models and pool types and value any input.
TRANSCENDENT MEMORY SELECTED BIBLIOGRAPHY
[1] C. A. Waldspurger.
Memory Resource Management in VMware ESX Server.
In Proc. of the 5th USENIX Symp. on Operating System Design and Implementation, pp 181-194, Boston MA, December 2002.
[2] S.T. Jones, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau.
Geiger: Monitoring the buffer cache in a virtual environment.
In Proc. of the 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pp 13-23, San Jose CA, October 2006.
[3] P. Lu and K. Shen.
Virtual Machine Memory Tracing with Hypervisor Exclusive Cache.
In Proc. of the 2007 USENIX Annual Tech. Conf., Santa Clara, CA, June 2007.
[4]
M. Schwidefsky, et al.
Collaborative Memory Management in Hosted Linux Environments.
In 2006 Ottawa Linux Symposium
[5]
D. Gupta, et al.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines.
In Proc. of the 8th USENIX Symp. on Operating System Design and Implementation, San Diego CA, December 2008.
[6]
W. Zhao and Z. Wang.
Dynamic Memory Balancing for Virtual Machines
To appear in Proc. of the 2009 ACM Int'l Conf on Virtual Execution Environments
|