Project: Transcendent Memory

Virtualization provides significant cost savings by allowing otherwise underutilized resources to be time-shared between multiple virtual machines. While efficient techniques for optimizing CPU and I/O device utilization are widely implemented, time-sharing of physical memory is much more difficult. As a result, physical memory is increasingly becoming a bottleneck in virtualized systems, limiting the efficient consolidation of virtual machines onto physical machines.

Transcendent Memory (tmem for short) provides a new approach for improving the utilization of physical memory in a virtualized** environment by claiming underutilized memory in a system and making it available where it is most needed. From the perspective of an operating system, tmem is fast pseudo-RAM of indeterminate and varying size that is useful primarily when real RAM is in short supply and is accessible only via a somewhat quirky copy-based interface.

More formally, Transcendent Memory is both: (a) a collection of idle physical memory in a system and (b) an API for providing indirect access to that memory. A tmem host (such as a hypervisor in a virtualized system) maintains and manages one or more tmem pools of physical memory. One or more tmem clients (such as a guest OS in a virtualized system) can access this memory only indirectly via a well-defined tmem API which imposes a carefully-crafted set of rules and restrictions. Through proper use of the tmem API, a tmem client may utilize a tmem pool as an extension to its memory, thus reducing disk I/O and improving performance. As an added bonus, tmem is compatible and complementary -- indeed even supplementary -- to other physical memory management mechanisms such as automated ballooning and page-sharing.

(** While tmem's primary goal is to better optimize the multiplexing of physical memory between guests in a virtualized system, the tmem interface is useful within a native operating system on a physical system too. See references below.)

The tmem approach is flexible and we have already identified four different use models that can benefit a tmem client:

A second-chance clean page cache called cleancache (previously known as precache and hcache)
A fast swap "device" called frontswap (previously known as preswap or hswap)
A shared page cache (shared cleancache) for remote clustered filesystems
An interdomain shared memory mechanism

The first three are sufficient for significantly improving utilization of physical memory -- and thus performance -- in many virtualized workloads. The last may provide substantial incremental benefits that are still under investigation.

To be useful, tmem must be present and enabled in both the hypervisor and in each guest operating system; guest OS changes are explicitly required, but are surprisingly non-invasive. Both Xen 4.0 and Oracle VM Server 2.2 support tmem in the hypervisor. Tmem-enabled kernels are available in the 2.6.18-xen tree, in OpenSuse 11.2, SLE11, and Oracle Enterprise Linux 5u5. And rpm's are available also for Enterprise Linux 6 beta to ensure tmem is ready when this OS is released. Finally, patches to implement cleancache and frontswap are being proposed to upstream Linux and have received a great deal of support.

A brief explanation of how Linux can use cleancache/frontswap/tmem (NOTE: uses the obsolete terms precache and preswap instead of cleancache and frontswap) is here

For Linux experts, a good overview of the motivations, related state-of-the-art, a brief intro to tmem and some impressive performance results was presented at the Virtualization mini-summit at Linux Plumber's Conference 2010, and is available in these slides, including a complete script in these speaker notes.

Some ideas on how tmem-like ideas might be extended for broader use in Linux was posted here, and briefly discussed at the Linux LSF/MM10 summit.

Tmem was previously described at the Linux Symposium 2009 conference from which there is a paper and presentation slides, including a complete script in the speaker notes.

A similar presentation with updated information and speaker notes was presented in January 2010 at Linux Conference Australia (LCA10).

There is also a more Xen-focused presentation with speakers notes and video from Xen Summit North America 2010 and older slides and video from a presentation on tmem at Xen Summit North America 2009 can be found here.

Finally, a brief older one-page academic overview of Transcendent Memory prepared for the OSDI'08 work-in-progress session is here.

Some preliminary but promising measurements of cleancache can be found in our Paravirtualized Paging paper from WIOV'08.

Various linux patches and RPMs, some very outdated, can be found here. The patch for Xen's paravirtualized 2.6.18-xen Linux is in tree here.

HOW DOES TRANSCENDENT MEMORY COMPARE TO OTHER APPROACHES?

First let's define the problem space:

The deceptively simple objective is to optimize, across time, the distribution of a fixed amount of machine memory among a maximal set of virtual machines (VMs). To do this, we measure the current and future memory need or working set size of each running VM and periodically reclaim memory from those VMs that have an excess of memory and either provide it to VMs that need more memory or use it to provision additional new VMs.

However, attempts to implement this objective encounter a number of difficulties:

Working set is a poorly-defined concept, so the current working set size of a VM cannot be accurately measured and an approximation must be deduced from a collection of externally observed data and events
The future working set size of a VM cannot be accurately predicted, must be estimated from trends in externally available data, and may change unpredictable, perhaps dramatically, at any instant
Memory has inertia and so cannot be randomly taken away from a VM without potential loss of data
The consequence of underprovisioning the memory of a VM may be severe: A significant non-linear performance impact on that VM (i.e. thrashing)
The consequence of mispredicting the number of VMs that can be supported by the fixed total machine memory is also severe: A significant non-linear performance impact on some set of VMs -- and thus on the overall system (i.e. host swapping)

Most research and development projects in this problem space focus either on working set estimation and prediction, or on cleverly increasing the apparent amount of physical memory (e.g. through compression or page-sharing). The transfer of memory between VMs is generally considered a solved problem -- ballooning [1] is implemented on all major virtualization platforms -- but memory inertia imposes challenges that ballooning can only partially overcome. Most notably, ballooning depends on requests to a VM to donate memory, which may have a long latency especially for large requests.

Our approach, tmem, is fundamentally different in that it assumes that working set estimation and prediction will never be sufficiently accurate and, instead, provides a mechanism for mitigating the impact of that inaccuracy. Tmem is complementary to compression and/or page-sharing and in fact may provide opportunities to optimize it by isolating memory that can most benefit. And while tmem partially depends on some form of ballooning-like mechanism, unneeded memory is pooled and utilized in a way that minimizes memory inertia, thus making a significant fraction of main memory instantly available to accomodate rapidly changing needs. Finally, tmem explicitly exposes certain previously hidden OS memory management operations, such as page eviction, that may uncover new avenues of virtualization research, not only in collaborative memory management but perhaps in areas such as VM introspection and improved virtual disk management.

Tmem has its limitations too. There is some overhead, which may result in a small negative performance impact on some workloads. OS change -- paravirtualization -- is required, though the changes are surpisingly small and non-invasive. And tmem doesn't help at all if all of main memory is truly in use (i.e. the sum of the working set of all active virtual machines exceeds the size of physical memory). But we believe the benefits of tmem will greatly outweigh these costs.

WHERE DOES THE MEMORY FOR TRANSCENDENT MEMORY COME FROM?

In order for tmem to work, unused memory must be collected. Where does this memory come from? To answer, we need to understand some basics of physical memory management, first in a physical system and then in a virtualized system.

Single system physical memory management

In a running physical system, memory utilization can be characterized by two metrics: W, the working set size of the currently running workload and P, the amount of physical memory on the system. On most systems, W will vary dramatically and often rapidly, and rarely will W be equal to P. For any point in time where W is smaller than P, some fraction of memory is unused or idle. If at another point in time, W is larger than P, most systems will use disk space (a swap disk) to contain the overflow. Since a disk is horribly slow compared to physical memory, if the memory-on-disk must be regularly accessed, system performance can become abysmal, a situation widely known as thrashing.

Since "memory is cheap", to avoid thrashing, a system administrator will often ensure P is much larger than W will ever be which means that physical memory is usually greatly overprovisioned. As a result, an even greater fraction of memory is usually idle.

Now, it is true that in any decent operating system, idle memory is not really entirely unused. Generally, much of this memory is used as some form of a page cache, storing copies of previously-used disk pages that might need to again be used in the future, thus avoiding the cost and delay of re-reading that page from the disk. However, despite the fact that operating systems algorithms have studied this problem for years, the algorithms still can only approximate the future, not predict it. As a result, many, maybe even most, of the pages in the page cache will never be used again and will eventually be evicted from memory. Annoyingly (and in another indication that the future cannot be predicted), it will turn out that some pages that are evicted are actually needed again in the future... Too late, those pages will need to be fetched from disk. (Note: In later discussion, we will refer to those pages as "false negative evictions.")

So, to differentiate those pages that at any moment in time will be used in the future from those that will not, we will consider the former to be in the working set W and will only consider the latter to be truly idle. At any moment in time, this still leaves a large quantity of physical memory as unused... wasted.

Virtualized system physical memory management -- Static partitioning

Now let's turn our attention to a virtual environment. Depending on the virtualization system and on options selected, physical memory is either statically or dynamically partitioned.

In a statically partitioned system, whenever a new virtual machine is launched, it is given a fixed amount of guest physical memory which it uses as it sees fit. The machine physical memory backing the guest physical memory may not be contiguous and the mapping from a guest physical page to machine physical memory may change over time, but the total guest physical memory size remains fixed, reflecting the real world equivalent. Subsequently launched virtual machines require additional fixed chunks of machine physical memory and, of course, the hypervisor reserves some memory for itself. Eventually only a fragment of machine physical remains which is insufficient to add another virtual machine, (Some virtualization systems support host swapping, but when this is used performance falls off a cliff so it is avoided except when rare conditions arise. For our purposes, it can be ignored.) This fragment is therefore unassigned and we call it fallow memory. Depending on the total machine physical memory in the system, and the amount of guest physical memory assigned to individual virtual machines, the amount of memory left fallow may be rather substantial. When combined with the idle memory from each guest, the total may be the majority of the machine physical memory. (In some virtualization systems, a service domain may own all non-hypervisor machine physical memory in the system and dole it out as virtual machines are launched, which means the amount of fallow memory is zero. However, the amount of idle memory in the service domain is instead a rough approximation of the same.)

Virtualized system physical memory management -- Dynamic partitioning

In order to allow underutilized machine physical memory to be reassigned from one virtual machine to another, a dynamic partitioning technique called ballooning was invented. With ballooning, a pseudo-device running in each virtual machine called the balloon driver absorbs and releases memory from the fixed guest physical memory allocation. For example, the balloon driver in virtual machine A transfers ownership of a set of pages to the hypervisor, which in turn transfers it to a balloon driver in virtual machine B.

Ballooning, while very interesting and useful, suffers from many constraints, Most notably, decisions regarding which virtual machines have memory to donate and which need more memory -- and how much -- must be driven by a sophisticated algorithm which can infer and predict the working set size of all of the guest virtual machines. We have already observed the difficulty of this on a single machine and the dire consequences of incorrectly predicting the future; the frequency is exacerbated under the memory pressure provoked by ballooning. Also notable is the fact that ballooning is entirely dependent on the graces of each guest operating system to surrender memory; we have also observed that an operating system believes it is successfully utilizing idle memory and so may be loath to give up memory, especially if the demand is urgent and the amount is large. We refer to this property as memory inertia. In short, ballooning is useful for shaping memory over time, but inadequately responsive enough to ensure that, for example, the rapidly growing working set of one or more VMs can be instantly satisfied. And once again the consequences of insufficient memory for needy virtual machines is a potentially dramatic reduction in performance.

Virtualized system physical memory management -- Live migration

Before we leave this topic, one more important significant source of underutilized memory should be identified. One of the key advantages of virtualization is the ability to migrate a live virtual machine from one physical machine to another, with effectively no downtime. Some believe that, in future virtual data centers, migration will be employed frequently to optimize load-balancing, power constraints, or manage other resources. One constraint of effective migration is that all of the guest physical memory must also (eventually) move from the sending physical machine to the destination. So, to allow for fluid migration, bin packing techniques must be applied and memory holes must be prevalent in the data center. While these holes may eventually be filled by an inward-bound virtual machine, the sum of the unused physical memory across the entire data center at any moment may be substantial. More idle memory.

Summary

As can be seen, the sources of idle memory are varied and may add up to a significant fraction of the physical memory in a machine or data center. Single system algorithms that attempt to make better use of idle memory depend on predicting the future of a machine's working set, which is destined to fail, at least some of the time. And mechanisms to move memory between virtual machines to meet dynamically varying demands must not only succesfully predict the future for multiple virtual machines simultaneously but also overcome memory inertia.

But if idle memory could be used for the benefit of virtual (or physical) machines without being owned by those machines, to both reduce the penalty for failing to predict the future and eliminate memory inertia, could physical memory be more effectively utilized?

This is the foundation of Transcendent Memory.

HOW IS TRANSCENDENT MEMORY USED?

First, while tmem works with any pool of otherwise idle memory it is ideally paired with a dynamic partitioning mechanism that is capable of responding automatically to changing memory needs. In this model, an automated ballooning mechanism (such as self-ballooning, or MEB [6]) handles longer-term and larger-scale memory shaping by transferring full ownership (i.e. direct addressability) of physical memory to a needy virtual machine, and tmem alleviates the impact of short-term mispredictions and memory inertia which are exacerbated by the memory pressure applied by dynamic partitioning. (Note that in recent tmem-enabled kernels, this self-ballooning mechanism is integrated into the kernel itself.)

In general, a tmem client makes requests to a tmem host via the tmem API, so a tmem host must expose the API to the client. As an example, Xen implements and exports a hypercall interface that supports the tmem API and the Linux kernel utilizes that API.

The tmem API is very narrow: There are two services, one to create a tmem pool and one to perform operations on the pool. At an abstract level, they are (using C syntax):

	pool_id = tmem_new_pool(uuid, flags);

and

	retval = tmem_op(OP, handle, pfn);

(For clarity, we will use slightly different syntax in the following but simple preprocessor directives can be used to translate between the two.)

First, let's look at pool creation. Two parameters are passed, a 128-bit uuid (universally unique identifier) and a 32-bit set of flags. Two of the flags are important as they select very different semantics:

shared vs private
ephemeral vs persistent

For shared pools, the uuid represents a share name. The uuid is ignored for private pools. We will explore the related semantics shortly but it's important to note that there is no size parameter; the client creating the tmem pool has no control or knowledge of the size of the pool.

Once created, operations can be performed on the pool. For example, a page of data can be put to a pool and associated with a handle which actually is a three-element tuple consisting of a 32-bit pool-id, a 64-bit object-id and a 32-bit page-id. The pool-id selects a previously created pool, the object-id is roughly analogous to a file and the page-id is analogous to sequential page number within the file. The data is specified as a physical page frame number. So, the C code:

	retval = tmem_put(pool_id, object_id, page_id, pfn);

copies the page of data from pfn into the previously created tmem pool specified by pool_id and associates it with the object_id/page_id pair. Then

	retval = tmem_get(pool_id, object_id, page_id, pfn);

copies the page of data associated with the pool_id/object_id/page_id (if presetn) from the tmem pool into the page frame specified by pfn.

Note that copying is explicitly required; no magic remapping (e.g. page flipping) is done. Also other semantics are enforced... and differ somewhat depending on the type of pool that was created.

For any tmem pools it creates, a tmem client is responsible for ensuring any data it puts into a pool remains consistent with any other memory or storage. To do this, two flush operations are provided, one for a page and one for an entire object. There are also operations to read, write, or exchange a partial page, or destroy a previously-created pool.

A complete list of operations and semantics can be found in the Transcendent Memory API specification.

As an example, let's consider a private ephemeral pool, or using the Linux client terminology, a cleancache. We put a page to the pool and then later attempt to get the page. But since the size of the pool/cleancache is unknown, any put or any get can fail. A get of a previously successfully put page will probably be successful, but sometimes the get may fail. In other words, persistence is not guaranteed! Thus, a client can only put clean pages not dirty pages into the cleancache. This is a minor inconvenience for an operating system, but in a virtualized environment the restriction is extremely important because any page in the pool can be instantly reclaimed by the hypervisor for other needs!

The use model for cleancache is straightforward: When Linux's pageframe replacement algorithm evicts a clean page, that page is put to a previously-created per-filesystem tmem pool (using the inode number and page index as the handle). Whenever a page must be read from disk, the tmem pool is first checked; if the page is found by tmem, it is used. If it is not found, the disk is read. Linux must sometimes judiciously flush pages from tmem to ensure consistency, but this is entirely manageable and the code is well-contained.

As a second example, consider a private persistent pool, or using the Linux client terminology, a frontswap. In this pool, persistence is guaranteed (for the life of the client) so any get following a successful put must also be successful. But because the size is unknown, any put can fail. These semantics nicely support a swap-disk-like mechanism, but the hypervisor controls the sizing of the swap-disk. As an illustration, the hypervisor may choose to accept a page into the tmem pool only when the caller is not using its full memory allocation (i.e. when it has previously ballooned-out some of its memory).

The implementation of cleancache and frontswap is best understood by reading the code. The most recent patchset submittal for cleancache can be found here and the most recent patchset submittal for frontswap can be found here

The cleancache submittal includes support for the OCFS2 filesystem using a tmem shared ephemeral . It results in a shared page cache for cluster VMs that reside on the same physical machine. And we believe a shared persistent pool will offer an interesting form of inter-VM shared memory that, once tmem is in use for other purposes, will be very easy to use as a foundation for inter-VM communication mechanisms.

We are also brainstorming other use models and pool types and value any input.

TRANSCENDENT MEMORY SELECTED BIBLIOGRAPHY

[1] C. A. Waldspurger. Memory Resource Management in VMware ESX Server. In Proc. of the 5th USENIX Symp. on Operating System Design and Implementation, pp 181-194, Boston MA, December 2002.

[2] S.T. Jones, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau. Geiger: Monitoring the buffer cache in a virtual environment. In Proc. of the 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pp 13-23, San Jose CA, October 2006.

[3] P. Lu and K. Shen. Virtual Machine Memory Tracing with Hypervisor Exclusive Cache. In Proc. of the 2007 USENIX Annual Tech. Conf., Santa Clara, CA, June 2007.

[4] M. Schwidefsky, et al. Collaborative Memory Management in Hosted Linux Environments. In 2006 Ottawa Linux Symposium

[5] D. Gupta, et al. Difference Engine: Harnessing Memory Redundancy in Virtual Machines. In Proc. of the 8th USENIX Symp. on Operating System Design and Implementation, San Diego CA, December 2008.

[6] W. Zhao and Z. Wang. Dynamic Memory Balancing for Virtual Machines To appear in Proc. of the 2009 ACM Int'l Conf on Virtual Execution Environments

Aug	SEP	Oct
	11
2014	2015	2016