Files
UnrealEngine/Engine/Source/Runtime/AutoRTFM/Documentation/README.md
Brandyn / Techy fcc1b09210 init
2026-04-04 15:40:51 -05:00

15 KiB
Raw Blame History

AutoRTFM User Guide

Author: John Stiles
Team: Verse - AutoRTFM

Context

A core Verse language feature is the ability to “roll back” changes to the state of execution when failures are encountered, making it appear as if nothing has changed. Verse code is allowed to wrap almost any operation within a failure scope, and runtime failures can occur at any point, so very little in Verse is truly exempt from rollback. dSTM will also use this ability to abort and retry a transaction whenever a remote TObjectPtr is dereferenced.

Before AutoRTFM was enabled, rollback was handled using explicit callbacks in Verse::Stm and this is still used on the client. However, servers now use AutoRTFM instead, and automatically support most forms of rollback without any programmer involvement. However, AutoRTFM has limitations around threading and resource sharing which sometimes require very careful thought.

AutoRTFM Rollback

Live in Production

The goal of AutoRTFM is to automatically implement rollback for single-threaded code. The basics of this approach are covered in Phil Pizlos blog post, Bringing Verse Transactional Memory Semantics to C++. Note that this post discusses AutoRTFM as being compiled into the binary as of Release 28.10 (January 2024), and thats technically correct, but it wasn't actually enabled on live servers at that time. AutoRTFM was disabled shortly after launch due to server stability issues.

After the initial launch attempt, the team regrouped and devised a more robust launch plan. AutoRTFM was then successfully enabled for a small percentage of VkPlay games on public servers in Release 34.10 (March 2025). As of Release 36.20 (July 2025), AutoRTFM is fully enabled for all VkPlay and VkEdit sessions. In Release 38.00 (November 2025), we plan to deprecate the ability to disable AutoRTFM in production. We are no longer testing Valkyrie in the AutoRTFM-off path, and features like the new VM rely on AutoRTFM to function.

When AutoRTFM is enabled, almost all game logic on the main thread automatically supports rollback without manual intervention. Unfortunately, some things have side effects that simply cant be undone automatically—for instance, you cant un-send a network packet after its been sent!—so more complex or multi-threaded code will need to be aware of the implementation details of AutoRTFM. Fortunately, most of the time, your existing code will work as-is, and no additional work will be necessary.

Open and Closed Code

AutoRTFM is designed around an Epic-internal fork of the Clang compiler. This version of Clang is responsible for adding transactional instrumentation to our code. For performance reasons, though, we dont always run with this instrumentation enabled; instead, AutoRTFM is designed to compile all of our code twice. One version is the typical, uninstrumented form that you would expect from a normal Clang build, more or less—by convention, we refer to this form as “open.” The other version is compiled with our extra transactional logic mixed in—our convention is to call this instrumented form “closed.” Open and closed versions of almost all functions coexist in our binaries.

Programs start off in the “open” state; in situations where rollback is needed, we transition into the “closed” state via AutoRTFM::Transact. This function is responsible for teleporting us to the matching, “closed” form of the currently-running code.

Instrumentation

In transactional mode, writes to heap memory occur in real-time, but they are also closely tracked so that they can be undone if necessary. We also wrap some low-level APIs like malloc and free to make them transactionally safe. Function pointers to “open” functions are dynamically rerouted to their “closed” equivalent so that we dont accidentally escape from “closed” code into its “open” form. Finally, we maintain a list of deferred tasks to perform at the conclusion of the transaction.

Committing

If the transaction reaches the end of its Transact block without being aborted, it has succeeded and will be committed. To do this, we execute our list of CommitTasks which are responsible for handling all deferred operations. For instance, we always defer free(MyData) to commit time, via the CommitTasks list, instead of calling free immediately; this allows us to resuscitate the block of heap memory if the transaction is aborted. Once the commit tasks are complete, we return to the “open” state and discard any heap tracking information. Users can also defer work to the end of a transaction by calling AutoRTFM::OnCommit to add their own callbacks to the commit task list. Commit tasks are run in first-in, first-out order.

Aborting

Conversely, if AutoRTFM::AbortTransaction is called anywhere within a Transact block, the transaction is considered to be aborted and must be rolled back. In other words, AutoRTFM is now responsible for undoing all changes made within the transaction; we must teleport execution straight to the end of the Transact block, while maintaining the illusion that nothing at all has changed. Our heap-write tracking data is used to undo all transactional changes to the heap, and we additionally have a list of AbortTasks to execute. As you might expect, these AbortTasks are the parallel opposite of the CommitTasks. For instance, we need to ensure that memory isnt just leaked if the user calls malloc(Size) and then aborts the transaction. This is handled by generating an AbortTask inside of malloc which frees the allocated data; in the event of an abort, this saves us from a leak. Users can also call OnAbort to add work to the abort task list. Abort tasks are run in first-in, last-out order.

Hazards

Some APIs are inherently multi-threaded; these are considered hazards in AutoRTFM, since our transactional model only has a single-threaded view of the world. These APIs intentionally trigger a language failure and will need to be manually fixed up by a programmer. For instance, std::atomic<> or FThreadSafeCounter are considered unsafe because they are designed to communicate state across threads; it would be possible to automatically roll back an atomic increment with an atomic decrement, but in general this wouldnt be sufficient to guarantee correct behavior of the entire program. The AutoRTFM team has designed transactionally safe primitives like critical sections (FTransactionallySafeCriticalSection) and mutexes (FTransactionallySafeMutex), but these must be manually replaced in the code because they are a little more complex, and take more memory, than a plain FCriticalSection or FMutex.

Resource Locking

The approach for resource locking within a transaction is that, once a lock is taken or a resource is acquired, that lock or resource is always held until the end of the transaction. That is, calling FTransactionallySafeCriticalSection::Unlock within a transaction will not immediately release the lock; instead, it enqueues a CommitTask entry responsible for unlocking the resource. Because the AutoRTFM thread maintains its lock for the duration of the transaction, it is free to re-acquire or re-lock the resource within the same transaction at will; this is important, because otherwise, we would quickly deadlock.

Holding resource locks is an important principle, because it prevents the transactional state from being exposed to other threads prematurely. In other words, transactional changes shouldnt be visible to other threads while the transaction is still in flight. This wouldnt be safe, because the transaction is still in a provisional state and might be rolled back—we dont want other threads to see or act on those changes until the transaction is fully committed. (For that matter, immediate unlocks would also make rollback itself a threading hazard, since the heap rollback code doesnt know about critical sections or mutexes, and wont take any locks while it is undoing heap changes.)

Nesting

It is legal to nest a sub-transaction inside of a transaction, and it is safe to abort the sub-transaction within the outer transaction. Aborting an outer transaction will roll back all the work performed in the sub-transaction as well, even if those inner transactions succeeded and were committed. In general, aborting a transaction should always roll back everything that happened inside of that transaction—even if that includes an inner sub-transaction.

Missing Closed Code

In some cases, we dont have the “closed” version of a function at all. For instance, Unreal Engine includes some libraries like Oodle and EOS as precompiled binaries; we also compile some code with specialized compilers like Intel ISPC that dont support instrumentation. In these cases, we are required to manually go into the “open” state via UE_AUTORTFM_OPEN or AutoRTFM::Open before calling these functions. However, for an abort to properly undo these changes to the heap, callers are required to manually inform the AutoRTFM runtime about any heap memory that the code will write to—crucially, before the write occurs!—via AutoRTFM::RecordOpenWrite.

Obviously, this extra work is complex and undesirable. If calling an open function from closed code cannot be avoided, consider wrapping the work in a helper function. If it is feasible, design the code so that heap writes do not need to be undone at all—e.g., only write to a short-lived scratch buffer that will be destroyed before the transaction is complete, or maintain a persistent scratch buffer where the contents are always assumed to be clobbered across transactions.

Mixed Open and Closed Code

When trying to resolve AutoRTFM incompatibilities, it is tempting to consider invoking large, complex functions in the “open”—thereby dodging all hazards—and then registering an OnAbort handler to undo its effects. However, this is dangerous and should be avoided. This approach can lead to new, subtle transactional hazards:

  • Locks will be released before the transaction is finished. This exposes transactional changes to other threads while the transaction is still in a provisional state. Also, if a rollback occurs, this becomes a second thread hazard, because locks wont be taken at all during the rollback.
  • An “open” block cannot free or reallocate memory that was allocated in a “closed” block. (Reason: When allocating, the closed code will also have created an abort handler to free its memory. Those callbacks will still run if the transaction is aborted, but the pointer passed to Free will already have been deallocated; in short, it will lead to a double-free.)
  • The above-mentioned reallocation hazard also extends to classes which implicitly reallocate memory on your behalf. In particular, innocuous methods like TArray::Add() can become very dangerous when both open and closed code add items to the same array.
  • Passing ownership of non-trivial objects across the open-closed boundary can lead to memory handling errors. For instance, if you have an FString that was created in “closed” code, and assign a new value to it from an “open” block, this is unsafe (discussed in depth here. We have a special mechanism which is designed to allow returning a non-trivial object from AutoRTFM::Open into closed code by making a copy; this must be handled on a type-by-type basis.
  • Altering the same bytes of memory from both “open” and “closed” code in the same transaction can lead to hard-to-diagnose silent rollback failures. (Reason: heap changes made in the open are not tracked and cant be undone automatically, as you probably know. But if we subsequently make a write to the same range of heap memory from “closed” code, the instrumentation will log the write so that it can be undone. Unfortunately, we already changed the data, so the true “original” value is already gone, and we log the already-changed value. If an abort occurs, this behavior is sometimes harmless, and other times very wrong.)

These hazards can be extraordinarily difficult to debug once they have occurred.

To help identify writes made in the closed, and then within the same transaction, in the open, you can enable a memory validator with the AutoRTFMMemoryValidationLevel flag (see below).

Command line flags

AutoRTFM can be controlled with the following dpcvars, which can be combined with a comma.

NOTE: Be aware that these settings are ignored in Shipping builds! A Test build should be used to run the memory validator.

AutoRTFM mode Server Flags
Disable -dpcvars=AutoRTFMRuntimeEnabled=off or -dpcvars=AutoRTFMRuntimeEnabled=0
Enabled -dpcvars=AutoRTFMRuntimeEnabled=on or -dpcvars=AutoRTFMRuntimeEnabled=1
Force-disabled -dpcvars=AutoRTFMRuntimeEnabled=forceoff or -dpcvars=AutoRTFMRuntimeEnabled=2
Force-enabled -dpcvars=AutoRTFMRuntimeEnabled=forceon or -dpcvars=AutoRTFMRuntimeEnabled=3
Retry Validation Mode Server Flags
Disable -dpcvars=AutoRTFMRetryTransactions=0
Retry non-nested -dpcvars=AutoRTFMRetryTransactions=1
Retry nested too -dpcvars=AutoRTFMRetryTransactions=2
Memory Validation Mode Server Flags
Disable -dpcvars=AutoRTFMRuntimeEnabled=1 or -dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=1
Warn and continue -dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=2
Hard error -dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=3
AutoRTFM Enable Probability Server Flags
0.1% chance to enable AutoRTFM -dpcvars=AutoRTFMEnabledProbability=0.1
5% chance to enable AutoRTFM -dpcvars=AutoRTFMEnabledProbability=5.0
50% chance to enable AutoRTFM -dpcvars=AutoRTFMEnabledProbability=50.0