UnrealEngine/Engine/Source/Runtime/AutoRTFM/Documentation/README.md

# AutoRTFM User Guide

**Author**: [John Stiles](mailto:john.stiles@epicgames.com)
**Team**: Verse - AutoRTFM

# Context

A core Verse language feature is the ability to “roll back” changes to the state of execution when
failures are encountered, making it appear as if nothing has changed. Verse code is allowed to wrap
almost any operation within a failure scope, and runtime failures can occur at any point, so very
little in Verse is truly exempt from rollback. dSTM will also use this ability to abort and retry a
transaction whenever a remote `TObjectPtr` is dereferenced.

Before AutoRTFM was enabled, rollback was handled using explicit callbacks in
[Verse::Stm](https://docs.google.com/document/d/1gQlVTJigO4AdpljegaRe9hDGIeaLfqgJLq1p6z8QBNE/edit?tab=t.0)
and this is still used on the client. However, servers now use AutoRTFM instead, and automatically
support most forms of rollback without any programmer involvement. However, AutoRTFM has limitations
around threading and resource sharing which sometimes require very careful thought.

# AutoRTFM Rollback

## Live in Production

The goal of AutoRTFM is to automatically implement rollback for single-threaded code. The basics of
this approach are covered in Phil Pizlo’s blog post,
[Bringing Verse Transactional Memory Semantics to C++](https://www.unrealengine.com/en-US/tech-blog/bringing-verse-transactional-memory-semantics-to-c).
Note that this post discusses AutoRTFM as being compiled into the binary as of Release 28.10
(January 2024), and that’s technically correct, but it wasn't actually *enabled* on live servers at
that time. AutoRTFM was disabled shortly after launch due to server stability issues.

After the initial launch attempt, the team regrouped and devised a more robust launch plan. AutoRTFM
was then successfully enabled for a small percentage of VkPlay games on public servers in Release
34.10 (March 2025). As of Release 36.20 (July 2025), AutoRTFM is fully enabled for all VkPlay and
VkEdit sessions. In Release 38.00 (November 2025), we plan to deprecate the ability to disable
AutoRTFM in production. We are no longer testing Valkyrie in the AutoRTFM-off path, and features
like the new VM rely on AutoRTFM to function.

When AutoRTFM is enabled, almost all game logic on the main thread automatically supports rollback
without manual intervention. Unfortunately, some things have side effects that simply can’t be
undone automatically—for instance, you can’t un-send a network packet after it’s been sent!—so more
complex or multi-threaded code will need to be aware of the implementation details of AutoRTFM.
Fortunately, most of the time, your existing code will work as-is, and no additional work will be
necessary.

## Open and Closed Code

AutoRTFM is designed around an Epic-internal fork of the Clang compiler. This version
of Clang is responsible for adding transactional instrumentation to our code. For performance
reasons, though, we don’t always run with this instrumentation enabled; instead, AutoRTFM
is designed to compile all of our code twice. One version is the typical, uninstrumented
form that you would expect from a normal Clang build, more or less—by convention, we refer
to this form as “open.” The other version is compiled with our extra transactional logic
mixed in—our convention is to call this instrumented form “closed.” Open and closed versions
of almost all functions coexist in our binaries.

Programs start off in the “open” state; in situations where rollback is needed, we transition
into the “closed” state via `AutoRTFM::Transact`. This function is responsible for teleporting
us to the matching, “closed” form of the currently-running code.

## Instrumentation

In transactional mode, writes to heap memory occur in real-time, but they are also closely
tracked so that they can be undone if necessary. We also wrap some low-level APIs like
`malloc` and `free` to make them transactionally safe. Function pointers to “open” functions
are dynamically rerouted to their “closed” equivalent so that we don’t accidentally escape
from “closed” code into its “open” form. Finally, we maintain a list of deferred tasks
to perform at the conclusion of the transaction.

## Committing

If the transaction reaches the end of its `Transact` block without being aborted, it has
succeeded and will be committed. To do this, we execute our list of `CommitTasks` which
are responsible for handling all deferred operations. For instance, we always defer `free(MyData)`
to commit time, via the `CommitTasks` list, instead of calling `free` immediately; this
allows us to resuscitate the block of heap memory if the transaction is aborted. Once
the commit tasks are complete, we return to the “open” state and discard any heap tracking
information. Users can also defer work to the end of a transaction by calling `AutoRTFM::OnCommit`
to add their own callbacks to the commit task list. Commit tasks are run in first-in,
first-out order.

## Aborting

Conversely, if `AutoRTFM::AbortTransaction` is called anywhere within a `Transact` block,
the transaction is considered to be aborted and must be rolled back. In other words, AutoRTFM
is now responsible for undoing all changes made within the transaction; we must teleport
execution straight to the end of the `Transact` block, while maintaining the illusion
that nothing at all has changed. Our heap-write tracking data is used to undo all transactional
changes to the heap, and we additionally have a list of `AbortTasks` to execute. As you
might expect, these `AbortTasks` are the parallel opposite of the `CommitTasks`. For instance,
we need to ensure that memory isn’t just leaked if the user calls `malloc(Size)` and then
aborts the transaction. This is handled by generating an `AbortTask` inside of `malloc`
which frees the allocated data; in the event of an abort, this saves us from a leak. Users
can also call `OnAbort` to add work to the abort task list. Abort tasks are run in first-in,
last-out order.

## Hazards

Some APIs are inherently multi-threaded; these are considered hazards in AutoRTFM, since
our transactional model only has a single-threaded view of the world. These APIs intentionally
trigger a language failure and will need to be manually fixed up by a programmer. For
instance, `std::atomic<>` or `FThreadSafeCounter` are considered unsafe because they are
designed to communicate state across threads; it would be possible to automatically roll
back an atomic increment with an atomic decrement, but in general this wouldn’t be sufficient
to guarantee correct behavior of the entire program. The AutoRTFM team has designed transactionally
safe primitives like critical sections (`FTransactionallySafeCriticalSection`) and mutexes
(`FTransactionallySafeMutex`), but these must be manually replaced in the code because
they are a little more complex, and take more memory, than a plain `FCriticalSection`
or `FMutex`.

## Resource Locking

The approach for resource locking within a transaction is that, once a lock is taken or
a resource is acquired, that lock or resource is always held until the end of the transaction.
That is, calling `FTransactionallySafeCriticalSection::Unlock` within a transaction will
not immediately release the lock; instead, it enqueues a `CommitTask` entry responsible
for unlocking the resource. Because the AutoRTFM thread maintains its lock for the duration
of the transaction, it is free to re-acquire or re-lock the resource within the same transaction
at will; this is important, because otherwise, we would quickly deadlock.

Holding resource locks is an important principle, because it prevents the transactional
state from being exposed to other threads prematurely. In other words, transactional changes
shouldn’t be visible to other threads while the transaction is still in flight. This wouldn’t
be safe, because the transaction is still in a provisional state and might be rolled back—we
don’t want other threads to see or act on those changes until the transaction is fully
committed. (For that matter, immediate unlocks would also make rollback *itself* a threading
hazard, since the heap rollback code doesn’t know about critical sections or mutexes,
and won’t take any locks while it is undoing heap changes.)

## Nesting

It is legal to nest a sub-transaction inside of a transaction, and it is safe to abort
the sub-transaction within the outer transaction. Aborting an outer transaction will roll
back all the work performed in the sub-transaction as well, even if those inner transactions
succeeded and were committed. In general, aborting a transaction should always roll back
*everything* that happened inside of that transaction—even if that includes an inner
sub-transaction.

## Missing Closed Code

In some cases, we don’t have the “closed” version of a function at all. For instance,
Unreal Engine includes some libraries like [Oodle](https://dev.epicgames.com/documentation/en-us/unreal-engine/using-oodle-in-unreal-engine)
and [EOS](https://dev.epicgames.com/documentation/en-us/unreal-engine/online-subsystem-eos-plugin-in-unreal-engine)
as precompiled binaries; we also compile some code with specialized compilers like [Intel
ISPC](https://ispc.github.io/ispc.html) that don’t support instrumentation. In these cases,
we are required to manually go into the “open” state via `UE_AUTORTFM_OPEN` or `AutoRTFM::Open`
before calling these functions. However, for an abort to properly undo these changes to
the heap, callers are required to *manually* inform the AutoRTFM runtime about any heap
memory that the code will write to—crucially, before the write occurs!—via
`AutoRTFM::RecordOpenWrite`.

Obviously, this extra work is complex and undesirable. If calling an open function from
closed code cannot be avoided, consider wrapping the work in a helper function. If it
is feasible, design the code so that heap writes do not need to be undone at all—e.g.,
only write to a short-lived scratch buffer that will be destroyed before the transaction
is complete, or maintain a persistent scratch buffer where the contents are always assumed
to be clobbered across transactions.

## Mixed Open and Closed Code

When trying to resolve AutoRTFM incompatibilities, it is tempting to consider invoking
large, complex functions in the “open”—thereby dodging all hazards—and then registering
an `OnAbort` handler to undo its effects. However, this is dangerous and should be avoided.
This approach can lead to new, subtle transactional hazards:

- Locks will be released before the transaction is finished. This exposes transactional
  changes to other threads while the transaction is still in a provisional state. Also,
  if a rollback occurs, this becomes a second thread hazard, because locks won’t be taken
  at all during the rollback.
- An “open” block cannot free or reallocate memory that was allocated in a “closed” block.
  (Reason: When allocating, the closed code will also have created an abort handler to
  free its memory. Those callbacks will still run if the transaction is aborted, but the
  pointer passed to Free will already have been deallocated; in short, it will lead to a
  double-free.)
- The above-mentioned reallocation hazard also extends to classes which implicitly reallocate
  memory on your behalf. In particular, innocuous methods like `TArray::Add()` can become
  very dangerous when both open and closed code add items to the same array.
- Passing ownership of non-trivial objects across the open-closed boundary can lead to
  memory handling errors. For instance, if you have an `FString` that was created in “closed”
  code, and assign a new value to it from an “open” block, this is unsafe (discussed in
  depth [here](https://jira.it.epicgames.com/browse/SOL-6991). We have a special mechanism
  which is designed to allow returning a non-trivial object from `AutoRTFM::Open` into closed
  code by making a copy; this must be handled on a type-by-type basis.
- Altering the same bytes of memory from both “open” and “closed” code in the same transaction
  can lead to hard-to-diagnose silent rollback failures.
  (Reason: heap changes made in the open are not tracked and can’t be undone automatically,
  as you probably know. But if we subsequently make a write to the same range of heap memory
  from “closed” code, the instrumentation will log the write so that it can be undone.
  Unfortunately, we already changed the data, so the true “original” value is already gone, and we
  log the already-changed value. If an abort occurs, this behavior is sometimes harmless, and
  other times very wrong.)

These hazards can be extraordinarily difficult to debug once they have occurred.

To help identify writes made in the closed, and then within the same transaction, in the
open, you can enable a memory validator with the `AutoRTFMMemoryValidationLevel` flag
(see below).

## Command line flags

AutoRTFM can be controlled with the following `dpcvars`, which can be combined with a
comma.

NOTE: Be aware that these settings are ignored in Shipping builds! A Test build should
be used to run the memory validator.

| AutoRTFM mode  | Server Flags                                                                        |
| :------------- | :---------------------------------------------------------------------------------- |
| Disable        | `-dpcvars=AutoRTFMRuntimeEnabled=off`      *or* `-dpcvars=AutoRTFMRuntimeEnabled=0` |
| Enabled        | `-dpcvars=AutoRTFMRuntimeEnabled=on`       *or* `-dpcvars=AutoRTFMRuntimeEnabled=1` |
| Force-disabled | `-dpcvars=AutoRTFMRuntimeEnabled=forceoff` *or* `-dpcvars=AutoRTFMRuntimeEnabled=2` |
| Force-enabled  | `-dpcvars=AutoRTFMRuntimeEnabled=forceon`  *or* `-dpcvars=AutoRTFMRuntimeEnabled=3` |

| Retry Validation Mode | Server Flags                           |
| :-------------------- | :------------------------------------- |
| Disable               | `-dpcvars=AutoRTFMRetryTransactions=0` |
| Retry non-nested      | `-dpcvars=AutoRTFMRetryTransactions=1` |
| Retry nested too      | `-dpcvars=AutoRTFMRetryTransactions=2` |

| Memory Validation Mode | Server Flags                                                                                                 |
| :--------------------- | :----------------------------------------------------------------------------------------------------------- |
| Disable                | `-dpcvars=AutoRTFMRuntimeEnabled=1` *or* `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=1` |
| Warn and continue      | `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=2`                                          |
| Hard error             | `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=3`                                          |

| AutoRTFM Enable Probability    | Server Flags                               |
| :----------------------------- | :----------------------------------------- |
| 0.1% chance to enable AutoRTFM | `-dpcvars=AutoRTFMEnabledProbability=0.1`  |
| 5% chance to enable AutoRTFM   | `-dpcvars=AutoRTFMEnabledProbability=5.0`  |
| 50% chance to enable AutoRTFM  | `-dpcvars=AutoRTFMEnabledProbability=50.0` |