230 lines
15 KiB
Markdown
230 lines
15 KiB
Markdown
# AutoRTFM User Guide
|
||
|
||
**Author**: [John Stiles](mailto:john.stiles@epicgames.com)
|
||
**Team**: Verse - AutoRTFM
|
||
|
||
# Context
|
||
|
||
A core Verse language feature is the ability to “roll back” changes to the state of execution when
|
||
failures are encountered, making it appear as if nothing has changed. Verse code is allowed to wrap
|
||
almost any operation within a failure scope, and runtime failures can occur at any point, so very
|
||
little in Verse is truly exempt from rollback. dSTM will also use this ability to abort and retry a
|
||
transaction whenever a remote `TObjectPtr` is dereferenced.
|
||
|
||
Before AutoRTFM was enabled, rollback was handled using explicit callbacks in
|
||
[Verse::Stm](https://docs.google.com/document/d/1gQlVTJigO4AdpljegaRe9hDGIeaLfqgJLq1p6z8QBNE/edit?tab=t.0)
|
||
and this is still used on the client. However, servers now use AutoRTFM instead, and automatically
|
||
support most forms of rollback without any programmer involvement. However, AutoRTFM has limitations
|
||
around threading and resource sharing which sometimes require very careful thought.
|
||
|
||
# AutoRTFM Rollback
|
||
|
||
## Live in Production
|
||
|
||
The goal of AutoRTFM is to automatically implement rollback for single-threaded code. The basics of
|
||
this approach are covered in Phil Pizlo’s blog post,
|
||
[Bringing Verse Transactional Memory Semantics to C++](https://www.unrealengine.com/en-US/tech-blog/bringing-verse-transactional-memory-semantics-to-c).
|
||
Note that this post discusses AutoRTFM as being compiled into the binary as of Release 28.10
|
||
(January 2024), and that’s technically correct, but it wasn't actually *enabled* on live servers at
|
||
that time. AutoRTFM was disabled shortly after launch due to server stability issues.
|
||
|
||
After the initial launch attempt, the team regrouped and devised a more robust launch plan. AutoRTFM
|
||
was then successfully enabled for a small percentage of VkPlay games on public servers in Release
|
||
34.10 (March 2025). As of Release 36.20 (July 2025), AutoRTFM is fully enabled for all VkPlay and
|
||
VkEdit sessions. In Release 38.00 (November 2025), we plan to deprecate the ability to disable
|
||
AutoRTFM in production. We are no longer testing Valkyrie in the AutoRTFM-off path, and features
|
||
like the new VM rely on AutoRTFM to function.
|
||
|
||
When AutoRTFM is enabled, almost all game logic on the main thread automatically supports rollback
|
||
without manual intervention. Unfortunately, some things have side effects that simply can’t be
|
||
undone automatically—for instance, you can’t un-send a network packet after it’s been sent!—so more
|
||
complex or multi-threaded code will need to be aware of the implementation details of AutoRTFM.
|
||
Fortunately, most of the time, your existing code will work as-is, and no additional work will be
|
||
necessary.
|
||
|
||
## Open and Closed Code
|
||
|
||
AutoRTFM is designed around an Epic-internal fork of the Clang compiler. This version
|
||
of Clang is responsible for adding transactional instrumentation to our code. For performance
|
||
reasons, though, we don’t always run with this instrumentation enabled; instead, AutoRTFM
|
||
is designed to compile all of our code twice. One version is the typical, uninstrumented
|
||
form that you would expect from a normal Clang build, more or less—by convention, we refer
|
||
to this form as “open.” The other version is compiled with our extra transactional logic
|
||
mixed in—our convention is to call this instrumented form “closed.” Open and closed versions
|
||
of almost all functions coexist in our binaries.
|
||
|
||
Programs start off in the “open” state; in situations where rollback is needed, we transition
|
||
into the “closed” state via `AutoRTFM::Transact`. This function is responsible for teleporting
|
||
us to the matching, “closed” form of the currently-running code.
|
||
|
||
## Instrumentation
|
||
|
||
In transactional mode, writes to heap memory occur in real-time, but they are also closely
|
||
tracked so that they can be undone if necessary. We also wrap some low-level APIs like
|
||
`malloc` and `free` to make them transactionally safe. Function pointers to “open” functions
|
||
are dynamically rerouted to their “closed” equivalent so that we don’t accidentally escape
|
||
from “closed” code into its “open” form. Finally, we maintain a list of deferred tasks
|
||
to perform at the conclusion of the transaction.
|
||
|
||
## Committing
|
||
|
||
If the transaction reaches the end of its `Transact` block without being aborted, it has
|
||
succeeded and will be committed. To do this, we execute our list of `CommitTasks` which
|
||
are responsible for handling all deferred operations. For instance, we always defer `free(MyData)`
|
||
to commit time, via the `CommitTasks` list, instead of calling `free` immediately; this
|
||
allows us to resuscitate the block of heap memory if the transaction is aborted. Once
|
||
the commit tasks are complete, we return to the “open” state and discard any heap tracking
|
||
information. Users can also defer work to the end of a transaction by calling `AutoRTFM::OnCommit`
|
||
to add their own callbacks to the commit task list. Commit tasks are run in first-in,
|
||
first-out order.
|
||
|
||
## Aborting
|
||
|
||
Conversely, if `AutoRTFM::AbortTransaction` is called anywhere within a `Transact` block,
|
||
the transaction is considered to be aborted and must be rolled back. In other words, AutoRTFM
|
||
is now responsible for undoing all changes made within the transaction; we must teleport
|
||
execution straight to the end of the `Transact` block, while maintaining the illusion
|
||
that nothing at all has changed. Our heap-write tracking data is used to undo all transactional
|
||
changes to the heap, and we additionally have a list of `AbortTasks` to execute. As you
|
||
might expect, these `AbortTasks` are the parallel opposite of the `CommitTasks`. For instance,
|
||
we need to ensure that memory isn’t just leaked if the user calls `malloc(Size)` and then
|
||
aborts the transaction. This is handled by generating an `AbortTask` inside of `malloc`
|
||
which frees the allocated data; in the event of an abort, this saves us from a leak. Users
|
||
can also call `OnAbort` to add work to the abort task list. Abort tasks are run in first-in,
|
||
last-out order.
|
||
|
||
## Hazards
|
||
|
||
Some APIs are inherently multi-threaded; these are considered hazards in AutoRTFM, since
|
||
our transactional model only has a single-threaded view of the world. These APIs intentionally
|
||
trigger a language failure and will need to be manually fixed up by a programmer. For
|
||
instance, `std::atomic<>` or `FThreadSafeCounter` are considered unsafe because they are
|
||
designed to communicate state across threads; it would be possible to automatically roll
|
||
back an atomic increment with an atomic decrement, but in general this wouldn’t be sufficient
|
||
to guarantee correct behavior of the entire program. The AutoRTFM team has designed transactionally
|
||
safe primitives like critical sections (`FTransactionallySafeCriticalSection`) and mutexes
|
||
(`FTransactionallySafeMutex`), but these must be manually replaced in the code because
|
||
they are a little more complex, and take more memory, than a plain `FCriticalSection`
|
||
or `FMutex`.
|
||
|
||
## Resource Locking
|
||
|
||
The approach for resource locking within a transaction is that, once a lock is taken or
|
||
a resource is acquired, that lock or resource is always held until the end of the transaction.
|
||
That is, calling `FTransactionallySafeCriticalSection::Unlock` within a transaction will
|
||
not immediately release the lock; instead, it enqueues a `CommitTask` entry responsible
|
||
for unlocking the resource. Because the AutoRTFM thread maintains its lock for the duration
|
||
of the transaction, it is free to re-acquire or re-lock the resource within the same transaction
|
||
at will; this is important, because otherwise, we would quickly deadlock.
|
||
|
||
Holding resource locks is an important principle, because it prevents the transactional
|
||
state from being exposed to other threads prematurely. In other words, transactional changes
|
||
shouldn’t be visible to other threads while the transaction is still in flight. This wouldn’t
|
||
be safe, because the transaction is still in a provisional state and might be rolled back—we
|
||
don’t want other threads to see or act on those changes until the transaction is fully
|
||
committed. (For that matter, immediate unlocks would also make rollback *itself* a threading
|
||
hazard, since the heap rollback code doesn’t know about critical sections or mutexes,
|
||
and won’t take any locks while it is undoing heap changes.)
|
||
|
||
## Nesting
|
||
|
||
It is legal to nest a sub-transaction inside of a transaction, and it is safe to abort
|
||
the sub-transaction within the outer transaction. Aborting an outer transaction will roll
|
||
back all the work performed in the sub-transaction as well, even if those inner transactions
|
||
succeeded and were committed. In general, aborting a transaction should always roll back
|
||
*everything* that happened inside of that transaction—even if that includes an inner
|
||
sub-transaction.
|
||
|
||
## Missing Closed Code
|
||
|
||
In some cases, we don’t have the “closed” version of a function at all. For instance,
|
||
Unreal Engine includes some libraries like [Oodle](https://dev.epicgames.com/documentation/en-us/unreal-engine/using-oodle-in-unreal-engine)
|
||
and [EOS](https://dev.epicgames.com/documentation/en-us/unreal-engine/online-subsystem-eos-plugin-in-unreal-engine)
|
||
as precompiled binaries; we also compile some code with specialized compilers like [Intel
|
||
ISPC](https://ispc.github.io/ispc.html) that don’t support instrumentation. In these cases,
|
||
we are required to manually go into the “open” state via `UE_AUTORTFM_OPEN` or `AutoRTFM::Open`
|
||
before calling these functions. However, for an abort to properly undo these changes to
|
||
the heap, callers are required to *manually* inform the AutoRTFM runtime about any heap
|
||
memory that the code will write to—crucially, before the write occurs!—via
|
||
`AutoRTFM::RecordOpenWrite`.
|
||
|
||
Obviously, this extra work is complex and undesirable. If calling an open function from
|
||
closed code cannot be avoided, consider wrapping the work in a helper function. If it
|
||
is feasible, design the code so that heap writes do not need to be undone at all—e.g.,
|
||
only write to a short-lived scratch buffer that will be destroyed before the transaction
|
||
is complete, or maintain a persistent scratch buffer where the contents are always assumed
|
||
to be clobbered across transactions.
|
||
|
||
## Mixed Open and Closed Code
|
||
|
||
When trying to resolve AutoRTFM incompatibilities, it is tempting to consider invoking
|
||
large, complex functions in the “open”—thereby dodging all hazards—and then registering
|
||
an `OnAbort` handler to undo its effects. However, this is dangerous and should be avoided.
|
||
This approach can lead to new, subtle transactional hazards:
|
||
|
||
- Locks will be released before the transaction is finished. This exposes transactional
|
||
changes to other threads while the transaction is still in a provisional state. Also,
|
||
if a rollback occurs, this becomes a second thread hazard, because locks won’t be taken
|
||
at all during the rollback.
|
||
- An “open” block cannot free or reallocate memory that was allocated in a “closed” block.
|
||
(Reason: When allocating, the closed code will also have created an abort handler to
|
||
free its memory. Those callbacks will still run if the transaction is aborted, but the
|
||
pointer passed to Free will already have been deallocated; in short, it will lead to a
|
||
double-free.)
|
||
- The above-mentioned reallocation hazard also extends to classes which implicitly reallocate
|
||
memory on your behalf. In particular, innocuous methods like `TArray::Add()` can become
|
||
very dangerous when both open and closed code add items to the same array.
|
||
- Passing ownership of non-trivial objects across the open-closed boundary can lead to
|
||
memory handling errors. For instance, if you have an `FString` that was created in “closed”
|
||
code, and assign a new value to it from an “open” block, this is unsafe (discussed in
|
||
depth [here](https://jira.it.epicgames.com/browse/SOL-6991). We have a special mechanism
|
||
which is designed to allow returning a non-trivial object from `AutoRTFM::Open` into closed
|
||
code by making a copy; this must be handled on a type-by-type basis.
|
||
- Altering the same bytes of memory from both “open” and “closed” code in the same transaction
|
||
can lead to hard-to-diagnose silent rollback failures.
|
||
(Reason: heap changes made in the open are not tracked and can’t be undone automatically,
|
||
as you probably know. But if we subsequently make a write to the same range of heap memory
|
||
from “closed” code, the instrumentation will log the write so that it can be undone.
|
||
Unfortunately, we already changed the data, so the true “original” value is already gone, and we
|
||
log the already-changed value. If an abort occurs, this behavior is sometimes harmless, and
|
||
other times very wrong.)
|
||
|
||
These hazards can be extraordinarily difficult to debug once they have occurred.
|
||
|
||
To help identify writes made in the closed, and then within the same transaction, in the
|
||
open, you can enable a memory validator with the `AutoRTFMMemoryValidationLevel` flag
|
||
(see below).
|
||
|
||
## Command line flags
|
||
|
||
AutoRTFM can be controlled with the following `dpcvars`, which can be combined with a
|
||
comma.
|
||
|
||
NOTE: Be aware that these settings are ignored in Shipping builds! A Test build should
|
||
be used to run the memory validator.
|
||
|
||
| AutoRTFM mode | Server Flags |
|
||
| :------------- | :---------------------------------------------------------------------------------- |
|
||
| Disable | `-dpcvars=AutoRTFMRuntimeEnabled=off` *or* `-dpcvars=AutoRTFMRuntimeEnabled=0` |
|
||
| Enabled | `-dpcvars=AutoRTFMRuntimeEnabled=on` *or* `-dpcvars=AutoRTFMRuntimeEnabled=1` |
|
||
| Force-disabled | `-dpcvars=AutoRTFMRuntimeEnabled=forceoff` *or* `-dpcvars=AutoRTFMRuntimeEnabled=2` |
|
||
| Force-enabled | `-dpcvars=AutoRTFMRuntimeEnabled=forceon` *or* `-dpcvars=AutoRTFMRuntimeEnabled=3` |
|
||
|
||
| Retry Validation Mode | Server Flags |
|
||
| :-------------------- | :------------------------------------- |
|
||
| Disable | `-dpcvars=AutoRTFMRetryTransactions=0` |
|
||
| Retry non-nested | `-dpcvars=AutoRTFMRetryTransactions=1` |
|
||
| Retry nested too | `-dpcvars=AutoRTFMRetryTransactions=2` |
|
||
|
||
| Memory Validation Mode | Server Flags |
|
||
| :--------------------- | :----------------------------------------------------------------------------------------------------------- |
|
||
| Disable | `-dpcvars=AutoRTFMRuntimeEnabled=1` *or* `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=1` |
|
||
| Warn and continue | `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=2` |
|
||
| Hard error | `-dpcvars=AutoRTFMRuntimeEnabled=1,AutoRTFMMemoryValidationLevel=3` |
|
||
|
||
| AutoRTFM Enable Probability | Server Flags |
|
||
| :----------------------------- | :----------------------------------------- |
|
||
| 0.1% chance to enable AutoRTFM | `-dpcvars=AutoRTFMEnabledProbability=0.1` |
|
||
| 5% chance to enable AutoRTFM | `-dpcvars=AutoRTFMEnabledProbability=5.0` |
|
||
| 50% chance to enable AutoRTFM | `-dpcvars=AutoRTFMEnabledProbability=50.0` |
|