Emulation Performance

TEMU is at its core an instruction set simulator. The instruction set simulator is optimized in the following ways:

High performance interpreter with a pre-decode direct dispatch algorithm.
Binary translator for hot code segments
Idle and power-down mode detection
Parallelization of multi-core processors

Although performance is both host and target software dependent, in general the raw rating for the different instruction set simulation modes are:

Interpretation: Should run at around 150-200 MIPS.
Binary Translation: Should run at around 300-500 MIPS.
Idle: Runs at infinite speed, but simulation events will throttle the speed.

Performance Impact of Target Software

Target software has a major impact on performance. The following cases are known to have a negative impact on emulation performance:

Excessive I/O counts
- Including register polling loops
Round-tripping between binary translated code and interpreted code (traps, privilege mode changing instructions, floating point instructions on some targets).
Self-modifying code
- True self-modifying code (where code is actually written)
- False self-modifying code (where a page containing code is written, but the write goes to a location on the page without code).
MMU invalidations (as these flush the emulator ATCs, which triggers MMU checks). This can be especially bad when the target software is thrashing the MMU pages.
Undetectable idle mode.
Non-parallel idle mode, when running TEMU with multiple threads.

Many of the cases above, can be improved with better target software, in principle, better performing software (on the hardware), will perform better in the simulator.

In other cases, there may be options, further detailed later in this section.

With respect to non-parallel idle mode, this applies to the simulation of multi-core processors. As the cores synchronize occasionally (configurable via the Scheduler quanta), if some cores are idling while another one is running, the threads running the idle cores will wait in a barrier for the core running code.

That means that the performance is bounded by the processor running code, and idle mode has little overall impact.

A solution in the case of real-time software is to plan the processor time partitioning for the software, so idle is triggered in parallel on all cores at the same time if possible.

Another potential issue is performance demanding parts of the target software. In some cases it is possible to replace these with simulation-specific variants.

Example 1. Spacecraft Star Tracker

In a spacecraft control software, a function or partition could be running image processing from star-tracker cameras.

The image processing function primarily produces a rotational vector or quaternion. And since the image processing may be slow, it makes sense to substitute it.

Such substitution could be done using the custom code pattern detector functionality discussed below, or by adding custom device models which would read the rotational parameters from the device model attached to the environmental models.

Performance Impact from Host

The host machine can impact performance in serious ways. Most noticeable is the impact from the lack of physical processors. To exploit parallelism, TEMU assigns different processor cores to different threads.

If there are not enough processors to run all the threads in parallel, execution of simulated processor cores will become serialized.

This can be especially noticeable in virtualized or containerized environments. Even if a virtual machine has 4 virtual processors, those processors are virtual, and may or may not run on separate physical processors.

It is always best to run TEMU on a physical machine, in the case of virtualized environments it may be needed to configure virtual machines and containers to ensure they are assigned physical processor cores.

Note that it is possible to explicitly set the thread affinity in order to control thread to host core allocation.

Performance Improvement Capabilities

Infinite Bus Speed

Many buses offer the option to run the simulation in infinite speed mode. This is typically controlled by the bus controller configuration.

E.g. instead of simulating the time it takes to transmit a byte on the UART, it is possible to emit the byte immediately, when the data register is written.

This saves time for both the target software which might be waiting for the UART to finish transferring the data; and in the emulator, which can omit event posting (or at least fall back on stacked / immediate events, which are faster to post).

Quanta Size

The quanta size (number of instructions executed between time synchronization), is impacting the performance, especially when TEMU runs in multi-threaded mode.

Higher quanta sizes are preferred, however one should at minimum make sure that the quanta size is compatible with TEMU in single threaded mode, which introduce worst case temporal decoupling.

With higher quanta’s, different processors run more instructions, before stopping, waiting for other processors to catch up. This in turn results in less time spent in processor switching and synchronization which is overhead.

Instruction Cycle Adjustments

TEMU 4 assumes that one instruction takes one cycle. It is possible to tweak the instruction throughput, by setting either the Cycles per Instruction (CPI) or Instructions per Cycle (IPC).

To set the CPI, the following can be run in a TEMU script:

cpu0.cpi = 1.5

By setting the CPI to 1.5, performance should increase roughly 30-35%, since the instruction throughput has been reduced by a factor 1.0/1.5 (or around 67%).

The IPC parameter is the inverse of CPI, it is primarily used when one wish to model the effects of super scalar processors.

cpu0.ipc = 2 # Same as setting cpi = 0.5

Note that in the case of, real-time software, it is not certain that deadlines are kept when modifying these parameters. Sufficient experimental validation will be needed to ensure that the target software can keep on running.

Custom Idle Detector

As of TEMU 4.2, the previous internal idle pattern detector API, has been exposed as an interface to the processor models.

Internally, TEMU specify several idle instructions as built-in patterns. These idle patterns are per target and described in the respective target guides. It is possible to add additional patterns.

For example, the SPARCv8 two instruction ba 0; nop pattern is defined as follows:

// Match the two instructions in sequence,
// we match the whole instructions using 0xffffffff as masks
const temu_CodePatternEntry ba0nop[] = {
    {0x10800000, 0xffffffff}, // ba 0
    {0x01000000, 0xffffffff}  // nop or sethi 0, %g0
};

temu_CodePattern ba0nopPattern = {
    .PhysicalAddress = 0,
    .PhysicalAddressMask = 0, // Physical address is don't care
    .Action = tePA_Idle, // Trigger idle mode on entry
    .Callback = NULL, // Applicable to tePA_Call only
    .CallbackData = NULL, // Applicable to tePA_Call only
    .Parameter = 0, // 0 makes this an untagged idle
    .PatternLength = 2, // Number of instructions in pattern
    .Pattern = ba0nop // Instruction patterns
};

CodePatternIface->installPattern(cpu, &ba0nopPattern);

Tagged Idle Mode

Tagged idle mode works like idle as discussed above. The difference is that, in tagged idle the instructions will be executed once.

Whether to execute the instructions is controlled by the processor’s skipIdleTags property.

The tag used for a particular idle mode pattern is controlled by the integer .Parameter field in the pattern.

If the pattern is non-zero, the tag bit will be checked before entering idle. If the processor property has the relevant bit set, idle will not be entered, but the bit will be cleared.

The bits are set by calling the wakeUp function in temu_CpuIface.

This means that we can specify a specific instruction as tagged idle, trigger on it, and then have it wakeUp when a model condition is reached.

There are two use cases for this, one is to force the processor to idle mode, when it is reading certain I/O registers. When the I/O registers change, the model can wake-up the processor, which forces it to run the I/O operation again. Although a model, could trigger idle at any I/O, the pattern mechanism makes it possible to single out specific instruction sequences.

The second use case is to optimize cross processor polling loops. This is a bit more complicated, but a program polling a memory location. Can trigger tagged idle mode in the same way as when polling an I/O register. The wake-up logic, will need to be triggered based on other conditions.

temu_CodePattern idleAtAddressPattern = {
    .PhysicalAddress = 0x40000000,
    .PhysicalAddressMask = 0xffffffff, // Trigger when running address above
    .Action = tePA_Idle, // Trigger idle mode on entry
    .Callback = NULL, // Applicable to tePA_Call only
    .CallbackData = NULL, // Applicable to tePA_Call only
    .Parameter = 1, // Non-zero makes this a tagged idle, tag is 1
    .PatternLength = 0,
    .Pattern = NULL
};

CodePatternIface->installPattern(cpu, &idleAtAddressPattern);



// From model when register state changes so it should wake up the software
cpuIf.Iface->wakeUp(cpuIf.Obj);

Skip Patterns

Skip patterns is a simple way to specify an instruction pattern that will be skipped. This can for example be the execution of scrubber logic, memory tests, etc.

The skip patterns are defined by setting the .Action field in the pattern, to tePA_Skip.

Note that the skipped instructions are assumed to take no time.

To some extent, the skip pattern is a bit like a branch. It would in some cases be possible to patch in a branch instruction to accomplish the same behavior. The patterns let you define a masked address, this in turn means that the scrubber or memory test can move due to target software changes.

temu_CodePattern skipPattern = {
    .PhysicalAddress = 0x40000000,
    .PhysicalAddressMask = 0xffffffff, // Trigger when running address above
    .Action = tePA_Skip, // Trigger skip instructions when running
    .Callback = NULL,
    .CallbackData = NULL,
    .Parameter = 10, // Skip 10 instructions, including the first
    .PatternLength = 0,
    .Pattern = NULL
};

CodePatternIface->installPattern(cpu, &skipPattern);

Arbitrary Call Operations

The general code pattern case is the tePA_Call action. The call action does not normally enter idle mode or skip instructions.

It is however possible to possible to:

Change the program counter (will skip instructions)
Change the processor state (e.g. enter idle mode)

That makes it possible to accomplish the same behavior as both the skip and tagged idle operations.

However the callback must take care of the state updates, making it a bit more difficult to implement.

A use case is to for example install custom function replacements.

For example, a call to the puts function in the target software, could be eliminated. The pattern triggered call operation could read out string directly, and send it to the simulator log bypassing both I/Os and system call invocations.

Similarly, a call to an image processing kernel, could be eliminated, further enabling the call handler to update registers as expected.

E.g. if the kernel computed a quaternion for a star-tracker, the result could be produced by the call operation, by accessing the the rotation directly from the simulator models.

Correctly using the tePA_Call operation to skip function and system calls is a per system / mission customization. To implement it correctly, good knowledge of the target’s ABI will be needed.

To handle specific target function calls using this mechanism, the pattern should be defined so it is installed on the call instruction. To handle all system calls / function calls, there are two ways:

In the case of system calls, intercept either all the relevant trapping instructions in the user software, or the system call handler in kernel space.
In the case of function calls, intercept the first instruction in the called function.

The reason that the calling instruction cannot be intercepted, is that call and branch operation in most cases are relative. The pattern API can at present not match on complex conditions such as CPU state.

The example below shows how to specify a custom call handler. The handler is triggered at physical address 0x40000000. It will read out CPU state and modify it before returning.

void
myCallFunc(void *obj, void *data)
{
  temu_Object *cpu = (temu_Object*)obj;
  MyData *myData = (MyData*)data;

  uint32_t param0 = temu_cpuGetReg(cpu, 8); // Get parameter 0 (%o0)
  temu_logInfo(cpu, "Calling function with parameter %" PRIx32, param0);
  // Do something with myData

  // ...

  // Skip instruction and its delay slot
  temu_cpuSetPc(cpu, temu_cpuGetPc(cpu) + 8);
  temu_cpuSetReg(cpu, 8, 1); // Set return value (%o0) of function to 1
}


MyData myCallbackData;
// ...

temu_CodePattern callPattern = {
    .PhysicalAddress = 0x40000000,
    .PhysicalAddressMask = 0xffffffff, // Trigger when running address above
    .Action = tePA_Call, // Call function
    .Callback = myCallFunc,
    .CallbackData = myCallbackData,
    .Parameter = 0, // Not used
    .PatternLength = 0,
    .Pattern = NULL
};

CodePatternIface->installPattern(cpu, &callPattern);

Barrier Strategy

When running in parallel, the simulated time is kept in sync by running each processor core for a fixed duration of simulated time called the quanta.

The threads finishing their work will wait for the other threads in a barrier.

At the time of writing there are two barrier implementations, with different performance characteristics:

Spin Barrier (the default)
Std Barrier with backoff support (where threads yield or sleep if they wait too long).

The best strategy depends on the target software and needs to be experimentally determined.

Barrier Control

# Enable the default spin barrier.
sched.enable-spin-barrier

# Enable the standard barrier with backoff support.
sched.enable-std-barrier

Thread Affinity

It is possible to control the affinity and bind TEMU scheduler threads to specific host processor cores.

The configuration of affinity prevents thread migration, and gives more predictable performance.

This must be done before the number of threads is configured.

Affinity Example

# Set affinity for the 2 threads
sched.set-affinity thread-id=0 host-core=4
sched.set-affinity thread-id=1 host-core=5

sched.set-threads threads=2

In the example above the we assign thread 0 to host CPU 4, and thread 1 to host CPU 5.

The host CPU number is the same as the one listed in /proc/cpuinfo on Linux.

Host cores should not select SMT (or hyperthreaded) cores on the same physical core.

There is no correct affinity setting for all host processors, and it should be experimentally determined on the deployed system.